[MPICH] Collective *v operation question

Martin Schwinzerl martin.schwinzerl at edu.uni-graz.at
Fri Nov 23 11:34:33 CST 2007


Dear all!

I'm currently hunting down a bug in a rather large program, that works fine
on our local cluster with MPICH2 but crashes in - as it seems - random
fashion on another cluster where only a  MPI1.2 runtime env.
( MPICH-1.2.7p1, IIRC) is available.

The crashes are in my humble opinion most likely caused by an
(known, but as of now unidentified) configuration problem in the local
scheduler / cluster management software (some jobs are not created
properly, etc.) and therefore unrelated to this lists subject,  but in
order to rule out another possible cause, I would be grateful if somebody
here could confirm, that my understanding of the MPI standard
with respect to the *v variants of collective operations (e.g. Gatherv,
Scatterv, ... ) is
correct :

SITUATION :

In my example, each process of the Intracommunicator performs a fixed,
number Np of operations and increments a counting variable cnt for each
successful operation (e.g. after the calculation is finished, each processes
cnt : 0 <= cnt <= Np ).

The root process then gathers the counter values from all processes and
increments the receive displacement entries for an Gatherv operation
accordingly.
If the process with rank r has cnt == 0, then the associated receive
displacement
value is set to the same value as for the Process ( r - 1 ). ( See code
sample
below for clarification)

QUESTION :

It has so far been my understanding, that this is compatible with this
restriction
from the MPI1.1 report / standard ( chapter 4.5 )

> The specification of counts, types, and displacements should not cause
> any location on
> the root to be written more than once. Such a call is erroneous.

, as those processes with cnt == 0 would send only zero length messages and
cause no write operations on the receive Buffer.

---> Is this assumption correct  ?

CODE SAMPLE FOR CLARIFICATION :

The following code sample  (C++ bindings)  sketches the situation
in question :

// ... rank, size from MPI_Init

int cnt = 0;

//Holds the results of the successful operation(s) :
std::vector< double > data; 

while( .... )
{
    // processes 0, ... size - 1 perform  essentially a random number
    // of operations. For each operation, cnt is incremented.
}

std::vector< int > recvCount( size );
std::vector< int > recvOffset( size );

MPI::COMM_WORLD.Gather( &cnt, 1, MPI::INT, &recvCount[ 0 ], 1, MPI::INT,
0 );
int n = 0;

if( rank == 0 )
{
    int offset = 0;
    for( int r = 0 ; r < size ; ++r )
    {
        n += recvCount[ r ];
        recvOffset[ r ] = offset;

        if( recvCount[ r ] > 0 )
        {
            offset += recvCount[ r ];
        }
    }
}

std::vector< double > recvData( n );
MPI::COMM_WORLD.Gatherv( &data[ 0 ], cnt, MPI::DOUBLE, &recvData[ 0 ],
&recvCount[ 0 ], &recvOffset[ 0 ], MPI::DOUBLE, 0);

// .... -> end of example

May thanks in advance!

Yours, truly
Martin Schwinzerl





More information about the mpich-discuss mailing list