[mpich-discuss] "unexpected messages" Question

Darius Buntinas buntinas at mcs.anl.gov
Thu Jan 7 12:50:54 CST 2010


Hi Dave,

Receiving too many unexpected messages is a common bug people hit.  This 
is often caused by some processes "running ahead" of others.

First, an unexpected message is a message which has been received by the 
mpi library for which a receive hasn't been posted (i.e., the program 
has not called a receive function like MPI_Recv or MPI_Irecv).  What 
happens is that for small messages ("small" being determined by the 
particular mpi library and/or interconnect you're using) the library 
stores a copy of the message locally.  If you receive enough of these, 
you run out of memory.

So let's say you have a program where one process receives a message 
then does some computation on it repeatedly in a loop, and another 
process sends messages to the other process, also in a loop.  Because 
the first process spends some time processing each message it'll 
probably run slower than second, and the second process will end up 
sending messages faster than the first process can receive them.

You then end up with all of these unexpected messages because the 
receiver hasn't been able to post the receives fast enough.  Note that 
this can happen even if you're using blocking sends.  Remember that 
MPI_Send() returns when the send buffer is free to be reused, and not 
necessarily when the receiver has received the message.

Another place where you can run into this problem of unexpected messages 
is with collectives.  People often thing of collective operations as 
synchronizing operations.  This is not always the case.  Consider 
reduce.  A reduction operation is typically performed in a tree fashion, 
where each process receives messages from it's children performs the 
operation and sends the result to it's parent.  If a process which 
happens to be a leaf of the tree calls MPI_Reduce before it's parent 
process does, it will result in an unexpected message at the parent. 
Note also that the leaf process may return from MPI_Reduce before it's 
parent even calls MPI_Reduce.  Now look what happens if you have a loop 
with MPI_Reduce in it.  Because the non-leaf nodes have to receive 
messages from several children perform a calculation and send the 
result, they will run slower than the leaf nodes which only have to send 
a single message, so you may end up with a "unexpected message storm" 
similar to the one above.

So how can you fix the problem?  See if you can rearrange your code to 
get rid of loops like the ones described above.  Otherwise, you'll 
probably have to introduce some synchronization between the processes 
which may affect performance.  For loops with collectives, you can add 
an MPI_Barrier in the loop.  For loops with sends/receives you can use 
synchronous sends (MPI_Ssend, and friends) or have the sender wait for 
an explicit ack message from the receiver.  Of course you could optimize 
this where you're not doing the synchronization in every iteration of 
the loop, e.g., call MPI_Barrier every 100th iteration.

I hope this helps some.

Darius



On 01/07/2010 11:27 AM, Hiatt, Dave M wrote:
> I'm following up on an earlier question.  I'm auditing the number of Bcast and Sends I do versus an exception message that is thrown during processing.  The message is saying "261894 unexpected messages queued".  This number is dramatically different that what appears to be the counts of messages the app is sending (I'm counting a Bcast as 1 message). and counting messages being received and sent between node 0 and the compute nodes.  This cluster has 496 total nodes.  When I run on a 60 node cluster I never see any hit of a problem like this.  And the network utilization does not indicate some kind of large congestion, but clearly something is happening.  So I'm assuming it's my app.  To that end a few questions if I might ask:
>
> First question - Is a BCast considered 1 message or will it be N messages where N is the number of active nodes in terms of this kind of count?
> Second question - What constitutes an "unexpected message"?  I am assuming any Send or BCast is expected.  Am I confused on this nomenclature?
> Third question  - I've assumed that the message count being stated in this queue translated directly to the number of calls to MPI::Send and MPI::Bcast calls I make.
>
> I have not been able so far to duplicate this problem on my test clusters (albeit they are much smaller, typically 60 nodes).  And I have no indication of being able to create some kind of "message storm" as it were in some kind of race condition.
>
> Thanks
> dave
>
> "Consequences, Schmonsequences, as long as I'm rich". - Daffy Duck
> Dave Hiatt
> Market Risk Systems Integration
> CitiMortgage, Inc.
> 1000 Technology Dr.
> Third Floor East, M.S. 55
> O'Fallon, MO 63368-2240
>
> Phone:  636-261-1408
> Mobile: 314-452-9165
> FAX:    636-261-1312
> Email:     Dave.M.Hiatt at citigroup.com
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list