[mpich-discuss] "unexpected messages" Question
Darius Buntinas
buntinas at mcs.anl.gov
Thu Jan 7 12:50:54 CST 2010
Hi Dave,
Receiving too many unexpected messages is a common bug people hit. This
is often caused by some processes "running ahead" of others.
First, an unexpected message is a message which has been received by the
mpi library for which a receive hasn't been posted (i.e., the program
has not called a receive function like MPI_Recv or MPI_Irecv). What
happens is that for small messages ("small" being determined by the
particular mpi library and/or interconnect you're using) the library
stores a copy of the message locally. If you receive enough of these,
you run out of memory.
So let's say you have a program where one process receives a message
then does some computation on it repeatedly in a loop, and another
process sends messages to the other process, also in a loop. Because
the first process spends some time processing each message it'll
probably run slower than second, and the second process will end up
sending messages faster than the first process can receive them.
You then end up with all of these unexpected messages because the
receiver hasn't been able to post the receives fast enough. Note that
this can happen even if you're using blocking sends. Remember that
MPI_Send() returns when the send buffer is free to be reused, and not
necessarily when the receiver has received the message.
Another place where you can run into this problem of unexpected messages
is with collectives. People often thing of collective operations as
synchronizing operations. This is not always the case. Consider
reduce. A reduction operation is typically performed in a tree fashion,
where each process receives messages from it's children performs the
operation and sends the result to it's parent. If a process which
happens to be a leaf of the tree calls MPI_Reduce before it's parent
process does, it will result in an unexpected message at the parent.
Note also that the leaf process may return from MPI_Reduce before it's
parent even calls MPI_Reduce. Now look what happens if you have a loop
with MPI_Reduce in it. Because the non-leaf nodes have to receive
messages from several children perform a calculation and send the
result, they will run slower than the leaf nodes which only have to send
a single message, so you may end up with a "unexpected message storm"
similar to the one above.
So how can you fix the problem? See if you can rearrange your code to
get rid of loops like the ones described above. Otherwise, you'll
probably have to introduce some synchronization between the processes
which may affect performance. For loops with collectives, you can add
an MPI_Barrier in the loop. For loops with sends/receives you can use
synchronous sends (MPI_Ssend, and friends) or have the sender wait for
an explicit ack message from the receiver. Of course you could optimize
this where you're not doing the synchronization in every iteration of
the loop, e.g., call MPI_Barrier every 100th iteration.
I hope this helps some.
Darius
On 01/07/2010 11:27 AM, Hiatt, Dave M wrote:
> I'm following up on an earlier question. I'm auditing the number of Bcast and Sends I do versus an exception message that is thrown during processing. The message is saying "261894 unexpected messages queued". This number is dramatically different that what appears to be the counts of messages the app is sending (I'm counting a Bcast as 1 message). and counting messages being received and sent between node 0 and the compute nodes. This cluster has 496 total nodes. When I run on a 60 node cluster I never see any hit of a problem like this. And the network utilization does not indicate some kind of large congestion, but clearly something is happening. So I'm assuming it's my app. To that end a few questions if I might ask:
>
> First question - Is a BCast considered 1 message or will it be N messages where N is the number of active nodes in terms of this kind of count?
> Second question - What constitutes an "unexpected message"? I am assuming any Send or BCast is expected. Am I confused on this nomenclature?
> Third question - I've assumed that the message count being stated in this queue translated directly to the number of calls to MPI::Send and MPI::Bcast calls I make.
>
> I have not been able so far to duplicate this problem on my test clusters (albeit they are much smaller, typically 60 nodes). And I have no indication of being able to create some kind of "message storm" as it were in some kind of race condition.
>
> Thanks
> dave
>
> "Consequences, Schmonsequences, as long as I'm rich". - Daffy Duck
> Dave Hiatt
> Market Risk Systems Integration
> CitiMortgage, Inc.
> 1000 Technology Dr.
> Third Floor East, M.S. 55
> O'Fallon, MO 63368-2240
>
> Phone: 636-261-1408
> Mobile: 314-452-9165
> FAX: 636-261-1312
> Email: Dave.M.Hiatt at citigroup.com
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list