[mpich-discuss] "unexpected messages" Question

chan at mcs.anl.gov chan at mcs.anl.gov
Thu Jan 7 12:09:13 CST 2010


It sounds like you have a message imbalance problem in your code
which usually shows up in scaling up node count.  One option is
to throttle the message rate in your code so messages won't arrive
too fast on any particular node. Tool like padb
(http://padb.pittman.org.uk/modes.html#mpi-queue)
may help identify the problem if you are running on supported platform.

comment inlined below:

----- "Dave M Hiatt" <dave.m.hiatt at citi.com> wrote:

> I'm following up on an earlier question.  I'm auditing the number of
> Bcast and Sends I do versus an exception message that is thrown during
> processing.  The message is saying "261894 unexpected messages
> queued".  This number is dramatically different that what appears to
> be the counts of messages the app is sending (I'm counting a Bcast as
> 1 message). and counting messages being received and sent between node
> 0 and the compute nodes.  This cluster has 496 total nodes.  When I
> run on a 60 node cluster I never see any hit of a problem like this. 
> And the network utilization does not indicate some kind of large
> congestion, but clearly something is happening.  So I'm assuming it's
> my app.  To that end a few questions if I might ask:
> 
> First question - Is a BCast considered 1 message or will it be N
> messages where N is the number of active nodes in terms of this kind
> of count?

AFAIK, Bcast produces N messages.

> Second question - What constitutes an "unexpected message"?  I am
> assuming any Send or BCast is expected.  

unexpected message means message has arrived and is stored in the unexpected
message queue (within MPI implementation) before it is received/processed by
the user code.

> Am I confused on this
> nomenclature?
> Third question  - I've assumed that the message count being stated in
> this queue translated directly to the number of calls to MPI::Send and
> MPI::Bcast calls I make.
> 
> I have not been able so far to duplicate this problem on my test
> clusters (albeit they are much smaller, typically 60 nodes).  And I
> have no indication of being able to create some kind of "message
> storm" as it were in some kind of race condition.
> 
> Thanks
> dave
> 
> "Consequences, Schmonsequences, as long as I'm rich". - Daffy Duck
> Dave Hiatt
> Market Risk Systems Integration
> CitiMortgage, Inc.
> 1000 Technology Dr.
> Third Floor East, M.S. 55
> O'Fallon, MO 63368-2240
> 
> Phone:  636-261-1408
> Mobile: 314-452-9165
> FAX:    636-261-1312
> Email:     Dave.M.Hiatt at citigroup.com
> 
> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list