[mpich-discuss] Recovering from a Bcast Timeout

Hiatt, Dave M dave.m.hiatt at citi.com
Tue Jan 5 10:11:44 CST 2010


Is there a good tutorial on decomposing MPI error messages?

For example here's what I'm seeing at intermittent times.  I'd like to become much more adept at deciphering these

Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x56e0048c, count=742412, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(536)...........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(436):
MPIDI_CH3_PktHandler_EagerSend(570).......: Failed to allocate memory for an unexpected message. 261894 unexpected messages queued.[cli_0]: aborting job:

In particular, if it's true that there are 261,894 outstanding messages then I've got some compute nodes that have gone rogue on me.  But I can't find this behavior at all on a test grid.  I'm still very suspicious of hardware/timeouts are actually causing this.

On restart of the exact point where this occurs, things run fine and will take off for many more hours, but often one of these "rogue waves" will just roll in.  So it's not just data related per se.  We've done extensive leak checking and there are none.  But this really looks like a network timeout.

So, where's a good place to learn how to really understand these error messages and what they're telling me?  And do you guys concur that I'm getting a timeout probably from a slow network, or overwhelmed NIC card out in the pool somewhere?


-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov]On Behalf Of Pavan Balaji
Sent: Tuesday, January 05, 2010 7:45 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Recovering from a Bcast Timeout



Calling an init after a finalize in the same program is incorrect as per
the MPI standard. If it worked in some cases, you were lucky :-).

See pg. 291 line 1 of the MPI-2.2 standard.

 -- Pavan

On 01/04/2010 05:11 PM, Hiatt, Dave M wrote:
> A general question to those in the know.  From time to time I get a Bcast timeout error.  I'm putting in an error handler to do a "catch" on this exception (C++).  My question is, will an MPI:: Finalize() followed by and MPI:: Initi() work from the same process.  This error is being caused by our deficient network, we've never lost a blade, and I'm confident both the app and MPI are functioning properly though considerable investigation.
>
> So are there any consequences to simply doing a Finalize() and a new Init() to start up, or will I have to stop the whole process and start again?  I'm assuming that it should restart without prejudice.  I'm on 1.2.1 Windows/Linux releases.
>
> Thanks
>
> dave
>
>
> "Consequences, Schmonsequences, as long as I'm rich". - Daffy Duck
> Dave Hiatt
> Market Risk Systems Integration
> CitiMortgage, Inc.
> 1000 Technology Dr.
> Third Floor East, M.S. 55
> O'Fallon, MO 63368-2240
>
> Phone:  636-261-1408
> Mobile: 314-452-9165
> FAX:    636-261-1312
> Email:     Dave.M.Hiatt at citigroup.com
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list