[mpich-discuss] Recovering from a Bcast Timeout

Pavan Balaji balaji at mcs.anl.gov
Tue Jan 5 12:29:23 CST 2010


It looks like the problem as what the error message says -- too many
unexpected messages. You might want to use a parallel debugger such as
Totalview to figure out what exactly is going on in your application.
There are open-source debuggers too (such as padb:
http://padb.pittman.org.uk), which you can try; though I have personally
not used padb so far.

 -- Pavan

On 01/05/2010 10:11 AM, Hiatt, Dave M wrote:
> Is there a good tutorial on decomposing MPI error messages?
> 
> For example here's what I'm seeing at intermittent times.  I'd like to become much more adept at deciphering these
> 
> Fatal error in MPI_Bcast: Other MPI error, error stack:
> MPI_Bcast(786)............................: MPI_Bcast(buf=0x56e0048c, count=742412, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast(536)...........................:
> MPIC_Sendrecv(126)........................:
> MPIC_Wait(270)............................:
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(436):
> MPIDI_CH3_PktHandler_EagerSend(570).......: Failed to allocate memory for an unexpected message. 261894 unexpected messages queued.[cli_0]: aborting job:
> 
> In particular, if it's true that there are 261,894 outstanding messages then I've got some compute nodes that have gone rogue on me.  But I can't find this behavior at all on a test grid.  I'm still very suspicious of hardware/timeouts are actually causing this.
> 
> On restart of the exact point where this occurs, things run fine and will take off for many more hours, but often one of these "rogue waves" will just roll in.  So it's not just data related per se.  We've done extensive leak checking and there are none.  But this really looks like a network timeout.
> 
> So, where's a good place to learn how to really understand these error messages and what they're telling me?  And do you guys concur that I'm getting a timeout probably from a slow network, or overwhelmed NIC card out in the pool somewhere?
> 
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov]On Behalf Of Pavan Balaji
> Sent: Tuesday, January 05, 2010 7:45 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Recovering from a Bcast Timeout
> 
> 
> 
> Calling an init after a finalize in the same program is incorrect as per
> the MPI standard. If it worked in some cases, you were lucky :-).
> 
> See pg. 291 line 1 of the MPI-2.2 standard.
> 
>  -- Pavan
> 
> On 01/04/2010 05:11 PM, Hiatt, Dave M wrote:
>> A general question to those in the know.  From time to time I get a Bcast timeout error.  I'm putting in an error handler to do a "catch" on this exception (C++).  My question is, will an MPI:: Finalize() followed by and MPI:: Initi() work from the same process.  This error is being caused by our deficient network, we've never lost a blade, and I'm confident both the app and MPI are functioning properly though considerable investigation.
>>
>> So are there any consequences to simply doing a Finalize() and a new Init() to start up, or will I have to stop the whole process and start again?  I'm assuming that it should restart without prejudice.  I'm on 1.2.1 Windows/Linux releases.
>>
>> Thanks
>>
>> dave
>>
>>
>> "Consequences, Schmonsequences, as long as I'm rich". - Daffy Duck
>> Dave Hiatt
>> Market Risk Systems Integration
>> CitiMortgage, Inc.
>> 1000 Technology Dr.
>> Third Floor East, M.S. 55
>> O'Fallon, MO 63368-2240
>>
>> Phone:  636-261-1408
>> Mobile: 314-452-9165
>> FAX:    636-261-1312
>> Email:     Dave.M.Hiatt at citigroup.com
>>
>>
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list