[mpich-discuss] Recovering from a Bcast Timeout

Pavan Balaji balaji at mcs.anl.gov
Tue Jan 5 12:25:15 CST 2010


What you are really looking for is fault tolerance in MPI programs. The
current MPI standard doesn't provide any fault tolerance (this is being
worked on for the upcoming MPI-3 standard). But, there are a few things
that can be done within the MPICH2 stack to help in these cases. Some of
these are being planned for the MPICH2-1.3.x release series (see
https://trac.mcs.anl.gov/projects/mpich2/roadmap).

 -- Pavan

On 01/05/2010 09:58 AM, Hiatt, Dave M wrote:
> So, let's say I lose an mpd out there on the grid, (550 blades, and the reality is with standard MTBF for a 1Ge network and all the 2nd and 3rd level switches, I'm going to get a timeout / hiccup about every 10 to 12 hours) there is no option, it's drop the process that did the original MPI::init and restart a new process.
> 
> If I don't lose the mpd, then I don't need to do a MPI::Finalize at all (since the size of the communicators remains accurate) and I can just resend the message and off we go.
> 
> My reading is those are the two paths that I can follow.  That about cover the options as it were?
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov]On Behalf Of Pavan Balaji
> Sent: Tuesday, January 05, 2010 7:45 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Recovering from a Bcast Timeout
> 
> 
> 
> Calling an init after a finalize in the same program is incorrect as per
> the MPI standard. If it worked in some cases, you were lucky :-).
> 
> See pg. 291 line 1 of the MPI-2.2 standard.
> 
>  -- Pavan
> 
> On 01/04/2010 05:11 PM, Hiatt, Dave M wrote:
>> A general question to those in the know.  From time to time I get a Bcast timeout error.  I'm putting in an error handler to do a "catch" on this exception (C++).  My question is, will an MPI:: Finalize() followed by and MPI:: Initi() work from the same process.  This error is being caused by our deficient network, we've never lost a blade, and I'm confident both the app and MPI are functioning properly though considerable investigation.
>>
>> So are there any consequences to simply doing a Finalize() and a new Init() to start up, or will I have to stop the whole process and start again?  I'm assuming that it should restart without prejudice.  I'm on 1.2.1 Windows/Linux releases.
>>
>> Thanks
>>
>> dave
>>
>>
>> "Consequences, Schmonsequences, as long as I'm rich". - Daffy Duck
>> Dave Hiatt
>> Market Risk Systems Integration
>> CitiMortgage, Inc.
>> 1000 Technology Dr.
>> Third Floor East, M.S. 55
>> O'Fallon, MO 63368-2240
>>
>> Phone:  636-261-1408
>> Mobile: 314-452-9165
>> FAX:    636-261-1312
>> Email:     Dave.M.Hiatt at citigroup.com
>>
>>
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list