[mpich-discuss] Recovering from a Bcast Timeout

Hiatt, Dave M dave.m.hiatt at citi.com
Tue Jan 5 09:58:21 CST 2010


So, let's say I lose an mpd out there on the grid, (550 blades, and the reality is with standard MTBF for a 1Ge network and all the 2nd and 3rd level switches, I'm going to get a timeout / hiccup about every 10 to 12 hours) there is no option, it's drop the process that did the original MPI::init and restart a new process.

If I don't lose the mpd, then I don't need to do a MPI::Finalize at all (since the size of the communicators remains accurate) and I can just resend the message and off we go.

My reading is those are the two paths that I can follow.  That about cover the options as it were?

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov]On Behalf Of Pavan Balaji
Sent: Tuesday, January 05, 2010 7:45 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Recovering from a Bcast Timeout



Calling an init after a finalize in the same program is incorrect as per
the MPI standard. If it worked in some cases, you were lucky :-).

See pg. 291 line 1 of the MPI-2.2 standard.

 -- Pavan

On 01/04/2010 05:11 PM, Hiatt, Dave M wrote:
> A general question to those in the know.  From time to time I get a Bcast timeout error.  I'm putting in an error handler to do a "catch" on this exception (C++).  My question is, will an MPI:: Finalize() followed by and MPI:: Initi() work from the same process.  This error is being caused by our deficient network, we've never lost a blade, and I'm confident both the app and MPI are functioning properly though considerable investigation.
>
> So are there any consequences to simply doing a Finalize() and a new Init() to start up, or will I have to stop the whole process and start again?  I'm assuming that it should restart without prejudice.  I'm on 1.2.1 Windows/Linux releases.
>
> Thanks
>
> dave
>
>
> "Consequences, Schmonsequences, as long as I'm rich". - Daffy Duck
> Dave Hiatt
> Market Risk Systems Integration
> CitiMortgage, Inc.
> 1000 Technology Dr.
> Third Floor East, M.S. 55
> O'Fallon, MO 63368-2240
>
> Phone:  636-261-1408
> Mobile: 314-452-9165
> FAX:    636-261-1312
> Email:     Dave.M.Hiatt at citigroup.com
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list