[mpich-discuss] 1.2.1p1 and Linux

Darius Buntinas buntinas at mcs.anl.gov
Thu Sep 30 11:28:49 CDT 2010


Right.  Ideally, after a failed process is detected, you'd want to be able to create a new communicator excluding that process.  Unfortunately, (for now) creating a communicator requires collectives.  These are the things we're looking into.

-d

On Sep 30, 2010, at 11:05 AM, Hiatt, Dave M wrote:

> Roger that.  One more question if I may.  I assume that the rule still applies that a process, (not just a thread within the process that executed the MPI_INIT), but the full process which has done an MPI_FINALIZE, still cannot do a subsequent MPI_INIT, correct?  To restart with a reduced collective, the process that had the failure must terminate complete, and a new process must start (or be forked) before a new MPI_INIT can be done on what would be a reduced collective presumably.   
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
> Sent: Thursday, September 30, 2010 10:56 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] 1.2.1p1 and Linux
> 
> 
> The 1.3 has checkpointing <http://wiki.mcs.anl.gov/mpich2/index.php/Checkpointing> and has improved tolerance for process and communication failures (using TCP).  If a process sets the error handler to MPI_ERRORS_RETURN then in the event of a communication failure, e.g., if a process dies and another process tries to send a message to it, an error code will be returned to the application, and communication with other processes will still be possible.  However, collectives are problematic with failed processes.  The behavior of collectives with failed processes is not defined by the standard.  Currently such a collective operation may fail, succeed or hang for different processes in the communicator.  We along with the MPI Forum are working on these issues.
> 
> -d
> 
> On Sep 29, 2010, at 5:39 PM, Hiatt, Dave M wrote:
> 
>> Yes, we need to go to Hydra anyway.  
>> Speaking of 1.3, that has fixes for fault tolerance doesn't it?  Is there documentation available now for how to implement it from the application's perspective?
>> 
>> Thanks
>> dave
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
>> Sent: Wednesday, September 29, 2010 2:46 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] 1.2.1p1 and Linux
>> 
>> OK, good luck :-).  You could try the 1.3 release candidate, or use hydra.
>> 
>> -d
>> 
>> On Sep 29, 2010, at 2:42 PM, Hiatt, Dave M wrote:
>> 
>>> Can't get MPD ring to start.  Having to drop back to backup systems because of hardware failure on RHEL 5 grid, things were running fine there.  This is kind of a fire drill situation, so there could be a lot of  reasons for this that are totally our fault and are configuration for sure and thought it should work.  Just wanted the reassurance to avoid a pie in the face before investing much time in chasing this problem down.  Thanks
>>> 
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
>>> Sent: Wednesday, September 29, 2010 2:28 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] 1.2.1p1 and Linux
>>> 
>>> 
>>> It should.  What issues are you seeing?
>>> 
>>> -d
>>> 
>>> On Sep 29, 2010, at 2:23 PM, Hiatt, Dave M wrote:
>>> 
>>>> 1.2.1p1 should function on RHEL 4.x should it not?
>>>> 
>>>> "With sufficient thrust pigs can fly, but it is not necessarily a good idea"
>>>> 
>>>> Dave M. Hiatt
>>>> Director, Risk Analytics
>>>> CitiMortgage
>>>> 1000 Technology Drive
>>>> O'Fallon, MO 63368-2240
>>>> 
>>>> Telephone:  636-261-1408
>>>> 
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list