[mpich-discuss] 1.2.1p1 and Linux

Darius Buntinas buntinas at mcs.anl.gov
Thu Sep 30 10:56:15 CDT 2010


The 1.3 has checkpointing <http://wiki.mcs.anl.gov/mpich2/index.php/Checkpointing> and has improved tolerance for process and communication failures (using TCP).  If a process sets the error handler to MPI_ERRORS_RETURN then in the event of a communication failure, e.g., if a process dies and another process tries to send a message to it, an error code will be returned to the application, and communication with other processes will still be possible.  However, collectives are problematic with failed processes.  The behavior of collectives with failed processes is not defined by the standard.  Currently such a collective operation may fail, succeed or hang for different processes in the communicator.  We along with the MPI Forum are working on these issues.

-d

On Sep 29, 2010, at 5:39 PM, Hiatt, Dave M wrote:

> Yes, we need to go to Hydra anyway.  
> Speaking of 1.3, that has fixes for fault tolerance doesn't it?  Is there documentation available now for how to implement it from the application's perspective?
> 
> Thanks
> dave
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
> Sent: Wednesday, September 29, 2010 2:46 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] 1.2.1p1 and Linux
> 
> OK, good luck :-).  You could try the 1.3 release candidate, or use hydra.
> 
> -d
> 
> On Sep 29, 2010, at 2:42 PM, Hiatt, Dave M wrote:
> 
>> Can't get MPD ring to start.  Having to drop back to backup systems because of hardware failure on RHEL 5 grid, things were running fine there.  This is kind of a fire drill situation, so there could be a lot of  reasons for this that are totally our fault and are configuration for sure and thought it should work.  Just wanted the reassurance to avoid a pie in the face before investing much time in chasing this problem down.  Thanks
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
>> Sent: Wednesday, September 29, 2010 2:28 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] 1.2.1p1 and Linux
>> 
>> 
>> It should.  What issues are you seeing?
>> 
>> -d
>> 
>> On Sep 29, 2010, at 2:23 PM, Hiatt, Dave M wrote:
>> 
>>> 1.2.1p1 should function on RHEL 4.x should it not?
>>> 
>>> "With sufficient thrust pigs can fly, but it is not necessarily a good idea"
>>> 
>>> Dave M. Hiatt
>>> Director, Risk Analytics
>>> CitiMortgage
>>> 1000 Technology Drive
>>> O'Fallon, MO 63368-2240
>>> 
>>> Telephone:  636-261-1408
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list