[mpich-discuss] Student Projects about Fault Tolerance

Pavan Balaji balaji at mcs.anl.gov
Mon Sep 6 13:19:14 CDT 2010


We are certainly happy to help out on student projects.

We made several improvements to MPICH2, so for cases where one process 
of an application dies abnormally, the remaining processes just return 
an error, but continue to function correctly. The application can spawn 
extra processes as needed (using MPI-2 dynamic processes). There's also 
BLCR checkpoint-restart support. Most of these have been released in 
1.3b1, but we added some more improvements that'll be release in 1.3rc1 
(to be out soon).

Please try these out and let us know if you run into any issues.

Unfortunately, I don't believe we have gotten around to write any 
documentation on how it works yet. But I've created a ticket for it: 
https://trac.mcs.anl.gov/projects/mpich2/ticket/1089

  -- Pavan

On 09/06/2010 04:08 AM, 牛海波 wrote:
> My dear friend:
> I'm a student interested in Fault Tolerance in MPICH, and fortunately i
> saw the news about the Student Projects including Fault Tolerance on
> your site. You said that "If you are interested in pursuing one of the
> below, contact us and we will be able to provide you with some
> guidance." So i think maybe i could get some docs about the MPICH
> implements or some other helpful guidance.
> I will be very appreciate to receive your help.
> Looking forward to a prompt reply.
> Sincerely yours:
> herman
>
>
> ------------------------------------------------------------------------
> 网易邮箱,没有垃圾邮件的邮箱。 <http://mail.163.com/?from=fe1>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list