[mpich-discuss] Student Projects about Fault Tolerance
Pavan Balaji
balaji at mcs.anl.gov
Mon Sep 6 13:19:14 CDT 2010
We are certainly happy to help out on student projects.
We made several improvements to MPICH2, so for cases where one process
of an application dies abnormally, the remaining processes just return
an error, but continue to function correctly. The application can spawn
extra processes as needed (using MPI-2 dynamic processes). There's also
BLCR checkpoint-restart support. Most of these have been released in
1.3b1, but we added some more improvements that'll be release in 1.3rc1
(to be out soon).
Please try these out and let us know if you run into any issues.
Unfortunately, I don't believe we have gotten around to write any
documentation on how it works yet. But I've created a ticket for it:
https://trac.mcs.anl.gov/projects/mpich2/ticket/1089
-- Pavan
On 09/06/2010 04:08 AM, 牛海波 wrote:
> My dear friend:
> I'm a student interested in Fault Tolerance in MPICH, and fortunately i
> saw the news about the Student Projects including Fault Tolerance on
> your site. You said that "If you are interested in pursuing one of the
> below, contact us and we will be able to provide you with some
> guidance." So i think maybe i could get some docs about the MPICH
> implements or some other helpful guidance.
> I will be very appreciate to receive your help.
> Looking forward to a prompt reply.
> Sincerely yours:
> herman
>
>
> ------------------------------------------------------------------------
> 网易邮箱,没有垃圾邮件的邮箱。 <http://mail.163.com/?from=fe1>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list