[MPICH] unmanaged disconnection from mpd ring

Tue May 15 12:34:26 CDT 2007

Clarify this part:
> However, the job was still running (waiting) this morning, since 6
> processors were off the ring with the mpdtrace output as
> chara
> cha02.

Do you mean that sometime during the night, after you submitted the job,
some machines dropped out of the ring (rebooted, dropped network
connectivity, whatever)?  If that's the case, I don't think that MPICH2 has
error handling capable of keeping a job running properly if the ring
topology changes during a job.  The program would need to be able to handle
it as well as the MPI implementation.  It's a tall order.  But maybe you're
just asking WHY the machines left the ring?  And that's a very different
question.

-Matt Chambers

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-
> discuss at mcs.anl.gov] On Behalf Of Jinyou Liang
> Sent: Tuesday, May 15, 2007 11:31 AM
> To: mpich-discuss-digest at mcs.anl.gov
> Subject: [MPICH] unmanaged disconnection from mpd ring
> 
> Dear friends,
> 
> I encountered a problem with mpd as described below, and would
> appreciate any insights that you may kindly offer to prevent similar
> problem.
> Thanks in advance,
> Paul
> 
> The problem:
> 
> Yesterday, I linked 8 dual processors together and mpdtrace output was:
> chara
> cha02
> cha03
> ...
> cha08.
> 
> I submitted a job that was supposed to finish during the night.
> 
> However, the job was still running (waiting) this morning, since 6
> processors were off the ring with the mpdtrace output as
> chara
> cha02.