[MPICH] unmanaged disconnection from mpd ring
Matthew Chambers
matthew.chambers at vanderbilt.edu
Tue May 15 12:34:26 CDT 2007
Clarify this part:
> However, the job was still running (waiting) this morning, since 6
> processors were off the ring with the mpdtrace output as
> chara
> cha02.
Do you mean that sometime during the night, after you submitted the job,
some machines dropped out of the ring (rebooted, dropped network
connectivity, whatever)? If that's the case, I don't think that MPICH2 has
error handling capable of keeping a job running properly if the ring
topology changes during a job. The program would need to be able to handle
it as well as the MPI implementation. It's a tall order. But maybe you're
just asking WHY the machines left the ring? And that's a very different
question.
-Matt Chambers
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-
> discuss at mcs.anl.gov] On Behalf Of Jinyou Liang
> Sent: Tuesday, May 15, 2007 11:31 AM
> To: mpich-discuss-digest at mcs.anl.gov
> Subject: [MPICH] unmanaged disconnection from mpd ring
>
> Dear friends,
>
> I encountered a problem with mpd as described below, and would
> appreciate any insights that you may kindly offer to prevent similar
> problem.
> Thanks in advance,
> Paul
>
> The problem:
>
> Yesterday, I linked 8 dual processors together and mpdtrace output was:
> chara
> cha02
> cha03
> ...
> cha08.
>
> I submitted a job that was supposed to finish during the night.
>
> However, the job was still running (waiting) this morning, since 6
> processors were off the ring with the mpdtrace output as
> chara
> cha02.
More information about the mpich-discuss
mailing list