[MPICH] unmanaged disconnection from mpd ring

Tue May 15 12:55:50 CDT 2007

That would usually be because of network connectivity issues.  It's also
possible that the MPD daemon crashed on those machines or that the machine
itself crashed.  How are the machines connected and have you checked to see
if they stayed up the whole time or if the MPD daemon closed?

-Matt

> -----Original Message-----
> From: Jinyou Liang [mailto:jliang at arb.ca.gov]
> Sent: Tuesday, May 15, 2007 12:51 PM
> To: Matthew Chambers
> Cc: mpich-discuss-digest at mcs.anl.gov
> Subject: Re: [MPICH] unmanaged disconnection from mpd ring
> 
> Matt,
> 
> Thanks for your clarification on my question.  I am just wondering why
> the machines left the ring, and how could that happen ?
> 
> Paul
> 
> 
> Matthew Chambers wrote:
> > Clarify this part:
> >
> >> However, the job was still running (waiting) this morning, since 6
> >> processors were off the ring with the mpdtrace output as
> >> chara
> >> cha02.
> >>
> >
> > Do you mean that sometime during the night, after you submitted the job,
> > some machines dropped out of the ring (rebooted, dropped network
> > connectivity, whatever)?  If that's the case, I don't think that MPICH2
> has
> > error handling capable of keeping a job running properly if the ring
> > topology changes during a job.  The program would need to be able to
> handle
> > it as well as the MPI implementation.  It's a tall order.  But maybe
> you're
> > just asking WHY the machines left the ring?  And that's a very different
> > question.
> >
> > -Matt Chambers
> >
> >
> >
> >> -----Original Message-----
> >> From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-
> >> discuss at mcs.anl.gov] On Behalf Of Jinyou Liang
> >> Sent: Tuesday, May 15, 2007 11:31 AM
> >> To: mpich-discuss-digest at mcs.anl.gov
> >> Subject: [MPICH] unmanaged disconnection from mpd ring
> >>
> >> Dear friends,
> >>
> >> I encountered a problem with mpd as described below, and would
> >> appreciate any insights that you may kindly offer to prevent similar
> >> problem.
> >> Thanks in advance,
> >> Paul
> >>
> >> The problem:
> >>
> >> Yesterday, I linked 8 dual processors together and mpdtrace output was:
> >> chara
> >> cha02
> >> cha03
> >> ...
> >> cha08.
> >>
> >> I submitted a job that was supposed to finish during the night.
> >>
> >> However, the job was still running (waiting) this morning, since 6
> >> processors were off the ring with the mpdtrace output as
> >> chara
> >> cha02.
> >>
> >
> >
> >