[MPICH] unmanaged disconnection from mpd ring

jliang at arb.ca.gov jliang at arb.ca.gov
Wed May 16 10:09:05 CDT 2007


So you believe the problem was caused by the disconnected nodes and/or busy network. Could you explain how the burn-in process will help prevent the problem from happening in the future ?  Thanks.

With best regards,
Paul

----- Original Message -----
From: Matthew Chambers <matthew.chambers at vanderbilt.edu>
Date: Tuesday, May 15, 2007 1:07 pm
Subject: RE: [MPICH] unmanaged disconnection from mpd ring

> I suggest doing an mpdringtest for an absurdly large number of 
> loops as a
> kind of burn-in for your nodes and/or network.
> 
> > -----Original Message-----
> > From: Jinyou Liang [jliang at arb.ca.gov]
> > Sent: Tuesday, May 15, 2007 2:17 PM
> > To: Matthew Chambers
> > Cc: mpich-discuss at mcs.anl.gov
> > Subject: Re: [MPICH] unmanaged disconnection from mpd ring
> > 
> > Matt,
> > 
> >     I have an account on all the 8 nodes, and set up the mpd 
> rings for
> > myself in the following way:
> > On master node (chara):
> >       mpd -e &
> >       which output a port number, i.e., 1234
> > On other nodes (cha02, cha03, ..., cha08):
> >       mpd -h chara  -p  1234 &
> > 
> > then confirmed with mpdtrace showing all 8 nodes were connected. 
>  For
> > unknown reason, mpdboot did not work on this cluster recently, 
> so I used
> > the above method to circumvent the problem.
> > 
> >     Following your suggestion, I ran 'uptime' on each node and 
> confirmed> they were all on last night.
> > 
> > Paul :)
> > 
> > 
> > 
> > Matthew Chambers wrote:
> > > I'd have to have more information about how the MPD ring boots 
> up to
> > > diagnose the issue there.  If the MPD ring has been set up to 
> boot up
> > for
> > > all users by the superuser (i.e. using a mpd.conf in /etc) 
> then you're
> > > somewhat SOL.  If you have an account on all the nodes, you 
> can set up
> > MPD
> > > to run as your own user without ever needing superuser access. 
> Also,
> > it's
> > > not safe to rule out the possibility that the machine crashed and
> > rebooted
> > > unless you ran "uptime" and saw that the affected nodes indeed 
> did not
> > > reboot. :)
> > >
> > >
> > >> -----Original Message-----
> > >> From: Jinyou Liang [jliang at arb.ca.gov]
> > >> Sent: Tuesday, May 15, 2007 1:37 PM
> > >> To: Matthew Chambers
> > >> Cc: mpich-discuss-digest at mcs.anl.gov
> > >> Subject: Re: [MPICH] unmanaged disconnection from mpd ring
> > >>
> > >> Matt,
> > >>
> > >> The machine is a 8-node cluster inside my office building 
> within a
> > >> firewall, and the mpdtrace showed that mpd were run only on 
> the master
> > >> node and node 2 while all nodes are up this morning.  Since 
> the system
> > >> administer is on vacation, no one could touch the machine 
> during the
> > >> interval.  Hence, I think it is safe to rule out the 
> possibility that
> > >> the machine crashed and rebooted during this interval, which 
> leaves the
> > >> possibility for the network problem or the mpd problem.
> > >>
> > >> If it is because of network connectivity issues or the MPD daemon
> > >> crashed on 6 other nodes,  is there anyway to prevent this from
> > >> happening from user's end ?
> > >>
> > >> Paul
> > >>
> > >> Matthew Chambers wrote:
> > >>
> > >>> That would usually be because of network connectivity 
> issues.  It's
> > also
> > >>> possible that the MPD daemon crashed on those machines or 
> that the
> > >>>
> > >> machine
> > >>
> > >>> itself crashed.  How are the machines connected and have you 
> checked> to
> > >>>
> > >> see
> > >>
> > >>> if they stayed up the whole time or if the MPD daemon closed?
> > >>>
> > >>> -Matt
> > >>>
> > >
> > >
> > >
> 
> 
> 




More information about the mpich-discuss mailing list