[MPICH] unmanaged disconnection from mpd ring

Matthew Chambers matthew.chambers at vanderbilt.edu
Tue May 15 13:45:09 CDT 2007


I'd have to have more information about how the MPD ring boots up to
diagnose the issue there.  If the MPD ring has been set up to boot up for
all users by the superuser (i.e. using a mpd.conf in /etc) then you're
somewhat SOL.  If you have an account on all the nodes, you can set up MPD
to run as your own user without ever needing superuser access.  Also, it's
not safe to rule out the possibility that the machine crashed and rebooted
unless you ran "uptime" and saw that the affected nodes indeed did not
reboot. :)

> -----Original Message-----
> From: Jinyou Liang [mailto:jliang at arb.ca.gov]
> Sent: Tuesday, May 15, 2007 1:37 PM
> To: Matthew Chambers
> Cc: mpich-discuss-digest at mcs.anl.gov
> Subject: Re: [MPICH] unmanaged disconnection from mpd ring
> 
> Matt,
> 
> The machine is a 8-node cluster inside my office building within a
> firewall, and the mpdtrace showed that mpd were run only on the master
> node and node 2 while all nodes are up this morning.  Since the system
> administer is on vacation, no one could touch the machine during the
> interval.  Hence, I think it is safe to rule out the possibility that
> the machine crashed and rebooted during this interval, which leaves the
> possibility for the network problem or the mpd problem.
> 
> If it is because of network connectivity issues or the MPD daemon
> crashed on 6 other nodes,  is there anyway to prevent this from
> happening from user's end ?
> 
> Paul
> 
> Matthew Chambers wrote:
> > That would usually be because of network connectivity issues.  It's also
> > possible that the MPD daemon crashed on those machines or that the
> machine
> > itself crashed.  How are the machines connected and have you checked to
> see
> > if they stayed up the whole time or if the MPD daemon closed?
> >
> > -Matt





More information about the mpich-discuss mailing list