[MPICH] unmanaged disconnection from mpd ring
Matthew Chambers
matthew.chambers at vanderbilt.edu
Tue May 15 15:07:24 CDT 2007
I suggest doing an mpdringtest for an absurdly large number of loops as a
kind of burn-in for your nodes and/or network.
> -----Original Message-----
> From: Jinyou Liang [mailto:jliang at arb.ca.gov]
> Sent: Tuesday, May 15, 2007 2:17 PM
> To: Matthew Chambers
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [MPICH] unmanaged disconnection from mpd ring
>
> Matt,
>
> I have an account on all the 8 nodes, and set up the mpd rings for
> myself in the following way:
> On master node (chara):
> mpd -e &
> which output a port number, i.e., 1234
> On other nodes (cha02, cha03, ..., cha08):
> mpd -h chara -p 1234 &
>
> then confirmed with mpdtrace showing all 8 nodes were connected. For
> unknown reason, mpdboot did not work on this cluster recently, so I used
> the above method to circumvent the problem.
>
> Following your suggestion, I ran 'uptime' on each node and confirmed
> they were all on last night.
>
> Paul :)
>
>
>
> Matthew Chambers wrote:
> > I'd have to have more information about how the MPD ring boots up to
> > diagnose the issue there. If the MPD ring has been set up to boot up
> for
> > all users by the superuser (i.e. using a mpd.conf in /etc) then you're
> > somewhat SOL. If you have an account on all the nodes, you can set up
> MPD
> > to run as your own user without ever needing superuser access. Also,
> it's
> > not safe to rule out the possibility that the machine crashed and
> rebooted
> > unless you ran "uptime" and saw that the affected nodes indeed did not
> > reboot. :)
> >
> >
> >> -----Original Message-----
> >> From: Jinyou Liang [mailto:jliang at arb.ca.gov]
> >> Sent: Tuesday, May 15, 2007 1:37 PM
> >> To: Matthew Chambers
> >> Cc: mpich-discuss-digest at mcs.anl.gov
> >> Subject: Re: [MPICH] unmanaged disconnection from mpd ring
> >>
> >> Matt,
> >>
> >> The machine is a 8-node cluster inside my office building within a
> >> firewall, and the mpdtrace showed that mpd were run only on the master
> >> node and node 2 while all nodes are up this morning. Since the system
> >> administer is on vacation, no one could touch the machine during the
> >> interval. Hence, I think it is safe to rule out the possibility that
> >> the machine crashed and rebooted during this interval, which leaves the
> >> possibility for the network problem or the mpd problem.
> >>
> >> If it is because of network connectivity issues or the MPD daemon
> >> crashed on 6 other nodes, is there anyway to prevent this from
> >> happening from user's end ?
> >>
> >> Paul
> >>
> >> Matthew Chambers wrote:
> >>
> >>> That would usually be because of network connectivity issues. It's
> also
> >>> possible that the MPD daemon crashed on those machines or that the
> >>>
> >> machine
> >>
> >>> itself crashed. How are the machines connected and have you checked
> to
> >>>
> >> see
> >>
> >>> if they stayed up the whole time or if the MPD daemon closed?
> >>>
> >>> -Matt
> >>>
> >
> >
> >
More information about the mpich-discuss
mailing list