[MPICH] unmanaged disconnection from mpd ring
Jinyou Liang
jliang at arb.ca.gov
Tue May 15 14:16:56 CDT 2007
Matt,
I have an account on all the 8 nodes, and set up the mpd rings for
myself in the following way:
On master node (chara):
mpd -e &
which output a port number, i.e., 1234
On other nodes (cha02, cha03, ..., cha08):
mpd -h chara -p 1234 &
then confirmed with mpdtrace showing all 8 nodes were connected. For
unknown reason, mpdboot did not work on this cluster recently, so I used
the above method to circumvent the problem.
Following your suggestion, I ran 'uptime' on each node and confirmed
they were all on last night.
Paul :)
Matthew Chambers wrote:
> I'd have to have more information about how the MPD ring boots up to
> diagnose the issue there. If the MPD ring has been set up to boot up for
> all users by the superuser (i.e. using a mpd.conf in /etc) then you're
> somewhat SOL. If you have an account on all the nodes, you can set up MPD
> to run as your own user without ever needing superuser access. Also, it's
> not safe to rule out the possibility that the machine crashed and rebooted
> unless you ran "uptime" and saw that the affected nodes indeed did not
> reboot. :)
>
>
>> -----Original Message-----
>> From: Jinyou Liang [mailto:jliang at arb.ca.gov]
>> Sent: Tuesday, May 15, 2007 1:37 PM
>> To: Matthew Chambers
>> Cc: mpich-discuss-digest at mcs.anl.gov
>> Subject: Re: [MPICH] unmanaged disconnection from mpd ring
>>
>> Matt,
>>
>> The machine is a 8-node cluster inside my office building within a
>> firewall, and the mpdtrace showed that mpd were run only on the master
>> node and node 2 while all nodes are up this morning. Since the system
>> administer is on vacation, no one could touch the machine during the
>> interval. Hence, I think it is safe to rule out the possibility that
>> the machine crashed and rebooted during this interval, which leaves the
>> possibility for the network problem or the mpd problem.
>>
>> If it is because of network connectivity issues or the MPD daemon
>> crashed on 6 other nodes, is there anyway to prevent this from
>> happening from user's end ?
>>
>> Paul
>>
>> Matthew Chambers wrote:
>>
>>> That would usually be because of network connectivity issues. It's also
>>> possible that the MPD daemon crashed on those machines or that the
>>>
>> machine
>>
>>> itself crashed. How are the machines connected and have you checked to
>>>
>> see
>>
>>> if they stayed up the whole time or if the MPD daemon closed?
>>>
>>> -Matt
>>>
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jliang.vcf
Type: text/x-vcard
Size: 145 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070515/8b0cebf8/attachment.vcf>
More information about the mpich-discuss
mailing list