[MPICH] unmanaged disconnection from mpd ring

Tue May 15 13:37:27 CDT 2007

Matt,

The machine is a 8-node cluster inside my office building within a 
firewall, and the mpdtrace showed that mpd were run only on the master 
node and node 2 while all nodes are up this morning.  Since the system 
administer is on vacation, no one could touch the machine during the 
interval.  Hence, I think it is safe to rule out the possibility that 
the machine crashed and rebooted during this interval, which leaves the 
possibility for the network problem or the mpd problem.

If it is because of network connectivity issues or the MPD daemon 
crashed on 6 other nodes,  is there anyway to prevent this from 
happening from user's end ?

Paul

Matthew Chambers wrote:
> That would usually be because of network connectivity issues.  It's also
> possible that the MPD daemon crashed on those machines or that the machine
> itself crashed.  How are the machines connected and have you checked to see
> if they stayed up the whole time or if the MPD daemon closed?
>
> -Matt
>
>
>   
>> -----Original Message-----
>> From: Jinyou Liang [mailto:jliang at arb.ca.gov]
>> Sent: Tuesday, May 15, 2007 12:51 PM
>> To: Matthew Chambers
>> Cc: mpich-discuss-digest at mcs.anl.gov
>> Subject: Re: [MPICH] unmanaged disconnection from mpd ring
>>
>> Matt,
>>
>> Thanks for your clarification on my question.  I am just wondering why
>> the machines left the ring, and how could that happen ?
>>
>> Paul
>>
>>
>> Matthew Chambers wrote:
>>     
>>> Clarify this part:
>>>
>>>       
>>>> However, the job was still running (waiting) this morning, since 6
>>>> processors were off the ring with the mpdtrace output as
>>>> chara
>>>> cha02.
>>>>
>>>>         
>>> Do you mean that sometime during the night, after you submitted the job,
>>> some machines dropped out of the ring (rebooted, dropped network
>>> connectivity, whatever)?  If that's the case, I don't think that MPICH2
>>>       
>> has
>>     
>>> error handling capable of keeping a job running properly if the ring
>>> topology changes during a job.  The program would need to be able to
>>>       
>> handle
>>     
>>> it as well as the MPI implementation.  It's a tall order.  But maybe
>>>       
>> you're
>>     
>>> just asking WHY the machines left the ring?  And that's a very different
>>> question.
>>>
>>> -Matt Chambers
>>>
>>>
>>>
>>>       
>>>> -----Original Message-----
>>>> From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-
>>>> discuss at mcs.anl.gov] On Behalf Of Jinyou Liang
>>>> Sent: Tuesday, May 15, 2007 11:31 AM
>>>> To: mpich-discuss-digest at mcs.anl.gov
>>>> Subject: [MPICH] unmanaged disconnection from mpd ring
>>>>
>>>> Dear friends,
>>>>
>>>> I encountered a problem with mpd as described below, and would
>>>> appreciate any insights that you may kindly offer to prevent similar
>>>> problem.
>>>> Thanks in advance,
>>>> Paul
>>>>
>>>> The problem:
>>>>
>>>> Yesterday, I linked 8 dual processors together and mpdtrace output was:
>>>> chara
>>>> cha02
>>>> cha03
>>>> ...
>>>> cha08.
>>>>
>>>> I submitted a job that was supposed to finish during the night.
>>>>
>>>> However, the job was still running (waiting) this morning, since 6
>>>> processors were off the ring with the mpdtrace output as
>>>> chara
>>>> cha02.
>>>>
>>>>         
>>>
>>>       
>
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: jliang.vcf
Type: text/x-vcard
Size: 145 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070515/3010bb87/attachment.vcf>