[MPICH] unmanaged disconnection from mpd ring

Jinyou Liang jliang at arb.ca.gov
Tue May 15 12:51:04 CDT 2007


Matt,

Thanks for your clarification on my question.  I am just wondering why 
the machines left the ring, and how could that happen ?

Paul


Matthew Chambers wrote:
> Clarify this part:
>   
>> However, the job was still running (waiting) this morning, since 6
>> processors were off the ring with the mpdtrace output as
>> chara
>> cha02.
>>     
>
> Do you mean that sometime during the night, after you submitted the job,
> some machines dropped out of the ring (rebooted, dropped network
> connectivity, whatever)?  If that's the case, I don't think that MPICH2 has
> error handling capable of keeping a job running properly if the ring
> topology changes during a job.  The program would need to be able to handle
> it as well as the MPI implementation.  It's a tall order.  But maybe you're
> just asking WHY the machines left the ring?  And that's a very different
> question.
>
> -Matt Chambers
>
>
>   
>> -----Original Message-----
>> From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-
>> discuss at mcs.anl.gov] On Behalf Of Jinyou Liang
>> Sent: Tuesday, May 15, 2007 11:31 AM
>> To: mpich-discuss-digest at mcs.anl.gov
>> Subject: [MPICH] unmanaged disconnection from mpd ring
>>
>> Dear friends,
>>
>> I encountered a problem with mpd as described below, and would
>> appreciate any insights that you may kindly offer to prevent similar
>> problem.
>> Thanks in advance,
>> Paul
>>
>> The problem:
>>
>> Yesterday, I linked 8 dual processors together and mpdtrace output was:
>> chara
>> cha02
>> cha03
>> ...
>> cha08.
>>
>> I submitted a job that was supposed to finish during the night.
>>
>> However, the job was still running (waiting) this morning, since 6
>> processors were off the ring with the mpdtrace output as
>> chara
>> cha02.
>>     
>
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: jliang.vcf
Type: text/x-vcard
Size: 145 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070515/053ce988/attachment.vcf>


More information about the mpich-discuss mailing list