[MPICH] mpi jobs not exiting
Steve Angelovich
sangelovich at lgc.com
Fri Jun 30 14:10:50 CDT 2006
It is really a hang.
Thanks,
Steve
Rajeev Thakur wrote:
>Sometimes I have found that the job appears to hang, but if I hit the return
>key a few times, the prompt comes back. Does that work for you or is it a
>real hang?
>
>Rajeev
>
>
>
>>-----Original Message-----
>>From: owner-mpich-discuss at mcs.anl.gov
>>[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Steve Angelovich
>>Sent: Thursday, June 29, 2006 5:15 PM
>>To: mpich-discuss at mcs.anl.gov
>>Subject: [MPICH] mpi jobs not exiting
>>
>>We have a cluster running redhat aw 3 that has starting
>>having problems
>>with mpi jobs not terminating properly. I've been able to
>>reproduce the
>>problem by doing the following;
>> - start the mpd ring
>>
>>mpdboot -n 16 -f ~/mpd.hosts
>>
>> - Running the following command;
>>
>>mpiexec -n 16 uptime
>>
>>It usually takes several iterations before the mpiexec command will
>>hang. Best I can figure out the process that was created on
>>each of the
>>nodes has completed and exited but for some reason the mpd
>>daemon still
>>thinks it is running. If I list the jobs running on the ring the job
>>still shows up. I can signal the job but there is no response.
>>
>>I've looked in the log file for the head node on the cluster and have
>>found nothing useful. Any insight into how to track down this issue
>>would be greatly appreciated.
>>
>>Thanks,
>>Steve
>>
>>
>>
>>
>>----------------------------------------------------------------------
>>This e-mail, including any attached files, may contain
>>confidential and privileged information for the sole use of
>>the intended recipient. Any review, use, distribution, or
>>disclosure by others is strictly prohibited. If you are not
>>the intended recipient (or authorized to receive information
>>for the intended recipient), please contact the sender by
>>reply e-mail and delete all copies of this message.
>>
>>
>>
>>
>
>
>
More information about the mpich-discuss
mailing list