[MPICH] mpi jobs not exiting

Rajeev Thakur thakur at mcs.anl.gov
Fri Jun 30 12:10:07 CDT 2006


Sometimes I have found that the job appears to hang, but if I hit the return
key a few times, the prompt comes back. Does that work for you or is it a
real hang?

Rajeev 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Steve Angelovich
> Sent: Thursday, June 29, 2006 5:15 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] mpi jobs not exiting
> 
> We have a cluster running redhat aw 3 that has starting 
> having problems 
> with mpi jobs not terminating properly.  I've been able to 
> reproduce the 
> problem by doing the following;
>  - start the mpd ring
> 
> mpdboot -n 16 -f ~/mpd.hosts
> 
>  - Running the following command;
> 
> mpiexec -n 16 uptime
> 
> It usually takes several iterations before the mpiexec command will 
> hang.  Best I can figure out the process that was created on 
> each of the 
> nodes has completed and exited but for some reason the mpd 
> daemon still 
> thinks it is running.  If I list the jobs running on the ring the job 
> still shows up.  I can signal the job but there is no response.
> 
> I've looked in the log file for the head node on the cluster and have 
> found nothing useful.  Any insight into how to track down this issue 
> would be greatly appreciated.
> 
> Thanks,
> Steve
> 
> 
> 
> 
> ----------------------------------------------------------------------
> This e-mail, including any attached files, may contain 
> confidential and privileged information for the sole use of 
> the intended recipient.  Any review, use, distribution, or 
> disclosure by others is strictly prohibited.  If you are not 
> the intended recipient (or authorized to receive information 
> for the intended recipient), please contact the sender by 
> reply e-mail and delete all copies of this message.
> 
> 




More information about the mpich-discuss mailing list