[MPICH] mpi jobs not exiting

Rusty Lusk lusk at mcs.anl.gov
Thu Jun 29 21:52:48 CDT 2006


What release of MPICH2 are you running?

From: Steve Angelovich <sangelovich at lgc.com>
Subject: [MPICH] mpi jobs not exiting
Date: Thu, 29 Jun 2006 16:14:49 -0600

> We have a cluster running redhat aw 3 that has starting having problems 
> with mpi jobs not terminating properly.  I've been able to reproduce the 
> problem by doing the following;
>  - start the mpd ring
> 
> mpdboot -n 16 -f ~/mpd.hosts
> 
>  - Running the following command;
> 
> mpiexec -n 16 uptime
> 
> It usually takes several iterations before the mpiexec command will 
> hang.  Best I can figure out the process that was created on each of the 
> nodes has completed and exited but for some reason the mpd daemon still 
> thinks it is running.  If I list the jobs running on the ring the job 
> still shows up.  I can signal the job but there is no response.
> 
> I've looked in the log file for the head node on the cluster and have 
> found nothing useful.  Any insight into how to track down this issue 
> would be greatly appreciated.
> 
> Thanks,
> Steve
> 
> 
> 
> 
> ----------------------------------------------------------------------
> This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient.  Any review, use, distribution, or disclosure by others is strictly prohibited.  If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.
> 




More information about the mpich-discuss mailing list