[MPICH] mpi jobs not exiting

Steve Angelovich sangelovich at lgc.com
Thu Jun 29 17:14:49 CDT 2006


We have a cluster running redhat aw 3 that has starting having problems 
with mpi jobs not terminating properly.  I've been able to reproduce the 
problem by doing the following;
 - start the mpd ring

mpdboot -n 16 -f ~/mpd.hosts

 - Running the following command;

mpiexec -n 16 uptime

It usually takes several iterations before the mpiexec command will 
hang.  Best I can figure out the process that was created on each of the 
nodes has completed and exited but for some reason the mpd daemon still 
thinks it is running.  If I list the jobs running on the ring the job 
still shows up.  I can signal the job but there is no response.

I've looked in the log file for the head node on the cluster and have 
found nothing useful.  Any insight into how to track down this issue 
would be greatly appreciated.

Thanks,
Steve




----------------------------------------------------------------------
This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient.  Any review, use, distribution, or disclosure by others is strictly prohibited.  If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.




More information about the mpich-discuss mailing list