[MPICH] mpi jobs not exiting
Steve Angelovich
sangelovich at lgc.com
Thu Jun 29 17:14:49 CDT 2006
We have a cluster running redhat aw 3 that has starting having problems
with mpi jobs not terminating properly. I've been able to reproduce the
problem by doing the following;
- start the mpd ring
mpdboot -n 16 -f ~/mpd.hosts
- Running the following command;
mpiexec -n 16 uptime
It usually takes several iterations before the mpiexec command will
hang. Best I can figure out the process that was created on each of the
nodes has completed and exited but for some reason the mpd daemon still
thinks it is running. If I list the jobs running on the ring the job
still shows up. I can signal the job but there is no response.
I've looked in the log file for the head node on the cluster and have
found nothing useful. Any insight into how to track down this issue
would be greatly appreciated.
Thanks,
Steve
----------------------------------------------------------------------
This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.
More information about the mpich-discuss
mailing list