[MPICH] mpi jobs not exiting
Rajeev Thakur
thakur at mcs.anl.gov
Fri Jun 30 12:10:07 CDT 2006
Sometimes I have found that the job appears to hang, but if I hit the return
key a few times, the prompt comes back. Does that work for you or is it a
real hang?
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Steve Angelovich
> Sent: Thursday, June 29, 2006 5:15 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] mpi jobs not exiting
>
> We have a cluster running redhat aw 3 that has starting
> having problems
> with mpi jobs not terminating properly. I've been able to
> reproduce the
> problem by doing the following;
> - start the mpd ring
>
> mpdboot -n 16 -f ~/mpd.hosts
>
> - Running the following command;
>
> mpiexec -n 16 uptime
>
> It usually takes several iterations before the mpiexec command will
> hang. Best I can figure out the process that was created on
> each of the
> nodes has completed and exited but for some reason the mpd
> daemon still
> thinks it is running. If I list the jobs running on the ring the job
> still shows up. I can signal the job but there is no response.
>
> I've looked in the log file for the head node on the cluster and have
> found nothing useful. Any insight into how to track down this issue
> would be greatly appreciated.
>
> Thanks,
> Steve
>
>
>
>
> ----------------------------------------------------------------------
> This e-mail, including any attached files, may contain
> confidential and privileged information for the sole use of
> the intended recipient. Any review, use, distribution, or
> disclosure by others is strictly prohibited. If you are not
> the intended recipient (or authorized to receive information
> for the intended recipient), please contact the sender by
> reply e-mail and delete all copies of this message.
>
>
More information about the mpich-discuss
mailing list