[mpich-discuss] mpich2-1.0.8 errors
Anne M. Hammond
hammond at txcorp.com
Thu Jan 15 14:31:59 CST 2009
We are getting significant instances of mpich2 jobs crashing
with this error message:
mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when
expecting ack of request
mpdtrace from the master node shows that all the nodes are
in the ring.
mpirun -l -n 13 hostname
(this completes successfully with 13 hostnames (master + 12 nodes))
Can you tell me how to diagnose and fix the "no msg received
from mpd" error?
The error doesn't seem to correlate with load on the
master or the /scr filesystem nfs-server from which
the job is run. The error seems to be random. If the same
job is resubmitted, it may successfully complete.
Thanks in advance.
More information about the mpich-discuss
mailing list