[mpich-discuss] mpich2-1.0.8 errors

Anne M. Hammond hammond at txcorp.com
Thu Jan 15 14:31:59 CST 2009

We are getting significant instances of mpich2 jobs crashing
with this error message:

mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when 
expecting ack of request

mpdtrace from the master node shows that all the nodes are
in the ring.

mpirun -l -n 13 hostname
(this completes successfully with 13 hostnames (master + 12 nodes))

Can you tell me how to diagnose and fix the "no msg received
from mpd" error?

The error doesn't seem to correlate with load on the
master or the /scr filesystem nfs-server from which
the job is run.  The error seems to be random.  If the same
job is resubmitted, it may successfully complete.

Thanks in advance.

