[mpich-discuss] mpich2-1.0.8 errors

Anthony Chan chan at mcs.anl.gov
Thu Jan 15 14:48:37 CST 2009


Have you tried mpdcheck to see if your network setup 
has problem ?

Appendix A of the installer's guide has a detailed description
of how to use mpdcheck.

http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf

A.Chan

----- "Anne M. Hammond" <hammond at txcorp.com> wrote:

> We are getting significant instances of mpich2 jobs crashing
> with this error message:
> 
> mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when 
> expecting ack of request
> 
> mpdtrace from the master node shows that all the nodes are
> in the ring.
> 
> mpirun -l -n 13 hostname
> (this completes successfully with 13 hostnames (master + 12 nodes))
> 
> Can you tell me how to diagnose and fix the "no msg received
> from mpd" error?
> 
> The error doesn't seem to correlate with load on the
> master or the /scr filesystem nfs-server from which
> the job is run.  The error seems to be random.  If the same
> job is resubmitted, it may successfully complete.
> 
> Thanks in advance.



More information about the mpich-discuss mailing list