[mpich-discuss] mpich2-1.0.8 errors
Anthony Chan
chan at mcs.anl.gov
Thu Jan 15 14:48:37 CST 2009
Have you tried mpdcheck to see if your network setup
has problem ?
Appendix A of the installer's guide has a detailed description
of how to use mpdcheck.
http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf
A.Chan
----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
> We are getting significant instances of mpich2 jobs crashing
> with this error message:
>
> mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when
> expecting ack of request
>
> mpdtrace from the master node shows that all the nodes are
> in the ring.
>
> mpirun -l -n 13 hostname
> (this completes successfully with 13 hostnames (master + 12 nodes))
>
> Can you tell me how to diagnose and fix the "no msg received
> from mpd" error?
>
> The error doesn't seem to correlate with load on the
> master or the /scr filesystem nfs-server from which
> the job is run. The error seems to be random. If the same
> job is resubmitted, it may successfully complete.
>
> Thanks in advance.
More information about the mpich-discuss
mailing list