[mpich-discuss] mpich2-1.0.8 errors
Anne M. Hammond
hammond at txcorp.com
Thu Jan 15 22:27:31 CST 2009
Thanks for the suggestion. I went through the Troubleshooting
section, but all looks good.
[hammond at boron ~]$ rcom-nodes "hostname; mpdcheck; echo "
node11.cl.txcorp.com
node12.cl.corp.com
node13.cl.corp.com
node14.cl.corp.com
node15.cl.corp.com
node16.cl.corp.com
node17.cl.corp.com
node18.cl.corp.com
node19.cl.corp.com
node20.cl.corp.com
node21.cl.corp.com
node22.cl.txcorp.com
On Thu, 15 Jan 2009, Anthony Chan wrote:
>
> Have you tried mpdcheck to see if your network setup
> has problem ?
>
> Appendix A of the installer's guide has a detailed description
> of how to use mpdcheck.
>
> http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf
>
> A.Chan
>
> ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>
>> We are getting significant instances of mpich2 jobs crashing
>> with this error message:
>>
>> mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when
>> expecting ack of request
>>
>> mpdtrace from the master node shows that all the nodes are
>> in the ring.
>>
>> mpirun -l -n 13 hostname
>> (this completes successfully with 13 hostnames (master + 12 nodes))
>>
>> Can you tell me how to diagnose and fix the "no msg received
>> from mpd" error?
>>
>> The error doesn't seem to correlate with load on the
>> master or the /scr filesystem nfs-server from which
>> the job is run. The error seems to be random. If the same
>> job is resubmitted, it may successfully complete.
>>
>> Thanks in advance.
>
>
--
More information about the mpich-discuss
mailing list