[mpich-discuss] mpich2-1.0.8 errors

Anne M. Hammond hammond at txcorp.com
Thu Jan 15 22:27:31 CST 2009


Thanks for the suggestion.  I went through the Troubleshooting
section, but all looks good.

[hammond at boron ~]$ rcom-nodes "hostname; mpdcheck; echo "
node11.cl.txcorp.com

node12.cl.corp.com

node13.cl.corp.com

node14.cl.corp.com

node15.cl.corp.com

node16.cl.corp.com

node17.cl.corp.com

node18.cl.corp.com

node19.cl.corp.com

node20.cl.corp.com

node21.cl.corp.com

node22.cl.txcorp.com



On Thu, 15 Jan 2009, Anthony Chan wrote:

>
> Have you tried mpdcheck to see if your network setup
> has problem ?
>
> Appendix A of the installer's guide has a detailed description
> of how to use mpdcheck.
>
> http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf
>
> A.Chan
>
> ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>
>> We are getting significant instances of mpich2 jobs crashing
>> with this error message:
>>
>> mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when
>> expecting ack of request
>>
>> mpdtrace from the master node shows that all the nodes are
>> in the ring.
>>
>> mpirun -l -n 13 hostname
>> (this completes successfully with 13 hostnames (master + 12 nodes))
>>
>> Can you tell me how to diagnose and fix the "no msg received
>> from mpd" error?
>>
>> The error doesn't seem to correlate with load on the
>> master or the /scr filesystem nfs-server from which
>> the job is run.  The error seems to be random.  If the same
>> job is resubmitted, it may successfully complete.
>>
>> Thanks in advance.
>
>

-- 



More information about the mpich-discuss mailing list