[mpich-discuss] mpich2-1.0.8 errors

Anne M. Hammond hammond at txcorp.com
Thu Jan 15 22:56:53 CST 2009


Thanks Anthony.

To make it consistent for all users, and to test this,
recvTimeout was changed from 20 to 30 in mpiexec.py.

Will report.

Anne

On Thu, 15 Jan 2009, Anthony Chan wrote:

> Anne,
>
> Our mpd expert, Ralph Butler, pointed out that mpd could time out
> (assumes mpd hasn't failed) after the default 20 seconds.
> He suggests you could modify the default timeout value in mpiexec
> through mpiexec command line argument -recvtimeout or (environment
> variable MPIEXEC_TIMEOUT).
>
> A.Chan
>
> ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>
>> We are getting significant instances of mpich2 jobs crashing
>> with this error message:
>>
>> mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when
>> expecting ack of request
>>
>> mpdtrace from the master node shows that all the nodes are
>> in the ring.
>>
>> mpirun -l -n 13 hostname
>> (this completes successfully with 13 hostnames (master + 12 nodes))
>>
>> Can you tell me how to diagnose and fix the "no msg received
>> from mpd" error?
>>
>> The error doesn't seem to correlate with load on the
>> master or the /scr filesystem nfs-server from which
>> the job is run.  The error seems to be random.  If the same
>> job is resubmitted, it may successfully complete.
>>
>> Thanks in advance.
>
>

-- 



More information about the mpich-discuss mailing list