[mpich-discuss] mpich2-1.0.8 errors

Ralph Butler rbutler at mtsu.edu
Fri Jan 16 08:27:30 CST 2009


Actually, I believe this particular one is overridden by  
MPIEXEC_RECV_TIMEOUT.  It operates
on the recvTimeout variable.  The other one sets a time for the job.

On ThuJan 15, at Thu Jan 15 8:08PM, Anthony Chan wrote:

> Anne,
>
> Our mpd expert, Ralph Butler, pointed out that mpd could time out
> (assumes mpd hasn't failed) after the default 20 seconds.
> He suggests you could modify the default timeout value in mpiexec
> through mpiexec command line argument -recvtimeout or (environment
> variable MPIEXEC_TIMEOUT).
>
> A.Chan
>
> ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>
>> We are getting significant instances of mpich2 jobs crashing
>> with this error message:
>>
>> mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when
>> expecting ack of request
>>
>> mpdtrace from the master node shows that all the nodes are
>> in the ring.
>>
>> mpirun -l -n 13 hostname
>> (this completes successfully with 13 hostnames (master + 12 nodes))
>>
>> Can you tell me how to diagnose and fix the "no msg received
>> from mpd" error?
>>
>> The error doesn't seem to correlate with load on the
>> master or the /scr filesystem nfs-server from which
>> the job is run.  The error seems to be random.  If the same
>> job is resubmitted, it may successfully complete.
>>
>> Thanks in advance.




More information about the mpich-discuss mailing list