[mpich-discuss] mpich2-1.0.8 errors

Anne M. Hammond hammond at txcorp.com
Tue Jan 27 14:15:08 CST 2009


This is a progress report.

In troubleshooting this, we determined the ring had been booted
with mpich2-1.0.7.  On 01/16/09, the ring was booted using mpich2-1.0.8.

Since that time, there have been no mpd timeouts.

I would like to thank the developers for their suggestions and response.

Anne

On Fri, 16 Jan 2009, Ralph Butler wrote:

> Actually, I believe this particular one is overridden by 
> MPIEXEC_RECV_TIMEOUT.  It operates
> on the recvTimeout variable.  The other one sets a time for the job.
>
> On ThuJan 15, at Thu Jan 15 8:08PM, Anthony Chan wrote:
>
>> Anne,
>> 
>> Our mpd expert, Ralph Butler, pointed out that mpd could time out
>> (assumes mpd hasn't failed) after the default 20 seconds.
>> He suggests you could modify the default timeout value in mpiexec
>> through mpiexec command line argument -recvtimeout or (environment
>> variable MPIEXEC_TIMEOUT).
>> 
>> A.Chan
>> 
>> ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>> 
>> > We are getting significant instances of mpich2 jobs crashing
>> > with this error message:
>> > 
>> > mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when
>> > expecting ack of request
>> > 
>> > mpdtrace from the master node shows that all the nodes are
>> > in the ring.
>> > 
>> > mpirun -l -n 13 hostname
>> > (this completes successfully with 13 hostnames (master + 12 nodes))
>> > 
>> > Can you tell me how to diagnose and fix the "no msg received
>> > from mpd" error?
>> > 
>> > The error doesn't seem to correlate with load on the
>> > master or the /scr filesystem nfs-server from which
>> > the job is run.  The error seems to be random.  If the same
>> > job is resubmitted, it may successfully complete.
>> > 
>> > Thanks in advance.
>
>

-- 

Anne M. Hammond - Systems / Network Administration - Tech-X Corp
                   hammond_at_txcorp.com 720-974-1840


More information about the mpich-discuss mailing list