[mpich-discuss] mpich2-1.0.8 errors

Anthony Chan chan at mcs.anl.gov
Thu Jan 15 20:08:33 CST 2009


Anne,

Our mpd expert, Ralph Butler, pointed out that mpd could time out 
(assumes mpd hasn't failed) after the default 20 seconds.  
He suggests you could modify the default timeout value in mpiexec
through mpiexec command line argument -recvtimeout or (environment 
variable MPIEXEC_TIMEOUT).

A.Chan

----- "Anne M. Hammond" <hammond at txcorp.com> wrote:

> We are getting significant instances of mpich2 jobs crashing
> with this error message:
> 
> mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when 
> expecting ack of request
> 
> mpdtrace from the master node shows that all the nodes are
> in the ring.
> 
> mpirun -l -n 13 hostname
> (this completes successfully with 13 hostnames (master + 12 nodes))
> 
> Can you tell me how to diagnose and fix the "no msg received
> from mpd" error?
> 
> The error doesn't seem to correlate with load on the
> master or the /scr filesystem nfs-server from which
> the job is run.  The error seems to be random.  If the same
> job is resubmitted, it may successfully complete.
> 
> Thanks in advance.



More information about the mpich-discuss mailing list