[mpich-discuss] mpich2-1.0.8 errors
Anne M. Hammond
hammond at txcorp.com
Tue Jan 27 14:15:08 CST 2009
This is a progress report.
In troubleshooting this, we determined the ring had been booted
with mpich2-1.0.7. On 01/16/09, the ring was booted using mpich2-1.0.8.
Since that time, there have been no mpd timeouts.
I would like to thank the developers for their suggestions and response.
Anne
On Fri, 16 Jan 2009, Ralph Butler wrote:
> Actually, I believe this particular one is overridden by
> MPIEXEC_RECV_TIMEOUT. It operates
> on the recvTimeout variable. The other one sets a time for the job.
>
> On ThuJan 15, at Thu Jan 15 8:08PM, Anthony Chan wrote:
>
>> Anne,
>>
>> Our mpd expert, Ralph Butler, pointed out that mpd could time out
>> (assumes mpd hasn't failed) after the default 20 seconds.
>> He suggests you could modify the default timeout value in mpiexec
>> through mpiexec command line argument -recvtimeout or (environment
>> variable MPIEXEC_TIMEOUT).
>>
>> A.Chan
>>
>> ----- "Anne M. Hammond" <hammond at txcorp.com> wrote:
>>
>> > We are getting significant instances of mpich2 jobs crashing
>> > with this error message:
>> >
>> > mpiexec_boron.corp.com (mpiexec 392): no msg recvd from mpd when
>> > expecting ack of request
>> >
>> > mpdtrace from the master node shows that all the nodes are
>> > in the ring.
>> >
>> > mpirun -l -n 13 hostname
>> > (this completes successfully with 13 hostnames (master + 12 nodes))
>> >
>> > Can you tell me how to diagnose and fix the "no msg received
>> > from mpd" error?
>> >
>> > The error doesn't seem to correlate with load on the
>> > master or the /scr filesystem nfs-server from which
>> > the job is run. The error seems to be random. If the same
>> > job is resubmitted, it may successfully complete.
>> >
>> > Thanks in advance.
>
>
--
Anne M. Hammond - Systems / Network Administration - Tech-X Corp
hammond_at_txcorp.com 720-974-1840
More information about the mpich-discuss
mailing list