[MPICH] issue regarding mpd timeout in mpiexec at scale
Rusty Lusk
lusk at mcs.anl.gov
Tue May 29 12:15:03 CDT 2007
I think that changing that timeout would be a reasonable approach.
There is no number that is too large :-). It depends on how long
you want to wait to find out that something is wrong. I would
experiment with various values, starting at a large number, and tune
it for your environment.
Regards,
Rusty Lusk
On May 29, 2007, at 11:09 AM, Gregory Bauer wrote:
> I was able to reproduce the mpiexec failure mechanism using the
> latest MPICH2 release (mpich2-1.0.5p4.tar.gz) as I documented with
> the MVAPICH2 release.
>
> Since the failure is intermittent, I can't be sure if changing
> recvTimeout to a larger value actually fixes the issue.
>
> What is a reasonable value to set recvTimeout to for task counts
> greater than 2048? Is 120 (seconds is the unit ?) too large?
>
> -Greg
>
> Gregory Bauer wrote:
>
>> This is a repost from the mvapich-discuss list that I posted the
>> other day. It was recommended that I post it here as well.
>>
>> When running at scale (2048 tasks and greater, with 8 tasks per
>> node or ppn=8) we occasionally see the following from mpiexec:
>>
>> mpiexec_abe1192 (mpiexec 411): no msg recvd from mpd when
>> expecting ack of request
>>
>> The reporting node may change, so it is not tied to any node in
>> particular.
>>
>> The sequence we use is:
>> mpdboot
>> mpdtrace
>> mpiexec
>>
>> The outpout from mpdtrace is fine. It is only when mpiexec is
>> ready the start up the actual mpi tasks that this issue appears.
>>
>> After looking at mpiexec.py and mpdlib.py I see that there is a
>> parameter in mpiexec.py called
>> recvTimeout
>> that is set to 20.
>>
>> If we set this to a larger value, will this reduce the likelihood
>> of getting the 'ack' timeout?
>>
>> This was with mvapich2-0.9.8-2007-05-03 but we are now at
>> mvapich2-0.9.8p2 for testing, using ofed 1.1.
>>
>> -Greg
>>
>
More information about the mpich-discuss
mailing list