[MPICH] issue regarding mpd timeout in mpiexec at scale
Gregory Bauer
gbauer at ncsa.uiuc.edu
Tue May 29 11:09:44 CDT 2007
I was able to reproduce the mpiexec failure mechanism using the latest
MPICH2 release (mpich2-1.0.5p4.tar.gz) as I documented with the
MVAPICH2 release.
Since the failure is intermittent, I can't be sure if changing
recvTimeout to a larger value actually fixes the issue.
What is a reasonable value to set recvTimeout to for task counts
greater than 2048? Is 120 (seconds is the unit ?) too large?
-Greg
Gregory Bauer wrote:
> This is a repost from the mvapich-discuss list that I posted the other
> day. It was recommended that I post it here as well.
>
> When running at scale (2048 tasks and greater, with 8 tasks per node
> or ppn=8) we occasionally see the following from mpiexec:
>
> mpiexec_abe1192 (mpiexec 411): no msg recvd from mpd when expecting
> ack of request
>
> The reporting node may change, so it is not tied to any node in
> particular.
>
> The sequence we use is:
> mpdboot
> mpdtrace
> mpiexec
>
> The outpout from mpdtrace is fine. It is only when mpiexec is ready
> the start up the actual mpi tasks that this issue appears.
>
> After looking at mpiexec.py and mpdlib.py I see that there is a
> parameter in mpiexec.py called
> recvTimeout
> that is set to 20.
>
> If we set this to a larger value, will this reduce the likelihood of
> getting the 'ack' timeout?
>
> This was with mvapich2-0.9.8-2007-05-03 but we are now at
> mvapich2-0.9.8p2 for testing, using ofed 1.1.
>
> -Greg
>
More information about the mpich-discuss
mailing list