[MPICH] issue regarding mpd timeout in mpiexec at scale

Gregory Bauer gbauer at ncsa.uiuc.edu
Tue May 29 11:09:44 CDT 2007


I was able to reproduce the mpiexec failure mechanism using the latest 
MPICH2 release  (mpich2-1.0.5p4.tar.gz) as I documented with the 
MVAPICH2 release.

Since the failure is intermittent, I can't be sure if changing 
recvTimeout  to a larger value actually fixes the issue.

What is a reasonable value to set recvTimeout  to for task counts 
greater than 2048? Is 120 (seconds is the unit ?) too large?

-Greg

Gregory Bauer wrote:

> This is a repost from the mvapich-discuss list that I posted the other 
> day. It was recommended that I post it here as well.
>
> When running at scale (2048 tasks and greater, with 8 tasks per node 
> or ppn=8) we occasionally see the following from mpiexec:
>
> mpiexec_abe1192 (mpiexec 411): no msg recvd from mpd when expecting 
> ack of request
>
> The reporting  node may change, so it is not tied to any node in 
> particular.
>
> The sequence we use is:
> mpdboot
> mpdtrace
> mpiexec
>
> The outpout from mpdtrace is fine. It is only when mpiexec is ready 
> the start up the actual mpi tasks that this issue appears.
>
> After looking at mpiexec.py and mpdlib.py I see that there is a 
> parameter in mpiexec.py called
> recvTimeout
> that is set to 20.
>
> If we set this to a larger value, will this reduce the likelihood of 
> getting the 'ack' timeout?
>
> This was with mvapich2-0.9.8-2007-05-03 but we are now at 
> mvapich2-0.9.8p2 for testing, using ofed 1.1.
>
> -Greg
>




More information about the mpich-discuss mailing list