[MPICH] issue regarding mpd timeout in mpiexec at scale
Gregory Bauer
gbauer at ncsa.uiuc.edu
Thu May 17 09:58:43 CDT 2007
This is a repost from the mvapich-discuss list that I posted the other
day. It was recommended that I post it here as well.
When running at scale (2048 tasks and greater, with 8 tasks per node or
ppn=8) we occasionally see the following from mpiexec:
mpiexec_abe1192 (mpiexec 411): no msg recvd from mpd when expecting ack
of request
The reporting node may change, so it is not tied to any node in
particular.
The sequence we use is:
mpdboot
mpdtrace
mpiexec
The outpout from mpdtrace is fine. It is only when mpiexec is ready the
start up the actual mpi tasks that this issue appears.
After looking at mpiexec.py and mpdlib.py I see that there is a
parameter in mpiexec.py called
recvTimeout
that is set to 20.
If we set this to a larger value, will this reduce the likelihood of
getting the 'ack' timeout?
This was with mvapich2-0.9.8-2007-05-03 but we are now at
mvapich2-0.9.8p2 for testing, using ofed 1.1.
-Greg
More information about the mpich-discuss
mailing list