[MPICH] issue regarding mpd timeout in mpiexec at scale

Gregory Bauer gbauer at ncsa.uiuc.edu
Thu May 17 09:58:43 CDT 2007


This is a repost from the mvapich-discuss list that I posted the other 
day. It was recommended that I post it here as well.

When running at scale (2048 tasks and greater, with 8 tasks per node or 
ppn=8) we occasionally see the following from mpiexec:

mpiexec_abe1192 (mpiexec 411): no msg recvd from mpd when expecting ack 
of request

The reporting  node may change, so it is not tied to any node in 
particular.

The sequence we use is:
mpdboot
mpdtrace
mpiexec

The outpout from mpdtrace is fine. It is only when mpiexec is ready the 
start up the actual mpi tasks that this issue appears.

After looking at mpiexec.py and mpdlib.py I see that there is a 
parameter in mpiexec.py called
recvTimeout
that is set to 20.

If we set this to a larger value, will this reduce the likelihood of 
getting the 'ack' timeout?

This was with mvapich2-0.9.8-2007-05-03 but we are now at 
mvapich2-0.9.8p2 for testing, using ofed 1.1.

-Greg




More information about the mpich-discuss mailing list