[MPICH] Impact of changing recvTimeout

Galton, Simon galtons at aecl.ca
Tue May 23 07:48:28 CDT 2006


I finally found the "timeout" variable in mpdrun.py.

It's called recvTimeout, and it's normally set to 20 (which turns out to be
20 seconds).

I found that I had to set it to 45 to reliably run on 54 cpus, and 110 to
reliably run on 122 cpus.

Is there anything "wrong" with changing this variable?  Is there any other
impact?

Simon

-----Original Message-----
From: Galton, Simon 
Sent: May 18, 2006 11:08 AM
To: mpich-discuss at mcs.anl.gov
Subject: Serious MPICH2 problem with many CPUs

Folks,

We're running into a problem when dispatching to "many" CPUs.  It seems when
we hit ~50 CPUs on our dual-Xeon GigE-connected cluster we start to get the
following error, and the job fails:

mpdrun_12429 (mpd_recv 386): other error after recv
__main__.mpdrunInterrupted :SIGALRM:
mpdrun failed: no msg recvd from mpd when expecting ack of request
    traceback: [('/usr/local/mpich2/bin/mpdrun.py', '256', 'mpdrun'),
('/usr/local/mpich2/bin/mpdrun.py', '978', '?')]

It happens 100% of the time at 54 CPUs or more.

It looks like a setup timeout.  Can we increase/fix this?  I'd rather not
recompile (validation issue)...


Simon Galton
CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE

This e-mail, and any attachments, may contain information that
is confidential, subject to copyright, or exempt from disclosure.
Any unauthorized review, disclosure, retransmission, 
dissemination or other use of or reliance on this information 
may be unlawful and is strictly prohibited.  

AVIS D'INFORMATION CONFIDENTIELLE ET PRIVILÉGIÉE

Le présent courriel, et toute pièce jointe, peut contenir de 
l'information qui est confidentielle, régie par les droits 
d'auteur, ou interdite de divulgation. Tout examen, 
divulgation, retransmission, diffusion ou autres utilisations 
non autorisées de l'information ou dépendance non autorisée 
envers celle-ci peut être illégale et est strictement interdite.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060523/bdd6e0a8/attachment.htm>


More information about the mpich-discuss mailing list