[MPICH] Impact of changing recvTimeout

Yusong Wang ywang25 at aps.anl.gov
Tue May 23 10:10:10 CDT 2006


I suspect this is something related with socket operations. You could
try the command
netstat | grep tcp | wc -l
on the master working node (not the head node you submitted the job) to
see the number of ports used before, during and after execution of your
application. If you run the application several times, I would expect
the returned number to increase.

Yusong

 
On Tue, 2006-05-23 at 08:48 -0400, Galton, Simon wrote:
> I finally found the "timeout" variable in mpdrun.py.
> 
> It's called recvTimeout, and it's normally set to 20 (which turns out
> to be 20 seconds).
> 
> I found that I had to set it to 45 to reliably run on 54 cpus, and 110
> to reliably run on 122 cpus.
> 
> Is there anything "wrong" with changing this variable?  Is there any
> other impact?
> 
> Simon
> 
> -----Original Message----- 
> From: Galton, Simon  
> Sent: May 18, 2006 11:08 AM 
> To: mpich-discuss at mcs.anl.gov 
> Subject: Serious MPICH2 problem with many CPUs
> 
> Folks,
> 
> We're running into a problem when dispatching to "many" CPUs.  It
> seems when we hit ~50 CPUs on our dual-Xeon GigE-connected cluster we
> start to get the following error, and the job fails:
> 
> mpdrun_12429 (mpd_recv 386): other error after recv
> __main__.mpdrunInterrupted :SIGALRM: 
> mpdrun failed: no msg recvd from mpd when expecting ack of request 
>     traceback: [('/usr/local/mpich2/bin/mpdrun.py', '256', 'mpdrun'),
> ('/usr/local/mpich2/bin/mpdrun.py', '978', '?')]
> 
> It happens 100% of the time at 54 CPUs or more.
> 
> It looks like a setup timeout.  Can we increase/fix this?  I'd rather
> not recompile (validation issue)...
> 
> 
> Simon Galton
> 
> 
> 
> 
> CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE
> 
> This e-mail, and any attachments, may contain information that
> is confidential, subject to copyright, or exempt from disclosure.
> Any unauthorized review, disclosure, retransmission, 
> dissemination or other use of or reliance on this information 
> may be unlawful and is strictly prohibited.  
> 
> AVIS D'INFORMATION CONFIDENTIELLE ET PRIVILGIE
> 
> Le prsent courriel, et toute pice jointe, peut contenir de 
> l'information qui est confidentielle, rgie par les droits 
> d'auteur, ou interdite de divulgation. Tout examen, 
> divulgation, retransmission, diffusion ou autres utilisations 
> non autorises de l'information ou dpendance non autorise 
> envers celle-ci peut tre illgale et est strictement interdite.




More information about the mpich-discuss mailing list