<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2658.34">
<TITLE>Serious MPICH2 problem with many CPUs</TITLE>
</HEAD>
<BODY>
<P><FONT SIZE=2>Folks,</FONT>
</P>
<P><FONT SIZE=2>We're running into a problem when dispatching to "many" CPUs. It seems when we hit ~50 CPUs on our dual-Xeon GigE-connected cluster we start to get the following error, and the job fails:</FONT></P>
<P><FONT SIZE=2>mpdrun_12429 (mpd_recv 386): other error after recv __main__.mpdrunInterrupted :SIGALRM:</FONT>
<BR><FONT SIZE=2>mpdrun failed: no msg recvd from mpd when expecting ack of request</FONT>
<BR><FONT SIZE=2> traceback: [('/usr/local/mpich2/bin/mpdrun.py', '256', 'mpdrun'), ('/usr/local/mpich2/bin/mpdrun.py', '978', '?')]</FONT>
</P>
<P><FONT SIZE=2>It happens 100% of the time at 54 CPUs or more.</FONT>
</P>
<P><FONT SIZE=2>It looks like a setup timeout. Can we increase/fix this? I'd rather not recompile (validation issue)...</FONT>
</P>
<BR>
<P><FONT SIZE=2>Simon Galton</FONT>
</P>
<br><br><table bgcolor=white style="color:black"><tr><td><br>CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE<br>
<br>
This e-mail, and any attachments, may contain information that<br>
is confidential, subject to copyright, or exempt from disclosure.<br>
Any unauthorized review, disclosure, retransmission, <br>
dissemination or other use of or reliance on this information <br>
may be unlawful and is strictly prohibited. <br>
<br>
AVIS D'INFORMATION CONFIDENTIELLE ET PRIVILÉGIÉE<br>
<br>
Le présent courriel, et toute pièce jointe, peut contenir de <br>
l'information qui est confidentielle, régie par les droits <br>
d'auteur, ou interdite de divulgation. Tout examen, <br>
divulgation, retransmission, diffusion ou autres utilisations <br>
non autorisées de l'information ou dépendance non autorisée <br>
envers celle-ci peut être illégale et est strictement interdite.</td></tr></table></BODY>
</HTML>