[MPICH] MPICH2 ring breaking; three times in two days

Rajeev Thakur thakur at mcs.anl.gov
Tue Jun 27 17:34:25 CDT 2006


Can you try upgrading to 1.0.3 and see if the problem goes away.
 
Rajeev


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Galton, Simon
Sent: Tuesday, June 27, 2006 8:34 AM
To: 'mpich-discuss at mcs.anl.gov'
Subject: [MPICH] MPICH2 ring breaking; three times in two days



Help! :) 

We've been seeing a problem where most of our nodes drop out of the MPICH2
ring; this has happened three times in the last two days.  It's not always
the same nodes, either :(

The syslog file on our head node shows the following error: 

(handle_rhs_challenge_response 1010): INVALID msg for rhs response msg=:{}:
from host=xxxxx 

xxxxx represents the various hosts which drop out of the ring. 

Could this be a misbehaving job? 

Help! :) 

I'm using mpich2-1.0.1 on RHEL3 

Simon 




CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE

This e-mail, and any attachments, may contain information that
is confidential, subject to copyright, or exempt from disclosure.
Any unauthorized review, disclosure, retransmission, 
dissemination or other use of or reliance on this information 
may be unlawful and is strictly prohibited.  

AVIS D'INFORMATION CONFIDENTIELLE ET PRIVILÉGIÉE

Le présent courriel, et toute pièce jointe, peut contenir de 
l'information qui est confidentielle, régie par les droits 
d'auteur, ou interdite de divulgation. Tout examen, 
divulgation, retransmission, diffusion ou autres utilisations 
non autorisées de l'information ou dépendance non autorisée 
envers celle-ci peut être illégale et est strictement interdite.	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060627/4855f0b6/attachment.htm>


More information about the mpich-discuss mailing list