[MPICH] MPICH2 ring breaking; three times in two days

Galton, Simon galtons at aecl.ca
Tue Jun 27 08:34:29 CDT 2006


Help! :)

We've been seeing a problem where most of our nodes drop out of the MPICH2
ring; this has happened three times in the last two days.  It's not always
the same nodes, either :(

The syslog file on our head node shows the following error:

(handle_rhs_challenge_response 1010): INVALID msg for rhs response msg=:{}:
from host=xxxxx

xxxxx represents the various hosts which drop out of the ring.

Could this be a misbehaving job?

Help! :)

I'm using mpich2-1.0.1 on RHEL3

Simon
CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE

This e-mail, and any attachments, may contain information that
is confidential, subject to copyright, or exempt from disclosure.
Any unauthorized review, disclosure, retransmission, 
dissemination or other use of or reliance on this information 
may be unlawful and is strictly prohibited.  

AVIS D'INFORMATION CONFIDENTIELLE ET PRIVILÉGIÉE

Le présent courriel, et toute pièce jointe, peut contenir de 
l'information qui est confidentielle, régie par les droits 
d'auteur, ou interdite de divulgation. Tout examen, 
divulgation, retransmission, diffusion ou autres utilisations 
non autorisées de l'information ou dépendance non autorisée 
envers celle-ci peut être illégale et est strictement interdite.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060627/8c314491/attachment.htm>


More information about the mpich-discuss mailing list