[mpich-discuss] MPD daemons disconnect/die unexpectedly

Michael Ahlmann mahlmann at ucdavis.edu
Wed Mar 26 17:50:29 CDT 2008


I am running a 4-node cluster using Centos 5 and dual-core Athlon 
chips.  I installed MPICH2 for parallel computations.  I can use MPDBOOT 
to setup my daemons, and an MPDTRACE confirms that all nodes are 
present.  MPDRINGTEST also runs as expected.  However, during some jobs, 
the MPD's die unexpectedly.  I originally thought this was caused by a 
network connectivity (maybe firewall?) issue, but if I run a single 
daemon on the head node (using MPD &), the same thing happens (but only 
sometimes).  The problem seems rather sporadic as on some occasions the 
cluster runs the same job for days on end without problems, and at other 
times, the problem occurs within minutes.  Any help in figuring out how 
to debug this problem would be highly appreciated.  Thanks!

-Michael




More information about the mpich-discuss mailing list