[mpich-discuss] MPD daemons disconnect/die unexpectedly
Michael Ahlmann
mahlmann at ucdavis.edu
Wed Mar 26 17:50:29 CDT 2008
I am running a 4-node cluster using Centos 5 and dual-core Athlon
chips. I installed MPICH2 for parallel computations. I can use MPDBOOT
to setup my daemons, and an MPDTRACE confirms that all nodes are
present. MPDRINGTEST also runs as expected. However, during some jobs,
the MPD's die unexpectedly. I originally thought this was caused by a
network connectivity (maybe firewall?) issue, but if I run a single
daemon on the head node (using MPD &), the same thing happens (but only
sometimes). The problem seems rather sporadic as on some occasions the
cluster runs the same job for days on end without problems, and at other
times, the problem occurs within minutes. Any help in figuring out how
to debug this problem would be highly appreciated. Thanks!
-Michael
More information about the mpich-discuss
mailing list