[mpich2-dev] Real issue with MPICH2 in disk-less cluster environment
Joseph Norris
jnorris at ucmerced.edu
Wed May 12 09:27:45 CDT 2010
Hello to all and many thanks beforehand with any information you can
give me.
I have recently moved our 66 node cluster to centos based cluster using
diskless tftp boot from chrooted environment on head node out to compute
nodes - SGE loaded etc... I was not familiar at all with mpich2 and many
of my users, use this so I got the tarball did the download, installed
and read the docs and thought I was going ok - however..... I can run
mpirun on a node but in a job that goes out to the nodes via sge I get
the following in my error return:
Got 1 slots.
===================================================================
Here we go
===================================================================
mpiexec_c20: cannot connect to local mpd (/tmp/mpd2.console_joseph);
possible ca
uses:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
I have jumped through hoops getting mpd to run on the nodes ( well most
of them ) - it seems when mpdboot goes through the mpd.host file it
invariably finds a node that returns the following error:
mpdboot_elcapitan.ucmerced.edu (handle_mpd_output 415): failed to
connect to mpd on c31
I am really out-of-my-depths with these issues - this was running so
that users could attach to mpd ran as root - if they wanted to - or
roll-their-own if they wanted to. I am not sure where I have missed the
mark and really need to get this up and running.
Thank you for your help.
--
Joseph Norris
Application Developer & Server Administrator
209-228-4576
jnorris at ucmerced.edu
More information about the mpich2-dev
mailing list