[mpich2-dev] Real issue with MPICH2 in disk-less cluster environment

Joseph Norris jnorris at ucmerced.edu
Wed May 12 09:27:45 CDT 2010


Hello to all and many thanks beforehand with any information you can 
give me.

I have recently moved our 66 node cluster to centos based cluster using 
diskless tftp boot from chrooted environment on head node out to compute 
nodes - SGE loaded etc... I was not familiar at all with mpich2 and many 
of my users, use this so I got the tarball did the download, installed 
and read the docs and thought I was going ok - however..... I can run 
mpirun on a node but in a job that goes out to the nodes via sge I get 
the following in my error return:

Got 1 slots.
===================================================================
                         Here we go
===================================================================
mpiexec_c20: cannot connect to local mpd (/tmp/mpd2.console_joseph); 
possible ca
uses:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
    mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.

I have jumped through hoops getting mpd to run on the nodes ( well most 
of them ) - it seems when mpdboot goes through the mpd.host file it 
invariably finds a node that returns the following error:

mpdboot_elcapitan.ucmerced.edu (handle_mpd_output 415): failed to 
connect to mpd on c31

I am really out-of-my-depths with these issues - this was running so 
that users could attach to mpd ran as root - if they wanted to - or 
roll-their-own if they wanted to.  I am not sure where I have missed the 
mark and really need to get this up and running.

Thank you for your help.

-- 
Joseph Norris
Application Developer & Server Administrator
209-228-4576
jnorris at ucmerced.edu 



More information about the mpich2-dev mailing list