[mpich-discuss] [mpich2-dev] Real issue with MPICH2 in disk-less cluster environment
Dave Goodell
goodell at mcs.anl.gov
Wed May 12 09:39:47 CDT 2010
What version of MPICH2 are you using?
Please try the hydra process manager instead of MPD and let us know if
that works for you: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
-Dave
On May 12, 2010, at 9:27 AM, Joseph Norris wrote:
> Hello to all and many thanks beforehand with any information you can
> give me.
>
> I have recently moved our 66 node cluster to centos based cluster
> using diskless tftp boot from chrooted environment on head node out
> to compute nodes - SGE loaded etc... I was not familiar at all with
> mpich2 and many of my users, use this so I got the tarball did the
> download, installed and read the docs and thought I was going ok -
> however..... I can run mpirun on a node but in a job that goes out
> to the nodes via sge I get the following in my error return:
>
> Got 1 slots.
> ===================================================================
> Here we go
> ===================================================================
> mpiexec_c20: cannot connect to local mpd (/tmp/mpd2.console_joseph);
> possible ca
> uses:
> 1. no mpd is running on this host
> 2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
> mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
>
> I have jumped through hoops getting mpd to run on the nodes ( well
> most of them ) - it seems when mpdboot goes through the mpd.host
> file it invariably finds a node that returns the following error:
>
> mpdboot_elcapitan.ucmerced.edu (handle_mpd_output 415): failed to
> connect to mpd on c31
>
> I am really out-of-my-depths with these issues - this was running so
> that users could attach to mpd ran as root - if they wanted to - or
> roll-their-own if they wanted to. I am not sure where I have missed
> the mark and really need to get this up and running.
>
> Thank you for your help.
>
> --
> Joseph Norris
> Application Developer & Server Administrator
> 209-228-4576
> jnorris at ucmerced.edu
More information about the mpich-discuss
mailing list