[mpich2-dev] Real issue with MPICH2 in disk-less cluster environment

Dave Goodell goodell at mcs.anl.gov
Wed May 12 09:39:47 CDT 2010


What version of MPICH2 are you using?

Please try the hydra process manager instead of MPD and let us know if  
that works for you: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager

-Dave

On May 12, 2010, at 9:27 AM, Joseph Norris wrote:

> Hello to all and many thanks beforehand with any information you can  
> give me.
>
> I have recently moved our 66 node cluster to centos based cluster  
> using diskless tftp boot from chrooted environment on head node out  
> to compute nodes - SGE loaded etc... I was not familiar at all with  
> mpich2 and many of my users, use this so I got the tarball did the  
> download, installed and read the docs and thought I was going ok -  
> however..... I can run mpirun on a node but in a job that goes out  
> to the nodes via sge I get the following in my error return:
>
> Got 1 slots.
> ===================================================================
>                        Here we go
> ===================================================================
> mpiexec_c20: cannot connect to local mpd (/tmp/mpd2.console_joseph);  
> possible ca
> uses:
> 1. no mpd is running on this host
> 2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>   mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
>
> I have jumped through hoops getting mpd to run on the nodes ( well  
> most of them ) - it seems when mpdboot goes through the mpd.host  
> file it invariably finds a node that returns the following error:
>
> mpdboot_elcapitan.ucmerced.edu (handle_mpd_output 415): failed to  
> connect to mpd on c31
>
> I am really out-of-my-depths with these issues - this was running so  
> that users could attach to mpd ran as root - if they wanted to - or  
> roll-their-own if they wanted to.  I am not sure where I have missed  
> the mark and really need to get this up and running.
>
> Thank you for your help.
>
> -- 
> Joseph Norris
> Application Developer & Server Administrator
> 209-228-4576
> jnorris at ucmerced.edu



More information about the mpich2-dev mailing list