[mpich-discuss] mpd as system process?

Thu Aug 5 15:09:17 CDT 2010

Am 05.08.2010 um 21:53 schrieb Marc Moreau:

> I have been playing this for sometime with no luck what so ever.  mpd
> never boots and everything times out.  Here is what I get
> 
> === Begin ===
> -catch_rsh /gridware/sge/default/spool/compute-1-24/active_jobs/53194.1/pe_hostfile
> /opt/mpich2-1.2.1p1
> compute-1-24:4

Ok, this looks like ROCKS. In this case it is necessary to add the "--short" to the `hostname` command to match the name of the master node of the parallel job. Unless you installed SGE to honor the FQDN (set during installation). The first round in the loop should this way always start the local mpd. Otherwise $PORT is empty and you get the error you got.

NODE=`hostname --short`

-- Reuti

PS: Yes, I should really rework the Howto to mention this.

> usage: start_mpich2 [-n <hostname>] mpich2-mpd-path [mpd-parameters ..]
> 
> where: 'hostname' gives the name of the target host
> startmpich2.sh: check for mpd daemons (1 of 10)
> startmpich2.sh: check for mpd daemons (2 of 10)
> startmpich2.sh: check for mpd daemons (3 of 10)
> startmpich2.sh: check for mpd daemons (4 of 10)
> startmpich2.sh: check for mpd daemons (5 of 10)
> startmpich2.sh: check for mpd daemons (6 of 10)
> startmpich2.sh: check for mpd daemons (7 of 10)
> startmpich2.sh: check for mpd daemons (8 of 10)
> startmpich2.sh: check for mpd daemons (9 of 10)
> startmpich2.sh: check for mpd daemons (10 of 10)
> startmpich2.sh: got only 8 of 1 nodes, aborting
> -catch_rsh /opt/mpich2-1.2.1p1
> mpdallexit: cannot connect to local mpd
> (/tmp/mpd2.console_marc.moreau_sge_53194.undefined); possible causes:
>  1. no mpd is running on this host
>  2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>    mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
> === END ===
> 
> Any suggestions?
> 
> -- Marc
> 
> On Thu, Aug 5, 2010 at 11:53 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Hi Marc,
>> 
>> Am 05.08.2010 um 19:46 schrieb Marc Moreau:
>> 
>>> I'm setting up MPICH2 on my cluster where users run many relatively
>>> short processes ( 2-10 hours ).  I am using SunGridEngine to manage
>>> the scheduling. The problem that I am running into is that SGE kills
>>> the mpd process when the job is done, even when other jobs are using
>>> it.  So if there are multiple MPI jobs running on the same node, they
>>> all die when the first process dies.
>>> 
>>> As a solution I'd like to set everything up so that users can just
>>> 'run' MPI jobs and not need to worry about starting and killing mpd
>>> within each job.  I'm thinking it would be nice to setup mpd as a
>>> system process and then have all the jobs run on the system mpd.  Is
>>> this sane and possible? Any other solutions ?
>> 
>> please have a look here:
>> 
>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html
>> 
>> it will create one dedicated ring per job. The ring will be setup and removed by the PE start/stop_proc_args scripts. The users just need to setup the correct portnumber in their scripts (please check the included demo-script in the archive for this).
>> 
>> -- Reuti
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>