[mpich-discuss] FW: multiple mpd rings as one user

Rajeev Thakur thakur at mcs.anl.gov
Thu Jan 29 09:27:25 CST 2009


-----Original Message-----
Sent: Thursday, January 29, 2009 6:50 AM
Subject: Re: [mpich-discuss] multiple mpd rings as one user

I believe that the following demos how to do it.

(bp400:71)% ./mpdboot -f temph -v -n 3 --remcons running mpdallexit on bp400
LAUNCHED mpd on bp400  via
RUNNING: mpd on bp400
LAUNCHED mpd on bp401  via  bp400
LAUNCHED mpd on bp402  via  bp400
RUNNING: mpd on bp401
RUNNING: mpd on bp402
(bp400:72)%
(bp400:72)%
(bp400:72)%
(bp400:72)% mpiexec -l -n 3 sh -c 'echo $MPD_CON_EXT'
0: t1
2: t1
1: t1
(bp400:73)%
(bp400:73)% setenv MPD_CON_EXT t2
(bp400:74)%
(bp400:74)% ./mpdboot -f temph -v -n 3 --remcons running mpdallexit on bp400
LAUNCHED mpd on bp400  via
RUNNING: mpd on bp400
LAUNCHED mpd on bp401  via  bp400
LAUNCHED mpd on bp402  via  bp400
RUNNING: mpd on bp401
RUNNING: mpd on bp402
(bp400:75)% mpiexec -l -n 3 sh -c 'echo $MPD_CON_EXT'
0: t2
2: t2
1: t2
(bp400:76)% setenv MPD_CON_EXT t1
(bp400:77)% mpiexec -l -n 3 sh -c 'echo $MPD_CON_EXT'
0: t1
2: t1
1: t1
(bp400:78)% mpdallexit
(bp400:79)% setenv MPD_CON_EXT t2
(bp400:80)% mpdallexit
(bp400:81)%

=
========================================================================
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov
> ] On Behalf Of oliver.wissdorf at boehringer-ingelheim.com
> Sent: Wednesday, January 28, 2009 1:58 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] multiple mpd rings as one user
>
> Hello,
>
> I am not sure if I need multiple rings. At the moment I am using one 
> ring.
> But the problem with one ring on a big cluster (144 nodes) is, that I 
> have to start the ring independent from the job. Which means I first 
> start the ring around the whole cluster and when the job wants to 
> start it gets the machinelist from the queuing system. This works fine 
> until one machine of the cluster stops working. Then the ring is 
> broken and the job crashes.
> Is there a better way to solve the problem? Is it possible to remove 
> or add machines from or to the ring to repair the ring?
>
> Thanks,
>
> Oliver
>
> -----Ursprüngliche Nachricht-----
> Von: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov
> ] Im Auftrag von Rajeev Thakur
> Gesendet: Dienstag, 27. Januar 2009 18:00
> An: mpich-discuss at mcs.anl.gov
> Betreff: Re: [mpich-discuss] multiple mpd rings as one user
>
> Do you really need multiple MPD rings? You can run multiple jobs with 
> one ring.
>
> Rajeev
>
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov
> ] On Behalf Of oliver.wissdorf at boehringer-ingelheim.com
> Sent: Tuesday, January 27, 2009 10:04 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] multiple mpd rings as one user
>
> Hello,
>
> I want to start multipel rings as one user on a linux cluster to 
> submit more than one job at a time. Therefore I use mpdboot and
> mpdexec:
>
>
>
> /usr/mpi/gcc/mvapich2-1.0.2/bin/mpdboot  -n 17 -f $PWD/mpd.txt -- 
> verbose --ifhn=172.17.30.101
>
> /usr/mpi/gcc/mvapich2-1.0.2/bin/mpiexec -machinefile $WORKDIR/ mpd.txt 
> -n $np <jobscript>
>
> I tried to set MPD_CON_EXT before starting the mpdboot and this allows 
> me to start multiple mpds on the host where the ring starts, but not 
> on the other hosts of the ring.
>
> I also tried to use -1 option with mpdboot, but this did not work 
> either.
>
> Is there a way to solve this issue? Did I make any mistakes?
>
> Oliver
>




More information about the mpich-discuss mailing list