[mpich-discuss] FW: multiple mpd rings as one user

Reuti reuti at staff.uni-marburg.de
Thu Jan 29 09:33:59 CST 2009


Am 29.01.2009 um 16:27 schrieb Rajeev Thakur:

> -----Original Message-----
> Sent: Thursday, January 29, 2009 6:50 AM
> Subject: Re: [mpich-discuss] multiple mpd rings as one user
>
> I believe that the following demos how to do it.
>
> (bp400:71)% ./mpdboot -f temph -v -n 3 --remcons running mpdallexit  
> on bp400
> LAUNCHED mpd on bp400  via
> RUNNING: mpd on bp400
> LAUNCHED mpd on bp401  via  bp400
> LAUNCHED mpd on bp402  via  bp400
> RUNNING: mpd on bp401
> RUNNING: mpd on bp402
> (bp400:72)%
> (bp400:72)%
> (bp400:72)%
> (bp400:72)% mpiexec -l -n 3 sh -c 'echo $MPD_CON_EXT'
> 0: t1
> 2: t1
> 1: t1
> (bp400:73)%
> (bp400:73)% setenv MPD_CON_EXT t2
> (bp400:74)%
> (bp400:74)% ./mpdboot -f temph -v -n 3 --remcons running mpdallexit  
> on bp400
> LAUNCHED mpd on bp400  via
> RUNNING: mpd on bp400
> LAUNCHED mpd on bp401  via  bp400
> LAUNCHED mpd on bp402  via  bp400
> RUNNING: mpd on bp401
> RUNNING: mpd on bp402
> (bp400:75)% mpiexec -l -n 3 sh -c 'echo $MPD_CON_EXT'
> 0: t2
> 2: t2
> 1: t2
> (bp400:76)% setenv MPD_CON_EXT t1
> (bp400:77)% mpiexec -l -n 3 sh -c 'echo $MPD_CON_EXT'
> 0: t1
> 2: t1
> 1: t1
> (bp400:78)% mpdallexit
> (bp400:79)% setenv MPD_CON_EXT t2
> (bp400:80)% mpdallexit
> (bp400:81)%
>
> =
> ====================================================================== 
> ==
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov
>> ] On Behalf Of oliver.wissdorf at boehringer-ingelheim.com
>> Sent: Wednesday, January 28, 2009 1:58 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] multiple mpd rings as one user
>>
>> Hello,
>>
>> I am not sure if I need multiple rings. At the moment I am using one
>> ring.
>> But the problem with one ring on a big cluster (144 nodes) is, that I
>> have to start the ring independent from the job. Which means I first
>> start the ring around the whole cluster and when the job wants to
>> start it gets the machinelist from the queuing system.

What queuingsystem are you using?

-- Reuti


>> This works fine
>> until one machine of the cluster stops working. Then the ring is
>> broken and the job crashes.
>> Is there a better way to solve the problem? Is it possible to remove
>> or add machines from or to the ring to repair the ring?
>>
>> Thanks,
>>
>> Oliver
>>
>> -----Ursprüngliche Nachricht-----
>> Von: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov
>> ] Im Auftrag von Rajeev Thakur
>> Gesendet: Dienstag, 27. Januar 2009 18:00
>> An: mpich-discuss at mcs.anl.gov
>> Betreff: Re: [mpich-discuss] multiple mpd rings as one user
>>
>> Do you really need multiple MPD rings? You can run multiple jobs with
>> one ring.
>>
>> Rajeev
>>
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov
>> ] On Behalf Of oliver.wissdorf at boehringer-ingelheim.com
>> Sent: Tuesday, January 27, 2009 10:04 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: [mpich-discuss] multiple mpd rings as one user
>>
>> Hello,
>>
>> I want to start multipel rings as one user on a linux cluster to
>> submit more than one job at a time. Therefore I use mpdboot and
>> mpdexec:
>>
>>
>>
>> /usr/mpi/gcc/mvapich2-1.0.2/bin/mpdboot  -n 17 -f $PWD/mpd.txt --
>> verbose --ifhn=172.17.30.101
>>
>> /usr/mpi/gcc/mvapich2-1.0.2/bin/mpiexec -machinefile $WORKDIR/  
>> mpd.txt
>> -n $np <jobscript>
>>
>> I tried to set MPD_CON_EXT before starting the mpdboot and this  
>> allows
>> me to start multiple mpds on the host where the ring starts, but not
>> on the other hosts of the ring.
>>
>> I also tried to use -1 option with mpdboot, but this did not work
>> either.
>>
>> Is there a way to solve this issue? Did I make any mistakes?
>>
>> Oliver
>>
>
>
>



More information about the mpich-discuss mailing list