[mpich-discuss] multiple mpd rings as one user
oliver.wissdorf at boehringer-ingelheim.com
oliver.wissdorf at boehringer-ingelheim.com
Wed Jan 28 01:58:27 CST 2009
Hello,
I am not sure if I need multiple rings. At the moment I am using one ring.
But the problem with one ring on a big cluster (144 nodes) is, that I have to
start the ring independent from the job. Which means I first start the ring
around the whole cluster and when the job wants to start it gets the
machinelist from the queuing system. This works fine until one machine of the
cluster stops working. Then the ring is broken and the job crashes.
Is there a better way to solve the problem? Is it possible to remove or add
machines from or to the ring to repair the ring?
Thanks,
Oliver
-----Ursprüngliche Nachricht-----
Von: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] Im Auftrag von Rajeev Thakur
Gesendet: Dienstag, 27. Januar 2009 18:00
An: mpich-discuss at mcs.anl.gov
Betreff: Re: [mpich-discuss] multiple mpd rings as one user
Do you really need multiple MPD rings? You can run multiple jobs with
one ring.
Rajeev
________________________________
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
oliver.wissdorf at boehringer-ingelheim.com
Sent: Tuesday, January 27, 2009 10:04 AM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] multiple mpd rings as one user
Hello,
I want to start multipel rings as one user on a linux cluster
to submit more than one job at a time. Therefore I use mpdboot and mpdexec:
/usr/mpi/gcc/mvapich2-1.0.2/bin/mpdboot -n 17 -f
$PWD/mpd.txt --verbose --ifhn=172.17.30.101
/usr/mpi/gcc/mvapich2-1.0.2/bin/mpiexec -machinefile
$WORKDIR/mpd.txt -n $np <jobscript>
I tried to set MPD_CON_EXT before starting the mpdboot and
this allows me to start multiple mpds on the host where the ring starts, but
not on the other hosts of the ring.
I also tried to use -1 option with mpdboot, but this did not
work either.
Is there a way to solve this issue? Did I make any mistakes?
Oliver
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090128/938b7651/attachment.htm>
More information about the mpich-discuss
mailing list