[mpich-discuss] multiple mpd rings as one user

oliver.wissdorf at boehringer-ingelheim.com oliver.wissdorf at boehringer-ingelheim.com
Wed Jan 28 01:58:27 CST 2009


Hello,
 
I am not sure if I need multiple rings. At the moment I am using one ring.
But the problem with one ring on a big cluster (144 nodes) is, that I have to
start the ring independent from the job. Which means I first start the ring
around the whole cluster and when the job wants to start it gets the
machinelist from the queuing system. This works fine until one machine of the
cluster stops working. Then the ring is broken and the job crashes. 
Is there a better way to solve the problem? Is it possible to remove or add
machines from or to the ring to repair the ring?
 
Thanks,
 
Oliver
 

 

	-----Ursprüngliche Nachricht-----
	Von: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] Im Auftrag von Rajeev Thakur
	Gesendet: Dienstag, 27. Januar 2009 18:00
	An: mpich-discuss at mcs.anl.gov
	Betreff: Re: [mpich-discuss] multiple mpd rings as one user
	
	
	Do you really need multiple MPD rings? You can run multiple jobs with
one ring.
	 
	Rajeev


________________________________

		From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
oliver.wissdorf at boehringer-ingelheim.com
		Sent: Tuesday, January 27, 2009 10:04 AM
		To: mpich-discuss at mcs.anl.gov
		Subject: [mpich-discuss] multiple mpd rings as one user
		
		

		Hello, 

		I want to start multipel rings as one user on a linux cluster
to submit more than one job at a time. Therefore I use mpdboot and mpdexec:



		/usr/mpi/gcc/mvapich2-1.0.2/bin/mpdboot  -n 17 -f
$PWD/mpd.txt --verbose --ifhn=172.17.30.101 

		/usr/mpi/gcc/mvapich2-1.0.2/bin/mpiexec -machinefile
$WORKDIR/mpd.txt -n $np <jobscript> 

		I tried to set MPD_CON_EXT before starting the mpdboot and
this allows me to start multiple mpds on the host where the ring starts, but
not on the other hosts of the ring.

		I also tried to use -1 option with mpdboot, but this did not
work either. 

		Is there a way to solve this issue? Did I make any mistakes? 

		Oliver 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090128/938b7651/attachment.htm>


More information about the mpich-discuss mailing list