[MPICH] Problem with -machinefile

Blankenship, David David.Blankenship at kla-tencor.com
Fri Apr 20 16:38:51 CDT 2007


That does seem to fix the problem. Thank you.
 
David

________________________________

From: Rajeev Thakur [mailto:thakur at mcs.anl.gov] 
Sent: Friday, April 20, 2007 12:56 PM
To: Blankenship, David; mpich-discuss at mcs.anl.gov
Subject: RE: [MPICH] Problem with -machinefile


Are you using the latest release, 1.0.5p4? There is a fix in there for a
problem with machinefile. It might help.
 
Rajeev
 


________________________________

	From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Blankenship, David
	Sent: Friday, April 20, 2007 11:27 AM
	To: mpich-discuss at mcs.anl.gov
	Subject: [MPICH] Problem with -machinefile
	
	

	I am having a problem running mpiexec with the -machine file
option. (Red Hat Enterprise 4 - 64 bit) 

	When I use the -machinefile option, my application hangs
(deadlocks) while attempting communication. The master is sending, the
workers are receiving, but nothing happens. Any thoughts?

	I start my MPD ring as follows: 

	> mpdboot -n 3 -f mpd.hosts 
	> cat mpd.hosts 
	pad-lnx52:2 
	noclue:2 
	question:4 

	I can then run my application with the -host option or by
letting the MPD ring choose the systems using either of the following
command lines:

	> mpiexec -l -n 1 -host pad-lnx52 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml : -n 2 -host noclue lithorun
: -n 2 -host question lithorun

	> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml
Output.xml 

	But when I try to use the -machine file option, my application
hangs. The master is sending; all of the workers are receiving, but no
communication appears to actually be happening.

	> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml 

	Here is trace of the process when it hangs. You can see that the
workers have been started and are waiting for a work packet in a
MPI::COMM_WORLD.Probe call. The master has divided up the work and is
attempting to send the first packet using a MPI::COMM_WORLD.Send call.
Then, nothing else happens. This only occurs when I am trying to use the
-machinefile option. 

	> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml 
	3: Worker on noclue 
	3: Waiting for work... 
	2: Worker on noclue 
	2: Waiting for work... 
	1: Worker on pad-lnx52.kla-tencor.com 
	1: Waiting for work... 
	0: Master on pad-lnx52.kla-tencor.com 
	0: Loading dev/LithoWare/Samples/FEM1D.xml 
	0: Found Factorial(FEM1D) 
	4: Worker on question.kla-tencor.com 
	4: Waiting for work... 
	0: Loading Sample.plt 
	0: Distributing Factorial(FEM1D) with 45 experiments over 4
processes with 3 work packets 
	0: Sending work(1625) 


	For a point of reference here is a trace of the process when it
works: 

	> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml
Output.xml 
	0: Master on pad-lnx52.kla-tencor.com 
	0: Loading dev/LithoWare/Samples/FEM1D.xml 
	0: Found Factorial(FEM1D) 
	0: Loading Sample.plt 
	0: Distributing Factorial(FEM1D) with 45 experiments over 4
processes with 3 work packets 
	0: Sending work(1625) 
	1: Worker on noclue 
	1: Waiting for work... 
	2: Worker on noclue 
	2: Waiting for work... 
	3: Worker on question.kla-tencor.com 
	0: Sent work(1625) 
	0: Sending work(1619) 
	3: Waiting for work... 
	4: Worker on question.kla-tencor.com 
	4: Waiting for work... 
	4: Received work(1625) 
	4: Found Factorial(FEM1D) 
	4: Loading Sample.plt 
	0: Sent work(1619) 
	0: Sending work(1622) 
	4: Running Factorial(FEM1D) with 15 experiments 
	3: Received work(1619) 
	3: Found Factorial(FEM1D) 
	3: Loading Sample.plt 
	3: Running Factorial(FEM1D) with 15 experiments 
	0: Sent work(1622) 
	0: Waiting for results... 
	2: Received work(1622) 
	2: Found Factorial(FEM1D) 
	2: Loading Sample.plt 
	2: Running Factorial(FEM1D) with 15 experiments 
	0: Received results(1672) 
	0: Waiting for results... 
	4: Factorial(FEM1D) complete (0.04175) 
	4: Sending results(1672) 
	4: Waiting for work... 
	3: Factorial(FEM1D) complete (0.0420239) 
	0: Received results(1652) 
	0: Waiting for results... 
	3: Sending results(1652) 
	3: Waiting for work... 
	2: Factorial(FEM1D) complete (0.0852771) 
	0: Received results(1400) 
	0: Factorial(FEM1D) complete (0.136751) 
	2: Sending results(1400) 
	2: Waiting for work... 
	1: Received work(0) 
	4: Received work(0) 
	3: Received work(0) 
	2: Received work(0) 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070420/6d9b132c/attachment.htm>


More information about the mpich-discuss mailing list