[MPICH] Problem with -machinefile

Rajeev Thakur thakur at mcs.anl.gov
Fri Apr 20 12:55:57 CDT 2007


Are you using the latest release, 1.0.5p4? There is a fix in there for a
problem with machinefile. It might help.
 
Rajeev
 


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Blankenship, David
Sent: Friday, April 20, 2007 11:27 AM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] Problem with -machinefile



I am having a problem running mpiexec with the -machine file option. (Red
Hat Enterprise 4 - 64 bit) 

When I use the -machinefile option, my application hangs (deadlocks) while
attempting communication. The master is sending, the workers are receiving,
but nothing happens. Any thoughts?

I start my MPD ring as follows: 

> mpdboot -n 3 -f mpd.hosts 
> cat mpd.hosts 
pad-lnx52:2 
noclue:2 
question:4 

I can then run my application with the -host option or by letting the MPD
ring choose the systems using either of the following command lines:

> mpiexec -l -n 1 -host pad-lnx52 lithorun dev/LithoWare/Samples/FEM1D.xml
Output.xml : -n 2 -host noclue lithorun : -n 2 -host question lithorun

> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml Output.xml 

But when I try to use the -machine file option, my application hangs. The
master is sending; all of the workers are receiving, but no communication
appears to actually be happening.

> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml 

Here is trace of the process when it hangs. You can see that the workers
have been started and are waiting for a work packet in a
MPI::COMM_WORLD.Probe call. The master has divided up the work and is
attempting to send the first packet using a MPI::COMM_WORLD.Send call. Then,
nothing else happens. This only occurs when I am trying to use the
-machinefile option. 

> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml 
3: Worker on noclue 
3: Waiting for work... 
2: Worker on noclue 
2: Waiting for work... 
1: Worker on pad-lnx52.kla-tencor.com 
1: Waiting for work... 
0: Master on pad-lnx52.kla-tencor.com 
0: Loading dev/LithoWare/Samples/FEM1D.xml 
0: Found Factorial(FEM1D) 
4: Worker on question.kla-tencor.com 
4: Waiting for work... 
0: Loading Sample.plt 
0: Distributing Factorial(FEM1D) with 45 experiments over 4 processes with 3
work packets 
0: Sending work(1625) 


For a point of reference here is a trace of the process when it works: 

> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml Output.xml 
0: Master on pad-lnx52.kla-tencor.com 
0: Loading dev/LithoWare/Samples/FEM1D.xml 
0: Found Factorial(FEM1D) 
0: Loading Sample.plt 
0: Distributing Factorial(FEM1D) with 45 experiments over 4 processes with 3
work packets 
0: Sending work(1625) 
1: Worker on noclue 
1: Waiting for work... 
2: Worker on noclue 
2: Waiting for work... 
3: Worker on question.kla-tencor.com 
0: Sent work(1625) 
0: Sending work(1619) 
3: Waiting for work... 
4: Worker on question.kla-tencor.com 
4: Waiting for work... 
4: Received work(1625) 
4: Found Factorial(FEM1D) 
4: Loading Sample.plt 
0: Sent work(1619) 
0: Sending work(1622) 
4: Running Factorial(FEM1D) with 15 experiments 
3: Received work(1619) 
3: Found Factorial(FEM1D) 
3: Loading Sample.plt 
3: Running Factorial(FEM1D) with 15 experiments 
0: Sent work(1622) 
0: Waiting for results... 
2: Received work(1622) 
2: Found Factorial(FEM1D) 
2: Loading Sample.plt 
2: Running Factorial(FEM1D) with 15 experiments 
0: Received results(1672) 
0: Waiting for results... 
4: Factorial(FEM1D) complete (0.04175) 
4: Sending results(1672) 
4: Waiting for work... 
3: Factorial(FEM1D) complete (0.0420239) 
0: Received results(1652) 
0: Waiting for results... 
3: Sending results(1652) 
3: Waiting for work... 
2: Factorial(FEM1D) complete (0.0852771) 
0: Received results(1400) 
0: Factorial(FEM1D) complete (0.136751) 
2: Sending results(1400) 
2: Waiting for work... 
1: Received work(0) 
4: Received work(0) 
3: Received work(0) 
2: Received work(0) 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070420/e4e43249/attachment.htm>


More information about the mpich-discuss mailing list