[MPICH] Problem with -machinefile

Blankenship, David David.Blankenship at kla-tencor.com
Fri Apr 20 11:26:51 CDT 2007


I am having a problem running mpiexec with the -machine file option.
(Red Hat Enterprise 4 - 64 bit)

When I use the -machinefile option, my application hangs (deadlocks)
while attempting communication. The master is sending, the workers are
receiving, but nothing happens. Any thoughts?

I start my MPD ring as follows:

> mpdboot -n 3 -f mpd.hosts
> cat mpd.hosts
pad-lnx52:2
noclue:2
question:4

I can then run my application with the -host option or by letting the
MPD ring choose the systems using either of the following command lines:

> mpiexec -l -n 1 -host pad-lnx52 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml : -n 2 -host noclue lithorun
: -n 2 -host question lithorun

> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml Output.xml

But when I try to use the -machine file option, my application hangs.
The master is sending; all of the workers are receiving, but no
communication appears to actually be happening.

> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml

Here is trace of the process when it hangs. You can see that the workers
have been started and are waiting for a work packet in a
MPI::COMM_WORLD.Probe call. The master has divided up the work and is
attempting to send the first packet using a MPI::COMM_WORLD.Send call.
Then, nothing else happens. This only occurs when I am trying to use the
-machinefile option. 

> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml
3: Worker on noclue
3: Waiting for work...
2: Worker on noclue
2: Waiting for work...
1: Worker on pad-lnx52.kla-tencor.com
1: Waiting for work...
0: Master on pad-lnx52.kla-tencor.com
0: Loading dev/LithoWare/Samples/FEM1D.xml
0: Found Factorial(FEM1D)
4: Worker on question.kla-tencor.com
4: Waiting for work...
0: Loading Sample.plt
0: Distributing Factorial(FEM1D) with 45 experiments over 4 processes
with 3 work packets
0: Sending work(1625)


For a point of reference here is a trace of the process when it works:

> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml Output.xml
0: Master on pad-lnx52.kla-tencor.com
0: Loading dev/LithoWare/Samples/FEM1D.xml
0: Found Factorial(FEM1D)
0: Loading Sample.plt
0: Distributing Factorial(FEM1D) with 45 experiments over 4 processes
with 3 work packets
0: Sending work(1625)
1: Worker on noclue
1: Waiting for work...
2: Worker on noclue
2: Waiting for work...
3: Worker on question.kla-tencor.com
0: Sent work(1625)
0: Sending work(1619)
3: Waiting for work...
4: Worker on question.kla-tencor.com
4: Waiting for work...
4: Received work(1625)
4: Found Factorial(FEM1D)
4: Loading Sample.plt
0: Sent work(1619)
0: Sending work(1622)
4: Running Factorial(FEM1D) with 15 experiments
3: Received work(1619)
3: Found Factorial(FEM1D)
3: Loading Sample.plt
3: Running Factorial(FEM1D) with 15 experiments
0: Sent work(1622)
0: Waiting for results...
2: Received work(1622)
2: Found Factorial(FEM1D)
2: Loading Sample.plt
2: Running Factorial(FEM1D) with 15 experiments
0: Received results(1672)
0: Waiting for results...
4: Factorial(FEM1D) complete (0.04175)
4: Sending results(1672)
4: Waiting for work...
3: Factorial(FEM1D) complete (0.0420239)
0: Received results(1652)
0: Waiting for results...
3: Sending results(1652)
3: Waiting for work...
2: Factorial(FEM1D) complete (0.0852771)
0: Received results(1400)
0: Factorial(FEM1D) complete (0.136751)
2: Sending results(1400)
2: Waiting for work...
1: Received work(0)
4: Received work(0)
3: Received work(0)
2: Received work(0)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070420/2035f840/attachment.htm>


More information about the mpich-discuss mailing list