[MPICH] Problem with -machinefile
Rajeev Thakur
thakur at mcs.anl.gov
Fri Apr 20 12:55:57 CDT 2007
Are you using the latest release, 1.0.5p4? There is a fix in there for a
problem with machinefile. It might help.
Rajeev
_____
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Blankenship, David
Sent: Friday, April 20, 2007 11:27 AM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] Problem with -machinefile
I am having a problem running mpiexec with the -machine file option. (Red
Hat Enterprise 4 - 64 bit)
When I use the -machinefile option, my application hangs (deadlocks) while
attempting communication. The master is sending, the workers are receiving,
but nothing happens. Any thoughts?
I start my MPD ring as follows:
> mpdboot -n 3 -f mpd.hosts
> cat mpd.hosts
pad-lnx52:2
noclue:2
question:4
I can then run my application with the -host option or by letting the MPD
ring choose the systems using either of the following command lines:
> mpiexec -l -n 1 -host pad-lnx52 lithorun dev/LithoWare/Samples/FEM1D.xml
Output.xml : -n 2 -host noclue lithorun : -n 2 -host question lithorun
> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml Output.xml
But when I try to use the -machine file option, my application hangs. The
master is sending; all of the workers are receiving, but no communication
appears to actually be happening.
> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml
Here is trace of the process when it hangs. You can see that the workers
have been started and are waiting for a work packet in a
MPI::COMM_WORLD.Probe call. The master has divided up the work and is
attempting to send the first packet using a MPI::COMM_WORLD.Send call. Then,
nothing else happens. This only occurs when I am trying to use the
-machinefile option.
> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml
3: Worker on noclue
3: Waiting for work...
2: Worker on noclue
2: Waiting for work...
1: Worker on pad-lnx52.kla-tencor.com
1: Waiting for work...
0: Master on pad-lnx52.kla-tencor.com
0: Loading dev/LithoWare/Samples/FEM1D.xml
0: Found Factorial(FEM1D)
4: Worker on question.kla-tencor.com
4: Waiting for work...
0: Loading Sample.plt
0: Distributing Factorial(FEM1D) with 45 experiments over 4 processes with 3
work packets
0: Sending work(1625)
For a point of reference here is a trace of the process when it works:
> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml Output.xml
0: Master on pad-lnx52.kla-tencor.com
0: Loading dev/LithoWare/Samples/FEM1D.xml
0: Found Factorial(FEM1D)
0: Loading Sample.plt
0: Distributing Factorial(FEM1D) with 45 experiments over 4 processes with 3
work packets
0: Sending work(1625)
1: Worker on noclue
1: Waiting for work...
2: Worker on noclue
2: Waiting for work...
3: Worker on question.kla-tencor.com
0: Sent work(1625)
0: Sending work(1619)
3: Waiting for work...
4: Worker on question.kla-tencor.com
4: Waiting for work...
4: Received work(1625)
4: Found Factorial(FEM1D)
4: Loading Sample.plt
0: Sent work(1619)
0: Sending work(1622)
4: Running Factorial(FEM1D) with 15 experiments
3: Received work(1619)
3: Found Factorial(FEM1D)
3: Loading Sample.plt
3: Running Factorial(FEM1D) with 15 experiments
0: Sent work(1622)
0: Waiting for results...
2: Received work(1622)
2: Found Factorial(FEM1D)
2: Loading Sample.plt
2: Running Factorial(FEM1D) with 15 experiments
0: Received results(1672)
0: Waiting for results...
4: Factorial(FEM1D) complete (0.04175)
4: Sending results(1672)
4: Waiting for work...
3: Factorial(FEM1D) complete (0.0420239)
0: Received results(1652)
0: Waiting for results...
3: Sending results(1652)
3: Waiting for work...
2: Factorial(FEM1D) complete (0.0852771)
0: Received results(1400)
0: Factorial(FEM1D) complete (0.136751)
2: Sending results(1400)
2: Waiting for work...
1: Received work(0)
4: Received work(0)
3: Received work(0)
2: Received work(0)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070420/e4e43249/attachment.htm>
More information about the mpich-discuss
mailing list