[MPICH] Problem with -machinefile
Blankenship, David
David.Blankenship at kla-tencor.com
Fri Apr 20 16:38:51 CDT 2007
That does seem to fix the problem. Thank you.
David
________________________________
From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
Sent: Friday, April 20, 2007 12:56 PM
To: Blankenship, David; mpich-discuss at mcs.anl.gov
Subject: RE: [MPICH] Problem with -machinefile
Are you using the latest release, 1.0.5p4? There is a fix in there for a
problem with machinefile. It might help.
Rajeev
________________________________
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Blankenship, David
Sent: Friday, April 20, 2007 11:27 AM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] Problem with -machinefile
I am having a problem running mpiexec with the -machine file
option. (Red Hat Enterprise 4 - 64 bit)
When I use the -machinefile option, my application hangs
(deadlocks) while attempting communication. The master is sending, the
workers are receiving, but nothing happens. Any thoughts?
I start my MPD ring as follows:
> mpdboot -n 3 -f mpd.hosts
> cat mpd.hosts
pad-lnx52:2
noclue:2
question:4
I can then run my application with the -host option or by
letting the MPD ring choose the systems using either of the following
command lines:
> mpiexec -l -n 1 -host pad-lnx52 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml : -n 2 -host noclue lithorun
: -n 2 -host question lithorun
> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml
Output.xml
But when I try to use the -machine file option, my application
hangs. The master is sending; all of the workers are receiving, but no
communication appears to actually be happening.
> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml
Here is trace of the process when it hangs. You can see that the
workers have been started and are waiting for a work packet in a
MPI::COMM_WORLD.Probe call. The master has divided up the work and is
attempting to send the first packet using a MPI::COMM_WORLD.Send call.
Then, nothing else happens. This only occurs when I am trying to use the
-machinefile option.
> mpiexec -machinefile mpd.hosts -l -n 5 lithorun
dev/LithoWare/Samples/FEM1D.xml Output.xml
3: Worker on noclue
3: Waiting for work...
2: Worker on noclue
2: Waiting for work...
1: Worker on pad-lnx52.kla-tencor.com
1: Waiting for work...
0: Master on pad-lnx52.kla-tencor.com
0: Loading dev/LithoWare/Samples/FEM1D.xml
0: Found Factorial(FEM1D)
4: Worker on question.kla-tencor.com
4: Waiting for work...
0: Loading Sample.plt
0: Distributing Factorial(FEM1D) with 45 experiments over 4
processes with 3 work packets
0: Sending work(1625)
For a point of reference here is a trace of the process when it
works:
> mpiexec -l -n 5 lithorun dev/LithoWare/Samples/FEM1D.xml
Output.xml
0: Master on pad-lnx52.kla-tencor.com
0: Loading dev/LithoWare/Samples/FEM1D.xml
0: Found Factorial(FEM1D)
0: Loading Sample.plt
0: Distributing Factorial(FEM1D) with 45 experiments over 4
processes with 3 work packets
0: Sending work(1625)
1: Worker on noclue
1: Waiting for work...
2: Worker on noclue
2: Waiting for work...
3: Worker on question.kla-tencor.com
0: Sent work(1625)
0: Sending work(1619)
3: Waiting for work...
4: Worker on question.kla-tencor.com
4: Waiting for work...
4: Received work(1625)
4: Found Factorial(FEM1D)
4: Loading Sample.plt
0: Sent work(1619)
0: Sending work(1622)
4: Running Factorial(FEM1D) with 15 experiments
3: Received work(1619)
3: Found Factorial(FEM1D)
3: Loading Sample.plt
3: Running Factorial(FEM1D) with 15 experiments
0: Sent work(1622)
0: Waiting for results...
2: Received work(1622)
2: Found Factorial(FEM1D)
2: Loading Sample.plt
2: Running Factorial(FEM1D) with 15 experiments
0: Received results(1672)
0: Waiting for results...
4: Factorial(FEM1D) complete (0.04175)
4: Sending results(1672)
4: Waiting for work...
3: Factorial(FEM1D) complete (0.0420239)
0: Received results(1652)
0: Waiting for results...
3: Sending results(1652)
3: Waiting for work...
2: Factorial(FEM1D) complete (0.0852771)
0: Received results(1400)
0: Factorial(FEM1D) complete (0.136751)
2: Sending results(1400)
2: Waiting for work...
1: Received work(0)
4: Received work(0)
3: Received work(0)
2: Received work(0)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070420/6d9b132c/attachment.htm>
More information about the mpich-discuss
mailing list