[MPICH] mpirun vs. mpiexec
Steve Young
chemadm at hamilton.edu
Tue Jun 5 10:22:28 CDT 2007
Hello,
I am trying to understand the differences of when to use mpirun vs.
using mpiexec.
Currently, we have a cluster (x86_64 with 38 nodes - 4cpu's per node)
that is set up with mpich2-1.0.5 and running a ring that is started
across all the nodes by root.
We are also using PBS (torque-2.0.0p7) to manage the resources.
Our main problem is with using the sander.MPI program from the Amber9
software. But I have been able to produce the same results using the
simple bounce program.
Now first I use mpirun and the program will run as expected:
mpirun -np 8 sander.MPI -O......
However, using mpirun the the program doesn't go to the proper nodes
that PBS allocates to the job. when I try to give mpirun the -
machinefile argument mpirun complains about this as it doesn't appear to
know about this one.
So now I try to use mpiexec, which does understand -machinefile and
allows me to enable the passing of the proper nodes that PBS allocates
to the job.
However, it seems that mpich goes into some kind of waiting state and
never runs. All the cpu's get allocated on the proper nodes but they
utilize 0% of those cpu's.
when I just run tests of the bounce program from the command line I get
the following results:
[clutest at herculaneum-clu bounce_herc]% mpirun -np 4 bounce
Number of processors = 4
msglen = 0 bytes, elapsed time = 0.1730 msec
msglen = 80 bytes, elapsed time = 0.0535 msec
msglen = 800 bytes, elapsed time = 1.8448 msec
msglen = 8000 bytes, elapsed time = 0.1773 msec
msglen = 80000 bytes, elapsed time = 0.7781 msec
msglen = 800000 bytes, elapsed time = 6.9986 msec
msglen = 8000000 bytes, elapsed time = 68.1831 msec
latency = 173.0 microseconds
bandwidth = 117.331127123461 MBytes/sec
(approximate values for mp_bsend/mp_brecv)
[clutest at herculaneum-clu bounce_herc]% mpiexec -machinefile machinefile
-np 4 bounce
Number of processors = 4
[clutest at herculaneum-clu bounce_herc]%
The mpiexec trial just sits there indefinetly. I run an strace of the
same thing and this is what I get after a bunch of pages of output:
select(7, [0 4 5 6], [], [], {1, 0}) = 0 (Timeout)
select(7, [0 4 5 6], [], [], {1, 0}) = 0 (Timeout)
this just continues over and over again and doesn't appear to stop. Any
idea's what it is waiting for?
What is the best way to run mpich with PBS? Any information you could
share would be greatly appreciated. Thanks,
-Steve
More information about the mpich-discuss
mailing list