[MPICH] RE: [MPICH] MPICH2 doesn´t distribute jobs when running applications
Matthew Chambers
matthew.chambers at vanderbilt.edu
Mon May 14 12:00:38 CDT 2007
What are you using the machine file for if youre using MPD? Are you
booting and closing your MPD ring for each job?
Quoted from the user manual section on MPD:
If you are using the mpd process manager, which is the default, then many
options are available. If you are using mpd, then before you run mpiexec,
you will have started, or will have had started for you, a ring of processes
called mpds (multi-purpose daemons), each running on its own host. It is
likely, but not necessary, that each mpd will be running on a separate host.
You can find out what this ring of hosts consists of by running the program
mpdtrace. One of the mpds will be running on the local machine, the one
where you will run mpiexec. The default placement of MPI processes, if one
runs
mpiexec -n 10 a.out
is to start the first MPI process (rank 0) on the local machine and then to
distribute the rest around the mpd ring one at a time. If there are more
processes than mpds, then wraparound occurs. If there are more mpds than
MPI processes, then some mpds will not run MPI processes. Thus any
number of processes can be run on a ring of any size. While one is doing
development, it is handy to run only one mpd, on the local machine. Then
all the MPI processes will run locally as well.
The first modification to this default behavior is the -1 option to mpiexec
(not a great argument name). If -1 is specified, as in
mpiexec -1 -n 10 a.out
then the first application process will be started by the first mpd in the
ring
after the local host. (If there is only one mpd in the ring, then this will
be on
the local host.) This option is for use when a cluster of compute nodes has
a head node where commands like mpiexec are run but not application
processes.
If an mpd is started with the --ncpus option, then when it is its turn to
start a process, it will start several application processes rather than
just
one before handing off the task of starting more processes to the next mpd
in the ring. For example, if the mpd is started with
mpd --ncpus=4
then it will start as many as four application processes, with consecutive
ranks, when it is its turn to start processes. This option is for use in
clusters
of SMPs, when the user would like consecutive ranks to appear on the same
machine. (In the default case, the same number of processes might well run
on the machine, but their ranks would be different.)
It seems like you should only see the behavior you are seeing if you did
start your MPDs with the --ncpus option. Otherwise, it should rotate
between machines roundrobin style. But I dont understand why youve got a
machine file if youre using an MPD ring.
-Matt Chambers
_____
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Christian M. Probst
Sent: Saturday, May 12, 2007 11:50 PM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] MPICH2 doesn´t distribute jobs when running applications
Hi, folks.
I am running MPICH2 in two servers with 8 cores each one... I have
configured both servers properly and passed all troubleshooting steps
provided in the installation guide.
But when I try to run my applications of interest (both bioinformatics,
mpiClustal and mpiHMMer), the following scenario appears:
If I run just in one server, the job is distributed sucessfully for the 8
cores...
If I run using both server in my machine file, but using -np 8, all jobs are
distributed to one server, and I have the same time of the previous running
(Ok, expected)... But no job is distributed for the second machine...
If I run using both server in my machine file, with any -np from 9 to 16,
all processes are distributed to just one server (mpdtrace -l appears with
several 0 in the beginning of the line) and I have no results after hours
waiting...
As I said before, if I run the tests, it distributes properly for both
servers, no matter what was set with -np.
Any clues?
Thanks in advance.
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070514/bd80836d/attachment.htm>
More information about the mpich-discuss
mailing list