[MPICH] RE: [MPICH] MPICH2 doesn´t distribute jobs when running applications

Matthew Chambers matthew.chambers at vanderbilt.edu
Mon May 14 12:00:38 CDT 2007


What are you using the machine file for if you’re using MPD?  Are you
booting and closing your MPD ring for each job?

 

Quoted from the user manual section on MPD:

If you are using the mpd process manager, which is the default, then many

options are available. If you are using mpd, then before you run mpiexec,

you will have started, or will have had started for you, a ring of processes

called mpd’s (multi-purpose daemons), each running on its own host. It is

likely, but not necessary, that each mpd will be running on a separate host.

You can find out what this ring of hosts consists of by running the program

mpdtrace. One of the mpd’s will be running on the “local” machine, the one

where you will run mpiexec. The default placement of MPI processes, if one

runs

 

mpiexec -n 10 a.out

 

is to start the first MPI process (rank 0) on the local machine and then to

distribute the rest around the mpd ring one at a time. If there are more

processes than mpd’s, then wraparound occurs. If there are more mpd’s than

MPI processes, then some mpd’s will not run MPI processes. Thus any

number of processes can be run on a ring of any size. While one is doing

development, it is handy to run only one mpd, on the local machine. Then

all the MPI processes will run locally as well.

 

The first modification to this default behavior is the -1 option to mpiexec

(not a great argument name). If -1 is specified, as in

 

mpiexec -1 -n 10 a.out

 

then the first application process will be started by the first mpd in the
ring

after the local host. (If there is only one mpd in the ring, then this will
be on

the local host.) This option is for use when a cluster of compute nodes has

a “head node” where commands like mpiexec are run but not application

processes.

 

If an mpd is started with the --ncpus option, then when it is its turn to

start a process, it will start several application processes rather than
just

one before handing off the task of starting more processes to the next mpd

in the ring. For example, if the mpd is started with

 

mpd --ncpus=4

 

then it will start as many as four application processes, with consecutive

ranks, when it is its turn to start processes. This option is for use in
clusters

of SMP’s, when the user would like consecutive ranks to appear on the same

machine. (In the default case, the same number of processes might well run

on the machine, but their ranks would be different.)

 

It seems like you should only see the behavior you are seeing if you did
start your MPDs with the --ncpus option.  Otherwise, it should rotate
between machines roundrobin style.  But I don’t understand why you’ve got a
machine file if you’re using an MPD ring.

 

-Matt Chambers

 

  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Christian M. Probst
Sent: Saturday, May 12, 2007 11:50 PM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] MPICH2 doesn´t distribute jobs when running applications

 

Hi, folks.

 

I am running MPICH2 in two servers with 8 cores each one... I have
configured both servers properly and passed all troubleshooting steps
provided in the installation guide.

 

But when I try to run my applications of interest (both bioinformatics,
mpiClustal and mpiHMMer), the following scenario appears:

 

If I run just in one server, the job is distributed sucessfully for the 8
cores...


If I run using both server in my machine file, but using -np 8, all jobs are
distributed to one server, and I have the same time of the previous running
(Ok, expected)... But no job is distributed for the second machine... 

 

If I run using both server in my machine file, with any -np from 9 to 16,
all processes are distributed to just one server (mpdtrace -l appears with
several 0 in the beginning of the line) and I have no results after hours
waiting... 

 

As I said before, if I run the tests, it distributes properly for both
servers, no matter what was set with -np.

 

Any clues?

 

Thanks in advance.

Christian

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070514/bd80836d/attachment.htm>


More information about the mpich-discuss mailing list