[MPICH] MPICH2 startup w/ PBS

Darius Buntinas buntinas at mcs.anl.gov
Tue Apr 4 10:57:38 CDT 2006


What's "screaming"?  mpdboot or mpiexec?

Try:
   mpdboot -n ${NP} -f ${PBS_NODEFILE}
   mpiexec -n ${NP} ./${EXE}

You don't need a machinefile with mpiexec unless you want to execute on a 
subset of the nodes in your mpd ring, or you want control of the 
process-to-node mapping.

I think that mpdboot should only start one mpd oer node, even if the node 
is specified more than one time in the file (you really only ever need one 
mpd per node).  If mpdboot is having trouble because you're asking for 
${NP} mpds but there are only ${NP}/2 unique nodes in the file, you can 
try something like:

   NUM_NODES=`sort -u ${PBS_NODEFILE} | wc -l | awk '{print $1}'`
   mpdboot -n ${NUM_NODES} -f ${PBS_NODEFILE}
   mpiexec -n ${NP} ./${EXE}

I'm not PBS expert, so there might be an easier way to do that, but give 
it a try.

If you are concerned about your process-to-node mapping and want to check 
what it is try:
   mpiexec -l -n ${NP} hostname

-d

On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:

> No joy. It always screams about not having enough hosts:
>
> totalnum=16  numhosts=8
> there are not enough hosts on which to start all processes
>
> I think this because we have two processors per node (ppn=2).
> Consequently PBS_NODEFILE has the hosts repeated. I've
> tried using --totalnum=${NP} --ncpus=2 and this didn't work
> either (same error message).
>
> Thanks!
>
> Jeff
>
>> 
>> How about the following 3 lines in your script:
>> 
>> mpdboot -n ${NP} -f ${PBS_NODEFILE}
>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
>> mpdallexit
>> 
>> Wei-keng
>> 
>> 
>> On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
>> 
>>> Good morning,
>>>
>>>  I hate to bother everyone early in the morning, but I'm
>>> looking for some advice on MPICH2 startup. I've been starting
>>> an mpd on each node in the cluster via,
>>> 
>>> mpdboot -n 25 -f /home/jlayton/mpd.hosts
>>> 
>>> where the file mpd.hosts contains a list of all possible hosts.
>>> So I'm basically starting mpd on every node. Then I run the
>>> code using mpiexec
>>> 
>>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
>>> 
>>> and run mpdallexit after the code is finished to stop all of the
>>> mpds. Notice that I'm using PBS for queuing/scheduling.
>>>  This is something of a pain, because we lose nodes for
>>> various projects or training so I'm constantly having to go into
>>> the list of hosts and edit it. I also have to change the count on
>>> the mpdboot command.
>>>  Is there a better way to start up MPICH2 codes using PBS?
>>> 
>>> Thanks!
>>> 
>>> Jeff
>>> 
>> 
>
>




More information about the mpich-discuss mailing list