[MPICH] MPICH2 startup w/ PBS
Darius Buntinas
buntinas at mcs.anl.gov
Tue Apr 4 10:57:38 CDT 2006
What's "screaming"? mpdboot or mpiexec?
Try:
mpdboot -n ${NP} -f ${PBS_NODEFILE}
mpiexec -n ${NP} ./${EXE}
You don't need a machinefile with mpiexec unless you want to execute on a
subset of the nodes in your mpd ring, or you want control of the
process-to-node mapping.
I think that mpdboot should only start one mpd oer node, even if the node
is specified more than one time in the file (you really only ever need one
mpd per node). If mpdboot is having trouble because you're asking for
${NP} mpds but there are only ${NP}/2 unique nodes in the file, you can
try something like:
NUM_NODES=`sort -u ${PBS_NODEFILE} | wc -l | awk '{print $1}'`
mpdboot -n ${NUM_NODES} -f ${PBS_NODEFILE}
mpiexec -n ${NP} ./${EXE}
I'm not PBS expert, so there might be an easier way to do that, but give
it a try.
If you are concerned about your process-to-node mapping and want to check
what it is try:
mpiexec -l -n ${NP} hostname
-d
On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
> No joy. It always screams about not having enough hosts:
>
> totalnum=16 numhosts=8
> there are not enough hosts on which to start all processes
>
> I think this because we have two processors per node (ppn=2).
> Consequently PBS_NODEFILE has the hosts repeated. I've
> tried using --totalnum=${NP} --ncpus=2 and this didn't work
> either (same error message).
>
> Thanks!
>
> Jeff
>
>>
>> How about the following 3 lines in your script:
>>
>> mpdboot -n ${NP} -f ${PBS_NODEFILE}
>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
>> mpdallexit
>>
>> Wei-keng
>>
>>
>> On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
>>
>>> Good morning,
>>>
>>> I hate to bother everyone early in the morning, but I'm
>>> looking for some advice on MPICH2 startup. I've been starting
>>> an mpd on each node in the cluster via,
>>>
>>> mpdboot -n 25 -f /home/jlayton/mpd.hosts
>>>
>>> where the file mpd.hosts contains a list of all possible hosts.
>>> So I'm basically starting mpd on every node. Then I run the
>>> code using mpiexec
>>>
>>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
>>>
>>> and run mpdallexit after the code is finished to stop all of the
>>> mpds. Notice that I'm using PBS for queuing/scheduling.
>>> This is something of a pain, because we lose nodes for
>>> various projects or training so I'm constantly having to go into
>>> the list of hosts and edit it. I also have to change the count on
>>> the mpdboot command.
>>> Is there a better way to start up MPICH2 codes using PBS?
>>>
>>> Thanks!
>>>
>>> Jeff
>>>
>>
>
>
More information about the mpich-discuss
mailing list