[MPICH] MPICH2 startup w/ PBS

Jeffrey B. Layton laytonjb at charter.net
Tue Apr 4 11:10:23 CDT 2006


Darius Buntinas wrote:
>
> What's "screaming"?  mpdboot or mpiexec?

I'm pretty sure it's mpdboot.


I'll try the method below to see what happens. I'm also going
to try Pete's mpiexec based on some recommendations to see
if that reduces the pain.

Thanks!

Jeff

>
> Try:
>   mpdboot -n ${NP} -f ${PBS_NODEFILE}
>   mpiexec -n ${NP} ./${EXE}
>
> You don't need a machinefile with mpiexec unless you want to execute 
> on a subset of the nodes in your mpd ring, or you want control of the 
> process-to-node mapping.
>
> I think that mpdboot should only start one mpd oer node, even if the 
> node is specified more than one time in the file (you really only ever 
> need one mpd per node).  If mpdboot is having trouble because you're 
> asking for ${NP} mpds but there are only ${NP}/2 unique nodes in the 
> file, you can try something like:
>
>   NUM_NODES=`sort -u ${PBS_NODEFILE} | wc -l | awk '{print $1}'`
>   mpdboot -n ${NUM_NODES} -f ${PBS_NODEFILE}
>   mpiexec -n ${NP} ./${EXE}
>
> I'm not PBS expert, so there might be an easier way to do that, but 
> give it a try.
>
> If you are concerned about your process-to-node mapping and want to 
> check what it is try:
>   mpiexec -l -n ${NP} hostname
>
> -d
>
> On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
>
>> No joy. It always screams about not having enough hosts:
>>
>> totalnum=16  numhosts=8
>> there are not enough hosts on which to start all processes
>>
>> I think this because we have two processors per node (ppn=2).
>> Consequently PBS_NODEFILE has the hosts repeated. I've
>> tried using --totalnum=${NP} --ncpus=2 and this didn't work
>> either (same error message).
>>
>> Thanks!
>>
>> Jeff
>>
>>>
>>> How about the following 3 lines in your script:
>>>
>>> mpdboot -n ${NP} -f ${PBS_NODEFILE}
>>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
>>> mpdallexit
>>>
>>> Wei-keng
>>>
>>>
>>> On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
>>>
>>>> Good morning,
>>>>
>>>>  I hate to bother everyone early in the morning, but I'm
>>>> looking for some advice on MPICH2 startup. I've been starting
>>>> an mpd on each node in the cluster via,
>>>>
>>>> mpdboot -n 25 -f /home/jlayton/mpd.hosts
>>>>
>>>> where the file mpd.hosts contains a list of all possible hosts.
>>>> So I'm basically starting mpd on every node. Then I run the
>>>> code using mpiexec
>>>>
>>>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
>>>>
>>>> and run mpdallexit after the code is finished to stop all of the
>>>> mpds. Notice that I'm using PBS for queuing/scheduling.
>>>>  This is something of a pain, because we lose nodes for
>>>> various projects or training so I'm constantly having to go into
>>>> the list of hosts and edit it. I also have to change the count on
>>>> the mpdboot command.
>>>>  Is there a better way to start up MPICH2 codes using PBS?
>>>>
>>>> Thanks!
>>>>
>>>> Jeff
>>>>
>>>
>>
>>
>




More information about the mpich-discuss mailing list