[MPICH] MPICH2 startup w/ PBS

Jeffrey B. Layton laytonjb at charter.net
Tue Apr 4 13:41:06 CDT 2006


The machine I'm using has multiple ethernet interfaces. By using
the -machinefile argument I can control which interface I run on.

> Is there any particular reason you use the -machinefile argument to
> mpiexec, rather than just let the mpd's start the processes in
> round-robin fashion?  
>
> From: "Jeffrey B. Layton" <laytonjb at charter.net>
> Subject: Re: [MPICH] MPICH2 startup w/ PBS
> Date: Tue, 04 Apr 2006 12:10:23 -0400
>
>   
>> Darius Buntinas wrote:
>>     
>>> What's "screaming"?  mpdboot or mpiexec?
>>>       
>> I'm pretty sure it's mpdboot.
>>
>>
>> I'll try the method below to see what happens. I'm also going
>> to try Pete's mpiexec based on some recommendations to see
>> if that reduces the pain.
>>
>> Thanks!
>>
>> Jeff
>>
>>     
>>> Try:
>>>   mpdboot -n ${NP} -f ${PBS_NODEFILE}
>>>   mpiexec -n ${NP} ./${EXE}
>>>
>>> You don't need a machinefile with mpiexec unless you want to execute 
>>> on a subset of the nodes in your mpd ring, or you want control of the 
>>> process-to-node mapping.
>>>
>>> I think that mpdboot should only start one mpd oer node, even if the 
>>> node is specified more than one time in the file (you really only ever 
>>> need one mpd per node).  If mpdboot is having trouble because you're 
>>> asking for ${NP} mpds but there are only ${NP}/2 unique nodes in the 
>>> file, you can try something like:
>>>
>>>   NUM_NODES=`sort -u ${PBS_NODEFILE} | wc -l | awk '{print $1}'`
>>>   mpdboot -n ${NUM_NODES} -f ${PBS_NODEFILE}
>>>   mpiexec -n ${NP} ./${EXE}
>>>
>>> I'm not PBS expert, so there might be an easier way to do that, but 
>>> give it a try.
>>>
>>> If you are concerned about your process-to-node mapping and want to 
>>> check what it is try:
>>>   mpiexec -l -n ${NP} hostname
>>>
>>> -d
>>>
>>> On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
>>>
>>>       
>>>> No joy. It always screams about not having enough hosts:
>>>>
>>>> totalnum=16  numhosts=8
>>>> there are not enough hosts on which to start all processes
>>>>
>>>> I think this because we have two processors per node (ppn=2).
>>>> Consequently PBS_NODEFILE has the hosts repeated. I've
>>>> tried using --totalnum=${NP} --ncpus=2 and this didn't work
>>>> either (same error message).
>>>>
>>>> Thanks!
>>>>
>>>> Jeff
>>>>
>>>>         
>>>>> How about the following 3 lines in your script:
>>>>>
>>>>> mpdboot -n ${NP} -f ${PBS_NODEFILE}
>>>>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
>>>>> mpdallexit
>>>>>
>>>>> Wei-keng
>>>>>
>>>>>
>>>>> On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
>>>>>
>>>>>           
>>>>>> Good morning,
>>>>>>
>>>>>>  I hate to bother everyone early in the morning, but I'm
>>>>>> looking for some advice on MPICH2 startup. I've been starting
>>>>>> an mpd on each node in the cluster via,
>>>>>>
>>>>>> mpdboot -n 25 -f /home/jlayton/mpd.hosts
>>>>>>
>>>>>> where the file mpd.hosts contains a list of all possible hosts.
>>>>>> So I'm basically starting mpd on every node. Then I run the
>>>>>> code using mpiexec
>>>>>>
>>>>>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
>>>>>>
>>>>>> and run mpdallexit after the code is finished to stop all of the
>>>>>> mpds. Notice that I'm using PBS for queuing/scheduling.
>>>>>>  This is something of a pain, because we lose nodes for
>>>>>> various projects or training so I'm constantly having to go into
>>>>>> the list of hosts and edit it. I also have to change the count on
>>>>>> the mpdboot command.
>>>>>>  Is there a better way to start up MPICH2 codes using PBS?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>>             
>>>>         
>
>   




More information about the mpich-discuss mailing list