[MPICH] MPICH2 startup w/ PBS

Rusty Lusk lusk at mcs.anl.gov
Tue Apr 4 11:27:44 CDT 2006


Is there any particular reason you use the -machinefile argument to
mpiexec, rather than just let the mpd's start the processes in
round-robin fashion?  

From: "Jeffrey B. Layton" <laytonjb at charter.net>
Subject: Re: [MPICH] MPICH2 startup w/ PBS
Date: Tue, 04 Apr 2006 12:10:23 -0400

> Darius Buntinas wrote:
> >
> > What's "screaming"?  mpdboot or mpiexec?
> 
> I'm pretty sure it's mpdboot.
> 
> 
> I'll try the method below to see what happens. I'm also going
> to try Pete's mpiexec based on some recommendations to see
> if that reduces the pain.
> 
> Thanks!
> 
> Jeff
> 
> >
> > Try:
> >   mpdboot -n ${NP} -f ${PBS_NODEFILE}
> >   mpiexec -n ${NP} ./${EXE}
> >
> > You don't need a machinefile with mpiexec unless you want to execute 
> > on a subset of the nodes in your mpd ring, or you want control of the 
> > process-to-node mapping.
> >
> > I think that mpdboot should only start one mpd oer node, even if the 
> > node is specified more than one time in the file (you really only ever 
> > need one mpd per node).  If mpdboot is having trouble because you're 
> > asking for ${NP} mpds but there are only ${NP}/2 unique nodes in the 
> > file, you can try something like:
> >
> >   NUM_NODES=`sort -u ${PBS_NODEFILE} | wc -l | awk '{print $1}'`
> >   mpdboot -n ${NUM_NODES} -f ${PBS_NODEFILE}
> >   mpiexec -n ${NP} ./${EXE}
> >
> > I'm not PBS expert, so there might be an easier way to do that, but 
> > give it a try.
> >
> > If you are concerned about your process-to-node mapping and want to 
> > check what it is try:
> >   mpiexec -l -n ${NP} hostname
> >
> > -d
> >
> > On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
> >
> >> No joy. It always screams about not having enough hosts:
> >>
> >> totalnum=16  numhosts=8
> >> there are not enough hosts on which to start all processes
> >>
> >> I think this because we have two processors per node (ppn=2).
> >> Consequently PBS_NODEFILE has the hosts repeated. I've
> >> tried using --totalnum=${NP} --ncpus=2 and this didn't work
> >> either (same error message).
> >>
> >> Thanks!
> >>
> >> Jeff
> >>
> >>>
> >>> How about the following 3 lines in your script:
> >>>
> >>> mpdboot -n ${NP} -f ${PBS_NODEFILE}
> >>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
> >>> mpdallexit
> >>>
> >>> Wei-keng
> >>>
> >>>
> >>> On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
> >>>
> >>>> Good morning,
> >>>>
> >>>>  I hate to bother everyone early in the morning, but I'm
> >>>> looking for some advice on MPICH2 startup. I've been starting
> >>>> an mpd on each node in the cluster via,
> >>>>
> >>>> mpdboot -n 25 -f /home/jlayton/mpd.hosts
> >>>>
> >>>> where the file mpd.hosts contains a list of all possible hosts.
> >>>> So I'm basically starting mpd on every node. Then I run the
> >>>> code using mpiexec
> >>>>
> >>>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
> >>>>
> >>>> and run mpdallexit after the code is finished to stop all of the
> >>>> mpds. Notice that I'm using PBS for queuing/scheduling.
> >>>>  This is something of a pain, because we lose nodes for
> >>>> various projects or training so I'm constantly having to go into
> >>>> the list of hosts and edit it. I also have to change the count on
> >>>> the mpdboot command.
> >>>>  Is there a better way to start up MPICH2 codes using PBS?
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Jeff
> >>>>
> >>>
> >>
> >>
> >
> 




More information about the mpich-discuss mailing list