[MPICH] MPICH2 startup w/ PBS
Rusty Lusk
lusk at mcs.anl.gov
Tue Apr 4 11:27:44 CDT 2006
Is there any particular reason you use the -machinefile argument to
mpiexec, rather than just let the mpd's start the processes in
round-robin fashion?
From: "Jeffrey B. Layton" <laytonjb at charter.net>
Subject: Re: [MPICH] MPICH2 startup w/ PBS
Date: Tue, 04 Apr 2006 12:10:23 -0400
> Darius Buntinas wrote:
> >
> > What's "screaming"? mpdboot or mpiexec?
>
> I'm pretty sure it's mpdboot.
>
>
> I'll try the method below to see what happens. I'm also going
> to try Pete's mpiexec based on some recommendations to see
> if that reduces the pain.
>
> Thanks!
>
> Jeff
>
> >
> > Try:
> > mpdboot -n ${NP} -f ${PBS_NODEFILE}
> > mpiexec -n ${NP} ./${EXE}
> >
> > You don't need a machinefile with mpiexec unless you want to execute
> > on a subset of the nodes in your mpd ring, or you want control of the
> > process-to-node mapping.
> >
> > I think that mpdboot should only start one mpd oer node, even if the
> > node is specified more than one time in the file (you really only ever
> > need one mpd per node). If mpdboot is having trouble because you're
> > asking for ${NP} mpds but there are only ${NP}/2 unique nodes in the
> > file, you can try something like:
> >
> > NUM_NODES=`sort -u ${PBS_NODEFILE} | wc -l | awk '{print $1}'`
> > mpdboot -n ${NUM_NODES} -f ${PBS_NODEFILE}
> > mpiexec -n ${NP} ./${EXE}
> >
> > I'm not PBS expert, so there might be an easier way to do that, but
> > give it a try.
> >
> > If you are concerned about your process-to-node mapping and want to
> > check what it is try:
> > mpiexec -l -n ${NP} hostname
> >
> > -d
> >
> > On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
> >
> >> No joy. It always screams about not having enough hosts:
> >>
> >> totalnum=16 numhosts=8
> >> there are not enough hosts on which to start all processes
> >>
> >> I think this because we have two processors per node (ppn=2).
> >> Consequently PBS_NODEFILE has the hosts repeated. I've
> >> tried using --totalnum=${NP} --ncpus=2 and this didn't work
> >> either (same error message).
> >>
> >> Thanks!
> >>
> >> Jeff
> >>
> >>>
> >>> How about the following 3 lines in your script:
> >>>
> >>> mpdboot -n ${NP} -f ${PBS_NODEFILE}
> >>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
> >>> mpdallexit
> >>>
> >>> Wei-keng
> >>>
> >>>
> >>> On Tue, 4 Apr 2006, Jeffrey B. Layton wrote:
> >>>
> >>>> Good morning,
> >>>>
> >>>> I hate to bother everyone early in the morning, but I'm
> >>>> looking for some advice on MPICH2 startup. I've been starting
> >>>> an mpd on each node in the cluster via,
> >>>>
> >>>> mpdboot -n 25 -f /home/jlayton/mpd.hosts
> >>>>
> >>>> where the file mpd.hosts contains a list of all possible hosts.
> >>>> So I'm basically starting mpd on every node. Then I run the
> >>>> code using mpiexec
> >>>>
> >>>> mpiexec -machinefile ${PBS_NODEFILE} -n ${NP} ./${EXE}
> >>>>
> >>>> and run mpdallexit after the code is finished to stop all of the
> >>>> mpds. Notice that I'm using PBS for queuing/scheduling.
> >>>> This is something of a pain, because we lose nodes for
> >>>> various projects or training so I'm constantly having to go into
> >>>> the list of hosts and edit it. I also have to change the count on
> >>>> the mpdboot command.
> >>>> Is there a better way to start up MPICH2 codes using PBS?
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Jeff
> >>>>
> >>>
> >>
> >>
> >
>
More information about the mpich-discuss
mailing list