[Swift-devel] Re: [Swift-user] pbs ppn count and stuff
Michael Wilde
wilde at mcs.anl.gov
Tue Feb 1 16:01:46 CST 2011
Thanks for the quick response. One followup - when you say:
> > What is "count in cog"? ...
> It's a task attribute. It means "start this many instances of the
> process".
By task attribute you mean I assume a parameter to the Karajan task() element and the associated CoG execution providers? But there is no direct way to set it from Swift sites.xml? Or is there? Or its just set in the process of translating Swift requests into jobs?
- Mike
----- Original Message -----
> On Tue, 2011-02-01 at 15:34 -0600, Michael Wilde wrote:
> > Hi Mihael,
> >
> > This issue is very timely - it came up in our meeting on the 0.92
> > release.
> >
> > I dont understand the specifics of much of what you say below,
> > regarding which of the many count parameters you are referring to,
> > how
> > this works with coasters, plain PBS and SGE (and Condor providers),
> > and MPI issues.
> >
> > I think a good step would be to help us (Sarah, Justin, and me)
> > update
> > the User Guide with all that a user needs to know to get node and
> > processor counts specified correctly for the many different
> > configurations of sites and Swift that are possible.
> >
> > Some of my initial questions are below. Maybe this would be best
> > discussed in a teleconference, but we can start by trying to clarify
> > the issues using this email thread.
> >
> > > On Mon, 2011-01-24 at 10:46 -0800, Mihael Hategan wrote:
> > > > So I think some of the problems with ppn are as follows:
> > > > 1. count in cog means number of processes. count in PBS means
> > number
> > > > of
> > > > nodes.
> >
> > What is "count in cog"? Presumably a pool attribute? How does it get
> > specified both for coasters and non-coasters? Is this related to the
> > xcount parameter in the GLOBUS profile in the Swift User Guide MPI
> > example: GLOBUS::host_xcount=3 ?
>
> It's a task attribute. It means "start this many instances of the
> process".
> >
> > > > 2. when the number of nodes requested was 1 but ppn > 1,
> >
> > You mean the number of nodes that Swift requested in the PBS submit
> > file?
>
> Right.
> >
> > as in #PBS -l nodes=$nodes:ppn=$cores
>
> No. As in #PBS -l nodes=1:ppn=n, with n > 1.
>
> So one physical node with multiple processes on that node.
> >
> > > the
> > > > multiple
> > > > job scheme was not enabled so, despite having multiple lines in
> > > > PBS_NODEFILE, only one process would get started. If count was >
> > > > 1
> > > > then
> > > > PBS would understand that count*ppn lines should be in
> > PBS_NODEFILE,
> > > > which would result in that number of processes be started. In
> > other
> > > > words there was no way to tell PBS to start 4 jobs on only one
> > node.
> >
> >
> > > > So:
> > > >
> > > > - I changed this to be consistent with 1. Count means number of
> > > > processes to be started. This imposes the restriction that count
> > > > %
> > > > ppn =
> > > > 0. If not, the pbs provider will throw an exception.
> >
> > # of processes to be started is number of workers in coaster case?
>
> Yes. Number of instances of the worker.pl process.
> >
> > > > - I also added mppnppn if USE_MPPWIDTH is enabled.
> >
> > Where & how should USE_MPPWIDTH be specified?
>
> Justin added support for it, so I'm assuming there was a place where
> it
> was needed.
> >
> > > >
> > > > This is in trunk.
> >
> > Should it be retrofitted to 0.92?
>
> It's a pretty radical change. I will port one thing to 0.92, and that
> is
> enabling the multi-job handling when ppn>1.
> >
> > Does it apply to SGE and the associated "pe" parallel environment
> > issues?
>
> This is strictly about PBS.
> >
> > How does it relate to workersPerNode and the various coaster
> > settings
> > that control size of node allocations?
>
> If you specify ppn > 1, then you need to have nodeGranularity=ppn. We
> should also change nodeGranularity to read coreGranularity.
> >
> > How does it relate to issues of whether or not a site does
> > node-packing, and whether or not a user wants to use node-packing
> > (ie
> > single-core jobs in most or all cases).
>
> If a site feels like re-defining the notion of a node from the
> physical
> thing with multiple cores to a virtual thing with a single core,
> there's
> not much we can do about it. But it is not much different from
> considering the site to physically have 1-core nodes.
>
> > I apologize that I cant formulate the question cleanly, but Im
> > finding
> > the terminology and processor-count model between Swift, cog,
> > coasters, and multiple schedulers with multiple modes to be so
> > complex
> > as to require a more detailed review of this entire issue, with a
> > Swift end-user focus.
>
> It's somewhat complex. But the way to look at it is that you pick one
> model (say the cog/globus one, which says count=number of processes)
> and
> stick with that. Then you translate that into the specifics of each
> site.
> >
> > Lets start with a voice call and then bring the issue back to the
> > devel list.
> > - Mike
> >
> > > >
> > > > Mihael
> > > >
> > > > _______________________________________________
> > > > Swift-user mailing list
> > > > Swift-user at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list