[Swift-devel] Re: [Swift-user] pbs ppn count and stuff

Mihael Hategan hategan at mcs.anl.gov
Tue Feb 1 15:47:44 CST 2011


On Tue, 2011-02-01 at 15:34 -0600, Michael Wilde wrote:
> Hi Mihael,
> 
> This issue is very timely - it came up in our meeting on the 0.92
> release.
> 
> I dont understand the specifics of much of what you say below,
> regarding which of the many count parameters you are referring to, how
> this works with coasters, plain PBS and SGE (and Condor providers),
> and MPI issues.
> 
> I think a good step would be to help us (Sarah, Justin, and me) update
> the User Guide with all that a user needs to know to get node and
> processor counts specified correctly for the many different
> configurations of sites and Swift that are possible.
> 
> Some of my initial questions are below. Maybe this would be best
> discussed in a teleconference, but we can start by trying to clarify
> the issues using this email thread.
> 
> > On Mon, 2011-01-24 at 10:46 -0800, Mihael Hategan wrote:
> > > So I think some of the problems with ppn are as follows:
> > > 1. count in cog means number of processes. count in PBS means
> number
> > > of
> > > nodes.
> 
> What is "count in cog"? Presumably a pool attribute? How does it get
> specified both for coasters and non-coasters? Is this related to the
> xcount parameter in the GLOBUS profile in the Swift User Guide MPI
> example: GLOBUS::host_xcount=3 ?

It's a task attribute. It means "start this many instances of the
process".
> 
> > > 2. when the number of nodes requested was 1 but ppn > 1,
> 
> You mean the number of nodes that Swift requested in the PBS submit
> file?

Right.
> 
> as in #PBS -l nodes=$nodes:ppn=$cores

No. As in #PBS -l nodes=1:ppn=n, with n > 1.

So one physical node with multiple processes on that node.
> 
> > the
> > > multiple
> > > job scheme was not enabled so, despite having multiple lines in
> > > PBS_NODEFILE, only one process would get started. If count was > 1
> > > then
> > > PBS would understand that count*ppn lines should be in
> PBS_NODEFILE,
> > > which would result in that number of processes be started. In
> other
> > > words there was no way to tell PBS to start 4 jobs on only one
> node.
> 
> 
> > > So:
> > >
> > > - I changed this to be consistent with 1. Count means number of
> > > processes to be started. This imposes the restriction that count %
> > > ppn =
> > > 0. If not, the pbs provider will throw an exception.
> 
> # of processes to be started is number of workers in coaster case?

Yes. Number of instances of the worker.pl process.
> 
> > > - I also added mppnppn if USE_MPPWIDTH is enabled.
> 
> Where & how should USE_MPPWIDTH be specified?

Justin added support for it, so I'm assuming there was a place where it
was needed.
> 
> > >
> > > This is in trunk.
> 
> Should it be retrofitted to 0.92?

It's a pretty radical change. I will port one thing to 0.92, and that is
enabling the multi-job handling when ppn>1.
> 
> Does it apply to SGE and the associated "pe" parallel environment
> issues?

This is strictly about PBS.
> 
> How does it relate to workersPerNode and the various coaster settings
> that control size of node allocations?

If you specify ppn > 1, then you need to have nodeGranularity=ppn. We
should also change nodeGranularity to read coreGranularity.
> 
> How does it relate to issues of whether or not a site does
> node-packing, and whether or not a user wants to use node-packing (ie
> single-core jobs in most or all cases).

If a site feels like re-defining the notion of a node from the physical
thing with multiple cores to a virtual thing with a single core, there's
not much we can do about it. But it is not much different from
considering the site to physically have 1-core nodes.

> I apologize that I cant formulate the question cleanly, but Im finding
> the terminology and processor-count model between Swift, cog,
> coasters, and multiple schedulers with multiple modes to be so complex
> as to require a more detailed review of this entire issue, with a
> Swift end-user focus.

It's somewhat complex. But the way to look at it is that you pick one
model (say the cog/globus one, which says count=number of processes) and
stick with that. Then you translate that into the specifics of each
site.
> 
> Lets start with a voice call and then bring the issue back to the
> devel list.
> - Mike
> 
> > >
> > > Mihael
> > >
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 





More information about the Swift-devel mailing list