[Swift-devel] sites.xml for ranger sge coasters

David Kelly davidk at ci.uchicago.edu
Thu Mar 8 10:58:50 CST 2012


Mike,

Sure, here is a quick description. When SGE is used with coasters, the submit script should attempt to start 1 worker.pl on each node. A single worker.pl can handle multiple jobs. Instead, it was starting one worker.pl for every core on every node. If you set jobspernode to 16 and had 16 cores per node, you could see up to 256 jobs per node.

I think this only affected SGE+Coasters, but we should add some tests to the suite to verify.

David

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "David Kelly" <davidk at ci.uchicago.edu>
> Cc: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>, "swift-devel" <swift-devel at ci.uchicago.edu>
> Sent: Thursday, March 8, 2012 9:21:43 AM
> Subject: Re: sites.xml for ranger sge coasters
> David, thanks for addressing this problem. Does it affect any of the
> other local providers: pbs, condor, (sge), cobalt?
> 
> (I need to do some cobalt runs on Eureka for a user today, so I hope
> that provider is OK).
> 
> You should describe the issue and fix on swift-devel.
> 
> We should start a convention where we can document known issues for
> releases, so that users dont have to discover these bugs on their own.
> Can you make an action item to propose and start such a place
> (probably crosslinked to both Downloads and Documentation). Not urgent
> for today, but next week would be good.
> 
> Thanks,
> 
> - Mike
> 
> 
> 
> ----- Original Message -----
> > From: "David Kelly" <davidk at ci.uchicago.edu>
> > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > Cc: "Michael Wilde" <wilde at mcs.anl.gov>
> > Sent: Thursday, March 8, 2012 8:45:00 AM
> > Subject: Re: sites.xml for ranger sge coasters
> > 0.93 is frozen, but I committed the same change to 0.93.1 this
> > morning.
> >
> > ----- Original Message -----
> > > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>
> > > Sent: Wednesday, March 7, 2012 6:39:09 PM
> > > Subject: Re: sites.xml for ranger sge coasters
> > > is it committed in 0.93 too?
> > >
> > >
> > > On Wed, Mar 7, 2012 at 6:10 PM, David Kelly <
> > > davidk at ci.uchicago.edu
> > > >
> > > wrote:
> > >
> > >
> > > I submitted a fix to trunk for the SGE provider. The submit script
> > > was
> > > wrong - it started one worker per core, rather than one worker per
> > > host. (Oddly it's been like that for years without anybody
> > > noticing).
> > > I ran a few sleep/hostname tests and it seems to be working. Can
> > > you
> > > please give it a try?
> > >
> > > Below is the sites.xml I used for my test:
> > >
> > > <config>
> > > <pool handle="ranger">
> > > <execution jobmanager="local:sge" provider="coaster" url="none"/>
> > >
> > > <filesystem provider="local" url="none" />
> > > <profile namespace="globus" key="maxWallTime">5</profile>
> > > <profile namespace="globus" key="maxTime">600</profile>
> > > <profile key="jobsPerNode" namespace="globus">16</profile>
> > > <profile key="slots" namespace="globus">1</profile>
> > > <profile key="nodeGranularity" namespace="globus">3</profile>
> > > <profile key="pe" namespace="globus">16way</profile>
> > > <profile key="maxNodes" namespace="globus">3</profile>
> > > <profile key="queue" namespace="globus">development</profile>
> > > <profile key="jobThrottle" namespace="karajan">0.4799</profile>
> > > <profile key="initialScore" namespace="karajan">10000</profile>
> > > <profile namespace="globus" key="project">TG-DBS080004N</profile>
> > > <workdirectory>/share/home/01503/davidkel/swiftwork</workdirectory>
> > > </pool>
> > > </config>
> > >
> > > Thanks,
> > > David
> > >
> > >
> > >
> > >
> > > --
> > > Ketan
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory



More information about the Swift-devel mailing list