[Swift-user] [Swift-devel] gram on ranger

Michael Wilde wilde at mcs.anl.gov
Thu Oct 20 10:21:43 CDT 2011


Thanks, Ketan. If I understand you correctly, then I would consider this a Swift bug, in that maxnodes should always mean *nodes*, for every type of resource provider including SGE.  Based on what you say, the SGE provider is in this case treating the requested maxnode count as cores (Assuming Anjali was running the same Swift revision as you were testing on here).

But then that might not explain the error in the log that Sarah posted.

It seems the next step is to try the run on a smaller job (we can test this ourselves), and see if we can replicate and diagnose the error, with SGE subit files and output/error logs.

David, can you do this, since you were working on SGE testing last week?
You and Ketan should share what you know about the situation, via swift-devel, as Ketan is also running on Ranger with persistent coasters I think.

Thanks,

Mike


----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Sarah Kenny" <skenny at uchicago.edu>, "Anjali Raja" <anjraja at gmail.com>, "Swift Devel"
> <swift-devel at ci.uchicago.edu>, "Swift User" <swift-user at ci.uchicago.edu>
> Sent: Thursday, October 20, 2011 9:54:33 AM
> Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> On Thu, Oct 20, 2011 at 7:50 AM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> Hi Sarah, Anjali,
> 
> My initial theory on whats failing in this job is that the Ranger
> development queue is limited to jobs of 16 nodes or less. (The Ranger
> User Guide says maxprocs 256 for that queue, and qconf -sq development
> says slots 16, which agrees). So you need to either change to one of
> the production queues (normal, long etc) or reduce the values of
> maxnode and nodegranularity.
> 
> 
> 
> I have a little confusion here: the desired line in the final pbs
> script should be : #$ -pe <n>way 256; in order to have 256 procs,
> however, putting maxnodes=16 on sites.xml results in the following
> line on pbs:
> #$ -pe <n>way 16;
> I understand this number 16/256 is for procs since, when putting 256
> with development queue, ranger indeed allows the job to run in
> development queue.
> 
> 
> 
> I would also suggest (unless you have already done this) that you test
> first on a very small run (like a single RInvoke app call) and then
> scale up to just a few voxels per dataset before trying such a large
> run. Have you already tested that?
> 
> Lastly, when reporting problems like this, the swift standard
> output/err is also very helpful to get a higher-level view of what
> went wrong.
> 
> Swift needs to clearly return errors from the local resource provider,
> which it doesnt seem to be doing here. Ive filed this as bug 593 and
> assigned to David.
> 
> Please let us know if changing the queue and/or slots resolves the
> problem. As mentioned in the bug report I think you can set debug=true
> (or yes?) in the provider-sge.properties file and get swift to
> preserve the output from SGE in ~/.globus/scripts. (In fact that may
> already be preserved, I am not sure). Please check there to see if the
> SGE error is there.
> 
> Thanks,
> 
> - Mike
> 
> 
> 
> ----- Original Message -----
> > From: "Sarah Kenny" < skenny at uchicago.edu >
> 
> > To: "Mihael Hategan" < hategan at mcs.anl.gov >
> > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" <
> > swift-devel at ci.uchicago.edu >, "Swift User"
> > < swift-user at ci.uchicago.edu >
> > Sent: Thursday, October 20, 2011 6:07:09 AM
> > Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> 
> > hi all, one of our users, anjali (cc'd here) is trying to submit
> > this
> > ~400k job workflow to ranger...thought i'd see if you felt like
> > having
> > a look :)
> >
> > log is here:
> > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log
> >
> > sites file:
> >
> > <config>
> > <pool handle="RANGER">
> > <execution provider="coaster" jobManager="gt2:SGE" url="
> > gatekeeper.ranger.tacc.teragrid.org "/>
> 
> 
> 
> > <filesystem provider="gsiftp" url="gsiftp://
> > gridftp.ranger.tacc.teragrid.org "/>
> > <profile namespace="globus" key="maxtime">7200</profile>
> > <profile namespace="globus" key="maxWallTime">00:20:00</profile>
> > <profile namespace="globus" key="jobsPerNode">1</profile>
> > <profile namespace="globus" key="nodeGranularity">64</profile>
> > <profile namespace="globus" key="maxNodes">256</profile>
> > <profile namespace="globus" key="queue">development</profile>
> > <profile namespace="karajan" key="jobThrottle">1.28</profile>
> > <profile namespace="globus" key="project">TG-DBS080004N</profile>
> > <profile namespace="globus" key="pe">16way</profile>
> > <profile namespace="karajan" key="initialScore">10000</profile>
> > <workdirectory>/work/00926/tg459516/swiftwork</workdirectory>
> > </pool>
> > </config>
> >
> >
> > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan <
> > hategan at mcs.anl.gov
> > > wrote:
> >
> >
> >
> > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote:
> > >
> > >
> > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan <
> > > hategan at mcs.anl.gov >
> > > wrote:
> > > Is this with a persistent coaster service?
> > >
> > > admittedly i have not used persistent coaster service...should i?
> >
> > No. I was just trying to figure out whether it might be something
> > related to the persistent version.
> >
> >
> >
> >
> > > i feel like it's documented *somewhere* (?)
> > >
> > > for now i've tried setting 'sitedir.keep=true' in the config so
> > > maybe
> > > it won't try to run the cleanup job...we'll see (waiting in q)
> > >
> > >
> > >
> > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote:
> > > >
> > > >
> > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly
> > > < davidk at ci.uchicago.edu >
> > > > wrote:
> > > >
> > > > That could be it.. maybe a cleanup script is not
> > > getting the
> > > > right parameters and failing. Do you happen to have
> > > a copy of
> > > > the coaster log?
> > > >
> > > > just put it in /home/skenny/swift_logs
> > > >
> > > >
> > > > Maybe there will be some clues in there.
> > > >
> > > > ----- Original Message -----
> > > > > From: "Sarah Kenny" < skenny at uchicago.edu >
> > > >
> > > > > To: "David Kelly" < davidk at ci.uchicago.edu >
> > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >,
> > > "Swift
> > > > User" < swift-user at ci.uchicago.edu >, "Justin M
> > > Wozniak"
> > > > > < wozniak at mcs.anl.gov >
> > > >
> > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM
> > > > > Subject: Re: [Swift-user] gram on ranger
> > > >
> > > > > so, this workflow completes all the jobs but then
> > > just hangs
> > > > > indefinitely at the end...maybe a stray cleanup
> > > job?
> > > > >
> > > > > log is here:
> > > > >
> > > >
> > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log
> > > > >
> > > > > just tweaked the sites file a bit from what david
> > > sent me:
> > > > >
> > > > > <config>
> > > > > <pool handle="RANGER">
> > > > > <execution provider="coaster" jobManager="gt2:SGE"
> > > url="
> > > > > gatekeeper.ranger.tacc.teragrid.org "/>
> > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > >
> > > > > gridftp.ranger.tacc.teragrid.org "/>
> > > >
> > > > > <profile namespace="globus"
> > > key="maxtime">28800</profile>
> > > > > <profile namespace="globus"
> > > > key="maxWallTime">00:15:00</profile>
> > > > > <profile namespace="globus"
> > > key="jobsPerNode">1</profile>
> > > > > <profile namespace="globus"
> > > > key="nodeGranularity">64</profile>
> > > > > <profile namespace="globus"
> > > key="maxNodes">256</profile>
> > > > > <profile namespace="globus"
> > > key="queue">normal</profile>
> > > > > <profile namespace="karajan"
> > > key="jobThrottle">1</profile>
> > > > > <profile namespace="globus"
> > > > key="project">TG-DBS080004N</profile>
> > > > > <profile namespace="globus"
> > > key="pe">16way</profile>
> > > > > <profile namespace="karajan"
> > > > key="initialScore">10000</profile>
> > > > >
> > > >
> > > <workdirectory>/work/00043/tg457040/sidgrid_out/skenny</workdirectory>
> > > > > </pool>
> > > > > </config>
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny <
> > > > skenny at uchicago.edu >
> > > > > wrote:
> > > > >
> > > > >
> > > > > ok, thanks, got in the queue now...also, realized
> > > my last
> > > > run may have
> > > > > been using the old swift. apparently i had
> > > SWIFT_HOME set in
> > > > my env
> > > > > and that overrides the newer swift i had set in my
> > > PATH.
> > > > >
> > > > > ~sk
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly <
> > > > davidk at ci.uchicago.edu
> > > > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Sarah,
> > > > >
> > > > > Can you give this another try with the latest
> > > 0.93? I made
> > > > some
> > > > > changes to the coaster and sge providers and was
> > > able to get
> > > > it
> > > > > working with a simple catns script. Here is the
> > > > configuration file I
> > > > > was using:
> > > > >
> > > > > <config>
> > > > > <pool handle="ranger">
> > > > > <execution provider="coaster" jobManager="gt2:SGE"
> > > url="
> > > > > gatekeeper.ranger.tacc.teragrid.org "/>
> > > > >
> > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > >
> > > > > gridftp.ranger.tacc.teragrid.org "/>
> > > >
> > > > > <profile namespace="globus"
> > > key="maxtime">3600</profile>
> > > > > <profile namespace="globus"
> > > > key="maxWallTime">00:00:03</profile>
> > > > > <profile namespace="globus"
> > > key="jobsPerNode">1</profile>
> > > > > <profile namespace="globus"
> > > > key="nodeGranularity">16</profile>
> > > > > <profile namespace="globus"
> > > key="maxNodes">16</profile>
> > > > > <profile namespace="globus"
> > > > key="queue">development</profile>
> > > > > <profile namespace="karajan"
> > > key="jobThrottle">0.9</profile>
> > > > >
> > > > > <profile namespace="globus"
> > > > key="project">TG-DBS080004N</profile>
> > > > >
> > > > > <profile namespace="globus"
> > > key="pe">16way</profile>
> > > > >
> > > >
> > > <workdirectory>/share/home/01503/davidkel/swiftwork</workdirectory>
> > > > > </pool>
> > > > > </config>
> > > > >
> > > > > Thanks,
> > > > >
> > > > > David
> > > > >
> > > > > ----- Original Message -----
> > > > >
> > > > > > From: "Sarah Kenny" < skenny at uchicago.edu >
> > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov >
> > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu
> > > >, "Swift
> > > > User" <
> > > > > > swift-user at ci.uchicago.edu >
> > > > >
> > > > >
> > > > >
> > > > > > Sent: Friday, October 7, 2011 3:13:57 PM
> > > > > > Subject: Re: [Swift-user] gram on ranger
> > > > >
> > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log
> > > > > >
> > > > > > on ci
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak
> > > <
> > > > > > wozniak at mcs.anl.gov
> > > > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Can I take a look at the log?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > hey all, i'm trying to submit to gram on ranger
> > > using the
> > > > latest
> > > > > > swift
> > > > > > (built from trunk). it failes like so:
> > > > > >
> > > > > > Cannot submit job
> > > > > > Caused by:
> > > > > > org.globus.cog.abstraction. impl.common.task.
> > > > > > TaskSubmissionException:
> > > > > > Cannot
> > > > > > submit job
> > > > > > Caused by: org.globus.gram.GramException:
> > > Parameter not
> > > > supported
> > > > > > Cannot submit job
> > > > > >
> > > > > > the gram log was saying first that 'jobsPerNode'
> > > is not
> > > > supported so
> > > > > > i
> > > > > > changed it to workersPerNode and then it was
> > > saying
> > > > 'maxnodes' is
> > > > > > not
> > > > > > supported. here's my sites file:
> > > > > >
> > > > > > <config>
> > > > > > <pool handle="RANGER">
> > > > > > <profile namespace="karajan"
> > > key="initialScore">10000</
> > > > profile>
> > > > > > <profile namespace="karajan"
> > > key="jobThrottle">1</profile>
> > > > > > <profile namespace="globus"
> > > key="maxWallTime">00:15:00</
> > > > profile>
> > > > > > <profile namespace="globus"
> > > key="maxTime">86400</profile>
> > > > > > <profile namespace="globus"
> > > key="slots">1</profile>
> > > > > > <profile namespace="globus"
> > > key="maxNodes">256</profile>
> > > > > > <profile namespace="globus"
> > > key="pe">16way</profile>
> > > > > > <profile namespace="globus"
> > > key="workersPerNode">1</
> > > > profile>
> > > > > > <profile namespace="globus"
> > > key="nodeGranularity">64</
> > > > profile>
> > > > > > <profile namespace="globus"
> > > key="queue">normal</profile>
> > > > > > <profile namespace="globus"
> > > key="project">TG-DBS080004N</
> > > > profile>
> > > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > > > > gridftp.ranger.tacc.teragrid. org "/>
> > > > >
> > > > > > <execution provider="coaster"
> > > jobManager="gt2:gt2:SGE"
> > > > url="
> > > > > > gatekeeper.ranger.tacc. teragrid.org "/>
> > > > >
> > > > > > <execution provider="gt2" jobManager="SGE" url="
> > > > > > gatekeeper.ranger.tacc. teragrid.org "/>
> > > > > > <workdirectory>/work/00043/
> > > tg457040</workdirectory>
> > > > >
> > > > > > </pool>
> > > > > > </config>
> > > > > >
> > > > > > thoughts? ideas?
> > > > > >
> > > > > > --
> > > > > > Justin M Wozniak
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sarah Kenny
> > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > Bio Sci
> > > > III
> > > > > > University of California Irvine, Dept. of
> > > Neurology ~
> > > > 773-818-8300
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > Swift-user mailing list
> > > > > > Swift-user at ci.uchicago.edu
> > > > > >
> > > >
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sarah Kenny
> > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > Bio Sci III
> > > > > University of California Irvine, Dept. of
> > > Neurology ~
> > > > 773-818-8300
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sarah Kenny
> > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > Bio Sci III
> > > > > University of California Irvine, Dept. of
> > > Neurology ~
> > > > 773-818-8300
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sarah Kenny
> > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > > University of California Irvine, Dept. of Neurology ~
> > > 773-818-8300
> > > >
> > > > _______________________________________________
> > > > Swift-user mailing list
> > > > Swift-user at ci.uchicago.edu
> > > >
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Sarah Kenny
> > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > University of California Irvine, Dept. of Neurology ~ 773-818-8300
> > >
> >
> >
> >
> >
> >
> > --
> > Sarah Kenny
> > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > University of California Irvine, Dept. of Neurology ~ 773-818-8300
> >
> >
> 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> 
> 
> 
> --
> Ketan

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list