[Swift-user] [Swift-devel] gram on ranger

Michael Wilde wilde at mcs.anl.gov
Sat Oct 22 09:41:17 CDT 2011


Sarah, was this 50K version run with the same sites file and Swift version?

At any rate, David is correcting some known problems in the SGE provider and increasing the test coverage for it. Once thats done we can try again.

In the meantime, if you want to push this forward in parallel, can you try to run again and capture the SGE submit and stdout/err files?

Im not 100% sure the following is correct, but I think you can set the SGE provider into debug mode by doing one or both of the following:

  etc/provider-sge.properties: add line: debug=true

(I think this works for the PBS provider and assume it does for SGE; we need to verify)

Also the sites/pbs page on the swiftdevel site has this, which *might* also give more debug info for SGE (again, needs to be checked):

# Special functionality: suppresses auto-deletion of PBS submit file
log4j.logger.org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor=DEBUG

log4j.logger.org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor=DEBUG


- Mike

----- Original Message -----
> From: "Sarah Kenny" <skenny at uchicago.edu>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>, "David Kelly" <davidk at ci.uchicago.edu>, "Anjali Raja"
> <anjraja at gmail.com>, "Swift Devel" <swift-devel at ci.uchicago.edu>, "Swift User" <swift-user at ci.uchicago.edu>
> Sent: Saturday, October 22, 2011 5:57:45 AM
> Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> fyi, this works on a smaller workflow, we've run it several times on a
> 50k version.
> 
> 
> On Thu, Oct 20, 2011 at 8:21 AM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> Thanks, Ketan. If I understand you correctly, then I would consider
> this a Swift bug, in that maxnodes should always mean *nodes*, for
> every type of resource provider including SGE. Based on what you say,
> the SGE provider is in this case treating the requested maxnode count
> as cores (Assuming Anjali was running the same Swift revision as you
> were testing on here).
> 
> But then that might not explain the error in the log that Sarah
> posted.
> 
> It seems the next step is to try the run on a smaller job (we can test
> this ourselves), and see if we can replicate and diagnose the error,
> with SGE subit files and output/error logs.
> 
> David, can you do this, since you were working on SGE testing last
> week?
> You and Ketan should share what you know about the situation, via
> swift-devel, as Ketan is also running on Ranger with persistent
> coasters I think.
> 
> Thanks,
> 
> 
> Mike
> 
> 
> ----- Original Message -----
> 
> > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > Cc: "Sarah Kenny" < skenny at uchicago.edu >, "Anjali Raja" <
> > anjraja at gmail.com >, "Swift Devel"
> > < swift-devel at ci.uchicago.edu >, "Swift User" <
> > swift-user at ci.uchicago.edu >
> > Sent: Thursday, October 20, 2011 9:54:33 AM
> 
> 
> 
> > Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> > On Thu, Oct 20, 2011 at 7:50 AM, Michael Wilde < wilde at mcs.anl.gov >
> > wrote:
> >
> >
> > Hi Sarah, Anjali,
> >
> > My initial theory on whats failing in this job is that the Ranger
> > development queue is limited to jobs of 16 nodes or less. (The
> > Ranger
> > User Guide says maxprocs 256 for that queue, and qconf -sq
> > development
> > says slots 16, which agrees). So you need to either change to one of
> > the production queues (normal, long etc) or reduce the values of
> > maxnode and nodegranularity.
> >
> >
> >
> > I have a little confusion here: the desired line in the final pbs
> > script should be : #$ -pe <n>way 256; in order to have 256 procs,
> > however, putting maxnodes=16 on sites.xml results in the following
> > line on pbs:
> > #$ -pe <n>way 16;
> > I understand this number 16/256 is for procs since, when putting 256
> > with development queue, ranger indeed allows the job to run in
> > development queue.
> >
> >
> >
> > I would also suggest (unless you have already done this) that you
> > test
> > first on a very small run (like a single RInvoke app call) and then
> > scale up to just a few voxels per dataset before trying such a large
> > run. Have you already tested that?
> >
> > Lastly, when reporting problems like this, the swift standard
> > output/err is also very helpful to get a higher-level view of what
> > went wrong.
> >
> > Swift needs to clearly return errors from the local resource
> > provider,
> > which it doesnt seem to be doing here. Ive filed this as bug 593 and
> > assigned to David.
> >
> > Please let us know if changing the queue and/or slots resolves the
> > problem. As mentioned in the bug report I think you can set
> > debug=true
> > (or yes?) in the provider-sge.properties file and get swift to
> > preserve the output from SGE in ~/.globus/scripts. (In fact that may
> > already be preserved, I am not sure). Please check there to see if
> > the
> > SGE error is there.
> >
> > Thanks,
> >
> > - Mike
> >
> >
> >
> > ----- Original Message -----
> > > From: "Sarah Kenny" < skenny at uchicago.edu >
> >
> > > To: "Mihael Hategan" < hategan at mcs.anl.gov >
> > > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" <
> > > swift-devel at ci.uchicago.edu >, "Swift User"
> > > < swift-user at ci.uchicago.edu >
> > > Sent: Thursday, October 20, 2011 6:07:09 AM
> > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> >
> > > hi all, one of our users, anjali (cc'd here) is trying to submit
> > > this
> > > ~400k job workflow to ranger...thought i'd see if you felt like
> > > having
> > > a look :)
> > >
> > > log is here:
> > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log
> > >
> > > sites file:
> > >
> > > <config>
> > > <pool handle="RANGER">
> > > <execution provider="coaster" jobManager="gt2:SGE" url="
> > > gatekeeper.ranger.tacc.teragrid.org "/>
> >
> >
> >
> > > <filesystem provider="gsiftp" url="gsiftp://
> > > gridftp.ranger.tacc.teragrid.org "/>
> > > <profile namespace="globus" key="maxtime">7200</profile>
> > > <profile namespace="globus" key="maxWallTime">00:20:00</profile>
> > > <profile namespace="globus" key="jobsPerNode">1</profile>
> > > <profile namespace="globus" key="nodeGranularity">64</profile>
> > > <profile namespace="globus" key="maxNodes">256</profile>
> > > <profile namespace="globus" key="queue">development</profile>
> > > <profile namespace="karajan" key="jobThrottle">1.28</profile>
> > > <profile namespace="globus" key="project">TG-DBS080004N</profile>
> > > <profile namespace="globus" key="pe">16way</profile>
> > > <profile namespace="karajan" key="initialScore">10000</profile>
> > > <workdirectory>/work/00926/tg459516/swiftwork</workdirectory>
> > > </pool>
> > > </config>
> > >
> > >
> > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan <
> > > hategan at mcs.anl.gov
> > > > wrote:
> > >
> > >
> > >
> > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote:
> > > >
> > > >
> > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan <
> > > > hategan at mcs.anl.gov >
> > > > wrote:
> > > > Is this with a persistent coaster service?
> > > >
> > > > admittedly i have not used persistent coaster service...should
> > > > i?
> > >
> > > No. I was just trying to figure out whether it might be something
> > > related to the persistent version.
> > >
> > >
> > >
> > >
> > > > i feel like it's documented *somewhere* (?)
> > > >
> > > > for now i've tried setting 'sitedir.keep=true' in the config so
> > > > maybe
> > > > it won't try to run the cleanup job...we'll see (waiting in q)
> > > >
> > > >
> > > >
> > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote:
> > > > >
> > > > >
> > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly
> > > > < davidk at ci.uchicago.edu >
> > > > > wrote:
> > > > >
> > > > > That could be it.. maybe a cleanup script is not
> > > > getting the
> > > > > right parameters and failing. Do you happen to have
> > > > a copy of
> > > > > the coaster log?
> > > > >
> > > > > just put it in /home/skenny/swift_logs
> > > > >
> > > > >
> > > > > Maybe there will be some clues in there.
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Sarah Kenny" < skenny at uchicago.edu >
> > > > >
> > > > > > To: "David Kelly" < davidk at ci.uchicago.edu >
> > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >,
> > > > "Swift
> > > > > User" < swift-user at ci.uchicago.edu >, "Justin M
> > > > Wozniak"
> > > > > > < wozniak at mcs.anl.gov >
> > > > >
> > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM
> > > > > > Subject: Re: [Swift-user] gram on ranger
> > > > >
> > > > > > so, this workflow completes all the jobs but then
> > > > just hangs
> > > > > > indefinitely at the end...maybe a stray cleanup
> > > > job?
> > > > > >
> > > > > > log is here:
> > > > > >
> > > > >
> > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log
> > > > > >
> > > > > > just tweaked the sites file a bit from what david
> > > > sent me:
> > > > > >
> > > > > > <config>
> > > > > > <pool handle="RANGER">
> > > > > > <execution provider="coaster" jobManager="gt2:SGE"
> > > > url="
> > > > > > gatekeeper.ranger.tacc.teragrid.org "/>
> > > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > > >
> > > > > > gridftp.ranger.tacc.teragrid.org "/>
> > > > >
> > > > > > <profile namespace="globus"
> > > > key="maxtime">28800</profile>
> > > > > > <profile namespace="globus"
> > > > > key="maxWallTime">00:15:00</profile>
> > > > > > <profile namespace="globus"
> > > > key="jobsPerNode">1</profile>
> > > > > > <profile namespace="globus"
> > > > > key="nodeGranularity">64</profile>
> > > > > > <profile namespace="globus"
> > > > key="maxNodes">256</profile>
> > > > > > <profile namespace="globus"
> > > > key="queue">normal</profile>
> > > > > > <profile namespace="karajan"
> > > > key="jobThrottle">1</profile>
> > > > > > <profile namespace="globus"
> > > > > key="project">TG-DBS080004N</profile>
> > > > > > <profile namespace="globus"
> > > > key="pe">16way</profile>
> > > > > > <profile namespace="karajan"
> > > > > key="initialScore">10000</profile>
> > > > > >
> > > > >
> > > > <workdirectory>/work/00043/tg457040/sidgrid_out/skenny</workdirectory>
> > > > > > </pool>
> > > > > > </config>
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny <
> > > > > skenny at uchicago.edu >
> > > > > > wrote:
> > > > > >
> > > > > >
> > > > > > ok, thanks, got in the queue now...also, realized
> > > > my last
> > > > > run may have
> > > > > > been using the old swift. apparently i had
> > > > SWIFT_HOME set in
> > > > > my env
> > > > > > and that overrides the newer swift i had set in my
> > > > PATH.
> > > > > >
> > > > > > ~sk
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly <
> > > > > davidk at ci.uchicago.edu
> > > > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Sarah,
> > > > > >
> > > > > > Can you give this another try with the latest
> > > > 0.93? I made
> > > > > some
> > > > > > changes to the coaster and sge providers and was
> > > > able to get
> > > > > it
> > > > > > working with a simple catns script. Here is the
> > > > > configuration file I
> > > > > > was using:
> > > > > >
> > > > > > <config>
> > > > > > <pool handle="ranger">
> > > > > > <execution provider="coaster" jobManager="gt2:SGE"
> > > > url="
> > > > > > gatekeeper.ranger.tacc.teragrid.org "/>
> > > > > >
> > > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > > >
> > > > > > gridftp.ranger.tacc.teragrid.org "/>
> > > > >
> > > > > > <profile namespace="globus"
> > > > key="maxtime">3600</profile>
> > > > > > <profile namespace="globus"
> > > > > key="maxWallTime">00:00:03</profile>
> > > > > > <profile namespace="globus"
> > > > key="jobsPerNode">1</profile>
> > > > > > <profile namespace="globus"
> > > > > key="nodeGranularity">16</profile>
> > > > > > <profile namespace="globus"
> > > > key="maxNodes">16</profile>
> > > > > > <profile namespace="globus"
> > > > > key="queue">development</profile>
> > > > > > <profile namespace="karajan"
> > > > key="jobThrottle">0.9</profile>
> > > > > >
> > > > > > <profile namespace="globus"
> > > > > key="project">TG-DBS080004N</profile>
> > > > > >
> > > > > > <profile namespace="globus"
> > > > key="pe">16way</profile>
> > > > > >
> > > > >
> > > > <workdirectory>/share/home/01503/davidkel/swiftwork</workdirectory>
> > > > > > </pool>
> > > > > > </config>
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > David
> > > > > >
> > > > > > ----- Original Message -----
> > > > > >
> > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu >
> > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov >
> > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu
> > > > >, "Swift
> > > > > User" <
> > > > > > > swift-user at ci.uchicago.edu >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM
> > > > > > > Subject: Re: [Swift-user] gram on ranger
> > > > > >
> > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log
> > > > > > >
> > > > > > > on ci
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak
> > > > <
> > > > > > > wozniak at mcs.anl.gov
> > > > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Can I take a look at the log?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > hey all, i'm trying to submit to gram on ranger
> > > > using the
> > > > > latest
> > > > > > > swift
> > > > > > > (built from trunk). it failes like so:
> > > > > > >
> > > > > > > Cannot submit job
> > > > > > > Caused by:
> > > > > > > org.globus.cog.abstraction. impl.common.task.
> > > > > > > TaskSubmissionException:
> > > > > > > Cannot
> > > > > > > submit job
> > > > > > > Caused by: org.globus.gram.GramException:
> > > > Parameter not
> > > > > supported
> > > > > > > Cannot submit job
> > > > > > >
> > > > > > > the gram log was saying first that 'jobsPerNode'
> > > > is not
> > > > > supported so
> > > > > > > i
> > > > > > > changed it to workersPerNode and then it was
> > > > saying
> > > > > 'maxnodes' is
> > > > > > > not
> > > > > > > supported. here's my sites file:
> > > > > > >
> > > > > > > <config>
> > > > > > > <pool handle="RANGER">
> > > > > > > <profile namespace="karajan"
> > > > key="initialScore">10000</
> > > > > profile>
> > > > > > > <profile namespace="karajan"
> > > > key="jobThrottle">1</profile>
> > > > > > > <profile namespace="globus"
> > > > key="maxWallTime">00:15:00</
> > > > > profile>
> > > > > > > <profile namespace="globus"
> > > > key="maxTime">86400</profile>
> > > > > > > <profile namespace="globus"
> > > > key="slots">1</profile>
> > > > > > > <profile namespace="globus"
> > > > key="maxNodes">256</profile>
> > > > > > > <profile namespace="globus"
> > > > key="pe">16way</profile>
> > > > > > > <profile namespace="globus"
> > > > key="workersPerNode">1</
> > > > > profile>
> > > > > > > <profile namespace="globus"
> > > > key="nodeGranularity">64</
> > > > > profile>
> > > > > > > <profile namespace="globus"
> > > > key="queue">normal</profile>
> > > > > > > <profile namespace="globus"
> > > > key="project">TG-DBS080004N</
> > > > > profile>
> > > > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > > > > > gridftp.ranger.tacc.teragrid. org "/>
> > > > > >
> > > > > > > <execution provider="coaster"
> > > > jobManager="gt2:gt2:SGE"
> > > > > url="
> > > > > > > gatekeeper.ranger.tacc. teragrid.org "/>
> > > > > >
> > > > > > > <execution provider="gt2" jobManager="SGE" url="
> > > > > > > gatekeeper.ranger.tacc. teragrid.org "/>
> > > > > > > <workdirectory>/work/00043/
> > > > tg457040</workdirectory>
> > > > > >
> > > > > > > </pool>
> > > > > > > </config>
> > > > > > >
> > > > > > > thoughts? ideas?
> > > > > > >
> > > > > > > --
> > > > > > > Justin M Wozniak
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sarah Kenny
> > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > > Bio Sci
> > > > > III
> > > > > > > University of California Irvine, Dept. of
> > > > Neurology ~
> > > > > 773-818-8300
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Swift-user mailing list
> > > > > > > Swift-user at ci.uchicago.edu
> > > > > > >
> > > > >
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sarah Kenny
> > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > > Bio Sci III
> > > > > > University of California Irvine, Dept. of
> > > > Neurology ~
> > > > > 773-818-8300
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sarah Kenny
> > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > > Bio Sci III
> > > > > > University of California Irvine, Dept. of
> > > > Neurology ~
> > > > > 773-818-8300
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sarah Kenny
> > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > > > University of California Irvine, Dept. of Neurology ~
> > > > 773-818-8300
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing list
> > > > > Swift-user at ci.uchicago.edu
> > > > >
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sarah Kenny
> > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > > University of California Irvine, Dept. of Neurology ~
> > > > 773-818-8300
> > > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Sarah Kenny
> > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > University of California Irvine, Dept. of Neurology ~ 773-818-8300
> > >
> > >
> >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> >
> >
> > --
> > Ketan
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> 
> 
> --
> Sarah Kenny
> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> University of California Irvine, Dept. of Neurology ~ 773-818-8300

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list