[Swift-user] [Swift-devel] gram on ranger

Michael Wilde wilde at mcs.anl.gov
Thu Oct 20 07:50:39 CDT 2011


Hi Sarah, Anjali,

My initial theory on whats failing in this job is that the Ranger development queue is limited to jobs of 16 nodes or less. (The Ranger User Guide says maxprocs 256 for that queue, and qconf -sq development says slots 16, which agrees). So you need to either change to one of the production queues (normal, long etc) or reduce the values of maxnode and nodegranularity.

I would also suggest (unless you have already done this) that you test first on a very small run (like a single RInvoke app call) and then scale up to just a few voxels per dataset before trying such a large run.  Have you already tested that?

Lastly, when reporting problems like this, the swift standard output/err is also very helpful to get a higher-level view of what went wrong.

Swift needs to clearly return errors from the local resource provider, which it doesnt seem to be doing here. Ive filed this as bug 593 and assigned to David.

Please let us know if changing the queue and/or slots resolves the problem. As mentioned in the bug report I think you can set debug=true (or yes?) in the provider-sge.properties file and get swift to preserve the output from SGE in ~/.globus/scripts. (In fact that may already be preserved, I am not sure). Please check there to see if the SGE error is there.

Thanks,

- Mike


----- Original Message -----
> From: "Sarah Kenny" <skenny at uchicago.edu>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Anjali Raja" <anjraja at gmail.com>, "Swift Devel" <swift-devel at ci.uchicago.edu>, "Swift User"
> <swift-user at ci.uchicago.edu>
> Sent: Thursday, October 20, 2011 6:07:09 AM
> Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> hi all, one of our users, anjali (cc'd here) is trying to submit this
> ~400k job workflow to ranger...thought i'd see if you felt like having
> a look :)
> 
> log is here:
> /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log
> 
> sites file:
> 
> <config>
> <pool handle="RANGER">
> <execution provider="coaster" jobManager="gt2:SGE" url="
> gatekeeper.ranger.tacc.teragrid.org "/>
> <filesystem provider="gsiftp" url="gsiftp://
> gridftp.ranger.tacc.teragrid.org "/>
> <profile namespace="globus" key="maxtime">7200</profile>
> <profile namespace="globus" key="maxWallTime">00:20:00</profile>
> <profile namespace="globus" key="jobsPerNode">1</profile>
> <profile namespace="globus" key="nodeGranularity">64</profile>
> <profile namespace="globus" key="maxNodes">256</profile>
> <profile namespace="globus" key="queue">development</profile>
> <profile namespace="karajan" key="jobThrottle">1.28</profile>
> <profile namespace="globus" key="project">TG-DBS080004N</profile>
> <profile namespace="globus" key="pe">16way</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <workdirectory>/work/00926/tg459516/swiftwork</workdirectory>
> </pool>
> </config>
> 
> 
> On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < hategan at mcs.anl.gov
> > wrote:
> 
> 
> 
> On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote:
> >
> >
> > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan <
> > hategan at mcs.anl.gov >
> > wrote:
> > Is this with a persistent coaster service?
> >
> > admittedly i have not used persistent coaster service...should i?
> 
> No. I was just trying to figure out whether it might be something
> related to the persistent version.
> 
> 
> 
> 
> > i feel like it's documented *somewhere* (?)
> >
> > for now i've tried setting 'sitedir.keep=true' in the config so
> > maybe
> > it won't try to run the cleanup job...we'll see (waiting in q)
> >
> >
> >
> > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote:
> > >
> > >
> > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly
> > < davidk at ci.uchicago.edu >
> > > wrote:
> > >
> > > That could be it.. maybe a cleanup script is not
> > getting the
> > > right parameters and failing. Do you happen to have
> > a copy of
> > > the coaster log?
> > >
> > > just put it in /home/skenny/swift_logs
> > >
> > >
> > > Maybe there will be some clues in there.
> > >
> > > ----- Original Message -----
> > > > From: "Sarah Kenny" < skenny at uchicago.edu >
> > >
> > > > To: "David Kelly" < davidk at ci.uchicago.edu >
> > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >,
> > "Swift
> > > User" < swift-user at ci.uchicago.edu >, "Justin M
> > Wozniak"
> > > > < wozniak at mcs.anl.gov >
> > >
> > > > Sent: Tuesday, October 11, 2011 1:32:37 PM
> > > > Subject: Re: [Swift-user] gram on ranger
> > >
> > > > so, this workflow completes all the jobs but then
> > just hangs
> > > > indefinitely at the end...maybe a stray cleanup
> > job?
> > > >
> > > > log is here:
> > > >
> > >
> > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log
> > > >
> > > > just tweaked the sites file a bit from what david
> > sent me:
> > > >
> > > > <config>
> > > > <pool handle="RANGER">
> > > > <execution provider="coaster" jobManager="gt2:SGE"
> > url="
> > > > gatekeeper.ranger.tacc.teragrid.org "/>
> > > > <filesystem provider="gsiftp" url="gsiftp://
> > >
> > > > gridftp.ranger.tacc.teragrid.org "/>
> > >
> > > > <profile namespace="globus"
> > key="maxtime">28800</profile>
> > > > <profile namespace="globus"
> > > key="maxWallTime">00:15:00</profile>
> > > > <profile namespace="globus"
> > key="jobsPerNode">1</profile>
> > > > <profile namespace="globus"
> > > key="nodeGranularity">64</profile>
> > > > <profile namespace="globus"
> > key="maxNodes">256</profile>
> > > > <profile namespace="globus"
> > key="queue">normal</profile>
> > > > <profile namespace="karajan"
> > key="jobThrottle">1</profile>
> > > > <profile namespace="globus"
> > > key="project">TG-DBS080004N</profile>
> > > > <profile namespace="globus"
> > key="pe">16way</profile>
> > > > <profile namespace="karajan"
> > > key="initialScore">10000</profile>
> > > >
> > >
> > <workdirectory>/work/00043/tg457040/sidgrid_out/skenny</workdirectory>
> > > > </pool>
> > > > </config>
> > > >
> > > >
> > > >
> > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny <
> > > skenny at uchicago.edu >
> > > > wrote:
> > > >
> > > >
> > > > ok, thanks, got in the queue now...also, realized
> > my last
> > > run may have
> > > > been using the old swift. apparently i had
> > SWIFT_HOME set in
> > > my env
> > > > and that overrides the newer swift i had set in my
> > PATH.
> > > >
> > > > ~sk
> > > >
> > > >
> > > >
> > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly <
> > > davidk at ci.uchicago.edu
> > > > > wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Sarah,
> > > >
> > > > Can you give this another try with the latest
> > 0.93? I made
> > > some
> > > > changes to the coaster and sge providers and was
> > able to get
> > > it
> > > > working with a simple catns script. Here is the
> > > configuration file I
> > > > was using:
> > > >
> > > > <config>
> > > > <pool handle="ranger">
> > > > <execution provider="coaster" jobManager="gt2:SGE"
> > url="
> > > > gatekeeper.ranger.tacc.teragrid.org "/>
> > > >
> > > > <filesystem provider="gsiftp" url="gsiftp://
> > >
> > > > gridftp.ranger.tacc.teragrid.org "/>
> > >
> > > > <profile namespace="globus"
> > key="maxtime">3600</profile>
> > > > <profile namespace="globus"
> > > key="maxWallTime">00:00:03</profile>
> > > > <profile namespace="globus"
> > key="jobsPerNode">1</profile>
> > > > <profile namespace="globus"
> > > key="nodeGranularity">16</profile>
> > > > <profile namespace="globus"
> > key="maxNodes">16</profile>
> > > > <profile namespace="globus"
> > > key="queue">development</profile>
> > > > <profile namespace="karajan"
> > key="jobThrottle">0.9</profile>
> > > >
> > > > <profile namespace="globus"
> > > key="project">TG-DBS080004N</profile>
> > > >
> > > > <profile namespace="globus"
> > key="pe">16way</profile>
> > > >
> > >
> > <workdirectory>/share/home/01503/davidkel/swiftwork</workdirectory>
> > > > </pool>
> > > > </config>
> > > >
> > > > Thanks,
> > > >
> > > > David
> > > >
> > > > ----- Original Message -----
> > > >
> > > > > From: "Sarah Kenny" < skenny at uchicago.edu >
> > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov >
> > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu
> > >, "Swift
> > > User" <
> > > > > swift-user at ci.uchicago.edu >
> > > >
> > > >
> > > >
> > > > > Sent: Friday, October 7, 2011 3:13:57 PM
> > > > > Subject: Re: [Swift-user] gram on ranger
> > > >
> > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log
> > > > >
> > > > > on ci
> > > > >
> > > > >
> > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak
> > <
> > > > > wozniak at mcs.anl.gov
> > > > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > > Can I take a look at the log?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote:
> > > > >
> > > > >
> > > > >
> > > > > hey all, i'm trying to submit to gram on ranger
> > using the
> > > latest
> > > > > swift
> > > > > (built from trunk). it failes like so:
> > > > >
> > > > > Cannot submit job
> > > > > Caused by:
> > > > > org.globus.cog.abstraction. impl.common.task.
> > > > > TaskSubmissionException:
> > > > > Cannot
> > > > > submit job
> > > > > Caused by: org.globus.gram.GramException:
> > Parameter not
> > > supported
> > > > > Cannot submit job
> > > > >
> > > > > the gram log was saying first that 'jobsPerNode'
> > is not
> > > supported so
> > > > > i
> > > > > changed it to workersPerNode and then it was
> > saying
> > > 'maxnodes' is
> > > > > not
> > > > > supported. here's my sites file:
> > > > >
> > > > > <config>
> > > > > <pool handle="RANGER">
> > > > > <profile namespace="karajan"
> > key="initialScore">10000</
> > > profile>
> > > > > <profile namespace="karajan"
> > key="jobThrottle">1</profile>
> > > > > <profile namespace="globus"
> > key="maxWallTime">00:15:00</
> > > profile>
> > > > > <profile namespace="globus"
> > key="maxTime">86400</profile>
> > > > > <profile namespace="globus"
> > key="slots">1</profile>
> > > > > <profile namespace="globus"
> > key="maxNodes">256</profile>
> > > > > <profile namespace="globus"
> > key="pe">16way</profile>
> > > > > <profile namespace="globus"
> > key="workersPerNode">1</
> > > profile>
> > > > > <profile namespace="globus"
> > key="nodeGranularity">64</
> > > profile>
> > > > > <profile namespace="globus"
> > key="queue">normal</profile>
> > > > > <profile namespace="globus"
> > key="project">TG-DBS080004N</
> > > profile>
> > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > > > gridftp.ranger.tacc.teragrid. org "/>
> > > >
> > > > > <execution provider="coaster"
> > jobManager="gt2:gt2:SGE"
> > > url="
> > > > > gatekeeper.ranger.tacc. teragrid.org "/>
> > > >
> > > > > <execution provider="gt2" jobManager="SGE" url="
> > > > > gatekeeper.ranger.tacc. teragrid.org "/>
> > > > > <workdirectory>/work/00043/
> > tg457040</workdirectory>
> > > >
> > > > > </pool>
> > > > > </config>
> > > > >
> > > > > thoughts? ideas?
> > > > >
> > > > > --
> > > > > Justin M Wozniak
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sarah Kenny
> > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > Bio Sci
> > > III
> > > > > University of California Irvine, Dept. of
> > Neurology ~
> > > 773-818-8300
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing list
> > > > > Swift-user at ci.uchicago.edu
> > > > >
> > >
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sarah Kenny
> > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > Bio Sci III
> > > > University of California Irvine, Dept. of
> > Neurology ~
> > > 773-818-8300
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sarah Kenny
> > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > Bio Sci III
> > > > University of California Irvine, Dept. of
> > Neurology ~
> > > 773-818-8300
> > >
> > >
> > >
> > >
> > > --
> > > Sarah Kenny
> > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > University of California Irvine, Dept. of Neurology ~
> > 773-818-8300
> > >
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > >
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >
> >
> >
> >
> >
> >
> > --
> > Sarah Kenny
> > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > University of California Irvine, Dept. of Neurology ~ 773-818-8300
> >
> 
> 
> 
> 
> 
> --
> Sarah Kenny
> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> University of California Irvine, Dept. of Neurology ~ 773-818-8300
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list