[Swift-user] [Swift-devel] gram on ranger

Sarah Kenny skenny at uchicago.edu
Sat Oct 22 05:57:45 CDT 2011


fyi, this works on a smaller workflow, we've run it several times on a 50k
version.

On Thu, Oct 20, 2011 at 8:21 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Thanks, Ketan. If I understand you correctly, then I would consider this a
> Swift bug, in that maxnodes should always mean *nodes*, for every type of
> resource provider including SGE.  Based on what you say, the SGE provider is
> in this case treating the requested maxnode count as cores (Assuming Anjali
> was running the same Swift revision as you were testing on here).
>
> But then that might not explain the error in the log that Sarah posted.
>
> It seems the next step is to try the run on a smaller job (we can test this
> ourselves), and see if we can replicate and diagnose the error, with SGE
> subit files and output/error logs.
>
> David, can you do this, since you were working on SGE testing last week?
> You and Ketan should share what you know about the situation, via
> swift-devel, as Ketan is also running on Ranger with persistent coasters I
> think.
>
> Thanks,
>
> Mike
>
>
> ----- Original Message -----
> > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "Sarah Kenny" <skenny at uchicago.edu>, "Anjali Raja" <
> anjraja at gmail.com>, "Swift Devel"
> > <swift-devel at ci.uchicago.edu>, "Swift User" <swift-user at ci.uchicago.edu>
> > Sent: Thursday, October 20, 2011 9:54:33 AM
> > Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> > On Thu, Oct 20, 2011 at 7:50 AM, Michael Wilde < wilde at mcs.anl.gov >
> > wrote:
> >
> >
> > Hi Sarah, Anjali,
> >
> > My initial theory on whats failing in this job is that the Ranger
> > development queue is limited to jobs of 16 nodes or less. (The Ranger
> > User Guide says maxprocs 256 for that queue, and qconf -sq development
> > says slots 16, which agrees). So you need to either change to one of
> > the production queues (normal, long etc) or reduce the values of
> > maxnode and nodegranularity.
> >
> >
> >
> > I have a little confusion here: the desired line in the final pbs
> > script should be : #$ -pe <n>way 256; in order to have 256 procs,
> > however, putting maxnodes=16 on sites.xml results in the following
> > line on pbs:
> > #$ -pe <n>way 16;
> > I understand this number 16/256 is for procs since, when putting 256
> > with development queue, ranger indeed allows the job to run in
> > development queue.
> >
> >
> >
> > I would also suggest (unless you have already done this) that you test
> > first on a very small run (like a single RInvoke app call) and then
> > scale up to just a few voxels per dataset before trying such a large
> > run. Have you already tested that?
> >
> > Lastly, when reporting problems like this, the swift standard
> > output/err is also very helpful to get a higher-level view of what
> > went wrong.
> >
> > Swift needs to clearly return errors from the local resource provider,
> > which it doesnt seem to be doing here. Ive filed this as bug 593 and
> > assigned to David.
> >
> > Please let us know if changing the queue and/or slots resolves the
> > problem. As mentioned in the bug report I think you can set debug=true
> > (or yes?) in the provider-sge.properties file and get swift to
> > preserve the output from SGE in ~/.globus/scripts. (In fact that may
> > already be preserved, I am not sure). Please check there to see if the
> > SGE error is there.
> >
> > Thanks,
> >
> > - Mike
> >
> >
> >
> > ----- Original Message -----
> > > From: "Sarah Kenny" < skenny at uchicago.edu >
> >
> > > To: "Mihael Hategan" < hategan at mcs.anl.gov >
> > > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" <
> > > swift-devel at ci.uchicago.edu >, "Swift User"
> > > < swift-user at ci.uchicago.edu >
> > > Sent: Thursday, October 20, 2011 6:07:09 AM
> > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> >
> > > hi all, one of our users, anjali (cc'd here) is trying to submit
> > > this
> > > ~400k job workflow to ranger...thought i'd see if you felt like
> > > having
> > > a look :)
> > >
> > > log is here:
> > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log
> > >
> > > sites file:
> > >
> > > <config>
> > > <pool handle="RANGER">
> > > <execution provider="coaster" jobManager="gt2:SGE" url="
> > > gatekeeper.ranger.tacc.teragrid.org "/>
> >
> >
> >
> > > <filesystem provider="gsiftp" url="gsiftp://
> > > gridftp.ranger.tacc.teragrid.org "/>
> > > <profile namespace="globus" key="maxtime">7200</profile>
> > > <profile namespace="globus" key="maxWallTime">00:20:00</profile>
> > > <profile namespace="globus" key="jobsPerNode">1</profile>
> > > <profile namespace="globus" key="nodeGranularity">64</profile>
> > > <profile namespace="globus" key="maxNodes">256</profile>
> > > <profile namespace="globus" key="queue">development</profile>
> > > <profile namespace="karajan" key="jobThrottle">1.28</profile>
> > > <profile namespace="globus" key="project">TG-DBS080004N</profile>
> > > <profile namespace="globus" key="pe">16way</profile>
> > > <profile namespace="karajan" key="initialScore">10000</profile>
> > > <workdirectory>/work/00926/tg459516/swiftwork</workdirectory>
> > > </pool>
> > > </config>
> > >
> > >
> > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan <
> > > hategan at mcs.anl.gov
> > > > wrote:
> > >
> > >
> > >
> > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote:
> > > >
> > > >
> > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan <
> > > > hategan at mcs.anl.gov >
> > > > wrote:
> > > > Is this with a persistent coaster service?
> > > >
> > > > admittedly i have not used persistent coaster service...should i?
> > >
> > > No. I was just trying to figure out whether it might be something
> > > related to the persistent version.
> > >
> > >
> > >
> > >
> > > > i feel like it's documented *somewhere* (?)
> > > >
> > > > for now i've tried setting 'sitedir.keep=true' in the config so
> > > > maybe
> > > > it won't try to run the cleanup job...we'll see (waiting in q)
> > > >
> > > >
> > > >
> > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote:
> > > > >
> > > > >
> > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly
> > > > < davidk at ci.uchicago.edu >
> > > > > wrote:
> > > > >
> > > > > That could be it.. maybe a cleanup script is not
> > > > getting the
> > > > > right parameters and failing. Do you happen to have
> > > > a copy of
> > > > > the coaster log?
> > > > >
> > > > > just put it in /home/skenny/swift_logs
> > > > >
> > > > >
> > > > > Maybe there will be some clues in there.
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Sarah Kenny" < skenny at uchicago.edu >
> > > > >
> > > > > > To: "David Kelly" < davidk at ci.uchicago.edu >
> > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >,
> > > > "Swift
> > > > > User" < swift-user at ci.uchicago.edu >, "Justin M
> > > > Wozniak"
> > > > > > < wozniak at mcs.anl.gov >
> > > > >
> > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM
> > > > > > Subject: Re: [Swift-user] gram on ranger
> > > > >
> > > > > > so, this workflow completes all the jobs but then
> > > > just hangs
> > > > > > indefinitely at the end...maybe a stray cleanup
> > > > job?
> > > > > >
> > > > > > log is here:
> > > > > >
> > > > >
> > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log
> > > > > >
> > > > > > just tweaked the sites file a bit from what david
> > > > sent me:
> > > > > >
> > > > > > <config>
> > > > > > <pool handle="RANGER">
> > > > > > <execution provider="coaster" jobManager="gt2:SGE"
> > > > url="
> > > > > > gatekeeper.ranger.tacc.teragrid.org "/>
> > > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > > >
> > > > > > gridftp.ranger.tacc.teragrid.org "/>
> > > > >
> > > > > > <profile namespace="globus"
> > > > key="maxtime">28800</profile>
> > > > > > <profile namespace="globus"
> > > > > key="maxWallTime">00:15:00</profile>
> > > > > > <profile namespace="globus"
> > > > key="jobsPerNode">1</profile>
> > > > > > <profile namespace="globus"
> > > > > key="nodeGranularity">64</profile>
> > > > > > <profile namespace="globus"
> > > > key="maxNodes">256</profile>
> > > > > > <profile namespace="globus"
> > > > key="queue">normal</profile>
> > > > > > <profile namespace="karajan"
> > > > key="jobThrottle">1</profile>
> > > > > > <profile namespace="globus"
> > > > > key="project">TG-DBS080004N</profile>
> > > > > > <profile namespace="globus"
> > > > key="pe">16way</profile>
> > > > > > <profile namespace="karajan"
> > > > > key="initialScore">10000</profile>
> > > > > >
> > > > >
> > > >
> <workdirectory>/work/00043/tg457040/sidgrid_out/skenny</workdirectory>
> > > > > > </pool>
> > > > > > </config>
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny <
> > > > > skenny at uchicago.edu >
> > > > > > wrote:
> > > > > >
> > > > > >
> > > > > > ok, thanks, got in the queue now...also, realized
> > > > my last
> > > > > run may have
> > > > > > been using the old swift. apparently i had
> > > > SWIFT_HOME set in
> > > > > my env
> > > > > > and that overrides the newer swift i had set in my
> > > > PATH.
> > > > > >
> > > > > > ~sk
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly <
> > > > > davidk at ci.uchicago.edu
> > > > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Sarah,
> > > > > >
> > > > > > Can you give this another try with the latest
> > > > 0.93? I made
> > > > > some
> > > > > > changes to the coaster and sge providers and was
> > > > able to get
> > > > > it
> > > > > > working with a simple catns script. Here is the
> > > > > configuration file I
> > > > > > was using:
> > > > > >
> > > > > > <config>
> > > > > > <pool handle="ranger">
> > > > > > <execution provider="coaster" jobManager="gt2:SGE"
> > > > url="
> > > > > > gatekeeper.ranger.tacc.teragrid.org "/>
> > > > > >
> > > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > > >
> > > > > > gridftp.ranger.tacc.teragrid.org "/>
> > > > >
> > > > > > <profile namespace="globus"
> > > > key="maxtime">3600</profile>
> > > > > > <profile namespace="globus"
> > > > > key="maxWallTime">00:00:03</profile>
> > > > > > <profile namespace="globus"
> > > > key="jobsPerNode">1</profile>
> > > > > > <profile namespace="globus"
> > > > > key="nodeGranularity">16</profile>
> > > > > > <profile namespace="globus"
> > > > key="maxNodes">16</profile>
> > > > > > <profile namespace="globus"
> > > > > key="queue">development</profile>
> > > > > > <profile namespace="karajan"
> > > > key="jobThrottle">0.9</profile>
> > > > > >
> > > > > > <profile namespace="globus"
> > > > > key="project">TG-DBS080004N</profile>
> > > > > >
> > > > > > <profile namespace="globus"
> > > > key="pe">16way</profile>
> > > > > >
> > > > >
> > > > <workdirectory>/share/home/01503/davidkel/swiftwork</workdirectory>
> > > > > > </pool>
> > > > > > </config>
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > David
> > > > > >
> > > > > > ----- Original Message -----
> > > > > >
> > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu >
> > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov >
> > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu
> > > > >, "Swift
> > > > > User" <
> > > > > > > swift-user at ci.uchicago.edu >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM
> > > > > > > Subject: Re: [Swift-user] gram on ranger
> > > > > >
> > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log
> > > > > > >
> > > > > > > on ci
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak
> > > > <
> > > > > > > wozniak at mcs.anl.gov
> > > > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Can I take a look at the log?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > hey all, i'm trying to submit to gram on ranger
> > > > using the
> > > > > latest
> > > > > > > swift
> > > > > > > (built from trunk). it failes like so:
> > > > > > >
> > > > > > > Cannot submit job
> > > > > > > Caused by:
> > > > > > > org.globus.cog.abstraction. impl.common.task.
> > > > > > > TaskSubmissionException:
> > > > > > > Cannot
> > > > > > > submit job
> > > > > > > Caused by: org.globus.gram.GramException:
> > > > Parameter not
> > > > > supported
> > > > > > > Cannot submit job
> > > > > > >
> > > > > > > the gram log was saying first that 'jobsPerNode'
> > > > is not
> > > > > supported so
> > > > > > > i
> > > > > > > changed it to workersPerNode and then it was
> > > > saying
> > > > > 'maxnodes' is
> > > > > > > not
> > > > > > > supported. here's my sites file:
> > > > > > >
> > > > > > > <config>
> > > > > > > <pool handle="RANGER">
> > > > > > > <profile namespace="karajan"
> > > > key="initialScore">10000</
> > > > > profile>
> > > > > > > <profile namespace="karajan"
> > > > key="jobThrottle">1</profile>
> > > > > > > <profile namespace="globus"
> > > > key="maxWallTime">00:15:00</
> > > > > profile>
> > > > > > > <profile namespace="globus"
> > > > key="maxTime">86400</profile>
> > > > > > > <profile namespace="globus"
> > > > key="slots">1</profile>
> > > > > > > <profile namespace="globus"
> > > > key="maxNodes">256</profile>
> > > > > > > <profile namespace="globus"
> > > > key="pe">16way</profile>
> > > > > > > <profile namespace="globus"
> > > > key="workersPerNode">1</
> > > > > profile>
> > > > > > > <profile namespace="globus"
> > > > key="nodeGranularity">64</
> > > > > profile>
> > > > > > > <profile namespace="globus"
> > > > key="queue">normal</profile>
> > > > > > > <profile namespace="globus"
> > > > key="project">TG-DBS080004N</
> > > > > profile>
> > > > > > > <filesystem provider="gsiftp" url="gsiftp://
> > > > > > > gridftp.ranger.tacc.teragrid. org "/>
> > > > > >
> > > > > > > <execution provider="coaster"
> > > > jobManager="gt2:gt2:SGE"
> > > > > url="
> > > > > > > gatekeeper.ranger.tacc. teragrid.org "/>
> > > > > >
> > > > > > > <execution provider="gt2" jobManager="SGE" url="
> > > > > > > gatekeeper.ranger.tacc. teragrid.org "/>
> > > > > > > <workdirectory>/work/00043/
> > > > tg457040</workdirectory>
> > > > > >
> > > > > > > </pool>
> > > > > > > </config>
> > > > > > >
> > > > > > > thoughts? ideas?
> > > > > > >
> > > > > > > --
> > > > > > > Justin M Wozniak
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sarah Kenny
> > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > > Bio Sci
> > > > > III
> > > > > > > University of California Irvine, Dept. of
> > > > Neurology ~
> > > > > 773-818-8300
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Swift-user mailing list
> > > > > > > Swift-user at ci.uchicago.edu
> > > > > > >
> > > > >
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sarah Kenny
> > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > > Bio Sci III
> > > > > > University of California Irvine, Dept. of
> > > > Neurology ~
> > > > > 773-818-8300
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sarah Kenny
> > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224
> > > > Bio Sci III
> > > > > > University of California Irvine, Dept. of
> > > > Neurology ~
> > > > > 773-818-8300
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sarah Kenny
> > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > > > University of California Irvine, Dept. of Neurology ~
> > > > 773-818-8300
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing list
> > > > > Swift-user at ci.uchicago.edu
> > > > >
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sarah Kenny
> > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300
> > > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Sarah Kenny
> > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> > > University of California Irvine, Dept. of Neurology ~ 773-818-8300
> > >
> > >
> >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> >
> >
> > --
> > Ketan
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>


-- 
Sarah Kenny
Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
University of California Irvine, Dept. of Neurology ~ 773-818-8300
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20111022/45f571a6/attachment.html>


More information about the Swift-user mailing list