<br><div class="gmail_quote">On Thu, Oct 20, 2011 at 7:50 AM, Michael Wilde <span dir="ltr"><<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi Sarah, Anjali,<br>
<br>
My initial theory on whats failing in this job is that the Ranger development queue is limited to jobs of 16 nodes or less. (The Ranger User Guide says maxprocs 256 for that queue, and qconf -sq development says slots 16, which agrees). So you need to either change to one of the production queues (normal, long etc) or reduce the values of maxnode and nodegranularity.<br>
</blockquote><div><br></div><div>I have a little confusion here: the desired line in the final pbs script should be : #$ -pe <n>way 256; in order to have 256 procs, however, putting maxnodes=16 on sites.xml results in the following line on pbs: </div>
<div>#$ -pe <n>way 16; </div><div>I understand this number 16/256 is for procs since, when putting 256 with development queue, ranger indeed allows the job to run in development queue.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
I would also suggest (unless you have already done this) that you test first on a very small run (like a single RInvoke app call) and then scale up to just a few voxels per dataset before trying such a large run. Have you already tested that?<br>
<br>
Lastly, when reporting problems like this, the swift standard output/err is also very helpful to get a higher-level view of what went wrong.<br>
<br>
Swift needs to clearly return errors from the local resource provider, which it doesnt seem to be doing here. Ive filed this as bug 593 and assigned to David.<br>
<br>
Please let us know if changing the queue and/or slots resolves the problem. As mentioned in the bug report I think you can set debug=true (or yes?) in the provider-sge.properties file and get swift to preserve the output from SGE in ~/.globus/scripts. (In fact that may already be preserved, I am not sure). Please check there to see if the SGE error is there.<br>
<br>
Thanks,<br>
<br>
- Mike<br>
<div class="im"><br>
<br>
----- Original Message -----<br>
> From: "Sarah Kenny" <<a href="mailto:skenny@uchicago.edu">skenny@uchicago.edu</a>><br>
</div><div class="im">> To: "Mihael Hategan" <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> Cc: "Anjali Raja" <<a href="mailto:anjraja@gmail.com">anjraja@gmail.com</a>>, "Swift Devel" <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>>, "Swift User"<br>
> <<a href="mailto:swift-user@ci.uchicago.edu">swift-user@ci.uchicago.edu</a>><br>
</div>> Sent: Thursday, October 20, 2011 6:07:09 AM<br>
> Subject: Re: [Swift-devel] [Swift-user] gram on ranger<br>
<div class="im">> hi all, one of our users, anjali (cc'd here) is trying to submit this<br>
> ~400k job workflow to ranger...thought i'd see if you felt like having<br>
> a look :)<br>
><br>
> log is here:<br>
> /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log<br>
><br>
> sites file:<br>
><br>
> <config><br>
> <pool handle="RANGER"><br>
> <execution provider="coaster" jobManager="gt2:SGE" url="<br>
> <a href="http://gatekeeper.ranger.tacc.teragrid.org" target="_blank">gatekeeper.ranger.tacc.teragrid.org</a> "/><br>
</div><div><div></div><div class="h5">> <filesystem provider="gsiftp" url="gsiftp://<br>
> <a href="http://gridftp.ranger.tacc.teragrid.org" target="_blank">gridftp.ranger.tacc.teragrid.org</a> "/><br>
> <profile namespace="globus" key="maxtime">7200</profile><br>
> <profile namespace="globus" key="maxWallTime">00:20:00</profile><br>
> <profile namespace="globus" key="jobsPerNode">1</profile><br>
> <profile namespace="globus" key="nodeGranularity">64</profile><br>
> <profile namespace="globus" key="maxNodes">256</profile><br>
> <profile namespace="globus" key="queue">development</profile><br>
> <profile namespace="karajan" key="jobThrottle">1.28</profile><br>
> <profile namespace="globus" key="project">TG-DBS080004N</profile><br>
> <profile namespace="globus" key="pe">16way</profile><br>
> <profile namespace="karajan" key="initialScore">10000</profile><br>
> <workdirectory>/work/00926/tg459516/swiftwork</workdirectory><br>
> </pool><br>
> </config><br>
><br>
><br>
> On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < <a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a><br>
> > wrote:<br>
><br>
><br>
><br>
> On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote:<br>
> ><br>
> ><br>
> > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan <<br>
> > <a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a> ><br>
> > wrote:<br>
> > Is this with a persistent coaster service?<br>
> ><br>
> > admittedly i have not used persistent coaster service...should i?<br>
><br>
> No. I was just trying to figure out whether it might be something<br>
> related to the persistent version.<br>
><br>
><br>
><br>
><br>
> > i feel like it's documented *somewhere* (?)<br>
> ><br>
> > for now i've tried setting 'sitedir.keep=true' in the config so<br>
> > maybe<br>
> > it won't try to run the cleanup job...we'll see (waiting in q)<br>
> ><br>
> ><br>
> ><br>
> > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote:<br>
> > ><br>
> > ><br>
> > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly<br>
> > < <a href="mailto:davidk@ci.uchicago.edu">davidk@ci.uchicago.edu</a> ><br>
> > > wrote:<br>
> > ><br>
> > > That could be it.. maybe a cleanup script is not<br>
> > getting the<br>
> > > right parameters and failing. Do you happen to have<br>
> > a copy of<br>
> > > the coaster log?<br>
> > ><br>
> > > just put it in /home/skenny/swift_logs<br>
> > ><br>
> > ><br>
> > > Maybe there will be some clues in there.<br>
> > ><br>
> > > ----- Original Message -----<br>
> > > > From: "Sarah Kenny" < <a href="mailto:skenny@uchicago.edu">skenny@uchicago.edu</a> ><br>
> > ><br>
> > > > To: "David Kelly" < <a href="mailto:davidk@ci.uchicago.edu">davidk@ci.uchicago.edu</a> ><br>
> > > > Cc: "Swift Devel" < <a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a> >,<br>
> > "Swift<br>
> > > User" < <a href="mailto:swift-user@ci.uchicago.edu">swift-user@ci.uchicago.edu</a> >, "Justin M<br>
> > Wozniak"<br>
> > > > < <a href="mailto:wozniak@mcs.anl.gov">wozniak@mcs.anl.gov</a> ><br>
> > ><br>
> > > > Sent: Tuesday, October 11, 2011 1:32:37 PM<br>
> > > > Subject: Re: [Swift-user] gram on ranger<br>
> > ><br>
> > > > so, this workflow completes all the jobs but then<br>
> > just hangs<br>
> > > > indefinitely at the end...maybe a stray cleanup<br>
> > job?<br>
> > > ><br>
> > > > log is here:<br>
> > > ><br>
> > ><br>
> > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log<br>
> > > ><br>
> > > > just tweaked the sites file a bit from what david<br>
> > sent me:<br>
> > > ><br>
> > > > <config><br>
> > > > <pool handle="RANGER"><br>
> > > > <execution provider="coaster" jobManager="gt2:SGE"<br>
> > url="<br>
> > > > <a href="http://gatekeeper.ranger.tacc.teragrid.org" target="_blank">gatekeeper.ranger.tacc.teragrid.org</a> "/><br>
> > > > <filesystem provider="gsiftp" url="gsiftp://<br>
> > ><br>
> > > > <a href="http://gridftp.ranger.tacc.teragrid.org" target="_blank">gridftp.ranger.tacc.teragrid.org</a> "/><br>
> > ><br>
> > > > <profile namespace="globus"<br>
> > key="maxtime">28800</profile><br>
> > > > <profile namespace="globus"<br>
> > > key="maxWallTime">00:15:00</profile><br>
> > > > <profile namespace="globus"<br>
> > key="jobsPerNode">1</profile><br>
> > > > <profile namespace="globus"<br>
> > > key="nodeGranularity">64</profile><br>
> > > > <profile namespace="globus"<br>
> > key="maxNodes">256</profile><br>
> > > > <profile namespace="globus"<br>
> > key="queue">normal</profile><br>
> > > > <profile namespace="karajan"<br>
> > key="jobThrottle">1</profile><br>
> > > > <profile namespace="globus"<br>
> > > key="project">TG-DBS080004N</profile><br>
> > > > <profile namespace="globus"<br>
> > key="pe">16way</profile><br>
> > > > <profile namespace="karajan"<br>
> > > key="initialScore">10000</profile><br>
> > > ><br>
> > ><br>
> > <workdirectory>/work/00043/tg457040/sidgrid_out/skenny</workdirectory><br>
> > > > </pool><br>
> > > > </config><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny <<br>
> > > <a href="mailto:skenny@uchicago.edu">skenny@uchicago.edu</a> ><br>
> > > > wrote:<br>
> > > ><br>
> > > ><br>
> > > > ok, thanks, got in the queue now...also, realized<br>
> > my last<br>
> > > run may have<br>
> > > > been using the old swift. apparently i had<br>
> > SWIFT_HOME set in<br>
> > > my env<br>
> > > > and that overrides the newer swift i had set in my<br>
> > PATH.<br>
> > > ><br>
> > > > ~sk<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly <<br>
> > > <a href="mailto:davidk@ci.uchicago.edu">davidk@ci.uchicago.edu</a><br>
> > > > > wrote:<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > Sarah,<br>
> > > ><br>
> > > > Can you give this another try with the latest<br>
> > 0.93? I made<br>
> > > some<br>
> > > > changes to the coaster and sge providers and was<br>
> > able to get<br>
> > > it<br>
> > > > working with a simple catns script. Here is the<br>
> > > configuration file I<br>
> > > > was using:<br>
> > > ><br>
> > > > <config><br>
> > > > <pool handle="ranger"><br>
> > > > <execution provider="coaster" jobManager="gt2:SGE"<br>
> > url="<br>
> > > > <a href="http://gatekeeper.ranger.tacc.teragrid.org" target="_blank">gatekeeper.ranger.tacc.teragrid.org</a> "/><br>
> > > ><br>
> > > > <filesystem provider="gsiftp" url="gsiftp://<br>
> > ><br>
> > > > <a href="http://gridftp.ranger.tacc.teragrid.org" target="_blank">gridftp.ranger.tacc.teragrid.org</a> "/><br>
> > ><br>
> > > > <profile namespace="globus"<br>
> > key="maxtime">3600</profile><br>
> > > > <profile namespace="globus"<br>
> > > key="maxWallTime">00:00:03</profile><br>
> > > > <profile namespace="globus"<br>
> > key="jobsPerNode">1</profile><br>
> > > > <profile namespace="globus"<br>
> > > key="nodeGranularity">16</profile><br>
> > > > <profile namespace="globus"<br>
> > key="maxNodes">16</profile><br>
> > > > <profile namespace="globus"<br>
> > > key="queue">development</profile><br>
> > > > <profile namespace="karajan"<br>
> > key="jobThrottle">0.9</profile><br>
> > > ><br>
> > > > <profile namespace="globus"<br>
> > > key="project">TG-DBS080004N</profile><br>
> > > ><br>
> > > > <profile namespace="globus"<br>
> > key="pe">16way</profile><br>
> > > ><br>
> > ><br>
> > <workdirectory>/share/home/01503/davidkel/swiftwork</workdirectory><br>
> > > > </pool><br>
> > > > </config><br>
> > > ><br>
> > > > Thanks,<br>
> > > ><br>
> > > > David<br>
> > > ><br>
> > > > ----- Original Message -----<br>
> > > ><br>
> > > > > From: "Sarah Kenny" < <a href="mailto:skenny@uchicago.edu">skenny@uchicago.edu</a> ><br>
> > > > > To: "Justin M Wozniak" < <a href="mailto:wozniak@mcs.anl.gov">wozniak@mcs.anl.gov</a> ><br>
> > > > > Cc: "Swift Devel" < <a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a><br>
> > >, "Swift<br>
> > > User" <<br>
> > > > > <a href="mailto:swift-user@ci.uchicago.edu">swift-user@ci.uchicago.edu</a> ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > > Sent: Friday, October 7, 2011 3:13:57 PM<br>
> > > > > Subject: Re: [Swift-user] gram on ranger<br>
> > > ><br>
> > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log<br>
> > > > ><br>
> > > > > on ci<br>
> > > > ><br>
> > > > ><br>
> > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak<br>
> > <<br>
> > > > > <a href="mailto:wozniak@mcs.anl.gov">wozniak@mcs.anl.gov</a><br>
> > > > > > wrote:<br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > > Can I take a look at the log?<br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote:<br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > > hey all, i'm trying to submit to gram on ranger<br>
> > using the<br>
> > > latest<br>
> > > > > swift<br>
> > > > > (built from trunk). it failes like so:<br>
> > > > ><br>
> > > > > Cannot submit job<br>
> > > > > Caused by:<br>
> > > > > org.globus.cog.abstraction. impl.common.task.<br>
> > > > > TaskSubmissionException:<br>
> > > > > Cannot<br>
> > > > > submit job<br>
> > > > > Caused by: org.globus.gram.GramException:<br>
> > Parameter not<br>
> > > supported<br>
> > > > > Cannot submit job<br>
> > > > ><br>
> > > > > the gram log was saying first that 'jobsPerNode'<br>
> > is not<br>
> > > supported so<br>
> > > > > i<br>
> > > > > changed it to workersPerNode and then it was<br>
> > saying<br>
> > > 'maxnodes' is<br>
> > > > > not<br>
> > > > > supported. here's my sites file:<br>
> > > > ><br>
> > > > > <config><br>
> > > > > <pool handle="RANGER"><br>
> > > > > <profile namespace="karajan"<br>
> > key="initialScore">10000</<br>
> > > profile><br>
> > > > > <profile namespace="karajan"<br>
> > key="jobThrottle">1</profile><br>
> > > > > <profile namespace="globus"<br>
> > key="maxWallTime">00:15:00</<br>
> > > profile><br>
> > > > > <profile namespace="globus"<br>
> > key="maxTime">86400</profile><br>
> > > > > <profile namespace="globus"<br>
> > key="slots">1</profile><br>
> > > > > <profile namespace="globus"<br>
> > key="maxNodes">256</profile><br>
> > > > > <profile namespace="globus"<br>
> > key="pe">16way</profile><br>
> > > > > <profile namespace="globus"<br>
> > key="workersPerNode">1</<br>
> > > profile><br>
> > > > > <profile namespace="globus"<br>
> > key="nodeGranularity">64</<br>
> > > profile><br>
> > > > > <profile namespace="globus"<br>
> > key="queue">normal</profile><br>
> > > > > <profile namespace="globus"<br>
> > key="project">TG-DBS080004N</<br>
> > > profile><br>
> > > > > <filesystem provider="gsiftp" url="gsiftp://<br>
> > > > > gridftp.ranger.tacc.teragrid. org "/><br>
> > > ><br>
> > > > > <execution provider="coaster"<br>
> > jobManager="gt2:gt2:SGE"<br>
> > > url="<br>
> > > > > gatekeeper.ranger.tacc. <a href="http://teragrid.org" target="_blank">teragrid.org</a> "/><br>
> > > ><br>
> > > > > <execution provider="gt2" jobManager="SGE" url="<br>
> > > > > gatekeeper.ranger.tacc. <a href="http://teragrid.org" target="_blank">teragrid.org</a> "/><br>
> > > > > <workdirectory>/work/00043/<br>
> > tg457040</workdirectory><br>
> > > ><br>
> > > > > </pool><br>
> > > > > </config><br>
> > > > ><br>
> > > > > thoughts? ideas?<br>
> > > > ><br>
> > > > > --<br>
> > > > > Justin M Wozniak<br>
> > > > ><br>
> > > > ><br>
> > > > ><br>
> > > > > --<br>
> > > > > Sarah Kenny<br>
> > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224<br>
> > Bio Sci<br>
> > > III<br>
> > > > > University of California Irvine, Dept. of<br>
> > Neurology ~<br>
> > > 773-818-8300<br>
> > > > ><br>
> > > > ><br>
> > > > > _______________________________________________<br>
> > > > > Swift-user mailing list<br>
> > > > > <a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>
> > > > ><br>
> > ><br>
> > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > --<br>
> > > > Sarah Kenny<br>
> > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224<br>
> > Bio Sci III<br>
> > > > University of California Irvine, Dept. of<br>
> > Neurology ~<br>
> > > 773-818-8300<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > --<br>
> > > > Sarah Kenny<br>
> > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224<br>
> > Bio Sci III<br>
> > > > University of California Irvine, Dept. of<br>
> > Neurology ~<br>
> > > 773-818-8300<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > > --<br>
> > > Sarah Kenny<br>
> > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III<br>
> > > University of California Irvine, Dept. of Neurology ~<br>
> > 773-818-8300<br>
> > ><br>
> > > _______________________________________________<br>
> > > Swift-user mailing list<br>
> > > <a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>
> > ><br>
> > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Sarah Kenny<br>
> > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III<br>
> > University of California Irvine, Dept. of Neurology ~ 773-818-8300<br>
> ><br>
><br>
><br>
><br>
><br>
><br>
> --<br>
> Sarah Kenny<br>
> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III<br>
> University of California Irvine, Dept. of Neurology ~ 773-818-8300<br>
><br>
><br>
</div></div><div class="im">> _______________________________________________<br>
> Swift-devel mailing list<br>
> <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
> <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
<br>
</div><font color="#888888">--<br>
Michael Wilde<br>
Computation Institute, University of Chicago<br>
Mathematics and Computer Science Division<br>
Argonne National Laboratory<br>
</font><div><div></div><div class="h5"><br>
_______________________________________________<br>
Swift-devel mailing list<br>
<a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
<a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>