[Swift-devel] swift on ranger

Michael Wilde wilde at mcs.anl.gov
Wed Dec 21 09:48:16 CST 2011


In the incident Sarah reported, can you tell from the log if the coaster provider generated an ill-formed job request, perhaps as the script was completing? Ie, something that either exceeded the SGE limits, or possibly had e.g. a zero-node request?

- Mike

----- Original Message -----
> From: "David Kelly" <davidk at ci.uchicago.edu>
> To: "Justin M Wozniak" <wozniak at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Wednesday, December 21, 2011 9:44:59 AM
> Subject: Re: [Swift-devel] swift on ranger
> Yep, very good idea.
> 
> ----- Original Message -----
> > From: "Justin M Wozniak" <wozniak at mcs.anl.gov>
> > To: "David Kelly" <davidk at ci.uchicago.edu>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Wednesday, December 21, 2011 9:40:22 AM
> > Subject: Re: [Swift-devel] swift on ranger
> > Should we provide an option to copy the submit file text into the
> > log?
> >
> > On Wed, 21 Dec 2011, David Kelly wrote:
> >
> > > Sarah,
> > >
> > > Could you please send the submit files that were generated from
> > > this
> > > run? That should help narrow it down a bit.
> > >
> > > Thanks,
> > > David
> > >
> > > ----- Original Message -----
> > >> From: "Sarah Kenny" <skenny at uci.edu>
> > >> To: "Swift Devel" <swift-devel at ci.uchicago.edu>, "Swift User"
> > >> <swift-user at ci.uchicago.edu>
> > >> Sent: Wednesday, December 21, 2011 6:57:32 AM
> > >> Subject: [Swift-devel] swift on ranger
> > >> getting this when submitting to ranger with both the latest and
> > >> our
> > >> previous version of swift (swift-r5259 cog-r3313)
> > >>
> > >> Final status: time: Wed, 21 Dec 2011 04:49:15 -0800 Finished
> > >> successfully:100
> > >> The following warnings have occurred:
> > >> 1.
> > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > >> Cannot submit job: Could not submit job (qsub reported an exit
> > >> code
> > >> of
> > >> 1).
> > >> --------------------------------------------------------------------------
> > >> Welcome to TACC's Ranger System, an NSF XD Resource
> > >> ---------------------------------------------------------------------------->
> > >> Checking that you specified -V...--> Checking that you specified
> > >> a
> > >> time limit...--> Checking that you specified a queue...-->
> > >> Setting
> > >> project...--> Checking that you specified a parallel
> > >> environment...-->
> > >> Checking that you specified a valid parallel environment
> > >> name...-->
> > >> Checking that the minimum and maximum PE counts are the
> > >> same...-->
> > >> Checking that the number of PEs requested is
> > >> valid...------------------> Rejecting job <------------------Your
> > >> slot
> > >> (or core) request is not a multiple of 16.Syntax: -pe <pe_name>
> > >> <n>where <n> is a multiple of
> > >> 16.-----------------------------------------------------
> > >> Unable to run job: JSV rejected job.Exiting.
> > >>
> > >> at
> > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
> > >> at
> > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45)
> > >> at
> > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:57)
> > >> at
> > >> org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40)
> > >> Caused by:
> > >> org.globus.cog.abstraction.impl.scheduler.common.ProcessException:
> > >> Could not submit job (qsub reported an exit code of 1).
> > >> --------------------------------------------------------------------------
> > >> Welcome to TACC's Ranger System, an NSF XD Resource
> > >> ---------------------------------------------------------------------------->
> > >> Checking that you specified -V...--> Checking that you specified
> > >> a
> > >> time limit...--> Checking that you specified a queue...-->
> > >> Setting
> > >> project...--> Checking that you specified a parallel
> > >> environment...-->
> > >> Checking that you specified a valid parallel environment
> > >> name...-->
> > >> Checking that the minimum and maximum PE counts are the
> > >> same...-->
> > >> Checking that the number of PEs requested is
> > >> valid...------------------> Rejecting job <------------------Your
> > >> slot
> > >> (or core) request is not a multiple of 16.Syntax: -pe <pe_name>
> > >> <n>where <n> is a multiple of
> > >> 16.-----------------------------------------------------
> > >> Unable to run job: JSV rejected job.Exiting.
> > >>
> > >> at
> > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:108)
> > >> at
> > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
> > >> ... 3 more
> > >>
> > >> ################### sites file
> > >>
> > >> <config>
> > >> <pool handle="RANGER">
> > >> <execution provider="coaster" jobManager="gt2:SGE" url="
> > >> gatekeeper.ranger.tacc.teragrid.org "/>
> > >> <filesystem provider="gsiftp" url="gsiftp://
> > >> gridftp.ranger.tacc.teragrid.org "/>
> > >> <profile namespace="globus" key="maxtime">86400</profile>
> > >> <profile namespace="globus" key="maxWallTime">02:00:00</profile>
> > >> <profile namespace="globus" key="jobsPerNode">1</profile>
> > >> <profile namespace="globus" key="nodeGranularity">64</profile>
> > >> <profile namespace="globus" key="maxNodes">4096</profile>
> > >> <profile namespace="globus" key="queue">normal</profile>
> > >> <profile namespace="karajan" key="jobThrottle">1.28</profile>
> > >> <profile namespace="globus" key="project">TG-DBS080004N</profile>
> > >> <profile namespace="globus" key="pe">16way</profile>
> > >> <profile namespace="karajan" key="initialScore">10000</profile>
> > >> <workdirectory>/work/00043/tg457040/swiftwork</workdirectory>
> > >> </pool>
> > >> </config>
> > >>
> > >> same settings we've been using for a while, i'm not sure why this
> > >> seems to be popping up now, but it's rather consistent. all jobs
> > >> are
> > >> finishing successfully, so it's rather confusing...any idea what
> > >> i
> > >> might be missing here?
> > >>
> > >> thanks
> > >> ~sk
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> Swift-devel mailing list
> > >> Swift-devel at ci.uchicago.edu
> > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> >
> > --
> > Justin M Wozniak
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list