[Swift-user] Getting swift to run on Fusion

Jonathan Margoliash jmargolpeople at gmail.com
Wed Sep 12 15:44:10 CDT 2012


Thanks David!

I was expecting the jobs themselves to crash, but I wanted to get to that
point in the debugging process. I'll try this script for now, and we'll
figure out why the other one wasn't working. Thanks again,

Jonathan

On Wed, Sep 12, 2012 at 3:24 PM, David Kelly <davidk at ci.uchicago.edu> wrote:

> Jonathan,
>
> I think the error is related to something being misconfigured in
> sites.xml. When I tried the same version of sites.xml, I saw the same qdel
> error. I will look to see why that is, but in the meantime, can you please
> try using this sites.xml:
>
> <config>
> <pool handle="fusion">
>   <execution jobmanager="local:pbs" provider="coaster" url="none"/>
>   <filesystem provider="local" url="none" />
>   <profile namespace="globus" key="maxtime">3600</profile>
>   <profile namespace="globus" key="jobsPerNode">8</profile>
>   <profile namespace="globus" key="queue">shared</profile>
>   <profile namespace="globus" key="slots">100</profile>
>   <profile namespace="globus" key="nodeGranularity">1</profile>
>   <profile namespace="globus" key="maxNodes">2</profile>
>   <profile namespace="karajan" key="jobThrottle">5.99</profile>
>   <profile namespace="karajan" key="initialScore">10000</profile>
>   <profile namespace="globus" key="HighOverAllocation">100</profile>
>   <profile namespace="globus" key="LowOverAllocation">100</profile>
>
> <workdirectory>/homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork</workdirectory>
> </pool>
> </config>
>
> (Modify your workdirectory as needed). I added this to my copy of
> fusion_start_sce.sh,
> in /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems
> to work for me, at least in terms of submitting and reporting on the status
> of jobs. The jobs themselves fail because there are references to
> /usr/bin/octave in some scripts which doesn't exist on Fusion. Hopefully
> this should help you get a little further.
>
> Thanks,
> David
>
> ----- Original Message -----
> > From: "Jonathan Margoliash" <jmargolpeople at gmail.com>
> > To: "David Kelly" <davidk at ci.uchicago.edu>
> > Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" <eyan at anl.gov>
> > Sent: Wednesday, September 12, 2012 11:32:56 AM
> > Subject: Re: Getting swift to run on Fusion
> > I attached the .0.rlog, .log, .d and the swift.log files. Which of
> > those files do you use for debugging? And these files are all located
> > in the directory
> >
> >
> > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235
> >
> >
> > on Fusion, if that's what you were asking for. Thanks!
> >
> >
> > Jonathan
> >
> >
> > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < davidk at ci.uchicago.edu
> > > wrote:
> >
> >
> > Jonathan,
> >
> > Could you please provide a pointer to the log file that got created
> > from this run?
> >
> > Thanks,
> > David
> >
> >
> >
> > ----- Original Message -----
> > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com >
> > > To: swift-user at ci.uchicago.edu , "Swift Language" <
> > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov >
> > > Sent: Wednesday, September 12, 2012 10:50:35 AM
> > > Subject: Getting swift to run on Fusion
> > > Hello swift support,
> > >
> > >
> > > This is my first attempt getting swift to work on Fusion, and I'm
> > > getting the following output to the terminal:
> > >
> > >
> > > ------
> > >
> > >
> > >
> > > Warning: Function toint is deprecated, at line 10
> > > Swift trunk swift-r5882 cog-r3434
> > >
> > >
> > > RunID: 20120912-1032-5y7xb1ug
> > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500
> > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34
> > > Submitted:8
> > > ...
> > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34
> > > Submitted:8
> > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s)
> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Failed to cancel task. qdel returned with an exit code of 153
> > > at
> > >
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
> > > at
> > >
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> > > at
> > >
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
> > > at
> > >
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
> > > at
> > >
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
> > > at
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
> > > at
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
> > > at
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
> > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34
> > > Submitted:8
> > > ...
> > >
> > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34
> > > Submitted:8
> > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s)
> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Failed to cancel task. qdel returned with an exit code of 153
> > > at
> > >
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
> > > at
> > >
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> > > at
> > >
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
> > > at
> > >
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
> > > at
> > >
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
> > > at
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
> > > at
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
> > > at
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
> > > at
> > >
> org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
> > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34
> > > Submitted:8
> > > ...
> > >
> > >
> > > ------
> > >
> > >
> > > I understand the long lines of unchanging "Progress: ..." reports -
> > > the shared queue is busy, and so I am not expecting my job to be
> > > executed right away. However, I don't understand why I'm getting
> > > these
> > > "failed to cancel task" errors. I gave each individual app well more
> > > than enough time for it to run to completion. And while I set the
> > > timelimit on the entire process to be much smaller than it needs
> > > (<profile namespace="globus" key="maxTime">60</profile> in
> > > sites.xml,
> > > when the process could run for days)
> > > I presumed the entire process would just get shut down after 60
> > > seconds of runtime. Why is this cropping up? Thanks,
> > >
> > >
> > > Jonathan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120912/2cf7d032/attachment.html>


More information about the Swift-user mailing list