[Swift-user] Getting swift to run on Fusion

David Kelly davidk at ci.uchicago.edu
Wed Sep 12 15:24:17 CDT 2012


Jonathan,

I think the error is related to something being misconfigured in sites.xml. When I tried the same version of sites.xml, I saw the same qdel error. I will look to see why that is, but in the meantime, can you please try using this sites.xml:

<config>
<pool handle="fusion">
  <execution jobmanager="local:pbs" provider="coaster" url="none"/>
  <filesystem provider="local" url="none" />
  <profile namespace="globus" key="maxtime">3600</profile>
  <profile namespace="globus" key="jobsPerNode">8</profile>
  <profile namespace="globus" key="queue">shared</profile>
  <profile namespace="globus" key="slots">100</profile>
  <profile namespace="globus" key="nodeGranularity">1</profile>
  <profile namespace="globus" key="maxNodes">2</profile>
  <profile namespace="karajan" key="jobThrottle">5.99</profile>
  <profile namespace="karajan" key="initialScore">10000</profile>
  <profile namespace="globus" key="HighOverAllocation">100</profile>
  <profile namespace="globus" key="LowOverAllocation">100</profile>
  <workdirectory>/homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork</workdirectory>
</pool>
</config>

(Modify your workdirectory as needed). I added this to my copy of fusion_start_sce.sh, in /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems to work for me, at least in terms of submitting and reporting on the status of jobs. The jobs themselves fail because there are references to /usr/bin/octave in some scripts which doesn't exist on Fusion. Hopefully this should help you get a little further.

Thanks,
David

----- Original Message -----
> From: "Jonathan Margoliash" <jmargolpeople at gmail.com>
> To: "David Kelly" <davidk at ci.uchicago.edu>
> Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" <eyan at anl.gov>
> Sent: Wednesday, September 12, 2012 11:32:56 AM
> Subject: Re: Getting swift to run on Fusion
> I attached the .0.rlog, .log, .d and the swift.log files. Which of
> those files do you use for debugging? And these files are all located
> in the directory
> 
> 
> /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235
> 
> 
> on Fusion, if that's what you were asking for. Thanks!
> 
> 
> Jonathan
> 
> 
> On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < davidk at ci.uchicago.edu
> > wrote:
> 
> 
> Jonathan,
> 
> Could you please provide a pointer to the log file that got created
> from this run?
> 
> Thanks,
> David
> 
> 
> 
> ----- Original Message -----
> > From: "Jonathan Margoliash" < jmargolpeople at gmail.com >
> > To: swift-user at ci.uchicago.edu , "Swift Language" <
> > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov >
> > Sent: Wednesday, September 12, 2012 10:50:35 AM
> > Subject: Getting swift to run on Fusion
> > Hello swift support,
> >
> >
> > This is my first attempt getting swift to work on Fusion, and I'm
> > getting the following output to the terminal:
> >
> >
> > ------
> >
> >
> >
> > Warning: Function toint is deprecated, at line 10
> > Swift trunk swift-r5882 cog-r3434
> >
> >
> > RunID: 20120912-1032-5y7xb1ug
> > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500
> > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34
> > Submitted:8
> > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34
> > Submitted:8
> > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34
> > Submitted:8
> > ...
> > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34
> > Submitted:8
> > Failed to shut down block: Block 0912-321051-000005 (8x60.000s)
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Failed to cancel task. qdel returned with an exit code of 153
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> > at
> > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
> > at
> > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
> > at
> > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
> > at
> > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
> > at
> > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
> > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34
> > Submitted:8
> > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34
> > Submitted:8
> > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34
> > Submitted:8
> > ...
> >
> > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34
> > Submitted:8
> > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34
> > Submitted:8
> > Failed to shut down block: Block 0912-321051-000006 (8x60.000s)
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Failed to cancel task. qdel returned with an exit code of 153
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> > at
> > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
> > at
> > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
> > at
> > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
> > at
> > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
> > at
> > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
> > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34
> > Submitted:8
> > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34
> > Submitted:8
> > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34
> > Submitted:8
> > ...
> >
> >
> > ------
> >
> >
> > I understand the long lines of unchanging "Progress: ..." reports -
> > the shared queue is busy, and so I am not expecting my job to be
> > executed right away. However, I don't understand why I'm getting
> > these
> > "failed to cancel task" errors. I gave each individual app well more
> > than enough time for it to run to completion. And while I set the
> > timelimit on the entire process to be much smaller than it needs
> > (<profile namespace="globus" key="maxTime">60</profile> in
> > sites.xml,
> > when the process could run for days)
> > I presumed the entire process would just get shut down after 60
> > seconds of runtime. Why is this cropping up? Thanks,
> >
> >
> > Jonathan



More information about the Swift-user mailing list