[Swift-user] Getting swift to run on Fusion

David Kelly davidk at ci.uchicago.edu
Thu Sep 13 10:01:00 CDT 2012


Jonathan,

Just trying to understand the workflow better.. from what I can tell it is currently something like this:

You run the swift script called calculate_point_values.swift
calculate_point_values calls run_swat_wrapper, which runs matlab
matlab loads a file called run_swat_wrapper.m

Is one of the matlab files launching instances of swift on the worker nodes?

David

----- Original Message -----
> From: "Jonathan Margoliash" <jmargolpeople at gmail.com>
> To: "David Kelly" <davidk at ci.uchicago.edu>
> Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" <eyan at anl.gov>
> Sent: Wednesday, September 12, 2012 4:46:20 PM
> Subject: Re: Getting swift to run on Fusion
> So, I am using your sites.xml file for a slightly newer version of the
> code (located at home/jmargoliash/my_SwiftSCE2_branch as opposed to
> home/jmargoliash/my_SwiftSCE2_branch_matlab ). When I run this
> version, I'm getting the following error out of matlab:
> 
> 
> 
> ----
> 
> stderr.txt: Fatal Error on startup: Unable to start the JVM.
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
> 
> 
> There is not enough memory to start up the Java virtual machine.
> Try quitting other applications or increasing your virtual memory.
> -----
> 
> 
> My question is: is this error matlab's fault, or is this sites.xml
> file trying to run too many apps on each node at once?
> 
> 
> Also, a tangentially related questions about the coaster service:
> Why would you want to have more than one coaster worker running on
> each node? Or rather, does each coaster worker correspond to a single
> app invocation, or can one coaster worker manage many simultaneous app
> invocations on a single node?
> 
> 
> 
> 
> On Wed, Sep 12, 2012 at 3:44 PM, Jonathan Margoliash <
> jmargolpeople at gmail.com > wrote:
> 
> 
> Thanks David!
> 
> I was expecting the jobs themselves to crash, but I wanted to get to
> that point in the debugging process. I'll try this script for now, and
> we'll figure out why the other one wasn't working. Thanks again,
> 
> 
> Jonathan
> 
> 
> 
> 
> On Wed, Sep 12, 2012 at 3:24 PM, David Kelly < davidk at ci.uchicago.edu
> > wrote:
> 
> 
> Jonathan,
> 
> I think the error is related to something being misconfigured in
> sites.xml. When I tried the same version of sites.xml, I saw the same
> qdel error. I will look to see why that is, but in the meantime, can
> you please try using this sites.xml:
> 
> <config>
> <pool handle="fusion">
> <execution jobmanager="local:pbs" provider="coaster" url="none"/>
> <filesystem provider="local" url="none" />
> <profile namespace="globus" key="maxtime">3600</profile>
> <profile namespace="globus" key="jobsPerNode">8</profile>
> <profile namespace="globus" key="queue">shared</profile>
> <profile namespace="globus" key="slots">100</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> <profile namespace="globus" key="maxNodes">2</profile>
> <profile namespace="karajan" key="jobThrottle">5.99</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <profile namespace="globus" key="HighOverAllocation">100</profile>
> <profile namespace="globus" key="LowOverAllocation">100</profile>
> <workdirectory>/homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork</workdirectory>
> </pool>
> </config>
> 
> (Modify your workdirectory as needed). I added this to my copy of
> fusion_start_sce.sh, in
> /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems
> to work for me, at least in terms of submitting and reporting on the
> status of jobs. The jobs themselves fail because there are references
> to /usr/bin/octave in some scripts which doesn't exist on Fusion.
> Hopefully this should help you get a little further.
> 
> 
> Thanks,
> David
> 
> ----- Original Message -----
> > From: "Jonathan Margoliash" < jmargolpeople at gmail.com >
> 
> > To: "David Kelly" < davidk at ci.uchicago.edu >
> > Cc: swift-user at ci.uchicago.edu , "Professor E. Yan" < eyan at anl.gov >
> > Sent: Wednesday, September 12, 2012 11:32:56 AM
> 
> > Subject: Re: Getting swift to run on Fusion
> > I attached the .0.rlog, .log, .d and the swift.log files. Which of
> > those files do you use for debugging? And these files are all
> > located
> > in the directory
> >
> >
> > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235
> >
> >
> > on Fusion, if that's what you were asking for. Thanks!
> >
> >
> > Jonathan
> >
> >
> > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly <
> > davidk at ci.uchicago.edu
> > > wrote:
> >
> >
> > Jonathan,
> >
> > Could you please provide a pointer to the log file that got created
> > from this run?
> >
> > Thanks,
> > David
> >
> >
> >
> > ----- Original Message -----
> > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com >
> 
> 
> > > To: swift-user at ci.uchicago.edu , "Swift Language" <
> > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov >
> > > Sent: Wednesday, September 12, 2012 10:50:35 AM
> > > Subject: Getting swift to run on Fusion
> > > Hello swift support,
> > >
> > >
> > > This is my first attempt getting swift to work on Fusion, and I'm
> > > getting the following output to the terminal:
> > >
> > >
> > > ------
> > >
> > >
> > >
> > > Warning: Function toint is deprecated, at line 10
> > > Swift trunk swift-r5882 cog-r3434
> > >
> > >
> > > RunID: 20120912-1032-5y7xb1ug
> > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500
> > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34
> > > Submitted:8
> > > ...
> > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34
> > > Submitted:8
> > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s)
> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Failed to cancel task. qdel returned with an exit code of 153
> > > at
> > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
> > > at
> > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> > > at
> > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
> > > at
> > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
> > > at
> > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
> > > at
> > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
> > > at
> > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
> > > at
> > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
> > > at
> > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
> > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34
> > > Submitted:8
> > > ...
> > >
> > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34
> > > Submitted:8
> > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s)
> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > Failed to cancel task. qdel returned with an exit code of 153
> > > at
> > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
> > > at
> > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> > > at
> > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
> > > at
> > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
> > > at
> > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
> > > at
> > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
> > > at
> > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
> > > at
> > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
> > > at
> > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
> > > at
> > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
> > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34
> > > Submitted:8
> > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34
> > > Submitted:8
> > > ...
> > >
> > >
> > > ------
> > >
> > >
> > > I understand the long lines of unchanging "Progress: ..." reports
> > > -
> > > the shared queue is busy, and so I am not expecting my job to be
> > > executed right away. However, I don't understand why I'm getting
> > > these
> > > "failed to cancel task" errors. I gave each individual app well
> > > more
> > > than enough time for it to run to completion. And while I set the
> > > timelimit on the entire process to be much smaller than it needs
> > > (<profile namespace="globus" key="maxTime">60</profile> in
> > > sites.xml,
> > > when the process could run for days)
> > > I presumed the entire process would just get shut down after 60
> > > seconds of runtime. Why is this cropping up? Thanks,
> > >
> > >
> > > Jonathan



More information about the Swift-user mailing list