[Swift-user] Getting swift to run on Fusion

Jonathan Margoliash jmargolpeople at gmail.com
Thu Sep 13 10:13:46 CDT 2012


David,

I've now got swift working on Fusion! My most recent code in
/home/jmargoliash/my_SwiftSCE2_branch runs swift, and manages to get inside
app invocations before crashing. I'm still working out some kinks
(specifically, the Fusion support team just installed octave on Fusion, and
so I'm trying to get that to work). However, the swift component seems fine
for now. I might have questions later about modifying the sites.xml file.

As for my workflow: swift is only being called once at any given time, on
the node which called ./fusion_start_sce.sh

The workflow is:
./fusion_start_sce.sh -> octave/matlab code -> calculate_point_values.swift
calculate_point_values.swift creates run_swat_wrapper.sh jobs
run_swat_wrapper.sh -> run_swat_wrapper.m (octave/matlab) -> ... -> SWAT
model

There is a similar work flow when generate_offspring.swift is called from
the control node, as opposed to calculate_point_values.swift (but the two
are never called concurrently).

One last question:
I don't want to have my main matlab/octave method, the one that
orchestrates the repeated calls to swift, running on the fusion login
nodes. I fear that I would end up eating up too much of their resources. So
how do I get it running on one of the compute nodes? Do I just qsub the
./fusion_start_sce.sh command? And if so, it would be the case that swift
would then be invoked directly from the compute nodes, in an attempt to
distribute jobs to other compute nodes. Would this work? Or would swift do
something like trying to submit another job to the scheduler, as opposed to
working within the current job?

Thanks,

Jonathan

On Thu, Sep 13, 2012 at 10:01 AM, David Kelly <davidk at ci.uchicago.edu>wrote:

> Jonathan,
>
> Just trying to understand the workflow better.. from what I can tell it is
> currently something like this:
>
> You run the swift script called calculate_point_values.swift
> calculate_point_values calls run_swat_wrapper, which runs matlab
> matlab loads a file called run_swat_wrapper.m
>
> Is one of the matlab files launching instances of swift on the worker
> nodes?
>
> David
>
> ----- Original Message -----
> > From: "Jonathan Margoliash" <jmargolpeople at gmail.com>
> > To: "David Kelly" <davidk at ci.uchicago.edu>
> > Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" <eyan at anl.gov>
> > Sent: Wednesday, September 12, 2012 4:46:20 PM
> > Subject: Re: Getting swift to run on Fusion
> > So, I am using your sites.xml file for a slightly newer version of the
> > code (located at home/jmargoliash/my_SwiftSCE2_branch as opposed to
> > home/jmargoliash/my_SwiftSCE2_branch_matlab ). When I run this
> > version, I'm getting the following error out of matlab:
> >
> >
> >
> > ----
> >
> > stderr.txt: Fatal Error on startup: Unable to start the JVM.
> > Error occurred during initialization of VM
> > Could not reserve enough space for object heap
> >
> >
> > There is not enough memory to start up the Java virtual machine.
> > Try quitting other applications or increasing your virtual memory.
> > -----
> >
> >
> > My question is: is this error matlab's fault, or is this sites.xml
> > file trying to run too many apps on each node at once?
> >
> >
> > Also, a tangentially related questions about the coaster service:
> > Why would you want to have more than one coaster worker running on
> > each node? Or rather, does each coaster worker correspond to a single
> > app invocation, or can one coaster worker manage many simultaneous app
> > invocations on a single node?
> >
> >
> >
> >
> > On Wed, Sep 12, 2012 at 3:44 PM, Jonathan Margoliash <
> > jmargolpeople at gmail.com > wrote:
> >
> >
> > Thanks David!
> >
> > I was expecting the jobs themselves to crash, but I wanted to get to
> > that point in the debugging process. I'll try this script for now, and
> > we'll figure out why the other one wasn't working. Thanks again,
> >
> >
> > Jonathan
> >
> >
> >
> >
> > On Wed, Sep 12, 2012 at 3:24 PM, David Kelly < davidk at ci.uchicago.edu
> > > wrote:
> >
> >
> > Jonathan,
> >
> > I think the error is related to something being misconfigured in
> > sites.xml. When I tried the same version of sites.xml, I saw the same
> > qdel error. I will look to see why that is, but in the meantime, can
> > you please try using this sites.xml:
> >
> > <config>
> > <pool handle="fusion">
> > <execution jobmanager="local:pbs" provider="coaster" url="none"/>
> > <filesystem provider="local" url="none" />
> > <profile namespace="globus" key="maxtime">3600</profile>
> > <profile namespace="globus" key="jobsPerNode">8</profile>
> > <profile namespace="globus" key="queue">shared</profile>
> > <profile namespace="globus" key="slots">100</profile>
> > <profile namespace="globus" key="nodeGranularity">1</profile>
> > <profile namespace="globus" key="maxNodes">2</profile>
> > <profile namespace="karajan" key="jobThrottle">5.99</profile>
> > <profile namespace="karajan" key="initialScore">10000</profile>
> > <profile namespace="globus" key="HighOverAllocation">100</profile>
> > <profile namespace="globus" key="LowOverAllocation">100</profile>
> >
> <workdirectory>/homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork</workdirectory>
> > </pool>
> > </config>
> >
> > (Modify your workdirectory as needed). I added this to my copy of
> > fusion_start_sce.sh, in
> > /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems
> > to work for me, at least in terms of submitting and reporting on the
> > status of jobs. The jobs themselves fail because there are references
> > to /usr/bin/octave in some scripts which doesn't exist on Fusion.
> > Hopefully this should help you get a little further.
> >
> >
> > Thanks,
> > David
> >
> > ----- Original Message -----
> > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com >
> >
> > > To: "David Kelly" < davidk at ci.uchicago.edu >
> > > Cc: swift-user at ci.uchicago.edu , "Professor E. Yan" < eyan at anl.gov >
> > > Sent: Wednesday, September 12, 2012 11:32:56 AM
> >
> > > Subject: Re: Getting swift to run on Fusion
> > > I attached the .0.rlog, .log, .d and the swift.log files. Which of
> > > those files do you use for debugging? And these files are all
> > > located
> > > in the directory
> > >
> > >
> > > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235
> > >
> > >
> > > on Fusion, if that's what you were asking for. Thanks!
> > >
> > >
> > > Jonathan
> > >
> > >
> > > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly <
> > > davidk at ci.uchicago.edu
> > > > wrote:
> > >
> > >
> > > Jonathan,
> > >
> > > Could you please provide a pointer to the log file that got created
> > > from this run?
> > >
> > > Thanks,
> > > David
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com >
> >
> >
> > > > To: swift-user at ci.uchicago.edu , "Swift Language" <
> > > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov >
> > > > Sent: Wednesday, September 12, 2012 10:50:35 AM
> > > > Subject: Getting swift to run on Fusion
> > > > Hello swift support,
> > > >
> > > >
> > > > This is my first attempt getting swift to work on Fusion, and I'm
> > > > getting the following output to the terminal:
> > > >
> > > >
> > > > ------
> > > >
> > > >
> > > >
> > > > Warning: Function toint is deprecated, at line 10
> > > > Swift trunk swift-r5882 cog-r3434
> > > >
> > > >
> > > > RunID: 20120912-1032-5y7xb1ug
> > > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500
> > > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34
> > > > Submitted:8
> > > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34
> > > > Submitted:8
> > > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34
> > > > Submitted:8
> > > > ...
> > > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34
> > > > Submitted:8
> > > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s)
> > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > > Failed to cancel task. qdel returned with an exit code of 153
> > > > at
> > > >
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
> > > > at
> > > >
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> > > > at
> > > >
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
> > > > at
> > > >
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
> > > > at
> > > >
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
> > > > at
> > > >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
> > > > at
> > > >
> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
> > > > at
> > > >
> org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
> > > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34
> > > > Submitted:8
> > > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34
> > > > Submitted:8
> > > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34
> > > > Submitted:8
> > > > ...
> > > >
> > > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34
> > > > Submitted:8
> > > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34
> > > > Submitted:8
> > > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s)
> > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > > > Failed to cancel task. qdel returned with an exit code of 153
> > > > at
> > > >
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
> > > > at
> > > >
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> > > > at
> > > >
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
> > > > at
> > > >
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
> > > > at
> > > >
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
> > > > at
> > > >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
> > > > at
> > > >
> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
> > > > at
> > > >
> org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
> > > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34
> > > > Submitted:8
> > > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34
> > > > Submitted:8
> > > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34
> > > > Submitted:8
> > > > ...
> > > >
> > > >
> > > > ------
> > > >
> > > >
> > > > I understand the long lines of unchanging "Progress: ..." reports
> > > > -
> > > > the shared queue is busy, and so I am not expecting my job to be
> > > > executed right away. However, I don't understand why I'm getting
> > > > these
> > > > "failed to cancel task" errors. I gave each individual app well
> > > > more
> > > > than enough time for it to run to completion. And while I set the
> > > > timelimit on the entire process to be much smaller than it needs
> > > > (<profile namespace="globus" key="maxTime">60</profile> in
> > > > sites.xml,
> > > > when the process could run for days)
> > > > I presumed the entire process would just get shut down after 60
> > > > seconds of runtime. Why is this cropping up? Thanks,
> > > >
> > > >
> > > > Jonathan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120913/c7faffb8/attachment.html>


More information about the Swift-user mailing list