[Swift-user] Getting swift to run on Fusion

Jonathan Margoliash jmargolpeople at gmail.com
Wed Sep 12 16:46:20 CDT 2012


So, I am using your sites.xml file for a slightly newer version of the code
(located at home/jmargoliash/my_SwiftSCE2_branch as opposed
to home/jmargoliash/my_SwiftSCE2_branch_matlab ). When I run this version,
I'm getting the following error out of matlab:

----
stderr.txt: Fatal Error on startup: Unable to start the JVM.
Error occurred during initialization of VM
Could not reserve enough space for object heap

There is not enough memory to start up the Java virtual machine.
Try quitting other applications or increasing your virtual memory.
-----

My question is: is this error matlab's fault, or is this sites.xml file
trying to run too many apps on each node at once?

Also, a tangentially related questions about the coaster service:
Why would you want to have more than one coaster worker running on each
node? Or rather, does each coaster worker correspond to a single app
invocation, or can one coaster worker manage many simultaneous app
invocations on a single node?


On Wed, Sep 12, 2012 at 3:44 PM, Jonathan Margoliash <
jmargolpeople at gmail.com> wrote:

> Thanks David!
>
> I was expecting the jobs themselves to crash, but I wanted to get to that
> point in the debugging process. I'll try this script for now, and we'll
> figure out why the other one wasn't working. Thanks again,
>
> Jonathan
>
>
> On Wed, Sep 12, 2012 at 3:24 PM, David Kelly <davidk at ci.uchicago.edu>wrote:
>
>> Jonathan,
>>
>> I think the error is related to something being misconfigured in
>> sites.xml. When I tried the same version of sites.xml, I saw the same qdel
>> error. I will look to see why that is, but in the meantime, can you please
>> try using this sites.xml:
>>
>> <config>
>> <pool handle="fusion">
>>   <execution jobmanager="local:pbs" provider="coaster" url="none"/>
>>   <filesystem provider="local" url="none" />
>>   <profile namespace="globus" key="maxtime">3600</profile>
>>   <profile namespace="globus" key="jobsPerNode">8</profile>
>>   <profile namespace="globus" key="queue">shared</profile>
>>   <profile namespace="globus" key="slots">100</profile>
>>   <profile namespace="globus" key="nodeGranularity">1</profile>
>>   <profile namespace="globus" key="maxNodes">2</profile>
>>   <profile namespace="karajan" key="jobThrottle">5.99</profile>
>>   <profile namespace="karajan" key="initialScore">10000</profile>
>>   <profile namespace="globus" key="HighOverAllocation">100</profile>
>>   <profile namespace="globus" key="LowOverAllocation">100</profile>
>>
>> <workdirectory>/homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork</workdirectory>
>> </pool>
>> </config>
>>
>> (Modify your workdirectory as needed). I added this to my copy of
>> fusion_start_sce.sh,
>> in /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems
>> to work for me, at least in terms of submitting and reporting on the status
>> of jobs. The jobs themselves fail because there are references to
>> /usr/bin/octave in some scripts which doesn't exist on Fusion. Hopefully
>> this should help you get a little further.
>>
>> Thanks,
>> David
>>
>> ----- Original Message -----
>> > From: "Jonathan Margoliash" <jmargolpeople at gmail.com>
>> > To: "David Kelly" <davidk at ci.uchicago.edu>
>> > Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" <eyan at anl.gov>
>> > Sent: Wednesday, September 12, 2012 11:32:56 AM
>> > Subject: Re: Getting swift to run on Fusion
>> > I attached the .0.rlog, .log, .d and the swift.log files. Which of
>> > those files do you use for debugging? And these files are all located
>> > in the directory
>> >
>> >
>> > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235
>> >
>> >
>> > on Fusion, if that's what you were asking for. Thanks!
>> >
>> >
>> > Jonathan
>> >
>> >
>> > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < davidk at ci.uchicago.edu
>> > > wrote:
>> >
>> >
>> > Jonathan,
>> >
>> > Could you please provide a pointer to the log file that got created
>> > from this run?
>> >
>> > Thanks,
>> > David
>> >
>> >
>> >
>> > ----- Original Message -----
>> > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com >
>> > > To: swift-user at ci.uchicago.edu , "Swift Language" <
>> > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov >
>> > > Sent: Wednesday, September 12, 2012 10:50:35 AM
>> > > Subject: Getting swift to run on Fusion
>> > > Hello swift support,
>> > >
>> > >
>> > > This is my first attempt getting swift to work on Fusion, and I'm
>> > > getting the following output to the terminal:
>> > >
>> > >
>> > > ------
>> > >
>> > >
>> > >
>> > > Warning: Function toint is deprecated, at line 10
>> > > Swift trunk swift-r5882 cog-r3434
>> > >
>> > >
>> > > RunID: 20120912-1032-5y7xb1ug
>> > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500
>> > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34
>> > > Submitted:8
>> > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34
>> > > Submitted:8
>> > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34
>> > > Submitted:8
>> > > ...
>> > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34
>> > > Submitted:8
>> > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s)
>> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>> > > Failed to cancel task. qdel returned with an exit code of 153
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
>> > > at
>> > >
>> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
>> > > at
>> > >
>> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
>> > > at
>> > >
>> org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
>> > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34
>> > > Submitted:8
>> > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34
>> > > Submitted:8
>> > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34
>> > > Submitted:8
>> > > ...
>> > >
>> > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34
>> > > Submitted:8
>> > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34
>> > > Submitted:8
>> > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s)
>> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>> > > Failed to cancel task. qdel returned with an exit code of 153
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)
>> > > at
>> > >
>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)
>> > > at
>> > >
>> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)
>> > > at
>> > >
>> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)
>> > > at
>> > >
>> org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)
>> > > at
>> > >
>> org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)
>> > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34
>> > > Submitted:8
>> > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34
>> > > Submitted:8
>> > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34
>> > > Submitted:8
>> > > ...
>> > >
>> > >
>> > > ------
>> > >
>> > >
>> > > I understand the long lines of unchanging "Progress: ..." reports -
>> > > the shared queue is busy, and so I am not expecting my job to be
>> > > executed right away. However, I don't understand why I'm getting
>> > > these
>> > > "failed to cancel task" errors. I gave each individual app well more
>> > > than enough time for it to run to completion. And while I set the
>> > > timelimit on the entire process to be much smaller than it needs
>> > > (<profile namespace="globus" key="maxTime">60</profile> in
>> > > sites.xml,
>> > > when the process could run for days)
>> > > I presumed the entire process would just get shut down after 60
>> > > seconds of runtime. Why is this cropping up? Thanks,
>> > >
>> > >
>> > > Jonathan
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120912/806b81b2/attachment.html>


More information about the Swift-user mailing list