So, I am using your sites.xml file for a slightly newer version of the code (located at home/jmargoliash/my_SwiftSCE2_branch as opposed to home/jmargoliash/my_SwiftSCE2_branch_matlab ). When I run this version, I'm getting the following error out of matlab:<div>
<div><br></div><div>---- </div><div><div>stderr.txt: Fatal Error on startup: Unable to start the JVM.</div><div>Error occurred during initialization of VM</div><div>Could not reserve enough space for object heap</div><div>
<br></div><div>There is not enough memory to start up the Java virtual machine.</div><div>Try quitting other applications or increasing your virtual memory.</div></div><div>-----</div><div><br></div><div>My question is: is this error matlab's fault, or is this sites.xml file trying to run too many apps on each node at once?</div>
<div><br></div><div>Also, a tangentially related questions about the coaster service:</div><div>Why would you want to have more than one coaster worker running on each node? Or rather, does each coaster worker correspond to a single app invocation, or can one coaster worker manage many simultaneous app invocations on a single node?</div>
<div><br></div><div><br><div class="gmail_quote">On Wed, Sep 12, 2012 at 3:44 PM, Jonathan Margoliash <span dir="ltr"><<a href="mailto:jmargolpeople@gmail.com" target="_blank">jmargolpeople@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks David!<br><br>I was expecting the jobs themselves to crash, but I wanted to get to that point in the debugging process. I'll try this script for now, and we'll figure out why the other one wasn't working. Thanks again,<div>
<br></div><div>Jonathan<div><div class="h5"><br><br><div class="gmail_quote">
On Wed, Sep 12, 2012 at 3:24 PM, David Kelly <span dir="ltr"><<a href="mailto:davidk@ci.uchicago.edu" target="_blank">davidk@ci.uchicago.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Jonathan,<br>
<br>
I think the error is related to something being misconfigured in sites.xml. When I tried the same version of sites.xml, I saw the same qdel error. I will look to see why that is, but in the meantime, can you please try using this sites.xml:<br>
<br>
<config><br>
<pool handle="fusion"><br>
<execution jobmanager="local:pbs" provider="coaster" url="none"/><br>
<filesystem provider="local" url="none" /><br>
<profile namespace="globus" key="maxtime">3600</profile><br>
<profile namespace="globus" key="jobsPerNode">8</profile><br>
<profile namespace="globus" key="queue">shared</profile><br>
<profile namespace="globus" key="slots">100</profile><br>
<profile namespace="globus" key="nodeGranularity">1</profile><br>
<profile namespace="globus" key="maxNodes">2</profile><br>
<profile namespace="karajan" key="jobThrottle">5.99</profile><br>
<profile namespace="karajan" key="initialScore">10000</profile><br>
<profile namespace="globus" key="HighOverAllocation">100</profile><br>
<profile namespace="globus" key="LowOverAllocation">100</profile><br>
<workdirectory>/homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork</workdirectory><br>
</pool><br>
</config><br>
<br>
(Modify your workdirectory as needed). I added this to my copy of fusion_start_sce.sh, in /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems to work for me, at least in terms of submitting and reporting on the status of jobs. The jobs themselves fail because there are references to /usr/bin/octave in some scripts which doesn't exist on Fusion. Hopefully this should help you get a little further.<br>
<div><br>
Thanks,<br>
David<br>
<br>
----- Original Message -----<br>
> From: "Jonathan Margoliash" <<a href="mailto:jmargolpeople@gmail.com" target="_blank">jmargolpeople@gmail.com</a>><br>
</div><div>> To: "David Kelly" <<a href="mailto:davidk@ci.uchicago.edu" target="_blank">davidk@ci.uchicago.edu</a>><br>
> Cc: <a href="mailto:swift-user@ci.uchicago.edu" target="_blank">swift-user@ci.uchicago.edu</a>, "Professor E. Yan" <<a href="mailto:eyan@anl.gov" target="_blank">eyan@anl.gov</a>><br>
> Sent: Wednesday, September 12, 2012 11:32:56 AM<br>
</div><div>> Subject: Re: Getting swift to run on Fusion<br>
> I attached the .0.rlog, .log, .d and the swift.log files. Which of<br>
> those files do you use for debugging? And these files are all located<br>
> in the directory<br>
><br>
><br>
> /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235<br>
><br>
><br>
> on Fusion, if that's what you were asking for. Thanks!<br>
><br>
><br>
> Jonathan<br>
><br>
><br>
> On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < <a href="mailto:davidk@ci.uchicago.edu" target="_blank">davidk@ci.uchicago.edu</a><br>
> > wrote:<br>
><br>
><br>
> Jonathan,<br>
><br>
> Could you please provide a pointer to the log file that got created<br>
> from this run?<br>
><br>
> Thanks,<br>
> David<br>
><br>
><br>
><br>
> ----- Original Message -----<br>
> > From: "Jonathan Margoliash" < <a href="mailto:jmargolpeople@gmail.com" target="_blank">jmargolpeople@gmail.com</a> ><br>
</div><div><div>> > To: <a href="mailto:swift-user@ci.uchicago.edu" target="_blank">swift-user@ci.uchicago.edu</a> , "Swift Language" <<br>
> > <a href="mailto:davidk@ci.uchicago.edu" target="_blank">davidk@ci.uchicago.edu</a> >, "Professor E. Yan" < <a href="mailto:eyan@anl.gov" target="_blank">eyan@anl.gov</a> ><br>
> > Sent: Wednesday, September 12, 2012 10:50:35 AM<br>
> > Subject: Getting swift to run on Fusion<br>
> > Hello swift support,<br>
> ><br>
> ><br>
> > This is my first attempt getting swift to work on Fusion, and I'm<br>
> > getting the following output to the terminal:<br>
> ><br>
> ><br>
> > ------<br>
> ><br>
> ><br>
> ><br>
> > Warning: Function toint is deprecated, at line 10<br>
> > Swift trunk swift-r5882 cog-r3434<br>
> ><br>
> ><br>
> > RunID: 20120912-1032-5y7xb1ug<br>
> > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500<br>
> > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > ...<br>
> > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Failed to shut down block: Block 0912-321051-000005 (8x60.000s)<br>
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:<br>
> > Failed to cancel task. qdel returned with an exit code of 153<br>
> > at<br>
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)<br>
> > at<br>
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)<br>
> > at<br>
> > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)<br>
> > at<br>
> > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)<br>
> > at<br>
> > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)<br>
> > at<br>
> > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)<br>
> > at<br>
> > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)<br>
> > at<br>
> > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)<br>
> > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > ...<br>
> ><br>
> > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Failed to shut down block: Block 0912-321051-000006 (8x60.000s)<br>
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:<br>
> > Failed to cancel task. qdel returned with an exit code of 153<br>
> > at<br>
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205)<br>
> > at<br>
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)<br>
> > at<br>
> > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69)<br>
> > at<br>
> > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102)<br>
> > at<br>
> > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91)<br>
> > at<br>
> > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46)<br>
> > at<br>
> > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320)<br>
> > at<br>
> > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552)<br>
> > at<br>
> > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140)<br>
> > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34<br>
> > Submitted:8<br>
> > ...<br>
> ><br>
> ><br>
> > ------<br>
> ><br>
> ><br>
> > I understand the long lines of unchanging "Progress: ..." reports -<br>
> > the shared queue is busy, and so I am not expecting my job to be<br>
> > executed right away. However, I don't understand why I'm getting<br>
> > these<br>
> > "failed to cancel task" errors. I gave each individual app well more<br>
> > than enough time for it to run to completion. And while I set the<br>
> > timelimit on the entire process to be much smaller than it needs<br>
> > (<profile namespace="globus" key="maxTime">60</profile> in<br>
> > sites.xml,<br>
> > when the process could run for days)<br>
> > I presumed the entire process would just get shut down after 60<br>
> > seconds of runtime. Why is this cropping up? Thanks,<br>
> ><br>
> ><br>
> > Jonathan<br>
</div></div></blockquote></div><br>
</div></div></div>
</blockquote></div><br></div></div>