Here's the full log (I think). <div><br><div>What's Mike's describing is basically my gut feeling as well...</div><div><br></div><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
Did you leave the tail end of this run running long enough for the current block to end, to see if it starts a new 3600 second block?</blockquote><div>A different run before I tried to reproduce the problem ran all night like that last night without starting any new blocks....(but the settings were very slightly different (fewer "slots") and it stalled with 7 jobs left i think<br>
<br><div class="gmail_quote">On Mon, Aug 9, 2010 at 6:01 PM, Michael Wilde <span dir="ltr"><<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
I have seen a common problem where maxwalltime on queues jobs exceeds maxtime, in which case Swift hangs, never finding a block it can fit the jobs into.<br>
<br>
I wonder if this is another manifestation of that behavior/bug: the time left in the running block is less than the 25 min maxwalltime for the remaining tasks, and Swift does not realize that it needs to end that block and start a new one.<br>
<br>
Did you leave the tail end of this run running long enough for the current block to end, to see if it starts a new 3600 second block?<br>
<br>
Im just surmising one possible cause; actual problem here might be completely different.<br>
<br>
- Mike<br>
<div><div></div><div class="h5"><br>
<br>
----- "Mihael Hategan" <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>> wrote:<br>
<br>
> That error might be related. Can I have the full log?<br>
><br>
> On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote:<br>
> > Hey everyone,<br>
> > I've been trying to run some short jobs in the "fast" queue on<br>
> pads.<br>
> > That means I need to keep the wall time under 1 hour, and my tasks<br>
> are<br>
> > around 20 min. What's been happening, at least for a smallish<br>
> number<br>
> > of jobs, is that swift decreases the number of jobs submitted to<br>
> the<br>
> > queue as the number of tasks is reduced and at the end, some tasks<br>
> > remain unfinished while no jobs are in the queue, and this<br>
> continues<br>
> > indefinately.<br>
> ><br>
> ><br>
> > The following is one sites entry where I reproducibly had this<br>
> problem<br>
> > for 70 tasks<br>
> ><br>
> ><br>
> > <execution provider="coaster" url="none"<br>
> > jobManager="local:pbs"/><br>
> > <!--<profile namespace="globus"<br>
> > key="queue">fast</profile>--><br>
> > <profile namespace="globus"<br>
> key="maxtime">3600</profile><br>
> > <profile namespace="globus"<br>
> > key="maxwalltime">00:25:00</profile><br>
> > <profile namespace="globus"<br>
> > key="workersPerNode">1</profile><br>
> > <profile namespace="globus"<br>
> > key="internalHostname">172.5.86.5</profile><br>
> > <profile namespace="globus" key="slots">120</profile><br>
> > <profile namespace="globus"<br>
> > key="nodeGranularity">1</profile><br>
> > <profile namespace="globus" key="maxNodes">1</profile><br>
> > <profile namespace="karajan"<br>
> > key="jobThrottle">0.99</profile><br>
> > <profile namespace="karajan"<br>
> > key="initialScore">10000</profile><br>
> > <profile namespace="globus"<br>
> > key="project">CI-CCR000013</profile><br>
> > <gridftp url="local://localhost" /><br>
> > <scratch>/tmp</scratch><br>
> ><br>
> ><br>
> <workdirectory>/home/hockyg/reichman/glassy_dynamics/code/swift/run/real</workdirectory><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > There are also some of this type of error<br>
> > Exception caught while unregistering channel<br>
> ><br>
> org.globus.cog.karajan.workflow.service.channels.ChannelException:<br>
> Trying to bind invalid channel (2027063355: {}) to 60652275: {}<br>
> > at<br>
> ><br>
> org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67)<br>
> > at<br>
> ><br>
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401)<br>
> > at<br>
> ><br>
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)<br>
> > at<br>
> ><br>
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)<br>
> > at<br>
> ><br>
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)<br>
> > at<br>
> ><br>
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)<br>
> > but i'm not sure that's related...<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > Running with "Swift svn swift-r3432 (swift modified locally)<br>
> > cog-r2829"<br>
> ><br>
> ><br>
> > Swift output went something like<br>
> > Progress: Submitted:69 Active:1 Finished successfully:1<br>
> > Progress: Submitted:67 Active:3 Finished successfully:1<br>
> > Progress: Submitted:66 Active:4 Finished successfully:1<br>
> > Progress: Submitted:65 Active:5 Finished successfully:1<br>
> > Progress: Submitted:64 Active:6 Finished successfully:1<br>
> > Progress: Submitted:61 Active:9 Finished successfully:1<br>
> > Progress: Submitted:58 Active:12 Finished successfully:1<br>
> > Progress: Submitted:57 Active:13 Finished successfully:1<br>
> > Progress: Submitted:54 Active:16 Finished successfully:1<br>
> > Progress: Submitted:52 Active:18 Finished successfully:1<br>
> > Progress: Submitted:51 Active:19 Finished successfully:1<br>
> > Progress: Submitted:50 Active:20 Finished successfully:1<br>
> > Progress: Submitted:49 Active:21 Finished successfully:1<br>
> > Progress: Submitted:48 Active:22 Finished successfully:1<br>
> > Progress: Submitted:41 Active:29 Finished successfully:1<br>
> > Progress: Submitted:38 Active:32 Finished successfully:1<br>
> > Progress: Submitted:37 Active:33 Finished successfully:1<br>
> > Progress: Submitted:35 Active:35 Finished successfully:1<br>
> > Progress: Submitted:31 Active:39 Finished successfully:1<br>
> > Progress: Submitted:30 Active:40 Finished successfully:1<br>
> > Progress: Submitted:26 Active:44 Finished successfully:1<br>
> > Progress: Submitted:26 Active:44 Finished successfully:1<br>
> > Progress: Submitted:26 Active:44 Finished successfully:1<br>
> > Progress: Submitted:26 Active:44 Finished successfully:1<br>
> > Progress: Submitted:26 Active:44 Finished successfully:1<br>
> > Progress: Submitted:26 Active:44 Finished successfully:1<br>
> > Progress: Submitted:26 Active:43 Checking status:1 Finished<br>
> > successfully:1<br>
> > Progress: Submitted:26 Active:43 Finished successfully:2<br>
> > Progress: Submitted:26 Active:42 Checking status:1 Finished<br>
> > successfully:2<br>
> > Progress: Submitted:25 Active:42 Checking status:1 Finished<br>
> > successfully:3<br>
> > Progress: Submitted:25 Active:41 Checking status:1 Finished<br>
> > successfully:4<br>
> > Progress: Submitted:25 Active:41 Finished successfully:5<br>
> > Progress: Submitted:25 Active:40 Checking status:1 Finished<br>
> > successfully:5<br>
> > Progress: Submitted:25 Active:39 Checking status:1 Finished<br>
> > successfully:6<br>
> > Progress: Submitted:24 Active:40 Finished successfully:7<br>
> > Progress: Submitted:24 Active:39 Checking status:1 Finished<br>
> > successfully:7<br>
> > Progress: Submitted:24 Active:38 Checking status:1 Finished<br>
> > successfully:8<br>
> > Progress: Submitted:24 Active:38 Finished successfully:9<br>
> > Progress: Submitted:24 Active:37 Checking status:1 Finished<br>
> > successfully:9<br>
> > Progress: Submitted:24 Active:35 Checking status:1 Finished<br>
> > successfully:11<br>
> > Progress: Submitted:23 Active:35 Checking status:1 Finished<br>
> > successfully:12<br>
> > Progress: Submitted:22 Active:35 Checking status:1 Finished<br>
> > successfully:13<br>
> > Progress: Submitted:22 Active:35 Finished successfully:14<br>
> > Progress: Submitted:22 Active:34 Checking status:1 Finished<br>
> > successfully:14<br>
> > Progress: Submitted:21 Active:34 Checking status:1 Finished<br>
> > successfully:15<br>
> > Progress: Submitted:21 Active:34 Finished successfully:16<br>
> > Progress: Submitted:21 Active:33 Checking status:1 Finished<br>
> > successfully:16<br>
> > Progress: Submitted:21 Active:33 Finished successfully:17<br>
> > Progress: Submitted:20 Active:32 Checking status:1 Finished<br>
> > successfully:18<br>
> > Progress: Submitted:20 Active:32 Finished successfully:19<br>
> > Progress: Submitted:20 Active:31 Checking status:1 Finished<br>
> > successfully:19<br>
> > Progress: Submitted:19 Active:31 Finished successfully:21<br>
> > Progress: Submitted:19 Active:30 Checking status:1 Finished<br>
> > successfully:21<br>
> > Progress: Submitted:18 Active:30 Checking status:1 Finished<br>
> > successfully:22<br>
> > Progress: Submitted:18 Active:29 Checking status:1 Finished<br>
> > successfully:23<br>
> > Progress: Submitted:18 Active:28 Checking status:1 Finished<br>
> > successfully:24<br>
> > Progress: Submitted:17 Active:29 Finished successfully:25<br>
> > Progress: Submitted:17 Active:29 Finished successfully:25<br>
> > Progress: Submitted:17 Active:28 Checking status:1 Finished<br>
> > successfully:25<br>
> > Progress: Submitted:17 Active:27 Checking status:1 Finished<br>
> > successfully:26<br>
> > Progress: Submitted:17 Active:26 Checking status:1 Finished<br>
> > successfully:27<br>
> > Progress: Submitted:17 Active:25 Checking status:1 Finished<br>
> > successfully:28<br>
> > Progress: Submitted:17 Active:24 Checking status:1 Finished<br>
> > successfully:29<br>
> > Progress: Submitted:16 Active:25 Finished successfully:30<br>
> > Progress: Submitted:16 Active:24 Checking status:1 Finished<br>
> > successfully:30<br>
> > Progress: Submitted:15 Active:24 Checking status:1 Finished<br>
> > successfully:31<br>
> > Progress: Submitted:15 Active:24 Finished successfully:32<br>
> > Progress: Submitted:15 Active:23 Checking status:1 Finished<br>
> > successfully:32<br>
> > Progress: Submitted:14 Active:24 Finished successfully:33<br>
> > Progress: Submitted:14 Active:23 Checking status:1 Finished<br>
> > successfully:33<br>
> > Progress: Submitted:14 Active:22 Checking status:1 Finished<br>
> > successfully:34<br>
> > Progress: Submitted:14 Active:22 Finished successfully:35<br>
> > Progress: Submitted:14 Active:21 Checking status:1 Finished<br>
> > successfully:35<br>
> > Progress: Submitted:13 Active:22 Finished successfully:36<br>
> > Progress: Submitted:13 Active:22 Finished successfully:36<br>
> > Progress: Submitted:13 Active:20 Checking status:1 Finished<br>
> > successfully:37<br>
> > Progress: Submitted:12 Active:21 Finished successfully:38<br>
> > Progress: Submitted:12 Active:20 Checking status:1 Finished<br>
> > successfully:38<br>
> > Progress: Submitted:12 Active:19 Checking status:1 Finished<br>
> > successfully:39<br>
> > Progress: Submitted:12 Active:19 Finished successfully:40<br>
> > Progress: Submitted:12 Active:18 Checking status:1 Finished<br>
> > successfully:40<br>
> > Progress: Submitted:12 Active:17 Checking status:1 Finished<br>
> > successfully:41<br>
> > Progress: Submitted:11 Active:17 Checking status:1 Finished<br>
> > successfully:42<br>
> > Progress: Submitted:11 Active:17 Finished successfully:43<br>
> > Progress: Submitted:11 Active:16 Checking status:1 Finished<br>
> > successfully:43<br>
> > Progress: Submitted:11 Active:15 Checking status:1 Finished<br>
> > successfully:44<br>
> > Progress: Submitted:10 Active:16 Finished successfully:45<br>
> > Progress: Submitted:3 Active:22 Finished successfully:46<br>
> > Progress: Submitted:3 Active:21 Checking status:1 Finished<br>
> > successfully:46<br>
> > Progress: Submitted:3 Active:19 Finished successfully:49<br>
> > Progress: Submitted:3 Active:19 Finished successfully:49<br>
> > Progress: Submitted:2 Active:20 Finished successfully:49<br>
> > Progress: Submitted:1 Active:21 Finished successfully:49<br>
> > .<br>
> > .<br>
> > .<br>
> > Progress: Submitted:1 Active:15 Finished successfully:55<br>
> > Progress: Submitted:1 Active:15 Finished successfully:55<br>
> > Progress: Submitted:1 Active:15 Finished successfully:55<br>
> > Progress: Submitted:1 Active:15 Finished successfully:55<br>
> > Progress: Submitted:1 Active:14 Checking status:1 Finished<br>
> > successfully:55<br>
> > Progress: Submitted:1 Active:14 Finished successfully:56<br>
> > Progress: Submitted:1 Active:13 Checking status:1 Finished<br>
> > successfully:56<br>
> > Progress: Submitted:1 Active:12 Checking status:1 Finished<br>
> > successfully:57<br>
> > Progress: Submitted:1 Active:12 Finished successfully:58<br>
> > Progress: Submitted:1 Active:11 Checking status:1 Finished<br>
> > successfully:58<br>
> > Progress: Submitted:1 Active:10 Checking status:1 Finished<br>
> > successfully:59<br>
> > Progress: Submitted:1 Active:10 Finished successfully:60<br>
> > Progress: Submitted:1 Active:10 Finished successfully:60<br>
> > Progress: Submitted:1 Active:10 Finished successfully:60<br>
> > Progress: Submitted:1 Active:8 Checking status:1 Finished<br>
> > successfully:61<br>
> > Progress: Submitted:1 Active:8 Finished successfully:62<br>
> > Progress: Submitted:1 Active:7 Checking status:1 Finished<br>
> > successfully:62<br>
> > Progress: Submitted:1 Active:7 Finished successfully:63<br>
> > Progress: Submitted:1 Active:6 Checking status:1 Finished<br>
> > successfully:63<br>
> > Progress: Submitted:1 Active:4 Checking status:1 Finished<br>
> > successfully:65<br>
> > Progress: Submitted:1 Active:3 Checking status:1 Finished<br>
> > successfully:66<br>
> > Progress: Submitted:1 Active:3 Finished successfully:67<br>
> > Progress: Submitted:1 Active:3 Finished successfully:67<br>
> > Progress: Submitted:1 Active:2 Checking status:1 Finished<br>
> > successfully:67<br>
> > Progress: Submitted:1 Active:2 Finished successfully:68<br>
> > Progress: Submitted:1 Active:1 Checking status:1 Finished<br>
> > successfully:68<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> > Progress: Submitted:1 Finished successfully:70<br>
> ><br>
> ><br>
> > etc<br>
> > _______________________________________________<br>
> > Swift-devel mailing list<br>
> > <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
> > <a href="http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel" target="_blank">http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel</a><br>
><br>
><br>
> _______________________________________________<br>
> Swift-devel mailing list<br>
> <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
> <a href="http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel" target="_blank">http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel</a><br>
<br>
</div></div><font color="#888888">--<br>
Michael Wilde<br>
Computation Institute, University of Chicago<br>
Mathematics and Computer Science Division<br>
Argonne National Laboratory<br>
<br>
</font></blockquote></div><br></div></div>