[Swift-devel] decaying number of coaster jobs leaves some tasks unfinished

Glen Hocky hockyg at uchicago.edu
Mon Aug 9 17:19:41 CDT 2010


Here's the full log (I think).

What's Mike's describing is basically my gut feeling as well...

Did you leave the tail end of this run running long enough for the current
> block to end, to see if it starts a new 3600 second block?

A different run before I tried to reproduce the problem ran all night like
that last night without starting any new blocks....(but the settings were
very slightly different (fewer "slots")  and it stalled with 7 jobs left i
think

On Mon, Aug 9, 2010 at 6:01 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> I have seen a common problem where maxwalltime on queues jobs exceeds
> maxtime, in which case Swift hangs, never finding a block it can fit the
> jobs into.
>
> I wonder if this is another manifestation of that behavior/bug: the time
> left in the running block is less than the 25 min maxwalltime for the
> remaining tasks, and Swift does not realize that it needs to end that block
> and start a new one.
>
> Did you leave the tail end of this run running long enough for the current
> block to end, to see if it starts a new 3600 second block?
>
> Im just surmising one possible cause; actual problem here might be
> completely different.
>
> - Mike
>
>
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
>
> > That error might be related. Can I have the full log?
> >
> > On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote:
> > > Hey everyone,
> > > I've been trying to run some short jobs in the "fast" queue on
> > pads.
> > > That means I need to keep the wall time under 1 hour, and my tasks
> > are
> > > around 20 min.  What's been happening, at least for a smallish
> > number
> > > of jobs, is that swift decreases the number of jobs submitted to
> > the
> > > queue as the number of tasks is reduced and at the end, some tasks
> > > remain unfinished while no jobs are in the queue, and this
> > continues
> > > indefinately.
> > >
> > >
> > > The following is one sites entry where I reproducibly had this
> > problem
> > > for 70 tasks
> > >
> > >
> > >             <execution provider="coaster" url="none"
> > >         jobManager="local:pbs"/>
> > >             <!--<profile namespace="globus"
> > >         key="queue">fast</profile>-->
> > >             <profile namespace="globus"
> > key="maxtime">3600</profile>
> > >             <profile namespace="globus"
> > >         key="maxwalltime">00:25:00</profile>
> > >             <profile namespace="globus"
> > >         key="workersPerNode">1</profile>
> > >             <profile namespace="globus"
> > >         key="internalHostname">172.5.86.5</profile>
> > >             <profile namespace="globus" key="slots">120</profile>
> > >             <profile namespace="globus"
> > >         key="nodeGranularity">1</profile>
> > >             <profile namespace="globus" key="maxNodes">1</profile>
> > >             <profile namespace="karajan"
> > >         key="jobThrottle">0.99</profile>
> > >             <profile namespace="karajan"
> > >         key="initialScore">10000</profile>
> > >             <profile namespace="globus"
> > >         key="project">CI-CCR000013</profile>
> > >             <gridftp  url="local://localhost" />
> > >             <scratch>/tmp</scratch>
> > >
> > >
> >
> <workdirectory>/home/hockyg/reichman/glassy_dynamics/code/swift/run/real</workdirectory>
> > >
> > >
> > >
> > >
> > > There are also some of this type of error
> > > Exception caught while unregistering channel
> > >
> > org.globus.cog.karajan.workflow.service.channels.ChannelException:
> > Trying to bind invalid channel (2027063355: {}) to 60652275: {}
> > >                 at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67)
> > >                 at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401)
> > >                 at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
> > >                 at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
> > >                 at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)
> > >                 at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)
> > > but i'm not sure that's related...
> > >
> > >
> > >
> > >
> > > Running with "Swift svn swift-r3432 (swift modified locally)
> > > cog-r2829"
> > >
> > >
> > > Swift output went something like
> > > Progress:  Submitted:69  Active:1  Finished successfully:1
> > > Progress:  Submitted:67  Active:3  Finished successfully:1
> > > Progress:  Submitted:66  Active:4  Finished successfully:1
> > > Progress:  Submitted:65  Active:5  Finished successfully:1
> > > Progress:  Submitted:64  Active:6  Finished successfully:1
> > > Progress:  Submitted:61  Active:9  Finished successfully:1
> > > Progress:  Submitted:58  Active:12  Finished successfully:1
> > > Progress:  Submitted:57  Active:13  Finished successfully:1
> > > Progress:  Submitted:54  Active:16  Finished successfully:1
> > > Progress:  Submitted:52  Active:18  Finished successfully:1
> > > Progress:  Submitted:51  Active:19  Finished successfully:1
> > > Progress:  Submitted:50  Active:20  Finished successfully:1
> > > Progress:  Submitted:49  Active:21  Finished successfully:1
> > > Progress:  Submitted:48  Active:22  Finished successfully:1
> > > Progress:  Submitted:41  Active:29  Finished successfully:1
> > > Progress:  Submitted:38  Active:32  Finished successfully:1
> > > Progress:  Submitted:37  Active:33  Finished successfully:1
> > > Progress:  Submitted:35  Active:35  Finished successfully:1
> > > Progress:  Submitted:31  Active:39  Finished successfully:1
> > > Progress:  Submitted:30  Active:40  Finished successfully:1
> > > Progress:  Submitted:26  Active:44  Finished successfully:1
> > > Progress:  Submitted:26  Active:44  Finished successfully:1
> > > Progress:  Submitted:26  Active:44  Finished successfully:1
> > > Progress:  Submitted:26  Active:44  Finished successfully:1
> > > Progress:  Submitted:26  Active:44  Finished successfully:1
> > > Progress:  Submitted:26  Active:44  Finished successfully:1
> > > Progress:  Submitted:26  Active:43  Checking status:1  Finished
> > > successfully:1
> > > Progress:  Submitted:26  Active:43  Finished successfully:2
> > > Progress:  Submitted:26  Active:42  Checking status:1  Finished
> > > successfully:2
> > > Progress:  Submitted:25  Active:42  Checking status:1  Finished
> > > successfully:3
> > > Progress:  Submitted:25  Active:41  Checking status:1  Finished
> > > successfully:4
> > > Progress:  Submitted:25  Active:41  Finished successfully:5
> > > Progress:  Submitted:25  Active:40  Checking status:1  Finished
> > > successfully:5
> > > Progress:  Submitted:25  Active:39  Checking status:1  Finished
> > > successfully:6
> > > Progress:  Submitted:24  Active:40  Finished successfully:7
> > > Progress:  Submitted:24  Active:39  Checking status:1  Finished
> > > successfully:7
> > > Progress:  Submitted:24  Active:38  Checking status:1  Finished
> > > successfully:8
> > > Progress:  Submitted:24  Active:38  Finished successfully:9
> > > Progress:  Submitted:24  Active:37  Checking status:1  Finished
> > > successfully:9
> > > Progress:  Submitted:24  Active:35  Checking status:1  Finished
> > > successfully:11
> > > Progress:  Submitted:23  Active:35  Checking status:1  Finished
> > > successfully:12
> > > Progress:  Submitted:22  Active:35  Checking status:1  Finished
> > > successfully:13
> > > Progress:  Submitted:22  Active:35  Finished successfully:14
> > > Progress:  Submitted:22  Active:34  Checking status:1  Finished
> > > successfully:14
> > > Progress:  Submitted:21  Active:34  Checking status:1  Finished
> > > successfully:15
> > > Progress:  Submitted:21  Active:34  Finished successfully:16
> > > Progress:  Submitted:21  Active:33  Checking status:1  Finished
> > > successfully:16
> > > Progress:  Submitted:21  Active:33  Finished successfully:17
> > > Progress:  Submitted:20  Active:32  Checking status:1  Finished
> > > successfully:18
> > > Progress:  Submitted:20  Active:32  Finished successfully:19
> > > Progress:  Submitted:20  Active:31  Checking status:1  Finished
> > > successfully:19
> > > Progress:  Submitted:19  Active:31  Finished successfully:21
> > > Progress:  Submitted:19  Active:30  Checking status:1  Finished
> > > successfully:21
> > > Progress:  Submitted:18  Active:30  Checking status:1  Finished
> > > successfully:22
> > > Progress:  Submitted:18  Active:29  Checking status:1  Finished
> > > successfully:23
> > > Progress:  Submitted:18  Active:28  Checking status:1  Finished
> > > successfully:24
> > > Progress:  Submitted:17  Active:29  Finished successfully:25
> > > Progress:  Submitted:17  Active:29  Finished successfully:25
> > > Progress:  Submitted:17  Active:28  Checking status:1  Finished
> > > successfully:25
> > > Progress:  Submitted:17  Active:27  Checking status:1  Finished
> > > successfully:26
> > > Progress:  Submitted:17  Active:26  Checking status:1  Finished
> > > successfully:27
> > > Progress:  Submitted:17  Active:25  Checking status:1  Finished
> > > successfully:28
> > > Progress:  Submitted:17  Active:24  Checking status:1  Finished
> > > successfully:29
> > > Progress:  Submitted:16  Active:25  Finished successfully:30
> > > Progress:  Submitted:16  Active:24  Checking status:1  Finished
> > > successfully:30
> > > Progress:  Submitted:15  Active:24  Checking status:1  Finished
> > > successfully:31
> > > Progress:  Submitted:15  Active:24  Finished successfully:32
> > > Progress:  Submitted:15  Active:23  Checking status:1  Finished
> > > successfully:32
> > > Progress:  Submitted:14  Active:24  Finished successfully:33
> > > Progress:  Submitted:14  Active:23  Checking status:1  Finished
> > > successfully:33
> > > Progress:  Submitted:14  Active:22  Checking status:1  Finished
> > > successfully:34
> > > Progress:  Submitted:14  Active:22  Finished successfully:35
> > > Progress:  Submitted:14  Active:21  Checking status:1  Finished
> > > successfully:35
> > > Progress:  Submitted:13  Active:22  Finished successfully:36
> > > Progress:  Submitted:13  Active:22  Finished successfully:36
> > > Progress:  Submitted:13  Active:20  Checking status:1  Finished
> > > successfully:37
> > > Progress:  Submitted:12  Active:21  Finished successfully:38
> > > Progress:  Submitted:12  Active:20  Checking status:1  Finished
> > > successfully:38
> > > Progress:  Submitted:12  Active:19  Checking status:1  Finished
> > > successfully:39
> > > Progress:  Submitted:12  Active:19  Finished successfully:40
> > > Progress:  Submitted:12  Active:18  Checking status:1  Finished
> > > successfully:40
> > > Progress:  Submitted:12  Active:17  Checking status:1  Finished
> > > successfully:41
> > > Progress:  Submitted:11  Active:17  Checking status:1  Finished
> > > successfully:42
> > > Progress:  Submitted:11  Active:17  Finished successfully:43
> > > Progress:  Submitted:11  Active:16  Checking status:1  Finished
> > > successfully:43
> > > Progress:  Submitted:11  Active:15  Checking status:1  Finished
> > > successfully:44
> > > Progress:  Submitted:10  Active:16  Finished successfully:45
> > > Progress:  Submitted:3  Active:22  Finished successfully:46
> > > Progress:  Submitted:3  Active:21  Checking status:1  Finished
> > > successfully:46
> > > Progress:  Submitted:3  Active:19  Finished successfully:49
> > > Progress:  Submitted:3  Active:19  Finished successfully:49
> > > Progress:  Submitted:2  Active:20  Finished successfully:49
> > > Progress:  Submitted:1  Active:21  Finished successfully:49
> > > .
> > > .
> > > .
> > > Progress:  Submitted:1  Active:15  Finished successfully:55
> > > Progress:  Submitted:1  Active:15  Finished successfully:55
> > > Progress:  Submitted:1  Active:15  Finished successfully:55
> > > Progress:  Submitted:1  Active:15  Finished successfully:55
> > > Progress:  Submitted:1  Active:14  Checking status:1  Finished
> > > successfully:55
> > > Progress:  Submitted:1  Active:14  Finished successfully:56
> > > Progress:  Submitted:1  Active:13  Checking status:1  Finished
> > > successfully:56
> > > Progress:  Submitted:1  Active:12  Checking status:1  Finished
> > > successfully:57
> > > Progress:  Submitted:1  Active:12  Finished successfully:58
> > > Progress:  Submitted:1  Active:11  Checking status:1  Finished
> > > successfully:58
> > > Progress:  Submitted:1  Active:10  Checking status:1  Finished
> > > successfully:59
> > > Progress:  Submitted:1  Active:10  Finished successfully:60
> > > Progress:  Submitted:1  Active:10  Finished successfully:60
> > > Progress:  Submitted:1  Active:10  Finished successfully:60
> > > Progress:  Submitted:1  Active:8  Checking status:1  Finished
> > > successfully:61
> > > Progress:  Submitted:1  Active:8  Finished successfully:62
> > > Progress:  Submitted:1  Active:7  Checking status:1  Finished
> > > successfully:62
> > > Progress:  Submitted:1  Active:7  Finished successfully:63
> > > Progress:  Submitted:1  Active:6  Checking status:1  Finished
> > > successfully:63
> > > Progress:  Submitted:1  Active:4  Checking status:1  Finished
> > > successfully:65
> > > Progress:  Submitted:1  Active:3  Checking status:1  Finished
> > > successfully:66
> > > Progress:  Submitted:1  Active:3  Finished successfully:67
> > > Progress:  Submitted:1  Active:3  Finished successfully:67
> > > Progress:  Submitted:1  Active:2  Checking status:1  Finished
> > > successfully:67
> > > Progress:  Submitted:1  Active:2  Finished successfully:68
> > > Progress:  Submitted:1  Active:1  Checking status:1  Finished
> > > successfully:68
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > > Progress:  Submitted:1  Finished successfully:70
> > >
> > >
> > > etc
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100809/448e91e6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: glassEquilCavities-20100809-1547-i1s75vd0.log
Type: application/octet-stream
Size: 1659730 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100809/448e91e6/attachment.obj>


More information about the Swift-devel mailing list