[Swift-devel] decaying number of coaster jobs leaves some tasks unfinished
Mihael Hategan
hategan at mcs.anl.gov
Mon Aug 9 17:37:27 CDT 2010
On Mon, 2010-08-09 at 18:19 -0400, Glen Hocky wrote:
> Here's the full log (I think).
>
> What's Mike's describing is basically my gut feeling as well...
Right. The log should tell us.
>
>
> Did you leave the tail end of this run running long enough for
> the current block to end, to see if it starts a new 3600
> second block?
> A different run before I tried to reproduce the problem ran all night
> like that last night without starting any new blocks....(but the
> settings were very slightly different (fewer "slots") and it stalled
> with 7 jobs left i think
>
> On Mon, Aug 9, 2010 at 6:01 PM, Michael Wilde <wilde at mcs.anl.gov>
> wrote:
> I have seen a common problem where maxwalltime on queues jobs
> exceeds maxtime, in which case Swift hangs, never finding a
> block it can fit the jobs into.
>
> I wonder if this is another manifestation of that
> behavior/bug: the time left in the running block is less than
> the 25 min maxwalltime for the remaining tasks, and Swift does
> not realize that it needs to end that block and start a new
> one.
>
> Did you leave the tail end of this run running long enough for
> the current block to end, to see if it starts a new 3600
> second block?
>
> Im just surmising one possible cause; actual problem here
> might be completely different.
>
> - Mike
>
>
>
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
>
> > That error might be related. Can I have the full log?
> >
> > On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote:
> > > Hey everyone,
> > > I've been trying to run some short jobs in the "fast"
> queue on
> > pads.
> > > That means I need to keep the wall time under 1 hour, and
> my tasks
> > are
> > > around 20 min. What's been happening, at least for a
> smallish
> > number
> > > of jobs, is that swift decreases the number of jobs
> submitted to
> > the
> > > queue as the number of tasks is reduced and at the end,
> some tasks
> > > remain unfinished while no jobs are in the queue, and this
> > continues
> > > indefinately.
> > >
> > >
> > > The following is one sites entry where I reproducibly had
> this
> > problem
> > > for 70 tasks
> > >
> > >
> > > <execution provider="coaster" url="none"
> > > jobManager="local:pbs"/>
> > > <!--<profile namespace="globus"
> > > key="queue">fast</profile>-->
> > > <profile namespace="globus"
> > key="maxtime">3600</profile>
> > > <profile namespace="globus"
> > > key="maxwalltime">00:25:00</profile>
> > > <profile namespace="globus"
> > > key="workersPerNode">1</profile>
> > > <profile namespace="globus"
> > > key="internalHostname">172.5.86.5</profile>
> > > <profile namespace="globus"
> key="slots">120</profile>
> > > <profile namespace="globus"
> > > key="nodeGranularity">1</profile>
> > > <profile namespace="globus"
> key="maxNodes">1</profile>
> > > <profile namespace="karajan"
> > > key="jobThrottle">0.99</profile>
> > > <profile namespace="karajan"
> > > key="initialScore">10000</profile>
> > > <profile namespace="globus"
> > > key="project">CI-CCR000013</profile>
> > > <gridftp url="local://localhost" />
> > > <scratch>/tmp</scratch>
> > >
> > >
> >
> <workdirectory>/home/hockyg/reichman/glassy_dynamics/code/swift/run/real</workdirectory>
> > >
> > >
> > >
> > >
> > > There are also some of this type of error
> > > Exception caught while unregistering channel
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.ChannelException:
> > Trying to bind invalid channel (2027063355: {}) to 60652275:
> {}
> > > at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67)
> > > at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401)
> > > at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
> > > at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
> > > at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)
> > > at
> > >
> >
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)
> > > but i'm not sure that's related...
> > >
> > >
> > >
> > >
> > > Running with "Swift svn swift-r3432 (swift modified
> locally)
> > > cog-r2829"
> > >
> > >
> > > Swift output went something like
> > > Progress: Submitted:69 Active:1 Finished successfully:1
> > > Progress: Submitted:67 Active:3 Finished successfully:1
> > > Progress: Submitted:66 Active:4 Finished successfully:1
> > > Progress: Submitted:65 Active:5 Finished successfully:1
> > > Progress: Submitted:64 Active:6 Finished successfully:1
> > > Progress: Submitted:61 Active:9 Finished successfully:1
> > > Progress: Submitted:58 Active:12 Finished
> successfully:1
> > > Progress: Submitted:57 Active:13 Finished
> successfully:1
> > > Progress: Submitted:54 Active:16 Finished
> successfully:1
> > > Progress: Submitted:52 Active:18 Finished
> successfully:1
> > > Progress: Submitted:51 Active:19 Finished
> successfully:1
> > > Progress: Submitted:50 Active:20 Finished
> successfully:1
> > > Progress: Submitted:49 Active:21 Finished
> successfully:1
> > > Progress: Submitted:48 Active:22 Finished
> successfully:1
> > > Progress: Submitted:41 Active:29 Finished
> successfully:1
> > > Progress: Submitted:38 Active:32 Finished
> successfully:1
> > > Progress: Submitted:37 Active:33 Finished
> successfully:1
> > > Progress: Submitted:35 Active:35 Finished
> successfully:1
> > > Progress: Submitted:31 Active:39 Finished
> successfully:1
> > > Progress: Submitted:30 Active:40 Finished
> successfully:1
> > > Progress: Submitted:26 Active:44 Finished
> successfully:1
> > > Progress: Submitted:26 Active:44 Finished
> successfully:1
> > > Progress: Submitted:26 Active:44 Finished
> successfully:1
> > > Progress: Submitted:26 Active:44 Finished
> successfully:1
> > > Progress: Submitted:26 Active:44 Finished
> successfully:1
> > > Progress: Submitted:26 Active:44 Finished
> successfully:1
> > > Progress: Submitted:26 Active:43 Checking status:1
> Finished
> > > successfully:1
> > > Progress: Submitted:26 Active:43 Finished
> successfully:2
> > > Progress: Submitted:26 Active:42 Checking status:1
> Finished
> > > successfully:2
> > > Progress: Submitted:25 Active:42 Checking status:1
> Finished
> > > successfully:3
> > > Progress: Submitted:25 Active:41 Checking status:1
> Finished
> > > successfully:4
> > > Progress: Submitted:25 Active:41 Finished
> successfully:5
> > > Progress: Submitted:25 Active:40 Checking status:1
> Finished
> > > successfully:5
> > > Progress: Submitted:25 Active:39 Checking status:1
> Finished
> > > successfully:6
> > > Progress: Submitted:24 Active:40 Finished
> successfully:7
> > > Progress: Submitted:24 Active:39 Checking status:1
> Finished
> > > successfully:7
> > > Progress: Submitted:24 Active:38 Checking status:1
> Finished
> > > successfully:8
> > > Progress: Submitted:24 Active:38 Finished
> successfully:9
> > > Progress: Submitted:24 Active:37 Checking status:1
> Finished
> > > successfully:9
> > > Progress: Submitted:24 Active:35 Checking status:1
> Finished
> > > successfully:11
> > > Progress: Submitted:23 Active:35 Checking status:1
> Finished
> > > successfully:12
> > > Progress: Submitted:22 Active:35 Checking status:1
> Finished
> > > successfully:13
> > > Progress: Submitted:22 Active:35 Finished
> successfully:14
> > > Progress: Submitted:22 Active:34 Checking status:1
> Finished
> > > successfully:14
> > > Progress: Submitted:21 Active:34 Checking status:1
> Finished
> > > successfully:15
> > > Progress: Submitted:21 Active:34 Finished
> successfully:16
> > > Progress: Submitted:21 Active:33 Checking status:1
> Finished
> > > successfully:16
> > > Progress: Submitted:21 Active:33 Finished
> successfully:17
> > > Progress: Submitted:20 Active:32 Checking status:1
> Finished
> > > successfully:18
> > > Progress: Submitted:20 Active:32 Finished
> successfully:19
> > > Progress: Submitted:20 Active:31 Checking status:1
> Finished
> > > successfully:19
> > > Progress: Submitted:19 Active:31 Finished
> successfully:21
> > > Progress: Submitted:19 Active:30 Checking status:1
> Finished
> > > successfully:21
> > > Progress: Submitted:18 Active:30 Checking status:1
> Finished
> > > successfully:22
> > > Progress: Submitted:18 Active:29 Checking status:1
> Finished
> > > successfully:23
> > > Progress: Submitted:18 Active:28 Checking status:1
> Finished
> > > successfully:24
> > > Progress: Submitted:17 Active:29 Finished
> successfully:25
> > > Progress: Submitted:17 Active:29 Finished
> successfully:25
> > > Progress: Submitted:17 Active:28 Checking status:1
> Finished
> > > successfully:25
> > > Progress: Submitted:17 Active:27 Checking status:1
> Finished
> > > successfully:26
> > > Progress: Submitted:17 Active:26 Checking status:1
> Finished
> > > successfully:27
> > > Progress: Submitted:17 Active:25 Checking status:1
> Finished
> > > successfully:28
> > > Progress: Submitted:17 Active:24 Checking status:1
> Finished
> > > successfully:29
> > > Progress: Submitted:16 Active:25 Finished
> successfully:30
> > > Progress: Submitted:16 Active:24 Checking status:1
> Finished
> > > successfully:30
> > > Progress: Submitted:15 Active:24 Checking status:1
> Finished
> > > successfully:31
> > > Progress: Submitted:15 Active:24 Finished
> successfully:32
> > > Progress: Submitted:15 Active:23 Checking status:1
> Finished
> > > successfully:32
> > > Progress: Submitted:14 Active:24 Finished
> successfully:33
> > > Progress: Submitted:14 Active:23 Checking status:1
> Finished
> > > successfully:33
> > > Progress: Submitted:14 Active:22 Checking status:1
> Finished
> > > successfully:34
> > > Progress: Submitted:14 Active:22 Finished
> successfully:35
> > > Progress: Submitted:14 Active:21 Checking status:1
> Finished
> > > successfully:35
> > > Progress: Submitted:13 Active:22 Finished
> successfully:36
> > > Progress: Submitted:13 Active:22 Finished
> successfully:36
> > > Progress: Submitted:13 Active:20 Checking status:1
> Finished
> > > successfully:37
> > > Progress: Submitted:12 Active:21 Finished
> successfully:38
> > > Progress: Submitted:12 Active:20 Checking status:1
> Finished
> > > successfully:38
> > > Progress: Submitted:12 Active:19 Checking status:1
> Finished
> > > successfully:39
> > > Progress: Submitted:12 Active:19 Finished
> successfully:40
> > > Progress: Submitted:12 Active:18 Checking status:1
> Finished
> > > successfully:40
> > > Progress: Submitted:12 Active:17 Checking status:1
> Finished
> > > successfully:41
> > > Progress: Submitted:11 Active:17 Checking status:1
> Finished
> > > successfully:42
> > > Progress: Submitted:11 Active:17 Finished
> successfully:43
> > > Progress: Submitted:11 Active:16 Checking status:1
> Finished
> > > successfully:43
> > > Progress: Submitted:11 Active:15 Checking status:1
> Finished
> > > successfully:44
> > > Progress: Submitted:10 Active:16 Finished
> successfully:45
> > > Progress: Submitted:3 Active:22 Finished
> successfully:46
> > > Progress: Submitted:3 Active:21 Checking status:1
> Finished
> > > successfully:46
> > > Progress: Submitted:3 Active:19 Finished
> successfully:49
> > > Progress: Submitted:3 Active:19 Finished
> successfully:49
> > > Progress: Submitted:2 Active:20 Finished
> successfully:49
> > > Progress: Submitted:1 Active:21 Finished
> successfully:49
> > > .
> > > .
> > > .
> > > Progress: Submitted:1 Active:15 Finished
> successfully:55
> > > Progress: Submitted:1 Active:15 Finished
> successfully:55
> > > Progress: Submitted:1 Active:15 Finished
> successfully:55
> > > Progress: Submitted:1 Active:15 Finished
> successfully:55
> > > Progress: Submitted:1 Active:14 Checking status:1
> Finished
> > > successfully:55
> > > Progress: Submitted:1 Active:14 Finished
> successfully:56
> > > Progress: Submitted:1 Active:13 Checking status:1
> Finished
> > > successfully:56
> > > Progress: Submitted:1 Active:12 Checking status:1
> Finished
> > > successfully:57
> > > Progress: Submitted:1 Active:12 Finished
> successfully:58
> > > Progress: Submitted:1 Active:11 Checking status:1
> Finished
> > > successfully:58
> > > Progress: Submitted:1 Active:10 Checking status:1
> Finished
> > > successfully:59
> > > Progress: Submitted:1 Active:10 Finished
> successfully:60
> > > Progress: Submitted:1 Active:10 Finished
> successfully:60
> > > Progress: Submitted:1 Active:10 Finished
> successfully:60
> > > Progress: Submitted:1 Active:8 Checking status:1
> Finished
> > > successfully:61
> > > Progress: Submitted:1 Active:8 Finished successfully:62
> > > Progress: Submitted:1 Active:7 Checking status:1
> Finished
> > > successfully:62
> > > Progress: Submitted:1 Active:7 Finished successfully:63
> > > Progress: Submitted:1 Active:6 Checking status:1
> Finished
> > > successfully:63
> > > Progress: Submitted:1 Active:4 Checking status:1
> Finished
> > > successfully:65
> > > Progress: Submitted:1 Active:3 Checking status:1
> Finished
> > > successfully:66
> > > Progress: Submitted:1 Active:3 Finished successfully:67
> > > Progress: Submitted:1 Active:3 Finished successfully:67
> > > Progress: Submitted:1 Active:2 Checking status:1
> Finished
> > > successfully:67
> > > Progress: Submitted:1 Active:2 Finished successfully:68
> > > Progress: Submitted:1 Active:1 Checking status:1
> Finished
> > > successfully:68
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > > Progress: Submitted:1 Finished successfully:70
> > >
> > >
> > > etc
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
>
More information about the Swift-devel
mailing list