[Swift-devel] decaying number of coaster jobs leaves some tasks unfinished

Mihael Hategan hategan at mcs.anl.gov
Mon Aug 9 17:37:27 CDT 2010


On Mon, 2010-08-09 at 18:19 -0400, Glen Hocky wrote:
> Here's the full log (I think). 
> 
> What's Mike's describing is basically my gut feeling as well...

Right. The log should tell us.

> 
> 
>         Did you leave the tail end of this run running long enough for
>         the current block to end, to see if it starts a new 3600
>         second block?
> A different run before I tried to reproduce the problem ran all night
> like that last night without starting any new blocks....(but the
> settings were very slightly different (fewer "slots")  and it stalled
> with 7 jobs left i think
> 
> On Mon, Aug 9, 2010 at 6:01 PM, Michael Wilde <wilde at mcs.anl.gov>
> wrote:
>         I have seen a common problem where maxwalltime on queues jobs
>         exceeds maxtime, in which case Swift hangs, never finding a
>         block it can fit the jobs into.
>         
>         I wonder if this is another manifestation of that
>         behavior/bug: the time left in the running block is less than
>         the 25 min maxwalltime for the remaining tasks, and Swift does
>         not realize that it needs to end that block and start a new
>         one.
>         
>         Did you leave the tail end of this run running long enough for
>         the current block to end, to see if it starts a new 3600
>         second block?
>         
>         Im just surmising one possible cause; actual problem here
>         might be completely different.
>         
>         - Mike
>         
>         
>         
>         ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
>         
>         > That error might be related. Can I have the full log?
>         >
>         > On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote:
>         > > Hey everyone,
>         > > I've been trying to run some short jobs in the "fast"
>         queue on
>         > pads.
>         > > That means I need to keep the wall time under 1 hour, and
>         my tasks
>         > are
>         > > around 20 min.  What's been happening, at least for a
>         smallish
>         > number
>         > > of jobs, is that swift decreases the number of jobs
>         submitted to
>         > the
>         > > queue as the number of tasks is reduced and at the end,
>         some tasks
>         > > remain unfinished while no jobs are in the queue, and this
>         > continues
>         > > indefinately.
>         > >
>         > >
>         > > The following is one sites entry where I reproducibly had
>         this
>         > problem
>         > > for 70 tasks
>         > >
>         > >
>         > >             <execution provider="coaster" url="none"
>         > >         jobManager="local:pbs"/>
>         > >             <!--<profile namespace="globus"
>         > >         key="queue">fast</profile>-->
>         > >             <profile namespace="globus"
>         > key="maxtime">3600</profile>
>         > >             <profile namespace="globus"
>         > >         key="maxwalltime">00:25:00</profile>
>         > >             <profile namespace="globus"
>         > >         key="workersPerNode">1</profile>
>         > >             <profile namespace="globus"
>         > >         key="internalHostname">172.5.86.5</profile>
>         > >             <profile namespace="globus"
>         key="slots">120</profile>
>         > >             <profile namespace="globus"
>         > >         key="nodeGranularity">1</profile>
>         > >             <profile namespace="globus"
>         key="maxNodes">1</profile>
>         > >             <profile namespace="karajan"
>         > >         key="jobThrottle">0.99</profile>
>         > >             <profile namespace="karajan"
>         > >         key="initialScore">10000</profile>
>         > >             <profile namespace="globus"
>         > >         key="project">CI-CCR000013</profile>
>         > >             <gridftp  url="local://localhost" />
>         > >             <scratch>/tmp</scratch>
>         > >
>         > >
>         >
>         <workdirectory>/home/hockyg/reichman/glassy_dynamics/code/swift/run/real</workdirectory>
>         > >
>         > >
>         > >
>         > >
>         > > There are also some of this type of error
>         > > Exception caught while unregistering channel
>         > >
>         >
>         org.globus.cog.karajan.workflow.service.channels.ChannelException:
>         > Trying to bind invalid channel (2027063355: {}) to 60652275:
>         {}
>         > >                 at
>         > >
>         >
>         org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67)
>         > >                 at
>         > >
>         >
>         org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401)
>         > >                 at
>         > >
>         >
>         org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
>         > >                 at
>         > >
>         >
>         org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
>         > >                 at
>         > >
>         >
>         org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)
>         > >                 at
>         > >
>         >
>         org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)
>         > > but i'm not sure that's related...
>         > >
>         > >
>         > >
>         > >
>         > > Running with "Swift svn swift-r3432 (swift modified
>         locally)
>         > > cog-r2829"
>         > >
>         > >
>         > > Swift output went something like
>         > > Progress:  Submitted:69  Active:1  Finished successfully:1
>         > > Progress:  Submitted:67  Active:3  Finished successfully:1
>         > > Progress:  Submitted:66  Active:4  Finished successfully:1
>         > > Progress:  Submitted:65  Active:5  Finished successfully:1
>         > > Progress:  Submitted:64  Active:6  Finished successfully:1
>         > > Progress:  Submitted:61  Active:9  Finished successfully:1
>         > > Progress:  Submitted:58  Active:12  Finished
>         successfully:1
>         > > Progress:  Submitted:57  Active:13  Finished
>         successfully:1
>         > > Progress:  Submitted:54  Active:16  Finished
>         successfully:1
>         > > Progress:  Submitted:52  Active:18  Finished
>         successfully:1
>         > > Progress:  Submitted:51  Active:19  Finished
>         successfully:1
>         > > Progress:  Submitted:50  Active:20  Finished
>         successfully:1
>         > > Progress:  Submitted:49  Active:21  Finished
>         successfully:1
>         > > Progress:  Submitted:48  Active:22  Finished
>         successfully:1
>         > > Progress:  Submitted:41  Active:29  Finished
>         successfully:1
>         > > Progress:  Submitted:38  Active:32  Finished
>         successfully:1
>         > > Progress:  Submitted:37  Active:33  Finished
>         successfully:1
>         > > Progress:  Submitted:35  Active:35  Finished
>         successfully:1
>         > > Progress:  Submitted:31  Active:39  Finished
>         successfully:1
>         > > Progress:  Submitted:30  Active:40  Finished
>         successfully:1
>         > > Progress:  Submitted:26  Active:44  Finished
>         successfully:1
>         > > Progress:  Submitted:26  Active:44  Finished
>         successfully:1
>         > > Progress:  Submitted:26  Active:44  Finished
>         successfully:1
>         > > Progress:  Submitted:26  Active:44  Finished
>         successfully:1
>         > > Progress:  Submitted:26  Active:44  Finished
>         successfully:1
>         > > Progress:  Submitted:26  Active:44  Finished
>         successfully:1
>         > > Progress:  Submitted:26  Active:43  Checking status:1
>          Finished
>         > > successfully:1
>         > > Progress:  Submitted:26  Active:43  Finished
>         successfully:2
>         > > Progress:  Submitted:26  Active:42  Checking status:1
>          Finished
>         > > successfully:2
>         > > Progress:  Submitted:25  Active:42  Checking status:1
>          Finished
>         > > successfully:3
>         > > Progress:  Submitted:25  Active:41  Checking status:1
>          Finished
>         > > successfully:4
>         > > Progress:  Submitted:25  Active:41  Finished
>         successfully:5
>         > > Progress:  Submitted:25  Active:40  Checking status:1
>          Finished
>         > > successfully:5
>         > > Progress:  Submitted:25  Active:39  Checking status:1
>          Finished
>         > > successfully:6
>         > > Progress:  Submitted:24  Active:40  Finished
>         successfully:7
>         > > Progress:  Submitted:24  Active:39  Checking status:1
>          Finished
>         > > successfully:7
>         > > Progress:  Submitted:24  Active:38  Checking status:1
>          Finished
>         > > successfully:8
>         > > Progress:  Submitted:24  Active:38  Finished
>         successfully:9
>         > > Progress:  Submitted:24  Active:37  Checking status:1
>          Finished
>         > > successfully:9
>         > > Progress:  Submitted:24  Active:35  Checking status:1
>          Finished
>         > > successfully:11
>         > > Progress:  Submitted:23  Active:35  Checking status:1
>          Finished
>         > > successfully:12
>         > > Progress:  Submitted:22  Active:35  Checking status:1
>          Finished
>         > > successfully:13
>         > > Progress:  Submitted:22  Active:35  Finished
>         successfully:14
>         > > Progress:  Submitted:22  Active:34  Checking status:1
>          Finished
>         > > successfully:14
>         > > Progress:  Submitted:21  Active:34  Checking status:1
>          Finished
>         > > successfully:15
>         > > Progress:  Submitted:21  Active:34  Finished
>         successfully:16
>         > > Progress:  Submitted:21  Active:33  Checking status:1
>          Finished
>         > > successfully:16
>         > > Progress:  Submitted:21  Active:33  Finished
>         successfully:17
>         > > Progress:  Submitted:20  Active:32  Checking status:1
>          Finished
>         > > successfully:18
>         > > Progress:  Submitted:20  Active:32  Finished
>         successfully:19
>         > > Progress:  Submitted:20  Active:31  Checking status:1
>          Finished
>         > > successfully:19
>         > > Progress:  Submitted:19  Active:31  Finished
>         successfully:21
>         > > Progress:  Submitted:19  Active:30  Checking status:1
>          Finished
>         > > successfully:21
>         > > Progress:  Submitted:18  Active:30  Checking status:1
>          Finished
>         > > successfully:22
>         > > Progress:  Submitted:18  Active:29  Checking status:1
>          Finished
>         > > successfully:23
>         > > Progress:  Submitted:18  Active:28  Checking status:1
>          Finished
>         > > successfully:24
>         > > Progress:  Submitted:17  Active:29  Finished
>         successfully:25
>         > > Progress:  Submitted:17  Active:29  Finished
>         successfully:25
>         > > Progress:  Submitted:17  Active:28  Checking status:1
>          Finished
>         > > successfully:25
>         > > Progress:  Submitted:17  Active:27  Checking status:1
>          Finished
>         > > successfully:26
>         > > Progress:  Submitted:17  Active:26  Checking status:1
>          Finished
>         > > successfully:27
>         > > Progress:  Submitted:17  Active:25  Checking status:1
>          Finished
>         > > successfully:28
>         > > Progress:  Submitted:17  Active:24  Checking status:1
>          Finished
>         > > successfully:29
>         > > Progress:  Submitted:16  Active:25  Finished
>         successfully:30
>         > > Progress:  Submitted:16  Active:24  Checking status:1
>          Finished
>         > > successfully:30
>         > > Progress:  Submitted:15  Active:24  Checking status:1
>          Finished
>         > > successfully:31
>         > > Progress:  Submitted:15  Active:24  Finished
>         successfully:32
>         > > Progress:  Submitted:15  Active:23  Checking status:1
>          Finished
>         > > successfully:32
>         > > Progress:  Submitted:14  Active:24  Finished
>         successfully:33
>         > > Progress:  Submitted:14  Active:23  Checking status:1
>          Finished
>         > > successfully:33
>         > > Progress:  Submitted:14  Active:22  Checking status:1
>          Finished
>         > > successfully:34
>         > > Progress:  Submitted:14  Active:22  Finished
>         successfully:35
>         > > Progress:  Submitted:14  Active:21  Checking status:1
>          Finished
>         > > successfully:35
>         > > Progress:  Submitted:13  Active:22  Finished
>         successfully:36
>         > > Progress:  Submitted:13  Active:22  Finished
>         successfully:36
>         > > Progress:  Submitted:13  Active:20  Checking status:1
>          Finished
>         > > successfully:37
>         > > Progress:  Submitted:12  Active:21  Finished
>         successfully:38
>         > > Progress:  Submitted:12  Active:20  Checking status:1
>          Finished
>         > > successfully:38
>         > > Progress:  Submitted:12  Active:19  Checking status:1
>          Finished
>         > > successfully:39
>         > > Progress:  Submitted:12  Active:19  Finished
>         successfully:40
>         > > Progress:  Submitted:12  Active:18  Checking status:1
>          Finished
>         > > successfully:40
>         > > Progress:  Submitted:12  Active:17  Checking status:1
>          Finished
>         > > successfully:41
>         > > Progress:  Submitted:11  Active:17  Checking status:1
>          Finished
>         > > successfully:42
>         > > Progress:  Submitted:11  Active:17  Finished
>         successfully:43
>         > > Progress:  Submitted:11  Active:16  Checking status:1
>          Finished
>         > > successfully:43
>         > > Progress:  Submitted:11  Active:15  Checking status:1
>          Finished
>         > > successfully:44
>         > > Progress:  Submitted:10  Active:16  Finished
>         successfully:45
>         > > Progress:  Submitted:3  Active:22  Finished
>         successfully:46
>         > > Progress:  Submitted:3  Active:21  Checking status:1
>          Finished
>         > > successfully:46
>         > > Progress:  Submitted:3  Active:19  Finished
>         successfully:49
>         > > Progress:  Submitted:3  Active:19  Finished
>         successfully:49
>         > > Progress:  Submitted:2  Active:20  Finished
>         successfully:49
>         > > Progress:  Submitted:1  Active:21  Finished
>         successfully:49
>         > > .
>         > > .
>         > > .
>         > > Progress:  Submitted:1  Active:15  Finished
>         successfully:55
>         > > Progress:  Submitted:1  Active:15  Finished
>         successfully:55
>         > > Progress:  Submitted:1  Active:15  Finished
>         successfully:55
>         > > Progress:  Submitted:1  Active:15  Finished
>         successfully:55
>         > > Progress:  Submitted:1  Active:14  Checking status:1
>          Finished
>         > > successfully:55
>         > > Progress:  Submitted:1  Active:14  Finished
>         successfully:56
>         > > Progress:  Submitted:1  Active:13  Checking status:1
>          Finished
>         > > successfully:56
>         > > Progress:  Submitted:1  Active:12  Checking status:1
>          Finished
>         > > successfully:57
>         > > Progress:  Submitted:1  Active:12  Finished
>         successfully:58
>         > > Progress:  Submitted:1  Active:11  Checking status:1
>          Finished
>         > > successfully:58
>         > > Progress:  Submitted:1  Active:10  Checking status:1
>          Finished
>         > > successfully:59
>         > > Progress:  Submitted:1  Active:10  Finished
>         successfully:60
>         > > Progress:  Submitted:1  Active:10  Finished
>         successfully:60
>         > > Progress:  Submitted:1  Active:10  Finished
>         successfully:60
>         > > Progress:  Submitted:1  Active:8  Checking status:1
>          Finished
>         > > successfully:61
>         > > Progress:  Submitted:1  Active:8  Finished successfully:62
>         > > Progress:  Submitted:1  Active:7  Checking status:1
>          Finished
>         > > successfully:62
>         > > Progress:  Submitted:1  Active:7  Finished successfully:63
>         > > Progress:  Submitted:1  Active:6  Checking status:1
>          Finished
>         > > successfully:63
>         > > Progress:  Submitted:1  Active:4  Checking status:1
>          Finished
>         > > successfully:65
>         > > Progress:  Submitted:1  Active:3  Checking status:1
>          Finished
>         > > successfully:66
>         > > Progress:  Submitted:1  Active:3  Finished successfully:67
>         > > Progress:  Submitted:1  Active:3  Finished successfully:67
>         > > Progress:  Submitted:1  Active:2  Checking status:1
>          Finished
>         > > successfully:67
>         > > Progress:  Submitted:1  Active:2  Finished successfully:68
>         > > Progress:  Submitted:1  Active:1  Checking status:1
>          Finished
>         > > successfully:68
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > > Progress:  Submitted:1  Finished successfully:70
>         > >
>         > >
>         > > etc
>         > > _______________________________________________
>         > > Swift-devel mailing list
>         > > Swift-devel at ci.uchicago.edu
>         > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>         >
>         >
>         > _______________________________________________
>         > Swift-devel mailing list
>         > Swift-devel at ci.uchicago.edu
>         > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>         
>         
>         --
>         Michael Wilde
>         Computation Institute, University of Chicago
>         Mathematics and Computer Science Division
>         Argonne National Laboratory
>         
> 
> 





More information about the Swift-devel mailing list