[Swift-devel] decaying number of coaster jobs leaves some tasks unfinished

Michael Wilde wilde at mcs.anl.gov
Mon Aug 9 17:01:24 CDT 2010


I have seen a common problem where maxwalltime on queues jobs exceeds maxtime, in which case Swift hangs, never finding a block it can fit the jobs into.

I wonder if this is another manifestation of that behavior/bug: the time left in the running block is less than the 25 min maxwalltime for the remaining tasks, and Swift does not realize that it needs to end that block and start a new one.

Did you leave the tail end of this run running long enough for the current block to end, to see if it starts a new 3600 second block?

Im just surmising one possible cause; actual problem here might be completely different.

- Mike


----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> That error might be related. Can I have the full log?
> 
> On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote:
> > Hey everyone,
> > I've been trying to run some short jobs in the "fast" queue on
> pads.
> > That means I need to keep the wall time under 1 hour, and my tasks
> are
> > around 20 min.  What's been happening, at least for a smallish
> number
> > of jobs, is that swift decreases the number of jobs submitted to
> the
> > queue as the number of tasks is reduced and at the end, some tasks
> > remain unfinished while no jobs are in the queue, and this
> continues
> > indefinately.
> > 
> > 
> > The following is one sites entry where I reproducibly had this
> problem
> > for 70 tasks
> > 
> > 
> >             <execution provider="coaster" url="none"
> >         jobManager="local:pbs"/>
> >             <!--<profile namespace="globus"
> >         key="queue">fast</profile>-->
> >             <profile namespace="globus"
> key="maxtime">3600</profile>
> >             <profile namespace="globus"
> >         key="maxwalltime">00:25:00</profile>
> >             <profile namespace="globus"
> >         key="workersPerNode">1</profile>
> >             <profile namespace="globus"
> >         key="internalHostname">172.5.86.5</profile>
> >             <profile namespace="globus" key="slots">120</profile>
> >             <profile namespace="globus"
> >         key="nodeGranularity">1</profile>
> >             <profile namespace="globus" key="maxNodes">1</profile>
> >             <profile namespace="karajan"
> >         key="jobThrottle">0.99</profile>
> >             <profile namespace="karajan"
> >         key="initialScore">10000</profile>
> >             <profile namespace="globus"
> >         key="project">CI-CCR000013</profile>
> >             <gridftp  url="local://localhost" />
> >             <scratch>/tmp</scratch>
> >         
> >         
> <workdirectory>/home/hockyg/reichman/glassy_dynamics/code/swift/run/real</workdirectory>
> > 
> > 
> > 
> > 
> > There are also some of this type of error 
> > Exception caught while unregistering channel
> >        
> org.globus.cog.karajan.workflow.service.channels.ChannelException:
> Trying to bind invalid channel (2027063355: {}) to 60652275: {}
> >                 at
> >        
> org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67)
> >                 at
> >        
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401)
> >                 at
> >        
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
> >                 at
> >        
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
> >                 at
> >        
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)
> >                 at
> >        
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)
> > but i'm not sure that's related... 
> > 
> > 
> > 
> > 
> > Running with "Swift svn swift-r3432 (swift modified locally)
> > cog-r2829"
> > 
> > 
> > Swift output went something like
> > Progress:  Submitted:69  Active:1  Finished successfully:1
> > Progress:  Submitted:67  Active:3  Finished successfully:1
> > Progress:  Submitted:66  Active:4  Finished successfully:1
> > Progress:  Submitted:65  Active:5  Finished successfully:1
> > Progress:  Submitted:64  Active:6  Finished successfully:1
> > Progress:  Submitted:61  Active:9  Finished successfully:1
> > Progress:  Submitted:58  Active:12  Finished successfully:1
> > Progress:  Submitted:57  Active:13  Finished successfully:1
> > Progress:  Submitted:54  Active:16  Finished successfully:1
> > Progress:  Submitted:52  Active:18  Finished successfully:1
> > Progress:  Submitted:51  Active:19  Finished successfully:1
> > Progress:  Submitted:50  Active:20  Finished successfully:1
> > Progress:  Submitted:49  Active:21  Finished successfully:1
> > Progress:  Submitted:48  Active:22  Finished successfully:1
> > Progress:  Submitted:41  Active:29  Finished successfully:1
> > Progress:  Submitted:38  Active:32  Finished successfully:1
> > Progress:  Submitted:37  Active:33  Finished successfully:1
> > Progress:  Submitted:35  Active:35  Finished successfully:1
> > Progress:  Submitted:31  Active:39  Finished successfully:1
> > Progress:  Submitted:30  Active:40  Finished successfully:1
> > Progress:  Submitted:26  Active:44  Finished successfully:1
> > Progress:  Submitted:26  Active:44  Finished successfully:1
> > Progress:  Submitted:26  Active:44  Finished successfully:1
> > Progress:  Submitted:26  Active:44  Finished successfully:1
> > Progress:  Submitted:26  Active:44  Finished successfully:1
> > Progress:  Submitted:26  Active:44  Finished successfully:1
> > Progress:  Submitted:26  Active:43  Checking status:1  Finished
> > successfully:1
> > Progress:  Submitted:26  Active:43  Finished successfully:2
> > Progress:  Submitted:26  Active:42  Checking status:1  Finished
> > successfully:2
> > Progress:  Submitted:25  Active:42  Checking status:1  Finished
> > successfully:3
> > Progress:  Submitted:25  Active:41  Checking status:1  Finished
> > successfully:4
> > Progress:  Submitted:25  Active:41  Finished successfully:5
> > Progress:  Submitted:25  Active:40  Checking status:1  Finished
> > successfully:5
> > Progress:  Submitted:25  Active:39  Checking status:1  Finished
> > successfully:6
> > Progress:  Submitted:24  Active:40  Finished successfully:7
> > Progress:  Submitted:24  Active:39  Checking status:1  Finished
> > successfully:7
> > Progress:  Submitted:24  Active:38  Checking status:1  Finished
> > successfully:8
> > Progress:  Submitted:24  Active:38  Finished successfully:9
> > Progress:  Submitted:24  Active:37  Checking status:1  Finished
> > successfully:9
> > Progress:  Submitted:24  Active:35  Checking status:1  Finished
> > successfully:11
> > Progress:  Submitted:23  Active:35  Checking status:1  Finished
> > successfully:12
> > Progress:  Submitted:22  Active:35  Checking status:1  Finished
> > successfully:13
> > Progress:  Submitted:22  Active:35  Finished successfully:14
> > Progress:  Submitted:22  Active:34  Checking status:1  Finished
> > successfully:14
> > Progress:  Submitted:21  Active:34  Checking status:1  Finished
> > successfully:15
> > Progress:  Submitted:21  Active:34  Finished successfully:16
> > Progress:  Submitted:21  Active:33  Checking status:1  Finished
> > successfully:16
> > Progress:  Submitted:21  Active:33  Finished successfully:17
> > Progress:  Submitted:20  Active:32  Checking status:1  Finished
> > successfully:18
> > Progress:  Submitted:20  Active:32  Finished successfully:19
> > Progress:  Submitted:20  Active:31  Checking status:1  Finished
> > successfully:19
> > Progress:  Submitted:19  Active:31  Finished successfully:21
> > Progress:  Submitted:19  Active:30  Checking status:1  Finished
> > successfully:21
> > Progress:  Submitted:18  Active:30  Checking status:1  Finished
> > successfully:22
> > Progress:  Submitted:18  Active:29  Checking status:1  Finished
> > successfully:23
> > Progress:  Submitted:18  Active:28  Checking status:1  Finished
> > successfully:24
> > Progress:  Submitted:17  Active:29  Finished successfully:25
> > Progress:  Submitted:17  Active:29  Finished successfully:25
> > Progress:  Submitted:17  Active:28  Checking status:1  Finished
> > successfully:25
> > Progress:  Submitted:17  Active:27  Checking status:1  Finished
> > successfully:26
> > Progress:  Submitted:17  Active:26  Checking status:1  Finished
> > successfully:27
> > Progress:  Submitted:17  Active:25  Checking status:1  Finished
> > successfully:28
> > Progress:  Submitted:17  Active:24  Checking status:1  Finished
> > successfully:29
> > Progress:  Submitted:16  Active:25  Finished successfully:30
> > Progress:  Submitted:16  Active:24  Checking status:1  Finished
> > successfully:30
> > Progress:  Submitted:15  Active:24  Checking status:1  Finished
> > successfully:31
> > Progress:  Submitted:15  Active:24  Finished successfully:32
> > Progress:  Submitted:15  Active:23  Checking status:1  Finished
> > successfully:32
> > Progress:  Submitted:14  Active:24  Finished successfully:33
> > Progress:  Submitted:14  Active:23  Checking status:1  Finished
> > successfully:33
> > Progress:  Submitted:14  Active:22  Checking status:1  Finished
> > successfully:34
> > Progress:  Submitted:14  Active:22  Finished successfully:35
> > Progress:  Submitted:14  Active:21  Checking status:1  Finished
> > successfully:35
> > Progress:  Submitted:13  Active:22  Finished successfully:36
> > Progress:  Submitted:13  Active:22  Finished successfully:36
> > Progress:  Submitted:13  Active:20  Checking status:1  Finished
> > successfully:37
> > Progress:  Submitted:12  Active:21  Finished successfully:38
> > Progress:  Submitted:12  Active:20  Checking status:1  Finished
> > successfully:38
> > Progress:  Submitted:12  Active:19  Checking status:1  Finished
> > successfully:39
> > Progress:  Submitted:12  Active:19  Finished successfully:40
> > Progress:  Submitted:12  Active:18  Checking status:1  Finished
> > successfully:40
> > Progress:  Submitted:12  Active:17  Checking status:1  Finished
> > successfully:41
> > Progress:  Submitted:11  Active:17  Checking status:1  Finished
> > successfully:42
> > Progress:  Submitted:11  Active:17  Finished successfully:43
> > Progress:  Submitted:11  Active:16  Checking status:1  Finished
> > successfully:43
> > Progress:  Submitted:11  Active:15  Checking status:1  Finished
> > successfully:44
> > Progress:  Submitted:10  Active:16  Finished successfully:45
> > Progress:  Submitted:3  Active:22  Finished successfully:46
> > Progress:  Submitted:3  Active:21  Checking status:1  Finished
> > successfully:46
> > Progress:  Submitted:3  Active:19  Finished successfully:49
> > Progress:  Submitted:3  Active:19  Finished successfully:49
> > Progress:  Submitted:2  Active:20  Finished successfully:49
> > Progress:  Submitted:1  Active:21  Finished successfully:49
> > .
> > .
> > .
> > Progress:  Submitted:1  Active:15  Finished successfully:55
> > Progress:  Submitted:1  Active:15  Finished successfully:55
> > Progress:  Submitted:1  Active:15  Finished successfully:55
> > Progress:  Submitted:1  Active:15  Finished successfully:55
> > Progress:  Submitted:1  Active:14  Checking status:1  Finished
> > successfully:55
> > Progress:  Submitted:1  Active:14  Finished successfully:56
> > Progress:  Submitted:1  Active:13  Checking status:1  Finished
> > successfully:56
> > Progress:  Submitted:1  Active:12  Checking status:1  Finished
> > successfully:57
> > Progress:  Submitted:1  Active:12  Finished successfully:58
> > Progress:  Submitted:1  Active:11  Checking status:1  Finished
> > successfully:58
> > Progress:  Submitted:1  Active:10  Checking status:1  Finished
> > successfully:59
> > Progress:  Submitted:1  Active:10  Finished successfully:60
> > Progress:  Submitted:1  Active:10  Finished successfully:60
> > Progress:  Submitted:1  Active:10  Finished successfully:60
> > Progress:  Submitted:1  Active:8  Checking status:1  Finished
> > successfully:61
> > Progress:  Submitted:1  Active:8  Finished successfully:62
> > Progress:  Submitted:1  Active:7  Checking status:1  Finished
> > successfully:62
> > Progress:  Submitted:1  Active:7  Finished successfully:63
> > Progress:  Submitted:1  Active:6  Checking status:1  Finished
> > successfully:63
> > Progress:  Submitted:1  Active:4  Checking status:1  Finished
> > successfully:65
> > Progress:  Submitted:1  Active:3  Checking status:1  Finished
> > successfully:66
> > Progress:  Submitted:1  Active:3  Finished successfully:67
> > Progress:  Submitted:1  Active:3  Finished successfully:67
> > Progress:  Submitted:1  Active:2  Checking status:1  Finished
> > successfully:67
> > Progress:  Submitted:1  Active:2  Finished successfully:68
> > Progress:  Submitted:1  Active:1  Checking status:1  Finished
> > successfully:68
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > Progress:  Submitted:1  Finished successfully:70
> > 
> > 
> > etc
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list