[Swift-devel] decaying number of coaster jobs leaves some tasks unfinished

Glen Hocky hockyg at uchicago.edu
Mon Aug 9 16:44:06 CDT 2010


Hey everyone,
I've been trying to run some short jobs in the "fast" queue on pads. That
means I need to keep the wall time under 1 hour, and my tasks are around 20
min.  What's been happening, at least for a smallish number of jobs, is that
swift decreases the number of jobs submitted to the queue as the number of
tasks is reduced and at the end, some tasks remain unfinished while no jobs
are in the queue, and this continues indefinately.

The following is one sites entry where I reproducibly had this problem for
70 tasks

    <execution provider="coaster" url="none" jobManager="local:pbs"/>

    <!--<profile namespace="globus" key="queue">fast</profile>-->

    <profile namespace="globus" key="maxtime">3600</profile>

    <profile namespace="globus" key="maxwalltime">00:25:00</profile>

    <profile namespace="globus" key="workersPerNode">1</profile>

    <profile namespace="globus" key="internalHostname">172.5.86.5</profile>

    <profile namespace="globus" key="slots">120</profile>

    <profile namespace="globus" key="nodeGranularity">1</profile>

    <profile namespace="globus" key="maxNodes">1</profile>

    <profile namespace="karajan" key="jobThrottle">0.99</profile>

    <profile namespace="karajan" key="initialScore">10000</profile>

    <profile namespace="globus" key="project">CI-CCR000013</profile>

    <gridftp  url="local://localhost" />

    <scratch>/tmp</scratch>


>  <workdirectory>/home/hockyg/reichman/glassy_dynamics/code/swift/run/real</workdirectory>



There are also some of this type of error
Exception caught while unregistering channel

> org.globus.cog.karajan.workflow.service.channels.ChannelException: Trying
> to bind invalid channel (2027063355: {}) to 60652275: {}

        at
> org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67)

        at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401)

        at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)

        at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)

        at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)

        at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)

but i'm not sure that's related...


Running with "Swift svn swift-r3432 (swift modified locally) cog-r2829"

Swift output went something like
Progress:  Submitted:69  Active:1  Finished successfully:1
Progress:  Submitted:67  Active:3  Finished successfully:1
Progress:  Submitted:66  Active:4  Finished successfully:1
Progress:  Submitted:65  Active:5  Finished successfully:1
Progress:  Submitted:64  Active:6  Finished successfully:1
Progress:  Submitted:61  Active:9  Finished successfully:1
Progress:  Submitted:58  Active:12  Finished successfully:1
Progress:  Submitted:57  Active:13  Finished successfully:1
Progress:  Submitted:54  Active:16  Finished successfully:1
Progress:  Submitted:52  Active:18  Finished successfully:1
Progress:  Submitted:51  Active:19  Finished successfully:1
Progress:  Submitted:50  Active:20  Finished successfully:1
Progress:  Submitted:49  Active:21  Finished successfully:1
Progress:  Submitted:48  Active:22  Finished successfully:1
Progress:  Submitted:41  Active:29  Finished successfully:1
Progress:  Submitted:38  Active:32  Finished successfully:1
Progress:  Submitted:37  Active:33  Finished successfully:1
Progress:  Submitted:35  Active:35  Finished successfully:1
Progress:  Submitted:31  Active:39  Finished successfully:1
Progress:  Submitted:30  Active:40  Finished successfully:1
Progress:  Submitted:26  Active:44  Finished successfully:1
Progress:  Submitted:26  Active:44  Finished successfully:1
Progress:  Submitted:26  Active:44  Finished successfully:1
Progress:  Submitted:26  Active:44  Finished successfully:1
Progress:  Submitted:26  Active:44  Finished successfully:1
Progress:  Submitted:26  Active:44  Finished successfully:1
Progress:  Submitted:26  Active:43  Checking status:1  Finished
successfully:1
Progress:  Submitted:26  Active:43  Finished successfully:2
Progress:  Submitted:26  Active:42  Checking status:1  Finished
successfully:2
Progress:  Submitted:25  Active:42  Checking status:1  Finished
successfully:3
Progress:  Submitted:25  Active:41  Checking status:1  Finished
successfully:4
Progress:  Submitted:25  Active:41  Finished successfully:5
Progress:  Submitted:25  Active:40  Checking status:1  Finished
successfully:5
Progress:  Submitted:25  Active:39  Checking status:1  Finished
successfully:6
Progress:  Submitted:24  Active:40  Finished successfully:7
Progress:  Submitted:24  Active:39  Checking status:1  Finished
successfully:7
Progress:  Submitted:24  Active:38  Checking status:1  Finished
successfully:8
Progress:  Submitted:24  Active:38  Finished successfully:9
Progress:  Submitted:24  Active:37  Checking status:1  Finished
successfully:9
Progress:  Submitted:24  Active:35  Checking status:1  Finished
successfully:11
Progress:  Submitted:23  Active:35  Checking status:1  Finished
successfully:12
Progress:  Submitted:22  Active:35  Checking status:1  Finished
successfully:13
Progress:  Submitted:22  Active:35  Finished successfully:14
Progress:  Submitted:22  Active:34  Checking status:1  Finished
successfully:14
Progress:  Submitted:21  Active:34  Checking status:1  Finished
successfully:15
Progress:  Submitted:21  Active:34  Finished successfully:16
Progress:  Submitted:21  Active:33  Checking status:1  Finished
successfully:16
Progress:  Submitted:21  Active:33  Finished successfully:17
Progress:  Submitted:20  Active:32  Checking status:1  Finished
successfully:18
Progress:  Submitted:20  Active:32  Finished successfully:19
Progress:  Submitted:20  Active:31  Checking status:1  Finished
successfully:19
Progress:  Submitted:19  Active:31  Finished successfully:21
Progress:  Submitted:19  Active:30  Checking status:1  Finished
successfully:21
Progress:  Submitted:18  Active:30  Checking status:1  Finished
successfully:22
Progress:  Submitted:18  Active:29  Checking status:1  Finished
successfully:23
Progress:  Submitted:18  Active:28  Checking status:1  Finished
successfully:24
Progress:  Submitted:17  Active:29  Finished successfully:25
Progress:  Submitted:17  Active:29  Finished successfully:25
Progress:  Submitted:17  Active:28  Checking status:1  Finished
successfully:25
Progress:  Submitted:17  Active:27  Checking status:1  Finished
successfully:26
Progress:  Submitted:17  Active:26  Checking status:1  Finished
successfully:27
Progress:  Submitted:17  Active:25  Checking status:1  Finished
successfully:28
Progress:  Submitted:17  Active:24  Checking status:1  Finished
successfully:29
Progress:  Submitted:16  Active:25  Finished successfully:30
Progress:  Submitted:16  Active:24  Checking status:1  Finished
successfully:30
Progress:  Submitted:15  Active:24  Checking status:1  Finished
successfully:31
Progress:  Submitted:15  Active:24  Finished successfully:32
Progress:  Submitted:15  Active:23  Checking status:1  Finished
successfully:32
Progress:  Submitted:14  Active:24  Finished successfully:33
Progress:  Submitted:14  Active:23  Checking status:1  Finished
successfully:33
Progress:  Submitted:14  Active:22  Checking status:1  Finished
successfully:34
Progress:  Submitted:14  Active:22  Finished successfully:35
Progress:  Submitted:14  Active:21  Checking status:1  Finished
successfully:35
Progress:  Submitted:13  Active:22  Finished successfully:36
Progress:  Submitted:13  Active:22  Finished successfully:36
Progress:  Submitted:13  Active:20  Checking status:1  Finished
successfully:37
Progress:  Submitted:12  Active:21  Finished successfully:38
Progress:  Submitted:12  Active:20  Checking status:1  Finished
successfully:38
Progress:  Submitted:12  Active:19  Checking status:1  Finished
successfully:39
Progress:  Submitted:12  Active:19  Finished successfully:40
Progress:  Submitted:12  Active:18  Checking status:1  Finished
successfully:40
Progress:  Submitted:12  Active:17  Checking status:1  Finished
successfully:41
Progress:  Submitted:11  Active:17  Checking status:1  Finished
successfully:42
Progress:  Submitted:11  Active:17  Finished successfully:43
Progress:  Submitted:11  Active:16  Checking status:1  Finished
successfully:43
Progress:  Submitted:11  Active:15  Checking status:1  Finished
successfully:44
Progress:  Submitted:10  Active:16  Finished successfully:45
Progress:  Submitted:3  Active:22  Finished successfully:46
Progress:  Submitted:3  Active:21  Checking status:1  Finished
successfully:46
Progress:  Submitted:3  Active:19  Finished successfully:49
Progress:  Submitted:3  Active:19  Finished successfully:49
Progress:  Submitted:2  Active:20  Finished successfully:49
Progress:  Submitted:1  Active:21  Finished successfully:49
.
.
.
Progress:  Submitted:1  Active:15  Finished successfully:55
Progress:  Submitted:1  Active:15  Finished successfully:55
Progress:  Submitted:1  Active:15  Finished successfully:55
Progress:  Submitted:1  Active:15  Finished successfully:55
Progress:  Submitted:1  Active:14  Checking status:1  Finished
successfully:55
Progress:  Submitted:1  Active:14  Finished successfully:56
Progress:  Submitted:1  Active:13  Checking status:1  Finished
successfully:56
Progress:  Submitted:1  Active:12  Checking status:1  Finished
successfully:57
Progress:  Submitted:1  Active:12  Finished successfully:58
Progress:  Submitted:1  Active:11  Checking status:1  Finished
successfully:58
Progress:  Submitted:1  Active:10  Checking status:1  Finished
successfully:59
Progress:  Submitted:1  Active:10  Finished successfully:60
Progress:  Submitted:1  Active:10  Finished successfully:60
Progress:  Submitted:1  Active:10  Finished successfully:60
Progress:  Submitted:1  Active:8  Checking status:1  Finished
successfully:61
Progress:  Submitted:1  Active:8  Finished successfully:62
Progress:  Submitted:1  Active:7  Checking status:1  Finished
successfully:62
Progress:  Submitted:1  Active:7  Finished successfully:63
Progress:  Submitted:1  Active:6  Checking status:1  Finished
successfully:63
Progress:  Submitted:1  Active:4  Checking status:1  Finished
successfully:65
Progress:  Submitted:1  Active:3  Checking status:1  Finished
successfully:66
Progress:  Submitted:1  Active:3  Finished successfully:67
Progress:  Submitted:1  Active:3  Finished successfully:67
Progress:  Submitted:1  Active:2  Checking status:1  Finished
successfully:67
Progress:  Submitted:1  Active:2  Finished successfully:68
Progress:  Submitted:1  Active:1  Checking status:1  Finished
successfully:68
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70
Progress:  Submitted:1  Finished successfully:70

etc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100809/7e4aa808/attachment.html>


More information about the Swift-devel mailing list