[Swift-user] Coasters submitting too many jobs simultaneously

Jonathan Margoliash jmargolpeople at gmail.com
Thu Sep 13 12:50:04 CDT 2012


Hey David, another question:

When I run Swift on Fusion using the sites.xml file you sent me, Swift is
scheduling many jobs on Fusion. Why is that? The sites.xml specifies
<execution jobmanager="local:pbs" provider="coaster" url="none"/>
and I thought the point of using coasters as the execution provider was to
wrap all of my separate app calls into a single job submission. With swift
scheduling so many jobs, it's hard to track down and manually abort them
when I need to.

Maybe this stems from my lack of understanding of the coaster system. I
thought jobsPerNode limited the number of app calls the would be sent to
any node at a given time. However, in looking back at the web page, I'm now
thinking that maybe it limits the number of swift coaster workers on each
node, while each swift coaster worker can run many apps at once. If that is
true, then how do I limit the number of apps run on each node
simultaneously? And if each swift worker can run many apps at once, why
would I ever want jobsPerNode > 1? Also, does the slots variable have
anything to do with this? If so, what does it do?

For reference, the workdirectory for the swift call is
/home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403
Here's the output of a bunch of tests I ran while swift was going:

--------------------------------
Sitest.xml:

<config>
<pool handle="fusion">
<execution jobmanager="local:pbs" provider="coaster" url="none"/>
 <filesystem provider="local" url="none" />
 <profile namespace="globus" key="maxtime">3600</profile>
 <profile namespace="globus" key="jobsPerNode">8</profile>
 <profile namespace="globus" key="queue">shared</profile>
 <profile namespace="globus" key="slots">100</profile>
 <profile namespace="globus" key="nodeGranularity">1</profile>
 <profile namespace="globus" key="maxNodes">2</profile>
 <profile namespace="karajan" key="jobThrottle">5.99</profile>
 <profile namespace="karajan" key="initialScore">10000</profile>
 <profile namespace="globus" key="HighOverAllocation">100</profile>
 <profile namespace="globus" key="LowOverAllocation">100</profile>
 <workdirectory>/home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403/swiftwork</workdirectory>
</pool>
</config>

---------------------------------
Terminal output from running swift:

Entering swift from create_random_sample ---- text generated by my code
Warning: Function toint is deprecated, at line 10
Progress:  time: Thu, 13 Sep 2012 12:14:22 -0500
Progress:  time: Thu, 13 Sep 2012 12:14:23 -0500  Initializing:1
Progress:  time: Thu, 13 Sep 2012 12:14:24 -0500  Stage in:99  Submitting:1
Progress:  time: Thu, 13 Sep 2012 12:14:25 -0500  Stage in:86  Submitting:1
 Submitted:13
Progress:  time: Thu, 13 Sep 2012 12:14:27 -0500  Submitted:99  Active:1
Progress:  time: Thu, 13 Sep 2012 12:14:30 -0500  Submitted:91  Active:9
Progress:  time: Thu, 13 Sep 2012 12:14:31 -0500  Submitted:59  Active:41
Progress:  time: Thu, 13 Sep 2012 12:14:32 -0500  Submitted:27  Active:73
Progress:  time: Thu, 13 Sep 2012 12:14:34 -0500  Submitted:12  Active:88
Progress:  time: Thu, 13 Sep 2012 12:14:37 -0500  Submitted:12  Active:88
Progress:  time: Thu, 13 Sep 2012 12:14:39 -0500  Submitted:11  Active:89
Progress:  time: Thu, 13 Sep 2012 12:14:40 -0500  Submitted:4  Active:96
Progress:  time: Thu, 13 Sep 2012 12:14:43 -0500  Submitted:4  Active:96
Progress:  time: Thu, 13 Sep 2012 12:14:46 -0500  Submitted:4  Active:96
Progress:  time: Thu, 13 Sep 2012 12:14:49 -0500  Submitted:4  Active:96
Progress:  time: Thu, 13 Sep 2012 12:14:52 -0500  Submitted:4  Active:96
...

(Why are so many apps considered submitted/active at once? I only want 8
apps working per node at maximum (because each node only has 8 cores), and
since maxNodes = 2 at the moment, I want active <= 16 at all times).

-------
Output of show-q u $USER after swift has been killed manually: (Notice that
a bunch of jobs are still going. Why doesn't swift shut them down
automatically when it quits?)

[jmargoliash at flogin3 my_SwiftSCE2_branch]$ showq -u $USER
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING
 STARTTIME

1289476            jmargoliash    Running     1    00:58:44  Thu Sep 13
12:14:27
1289477            jmargoliash    Running     1    00:58:46  Thu Sep 13
12:14:29
1289478            jmargoliash    Running     1    00:58:46  Thu Sep 13
12:14:29
1289479            jmargoliash    Running     1    00:58:47  Thu Sep 13
12:14:30
1289480            jmargoliash    Running     1    00:58:47  Thu Sep 13
12:14:30
1289481            jmargoliash    Running     2    00:58:47  Thu Sep 13
12:14:30
1289482            jmargoliash    Running     2    00:58:47  Thu Sep 13
12:14:30
1289483            jmargoliash    Running     2    00:58:48  Thu Sep 13
12:14:31
1289484            jmargoliash    Running     1    00:58:48  Thu Sep 13
12:14:31

     9 Active Jobs    2860 of 3088 Processors Active (92.62%)
                       343 of  346 Nodes Active      (99.13%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
 QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
 QUEUETIME


Total Jobs: 9   Active Jobs: 9   Idle Jobs: 0   Blocked Jobs: 0
[jmargoliash at flogin3 my_SwiftSCE2_branch]$


-------------------------
Output of ps -u $USER -H after swift has been killed:

[jmargoliash at flogin3 my_SwiftSCE2_branch]$ ps -u $USER -H
  PID TTY          TIME CMD
19603 ?        00:00:00 sshd
19604 pts/16   00:00:00   bash
17270 ?        00:00:00 sshd
17271 pts/25   00:00:00   bash
 6495 pts/25   00:00:00     vim
16825 ?        00:00:00 sshd
16826 pts/34   00:00:00   bash
25813 pts/34   00:00:00     ps
 4494 ?        00:00:00 sshd
 4495 pts/1    00:00:00   bash
31023 pts/1    00:00:00     vim
24727 pts/16   00:00:00 qdel <-----------
20792 pts/16   00:00:00 check_on_swift.
20793 pts/16   00:00:00   sleep
19755 pts/16   00:00:00 tee

You can see that a qdel command has been started after swift finished. (I'm
pretty sure this is not a call that was left over hanging from when I
called qdel earlier). I assume this is swift's attempt to shut down the
processes it has started up as it exits. However, I presumed qdel would
have a near-instantaneous return. Why is it hanging here? Is this a problem
with fusion, or with my code?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120913/40419559/attachment.html>


More information about the Swift-user mailing list