[Swift-user] Coasters submitting too many jobs simultaneously

Michael Wilde wilde at mcs.anl.gov
Thu Sep 13 15:11:30 CDT 2012


Hi Jonathan,

To clarify what your sites file below is requesting from coasters:
(note that I've changed the order of the XML tags to better explain them)

<profile namespace="globus" key="queue">shared</profile>

Send all jobs to the shared queue. On Fusion, I think PBS runs multiple jobs per node on this queue. So while the shared queue is good for fester testing, it may conflict with what Swift is requesting per the tags below.
 
<profile namespace="globus" key="slots">100</profile>

Submit up to 100 jobs to PBS at a time.

<profile namespace="globus" key="nodeGranularity">1</profile>
<profile namespace="globus" key="maxNodes">2</profile>

PBS jobs will request up to 2 nodes. This may interact in odd ways with the shared queue - not sure, need to investigate that.  Some jobs may be 1 node, others 2, depending on how many app requests your Swift script is making at the point when coasters decides to batch up the current requests into PBS jobs.

<profile namespace="globus" key="jobsPerNode">8</profile>

Run 8 swift apps concurrently on every PBS node. This will definitely be more than you want in the shared queue, where PBS is (I *think*) giving you just 1 *core*.

<profile namespace="karajan" key="jobThrottle">5.99</profile>
<profile namespace="karajan" key="initialScore">10000</profile>

Run up to 600 app calls at once on this site.

<profile namespace="globus" key="HighOverAllocation">100</profile>
<profile namespace="globus" key="LowOverAllocation">100</profile>
<profile namespace="globus" key="maxtime">3600</profile>

Make all job slots request the max amount of time (maxtime)

- Mike

----- Original Message -----
> From: "Jonathan Margoliash" <jmargolpeople at gmail.com>
> To: "Swift Language" <davidk at ci.uchicago.edu>, swift-user at ci.uchicago.edu, "Professor E. Yan" <eyan at anl.gov>, "Michael
> Wilde" <wilde at mcs.anl.gov>
> Sent: Thursday, September 13, 2012 12:50:04 PM
> Subject: Coasters submitting too many jobs simultaneously
> Hey David, another question:
> 
> 
> When I run Swift on Fusion using the sites.xml file you sent me, Swift
> is scheduling many jobs on Fusion. Why is that? The sites.xml
> specifies
> <execution jobmanager="local:pbs" provider="coaster" url="none"/>
> and I thought the point of using coasters as the execution provider
> was to wrap all of my separate app calls into a single job submission.
> With swift scheduling so many jobs, it's hard to track down and
> manually abort them when I need to.
> 
> 
> Maybe this stems from my lack of understanding of the coaster system.
> I thought jobsPerNode limited the number of app calls the would be
> sent to any node at a given time. However, in looking back at the web
> page, I'm now thinking that maybe it limits the number of swift
> coaster workers on each node, while each swift coaster worker can run
> many apps at once. If that is true, then how do I limit the number of
> apps run on each node simultaneously? And if each swift worker can run
> many apps at once, why would I ever want jobsPerNode > 1? Also, does
> the slots variable have anything to do with this? If so, what does it
> do?
> 
> 
> For reference, the workdirectory for the swift call is
> /home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403
> Here's the output of a bunch of tests I ran while swift was going:
> 
> 
> 
> --------------------------------
> Sitest.xml:
> 
> 
> 
> <config>
> <pool handle="fusion">
> <execution jobmanager="local:pbs" provider="coaster" url="none"/>
> <filesystem provider="local" url="none" />
> <profile namespace="globus" key="maxtime">3600</profile>
> <profile namespace="globus" key="jobsPerNode">8</profile>
> <profile namespace="globus" key="queue">shared</profile>
> <profile namespace="globus" key="slots">100</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> <profile namespace="globus" key="maxNodes">2</profile>
> <profile namespace="karajan" key="jobThrottle">5.99</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <profile namespace="globus" key="HighOverAllocation">100</profile>
> <profile namespace="globus" key="LowOverAllocation">100</profile>
> <workdirectory>/home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403/swiftwork</workdirectory>
> </pool>
> </config>
> 
> ---------------------------------
> Terminal output from running swift:
> 
> 
> 
> Entering swift from create_random_sample ---- text generated by my
> code
> Warning: Function toint is deprecated, at line 10
> Progress: time: Thu, 13 Sep 2012 12:14:22 -0500
> Progress: time: Thu, 13 Sep 2012 12:14:23 -0500 Initializing:1
> Progress: time: Thu, 13 Sep 2012 12:14:24 -0500 Stage in:99
> Submitting:1
> Progress: time: Thu, 13 Sep 2012 12:14:25 -0500 Stage in:86
> Submitting:1 Submitted:13
> Progress: time: Thu, 13 Sep 2012 12:14:27 -0500 Submitted:99 Active:1
> Progress: time: Thu, 13 Sep 2012 12:14:30 -0500 Submitted:91 Active:9
> Progress: time: Thu, 13 Sep 2012 12:14:31 -0500 Submitted:59 Active:41
> Progress: time: Thu, 13 Sep 2012 12:14:32 -0500 Submitted:27 Active:73
> Progress: time: Thu, 13 Sep 2012 12:14:34 -0500 Submitted:12 Active:88
> Progress: time: Thu, 13 Sep 2012 12:14:37 -0500 Submitted:12 Active:88
> Progress: time: Thu, 13 Sep 2012 12:14:39 -0500 Submitted:11 Active:89
> Progress: time: Thu, 13 Sep 2012 12:14:40 -0500 Submitted:4 Active:96
> Progress: time: Thu, 13 Sep 2012 12:14:43 -0500 Submitted:4 Active:96
> Progress: time: Thu, 13 Sep 2012 12:14:46 -0500 Submitted:4 Active:96
> Progress: time: Thu, 13 Sep 2012 12:14:49 -0500 Submitted:4 Active:96
> Progress: time: Thu, 13 Sep 2012 12:14:52 -0500 Submitted:4 Active:96
> ...
> 
> 
> (Why are so many apps considered submitted/active at once? I only want
> 8 apps working per node at maximum (because each node only has 8
> cores), and since maxNodes = 2 at the moment, I want active <= 16 at
> all times).
> 
> 
> -------
> Output of show-q u $USER after swift has been killed manually: (Notice
> that a bunch of jobs are still going. Why doesn't swift shut them down
> automatically when it quits?)
> 
> 
> [jmargoliash at flogin3 my_SwiftSCE2_branch]$ showq -u $USER
> ACTIVE JOBS--------------------
> JOBNAME USERNAME STATE PROC REMAINING STARTTIME
> 
> 
> 1289476 jmargoliash Running 1 00:58:44 Thu Sep 13 12:14:27
> 1289477 jmargoliash Running 1 00:58:46 Thu Sep 13 12:14:29
> 1289478 jmargoliash Running 1 00:58:46 Thu Sep 13 12:14:29
> 1289479 jmargoliash Running 1 00:58:47 Thu Sep 13 12:14:30
> 1289480 jmargoliash Running 1 00:58:47 Thu Sep 13 12:14:30
> 1289481 jmargoliash Running 2 00:58:47 Thu Sep 13 12:14:30
> 1289482 jmargoliash Running 2 00:58:47 Thu Sep 13 12:14:30
> 1289483 jmargoliash Running 2 00:58:48 Thu Sep 13 12:14:31
> 1289484 jmargoliash Running 1 00:58:48 Thu Sep 13 12:14:31
> 
> 
> 9 Active Jobs 2860 of 3088 Processors Active (92.62%)
> 343 of 346 Nodes Active (99.13%)
> 
> 
> IDLE JOBS----------------------
> JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
> 
> 
> 
> 
> 0 Idle Jobs
> 
> 
> BLOCKED JOBS----------------
> JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
> 
> 
> 
> 
> Total Jobs: 9 Active Jobs: 9 Idle Jobs: 0 Blocked Jobs: 0
> [jmargoliash at flogin3 my_SwiftSCE2_branch]$
> 
> 
> 
> 
> -------------------------
> Output of ps -u $USER -H after swift has been killed:
> 
> 
> 
> [jmargoliash at flogin3 my_SwiftSCE2_branch]$ ps -u $USER -H
> PID TTY TIME CMD
> 19603 ? 00:00:00 sshd
> 19604 pts/16 00:00:00 bash
> 17270 ? 00:00:00 sshd
> 17271 pts/25 00:00:00 bash
> 6495 pts/25 00:00:00 vim
> 16825 ? 00:00:00 sshd
> 16826 pts/34 00:00:00 bash
> 25813 pts/34 00:00:00 ps
> 4494 ? 00:00:00 sshd
> 4495 pts/1 00:00:00 bash
> 31023 pts/1 00:00:00 vim
> 24727 pts/16 00:00:00 qdel <-----------
> 20792 pts/16 00:00:00 check_on_swift.
> 20793 pts/16 00:00:00 sleep
> 19755 pts/16 00:00:00 tee
> 
> 
> You can see that a qdel command has been started after swift finished.
> (I'm pretty sure this is not a call that was left over hanging from
> when I called qdel earlier). I assume this is swift's attempt to shut
> down the processes it has started up as it exits. However, I presumed
> qdel would have a near-instantaneous return. Why is it hanging here?
> Is this a problem with fusion, or with my code?

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list