<div>Hey David, another question:</div><div><br></div><div>When I run Swift on Fusion using the sites.xml file you sent me, Swift is scheduling many jobs on Fusion. Why is that? The sites.xml specifies</div><div><execution jobmanager="local:pbs" provider="coaster" url="none"/></div>
<div>and I thought the point of using coasters as the execution provider was to wrap all of my separate app calls into a single job submission. With swift scheduling so many jobs, it's hard to track down and manually abort them when I need to.</div>
<div><br></div><div>Maybe this stems from my lack of understanding of the coaster system. I thought jobsPerNode limited the number of app calls the would be sent to any node at a given time. However, in looking back at the web page, I'm now thinking that maybe it limits the number of swift coaster workers on each node, while each swift coaster worker can run many apps at once. If that is true, then how do I limit the number of apps run on each node simultaneously? And if each swift worker can run many apps at once, why would I ever want jobsPerNode > 1? Also, does the slots variable have anything to do with this? If so, what does it do?</div>
<div><br></div><div>For reference, the workdirectory for the swift call is</div><div>/home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403</div><div>Here's the output of a bunch of tests I ran while swift was going:</div>
<div><br></div><div><div>--------------------------------</div><div>Sitest.xml:</div><div><br></div><div><div><config></div><div><pool handle="fusion"></div><div><execution jobmanager="local:pbs" provider="coaster" url="none"/></div>
<div> <filesystem provider="local" url="none" /></div><div> <profile namespace="globus" key="maxtime">3600</profile></div><div> <profile namespace="globus" key="jobsPerNode">8</profile></div>
<div> <profile namespace="globus" key="queue">shared</profile></div><div> <profile namespace="globus" key="slots">100</profile></div><div> <profile namespace="globus" key="nodeGranularity">1</profile></div>
<div> <profile namespace="globus" key="maxNodes">2</profile></div><div> <profile namespace="karajan" key="jobThrottle">5.99</profile></div><div> <profile namespace="karajan" key="initialScore">10000</profile></div>
<div> <profile namespace="globus" key="HighOverAllocation">100</profile></div><div> <profile namespace="globus" key="LowOverAllocation">100</profile></div><div>
<workdirectory>/home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403/swiftwork</workdirectory></div><div></pool></div><div></config></div></div><div><br class="Apple-interchange-newline">
---------------------------------</div><div>Terminal output from running swift:</div><div><br></div><div><div>Entering swift from create_random_sample ---- text generated by my code</div><div>Warning: Function toint is deprecated, at line 10</div>
<div>Progress: time: Thu, 13 Sep 2012 12:14:22 -0500</div><div>Progress: time: Thu, 13 Sep 2012 12:14:23 -0500 Initializing:1</div><div>Progress: time: Thu, 13 Sep 2012 12:14:24 -0500 Stage in:99 Submitting:1</div><div>
Progress: time: Thu, 13 Sep 2012 12:14:25 -0500 Stage in:86 Submitting:1 Submitted:13</div><div>Progress: time: Thu, 13 Sep 2012 12:14:27 -0500 Submitted:99 Active:1</div><div>Progress: time: Thu, 13 Sep 2012 12:14:30 -0500 Submitted:91 Active:9</div>
<div>Progress: time: Thu, 13 Sep 2012 12:14:31 -0500 Submitted:59 Active:41</div><div>Progress: time: Thu, 13 Sep 2012 12:14:32 -0500 Submitted:27 Active:73</div><div>Progress: time: Thu, 13 Sep 2012 12:14:34 -0500 Submitted:12 Active:88</div>
<div>Progress: time: Thu, 13 Sep 2012 12:14:37 -0500 Submitted:12 Active:88</div><div>Progress: time: Thu, 13 Sep 2012 12:14:39 -0500 Submitted:11 Active:89</div><div>Progress: time: Thu, 13 Sep 2012 12:14:40 -0500 Submitted:4 Active:96</div>
<div>Progress: time: Thu, 13 Sep 2012 12:14:43 -0500 Submitted:4 Active:96</div><div>Progress: time: Thu, 13 Sep 2012 12:14:46 -0500 Submitted:4 Active:96</div><div>Progress: time: Thu, 13 Sep 2012 12:14:49 -0500 Submitted:4 Active:96</div>
<div>Progress: time: Thu, 13 Sep 2012 12:14:52 -0500 Submitted:4 Active:96</div></div><div>...</div></div><div><br></div><div>(Why are so many apps considered submitted/active at once? I only want 8 apps working per node at maximum (because each node only has 8 cores), and since maxNodes = 2 at the moment, I want active <= 16 at all times).</div>
<div><br></div><div>-------</div><div>Output of show-q u $USER after swift has been killed manually: (Notice that a bunch of jobs are still going. Why doesn't swift shut them down automatically when it quits?)</div><div>
<br></div><div>[jmargoliash@flogin3 my_SwiftSCE2_branch]$ showq -u $USER</div><div>ACTIVE JOBS--------------------</div><div>JOBNAME USERNAME STATE PROC REMAINING STARTTIME</div><div><br></div>
<div>1289476 jmargoliash Running 1 00:58:44 Thu Sep 13 12:14:27</div>
<div>1289477 jmargoliash Running 1 00:58:46 Thu Sep 13 12:14:29</div><div>1289478 jmargoliash Running 1 00:58:46 Thu Sep 13 12:14:29</div><div>1289479 jmargoliash Running 1 00:58:47 Thu Sep 13 12:14:30</div>
<div>1289480 jmargoliash Running 1 00:58:47 Thu Sep 13 12:14:30</div><div>1289481 jmargoliash Running 2 00:58:47 Thu Sep 13 12:14:30</div><div>1289482 jmargoliash Running 2 00:58:47 Thu Sep 13 12:14:30</div>
<div>1289483 jmargoliash Running 2 00:58:48 Thu Sep 13 12:14:31</div><div>1289484 jmargoliash Running 1 00:58:48 Thu Sep 13 12:14:31</div><div><br></div><div> 9 Active Jobs 2860 of 3088 Processors Active (92.62%)</div>
<div> 343 of 346 Nodes Active (99.13%)</div><div><br></div><div>IDLE JOBS----------------------</div><div>JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME</div><div>
<br></div><div><br></div><div>0 Idle Jobs</div><div><br></div><div>BLOCKED JOBS----------------</div><div>JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME</div><div><br></div><div><br></div><div>
Total Jobs: 9 Active Jobs: 9 Idle Jobs: 0 Blocked Jobs: 0</div><div>[jmargoliash@flogin3 my_SwiftSCE2_branch]$ </div><div><br></div><div><br></div><div>
-------------------------</div><div>Output of ps -u $USER -H after swift has been killed:</div><div><br></div><div><div>[jmargoliash@flogin3 my_SwiftSCE2_branch]$ ps -u $USER -H</div><div> PID TTY TIME CMD</div>
<div>19603 ? 00:00:00 sshd</div>
<div>19604 pts/16 00:00:00 bash</div><div>17270 ? 00:00:00 sshd</div><div>17271 pts/25 00:00:00 bash</div><div> 6495 pts/25 00:00:00 vim</div><div>16825 ? 00:00:00 sshd</div><div>16826 pts/34 00:00:00 bash</div>
<div>25813 pts/34 00:00:00 ps</div><div> 4494 ? 00:00:00 sshd</div><div> 4495 pts/1 00:00:00 bash</div><div>31023 pts/1 00:00:00 vim</div><div>24727 pts/16 00:00:00 qdel <-----------</div><div>
20792 pts/16 00:00:00 check_on_swift.</div>
<div>20793 pts/16 00:00:00 sleep</div><div>19755 pts/16 00:00:00 tee</div></div><div><br></div><div>You can see that a qdel command has been started after swift finished. (I'm pretty sure this is not a call that was left over hanging from when I called qdel earlier). I assume this is swift's attempt to shut down the processes it has started up as it exits. However, I presumed qdel would have a near-instantaneous return. Why is it hanging here? Is this a problem with fusion, or with my code?</div>