[Swift-user] Re: [Swift-devel] Coasters Configuration Optimizations

Michael Wilde wilde at mcs.anl.gov
Sat Apr 2 08:05:46 CDT 2011


(replying to swift-user instead of devel)

Ketan,

Your main sites.xml coaster settings were:

    <execution provider="coaster" jobmanager="local:pbs"/>
    <profile namespace="globus" key="project">CI-CCR000013</profile>
    <profile namespace="globus" key="ppn">24:cray:pack</profile>
    <profile namespace="globus" key="workersPerNode">24</profile>
    <profile namespace="globus" key="maxTime">100000</profile>
    <profile namespace="globus" key="lowOverallocation">100</profile>
    <profile namespace="globus" key="highOverallocation">100</profile>
    <profile namespace="globus" key="slots">20</profile>
    <profile namespace="globus" key="nodeGranularity">5</profile>
    <profile namespace="globus" key="maxNodes">5</profile>
    <profile namespace="karajan" key="jobThrottle">20.00</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>

Your tc entry (shortened here) was:

pbs modftdock /.../modftdock.sh null null GLOBUS::maxwalltime="02:00:00"

And you said you saw in PBS: 13 jobs of 24 hours and 4 jobs of 22 hours. I suspect this was after the script had been running a while, and many jobs had been completed.

Based on your settings, I think you should have had at one time about 17 coaster block jobs running, because the throttle on your coaster pool was set to 20 (which would cause Swift to try to run about 2000 apps at once - 2001 to be precise). Since each job should have requested exactly 5 nodes (based on your maxnodes=nodegranularity=5 setting above), Swift would have had to run 17 jobs to accomodate 2000 apps (17 * (5*24) ) = 2040 apps. 24 comes from your workerspernode setting, which is a poorly-named parameter that we are renaming to what it really specifies: appsPerNode for concurrent application calls per node.

I also suspect that that when this workflow started, coasters was requesting blocks of time closer to the 100,000 seconds that you specified for maxtime? (thats ~27 hours). I think the qstat snapshot you provided showed fewer than 17 jobs and job times shorter than 27 hours (24 and 22 hours) because there was no longer enough apps remaining to run to require those higher values. But it was still going to try to run all the remaining jobs - probably fewer than 2000 jobs remained when you run the enclosed qstat. In fact the jobs remaining at the time was likely less than:
  
  13*5*(24/2) + 4*5*(22/2) = 120*12 + 20*10 signifying <= 1640 jobs remaining

Since the maxwalltime estimate for your app in tc.data was 2 hours, I think coasters will pick a wall time that is the min(time needed for jobs remaining, time needed for max throttle jobs based on maxtime and high/low overallocation settings).

A note here to Swift developers: we need to first clarify the behavior of coasters in detail in the User Guide; then we need to build suitable templates that *greatly* simplify the settings and end-user parameters, and explain those simpler settings for use by all but the most sophisticated users with complex needs.

We also need to do much more experimentation to see if coasters will run OK with far less parameter-override specification, and see if its automation and algorithmic intelligence will do the right thing in almost all cases.

Most of the time in current use we specify overrides for almost all settings so that we get a precise shape and number of jobs submitted. Doing that assumes we know better than coasters and forces the user to understand how to override all the settings.

Its a very interesting question, and a hard but critically important one to answer to make usage simpler.

- Mike



----- Original Message -----
> Hello,
> 
> Today, I successfully ran an experiment with 5000 tasks on beagle with
> Coasters. These modFTdock tasks correspond to the production grade
> modFTdock parameters and each task takes around 20 minutes to
> complete.
> 
> After some discussions with Mike, I configured my sites.xml file to
> obtain necessary resources on beagle.
> 
> However, it seems that I did not configure my sites.xml optimally as
> the resources requested exceeded my requirements.
> 
> In summary 13 blocks for 24 hours and 4 blocks for 22 hours (120
> nodes) were requested while the needed were at most 4 hours on each
> block or more on less number of blocks.
> 
> Attached are the following:
> 
> sites.xml
> tc.data
> A qstat snapshot at the completion of the experiment.
> 
> Suggestions and insights on how to optimize the configuration are
> welcome.
> 
> 
> Regards,
> Ketan
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list