[Swift-user] Coaster provider is not allocating dedicated nodes

Michael Wilde wilde at mcs.anl.gov
Wed Jan 20 09:38:51 CST 2010


Using the sites entry below, I see that coasters is allocating 8 
*shared* nods rather than *dedicated* nodes; hence its running many more 
processes per node than it should, causing the jobs to run longer than 
expected and exceed their walltime.

using this sites entry:

   <pool handle="pbs">
     <execution provider="coaster" url="none" jobManager="local:pbs"/>

     <profile namespace="globus" key="maxtime">7500</profile>
     <profile namespace="globus" key="workersPerNode">8</profile>

     <profile namespace="globus" key="slots">12</profile>
     <profile namespace="globus" key="nodeGranularity">1</profile>
     <profile namespace="globus" key="maxNodes">1</profile>

     <profile namespace="karajan" key="jobThrottle">1.27</profile>
     <profile namespace="karajan" key="initialScore">10000</profile>
     <filesystem provider="local"/>
     <workdirectory>$rundir</workdirectory>
   </pool>

qstat (below) shows the 12 coaster jobs I requested with "slots=12", but 
they are only using 2 different nodes, c45 and c46, between them, even 
though they are running 96 total coaster workers. (I can see that I have 
96 jobs active).

It seems like between coasters and the PBS provider, Swift is nt telling 
PBS that each of these jobs should get a dedicated node of 8 cores.


Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
1034.svc.pads.ci     wilde    extended null              13086     1  -- 
    --  02:04 R 01:26
    c46
1035.svc.pads.ci     wilde    extended null              13168     1  -- 
    --  02:04 R 01:26
    c46
1036.svc.pads.ci     wilde    extended null              13387     1  -- 
    --  02:04 R 01:26
    c46
1037.svc.pads.ci     wilde    extended null              14060     1  -- 
    --  02:04 R 01:26
    c46
1038.svc.pads.ci     wilde    extended null              14237     1  -- 
    --  02:04 R 01:26
    c46
1039.svc.pads.ci     wilde    extended null              14640     1  -- 
    --  02:04 R 01:26
    c46
1040.svc.pads.ci     wilde    extended null              15200     1  -- 
    --  02:04 R 01:26
    c46
1041.svc.pads.ci     wilde    extended null              15753     1  -- 
    --  02:04 R 01:26
    c46
1042.svc.pads.ci     wilde    extended null              23700     1  -- 
    --  02:04 R 01:26
    c45
1043.svc.pads.ci     wilde    extended null              23781     1  -- 
    --  02:04 R 01:26
    c45
1044.svc.pads.ci     wilde    extended null              24016     1  -- 
    --  02:04 R 01:26
    c45
1045.svc.pads.ci     wilde    extended null              24796     1  -- 
    --  02:04 R 01:26
    c45



More information about the Swift-user mailing list