[Swift-user] Using > 1 CPU per compute node under GRAM

Mon Jul 21 18:57:26 CDT 2008

In the past (i.e. MolDyn), I don't think we ever found a easy solution 
to this when running straight through GRAM (if the LRM didn't support 
this policy). But, as JP said, it is site specific, so some sites will 
allow getting only 1 CPU per node, such as Teraport, in which case GRAM 
should work just fine.

Ioan

Michael Wilde wrote:
> Im asking this on behalf of Mike Kubal while I wait for more info on 
> his settings:
>
> Mike is running under Swift on teragrid/Abe which has 8-core nodes. 
> His jobs are all running 1-job-per-node, wasting 7 cores.
>
> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
>
> In the meantime, does anyone know if there's a way to specify 
> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>
> And if this is dependent on the local job manager code or settings? 
> (Ie might work on some sites but not others)?
>
> On globus doc page:
> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes 
>
>
> I see:
>         <!-- *OR* an explicit number of processes per node... -->
>         <processesPerHost>...</processesPerHost>
>         </resourceAllocationGroup>
>         </extensions>
> but cant tell if this applies to single-core jobs or only to 
> multi-core jobs.
>
> This will ideally be handled as desired by Falkon or Coaster, but in 
> the meantime I was hoping there was a simple setting to give MikeK 
> better CPU yield on Abe.
>
> - Mike Wilde
>
> ---
>
> A sample of one of his jobs looks like this under qstat -ef:
>
> Job Id: 395980.abem5.ncsa.uiuc.edu
>     Job_Name = STDIN
>     Job_Owner = mkubal at abe1196
>     job_state = Q
>     queue = normal
>     server = abem5.ncsa.uiuc.edu
>     Account_Name = onm
>     Checkpoint = u
>     ctime = Mon Jul 21 17:43:47 2008
>     Error_Path = abe1196:/dev/null
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = n
>     mtime = Mon Jul 21 17:43:47 2008
>     Output_Path = abe1196:/dev/null
>     Priority = 0
>     qtime = Mon Jul 21 17:43:47 2008
>     Rerunable = True
>     Resource_List.ncpus = 1
>     Resource_List.nodect = 1
>     Resource_List.nodes = 1
>     Resource_List.walltime = 00:10:00
>     Shell_Path_List = /bin/sh
>     etime = Mon Jul 21 17:43:47 2008
>     submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
>
> And his jobs show up like this under qstat -n (ie are all on core /0 ):
>
> 395653.abem5.ncsa.ui mkubal   normal   STDIN        1767     1   1    
> --  00:10 R   --
>    abe0872/0
>
> While multi-core jobs use
>
> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
>    +abe0579/3+abe0579/2+abe0579/1+abe0579/0
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================