[Swift-user] Re: Using > 1 CPU per compute node under GRAM

Mon Jul 21 18:28:28 CDT 2008

It's definitely subject to local resource manager/scheduling policy  
configuration.
At UC/ANL, for example, there's an explicit policy that says 1 job per  
node. Each
job can of course run 1-n processes that share the 2 processors.  
There's nothing
gram can do to get around that policy.

You'll need to ask NCSA whether their policies allow multiple jobs on  
one node.
If Abe allows only one job per node, then it's up to your one job to  
spawn off
enough processes/threads to use the 8 cores.

JP

On Jul 21, 2008, at 6:10 PM, Michael Wilde wrote:

> Im asking this on behalf of Mike Kubal while I wait for more info on  
> his settings:
>
> Mike is running under Swift on teragrid/Abe which has 8-core nodes.  
> His jobs are all running 1-job-per-node, wasting 7 cores.
>
> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
>
> In the meantime, does anyone know if there's a way to specify  
> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>
> And if this is dependent on the local job manager code or settings?  
> (Ie might work on some sites but not others)?
>
> On globus doc page:
> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes
>
> I see:
>        <!-- *OR* an explicit number of processes per node... -->
>        <processesPerHost>...</processesPerHost>
>        </resourceAllocationGroup>
>        </extensions>
> but cant tell if this applies to single-core jobs or only to multi- 
> core jobs.
>
> This will ideally be handled as desired by Falkon or Coaster, but in  
> the meantime I was hoping there was a simple setting to give MikeK  
> better CPU yield on Abe.
>
> - Mike Wilde
>
> ---
>
> A sample of one of his jobs looks like this under qstat -ef:
>
> Job Id: 395980.abem5.ncsa.uiuc.edu
>    Job_Name = STDIN
>    Job_Owner = mkubal at abe1196
>    job_state = Q
>    queue = normal
>    server = abem5.ncsa.uiuc.edu
>    Account_Name = onm
>    Checkpoint = u
>    ctime = Mon Jul 21 17:43:47 2008
>    Error_Path = abe1196:/dev/null
>    Hold_Types = n
>    Join_Path = n
>    Keep_Files = n
>    Mail_Points = n
>    mtime = Mon Jul 21 17:43:47 2008
>    Output_Path = abe1196:/dev/null
>    Priority = 0
>    qtime = Mon Jul 21 17:43:47 2008
>    Rerunable = True
>    Resource_List.ncpus = 1
>    Resource_List.nodect = 1
>    Resource_List.nodes = 1
>    Resource_List.walltime = 00:10:00
>    Shell_Path_List = /bin/sh
>    etime = Mon Jul 21 17:43:47 2008
>    submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
>
> And his jobs show up like this under qstat -n (ie are all on core / 
> 0 ):
>
> 395653.abem5.ncsa.ui mkubal   normal   STDIN        1767     1    
> 1    --  00:10 R   --
>   abe0872/0
>
> While multi-core jobs use
>
> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
>   +abe0579/3+abe0579/2+abe0579/1+abe0579/0