[Swift-user] Using > 1 CPU per compute node under GRAM

Mon Jul 21 18:10:45 CDT 2008

Im asking this on behalf of Mike Kubal while I wait for more info on his 
settings:

Mike is running under Swift on teragrid/Abe which has 8-core nodes. His 
jobs are all running 1-job-per-node, wasting 7 cores.

I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.

In the meantime, does anyone know if there's a way to specify 
compute-node-sharing between separate single-cpu jobs via both GRAMs?

And if this is dependent on the local job manager code or settings? (Ie 
might work on some sites but not others)?

On globus doc page:
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes

I see:
         <!-- *OR* an explicit number of processes per node... -->
         <processesPerHost>...</processesPerHost>
         </resourceAllocationGroup>
         </extensions>
but cant tell if this applies to single-core jobs or only to multi-core 
jobs.

This will ideally be handled as desired by Falkon or Coaster, but in the 
meantime I was hoping there was a simple setting to give MikeK better 
CPU yield on Abe.

- Mike Wilde

---

A sample of one of his jobs looks like this under qstat -ef:

Job Id: 395980.abem5.ncsa.uiuc.edu
     Job_Name = STDIN
     Job_Owner = mkubal at abe1196
     job_state = Q
     queue = normal
     server = abem5.ncsa.uiuc.edu
     Account_Name = onm
     Checkpoint = u
     ctime = Mon Jul 21 17:43:47 2008
     Error_Path = abe1196:/dev/null
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = n
     mtime = Mon Jul 21 17:43:47 2008
     Output_Path = abe1196:/dev/null
     Priority = 0
     qtime = Mon Jul 21 17:43:47 2008
     Rerunable = True
     Resource_List.ncpus = 1
     Resource_List.nodect = 1
     Resource_List.nodes = 1
     Resource_List.walltime = 00:10:00
     Shell_Path_List = /bin/sh
     etime = Mon Jul 21 17:43:47 2008
     submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN

And his jobs show up like this under qstat -n (ie are all on core /0 ):

395653.abem5.ncsa.ui mkubal   normal   STDIN        1767     1   1    -- 
  00:10 R   --
    abe0872/0

While multi-core jobs use

+abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
    +abe0579/3+abe0579/2+abe0579/1+abe0579/0