[Swift-user] Re: Using > 1 CPU per compute node under GRAM
Michael Wilde
wilde at mcs.anl.gov
Mon Jul 21 18:45:24 CDT 2008
Thanks, JP.
I'll forward this to the TeraGrid Help Desk and report back to this list.
- Mike
On 7/21/08 6:28 PM, JP Navarro wrote:
> It's definitely subject to local resource manager/scheduling policy
> configuration.
> At UC/ANL, for example, there's an explicit policy that says 1 job per
> node. Each
> job can of course run 1-n processes that share the 2 processors. There's
> nothing
> gram can do to get around that policy.
>
> You'll need to ask NCSA whether their policies allow multiple jobs on
> one node.
> If Abe allows only one job per node, then it's up to your one job to
> spawn off
> enough processes/threads to use the 8 cores.
>
> JP
>
> On Jul 21, 2008, at 6:10 PM, Michael Wilde wrote:
>
>> Im asking this on behalf of Mike Kubal while I wait for more info on
>> his settings:
>>
>> Mike is running under Swift on teragrid/Abe which has 8-core nodes.
>> His jobs are all running 1-job-per-node, wasting 7 cores.
>>
>> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
>>
>> In the meantime, does anyone know if there's a way to specify
>> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>>
>> And if this is dependent on the local job manager code or settings?
>> (Ie might work on some sites but not others)?
>>
>> On globus doc page:
>> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes
>>
>>
>> I see:
>> <!-- *OR* an explicit number of processes per node... -->
>> <processesPerHost>...</processesPerHost>
>> </resourceAllocationGroup>
>> </extensions>
>> but cant tell if this applies to single-core jobs or only to
>> multi-core jobs.
>>
>> This will ideally be handled as desired by Falkon or Coaster, but in
>> the meantime I was hoping there was a simple setting to give MikeK
>> better CPU yield on Abe.
>>
>> - Mike Wilde
>>
>> ---
>>
>> A sample of one of his jobs looks like this under qstat -ef:
>>
>> Job Id: 395980.abem5.ncsa.uiuc.edu
>> Job_Name = STDIN
>> Job_Owner = mkubal at abe1196
>> job_state = Q
>> queue = normal
>> server = abem5.ncsa.uiuc.edu
>> Account_Name = onm
>> Checkpoint = u
>> ctime = Mon Jul 21 17:43:47 2008
>> Error_Path = abe1196:/dev/null
>> Hold_Types = n
>> Join_Path = n
>> Keep_Files = n
>> Mail_Points = n
>> mtime = Mon Jul 21 17:43:47 2008
>> Output_Path = abe1196:/dev/null
>> Priority = 0
>> qtime = Mon Jul 21 17:43:47 2008
>> Rerunable = True
>> Resource_List.ncpus = 1
>> Resource_List.nodect = 1
>> Resource_List.nodes = 1
>> Resource_List.walltime = 00:10:00
>> Shell_Path_List = /bin/sh
>> etime = Mon Jul 21 17:43:47 2008
>> submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
>>
>> And his jobs show up like this under qstat -n (ie are all on core /0 ):
>>
>> 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1
>> -- 00:10 R --
>> abe0872/0
>>
>> While multi-core jobs use
>>
>> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
>> +abe0579/3+abe0579/2+abe0579/1+abe0579/0
>
More information about the Swift-user
mailing list