[Swift-user] Re: Using > 1 CPU per compute node under GRAM

Michael Wilde wilde at mcs.anl.gov
Mon Jul 21 18:45:24 CDT 2008


Thanks, JP.

I'll forward this to the TeraGrid Help Desk and report back to this list.

- Mike



On 7/21/08 6:28 PM, JP Navarro wrote:
> It's definitely subject to local resource manager/scheduling policy 
> configuration.
> At UC/ANL, for example, there's an explicit policy that says 1 job per 
> node. Each
> job can of course run 1-n processes that share the 2 processors. There's 
> nothing
> gram can do to get around that policy.
> 
> You'll need to ask NCSA whether their policies allow multiple jobs on 
> one node.
> If Abe allows only one job per node, then it's up to your one job to 
> spawn off
> enough processes/threads to use the 8 cores.
> 
> JP
> 
> On Jul 21, 2008, at 6:10 PM, Michael Wilde wrote:
> 
>> Im asking this on behalf of Mike Kubal while I wait for more info on 
>> his settings:
>>
>> Mike is running under Swift on teragrid/Abe which has 8-core nodes. 
>> His jobs are all running 1-job-per-node, wasting 7 cores.
>>
>> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
>>
>> In the meantime, does anyone know if there's a way to specify 
>> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>>
>> And if this is dependent on the local job manager code or settings? 
>> (Ie might work on some sites but not others)?
>>
>> On globus doc page:
>> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes 
>>
>>
>> I see:
>>        <!-- *OR* an explicit number of processes per node... -->
>>        <processesPerHost>...</processesPerHost>
>>        </resourceAllocationGroup>
>>        </extensions>
>> but cant tell if this applies to single-core jobs or only to 
>> multi-core jobs.
>>
>> This will ideally be handled as desired by Falkon or Coaster, but in 
>> the meantime I was hoping there was a simple setting to give MikeK 
>> better CPU yield on Abe.
>>
>> - Mike Wilde
>>
>> ---
>>
>> A sample of one of his jobs looks like this under qstat -ef:
>>
>> Job Id: 395980.abem5.ncsa.uiuc.edu
>>    Job_Name = STDIN
>>    Job_Owner = mkubal at abe1196
>>    job_state = Q
>>    queue = normal
>>    server = abem5.ncsa.uiuc.edu
>>    Account_Name = onm
>>    Checkpoint = u
>>    ctime = Mon Jul 21 17:43:47 2008
>>    Error_Path = abe1196:/dev/null
>>    Hold_Types = n
>>    Join_Path = n
>>    Keep_Files = n
>>    Mail_Points = n
>>    mtime = Mon Jul 21 17:43:47 2008
>>    Output_Path = abe1196:/dev/null
>>    Priority = 0
>>    qtime = Mon Jul 21 17:43:47 2008
>>    Rerunable = True
>>    Resource_List.ncpus = 1
>>    Resource_List.nodect = 1
>>    Resource_List.nodes = 1
>>    Resource_List.walltime = 00:10:00
>>    Shell_Path_List = /bin/sh
>>    etime = Mon Jul 21 17:43:47 2008
>>    submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
>>
>> And his jobs show up like this under qstat -n (ie are all on core /0 ):
>>
>> 395653.abem5.ncsa.ui mkubal   normal   STDIN        1767     1   1    
>> --  00:10 R   --
>>   abe0872/0
>>
>> While multi-core jobs use
>>
>> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
>>   +abe0579/3+abe0579/2+abe0579/1+abe0579/0
> 



More information about the Swift-user mailing list