[Swift-devel] Re: Falkon worker config params?

Wed Sep 5 17:38:18 CDT 2007

See below:

Michael Wilde wrote:
> Thanks, Ioan.  A few follow us, to clarify:
>
> Ioan Raicu wrote:
>> Hi,
>>
>> Michael Wilde wrote:
>>> Ioan, can you send/resend/point-me-to definitions of the critical 
>>> parameters to control the startup of falkon workers, and review the 
>>> attached file for anything I'm doing stupid here?
>>>
>>> Can you review/improve/fil in my comments with more (correct) details?
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>>
>>>
>>> #Provisioner config file
>>> #KEY=VALUE
>>> #if multiple lines have the same key, the previous value will be
>>> overwritten with the new valu
>>> e
>>> #all paths are relative
>>>
>>> #resources numbers
>>> MinNumExecutors=0      # min # of exec threads to keep extant
>>> MaxNumExecutors=250    # max # of exec threads to allow extant
>>> ExecutorsPerHost=2     # # of exec threads to run on each host
>>>
>>> #resources times
>>> MinResourceAllocationTime_min=60   # ??? re-assess allocations
>>> MaxResourceAllocationTime_min=60   # every this-many seconds? ???
>>>                                    # if so, why uper and lower 
>>> settings?
>> This is the time that GRAM4 sets the max wall clock time to.  There 
>> is a lower and upper bound, but in reality I am not doing anything 
>> with both bounds, only one. In the future, the provisioner could make 
>> smarter allocation requests in terms of time based on the workload 
>> characteristics (not just number of jobs), and hence the existence of 
>> a lower and upper bound.
>
> Since you didnt say which one is in use, I'll assume its best to set 
> both to the same value, as is done above. I'm assuming the _min means 
> these times are in minutes.
Yes, the times are in minutes.  I don't remember off the top of my head 
which one I use, but I think I first do a Math.min(xxx), and then do a 
Math.max(xxx).  Setting both to the same value is what I normally do. 
>
>>>
>>> #resources types
>>> HostType=any
>>> #HostType=ia32_compute
>>> #HostType=ia64_compute
>>>
>>> #allocation strategies            # please explain these
>
> This is a little muddy.  Is the way it works, that these allocation 
> strategies are used when the service looks in the queue (every few 
> seconds?), 
FalkonStatePollTime_sec=15
> sees that there is work to do, and allocates that many workers via new 
> jobs, using the AllocationStrategy?
yes
>
> Ie, it wakes up, sees that it has 50 jobs to run, sees eg a max worker 
> of 40, says "i need 40 workers at this point in time" and then starts 
> that many using the designated strategy?
yes, but this is assuming that there were 0 allocated workers.  If you 
already had 10 allocated workers, then it would only get 30 more.
>
>> If you want 20 workers with 2 per node, then it would go like this:
>>> #AllocationStrategy=one_at_a_time
>> GRAM job #1 of 1 node with 2 workers
>> GRAM job #2 of 1 node with 2 workers
>> GRAM job #3 of 1 node with 2 workers
>> GRAM job #4 of 1 node with 2 workers
>> GRAM job #5 of 1 node with 2 workers
>> GRAM job #6 of 1 node with 2 workers
>> GRAM job #7 of 1 node with 2 workers
>> GRAM job #8 of 1 node with 2 workers
>> GRAM job #9 of 1 node with 2 workers
>> GRAM job #10 of 1 node with 2 workers
>
> Ie, always runs gram job to get 1 worker node.
yes
>
>>
>>> #AllocationStrategy=additive
>> GRAM job #1 of 1 node with 2 workers
>> GRAM job #2 of 2 node with 2 workers
>> GRAM job #3 of 3 node with 2 workers
>> GRAM job #4 of 4 node with 2 workers
>
>
> Ie grows the #nodes per job by 1 with each job it submits?
> Eg to get 12 it would do 1,2,3,4,2 ???
yes!
>
>>
>>> #AllocationStrategy=exponential
>> GRAM job #1 of 1 node with 2 workers
>> GRAM job #2 of 2 node with 2 workers
>> GRAM job #3 of 4 node with 2 workers
>> GRAM job #4 of 3 node with 2 workers
>
> Ditto but exponential?
yes!
>
>>
>> #AllocationStrategy=all_at_a_time
>> GRAM job #1 of 10 node with 2 workers
>
> Ie, very time it needs nodes, it asks for all the nodes it needs with 
> one job?
yes!
>
> So: you would use these to tune the worker requests to what you know 
> abut a given site's scheduling policy?  If, if the site favors jobs 
> that ask for lots of nodes at once, use "all_at_one_time", but if that 
> would exceed a limit, use one of the other strategies?
yes.  Also remember that if you ask for say 100 nodes, and only 99 are 
available, your entire 100 node allocation will wait in the queue until 
that 100th node is also available.  Using smaller allocation sizes will 
likely give you some workers faster (primarily due to backfilling)!  If 
the allocations are too small (say 1 worker per GRAM job), then you have 
to be careful how many GRAM jobs you generate as you can quickly 
overwhelm the GRAM service and PBS with anything more than a few dozen 
jobs at a time.  Sending 100s of GRAM jobs a ta time could bring the 
service or PBS down.
>
>>
>>> AllocationStrategy=additive
>>> MinNumHostsPerAllocation=10       # get at least this many nodes per
>>>                                   # alloc job?
>>>                                   # (doesnt match what I see)
>> This is not implemented yet.  The current MinNumHostsPerAllocation is 
>> set to 1.  This feature shouldn't be hard to be implemented, I just 
>> haven't had time to do it.
>>> MaxNumHostsPerAllocation=100
>> This is also not implemented yet.
>
> I dont understand the explanation on this.
Its because I didn't explain it, I just said it is not implemented.  
Essentially, once this is implemented, you could set the 
MinNumHostsPerAllocation=10 and say you needed 100 nodes and wanted to 
use the #AllocationStrategy=one_at_a_time, then you would get:
GRAM job #1 of 10 node with 2 workers
GRAM job #2 of 10 node with 2 workers
GRAM job #3 of 10 node with 2 workers
GRAM job #4 of 10 node with 2 workers
GRAM job #5 of 10 node with 2 workers
GRAM job #6 of 10 node with 2 workers
GRAM job #7 of 10 node with 2 workers
GRAM job #8 of 10 node with 2 workers
GRAM job #9 of 10 node with 2 workers
GRAM job #10 of 10 node with 2 workers

In essence, this value is currently set to 1 for the min, and infinity 
for the max.  Once the max is implemented, and we set it to 
MaxNumHostsPerAllocation=50, and we want to use 
#AllocationStrategy=all_at_a_time, we would get:
GRAM job #1 of 50 node with 2 workers
GRAM job #2 of 50 node with 2 workers
>
>>>
>>> #de-allocation strategies, 0 means never de-allocate due to idle time
>>> DeAllocationIdleTime_sec=300000
>>> # ^^^^ in msec 300,000 = 300 secs = 5 min  # Seems to work well.
>>>                                    # But I see a few stragglers that
>>>                                    # inger much longer (did last week)
>> Did you see them in the Falkon logs?  
>
> No.
>
> Probably not.  Did you see them in
>> showq/qstat (PBS monitoring tools)?  Probably yes. 
>
> Yes.
>
OK, that is what I thought.  So its exactly what I described below.

Ioan
>
> - Mike
>
>
>  If the first answer
>> is no, and the second is yes, then it has to do with the fact that 
>> there is no coordination among the workers when they de-allocate.  
>> This is OK as long as each worker is allocated in a separate job 
>> (i.e. #AllocationStrategy=one_at_a_time).  However, all the other 
>> strategies do allocate multiple workers per GRAM job, and hence the 
>> problem that you are seeing arises.  Let me give you an example.
>> The timeline is as follows:
>> Time 0: Task 1 submitted, 20 min long
>> Time 0: Task 2 submitted, 20 min long
>> Time 0: Task 3 submitted, 20 min long
>> Time 0: GRAM job allocates 2 workers for 60 min, with a 5 min idle time
>> Time 0: Worker 1 receives task 1
>> Time 0: Worker 2 receives task 2
>> Time 20: Worker 1 completes task 1
>> Time 20: Worker 2 completes task 2
>> Time 20: Worker 1 receives task 3
>> Time 25: Worker 2 de-allocates itself due to 5 min idle time reached
>> Time 40: Worker 1 completes task 3
>> Time 45: Worker 1 de-allocates itself due to 5 min idle time reached
>> Time 45: GRAM completes and resources are returned to the LRM pool
>>
>> Note that GRAM only completed when all workers exited.  Although 
>> Worker 1 de-allocated itself from Falkon at time 25, it only got 
>> released back into the LRM resource pool at time 45, when the Worker 
>> 2 also exited.  The only solution to this problem is to either 1) 
>> have a centralized control over the workers, which would know what 
>> workers were allocated together, and hence must be de-allocated 
>> together, or 2) have some coordination among the workers so they only 
>> de-allocate when they are all ready to de-allocate. One artifact of 
>> this is that for large runs that vary in the number of resources 
>> needed, the resources can become quite fragmented, and hence Falkon's 
>> registered workers be less than the actual reserved resources from 
>> GRAM/PBS.
>> A short term solution is to either use the 
>> #AllocationStrategy=one_at_a_time, or set the idle time to 0, which 
>> would mean that the workers will only de-register when the lease is 
>> up, which will be the same for all the workers, and hence this 
>> problem would not appear.
>>
>> Ioan
>>>
>>> #Falkon information
>>> FalkonServiceURI=http://tg-viz-login1.uc.teragrid.org:50011/wsrf/services/GenericPortal/core/W 
>>>
>>> S/GPFactoryService
>>> #FalkonServiceURI=http://viper.uchicago.edu:50001/wsrf/services/GenericPortal/core/WS/GPFactor 
>>>
>>> yService
>>> EPR_FileName=WorkerEPR.txt
>>> FalkonStatePollTime_sec=15
>>>
>>> #GRAM4 details
>>> GRAM4_Location=tg-grid1.uc.teragrid.org
>>> GRAM4_FactoryType=PBS
>>> #GRAM4_FactoryType=FORK
>>> #GRAM4_FactoryType=LSF
>>> #GRAM4_FactoryType=CONDOR
>>>
>>> #project accounting information
>>> Project=TG-STA040017N
>>> #Project=default
>>>
>>> #Executor script
>>> ExecutorScript=run.worker.sh
>>>
>>> #Security Descriptor File
>>> SecurityFile=etc/client-security-config.xml
>>>
>>> #logging
>>> DRP_Log=logs/drp-status.txt
>>>
>>> #enable debug statements
>>> #DEBUG=true
>>> DEBUG=false
>>> DIPERF=false
>>> #DIPERF=true
>>>
>>>
>>>
>>>
>>>
>>> -------- Original Message --------
>>> Subject: PBS JOB 1512406.tg-master.uc.teragrid.org
>>> Date: Wed,  5 Sep 2007 14:46:17 -0500 (CDT)
>>> From: adm at tg-master.uc.teragrid.org (root)
>>> To: wilde at tg-grid1.uc.teragrid.org
>>>
>>> PBS Job Id: 1512406.tg-master.uc.teragrid.org
>>> Job Name:   STDIN
>>> An error has occurred processing your job, see below.
>>> Post job file processing error; job 
>>> 1512406.tg-master.uc.teragrid.org on
>>> host tg-v082/0+tg-v076/0+tg-v053/0+tg-v040/0+tg-v034/0Unknown resource
>>> type  REJHOST=tg-v082.uc.teragrid.org MSG=invalid home directory
>>> '/home/wilde' specified, errno=2 (No such file or directory)
>>>
>>>
>>>
>>>
>>
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================