[Swift-devel] Is there a site count limit?

Fri Apr 10 14:44:45 CDT 2009

Mihael, your suggestion of:

<profile namespace="karajan" key="jobThrottle">2.56</profile>
<profile namespace="karajan" key="initialScore">1000</profile>

Is *almost* right on:

int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk 
'{ sum += $1} END {print sum}'
8131
int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c 

       3
     254 host=bgp000
     254 host=bgp001
     254 host=bgp002
     ...
     254 host=bgp030
     254 host=bgp031
int$

Can you suggest how to tweak it up to 256? Use jobThrottle=2.58 maybe? I 
will experiment, but if there's a precise way to hit it "just right" 
that would be great. If not, we will adjust as needed and reduce the 
total # of jobs.

Is this a roundoff issue, or does the formula subtract 2 somewhere from 
the throttle * score product?

- Mike

On 4/10/09 12:39 PM, Michael Wilde wrote:
> 
> 
> On 4/10/09 12:22 PM, Mihael Hategan wrote:
>> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote:
>>> Increase foreach.max.threads to at least 4096.
> 
> it was set to 100000 (100K)
> 
>> That doesn't seem to be the cause though. Do you have all the
>> sites/executables properly in tc.data?
> 
> duh. of course not :)
> 
> thats the problem, thanks.
> 
>>
>>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote:
>>>> They are in ci:/home/wilde/oops.1063.2
>>>>
>>>> I spotted the anomaly (if thats what it is) as below.
>>>>
>>>> Also: we discussed on the list way way back how to get the swift 
>>>> scheduler to send no more jobs to each "site" than there are cores 
>>>> in that site (for this bgp/falkon case) so that jobs dont get 
>>>> committed to busy sites while other sites have free cores.
>>>>
>>>> In this run, we are trying to send 32K jobs to 32K cores.
>>>> Each of the 128 "sites" have 256 cores.
>>>>
>>>> The #s below show about 19K of those jobs as having been dispatched 
>>>> to 32*256 = 8192 cores.
>>>>
>>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c
>>>>       24
>>>>      365 host=bgp000
>>>>      790 host=bgp001
>>>>      371 host=bgp002
>>>>      383 host=bgp003
>>>>      365 host=bgp004
>>>>      791 host=bgp005
>>>>      415 host=bgp006
>>>>      775 host=bgp007
>>>>      790 host=bgp008
>>>>      791 host=bgp009
>>>>      369 host=bgp010
>>>>      790 host=bgp011
>>>>      359 host=bgp012
>>>>      791 host=bgp013
>>>>      394 host=bgp014
>>>>      402 host=bgp015
>>>>      358 host=bgp016
>>>>      595 host=bgp017
>>>>      790 host=bgp018
>>>>      790 host=bgp019
>>>>      791 host=bgp020
>>>>      790 host=bgp021
>>>>      370 host=bgp022
>>>>      790 host=bgp023
>>>>      790 host=bgp024
>>>>      674 host=bgp025
>>>>      567 host=bgp026
>>>>      389 host=bgp027
>>>>      778 host=bgp028
>>>>      366 host=bgp029
>>>>      787 host=bgp030
>>>>      695 host=bgp031
>>>> int$ pwd
>>>>
>>>>
>>>> On 4/10/09 11:42 AM, Mihael Hategan wrote:
>>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We're trying to run an oops run on 8 racks of the BGP. Its 
>>>>>> possible this is larger than has been done to date with swift.
>>>>>>
>>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for 
>>>>>> each pset in the 8-rack partition.
>>>>>>
>>>>>>  From what I can tell, Swift sees all 128 sites, but only sends 
>>>>>> jobs to exactly the first 32, bgp000-bgp031.
>>>>>>
>>>>>> While I debug this further, does anyone know of some hardwired 
>>>>>> limit that would cause swift to send to only the first 32 bgp sites?
>>>>> I can't think of anything that would make that the case. The sites 
>>>>> file
>>>>> and a log would be useful.
>>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel