[Swift-devel] Is there a site count limit?
Michael Wilde
wilde at mcs.anl.gov
Fri Apr 10 14:44:45 CDT 2009
Mihael, your suggestion of:
<profile namespace="karajan" key="jobThrottle">2.56</profile>
<profile namespace="karajan" key="initialScore">1000</profile>
Is *almost* right on:
int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk
'{ sum += $1} END {print sum}'
8131
int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c
3
254 host=bgp000
254 host=bgp001
254 host=bgp002
...
254 host=bgp030
254 host=bgp031
int$
Can you suggest how to tweak it up to 256? Use jobThrottle=2.58 maybe? I
will experiment, but if there's a precise way to hit it "just right"
that would be great. If not, we will adjust as needed and reduce the
total # of jobs.
Is this a roundoff issue, or does the formula subtract 2 somewhere from
the throttle * score product?
- Mike
On 4/10/09 12:39 PM, Michael Wilde wrote:
>
>
> On 4/10/09 12:22 PM, Mihael Hategan wrote:
>> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote:
>>> Increase foreach.max.threads to at least 4096.
>
> it was set to 100000 (100K)
>
>> That doesn't seem to be the cause though. Do you have all the
>> sites/executables properly in tc.data?
>
> duh. of course not :)
>
> thats the problem, thanks.
>
>>
>>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote:
>>>> They are in ci:/home/wilde/oops.1063.2
>>>>
>>>> I spotted the anomaly (if thats what it is) as below.
>>>>
>>>> Also: we discussed on the list way way back how to get the swift
>>>> scheduler to send no more jobs to each "site" than there are cores
>>>> in that site (for this bgp/falkon case) so that jobs dont get
>>>> committed to busy sites while other sites have free cores.
>>>>
>>>> In this run, we are trying to send 32K jobs to 32K cores.
>>>> Each of the 128 "sites" have 256 cores.
>>>>
>>>> The #s below show about 19K of those jobs as having been dispatched
>>>> to 32*256 = 8192 cores.
>>>>
>>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c
>>>> 24
>>>> 365 host=bgp000
>>>> 790 host=bgp001
>>>> 371 host=bgp002
>>>> 383 host=bgp003
>>>> 365 host=bgp004
>>>> 791 host=bgp005
>>>> 415 host=bgp006
>>>> 775 host=bgp007
>>>> 790 host=bgp008
>>>> 791 host=bgp009
>>>> 369 host=bgp010
>>>> 790 host=bgp011
>>>> 359 host=bgp012
>>>> 791 host=bgp013
>>>> 394 host=bgp014
>>>> 402 host=bgp015
>>>> 358 host=bgp016
>>>> 595 host=bgp017
>>>> 790 host=bgp018
>>>> 790 host=bgp019
>>>> 791 host=bgp020
>>>> 790 host=bgp021
>>>> 370 host=bgp022
>>>> 790 host=bgp023
>>>> 790 host=bgp024
>>>> 674 host=bgp025
>>>> 567 host=bgp026
>>>> 389 host=bgp027
>>>> 778 host=bgp028
>>>> 366 host=bgp029
>>>> 787 host=bgp030
>>>> 695 host=bgp031
>>>> int$ pwd
>>>>
>>>>
>>>> On 4/10/09 11:42 AM, Mihael Hategan wrote:
>>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We're trying to run an oops run on 8 racks of the BGP. Its
>>>>>> possible this is larger than has been done to date with swift.
>>>>>>
>>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for
>>>>>> each pset in the 8-rack partition.
>>>>>>
>>>>>> From what I can tell, Swift sees all 128 sites, but only sends
>>>>>> jobs to exactly the first 32, bgp000-bgp031.
>>>>>>
>>>>>> While I debug this further, does anyone know of some hardwired
>>>>>> limit that would cause swift to send to only the first 32 bgp sites?
>>>>> I can't think of anything that would make that the case. The sites
>>>>> file
>>>>> and a log would be useful.
>>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list