[Swift-devel] Is there a site count limit?

Michael Wilde wilde at mcs.anl.gov
Fri Apr 10 12:39:42 CDT 2009



On 4/10/09 12:22 PM, Mihael Hategan wrote:
> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote:
>> Increase foreach.max.threads to at least 4096.

it was set to 100000 (100K)

> That doesn't seem to be the cause though. Do you have all the
> sites/executables properly in tc.data?

duh. of course not :)

thats the problem, thanks.

> 
>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote:
>>> They are in ci:/home/wilde/oops.1063.2
>>>
>>> I spotted the anomaly (if thats what it is) as below.
>>>
>>> Also: we discussed on the list way way back how to get the swift 
>>> scheduler to send no more jobs to each "site" than there are cores in 
>>> that site (for this bgp/falkon case) so that jobs dont get committed to 
>>> busy sites while other sites have free cores.
>>>
>>> In this run, we are trying to send 32K jobs to 32K cores.
>>> Each of the 128 "sites" have 256 cores.
>>>
>>> The #s below show about 19K of those jobs as having been dispatched to 
>>> 32*256 = 8192 cores.
>>>
>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c 
>>>
>>>       24
>>>      365 host=bgp000
>>>      790 host=bgp001
>>>      371 host=bgp002
>>>      383 host=bgp003
>>>      365 host=bgp004
>>>      791 host=bgp005
>>>      415 host=bgp006
>>>      775 host=bgp007
>>>      790 host=bgp008
>>>      791 host=bgp009
>>>      369 host=bgp010
>>>      790 host=bgp011
>>>      359 host=bgp012
>>>      791 host=bgp013
>>>      394 host=bgp014
>>>      402 host=bgp015
>>>      358 host=bgp016
>>>      595 host=bgp017
>>>      790 host=bgp018
>>>      790 host=bgp019
>>>      791 host=bgp020
>>>      790 host=bgp021
>>>      370 host=bgp022
>>>      790 host=bgp023
>>>      790 host=bgp024
>>>      674 host=bgp025
>>>      567 host=bgp026
>>>      389 host=bgp027
>>>      778 host=bgp028
>>>      366 host=bgp029
>>>      787 host=bgp030
>>>      695 host=bgp031
>>> int$ pwd
>>>
>>>
>>> On 4/10/09 11:42 AM, Mihael Hategan wrote:
>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote:
>>>>> Hi,
>>>>>
>>>>> We're trying to run an oops run on 8 racks of the BGP. Its possible this 
>>>>> is larger than has been done to date with swift.
>>>>>
>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for each 
>>>>> pset in the 8-rack partition.
>>>>>
>>>>>  From what I can tell, Swift sees all 128 sites, but only sends jobs to 
>>>>> exactly the first 32, bgp000-bgp031.
>>>>>
>>>>> While I debug this further, does anyone know of some hardwired limit 
>>>>> that would cause swift to send to only the first 32 bgp sites?
>>>> I can't think of anything that would make that the case. The sites file
>>>> and a log would be useful.
>>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list