[Swift-devel] Is there a site count limit?
Michael Wilde
wilde at mcs.anl.gov
Fri Apr 10 12:00:26 CDT 2009
They are in ci:/home/wilde/oops.1063.2
I spotted the anomaly (if thats what it is) as below.
Also: we discussed on the list way way back how to get the swift
scheduler to send no more jobs to each "site" than there are cores in
that site (for this bgp/falkon case) so that jobs dont get committed to
busy sites while other sites have free cores.
In this run, we are trying to send 32K jobs to 32K cores.
Each of the 128 "sites" have 256 cores.
The #s below show about 19K of those jobs as having been dispatched to
32*256 = 8192 cores.
int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c
24
365 host=bgp000
790 host=bgp001
371 host=bgp002
383 host=bgp003
365 host=bgp004
791 host=bgp005
415 host=bgp006
775 host=bgp007
790 host=bgp008
791 host=bgp009
369 host=bgp010
790 host=bgp011
359 host=bgp012
791 host=bgp013
394 host=bgp014
402 host=bgp015
358 host=bgp016
595 host=bgp017
790 host=bgp018
790 host=bgp019
791 host=bgp020
790 host=bgp021
370 host=bgp022
790 host=bgp023
790 host=bgp024
674 host=bgp025
567 host=bgp026
389 host=bgp027
778 host=bgp028
366 host=bgp029
787 host=bgp030
695 host=bgp031
int$ pwd
On 4/10/09 11:42 AM, Mihael Hategan wrote:
> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote:
>> Hi,
>>
>> We're trying to run an oops run on 8 racks of the BGP. Its possible this
>> is larger than has been done to date with swift.
>>
>> Our sites.xml file has localhost plus 128 Falkon sites, one for each
>> pset in the 8-rack partition.
>>
>> From what I can tell, Swift sees all 128 sites, but only sends jobs to
>> exactly the first 32, bgp000-bgp031.
>>
>> While I debug this further, does anyone know of some hardwired limit
>> that would cause swift to send to only the first 32 bgp sites?
>
> I can't think of anything that would make that the case. The sites file
> and a log would be useful.
>
More information about the Swift-devel
mailing list