[Swift-devel] Is there a site count limit?
Allan Espinosa
aespinosa at cs.uchicago.edu
Sat Jun 27 14:28:11 CDT 2009
ok this gets me confused. in the swift docs:
throttle.score.job.factor
Valid values: <int>, off
Default value: 4
The Swift scheduler has the ability to limit the number of
concurrent jobs allowed on a site based on the performance history of
that site. Each site is assigned a score (initially 1), which can
increase or decrease based on whether the site yields successful or
faulty job runs. The score for a site can take values in the (0.1,
100) interval. The number of allowed jobs is calculated using the
following formula:
2 + score*throttle.score.job.factor
so the score can exceed 100?
2009/4/10 Mihael Hategan <hategan at mcs.anl.gov>:
> On Fri, 2009-04-10 at 14:44 -0500, Michael Wilde wrote:
>> Mihael, your suggestion of:
>>
>> <profile namespace="karajan" key="jobThrottle">2.56</profile>
>> <profile namespace="karajan" key="initialScore">1000</profile>
>>
>> Is *almost* right on:
>>
>> int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk
>> '{ sum += $1} END {print sum}'
>> 8131
>> int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c
>>
>> 3
>> 254 host=bgp000
>> 254 host=bgp001
>> 254 host=bgp002
>> ...
>> 254 host=bgp030
>> 254 host=bgp031
>> int$
>>
>> Can you suggest how to tweak it up to 256? Use jobThrottle=2.58 maybe?
>
> Make the initial score larger. 10000 should be enough. As it goes to
> +inf, you should have a max of 100*jobThrottle + 1 jobs.
>
>> I
>> will experiment, but if there's a precise way to hit it "just right"
>> that would be great. If not, we will adjust as needed and reduce the
>> total # of jobs.
>>
>> Is this a roundoff issue, or does the formula subtract 2 somewhere from
>> the throttle * score product?
>>
>> - Mike
>>
>>
>> On 4/10/09 12:39 PM, Michael Wilde wrote:
>> >
>> >
>> > On 4/10/09 12:22 PM, Mihael Hategan wrote:
>> >> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote:
>> >>> Increase foreach.max.threads to at least 4096.
>> >
>> > it was set to 100000 (100K)
>> >
>> >> That doesn't seem to be the cause though. Do you have all the
>> >> sites/executables properly in tc.data?
>> >
>> > duh. of course not :)
>> >
>> > thats the problem, thanks.
>> >
>> >>
>> >>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote:
>> >>>> They are in ci:/home/wilde/oops.1063.2
>> >>>>
>> >>>> I spotted the anomaly (if thats what it is) as below.
>> >>>>
>> >>>> Also: we discussed on the list way way back how to get the swift
>> >>>> scheduler to send no more jobs to each "site" than there are cores
>> >>>> in that site (for this bgp/falkon case) so that jobs dont get
>> >>>> committed to busy sites while other sites have free cores.
>> >>>>
>> >>>> In this run, we are trying to send 32K jobs to 32K cores.
>> >>>> Each of the 128 "sites" have 256 cores.
>> >>>>
>> >>>> The #s below show about 19K of those jobs as having been dispatched
>> >>>> to 32*256 = 8192 cores.
>> >>>>
>> >>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c
>> >>>> 24
>> >>>> 365 host=bgp000
>> >>>> 790 host=bgp001
>> >>>> 371 host=bgp002
>> >>>> 383 host=bgp003
>> >>>> 365 host=bgp004
>> >>>> 791 host=bgp005
>> >>>> 415 host=bgp006
>> >>>> 775 host=bgp007
>> >>>> 790 host=bgp008
>> >>>> 791 host=bgp009
>> >>>> 369 host=bgp010
>> >>>> 790 host=bgp011
>> >>>> 359 host=bgp012
>> >>>> 791 host=bgp013
>> >>>> 394 host=bgp014
>> >>>> 402 host=bgp015
>> >>>> 358 host=bgp016
>> >>>> 595 host=bgp017
>> >>>> 790 host=bgp018
>> >>>> 790 host=bgp019
>> >>>> 791 host=bgp020
>> >>>> 790 host=bgp021
>> >>>> 370 host=bgp022
>> >>>> 790 host=bgp023
>> >>>> 790 host=bgp024
>> >>>> 674 host=bgp025
>> >>>> 567 host=bgp026
>> >>>> 389 host=bgp027
>> >>>> 778 host=bgp028
>> >>>> 366 host=bgp029
>> >>>> 787 host=bgp030
>> >>>> 695 host=bgp031
>> >>>> int$ pwd
>> >>>>
>> >>>>
>> >>>> On 4/10/09 11:42 AM, Mihael Hategan wrote:
>> >>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> We're trying to run an oops run on 8 racks of the BGP. Its
>> >>>>>> possible this is larger than has been done to date with swift.
>> >>>>>>
>> >>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for
>> >>>>>> each pset in the 8-rack partition.
>> >>>>>>
>> >>>>>> From what I can tell, Swift sees all 128 sites, but only sends
>> >>>>>> jobs to exactly the first 32, bgp000-bgp031.
>> >>>>>>
>> >>>>>> While I debug this further, does anyone know of some hardwired
>> >>>>>> limit that would cause swift to send to only the first 32 bgp sites?
>> >>>>> I can't think of anything that would make that the case. The sites
>> >>>>> file
>> >>>>> and a log would be useful.
>> >>>>>
>> >>> _______________________________________________
>> >>> Swift-devel mailing list
>> >>> Swift-devel at ci.uchicago.edu
>> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >>
>> > _______________________________________________
>> > Swift-devel mailing list
>> > Swift-devel at ci.uchicago.edu
>> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
--
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
More information about the Swift-devel
mailing list