[Swift-devel] Is there a site count limit?
Mihael Hategan
hategan at mcs.anl.gov
Fri Apr 10 15:15:09 CDT 2009
On Fri, 2009-04-10 at 14:44 -0500, Michael Wilde wrote:
> Mihael, your suggestion of:
>
> <profile namespace="karajan" key="jobThrottle">2.56</profile>
> <profile namespace="karajan" key="initialScore">1000</profile>
>
> Is *almost* right on:
>
> int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk
> '{ sum += $1} END {print sum}'
> 8131
> int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c
>
> 3
> 254 host=bgp000
> 254 host=bgp001
> 254 host=bgp002
> ...
> 254 host=bgp030
> 254 host=bgp031
> int$
>
> Can you suggest how to tweak it up to 256? Use jobThrottle=2.58 maybe?
Make the initial score larger. 10000 should be enough. As it goes to
+inf, you should have a max of 100*jobThrottle + 1 jobs.
> I
> will experiment, but if there's a precise way to hit it "just right"
> that would be great. If not, we will adjust as needed and reduce the
> total # of jobs.
>
> Is this a roundoff issue, or does the formula subtract 2 somewhere from
> the throttle * score product?
>
> - Mike
>
>
> On 4/10/09 12:39 PM, Michael Wilde wrote:
> >
> >
> > On 4/10/09 12:22 PM, Mihael Hategan wrote:
> >> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote:
> >>> Increase foreach.max.threads to at least 4096.
> >
> > it was set to 100000 (100K)
> >
> >> That doesn't seem to be the cause though. Do you have all the
> >> sites/executables properly in tc.data?
> >
> > duh. of course not :)
> >
> > thats the problem, thanks.
> >
> >>
> >>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote:
> >>>> They are in ci:/home/wilde/oops.1063.2
> >>>>
> >>>> I spotted the anomaly (if thats what it is) as below.
> >>>>
> >>>> Also: we discussed on the list way way back how to get the swift
> >>>> scheduler to send no more jobs to each "site" than there are cores
> >>>> in that site (for this bgp/falkon case) so that jobs dont get
> >>>> committed to busy sites while other sites have free cores.
> >>>>
> >>>> In this run, we are trying to send 32K jobs to 32K cores.
> >>>> Each of the 128 "sites" have 256 cores.
> >>>>
> >>>> The #s below show about 19K of those jobs as having been dispatched
> >>>> to 32*256 = 8192 cores.
> >>>>
> >>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c
> >>>> 24
> >>>> 365 host=bgp000
> >>>> 790 host=bgp001
> >>>> 371 host=bgp002
> >>>> 383 host=bgp003
> >>>> 365 host=bgp004
> >>>> 791 host=bgp005
> >>>> 415 host=bgp006
> >>>> 775 host=bgp007
> >>>> 790 host=bgp008
> >>>> 791 host=bgp009
> >>>> 369 host=bgp010
> >>>> 790 host=bgp011
> >>>> 359 host=bgp012
> >>>> 791 host=bgp013
> >>>> 394 host=bgp014
> >>>> 402 host=bgp015
> >>>> 358 host=bgp016
> >>>> 595 host=bgp017
> >>>> 790 host=bgp018
> >>>> 790 host=bgp019
> >>>> 791 host=bgp020
> >>>> 790 host=bgp021
> >>>> 370 host=bgp022
> >>>> 790 host=bgp023
> >>>> 790 host=bgp024
> >>>> 674 host=bgp025
> >>>> 567 host=bgp026
> >>>> 389 host=bgp027
> >>>> 778 host=bgp028
> >>>> 366 host=bgp029
> >>>> 787 host=bgp030
> >>>> 695 host=bgp031
> >>>> int$ pwd
> >>>>
> >>>>
> >>>> On 4/10/09 11:42 AM, Mihael Hategan wrote:
> >>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> We're trying to run an oops run on 8 racks of the BGP. Its
> >>>>>> possible this is larger than has been done to date with swift.
> >>>>>>
> >>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for
> >>>>>> each pset in the 8-rack partition.
> >>>>>>
> >>>>>> From what I can tell, Swift sees all 128 sites, but only sends
> >>>>>> jobs to exactly the first 32, bgp000-bgp031.
> >>>>>>
> >>>>>> While I debug this further, does anyone know of some hardwired
> >>>>>> limit that would cause swift to send to only the first 32 bgp sites?
> >>>>> I can't think of anything that would make that the case. The sites
> >>>>> file
> >>>>> and a log would be useful.
> >>>>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list