[Swift-devel] Is there a site count limit?

Mihael Hategan hategan at mcs.anl.gov
Fri Apr 10 15:15:09 CDT 2009


On Fri, 2009-04-10 at 14:44 -0500, Michael Wilde wrote:
> Mihael, your suggestion of:
> 
> <profile namespace="karajan" key="jobThrottle">2.56</profile>
> <profile namespace="karajan" key="initialScore">1000</profile>
> 
> Is *almost* right on:
> 
> int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk 
> '{ sum += $1} END {print sum}'
> 8131
> int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c 
> 
>        3
>      254 host=bgp000
>      254 host=bgp001
>      254 host=bgp002
>      ...
>      254 host=bgp030
>      254 host=bgp031
> int$
> 
> Can you suggest how to tweak it up to 256? Use jobThrottle=2.58 maybe?

Make the initial score larger. 10000 should be enough. As it goes to
+inf, you should have a max of 100*jobThrottle + 1 jobs.

>  I 
> will experiment, but if there's a precise way to hit it "just right" 
> that would be great. If not, we will adjust as needed and reduce the 
> total # of jobs.
> 
> Is this a roundoff issue, or does the formula subtract 2 somewhere from 
> the throttle * score product?
> 
> - Mike
> 
> 
> On 4/10/09 12:39 PM, Michael Wilde wrote:
> > 
> > 
> > On 4/10/09 12:22 PM, Mihael Hategan wrote:
> >> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote:
> >>> Increase foreach.max.threads to at least 4096.
> > 
> > it was set to 100000 (100K)
> > 
> >> That doesn't seem to be the cause though. Do you have all the
> >> sites/executables properly in tc.data?
> > 
> > duh. of course not :)
> > 
> > thats the problem, thanks.
> > 
> >>
> >>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote:
> >>>> They are in ci:/home/wilde/oops.1063.2
> >>>>
> >>>> I spotted the anomaly (if thats what it is) as below.
> >>>>
> >>>> Also: we discussed on the list way way back how to get the swift 
> >>>> scheduler to send no more jobs to each "site" than there are cores 
> >>>> in that site (for this bgp/falkon case) so that jobs dont get 
> >>>> committed to busy sites while other sites have free cores.
> >>>>
> >>>> In this run, we are trying to send 32K jobs to 32K cores.
> >>>> Each of the 128 "sites" have 256 cores.
> >>>>
> >>>> The #s below show about 19K of those jobs as having been dispatched 
> >>>> to 32*256 = 8192 cores.
> >>>>
> >>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c
> >>>>       24
> >>>>      365 host=bgp000
> >>>>      790 host=bgp001
> >>>>      371 host=bgp002
> >>>>      383 host=bgp003
> >>>>      365 host=bgp004
> >>>>      791 host=bgp005
> >>>>      415 host=bgp006
> >>>>      775 host=bgp007
> >>>>      790 host=bgp008
> >>>>      791 host=bgp009
> >>>>      369 host=bgp010
> >>>>      790 host=bgp011
> >>>>      359 host=bgp012
> >>>>      791 host=bgp013
> >>>>      394 host=bgp014
> >>>>      402 host=bgp015
> >>>>      358 host=bgp016
> >>>>      595 host=bgp017
> >>>>      790 host=bgp018
> >>>>      790 host=bgp019
> >>>>      791 host=bgp020
> >>>>      790 host=bgp021
> >>>>      370 host=bgp022
> >>>>      790 host=bgp023
> >>>>      790 host=bgp024
> >>>>      674 host=bgp025
> >>>>      567 host=bgp026
> >>>>      389 host=bgp027
> >>>>      778 host=bgp028
> >>>>      366 host=bgp029
> >>>>      787 host=bgp030
> >>>>      695 host=bgp031
> >>>> int$ pwd
> >>>>
> >>>>
> >>>> On 4/10/09 11:42 AM, Mihael Hategan wrote:
> >>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> We're trying to run an oops run on 8 racks of the BGP. Its 
> >>>>>> possible this is larger than has been done to date with swift.
> >>>>>>
> >>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for 
> >>>>>> each pset in the 8-rack partition.
> >>>>>>
> >>>>>>  From what I can tell, Swift sees all 128 sites, but only sends 
> >>>>>> jobs to exactly the first 32, bgp000-bgp031.
> >>>>>>
> >>>>>> While I debug this further, does anyone know of some hardwired 
> >>>>>> limit that would cause swift to send to only the first 32 bgp sites?
> >>>>> I can't think of anything that would make that the case. The sites 
> >>>>> file
> >>>>> and a log would be useful.
> >>>>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list