[Swift-devel] Is there a site count limit?

Mihael Hategan hategan at mcs.anl.gov
Fri Apr 10 12:22:10 CDT 2009


On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote:
> Increase foreach.max.threads to at least 4096.

That doesn't seem to be the cause though. Do you have all the
sites/executables properly in tc.data?

> 
> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote:
> > They are in ci:/home/wilde/oops.1063.2
> > 
> > I spotted the anomaly (if thats what it is) as below.
> > 
> > Also: we discussed on the list way way back how to get the swift 
> > scheduler to send no more jobs to each "site" than there are cores in 
> > that site (for this bgp/falkon case) so that jobs dont get committed to 
> > busy sites while other sites have free cores.
> > 
> > In this run, we are trying to send 32K jobs to 32K cores.
> > Each of the 128 "sites" have 256 cores.
> > 
> > The #s below show about 19K of those jobs as having been dispatched to 
> > 32*256 = 8192 cores.
> > 
> > int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c 
> > 
> >       24
> >      365 host=bgp000
> >      790 host=bgp001
> >      371 host=bgp002
> >      383 host=bgp003
> >      365 host=bgp004
> >      791 host=bgp005
> >      415 host=bgp006
> >      775 host=bgp007
> >      790 host=bgp008
> >      791 host=bgp009
> >      369 host=bgp010
> >      790 host=bgp011
> >      359 host=bgp012
> >      791 host=bgp013
> >      394 host=bgp014
> >      402 host=bgp015
> >      358 host=bgp016
> >      595 host=bgp017
> >      790 host=bgp018
> >      790 host=bgp019
> >      791 host=bgp020
> >      790 host=bgp021
> >      370 host=bgp022
> >      790 host=bgp023
> >      790 host=bgp024
> >      674 host=bgp025
> >      567 host=bgp026
> >      389 host=bgp027
> >      778 host=bgp028
> >      366 host=bgp029
> >      787 host=bgp030
> >      695 host=bgp031
> > int$ pwd
> > 
> > 
> > On 4/10/09 11:42 AM, Mihael Hategan wrote:
> > > On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote:
> > >> Hi,
> > >>
> > >> We're trying to run an oops run on 8 racks of the BGP. Its possible this 
> > >> is larger than has been done to date with swift.
> > >>
> > >> Our sites.xml file has localhost plus 128 Falkon sites, one for each 
> > >> pset in the 8-rack partition.
> > >>
> > >>  From what I can tell, Swift sees all 128 sites, but only sends jobs to 
> > >> exactly the first 32, bgp000-bgp031.
> > >>
> > >> While I debug this further, does anyone know of some hardwired limit 
> > >> that would cause swift to send to only the first 32 bgp sites?
> > > 
> > > I can't think of anything that would make that the case. The sites file
> > > and a log would be useful.
> > > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list