[Swift-devel] Imbalanced scheduling with coasters and multiple sites

Mihael Hategan hategan at mcs.anl.gov
Mon Apr 6 23:45:45 CDT 2009


On Mon, 2009-04-06 at 23:25 -0500, Michael Wilde wrote:
> I tried this test and discovered some more things about coaster time 
> management that I dont understand.
> 
> It seems that on Queenbee coasters were timing out, while on abe the 
> workers were getting queued, but abe's coasters.log showed lots of java 
> exceptions.

Yes. It still seems to have been run with the unfortunate version. I
can't tell which exceptions are legit and which ones are the result of
coasters code in the particular bad state.

> 
> If you're interested, all logs for this run including coasters.logs from 
> the two sites .globus dirs is on ci net at 
> /home/wilde/swift/lab/20090406-2120-04ythaie
> 
> I will re-run with the latest cog/swift revs to see if the behavior 
> persists.
> 
> - Mike
> 
> 
> On 4/6/09 6:28 PM, Michael Wilde wrote:
> > Glen seems to have a good example of this in:
> >   /home/hockyg/oops/swift/output/teragridoutdir.1
> > 
> > com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | 
> > sort | uniq -c
> >     159 host=abe
> >       8 host=localhost
> >      13 host=qb
> >      11 host=ranger
> > com$
> > 
> > ---
> > 
> > But then I looked in the log and I see that for qb and ranger, it tries 
> > to start jobs there and gets an exception on each of them, while jobs 
> > for abe keep on zipping through.
> > 
> > As far as I can tell, there is, eg on queenbee, no coaster boot log at 
> > the time of the exception, and I cant glean any clues from the GRAM log 
> > at the time of the exception (no obvious errors in it).
> > 
> > I am trying now to reproduce this with simple echo-like jobs under my 
> > own id & cert where I can see all the server-side logs.
> > 
> > I *think* that for the run above, Glen first tested ach of the 3 
> > sites.xml pool elements separately, for the 3 sites, before trying the 
> > 3-site test.  I *think* he verified that all three sites worked separately.
> > 
> > But when put together, it *seems* that only the first one works, as if 
> > the ability to start coasters on 3 sites at once is broken.
> > 
> > I am not at all sure, and will try to isolate with a simpler test that 
> > you can run as well, but at the moment thats a plausible theory.
> > 
> > Btw, this is still with the Mar 31 code rev. I need to catch up on mail 
> > to see if I can no go back to testing on trunk.
> > 
> > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list