[Swift-devel] Imbalanced scheduling with coasters and multiple sites
Mihael Hategan
hategan at mcs.anl.gov
Mon Apr 6 23:45:45 CDT 2009
On Mon, 2009-04-06 at 23:25 -0500, Michael Wilde wrote:
> I tried this test and discovered some more things about coaster time
> management that I dont understand.
>
> It seems that on Queenbee coasters were timing out, while on abe the
> workers were getting queued, but abe's coasters.log showed lots of java
> exceptions.
Yes. It still seems to have been run with the unfortunate version. I
can't tell which exceptions are legit and which ones are the result of
coasters code in the particular bad state.
>
> If you're interested, all logs for this run including coasters.logs from
> the two sites .globus dirs is on ci net at
> /home/wilde/swift/lab/20090406-2120-04ythaie
>
> I will re-run with the latest cog/swift revs to see if the behavior
> persists.
>
> - Mike
>
>
> On 4/6/09 6:28 PM, Michael Wilde wrote:
> > Glen seems to have a good example of this in:
> > /home/hockyg/oops/swift/output/teragridoutdir.1
> >
> > com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' |
> > sort | uniq -c
> > 159 host=abe
> > 8 host=localhost
> > 13 host=qb
> > 11 host=ranger
> > com$
> >
> > ---
> >
> > But then I looked in the log and I see that for qb and ranger, it tries
> > to start jobs there and gets an exception on each of them, while jobs
> > for abe keep on zipping through.
> >
> > As far as I can tell, there is, eg on queenbee, no coaster boot log at
> > the time of the exception, and I cant glean any clues from the GRAM log
> > at the time of the exception (no obvious errors in it).
> >
> > I am trying now to reproduce this with simple echo-like jobs under my
> > own id & cert where I can see all the server-side logs.
> >
> > I *think* that for the run above, Glen first tested ach of the 3
> > sites.xml pool elements separately, for the 3 sites, before trying the
> > 3-site test. I *think* he verified that all three sites worked separately.
> >
> > But when put together, it *seems* that only the first one works, as if
> > the ability to start coasters on 3 sites at once is broken.
> >
> > I am not at all sure, and will try to isolate with a simpler test that
> > you can run as well, but at the moment thats a plausible theory.
> >
> > Btw, this is still with the Mar 31 code rev. I need to catch up on mail
> > to see if I can no go back to testing on trunk.
> >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list