[Swift-devel] Imbalanced scheduling with coasters and multiple sites

Michael Wilde wilde at mcs.anl.gov
Mon Apr 6 23:25:45 CDT 2009


I tried this test and discovered some more things about coaster time 
management that I dont understand.

It seems that on Queenbee coasters were timing out, while on abe the 
workers were getting queued, but abe's coasters.log showed lots of java 
exceptions.

If you're interested, all logs for this run including coasters.logs from 
the two sites .globus dirs is on ci net at 
/home/wilde/swift/lab/20090406-2120-04ythaie

I will re-run with the latest cog/swift revs to see if the behavior 
persists.

- Mike


On 4/6/09 6:28 PM, Michael Wilde wrote:
> Glen seems to have a good example of this in:
>   /home/hockyg/oops/swift/output/teragridoutdir.1
> 
> com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | 
> sort | uniq -c
>     159 host=abe
>       8 host=localhost
>      13 host=qb
>      11 host=ranger
> com$
> 
> ---
> 
> But then I looked in the log and I see that for qb and ranger, it tries 
> to start jobs there and gets an exception on each of them, while jobs 
> for abe keep on zipping through.
> 
> As far as I can tell, there is, eg on queenbee, no coaster boot log at 
> the time of the exception, and I cant glean any clues from the GRAM log 
> at the time of the exception (no obvious errors in it).
> 
> I am trying now to reproduce this with simple echo-like jobs under my 
> own id & cert where I can see all the server-side logs.
> 
> I *think* that for the run above, Glen first tested ach of the 3 
> sites.xml pool elements separately, for the 3 sites, before trying the 
> 3-site test.  I *think* he verified that all three sites worked separately.
> 
> But when put together, it *seems* that only the first one works, as if 
> the ability to start coasters on 3 sites at once is broken.
> 
> I am not at all sure, and will try to isolate with a simpler test that 
> you can run as well, but at the moment thats a plausible theory.
> 
> Btw, this is still with the Mar 31 code rev. I need to catch up on mail 
> to see if I can no go back to testing on trunk.
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list