[Swift-devel] Imbalanced scheduling with coasters and multiple sites

Michael Wilde wilde at mcs.anl.gov
Mon Apr 6 18:28:04 CDT 2009


Glen seems to have a good example of this in:
   /home/hockyg/oops/swift/output/teragridoutdir.1

com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | 
sort | uniq -c
     159 host=abe
       8 host=localhost
      13 host=qb
      11 host=ranger
com$

---

But then I looked in the log and I see that for qb and ranger, it tries 
to start jobs there and gets an exception on each of them, while jobs 
for abe keep on zipping through.

As far as I can tell, there is, eg on queenbee, no coaster boot log at 
the time of the exception, and I cant glean any clues from the GRAM log 
at the time of the exception (no obvious errors in it).

I am trying now to reproduce this with simple echo-like jobs under my 
own id & cert where I can see all the server-side logs.

I *think* that for the run above, Glen first tested ach of the 3 
sites.xml pool elements separately, for the 3 sites, before trying the 
3-site test.  I *think* he verified that all three sites worked separately.

But when put together, it *seems* that only the first one works, as if 
the ability to start coasters on 3 sites at once is broken.

I am not at all sure, and will try to isolate with a simpler test that 
you can run as well, but at the moment thats a plausible theory.

Btw, this is still with the Mar 31 code rev. I need to catch up on mail 
to see if I can no go back to testing on trunk.






More information about the Swift-devel mailing list