[Swift-devel] Imbalanced scheduling with coasters and multiple sites

Michael Wilde wilde at mcs.anl.gov
Mon Apr 6 23:56:54 CDT 2009


The latest rev shows a similar failure on the surface, but I think 
different patterns in the coaster logs.

The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile.

This time 39 of 40 jobs ran on abe, and then the workflow lingered and 
finally failed, with 39 ok, 1 failure.

All the logs for this run are in
   /home/wilde/swift/lab/20090406-2330-72p9ale0

below that are dirs for the abe and qb logs coaster and gram logs.
Abe had no gram log for this run.

I suspect this one is worth looking at.



On 4/6/09 11:25 PM, Michael Wilde wrote:
> I tried this test and discovered some more things about coaster time 
> management that I dont understand.
> 
> It seems that on Queenbee coasters were timing out, while on abe the 
> workers were getting queued, but abe's coasters.log showed lots of java 
> exceptions.
> 
> If you're interested, all logs for this run including coasters.logs from 
> the two sites .globus dirs is on ci net at 
> /home/wilde/swift/lab/20090406-2120-04ythaie
> 
> I will re-run with the latest cog/swift revs to see if the behavior 
> persists.
> 
> - Mike
> 
> 
> On 4/6/09 6:28 PM, Michael Wilde wrote:
>> Glen seems to have a good example of this in:
>>   /home/hockyg/oops/swift/output/teragridoutdir.1
>>
>> com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | 
>> sort | uniq -c
>>     159 host=abe
>>       8 host=localhost
>>      13 host=qb
>>      11 host=ranger
>> com$
>>
>> ---
>>
>> But then I looked in the log and I see that for qb and ranger, it 
>> tries to start jobs there and gets an exception on each of them, while 
>> jobs for abe keep on zipping through.
>>
>> As far as I can tell, there is, eg on queenbee, no coaster boot log at 
>> the time of the exception, and I cant glean any clues from the GRAM 
>> log at the time of the exception (no obvious errors in it).
>>
>> I am trying now to reproduce this with simple echo-like jobs under my 
>> own id & cert where I can see all the server-side logs.
>>
>> I *think* that for the run above, Glen first tested ach of the 3 
>> sites.xml pool elements separately, for the 3 sites, before trying the 
>> 3-site test.  I *think* he verified that all three sites worked 
>> separately.
>>
>> But when put together, it *seems* that only the first one works, as if 
>> the ability to start coasters on 3 sites at once is broken.
>>
>> I am not at all sure, and will try to isolate with a simpler test that 
>> you can run as well, but at the moment thats a plausible theory.
>>
>> Btw, this is still with the Mar 31 code rev. I need to catch up on 
>> mail to see if I can no go back to testing on trunk.
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list