[Swift-devel] Imbalanced scheduling with coasters and multiple sites
Michael Wilde
wilde at mcs.anl.gov
Mon Apr 6 23:56:54 CDT 2009
The latest rev shows a similar failure on the surface, but I think
different patterns in the coaster logs.
The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile.
This time 39 of 40 jobs ran on abe, and then the workflow lingered and
finally failed, with 39 ok, 1 failure.
All the logs for this run are in
/home/wilde/swift/lab/20090406-2330-72p9ale0
below that are dirs for the abe and qb logs coaster and gram logs.
Abe had no gram log for this run.
I suspect this one is worth looking at.
On 4/6/09 11:25 PM, Michael Wilde wrote:
> I tried this test and discovered some more things about coaster time
> management that I dont understand.
>
> It seems that on Queenbee coasters were timing out, while on abe the
> workers were getting queued, but abe's coasters.log showed lots of java
> exceptions.
>
> If you're interested, all logs for this run including coasters.logs from
> the two sites .globus dirs is on ci net at
> /home/wilde/swift/lab/20090406-2120-04ythaie
>
> I will re-run with the latest cog/swift revs to see if the behavior
> persists.
>
> - Mike
>
>
> On 4/6/09 6:28 PM, Michael Wilde wrote:
>> Glen seems to have a good example of this in:
>> /home/hockyg/oops/swift/output/teragridoutdir.1
>>
>> com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' |
>> sort | uniq -c
>> 159 host=abe
>> 8 host=localhost
>> 13 host=qb
>> 11 host=ranger
>> com$
>>
>> ---
>>
>> But then I looked in the log and I see that for qb and ranger, it
>> tries to start jobs there and gets an exception on each of them, while
>> jobs for abe keep on zipping through.
>>
>> As far as I can tell, there is, eg on queenbee, no coaster boot log at
>> the time of the exception, and I cant glean any clues from the GRAM
>> log at the time of the exception (no obvious errors in it).
>>
>> I am trying now to reproduce this with simple echo-like jobs under my
>> own id & cert where I can see all the server-side logs.
>>
>> I *think* that for the run above, Glen first tested ach of the 3
>> sites.xml pool elements separately, for the 3 sites, before trying the
>> 3-site test. I *think* he verified that all three sites worked
>> separately.
>>
>> But when put together, it *seems* that only the first one works, as if
>> the ability to start coasters on 3 sites at once is broken.
>>
>> I am not at all sure, and will try to isolate with a simpler test that
>> you can run as well, but at the moment thats a plausible theory.
>>
>> Btw, this is still with the Mar 31 code rev. I need to catch up on
>> mail to see if I can no go back to testing on trunk.
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list