[Swift-devel] Some remote workers + provider staging logs (ReplyTimeouts on large workflows)

Wed Jan 5 07:17:34 CST 2011

ok, somehow the magic number is something between 55-62 workers to
cause failures.

-Allan

2011/1/4 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> I was finally able to replicate things in a non OSG setting (well most of it).I
> ran 100 workers on bridled and it produced the errors I was expecting.  I'm
> binary searching to determine what number it started failing (10 worked).
>
>
> Attached is the trace of 100 workers . client and service logs are also included
>
> -Allan
>
> 2010/12/30 Michael Wilde <wilde at mcs.anl.gov>:
>> Hi Allan,
>>
>> It would be good to get client, service and worker logs for a reasonably small
>> failing case - I suspect Mihael could diagnose the problem from that.
>>
>> I will try to join you by Skype at 2PM if thats convenient for you and Dan .
>>
>> - Mike
>>
>>
>> ----- Original Message -----
>>> I redid the OSG run with only 1 worker per coaster service and the same
>>> workflow finished without problems. I'll investigate if there are problems on
>>> multiple workers by making a testbed case in PADS as well.
>>>
>>> 2010/12/30 Mihael Hategan <hategan at mcs.anl.gov>:
>>> > On Wed, 2010-12-29 at 15:28 -0600, Allan Espinosa wrote:
>>> >
>>> >> Does the timeout occur from the jobs being to long in the coaster
>>> >> service queue?
>>> >
>>> > No. The coaster protocol requires each command sent on a channel to be
>>> > acknowledged (pretty much like TCP does). Either the worker was very busy
>>> > (unlikely by design) or it has a fault that disturbed its main event loop
>>> > or there was an actual networking problem (also unlikely).
>>> >
>>> >>
>>> >>
>>> >> I did the same workflow on PADS only (site throttle makes it receive only
>>> >> a maximum of 400 jobs). I got the same errors at some point when my
>>> >> workers failed at a time less than the timeout period:
>>> >>
>>> >> The last line shows the worker.pl message when it exited:
>>> >>
>>> >> rmdir
>>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111/5
>>> >> rmdir
>>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111
>>> >> rmdir
>>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations
>>> >> unlink
>>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/wrapper.log
>>> >> unlink
>>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/stdout.txt
>>> >> rmdir
>>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k
>>> >> Failed to process data: at
>>> >> /home/aespinosa/swift/cogkit/modules/provider-coaster/resources/worker.pl
>>> >> line 639.
>>> >
>>> > I wish perl had a stack trace. Can you enable TRACE on the worker
>>> > and
>>> > re-run and send me the log for the failing worker?
>>> >
>>> > Mihael