[Swift-devel] Some remote workers + provider staging logs (ReplyTimeouts on large workflows)

Tue Jan 4 04:53:18 CST 2011

I was finally able to replicate things in a non OSG setting (well most of it).I
ran 100 workers on bridled and it produced the errors I was expecting.  I'm
binary searching to determine what number it started failing (10 worked).

Attached is the trace of 100 workers . client and service logs are also included 

-Allan

2010/12/30 Michael Wilde <wilde at mcs.anl.gov>:
> Hi Allan,
>
> It would be good to get client, service and worker logs for a reasonably small
> failing case - I suspect Mihael could diagnose the problem from that.
>
> I will try to join you by Skype at 2PM if thats convenient for you and Dan .
>
> - Mike
>
>
> ----- Original Message -----
>> I redid the OSG run with only 1 worker per coaster service and the same
>> workflow finished without problems. I'll investigate if there are problems on
>> multiple workers by making a testbed case in PADS as well.
>>
>> 2010/12/30 Mihael Hategan <hategan at mcs.anl.gov>:
>> > On Wed, 2010-12-29 at 15:28 -0600, Allan Espinosa wrote:
>> >
>> >> Does the timeout occur from the jobs being to long in the coaster
>> >> service queue?
>> >
>> > No. The coaster protocol requires each command sent on a channel to be
>> > acknowledged (pretty much like TCP does). Either the worker was very busy
>> > (unlikely by design) or it has a fault that disturbed its main event loop
>> > or there was an actual networking problem (also unlikely).
>> >
>> >>
>> >>
>> >> I did the same workflow on PADS only (site throttle makes it receive only
>> >> a maximum of 400 jobs). I got the same errors at some point when my
>> >> workers failed at a time less than the timeout period:
>> >>
>> >> The last line shows the worker.pl message when it exited:
>> >>
>> >> rmdir
>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111/5
>> >> rmdir
>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111
>> >> rmdir
>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations
>> >> unlink
>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/wrapper.log
>> >> unlink
>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/stdout.txt
>> >> rmdir
>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k
>> >> Failed to process data: at
>> >> /home/aespinosa/swift/cogkit/modules/provider-coaster/resources/worker.pl
>> >> line 639.
>> >
>> > I wish perl had a stack trace. Can you enable TRACE on the worker
>> > and
>> > re-run and send me the log for the failing worker?
>> >
>> > Mihael
>> >
>> >
>> >
>> >
>>

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: timer-bug.tar.bz2
Type: application/octet-stream
Size: 6684316 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110104/8f8dbad9/attachment.obj>