[Swift-devel] Some remote workers + provider staging logs (ReplyTimeouts on large workflows)

Thu Dec 30 00:01:50 CST 2010

On Wed, 2010-12-29 at 15:28 -0600, Allan Espinosa wrote:

> Does the timeout occur from the jobs being to long in the coaster
> service queue?

No. The coaster protocol requires each command sent on a channel to be
acknowledged (pretty much like TCP does). Either the worker was very
busy (unlikely by design) or it has a fault that disturbed its main
event loop or there was an actual networking problem (also unlikely).

> 
> 
> I did the same workflow on PADS only (site throttle makes it receive
> only a maximum of 400 jobs).  I got the same errors at some point when
> my workers failed at a time less than the timeout period:
> 
> The last line shows the worker.pl message when it exited:
> 
> rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111/5
> rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111
> rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations
> unlink /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/wrapper.log
> unlink /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/stdout.txt
> rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k
> Failed to process data:  at
> /home/aespinosa/swift/cogkit/modules/provider-coaster/resources/worker.pl
> line 639.

I wish perl had a stack trace. Can you enable TRACE on the worker and
re-run and send me the log for the failing worker?

Mihael