[Swift-devel] Some remote workers + provider staging logs (ReplyTimeouts on large workflows)
Mihael Hategan
hategan at mcs.anl.gov
Thu Dec 30 00:01:50 CST 2010
On Wed, 2010-12-29 at 15:28 -0600, Allan Espinosa wrote:
> Does the timeout occur from the jobs being to long in the coaster
> service queue?
No. The coaster protocol requires each command sent on a channel to be
acknowledged (pretty much like TCP does). Either the worker was very
busy (unlikely by design) or it has a fault that disturbed its main
event loop or there was an actual networking problem (also unlikely).
>
>
> I did the same workflow on PADS only (site throttle makes it receive
> only a maximum of 400 jobs). I got the same errors at some point when
> my workers failed at a time less than the timeout period:
>
> The last line shows the worker.pl message when it exited:
>
> rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111/5
> rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111
> rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations
> unlink /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/wrapper.log
> unlink /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/stdout.txt
> rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k
> Failed to process data: at
> /home/aespinosa/swift/cogkit/modules/provider-coaster/resources/worker.pl
> line 639.
I wish perl had a stack trace. Can you enable TRACE on the worker and
re-run and send me the log for the failing worker?
Mihael
More information about the Swift-devel
mailing list