[Swift-devel] Some remote workers + provider staging logs (ReplyTimeouts on large workflows)

Michael Wilde wilde at mcs.anl.gov
Thu Dec 30 12:51:43 CST 2010


Hi Allan,

It would be good to get client, service and worker logs for a reasonably small failing case - I suspect Mihael could diagnose the problem from that.

I will try to join you by Skype at 2PM if thats convenient for you and Dan
.

- Mike


----- Original Message -----
> I redid the OSG run with only 1 worker per coaster service and the
> same workflow finished without problems. I'll investigate if there
> are problems on multiple workers by making a testbed case in PADS as
> well.
> 
> 2010/12/30 Mihael Hategan <hategan at mcs.anl.gov>:
> > On Wed, 2010-12-29 at 15:28 -0600, Allan Espinosa wrote:
> >
> >> Does the timeout occur from the jobs being to long in the coaster
> >> service queue?
> >
> > No. The coaster protocol requires each command sent on a channel to
> > be
> > acknowledged (pretty much like TCP does). Either the worker was very
> > busy (unlikely by design) or it has a fault that disturbed its main
> > event loop or there was an actual networking problem (also
> > unlikely).
> >
> >>
> >>
> >> I did the same workflow on PADS only (site throttle makes it
> >> receive
> >> only a maximum of 400 jobs). I got the same errors at some point
> >> when
> >> my workers failed at a time less than the timeout period:
> >>
> >> The last line shows the worker.pl message when it exited:
> >>
> >> rmdir
> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111/5
> >> rmdir
> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111
> >> rmdir
> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations
> >> unlink
> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/wrapper.log
> >> unlink
> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/stdout.txt
> >> rmdir
> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k
> >> Failed to process data: at
> >> /home/aespinosa/swift/cogkit/modules/provider-coaster/resources/worker.pl
> >> line 639.
> >
> > I wish perl had a stack trace. Can you enable TRACE on the worker
> > and
> > re-run and send me the log for the failing worker?
> >
> > Mihael
> >
> >
> >
> >
> 
> 
> 
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list