[Swift-devel] Some remote workers + provider staging logs (ReplyTimeouts on large workflows)

Allan Espinosa aespinosa at cs.uchicago.edu
Wed Dec 29 15:28:08 CST 2010


Run cat jobs (with 2.3 MB data file) to 6 remote sites.  The coaster
service is run in communicado.  I attached the log file and the
service log file of one of the services that show the exception.
sites file is coaster_osg.xml

provider.staging=true (default proxy)

Snippet of error messages (log):

10.000(0.039):2623/3 overload: 1, 0.1
2010-12-29 14:52:34,092-0600 INFO  vdl:execute Exception in cat:
Arguments: [RuptureVariations/100/5/100_5.txt.variation-s0004-h0005]
Host: USCMS-FNAL-WC1__cmsosgce3.fnal.gov
Directory: catsall-20101229-1449-7rs3j584/jobs/2/cat-239zrp3kTODO: outs
----

Caused by: Job failed with an exit code of 521
Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
Job failed with an exit code of 521
2010-12-29 14:52:34,092-0600 DEBUG WeightedHostScoreScheduler
multiplyScore(USCMS-FNAL-WC1__cmsosgce3.fnal.gov:-9.900(0.039):2623/3
overload: 1, -0.5)

from service log:

Congestion queue size: 0
Plan time: 1
Sender 315976503 queue size: 0
Command(5, GET): handling reply timeout;
sendReqTime=101229-144933.031, sendTime=101229-144933.033,
now=101229-145133
.036
Command(5, GET): re-sending
Command(5, GET)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:283)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:288)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Sending Command(5, GET) on MetaChannel: 1286943672[832416103: {}] ->
GSSSChannel-null(3)[832416103: {}]
Command(28, GET): handling reply timeout;
sendReqTime=101229-144933.109, sendTime=101229-144933.126,
now=101229-145133.129
Command(28, GET): re-sending
Command(28, GET)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:283)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:288)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Sending Command(28, GET) on MetaChannel: 1286943672[832416103: {}] ->
GSSSChannel-null(3)[832416103: {}]
Command(29, GET): handling reply timeout;
sendReqTime=101229-144933.109, sendTime=101229-144933.127,
now=101229-145133.136
...
...
...
USCMS-FNAL-WC1__cmsosgce3.fnal.gov:101 pull
Sending Command(1, SUBMITJOB) on SC-USCMS-FNAL-WC1__cmsosgce3.fnal.gov-000101
USCMS-FNAL-WC1__cmsosgce3.fnal.gov:102 pull
java.lang.IllegalStateException: Timer already cancelled.
        at java.util.Timer.sched(Timer.java:354)
        at java.util.Timer.schedule(Timer.java:170)
        at org.globus.cog.karajan.workflow.service.commands.Command.setupReplyTimeoutChecker(Command.java:156)
        at org.globus.cog.karajan.workflow.service.commands.Command.dataSent(Command.java:150)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:253)
Sending Command(1, SUBMITJOB) on SC-USCMS-FNAL-WC1__cmsosgce3.fnal.gov-000102
Sending Command(682, JOBSTATUS) on GSSSChannel-null(3)[832416103: {}]
java.lang.IllegalStateException: Timer already cancelled.
        at java.util.Timer.sched(Timer.java:354)
        at java.util.Timer.schedule(Timer.java:170)
        at org.globus.cog.karajan.workflow.service.commands.Command.setupReplyTimeoutChecker(Command.java:156)
        at org.globus.cog.karajan.workflow.service.commands.Command.dataSent(Command.java:150)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:253)


The site USCMS-fnal-wc1 has a throttle of 68.86 = 6888 job capacity.
But currently it only has 100 workers available.   The log reports it
receive 2.6k jobs from the workflow.

Does the timeout occur from the jobs being to long in the coaster
service queue?


I did the same workflow on PADS only (site throttle makes it receive
only a maximum of 400 jobs).  I got the same errors at some point when
my workers failed at a time less than the timeout period:

The last line shows the worker.pl message when it exited:

rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111/5
rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111
rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations
unlink /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/wrapper.log
unlink /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/stdout.txt
rmdir /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k
Failed to process data:  at
/home/aespinosa/swift/cogkit/modules/provider-coaster/resources/worker.pl
line 639.


-Allan

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: catsall-20101229-1449-7rs3j584.log.gz
Type: application/x-gzip
Size: 1053260 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101229/b345d884/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: service-6.log.gz
Type: application/x-gzip
Size: 171561 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101229/b345d884/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: coaster_osg.xml
Type: text/xml
Size: 4368 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101229/b345d884/attachment.xml>


More information about the Swift-devel mailing list