[Swift-devel] Coaster provider staging timeout problems lingering
Yadu Nand
yadudoc1729 at gmail.com
Sat Jun 1 14:37:23 CDT 2013
Mike,
Yes!
I see the same error string in my logs as well:
org.globus.cog.karajan.workflow.service.TimeoutException: Channel
timed out. lastTime=130601-180549.457, now=130601-180750.058,
channel=GSSSChannel-09\
20050760(1)[service-60881]
at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:130)
at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:121)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
What I find interesting is the way this bug goes silent with fewer app
invocations.
When I tried to request just one worker on Beagle with debug error
logging, I think there wouldn't be a need to stage files in parallel
and I do not see any timeouts. All these details are in the comments
for Bug: 1006 (https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1006)
Mihael, I hope these logs and tests help. Let me know if you need anything more.
-Yadu
On Fri, May 31, 2013 at 11:29 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> I have heard from David a while back and Yadu this week that coaster provider staging timeouts remain a problem, albeit a far less frequent one.
>
> Yadu seems to have a test (which he is polishing) that can reproduce the problem readily.
>
> Yadu, do the coaster provider staging timeouts that you are seeing on the ex-search app match what I encountered back in March on OSG via UC3 (below)?
>
> Please either file a ticket asap and assign to Mihael or locate an existing ticket for this bug and add more logs and incidents to it.
>
> David, can you tell us what your impression of remaining provider staging failure scenarios are, ideally via an existing ticket if there's one open for this?
>
> Thanks,
>
> - Mike
>
> ----- Forwarded Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, March 12, 2013 8:51:01 AM
> Subject: [Swift-devel] Coaster run to UC3 dies with channel timeout
>
> This demo (for OSG all-hands) was running fairly reliably, 100's to a few thousand 30-second tasks to UC3 with flocking to OSG and other pools.
>
> But just got a failure, so it looks like sporadic problems remain.
>
> Running Swift 0.94 latest rev.
>
> Log is on midway in:
>
> /home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
> -rw-rw-r-- 1 wilde wilde 11632001 Mar 12 08:42 saved/modis-20130312-1335-p30ylps9.log
>
> I'll file a ticket once we get a sense of the frequency.
>
> - Mike
>
>
>
> Progress: time: Tue, 12 Mar 2013 13:42:28 +0000 Selecting site:461 Stage in:10 Submitted:782 Active:204 Stage out:4 Finished successfully:1539
> Progress: time: Tue, 12 Mar 2013 13:42:29 +0000 Selecting site:453 Stage in:6 Submitted:779 Active:215 Finished successfully:1547
> Progress: time: Tue, 12 Mar 2013 13:42:30 +0000 Selecting site:439 Stage in:16 Submitting:1 Submitted:776 Active:204 Stage out:2 Finished successfully:1562
> Execution failed:
> Exception in perl:
> Arguments: [getlanduse.pl, input/h06v33.rgb]
> Host: uc3
> Directory: modis-20130312-1335-p30ylps9/jobs/7/perl-7s7qvh6l
>
> Caused by:
> Task failed: null
> org.globus.cog.karajan.workflow.service.TimeoutException: Channel timed out. lastTime=130312-084030.762, now=130312-084231.763, channel=TCP-0312-3508510-000259-000000
> at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
> at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:122)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
>
> getLandUse, modis.swift, line 24
> swift$ pwd
> /home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
> swift$ ls
> cf input/ modis-20130312-1335-p30ylps9.0.rlog modis-20130312-1335-p30ylps9.log saved/ tc
> getlanduse.pl* landuse/ modis-20130312-1335-p30ylps9.d/ run* swift.log uc3.xml
> swift$ e ../save
> swift$ save
> swift$ ls saved
> modis-20130312-1326-n9rofj6e.d/ modis-20130312-1329-f2a2eic4.log modis-20130312-1335-p30ylps9.log
> modis-20130312-1326-n9rofj6e.log modis-20130312-1335-p30ylps9.0.rlog swift.log
> modis-20130312-1329-f2a2eic4.d/ modis-20130312-1335-p30ylps9.d/
> swift$ ls saved/modis-20130312-1335-p30ylps9.log
> saved/modis-20130312-1335-p30ylps9.log
> swift$ pwd; ls -l saved/modis-20130312-1335-p30ylps9.log
> /home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
> -rw-rw-r-- 1 wilde wilde 11632001 Mar 12 08:42 saved/modis-20130312-1335-p30ylps9.log
> swift$
>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
--
Yadu Nand B
More information about the Swift-devel
mailing list