[Swift-devel] Coaster provider staging timeout problems lingering

Yadu Nand yadudoc1729 at gmail.com
Sat Jun 1 14:37:23 CDT 2013


Mike,

Yes!

I see the same error string in my logs as well:
org.globus.cog.karajan.workflow.service.TimeoutException: Channel
timed out. lastTime=130601-180549.457, now=130601-180750.058,
channel=GSSSChannel-09\
20050760(1)[service-60881]
        at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:130)
        at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:121)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

What I find interesting is the way this bug goes silent with fewer app
invocations.
When I tried to request just one worker on Beagle with debug error
logging, I think there wouldn't be a need to stage files in parallel
and I do not see any timeouts. All these details are in the comments
for Bug: 1006 (https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1006)

Mihael, I hope these logs and tests help. Let me know if you need anything more.

-Yadu

On Fri, May 31, 2013 at 11:29 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> I have heard from David a while back and Yadu this week that coaster provider staging timeouts remain a problem, albeit a far less frequent one.
>
> Yadu seems to have a test (which he is polishing) that can reproduce the problem readily.
>
> Yadu, do the coaster provider staging timeouts that you are seeing on the ex-search app match what I encountered back in March on OSG via UC3 (below)?
>
> Please either file a ticket asap and assign to Mihael or locate an existing ticket for this bug and add more logs and incidents to it.
>
> David, can you tell us what your impression of remaining provider staging failure scenarios are, ideally via an existing ticket if there's one open for this?
>
> Thanks,
>
> - Mike
>
> ----- Forwarded Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, March 12, 2013 8:51:01 AM
> Subject: [Swift-devel] Coaster run to UC3 dies with channel timeout
>
> This demo (for OSG all-hands) was running fairly reliably, 100's to a few thousand 30-second tasks to UC3 with flocking to OSG and other pools.
>
> But just got a failure, so it looks like sporadic problems remain.
>
> Running Swift 0.94 latest rev.
>
> Log is on midway in:
>
> /home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
> -rw-rw-r-- 1 wilde wilde 11632001 Mar 12 08:42 saved/modis-20130312-1335-p30ylps9.log
>
> I'll file a ticket once we get a sense of the frequency.
>
> - Mike
>
>
>
>  Progress:  time: Tue, 12 Mar 2013 13:42:28 +0000  Selecting site:461  Stage in:10  Submitted:782  Active:204  Stage out:4  Finished successfully:1539
> Progress:  time: Tue, 12 Mar 2013 13:42:29 +0000  Selecting site:453  Stage in:6  Submitted:779  Active:215  Finished successfully:1547
> Progress:  time: Tue, 12 Mar 2013 13:42:30 +0000  Selecting site:439  Stage in:16  Submitting:1  Submitted:776  Active:204  Stage out:2  Finished successfully:1562
> Execution failed:
>         Exception in perl:
>     Arguments: [getlanduse.pl, input/h06v33.rgb]
>     Host: uc3
>     Directory: modis-20130312-1335-p30ylps9/jobs/7/perl-7s7qvh6l
>
> Caused by:
>         Task failed: null
> org.globus.cog.karajan.workflow.service.TimeoutException: Channel timed out. lastTime=130312-084030.762, now=130312-084231.763, channel=TCP-0312-3508510-000259-000000
>         at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
>         at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:122)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
>
>         getLandUse, modis.swift, line 24
> swift$ pwd
> /home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
> swift$ ls
> cf              input/    modis-20130312-1335-p30ylps9.0.rlog  modis-20130312-1335-p30ylps9.log  saved/     tc
> getlanduse.pl*  landuse/  modis-20130312-1335-p30ylps9.d/      run*                              swift.log  uc3.xml
> swift$ e ../save
> swift$ save
> swift$ ls saved
> modis-20130312-1326-n9rofj6e.d/   modis-20130312-1329-f2a2eic4.log     modis-20130312-1335-p30ylps9.log
> modis-20130312-1326-n9rofj6e.log  modis-20130312-1335-p30ylps9.0.rlog  swift.log
> modis-20130312-1329-f2a2eic4.d/   modis-20130312-1335-p30ylps9.d/
> swift$ ls saved/modis-20130312-1335-p30ylps9.log
> saved/modis-20130312-1335-p30ylps9.log
> swift$ pwd; ls -l saved/modis-20130312-1335-p30ylps9.log
> /home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
> -rw-rw-r-- 1 wilde wilde 11632001 Mar 12 08:42 saved/modis-20130312-1335-p30ylps9.log
> swift$
>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel



-- 
Yadu Nand B



More information about the Swift-devel mailing list