[Swift-devel] Coaster provider staging timeout problems lingering

Michael Wilde wilde at mcs.anl.gov
Fri May 31 11:29:36 CDT 2013


I have heard from David a while back and Yadu this week that coaster provider staging timeouts remain a problem, albeit a far less frequent one.

Yadu seems to have a test (which he is polishing) that can reproduce the problem readily.

Yadu, do the coaster provider staging timeouts that you are seeing on the ex-search app match what I encountered back in March on OSG via UC3 (below)?

Please either file a ticket asap and assign to Mihael or locate an existing ticket for this bug and add more logs and incidents to it.

David, can you tell us what your impression of remaining provider staging failure scenarios are, ideally via an existing ticket if there's one open for this?

Thanks,

- Mike

----- Forwarded Message -----
From: "Michael Wilde" <wilde at mcs.anl.gov>
To: "Mihael Hategan" <hategan at mcs.anl.gov>
Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
Sent: Tuesday, March 12, 2013 8:51:01 AM
Subject: [Swift-devel] Coaster run to UC3 dies with channel timeout

This demo (for OSG all-hands) was running fairly reliably, 100's to a few thousand 30-second tasks to UC3 with flocking to OSG and other pools.

But just got a failure, so it looks like sporadic problems remain.

Running Swift 0.94 latest rev.

Log is on midway in:

/home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
-rw-rw-r-- 1 wilde wilde 11632001 Mar 12 08:42 saved/modis-20130312-1335-p30ylps9.log

I'll file a ticket once we get a sense of the frequency.

- Mike



 Progress:  time: Tue, 12 Mar 2013 13:42:28 +0000  Selecting site:461  Stage in:10  Submitted:782  Active:204  Stage out:4  Finished successfully:1539
Progress:  time: Tue, 12 Mar 2013 13:42:29 +0000  Selecting site:453  Stage in:6  Submitted:779  Active:215  Finished successfully:1547
Progress:  time: Tue, 12 Mar 2013 13:42:30 +0000  Selecting site:439  Stage in:16  Submitting:1  Submitted:776  Active:204  Stage out:2  Finished successfully:1562
Execution failed:
	Exception in perl:
    Arguments: [getlanduse.pl, input/h06v33.rgb]
    Host: uc3
    Directory: modis-20130312-1335-p30ylps9/jobs/7/perl-7s7qvh6l

Caused by:
	Task failed: null
org.globus.cog.karajan.workflow.service.TimeoutException: Channel timed out. lastTime=130312-084030.762, now=130312-084231.763, channel=TCP-0312-3508510-000259-000000
	at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
	at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:122)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)

	getLandUse, modis.swift, line 24
swift$ pwd
/home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
swift$ ls
cf		input/	  modis-20130312-1335-p30ylps9.0.rlog  modis-20130312-1335-p30ylps9.log  saved/     tc
getlanduse.pl*	landuse/  modis-20130312-1335-p30ylps9.d/      run*				 swift.log  uc3.xml
swift$ e ../save
swift$ save
swift$ ls saved
modis-20130312-1326-n9rofj6e.d/   modis-20130312-1329-f2a2eic4.log     modis-20130312-1335-p30ylps9.log
modis-20130312-1326-n9rofj6e.log  modis-20130312-1335-p30ylps9.0.rlog  swift.log
modis-20130312-1329-f2a2eic4.d/   modis-20130312-1335-p30ylps9.d/
swift$ ls saved/modis-20130312-1335-p30ylps9.log
saved/modis-20130312-1335-p30ylps9.log
swift$ pwd; ls -l saved/modis-20130312-1335-p30ylps9.log
/home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
-rw-rw-r-- 1 wilde wilde 11632001 Mar 12 08:42 saved/modis-20130312-1335-p30ylps9.log
swift$ 


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list