[Swift-devel] Coaster provider staging timeout problems lingering
Michael Wilde
wilde at mcs.anl.gov
Fri May 31 11:29:36 CDT 2013
I have heard from David a while back and Yadu this week that coaster provider staging timeouts remain a problem, albeit a far less frequent one.
Yadu seems to have a test (which he is polishing) that can reproduce the problem readily.
Yadu, do the coaster provider staging timeouts that you are seeing on the ex-search app match what I encountered back in March on OSG via UC3 (below)?
Please either file a ticket asap and assign to Mihael or locate an existing ticket for this bug and add more logs and incidents to it.
David, can you tell us what your impression of remaining provider staging failure scenarios are, ideally via an existing ticket if there's one open for this?
Thanks,
- Mike
----- Forwarded Message -----
From: "Michael Wilde" <wilde at mcs.anl.gov>
To: "Mihael Hategan" <hategan at mcs.anl.gov>
Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
Sent: Tuesday, March 12, 2013 8:51:01 AM
Subject: [Swift-devel] Coaster run to UC3 dies with channel timeout
This demo (for OSG all-hands) was running fairly reliably, 100's to a few thousand 30-second tasks to UC3 with flocking to OSG and other pools.
But just got a failure, so it looks like sporadic problems remain.
Running Swift 0.94 latest rev.
Log is on midway in:
/home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
-rw-rw-r-- 1 wilde wilde 11632001 Mar 12 08:42 saved/modis-20130312-1335-p30ylps9.log
I'll file a ticket once we get a sense of the frequency.
- Mike
Progress: time: Tue, 12 Mar 2013 13:42:28 +0000 Selecting site:461 Stage in:10 Submitted:782 Active:204 Stage out:4 Finished successfully:1539
Progress: time: Tue, 12 Mar 2013 13:42:29 +0000 Selecting site:453 Stage in:6 Submitted:779 Active:215 Finished successfully:1547
Progress: time: Tue, 12 Mar 2013 13:42:30 +0000 Selecting site:439 Stage in:16 Submitting:1 Submitted:776 Active:204 Stage out:2 Finished successfully:1562
Execution failed:
Exception in perl:
Arguments: [getlanduse.pl, input/h06v33.rgb]
Host: uc3
Directory: modis-20130312-1335-p30ylps9/jobs/7/perl-7s7qvh6l
Caused by:
Task failed: null
org.globus.cog.karajan.workflow.service.TimeoutException: Channel timed out. lastTime=130312-084030.762, now=130312-084231.763, channel=TCP-0312-3508510-000259-000000
at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:122)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
getLandUse, modis.swift, line 24
swift$ pwd
/home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
swift$ ls
cf input/ modis-20130312-1335-p30ylps9.0.rlog modis-20130312-1335-p30ylps9.log saved/ tc
getlanduse.pl* landuse/ modis-20130312-1335-p30ylps9.d/ run* swift.log uc3.xml
swift$ e ../save
swift$ save
swift$ ls saved
modis-20130312-1326-n9rofj6e.d/ modis-20130312-1329-f2a2eic4.log modis-20130312-1335-p30ylps9.log
modis-20130312-1326-n9rofj6e.log modis-20130312-1335-p30ylps9.0.rlog swift.log
modis-20130312-1329-f2a2eic4.d/ modis-20130312-1335-p30ylps9.d/
swift$ ls saved/modis-20130312-1335-p30ylps9.log
saved/modis-20130312-1335-p30ylps9.log
swift$ pwd; ls -l saved/modis-20130312-1335-p30ylps9.log
/home/wilde/osgdemo/modis/svn/swiftdemo/test.uc3
-rw-rw-r-- 1 wilde wilde 11632001 Mar 12 08:42 saved/modis-20130312-1335-p30ylps9.log
swift$
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list