<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi Jonathan,<br>
<br>
I believe some of the issues related to timeouts seen in your logs
are fixed/less likely in trunk<br>
and would recommend that you try a run with that. I've also
converted your swift.properties to<br>
the new swift.conf format. You can get a tested .conf file along
with a small test case from here:<br>
<br>
<a class="moz-txt-link-freetext" href="http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz">http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz</a><br>
<br>
Here are some changes I've made to the conf:<br>
lazyErrors: true and executionRetries: 0 so that long running jobs
are not retried.<br>
staging set to direct, since you are running on the shared FS.<br>
added worker logging and an app definition for debug.<br>
<br>
You can get the latest trunk build from here :
<a class="moz-txt-link-freetext" href="http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz">http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz</a><br>
<br>
Thanks,<br>
Yadu<br>
<br>
<div class="moz-cite-prefix">On 12/03/2014 01:16 PM, Jonathan Ozik
wrote:<br>
</div>
<blockquote
cite="mid:040074E2-ADC1-45C0-8580-9926B8E64535@gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div class="" style="word-wrap:break-word">Hi Yadu,
<div class=""><br class="">
</div>
<div class="">The tar.gz archive is here: <a
moz-do-not-send="true"
href="https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0"
class="">https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0</a></div>
<div class="">I’m also attaching the swift.properties file that
I used below.</div>
<div class=""><br class="">
</div>
<div class="">Thank you,</div>
<div class=""><br class="">
</div>
<div class="">Jonathan</div>
</div>
<div class="" style="word-wrap:break-word">
<div class=""><br class="">
<div>
<blockquote type="cite" class="">
<div class="">On Dec 3, 2014, at 11:04 AM, Yadu Nand
Babuji <<a moz-do-not-send="true"
href="mailto:yadunand@uchicago.edu" class="">yadunand@uchicago.edu</a>>
wrote:</div>
<br class="x_Apple-interchange-newline">
<div class="">Hi Jonathan,<br class="">
<br class="">
The issue you are seeing sounds pretty close to what
David reported a <br class="">
while back.<br class="">
Could you send us a tar ball of your run directory from
a failed run ?<br class="">
<br class="">
Could you also check if you've set lowOverAllocation and
<br class="">
highOverAllocation in your sites definition ?<br
class="">
<br class="">
Thanks,<br class="">
Yadu<br class="">
<br class="">
On 12/03/2014 10:50 AM, Ozik, Jonathan wrote:<br
class="">
<blockquote type="cite" class="">Hi all,<br class="">
<br class="">
I’m trying to run a large set of simulations on Midway
using Swift 0.95-RC5.<br class="">
768 of the 2187 tasks completed successfully and then
I got the exception:<br class="">
<br class="">
<span class="x_Apple-tab-span" style="white-space:pre"></span>exception
@ swift-int.k, line: 530<br class="">
Caused by: Block task failed: Connection to worker
lost<br class="">
org.globus.cog.coaster.TimeoutException: Channel timed
out. lastTime=141203-145449.325,
now=141203-145649.844, channel=TCPChannel [type:
server, contact: 1202-5410010-000072-000000]<br
class="">
<span class="x_Apple-tab-span" style="white-space:pre"></span>at
org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)<br
class="">
<span class="x_Apple-tab-span" style="white-space:pre"></span>at
org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)<br
class="">
<span class="x_Apple-tab-span" style="white-space:pre"></span>at
java.util.TimerThread.mainLoop(Timer.java:555)<br
class="">
<span class="x_Apple-tab-span" style="white-space:pre"></span>at
java.util.TimerThread.run(Timer.java:505)<br class="">
<br class="">
Progress: Wed, 03 Dec 2014 14:59:51+0000
Submitted:651 Failed:6 Finished successfully:768
Failed but can retry:762<br class="">
Progress: Wed, 03 Dec 2014 14:59:52+0000
Submitted:651 Failed:44 Finished successfully:768
Failed but can retry:724<br class="">
<br class="">
And the process seems to have stopped.<br class="">
<br class="">
What log file would be helpful for diagnosing this?<br
class="">
<br class="">
Jonathan<br class="">
<br class="">
<br class="">
_______________________________________________<br
class="">
Swift-user mailing list<br class="">
<a moz-do-not-send="true"
href="mailto:Swift-user@ci.uchicago.edu" class="">Swift-user@ci.uchicago.edu</a><br
class="">
<a class="moz-txt-link-freetext" href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br
class="">
</blockquote>
<br class="">
_______________________________________________<br
class="">
Swift-user mailing list<br class="">
<a moz-do-not-send="true"
href="mailto:Swift-user@ci.uchicago.edu" class="">Swift-user@ci.uchicago.edu</a><br
class="">
<a class="moz-txt-link-freetext" href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a></div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
<br>
</body>
</html>