[Swift-devel] Progress on Bug 690? - Re: timeout on OSG with coasters provider staging

Ketan Maheshwari ketancmaheshwari at gmail.com
Wed Jan 25 21:22:18 CST 2012


I could reproduce the bug going from bridled to mcs with the same
configuration. I am seeing 2 timeouts: one is the HEARTBEAT and other
similar timeout messages and second is the register timeout message when
trying to start a worker after about a gap of 5 minutes. This is a very
similar scenario to OSG since the workers will only start after a delay
(often long). The exact message is:

Failed to register (timeout)

So, Mihael, if you try the catsn example that I sent you from any machine
to mcs workstations, you should be able to see the symptoms. Following are
the config etc files that you could use:

====config======
wrapperlog.always.transfer=false
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=true
provider.staging.pin.swiftfiles=false
foreach.max.threads=200
==========

=====sites.xml=====
 <config>
    <pool handle="grid">
      <execution provider="coaster-persistent" url="http://localhost:50000"
jobmanager="local:local"/>
      <profile namespace="globus" key="workerManager">passive</profile>
      <profile namespace="globus" key="jobsPerNode">1</profile>
      <profile key="jobThrottle" namespace="karajan">0.02</profile>
      <profile namespace="karajan" key="initialScore">10000</profile>
      <!-- <filesystem provider="local" url="none" /> -->
      <profile namespace="swift" key="stagingMethod">proxy</profile>
      <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile>
      <workdirectory>/tmp/ketan</workdirectory>
    </pool>
</config>
==============

====tc======
grid cat /bin/cat null null null
======

The catsn example tarball is here:
http://ci.uchicago.edu/~ketan/catsn-exp-formihael.tgz


Regards,
Ketan

On Wed, Jan 25, 2012 at 1:15 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Sorry. I was with the sshcl provider and the merging. I'll have to look
> at it this weekend.
>
> On Wed, 2012-01-25 at 08:33 -0600, Michael Wilde wrote:
> > Mihael, Ketan, can you send an update on this, and escalate the priority
> of resolving this problem?
> >
> > A resolution is needed rather urgently for the ExTENCI project.
> >
> > Mihael, do you know where the problem lies, and have a strategy for a
> fix?
> >
> > Thanks,
> >
> > - Mike
> >
> > ----- Original Message -----
> > > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > To: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Thursday, January 19, 2012 5:22:19 PM
> > > Subject: Re: [Swift-devel] timeout on OSG with coasters provider
> staging
> > > Here is another worker log this one is for a real SCEC run:
> > >
> > >
> > > ci.uchicago.edu/~ketan/timeout_worker_log_scec.txt<http://ci.uchicago.edu/%7Eketan/timeout_worker_log_scec.txt>
> > >
> > >
> > > On Thu, Jan 19, 2012 at 1:54 PM, Ketan Maheshwari <
> > > ketancmaheshwari at gmail.com > wrote:
> > >
> > >
> > > Mihael,
> > >
> > >
> > > I have the logs now. Filed as bug 690:
> > >
> > >
> > > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=690
> > >
> > > Regards,
> > > Ketan
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Jan 16, 2012 at 2:24 PM, Ketan Maheshwari <
> > > ketancmaheshwari at gmail.com > wrote:
> > >
> > >
> > > Mihael,
> > >
> > >
> > > Please find service log here:
> > > http://ci.uchicago.edu/~ketan/swift.log.tar.gz
> > >
> > > worker logs seems to have lost. I'll see if I can find'em.
> > >
> > > Regards,
> > > Ketan
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Jan 16, 2012 at 1:38 PM, Mihael Hategan < hategan at mcs.anl.gov
> > > > wrote:
> > >
> > >
> > > Nothing interesting there. Do you also happen to have the service and
> > > worker logs?
> > >
> > >
> > >
> > >
> > > On Mon, 2012-01-16 at 11:05 -0600, Ketan Maheshwari wrote:
> > > > Hi Mihael,
> > > >
> > > >
> > > > I could reproduce this timeout exception on OSG with catsn Swift
> > > > jobs.
> > > >
> > > >
> > > > These are 100 jobs with a data size of 10MB each. So, 2000MB of data
> > > > movement in all.
> > > >
> > > >
> > > > I tried with 1 worker running on a single OSG site. I tried three
> > > > different OSG sites: Nebraska, UChicago and RENCI.
> > > >
> > > >
> > > > In each of these cases, I run into the following timeout after ~4
> > > > minutes of run (15-70 jobs complete during this period) . :
> > > >
> > > >
> > > > Timeout
> > > > org.globus.cog.karajan.workflow.service.TimeoutException:
> > > > Handler(562,
> > > > PUT): timed out receiving request. Last time 940817-011255.807, now:
> > > > 120115-194100.072
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.handlers.RequestHandler.handleTimeout(RequestHandler.java:124)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:123)
> > > > at
> > > >
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:116)
> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
> > > > at java.util.TimerThread.run(Timer.java:462)
> > > > Command(168, SUBMITJOB): handling reply timeout;
> > > > sendReqTime=120115-193900.255, sendTime=120115-193900.255,
> > > > now=120115-194100.416, channel=SC-null
> > > >
> > > >
> > > > This is followed by messages similar to the above last line but the
> > > > progress of workflow halts.
> > > >
> > > >
> > > > Here is the tarball of the
> > > > experiment: http://ci.uchicago.edu/~ketan/catsn-exp-formihael.tgz
> > > >
> > > >
> > > > It contains a README which has the steps to run: basically
> > > > start-service on localhost -> start worker on OSG site -> run swift
> > > >
> > > >
> > > > Regards,
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Ketan
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Ketan
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Ketan
> > >
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
>
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120125/399ec1a4/attachment.html>


More information about the Swift-devel mailing list