[Swift-devel] Progress on Bug 690? - Re: timeout on OSG with coasters provider staging

Wed Jan 25 23:57:30 CST 2012

I further tried a real scec workflow on all 10 mcs machines from Bridled. I
did get timeout exception.

Please find workerlogs from all mcs machines here:
http://www.mcs.anl.gov/~ketan/mcsworkerlogs.tar.gz

On Wed, Jan 25, 2012 at 9:22 PM, Ketan Maheshwari <
ketancmaheshwari at gmail.com> wrote:

> I could reproduce the bug going from bridled to mcs with the same
> configuration. I am seeing 2 timeouts: one is the HEARTBEAT and other
> similar timeout messages and second is the register timeout message when
> trying to start a worker after about a gap of 5 minutes. This is a very
> similar scenario to OSG since the workers will only start after a delay
> (often long). The exact message is:
>
> Failed to register (timeout)
>
> So, Mihael, if you try the catsn example that I sent you from any machine
> to mcs workstations, you should be able to see the symptoms. Following are
> the config etc files that you could use:
>
> ====config======
> wrapperlog.always.transfer=false
> sitedir.keep=true
> execution.retries=0
> lazy.errors=false
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=200
> ==========
>
> =====sites.xml=====
>  <config>
>     <pool handle="grid">
>       <execution provider="coaster-persistent" url="http://localhost:50000"
> jobmanager="local:local"/>
>       <profile namespace="globus" key="workerManager">passive</profile>
>       <profile namespace="globus" key="jobsPerNode">1</profile>
>       <profile key="jobThrottle" namespace="karajan">0.02</profile>
>       <profile namespace="karajan" key="initialScore">10000</profile>
>       <!-- <filesystem provider="local" url="none" /> -->
>       <profile namespace="swift" key="stagingMethod">proxy</profile>
>       <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile>
>       <workdirectory>/tmp/ketan</workdirectory>
>     </pool>
> </config>
> ==============
>
> ====tc======
> grid cat /bin/cat null null null
> ======
>
> The catsn example tarball is here:
> http://ci.uchicago.edu/~ketan/catsn-exp-formihael.tgz
>
>
> Regards,
> Ketan
>
>
> On Wed, Jan 25, 2012 at 1:15 PM, Mihael Hategan <hategan at mcs.anl.gov>wrote:
>
>> Sorry. I was with the sshcl provider and the merging. I'll have to look
>> at it this weekend.
>>
>> On Wed, 2012-01-25 at 08:33 -0600, Michael Wilde wrote:
>> > Mihael, Ketan, can you send an update on this, and escalate the
>> priority of resolving this problem?
>> >
>> > A resolution is needed rather urgently for the ExTENCI project.
>> >
>> > Mihael, do you know where the problem lies, and have a strategy for a
>> fix?
>> >
>> > Thanks,
>> >
>> > - Mike
>> >
>> > ----- Original Message -----
>> > > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>> > > To: "Mihael Hategan" <hategan at mcs.anl.gov>
>> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> > > Sent: Thursday, January 19, 2012 5:22:19 PM
>> > > Subject: Re: [Swift-devel] timeout on OSG with coasters provider
>> staging
>> > > Here is another worker log this one is for a real SCEC run:
>> > >
>> > >
>> > > ci.uchicago.edu/~ketan/timeout_worker_log_scec.txt<http://ci.uchicago.edu/%7Eketan/timeout_worker_log_scec.txt>
>> > >
>> > >
>> > > On Thu, Jan 19, 2012 at 1:54 PM, Ketan Maheshwari <
>> > > ketancmaheshwari at gmail.com > wrote:
>> > >
>> > >
>> > > Mihael,
>> > >
>> > >
>> > > I have the logs now. Filed as bug 690:
>> > >
>> > >
>> > > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=690
>> > >
>> > > Regards,
>> > > Ketan
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Jan 16, 2012 at 2:24 PM, Ketan Maheshwari <
>> > > ketancmaheshwari at gmail.com > wrote:
>> > >
>> > >
>> > > Mihael,
>> > >
>> > >
>> > > Please find service log here:
>> > > http://ci.uchicago.edu/~ketan/swift.log.tar.gz
>> > >
>> > > worker logs seems to have lost. I'll see if I can find'em.
>> > >
>> > > Regards,
>> > > Ketan
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Jan 16, 2012 at 1:38 PM, Mihael Hategan < hategan at mcs.anl.gov
>> > > > wrote:
>> > >
>> > >
>> > > Nothing interesting there. Do you also happen to have the service and
>> > > worker logs?
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, 2012-01-16 at 11:05 -0600, Ketan Maheshwari wrote:
>> > > > Hi Mihael,
>> > > >
>> > > >
>> > > > I could reproduce this timeout exception on OSG with catsn Swift
>> > > > jobs.
>> > > >
>> > > >
>> > > > These are 100 jobs with a data size of 10MB each. So, 2000MB of data
>> > > > movement in all.
>> > > >
>> > > >
>> > > > I tried with 1 worker running on a single OSG site. I tried three
>> > > > different OSG sites: Nebraska, UChicago and RENCI.
>> > > >
>> > > >
>> > > > In each of these cases, I run into the following timeout after ~4
>> > > > minutes of run (15-70 jobs complete during this period) . :
>> > > >
>> > > >
>> > > > Timeout
>> > > > org.globus.cog.karajan.workflow.service.TimeoutException:
>> > > > Handler(562,
>> > > > PUT): timed out receiving request. Last time 940817-011255.807, now:
>> > > > 120115-194100.072
>> > > > at
>> > > >
>> org.globus.cog.karajan.workflow.service.handlers.RequestHandler.handleTimeout(RequestHandler.java:124)
>> > > > at
>> > > >
>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
>> > > > at
>> > > >
>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:123)
>> > > > at
>> > > >
>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:116)
>> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
>> > > > at java.util.TimerThread.run(Timer.java:462)
>> > > > Command(168, SUBMITJOB): handling reply timeout;
>> > > > sendReqTime=120115-193900.255, sendTime=120115-193900.255,
>> > > > now=120115-194100.416, channel=SC-null
>> > > >
>> > > >
>> > > > This is followed by messages similar to the above last line but the
>> > > > progress of workflow halts.
>> > > >
>> > > >
>> > > > Here is the tarball of the
>> > > > experiment: http://ci.uchicago.edu/~ketan/catsn-exp-formihael.tgz
>> > > >
>> > > >
>> > > > It contains a README which has the steps to run: basically
>> > > > start-service on localhost -> start worker on OSG site -> run swift
>> > > >
>> > > >
>> > > > Regards,
>> > > > --
>> > > > Ketan
>> > > >
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Ketan
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Ketan
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Ketan
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > Swift-devel mailing list
>> > > Swift-devel at ci.uchicago.edu
>> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> >
>>
>>
>>
>
>
> --
> Ketan
>
>
>

-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120125/31abe312/attachment.html>