[Swift-devel] Progress on Bug 690? - Re: timeout on OSG with coasters provider staging

Michael Wilde wilde at mcs.anl.gov
Wed Jan 25 08:33:41 CST 2012


Mihael, Ketan, can you send an update on this, and escalate the priority of resolving this problem?

A resolution is needed rather urgently for the ExTENCI project.

Mihael, do you know where the problem lies, and have a strategy for a fix?

Thanks,

- Mike

----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Thursday, January 19, 2012 5:22:19 PM
> Subject: Re: [Swift-devel] timeout on OSG with coasters provider staging
> Here is another worker log this one is for a real SCEC run:
> 
> 
> ci.uchicago.edu/~ketan/timeout_worker_log_scec.txt
> 
> 
> On Thu, Jan 19, 2012 at 1:54 PM, Ketan Maheshwari <
> ketancmaheshwari at gmail.com > wrote:
> 
> 
> Mihael,
> 
> 
> I have the logs now. Filed as bug 690:
> 
> 
> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=690
> 
> Regards,
> Ketan
> 
> 
> 
> 
> 
> On Mon, Jan 16, 2012 at 2:24 PM, Ketan Maheshwari <
> ketancmaheshwari at gmail.com > wrote:
> 
> 
> Mihael,
> 
> 
> Please find service log here:
> http://ci.uchicago.edu/~ketan/swift.log.tar.gz
> 
> worker logs seems to have lost. I'll see if I can find'em.
> 
> Regards,
> Ketan
> 
> 
> 
> 
> 
> On Mon, Jan 16, 2012 at 1:38 PM, Mihael Hategan < hategan at mcs.anl.gov
> > wrote:
> 
> 
> Nothing interesting there. Do you also happen to have the service and
> worker logs?
> 
> 
> 
> 
> On Mon, 2012-01-16 at 11:05 -0600, Ketan Maheshwari wrote:
> > Hi Mihael,
> >
> >
> > I could reproduce this timeout exception on OSG with catsn Swift
> > jobs.
> >
> >
> > These are 100 jobs with a data size of 10MB each. So, 2000MB of data
> > movement in all.
> >
> >
> > I tried with 1 worker running on a single OSG site. I tried three
> > different OSG sites: Nebraska, UChicago and RENCI.
> >
> >
> > In each of these cases, I run into the following timeout after ~4
> > minutes of run (15-70 jobs complete during this period) . :
> >
> >
> > Timeout
> > org.globus.cog.karajan.workflow.service.TimeoutException:
> > Handler(562,
> > PUT): timed out receiving request. Last time 940817-011255.807, now:
> > 120115-194100.072
> > at
> > org.globus.cog.karajan.workflow.service.handlers.RequestHandler.handleTimeout(RequestHandler.java:124)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:123)
> > at
> > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:116)
> > at java.util.TimerThread.mainLoop(Timer.java:512)
> > at java.util.TimerThread.run(Timer.java:462)
> > Command(168, SUBMITJOB): handling reply timeout;
> > sendReqTime=120115-193900.255, sendTime=120115-193900.255,
> > now=120115-194100.416, channel=SC-null
> >
> >
> > This is followed by messages similar to the above last line but the
> > progress of workflow halts.
> >
> >
> > Here is the tarball of the
> > experiment: http://ci.uchicago.edu/~ketan/catsn-exp-formihael.tgz
> >
> >
> > It contains a README which has the steps to run: basically
> > start-service on localhost -> start worker on OSG site -> run swift
> >
> >
> > Regards,
> > --
> > Ketan
> >
> >
> >
> 
> 
> 
> 
> 
> 
> --
> Ketan
> 
> 
> 
> 
> 
> 
> --
> Ketan
> 
> 
> 
> 
> 
> 
> --
> Ketan
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list