[Swift-devel] timeout on OSG with coasters provider staging

Mihael Hategan hategan at mcs.anl.gov
Fri Jan 20 03:40:19 CST 2012


Thanks! Most of the coaster staging problems seem to be between the
worker and the service, so those are most likely the most important logs
for these issues.

Mihael

On Thu, 2012-01-19 at 17:22 -0600, Ketan Maheshwari wrote:
> 
> Here is another worker log this one is for a real SCEC run:
> 
> 
> ci.uchicago.edu/~ketan/timeout_worker_log_scec.txt
> 
> On Thu, Jan 19, 2012 at 1:54 PM, Ketan Maheshwari
> <ketancmaheshwari at gmail.com> wrote:
>         Mihael,
>         
>         
>         I have the logs now. Filed as bug 690:
>         
>         
>         https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=690
>         
>         Regards,
>         Ketan
>         
>         
>         On Mon, Jan 16, 2012 at 2:24 PM, Ketan Maheshwari
>         <ketancmaheshwari at gmail.com> wrote:
>                 Mihael,
>                 
>                 
>                 Please find service log here:
>                  http://ci.uchicago.edu/~ketan/swift.log.tar.gz
>                 
>                 worker logs seems to have lost. I'll see if I can
>                 find'em.
>                 
>                 Regards,
>                 Ketan
>                 
>                 
>                 On Mon, Jan 16, 2012 at 1:38 PM, Mihael Hategan
>                 <hategan at mcs.anl.gov> wrote:
>                         Nothing interesting there. Do you also happen
>                         to have the service and
>                         worker logs?
>                         
>                         
>                         On Mon, 2012-01-16 at 11:05 -0600, Ketan
>                         Maheshwari wrote:
>                         > Hi Mihael,
>                         >
>                         >
>                         > I could reproduce this timeout exception on
>                         OSG with catsn Swift jobs.
>                         >
>                         >
>                         > These are 100 jobs with a data size of 10MB
>                         each. So, 2000MB of data
>                         > movement in all.
>                         >
>                         >
>                         > I tried with 1 worker running on a single
>                         OSG site. I tried three
>                         > different OSG sites: Nebraska, UChicago and
>                         RENCI.
>                         >
>                         >
>                         > In each of these cases, I run into the
>                         following timeout after ~4
>                         > minutes of run (15-70 jobs complete during
>                         this period) . :
>                         >
>                         >
>                         > Timeout
>                         >
>                         org.globus.cog.karajan.workflow.service.TimeoutException: Handler(562,
>                         > PUT): timed out receiving request. Last time
>                         940817-011255.807, now:
>                         > 120115-194100.072
>                         > at
>                         >
>                         org.globus.cog.karajan.workflow.service.handlers.RequestHandler.handleTimeout(RequestHandler.java:124)
>                         > at
>                         >
>                         org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
>                         > at
>                         >
>                         org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:123)
>                         > at
>                         >
>                         org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:116)
>                         > at
>                         java.util.TimerThread.mainLoop(Timer.java:512)
>                         > at java.util.TimerThread.run(Timer.java:462)
>                         > Command(168, SUBMITJOB): handling reply
>                         timeout;
>                         > sendReqTime=120115-193900.255,
>                         sendTime=120115-193900.255,
>                         > now=120115-194100.416, channel=SC-null
>                         >
>                         >
>                         > This is followed by messages similar to the
>                         above last line but the
>                         > progress of workflow halts.
>                         >
>                         >
>                         > Here is the tarball of the
>                         > experiment:
>                         http://ci.uchicago.edu/~ketan/catsn-exp-formihael.tgz
>                         >
>                         >
>                         > It contains a README which has the steps to
>                         run: basically
>                         > start-service on localhost -> start worker
>                         on OSG site -> run swift
>                         >
>                         >
>                         > Regards,
>                         > --
>                         > Ketan
>                         >
>                         >
>                         >
>                         
>                         
>                         
>                 
>                 
>                 
>                 
>                 -- 
>                 Ketan
>                 
>                 
>                 
>         
>         
>         
>         
>         -- 
>         Ketan
>         
>         
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
> 





More information about the Swift-devel mailing list