[Swift-devel] trunk coasters

Michael Wilde wilde at mcs.anl.gov
Fri Aug 5 10:00:58 CDT 2011


Mihael,

Persistent coasters works well so far in 0.93; the problem below seems to be in trunk.

Im able to run to many remote OSG sites now, with good performance, using provider staging, with one coaster service.

Ive seen one script of 100 jobs hang after 97 completed (once), but all other tests up to 1000 jobs have succeeded.  I'll try to recreate that hang and capture logs etc.

- Mike

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Justin M Wozniak" <wozniak at mcs.anl.gov>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Thursday, August 4, 2011 11:22:31 PM
> Subject: Re: [Swift-devel] trunk coasters
> Im still getting the errors below (which I think are what I reported
> prior to this fix). I'll double check that I got the latest fix in,
> but I think I do.
> 
> - Mike
> 
> 2011-08-04 23:17:47,757-0500 DEBUG Cpu workerStarted: swork:node016:0
> 2011-08-04 23:17:47,757-0500 DEBUG Cpu swork:0 pullLater
> 2011-08-04 23:17:47,758-0500 INFO Block Started CPU 0:1312517867s
> 2011-08-04 23:17:47,758-0500 INFO Block Started worker swork:000000
> 2011-08-04 23:17:47,758-0500 INFO Cpu swork:0 pull
> 2011-08-04 23:17:47,761-0500 WARN BlockQueueProcessor Failed to send
> worker status update to client
> java.lang.NullPointerException
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433)
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.java:72)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143)
> at
> org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64)
> at
> org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57)
> at
> org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:375)
> 2011-08-04 23:17:47,764-0500 INFO LocalTCPService Received
> registration: blockid = swork, url = node016
> 2011-08-04 23:17:47,765-0500 INFO AbstractKarajanChannel MetaChannel:
> 467772424[15735326: {}] -> null: Disabling heartbeats (config is null)
> 2011-08-04 23:17:47,765-0500 INFO MetaChannel MetaChannel:
> 467772424[15735326: {}] -> null.bind -> SC-null
> 2011-08-04 23:17:47,765-0500 DEBUG Cpu workerStarted: swork:node016:1
> 2011-08-04 23:17:47,765-0500 DEBUG Cpu swork:1 pullLater
> 2011-08-04 23:17:47,765-0500 INFO Block Started CPU 1:1312517867s
> 2011-08-04 23:17:47,765-0500 INFO Cpu swork:1 pull
> 2011-08-04 23:17:47,765-0500 INFO Block Started worker swork:000001
> 2011-08-04 23:17:47,766-0500 WARN BlockQueueProcessor Failed to send
> worker status update to client
> java.lang.NullPointerException
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433)
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.java:72)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143)
> at
> org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64)
> at
> org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57)
> at
> org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:375)
> 2011-08-04 23:17:48,568-0500 INFO TCPBufferManager Adjusting buffer
> size to 524288
> 
> 
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "Justin M Wozniak" <wozniak at mcs.anl.gov>, "Swift Devel"
> > <swift-devel at ci.uchicago.edu>
> > Sent: Thursday, August 4, 2011 6:09:56 PM
> > Subject: Re: [Swift-devel] trunk coasters
> > On Thu, 2011-08-04 at 15:29 -0500, Michael Wilde wrote:
> >
> > > So the other error - the failing service - is not happing on local
> > > tests on 0.93; next I'll try the remote cases.
> >
> > Ok. I committed a number of things to trunk, one of which is a fix
> > for
> > the messed up channel lookup problem.
> >
> > I used it previously for auto-deployed services on ranger and pads,
> > but
> > haven't tried it with the stand-alone service. So please test that
> > if
> > you can and let me know.
> >
> > I'll now move to dealing with 0.93 issues.
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list