[Swift-devel] Re: Coaster error

Mihael Hategan hategan at mcs.anl.gov
Tue Aug 17 13:33:20 CDT 2010


The failure to shut down a channel is also ignorable.
Essentially the worker shuts down before it gets to acknowledge the
shutdown command. I guess this could be fixed, but for now ignore it.

On Tue, 2010-08-17 at 13:21 -0500, Jonathan Monette wrote:
> Or so the qdel error I am seeing is ignorable?  And I am assuming that 
> the shutdown failure has something to do with the jobs being run because 
> when I run a smaller data set (10 images instead of 1300 images) the 
> shutdown error happens at the end of the workflow and I also get the error
> 
> Failed to shut down channel
> org.globus.cog.karajan.workflow.service.channels.ChannelException: 
> Invalid channel: 1338035062: {}
>      at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442)
>      at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422)
>      at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
>      at 
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
>      at 
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)
>      at 
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)
> 
> 
> On 8/17/10 12:43 PM, Mihael Hategan wrote:
> > On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote:
> >    
> >> Ok.  Have ran more tests on this problem.  I am running on both
> >> localhost and pads.  In the first stage of my workflow I run on
> >> localhost to collect some metadata.  I then use this metadata to
> >> reproject the images submitting these jobs to pads.  All the images are
> >> reprojected and completes without error.  After this the coasters is
> >> waiting for more jobs to submit to the workers while localhost is
> >> collecting more metadata.  I believe coasters starts to shutdown some of
> >> the workers because they are idle and wants to free the resources on the
> >> machine(am I correct so far?)
> >>      
> > You are.
> >
> >    
> >>    During the shutdown some workers are
> >> shutdown successfully but there is always 1 or 2 that fail to shutdown
> >> and I get the qdel error 153 I mentioned yesterday.  If coasters fails
> >> to shutdown a job does the service terminate?
> >>      
> > No. The qdel part is not critical and is used when workers don't shut
> > down cleanly or on time.
> >
> >    
> >>    I ask this because after
> >> the job fails to shutdown there are no more jobs being submitted in the
> >> queue and my script hangs since it is waiting for the next stage in my
> >> workflow to complete.  Is there a coaster parameter that lets coasters
> >> know to not shutdown the workers even if they become idle for a bit or
> >> is this a legitimate error in coasters?
> >>      
> > You are assuming that the shutdown failure has something to do with jobs
> > not being run. I do not think that's necessarily right.
> >
> >
> >
> >    
> 





More information about the Swift-devel mailing list