[Swift-devel] Re: Coaster error
Mihael Hategan
hategan at mcs.anl.gov
Tue Aug 17 13:33:20 CDT 2010
The failure to shut down a channel is also ignorable.
Essentially the worker shuts down before it gets to acknowledge the
shutdown command. I guess this could be fixed, but for now ignore it.
On Tue, 2010-08-17 at 13:21 -0500, Jonathan Monette wrote:
> Or so the qdel error I am seeing is ignorable? And I am assuming that
> the shutdown failure has something to do with the jobs being run because
> when I run a smaller data set (10 images instead of 1300 images) the
> shutdown error happens at the end of the workflow and I also get the error
>
> Failed to shut down channel
> org.globus.cog.karajan.workflow.service.channels.ChannelException:
> Invalid channel: 1338035062: {}
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442)
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422)
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)
>
>
> On 8/17/10 12:43 PM, Mihael Hategan wrote:
> > On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote:
> >
> >> Ok. Have ran more tests on this problem. I am running on both
> >> localhost and pads. In the first stage of my workflow I run on
> >> localhost to collect some metadata. I then use this metadata to
> >> reproject the images submitting these jobs to pads. All the images are
> >> reprojected and completes without error. After this the coasters is
> >> waiting for more jobs to submit to the workers while localhost is
> >> collecting more metadata. I believe coasters starts to shutdown some of
> >> the workers because they are idle and wants to free the resources on the
> >> machine(am I correct so far?)
> >>
> > You are.
> >
> >
> >> During the shutdown some workers are
> >> shutdown successfully but there is always 1 or 2 that fail to shutdown
> >> and I get the qdel error 153 I mentioned yesterday. If coasters fails
> >> to shutdown a job does the service terminate?
> >>
> > No. The qdel part is not critical and is used when workers don't shut
> > down cleanly or on time.
> >
> >
> >> I ask this because after
> >> the job fails to shutdown there are no more jobs being submitted in the
> >> queue and my script hangs since it is waiting for the next stage in my
> >> workflow to complete. Is there a coaster parameter that lets coasters
> >> know to not shutdown the workers even if they become idle for a bit or
> >> is this a legitimate error in coasters?
> >>
> > You are assuming that the shutdown failure has something to do with jobs
> > not being run. I do not think that's necessarily right.
> >
> >
> >
> >
>
More information about the Swift-devel
mailing list