[Swift-devel] Re: Coaster error
Mihael Hategan
hategan at mcs.anl.gov
Tue Aug 17 14:37:01 CDT 2010
On Tue, 2010-08-17 at 13:37 -0500, Jonathan Monette wrote:
> Ok then. Then do you have any ideas on why no more jobs are submitted
> through coasters after this error?
Nope. Do you have the coaster log?
> Here is my sites entry for pads
>
> <pool handle="pads">
> <execution jobmanager="local:pbs" provider="coaster"
> url="login.pads.ci.uchicago.edu" />
> <filesystem provider="local" />
> <profile key="maxtime" namespace="globus">3600</profile>
> <profile key="internalhostname" namespace="globus">192.5.86.6</profile>
> <profile key="workersPerNode" namespace="globus">1</profile>
> <profile key="slots" namespace="globus">10</profile>
> <profile key="nodeGranularity" namespace="globus">1</profile>
> <profile key="maxNodes" namespace="globus">1</profile>
> <profile key="queue" namespace="globus">fast</profile>
> <profile key="jobThrottle" namespace="karajan">1</profile>
> <profile key="initialScore" namespace="karajan">10000</profile>
> <workdirectory>/gpfs/pads/swift/jonmon/Swift/work/pads</workdirectory>
> </pool>
>
> I have slots set to 10. Does this mean this is the maximum number of
> jobs that will be submitted and this number should be increased?
>
> On 8/17/10 1:33 PM, Mihael Hategan wrote:
> > The failure to shut down a channel is also ignorable.
> > Essentially the worker shuts down before it gets to acknowledge the
> > shutdown command. I guess this could be fixed, but for now ignore it.
> >
> > On Tue, 2010-08-17 at 13:21 -0500, Jonathan Monette wrote:
> >
> >> Or so the qdel error I am seeing is ignorable? And I am assuming that
> >> the shutdown failure has something to do with the jobs being run because
> >> when I run a smaller data set (10 images instead of 1300 images) the
> >> shutdown error happens at the end of the workflow and I also get the error
> >>
> >> Failed to shut down channel
> >> org.globus.cog.karajan.workflow.service.channels.ChannelException:
> >> Invalid channel: 1338035062: {}
> >> at
> >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442)
> >> at
> >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422)
> >> at
> >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
> >> at
> >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
> >> at
> >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)
> >> at
> >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)
> >>
> >>
> >> On 8/17/10 12:43 PM, Mihael Hategan wrote:
> >>
> >>> On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote:
> >>>
> >>>
> >>>> Ok. Have ran more tests on this problem. I am running on both
> >>>> localhost and pads. In the first stage of my workflow I run on
> >>>> localhost to collect some metadata. I then use this metadata to
> >>>> reproject the images submitting these jobs to pads. All the images are
> >>>> reprojected and completes without error. After this the coasters is
> >>>> waiting for more jobs to submit to the workers while localhost is
> >>>> collecting more metadata. I believe coasters starts to shutdown some of
> >>>> the workers because they are idle and wants to free the resources on the
> >>>> machine(am I correct so far?)
> >>>>
> >>>>
> >>> You are.
> >>>
> >>>
> >>>
> >>>> During the shutdown some workers are
> >>>> shutdown successfully but there is always 1 or 2 that fail to shutdown
> >>>> and I get the qdel error 153 I mentioned yesterday. If coasters fails
> >>>> to shutdown a job does the service terminate?
> >>>>
> >>>>
> >>> No. The qdel part is not critical and is used when workers don't shut
> >>> down cleanly or on time.
> >>>
> >>>
> >>>
> >>>> I ask this because after
> >>>> the job fails to shutdown there are no more jobs being submitted in the
> >>>> queue and my script hangs since it is waiting for the next stage in my
> >>>> workflow to complete. Is there a coaster parameter that lets coasters
> >>>> know to not shutdown the workers even if they become idle for a bit or
> >>>> is this a legitimate error in coasters?
> >>>>
> >>>>
> >>> You are assuming that the shutdown failure has something to do with jobs
> >>> not being run. I do not think that's necessarily right.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
>
More information about the Swift-devel
mailing list