[Swift-devel] Re: Coaster error

Tue Aug 17 16:18:35 CDT 2010

The one in .globus/coasters/ doesn't get anything new written to it when 
I do my runs.  Could that be because I have my jobmanager local:pbs?  
Does that put the coaster stuff in the swift log?

On 8/17/10 2:37 PM, Mihael Hategan wrote:
> On Tue, 2010-08-17 at 13:37 -0500, Jonathan Monette wrote:
>    
>> Ok then.  Then do you have any ideas on why no more jobs are submitted
>> through coasters after this error?
>>      
> Nope. Do you have the coaster log?
>
>    
>> Here is my sites entry for pads
>>
>> <pool handle="pads">
>> <execution jobmanager="local:pbs" provider="coaster"
>> url="login.pads.ci.uchicago.edu" />
>> <filesystem provider="local" />
>> <profile key="maxtime" namespace="globus">3600</profile>
>> <profile key="internalhostname" namespace="globus">192.5.86.6</profile>
>> <profile key="workersPerNode" namespace="globus">1</profile>
>> <profile key="slots" namespace="globus">10</profile>
>> <profile key="nodeGranularity" namespace="globus">1</profile>
>> <profile key="maxNodes" namespace="globus">1</profile>
>> <profile key="queue" namespace="globus">fast</profile>
>> <profile key="jobThrottle" namespace="karajan">1</profile>
>> <profile key="initialScore" namespace="karajan">10000</profile>
>> <workdirectory>/gpfs/pads/swift/jonmon/Swift/work/pads</workdirectory>
>> </pool>
>>
>> I have slots set to 10.  Does this mean this is the maximum number of
>> jobs that will be submitted and this number should be increased?
>>
>> On 8/17/10 1:33 PM, Mihael Hategan wrote:
>>      
>>> The failure to shut down a channel is also ignorable.
>>> Essentially the worker shuts down before it gets to acknowledge the
>>> shutdown command. I guess this could be fixed, but for now ignore it.
>>>
>>> On Tue, 2010-08-17 at 13:21 -0500, Jonathan Monette wrote:
>>>
>>>        
>>>> Or so the qdel error I am seeing is ignorable?  And I am assuming that
>>>> the shutdown failure has something to do with the jobs being run because
>>>> when I run a smaller data set (10 images instead of 1300 images) the
>>>> shutdown error happens at the end of the workflow and I also get the error
>>>>
>>>> Failed to shut down channel
>>>> org.globus.cog.karajan.workflow.service.channels.ChannelException:
>>>> Invalid channel: 1338035062: {}
>>>>        at
>>>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442)
>>>>        at
>>>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422)
>>>>        at
>>>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
>>>>        at
>>>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
>>>>        at
>>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83)
>>>>        at
>>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257)
>>>>
>>>>
>>>> On 8/17/10 12:43 PM, Mihael Hategan wrote:
>>>>
>>>>          
>>>>> On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> Ok.  Have ran more tests on this problem.  I am running on both
>>>>>> localhost and pads.  In the first stage of my workflow I run on
>>>>>> localhost to collect some metadata.  I then use this metadata to
>>>>>> reproject the images submitting these jobs to pads.  All the images are
>>>>>> reprojected and completes without error.  After this the coasters is
>>>>>> waiting for more jobs to submit to the workers while localhost is
>>>>>> collecting more metadata.  I believe coasters starts to shutdown some of
>>>>>> the workers because they are idle and wants to free the resources on the
>>>>>> machine(am I correct so far?)
>>>>>>
>>>>>>
>>>>>>              
>>>>> You are.
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>>      During the shutdown some workers are
>>>>>> shutdown successfully but there is always 1 or 2 that fail to shutdown
>>>>>> and I get the qdel error 153 I mentioned yesterday.  If coasters fails
>>>>>> to shutdown a job does the service terminate?
>>>>>>
>>>>>>
>>>>>>              
>>>>> No. The qdel part is not critical and is used when workers don't shut
>>>>> down cleanly or on time.
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>>      I ask this because after
>>>>>> the job fails to shutdown there are no more jobs being submitted in the
>>>>>> queue and my script hangs since it is waiting for the next stage in my
>>>>>> workflow to complete.  Is there a coaster parameter that lets coasters
>>>>>> know to not shutdown the workers even if they become idle for a bit or
>>>>>> is this a legitimate error in coasters?
>>>>>>
>>>>>>
>>>>>>              
>>>>> You are assuming that the shutdown failure has something to do with jobs
>>>>> not being run. I do not think that's necessarily right.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>>
>>>        
>>      
>
>    

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein