[Swift-devel] Status of the Falkon bad-host-retry fix?

Michael Wilde wilde at mcs.anl.gov
Wed Sep 19 18:16:05 CDT 2007


[Subject Was Re: [Swift-devel] bug 53]

Nika, were you able to try Ioan's fix to the stale-file-handle host 
problem, and have any updates to report on it?

- Mike


On 9/18/07 11:15 AM, Michael Wilde wrote:
> Its not clear when this happened, as Nika and Ioan's workflow submission 
> from viper has afaik been mostly through Falkon for quite a while now.
> 
> Nika, perhaps you can shift back to trying the two Falkon approaches 
> (with higher prio on testing Ioan's retry code) in the meantime.
> 
> Ioan, is CI Support / Ti supporting viper, or are you the "sysadmin" Ben 
> is referring to?
> 
> Ive also suggested in the past that we focus on using evitable and 
> terminable (and swift03/04) as our main submit hosts, primarily for 
> support and coordination reasons.  Is this a good time to try the 
> GRAM/non-Falkon workfow there?
> 
> - Mike
> 
> 
> Ben Clifford wrote:
>> sounds like viper had firewall configuration changed recently. viper 
>> sysadmin needs to help debug basic job submission with simple globus 
>> tools before that machine is worth using again.
>>
>> On Tue, 18 Sep 2007, Michael Wilde wrote:
>>
>>> does the cog equivalent of globus_tcp_source_range also need to be set?
>>> is that only for gridftp, or gram as well?  or could this be a 
>>> gridftp hang?
>>>
>>> - mike
>>>
>>> Ben Clifford wrote:
>>>> can you submit a job using globus-job-run?
>>>>
>>>> On Tue, 18 Sep 2007, Veronika Nefedova wrote:
>>>>
>>>>> I set tcp.port.range in swift properties but even a simple helloworld
>>>>> workflow
>>>>> hangs (the  submit host doesn't receive the notification from the 
>>>>> compute
>>>>> host
>>>>> that the job has finished).
>>>>> tcp.port.range=50000,60000
>>>>>
>>>>> Not sure what else has changed on viper? It used to be a very good 
>>>>> submit
>>>>> host, I never had any problems with it );
>>>>>
>>>>> Nika
>>>>>
>>>>> On Sep 18, 2007, at 9:13 AM, Mihael Hategan wrote:
>>>>>
>>>>>> Should pick that one. If not ~/.globus/cog.properties ->
>>>>>> tcp.port.range=begin,end
>>>>>>
>>>>>> On Tue, 2007-09-18 at 07:42 +0000, Ben Clifford wrote:
>>>>>>> Not sure if cog picks up the GLOBUS_whatever environment variables.
>>>>>>> Mihael
>>>>>>> presumably knows.
>>>>>>>
>>>>>>> On Mon, 17 Sep 2007, Ioan Raicu wrote:
>>>>>>>
>>>>>>>> There is a firewall on viper.  Ports 50000 - 60000 are open for 
>>>>>>>> TCP.
>>>>>>>> You
>>>>>>>> might want to set the TCP_PORT_RANGE (I am not sure this is the
>>>>>>>> exact
>>>>>>>> environment variable, but something like that) to be between 50K 
>>>>>>>> and
>>>>>>>> 60K
>>>>>>>> ports
>>>>>>>> to ensure that GT4 uses one of these open ports.
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>> The same. You can check the job's status in its log on viper in
>>>>>>>>> ~nefedova/alamines/MolDyn-244-loops-20070917-1356-h95gxij8.log.
>>>>>>>>>
>>>>>>>>> The job is still runnning (i.e. hanging) with the same symptom as
>>>>>>>>> before:
>>>>>>>>> the first jobs is done and then nothing else gets submitted (the
>>>>>>>>> submit host
>>>>>>>>> doesn't receive any notification that the job has finished).
>>>>>>>>>
>>>>>>>>> NIka
>>>>>>>>>
>>>>>>>>> On Sep 17, 2007, at 9:51 AM, Mihael Hategan wrote:
>>>>>>>>>
>>>>>>>>>> On Mon, 2007-09-17 at 09:41 -0500, Veronika Nefedova wrote:
>>>>>>>>>>> I did 'svn up' in cog directory and then did 'ant dist' in the
>>>>>>>>>>> same
>>>>>>>>>>> directory.
>>>>>>>>>> 'ant dist' should be done in the swift directory.
>>>>>>>>>>
>>>>>>>>>>> My 'svn info' gives me r1740.
>>>>>>>>>>>
>>>>>>>>>>> On Sep 17, 2007, at 8:55 AM, Mihael Hategan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Did you update cog?
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 2007-09-17 at 08:38 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>>> No, I've tried with r1740, it still hanged (timed out).
>>>>>>>>>>>>> the log is on viper:/home/nefedova/alamines/MolDyn-244-
>>>>>>>>>>>>> loops-20070914-1834-pvhyji75.log
>>>>>>>>>>>>>
>>>>>>>>>>>>> NIka
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 15, 2007, at 10:59 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, 2007-09-15 at 09:06 +0000, Ben Clifford wrote:
>>>>>>>>>>>>>>> On Fri, 14 Sep 2007, Mihael Hategan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 2007-09-13 at 16:41 -0500, Mihael Hategan
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> Ok, so there's something in.
>>>>>>>>>>>>>>>> That something was throttling a bit too much (not
>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>> jobs,
>>>>>>>>>>>>>>>> but all
>>>>>>>>>>>>>>>> tasks on that site). I need to take a second look at
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>> Is that fixed by cog r1740? It looks like that commit
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> intended to.
>>>>>>>>>>>>>> It's an attempt to fix it, but it needs to be confirmed
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>> Nika.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>>
>>>
>>
>>
> 



More information about the Swift-devel mailing list