[Swift-devel] Re: Status of the Falkon bad-host-retry fix?

Veronika Nefedova nefedova at mcs.anl.gov
Thu Sep 20 07:49:00 CDT 2007


No. I am spending 100% my effort now on debugging pure gram  
submissions which had started before Ioan has put his fix in SVN. I  
am working closely with Ben and Mihael -- we've uncovered already a  
few problems in the course of our work -- the problems are being  
fixed as we go. To list just a few  problems -- kickstart not  
handling the stdout correctly, swift not reporting the job status  
correctly, etc. This problems would've manifested themselves in  
Falcon tests as far as I can tell, so these testing is beneficial to  
any future Falcon testing. We are testing on both ncsa an tg-cu  
clusters so there is no away I could do any falcon testing at the  
same time.

Nika

On Sep 19, 2007, at 6:16 PM, Michael Wilde wrote:

> [Subject Was Re: [Swift-devel] bug 53]
>
> Nika, were you able to try Ioan's fix to the stale-file-handle host  
> problem, and have any updates to report on it?
>
> - Mike
>
>
> On 9/18/07 11:15 AM, Michael Wilde wrote:
>> Its not clear when this happened, as Nika and Ioan's workflow  
>> submission from viper has afaik been mostly through Falkon for  
>> quite a while now.
>> Nika, perhaps you can shift back to trying the two Falkon  
>> approaches (with higher prio on testing Ioan's retry code) in the  
>> meantime.
>> Ioan, is CI Support / Ti supporting viper, or are you the  
>> "sysadmin" Ben is referring to?
>> Ive also suggested in the past that we focus on using evitable and  
>> terminable (and swift03/04) as our main submit hosts, primarily  
>> for support and coordination reasons.  Is this a good time to try  
>> the GRAM/non-Falkon workfow there?
>> - Mike
>> Ben Clifford wrote:
>>> sounds like viper had firewall configuration changed recently.  
>>> viper sysadmin needs to help debug basic job submission with  
>>> simple globus tools before that machine is worth using again.
>>>
>>> On Tue, 18 Sep 2007, Michael Wilde wrote:
>>>
>>>> does the cog equivalent of globus_tcp_source_range also need to  
>>>> be set?
>>>> is that only for gridftp, or gram as well?  or could this be a  
>>>> gridftp hang?
>>>>
>>>> - mike
>>>>
>>>> Ben Clifford wrote:
>>>>> can you submit a job using globus-job-run?
>>>>>
>>>>> On Tue, 18 Sep 2007, Veronika Nefedova wrote:
>>>>>
>>>>>> I set tcp.port.range in swift properties but even a simple  
>>>>>> helloworld
>>>>>> workflow
>>>>>> hangs (the  submit host doesn't receive the notification from  
>>>>>> the compute
>>>>>> host
>>>>>> that the job has finished).
>>>>>> tcp.port.range=50000,60000
>>>>>>
>>>>>> Not sure what else has changed on viper? It used to be a very  
>>>>>> good submit
>>>>>> host, I never had any problems with it );
>>>>>>
>>>>>> Nika
>>>>>>
>>>>>> On Sep 18, 2007, at 9:13 AM, Mihael Hategan wrote:
>>>>>>
>>>>>>> Should pick that one. If not ~/.globus/cog.properties ->
>>>>>>> tcp.port.range=begin,end
>>>>>>>
>>>>>>> On Tue, 2007-09-18 at 07:42 +0000, Ben Clifford wrote:
>>>>>>>> Not sure if cog picks up the GLOBUS_whatever environment  
>>>>>>>> variables.
>>>>>>>> Mihael
>>>>>>>> presumably knows.
>>>>>>>>
>>>>>>>> On Mon, 17 Sep 2007, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>>> There is a firewall on viper.  Ports 50000 - 60000 are open  
>>>>>>>>> for TCP.
>>>>>>>>> You
>>>>>>>>> might want to set the TCP_PORT_RANGE (I am not sure this is  
>>>>>>>>> the
>>>>>>>>> exact
>>>>>>>>> environment variable, but something like that) to be  
>>>>>>>>> between 50K and
>>>>>>>>> 60K
>>>>>>>>> ports
>>>>>>>>> to ensure that GT4 uses one of these open ports.
>>>>>>>>> Ioan
>>>>>>>>>
>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>> The same. You can check the job's status in its log on  
>>>>>>>>>> viper in
>>>>>>>>>> ~nefedova/alamines/MolDyn-244-loops-20070917-1356- 
>>>>>>>>>> h95gxij8.log.
>>>>>>>>>>
>>>>>>>>>> The job is still runnning (i.e. hanging) with the same  
>>>>>>>>>> symptom as
>>>>>>>>>> before:
>>>>>>>>>> the first jobs is done and then nothing else gets  
>>>>>>>>>> submitted (the
>>>>>>>>>> submit host
>>>>>>>>>> doesn't receive any notification that the job has finished).
>>>>>>>>>>
>>>>>>>>>> NIka
>>>>>>>>>>
>>>>>>>>>> On Sep 17, 2007, at 9:51 AM, Mihael Hategan wrote:
>>>>>>>>>>
>>>>>>>>>>> On Mon, 2007-09-17 at 09:41 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>> I did 'svn up' in cog directory and then did 'ant dist'  
>>>>>>>>>>>> in the
>>>>>>>>>>>> same
>>>>>>>>>>>> directory.
>>>>>>>>>>> 'ant dist' should be done in the swift directory.
>>>>>>>>>>>
>>>>>>>>>>>> My 'svn info' gives me r1740.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 17, 2007, at 8:55 AM, Mihael Hategan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Did you update cog?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 2007-09-17 at 08:38 -0500, Veronika Nefedova  
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> No, I've tried with r1740, it still hanged (timed out).
>>>>>>>>>>>>>> the log is on viper:/home/nefedova/alamines/MolDyn-244-
>>>>>>>>>>>>>> loops-20070914-1834-pvhyji75.log
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> NIka
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 15, 2007, at 10:59 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, 2007-09-15 at 09:06 +0000, Ben Clifford wrote:
>>>>>>>>>>>>>>>> On Fri, 14 Sep 2007, Mihael Hategan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 2007-09-13 at 16:41 -0500, Mihael Hategan
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> Ok, so there's something in.
>>>>>>>>>>>>>>>>> That something was throttling a bit too much (not
>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>> jobs,
>>>>>>>>>>>>>>>>> but all
>>>>>>>>>>>>>>>>> tasks on that site). I need to take a second look at
>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>> Is that fixed by cog r1740? It looks like that commit
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> intended to.
>>>>>>>>>>>>>>> It's an attempt to fix it, but it needs to be confirmed
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>> Nika.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>
>>>>
>>>
>>>
>




More information about the Swift-devel mailing list