[Swift-devel] Re: Status of the Falkon bad-host-retry fix?

Veronika Nefedova nefedova at mcs.anl.gov
Thu Sep 20 09:52:33 CDT 2007


Ben and Mihael could provide you with the intrinsic details of the  
fixes they've done, below is the sYmptoms I've observed.

On Sep 20, 2007, at 9:41 AM, Michael Wilde wrote:

> When I ran angle jobs last week, I got kickstart records back.
> What was happening in your case?
>

One of my executables produces stdout. Kickstart was not handling it  
correctly, thus the stdout was never created, and the application  
would fail (3-rd stage of the workflow). It took awhile to figure out  
that the kickstart was the reason for all the failures!

> What was the nature of the incorrect job status problem?
>
> Did both of these happen on all jobs, or only in special cases?
>

It happened on the workflows when the clustering was turned on. Not  
sure if clustering was the only reason, but  the status of the failed  
jobs was not set correctly and thus restarts on the failed jobs never  
happened. Mihael could comment on it in more details.

Nika

> - Mike
>
> On 9/20/07 7:49 AM, Veronika Nefedova wrote:
>> No. I am spending 100% my effort now on debugging pure gram  
>> submissions which had started before Ioan has put his fix in SVN.  
>> I am working closely with Ben and Mihael -- we've uncovered  
>> already a few problems in the course of our work -- the problems  
>> are being fixed as we go. To list just a few  problems --  
>> kickstart not handling the stdout correctly, swift not reporting  
>> the job status correctly, etc. This problems would've manifested  
>> themselves in Falcon tests as far as I can tell, so these testing  
>> is beneficial to any future Falcon testing. We are testing on both  
>> ncsa an tg-cu clusters so there is no away I could do any falcon  
>> testing at the same time.
>> Nika
>> On Sep 19, 2007, at 6:16 PM, Michael Wilde wrote:
>>> [Subject Was Re: [Swift-devel] bug 53]
>>>
>>> Nika, were you able to try Ioan's fix to the stale-file-handle  
>>> host problem, and have any updates to report on it?
>>>
>>> - Mike
>>>
>>>
>>> On 9/18/07 11:15 AM, Michael Wilde wrote:
>>>> Its not clear when this happened, as Nika and Ioan's workflow  
>>>> submission from viper has afaik been mostly through Falkon for  
>>>> quite a while now.
>>>> Nika, perhaps you can shift back to trying the two Falkon  
>>>> approaches (with higher prio on testing Ioan's retry code) in  
>>>> the meantime.
>>>> Ioan, is CI Support / Ti supporting viper, or are you the  
>>>> "sysadmin" Ben is referring to?
>>>> Ive also suggested in the past that we focus on using evitable  
>>>> and terminable (and swift03/04) as our main submit hosts,  
>>>> primarily for support and coordination reasons.  Is this a good  
>>>> time to try the GRAM/non-Falkon workfow there?
>>>> - Mike
>>>> Ben Clifford wrote:
>>>>> sounds like viper had firewall configuration changed recently.  
>>>>> viper sysadmin needs to help debug basic job submission with  
>>>>> simple globus tools before that machine is worth using again.
>>>>>
>>>>> On Tue, 18 Sep 2007, Michael Wilde wrote:
>>>>>
>>>>>> does the cog equivalent of globus_tcp_source_range also need  
>>>>>> to be set?
>>>>>> is that only for gridftp, or gram as well?  or could this be a  
>>>>>> gridftp hang?
>>>>>>
>>>>>> - mike
>>>>>>
>>>>>> Ben Clifford wrote:
>>>>>>> can you submit a job using globus-job-run?
>>>>>>>
>>>>>>> On Tue, 18 Sep 2007, Veronika Nefedova wrote:
>>>>>>>
>>>>>>>> I set tcp.port.range in swift properties but even a simple  
>>>>>>>> helloworld
>>>>>>>> workflow
>>>>>>>> hangs (the  submit host doesn't receive the notification  
>>>>>>>> from the compute
>>>>>>>> host
>>>>>>>> that the job has finished).
>>>>>>>> tcp.port.range=50000,60000
>>>>>>>>
>>>>>>>> Not sure what else has changed on viper? It used to be a  
>>>>>>>> very good submit
>>>>>>>> host, I never had any problems with it );
>>>>>>>>
>>>>>>>> Nika
>>>>>>>>
>>>>>>>> On Sep 18, 2007, at 9:13 AM, Mihael Hategan wrote:
>>>>>>>>
>>>>>>>>> Should pick that one. If not ~/.globus/cog.properties ->
>>>>>>>>> tcp.port.range=begin,end
>>>>>>>>>
>>>>>>>>> On Tue, 2007-09-18 at 07:42 +0000, Ben Clifford wrote:
>>>>>>>>>> Not sure if cog picks up the GLOBUS_whatever environment  
>>>>>>>>>> variables.
>>>>>>>>>> Mihael
>>>>>>>>>> presumably knows.
>>>>>>>>>>
>>>>>>>>>> On Mon, 17 Sep 2007, Ioan Raicu wrote:
>>>>>>>>>>
>>>>>>>>>>> There is a firewall on viper.  Ports 50000 - 60000 are  
>>>>>>>>>>> open for TCP.
>>>>>>>>>>> You
>>>>>>>>>>> might want to set the TCP_PORT_RANGE (I am not sure this  
>>>>>>>>>>> is the
>>>>>>>>>>> exact
>>>>>>>>>>> environment variable, but something like that) to be  
>>>>>>>>>>> between 50K and
>>>>>>>>>>> 60K
>>>>>>>>>>> ports
>>>>>>>>>>> to ensure that GT4 uses one of these open ports.
>>>>>>>>>>> Ioan
>>>>>>>>>>>
>>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>>> The same. You can check the job's status in its log on  
>>>>>>>>>>>> viper in
>>>>>>>>>>>> ~nefedova/alamines/MolDyn-244-loops-20070917-1356- 
>>>>>>>>>>>> h95gxij8.log.
>>>>>>>>>>>>
>>>>>>>>>>>> The job is still runnning (i.e. hanging) with the same  
>>>>>>>>>>>> symptom as
>>>>>>>>>>>> before:
>>>>>>>>>>>> the first jobs is done and then nothing else gets  
>>>>>>>>>>>> submitted (the
>>>>>>>>>>>> submit host
>>>>>>>>>>>> doesn't receive any notification that the job has  
>>>>>>>>>>>> finished).
>>>>>>>>>>>>
>>>>>>>>>>>> NIka
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 17, 2007, at 9:51 AM, Mihael Hategan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 2007-09-17 at 09:41 -0500, Veronika Nefedova  
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> I did 'svn up' in cog directory and then did 'ant  
>>>>>>>>>>>>>> dist' in the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>> directory.
>>>>>>>>>>>>> 'ant dist' should be done in the swift directory.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> My 'svn info' gives me r1740.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 17, 2007, at 8:55 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Did you update cog?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, 2007-09-17 at 08:38 -0500, Veronika Nefedova  
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> No, I've tried with r1740, it still hanged (timed out).
>>>>>>>>>>>>>>>> the log is on viper:/home/nefedova/alamines/MolDyn-244-
>>>>>>>>>>>>>>>> loops-20070914-1834-pvhyji75.log
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> NIka
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sep 15, 2007, at 10:59 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, 2007-09-15 at 09:06 +0000, Ben Clifford wrote:
>>>>>>>>>>>>>>>>>> On Fri, 14 Sep 2007, Mihael Hategan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, 2007-09-13 at 16:41 -0500, Mihael Hategan
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> Ok, so there's something in.
>>>>>>>>>>>>>>>>>>> That something was throttling a bit too much (not
>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>> jobs,
>>>>>>>>>>>>>>>>>>> but all
>>>>>>>>>>>>>>>>>>> tasks on that site). I need to take a second look at
>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>> Is that fixed by cog r1740? It looks like that commit
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> intended to.
>>>>>>>>>>>>>>>>> It's an attempt to fix it, but it needs to be  
>>>>>>>>>>>>>>>>> confirmed
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> Nika.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift- 
>>>>>>>>>>>>>>>>> devel
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>




More information about the Swift-devel mailing list