[Swift-devel] Re: Status of the Falkon bad-host-retry fix?

Michael Wilde wilde at mcs.anl.gov
Thu Sep 20 10:05:42 CDT 2007


I see - makes sense.  Both cases are feature interactions that we had 
not tried or tested before.

- Mike


On 9/20/07 9:52 AM, Veronika Nefedova wrote:
> Ben and Mihael could provide you with the intrinsic details of the fixes 
> they've done, below is the sYmptoms I've observed.
> 
> On Sep 20, 2007, at 9:41 AM, Michael Wilde wrote:
> 
>> When I ran angle jobs last week, I got kickstart records back.
>> What was happening in your case?
>>
> 
> One of my executables produces stdout. Kickstart was not handling it 
> correctly, thus the stdout was never created, and the application would 
> fail (3-rd stage of the workflow). It took awhile to figure out that the 
> kickstart was the reason for all the failures!
> 
>> What was the nature of the incorrect job status problem?
>>
>> Did both of these happen on all jobs, or only in special cases?
>>
> 
> It happened on the workflows when the clustering was turned on. Not sure 
> if clustering was the only reason, but  the status of the failed jobs 
> was not set correctly and thus restarts on the failed jobs never 
> happened. Mihael could comment on it in more details.
> 
> Nika
> 
>> - Mike
>>
>> On 9/20/07 7:49 AM, Veronika Nefedova wrote:
>>> No. I am spending 100% my effort now on debugging pure gram 
>>> submissions which had started before Ioan has put his fix in SVN. I 
>>> am working closely with Ben and Mihael -- we've uncovered already a 
>>> few problems in the course of our work -- the problems are being 
>>> fixed as we go. To list just a few  problems -- kickstart not 
>>> handling the stdout correctly, swift not reporting the job status 
>>> correctly, etc. This problems would've manifested themselves in 
>>> Falcon tests as far as I can tell, so these testing is beneficial to 
>>> any future Falcon testing. We are testing on both ncsa an tg-cu 
>>> clusters so there is no away I could do any falcon testing at the 
>>> same time.
>>> Nika
>>> On Sep 19, 2007, at 6:16 PM, Michael Wilde wrote:
>>>> [Subject Was Re: [Swift-devel] bug 53]
>>>>
>>>> Nika, were you able to try Ioan's fix to the stale-file-handle host 
>>>> problem, and have any updates to report on it?
>>>>
>>>> - Mike
>>>>
>>>>
>>>> On 9/18/07 11:15 AM, Michael Wilde wrote:
>>>>> Its not clear when this happened, as Nika and Ioan's workflow 
>>>>> submission from viper has afaik been mostly through Falkon for 
>>>>> quite a while now.
>>>>> Nika, perhaps you can shift back to trying the two Falkon 
>>>>> approaches (with higher prio on testing Ioan's retry code) in the 
>>>>> meantime.
>>>>> Ioan, is CI Support / Ti supporting viper, or are you the 
>>>>> "sysadmin" Ben is referring to?
>>>>> Ive also suggested in the past that we focus on using evitable and 
>>>>> terminable (and swift03/04) as our main submit hosts, primarily for 
>>>>> support and coordination reasons.  Is this a good time to try the 
>>>>> GRAM/non-Falkon workfow there?
>>>>> - Mike
>>>>> Ben Clifford wrote:
>>>>>> sounds like viper had firewall configuration changed recently. 
>>>>>> viper sysadmin needs to help debug basic job submission with 
>>>>>> simple globus tools before that machine is worth using again.
>>>>>>
>>>>>> On Tue, 18 Sep 2007, Michael Wilde wrote:
>>>>>>
>>>>>>> does the cog equivalent of globus_tcp_source_range also need to 
>>>>>>> be set?
>>>>>>> is that only for gridftp, or gram as well?  or could this be a 
>>>>>>> gridftp hang?
>>>>>>>
>>>>>>> - mike
>>>>>>>
>>>>>>> Ben Clifford wrote:
>>>>>>>> can you submit a job using globus-job-run?
>>>>>>>>
>>>>>>>> On Tue, 18 Sep 2007, Veronika Nefedova wrote:
>>>>>>>>
>>>>>>>>> I set tcp.port.range in swift properties but even a simple 
>>>>>>>>> helloworld
>>>>>>>>> workflow
>>>>>>>>> hangs (the  submit host doesn't receive the notification from 
>>>>>>>>> the compute
>>>>>>>>> host
>>>>>>>>> that the job has finished).
>>>>>>>>> tcp.port.range=50000,60000
>>>>>>>>>
>>>>>>>>> Not sure what else has changed on viper? It used to be a very 
>>>>>>>>> good submit
>>>>>>>>> host, I never had any problems with it );
>>>>>>>>>
>>>>>>>>> Nika
>>>>>>>>>
>>>>>>>>> On Sep 18, 2007, at 9:13 AM, Mihael Hategan wrote:
>>>>>>>>>
>>>>>>>>>> Should pick that one. If not ~/.globus/cog.properties ->
>>>>>>>>>> tcp.port.range=begin,end
>>>>>>>>>>
>>>>>>>>>> On Tue, 2007-09-18 at 07:42 +0000, Ben Clifford wrote:
>>>>>>>>>>> Not sure if cog picks up the GLOBUS_whatever environment 
>>>>>>>>>>> variables.
>>>>>>>>>>> Mihael
>>>>>>>>>>> presumably knows.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 17 Sep 2007, Ioan Raicu wrote:
>>>>>>>>>>>
>>>>>>>>>>>> There is a firewall on viper.  Ports 50000 - 60000 are open 
>>>>>>>>>>>> for TCP.
>>>>>>>>>>>> You
>>>>>>>>>>>> might want to set the TCP_PORT_RANGE (I am not sure this is the
>>>>>>>>>>>> exact
>>>>>>>>>>>> environment variable, but something like that) to be between 
>>>>>>>>>>>> 50K and
>>>>>>>>>>>> 60K
>>>>>>>>>>>> ports
>>>>>>>>>>>> to ensure that GT4 uses one of these open ports.
>>>>>>>>>>>> Ioan
>>>>>>>>>>>>
>>>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>>>> The same. You can check the job's status in its log on 
>>>>>>>>>>>>> viper in
>>>>>>>>>>>>> ~nefedova/alamines/MolDyn-244-loops-20070917-1356-h95gxij8.log. 
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The job is still runnning (i.e. hanging) with the same 
>>>>>>>>>>>>> symptom as
>>>>>>>>>>>>> before:
>>>>>>>>>>>>> the first jobs is done and then nothing else gets submitted 
>>>>>>>>>>>>> (the
>>>>>>>>>>>>> submit host
>>>>>>>>>>>>> doesn't receive any notification that the job has finished).
>>>>>>>>>>>>>
>>>>>>>>>>>>> NIka
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 17, 2007, at 9:51 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, 2007-09-17 at 09:41 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>>>>> I did 'svn up' in cog directory and then did 'ant dist' 
>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>> directory.
>>>>>>>>>>>>>> 'ant dist' should be done in the swift directory.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My 'svn info' gives me r1740.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sep 17, 2007, at 8:55 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Did you update cog?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, 2007-09-17 at 08:38 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>>>>>>> No, I've tried with r1740, it still hanged (timed out).
>>>>>>>>>>>>>>>>> the log is on viper:/home/nefedova/alamines/MolDyn-244-
>>>>>>>>>>>>>>>>> loops-20070914-1834-pvhyji75.log
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> NIka
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sep 15, 2007, at 10:59 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, 2007-09-15 at 09:06 +0000, Ben Clifford wrote:
>>>>>>>>>>>>>>>>>>> On Fri, 14 Sep 2007, Mihael Hategan wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, 2007-09-13 at 16:41 -0500, Mihael Hategan
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> Ok, so there's something in.
>>>>>>>>>>>>>>>>>>>> That something was throttling a bit too much (not
>>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>> jobs,
>>>>>>>>>>>>>>>>>>>> but all
>>>>>>>>>>>>>>>>>>>> tasks on that site). I need to take a second look at
>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>> Is that fixed by cog r1740? It looks like that commit
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> intended to.
>>>>>>>>>>>>>>>>>> It's an attempt to fix it, but it needs to be confirmed
>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> Nika.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>
> 
> 



More information about the Swift-devel mailing list