[Swift-devel] Re: Status of the Falkon bad-host-retry fix?
Veronika Nefedova
nefedova at mcs.anl.gov
Thu Sep 20 07:49:00 CDT 2007
No. I am spending 100% my effort now on debugging pure gram
submissions which had started before Ioan has put his fix in SVN. I
am working closely with Ben and Mihael -- we've uncovered already a
few problems in the course of our work -- the problems are being
fixed as we go. To list just a few problems -- kickstart not
handling the stdout correctly, swift not reporting the job status
correctly, etc. This problems would've manifested themselves in
Falcon tests as far as I can tell, so these testing is beneficial to
any future Falcon testing. We are testing on both ncsa an tg-cu
clusters so there is no away I could do any falcon testing at the
same time.
Nika
On Sep 19, 2007, at 6:16 PM, Michael Wilde wrote:
> [Subject Was Re: [Swift-devel] bug 53]
>
> Nika, were you able to try Ioan's fix to the stale-file-handle host
> problem, and have any updates to report on it?
>
> - Mike
>
>
> On 9/18/07 11:15 AM, Michael Wilde wrote:
>> Its not clear when this happened, as Nika and Ioan's workflow
>> submission from viper has afaik been mostly through Falkon for
>> quite a while now.
>> Nika, perhaps you can shift back to trying the two Falkon
>> approaches (with higher prio on testing Ioan's retry code) in the
>> meantime.
>> Ioan, is CI Support / Ti supporting viper, or are you the
>> "sysadmin" Ben is referring to?
>> Ive also suggested in the past that we focus on using evitable and
>> terminable (and swift03/04) as our main submit hosts, primarily
>> for support and coordination reasons. Is this a good time to try
>> the GRAM/non-Falkon workfow there?
>> - Mike
>> Ben Clifford wrote:
>>> sounds like viper had firewall configuration changed recently.
>>> viper sysadmin needs to help debug basic job submission with
>>> simple globus tools before that machine is worth using again.
>>>
>>> On Tue, 18 Sep 2007, Michael Wilde wrote:
>>>
>>>> does the cog equivalent of globus_tcp_source_range also need to
>>>> be set?
>>>> is that only for gridftp, or gram as well? or could this be a
>>>> gridftp hang?
>>>>
>>>> - mike
>>>>
>>>> Ben Clifford wrote:
>>>>> can you submit a job using globus-job-run?
>>>>>
>>>>> On Tue, 18 Sep 2007, Veronika Nefedova wrote:
>>>>>
>>>>>> I set tcp.port.range in swift properties but even a simple
>>>>>> helloworld
>>>>>> workflow
>>>>>> hangs (the submit host doesn't receive the notification from
>>>>>> the compute
>>>>>> host
>>>>>> that the job has finished).
>>>>>> tcp.port.range=50000,60000
>>>>>>
>>>>>> Not sure what else has changed on viper? It used to be a very
>>>>>> good submit
>>>>>> host, I never had any problems with it );
>>>>>>
>>>>>> Nika
>>>>>>
>>>>>> On Sep 18, 2007, at 9:13 AM, Mihael Hategan wrote:
>>>>>>
>>>>>>> Should pick that one. If not ~/.globus/cog.properties ->
>>>>>>> tcp.port.range=begin,end
>>>>>>>
>>>>>>> On Tue, 2007-09-18 at 07:42 +0000, Ben Clifford wrote:
>>>>>>>> Not sure if cog picks up the GLOBUS_whatever environment
>>>>>>>> variables.
>>>>>>>> Mihael
>>>>>>>> presumably knows.
>>>>>>>>
>>>>>>>> On Mon, 17 Sep 2007, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>>> There is a firewall on viper. Ports 50000 - 60000 are open
>>>>>>>>> for TCP.
>>>>>>>>> You
>>>>>>>>> might want to set the TCP_PORT_RANGE (I am not sure this is
>>>>>>>>> the
>>>>>>>>> exact
>>>>>>>>> environment variable, but something like that) to be
>>>>>>>>> between 50K and
>>>>>>>>> 60K
>>>>>>>>> ports
>>>>>>>>> to ensure that GT4 uses one of these open ports.
>>>>>>>>> Ioan
>>>>>>>>>
>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>> The same. You can check the job's status in its log on
>>>>>>>>>> viper in
>>>>>>>>>> ~nefedova/alamines/MolDyn-244-loops-20070917-1356-
>>>>>>>>>> h95gxij8.log.
>>>>>>>>>>
>>>>>>>>>> The job is still runnning (i.e. hanging) with the same
>>>>>>>>>> symptom as
>>>>>>>>>> before:
>>>>>>>>>> the first jobs is done and then nothing else gets
>>>>>>>>>> submitted (the
>>>>>>>>>> submit host
>>>>>>>>>> doesn't receive any notification that the job has finished).
>>>>>>>>>>
>>>>>>>>>> NIka
>>>>>>>>>>
>>>>>>>>>> On Sep 17, 2007, at 9:51 AM, Mihael Hategan wrote:
>>>>>>>>>>
>>>>>>>>>>> On Mon, 2007-09-17 at 09:41 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>> I did 'svn up' in cog directory and then did 'ant dist'
>>>>>>>>>>>> in the
>>>>>>>>>>>> same
>>>>>>>>>>>> directory.
>>>>>>>>>>> 'ant dist' should be done in the swift directory.
>>>>>>>>>>>
>>>>>>>>>>>> My 'svn info' gives me r1740.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 17, 2007, at 8:55 AM, Mihael Hategan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Did you update cog?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 2007-09-17 at 08:38 -0500, Veronika Nefedova
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> No, I've tried with r1740, it still hanged (timed out).
>>>>>>>>>>>>>> the log is on viper:/home/nefedova/alamines/MolDyn-244-
>>>>>>>>>>>>>> loops-20070914-1834-pvhyji75.log
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> NIka
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 15, 2007, at 10:59 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, 2007-09-15 at 09:06 +0000, Ben Clifford wrote:
>>>>>>>>>>>>>>>> On Fri, 14 Sep 2007, Mihael Hategan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 2007-09-13 at 16:41 -0500, Mihael Hategan
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> Ok, so there's something in.
>>>>>>>>>>>>>>>>> That something was throttling a bit too much (not
>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>> jobs,
>>>>>>>>>>>>>>>>> but all
>>>>>>>>>>>>>>>>> tasks on that site). I need to take a second look at
>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>> Is that fixed by cog r1740? It looks like that commit
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> intended to.
>>>>>>>>>>>>>>> It's an attempt to fix it, but it needs to be confirmed
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>> Nika.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>
>>>>
>>>
>>>
>
More information about the Swift-devel
mailing list