[AG-TECH] Vanishing Bridges
John I. Quebedeaux, Jr
johnq at lsu.edu
Thu Oct 14 20:59:37 CDT 2010
Great on the updates. - JQ
> From: Christoph Willing <c.willing at uq.edu.au>
> Date: Fri, 15 Oct 2010 11:47:12 +1000
> To: "Thomas D. Uram" <turam at mcs.anl.gov>
> Cc: John Quebedeaux <johnq at lsu.edu>, Philippe d'Anfray
> <Philippe.d-Anfray at cea.fr>, "<Marcolino.Pires at ac-paris.fr>"
> <Marcolino.Pires at ac-paris.fr>, "ag-tech at mcs.anl.gov" <ag-tech at mcs.anl.gov>
> Subject: Re: [AG-TECH] Vanishing Bridges
>
>
> On 15/10/2010, at 6:10 AM, Thomas Uram wrote:
>
>> This has been fixed. I replicated the problem with a Bridge running
>> on Ubuntu Lucid, registered against the ANL bridge registry.
>>
>> This problem came down to a change in the request handling code in
>> Python 2.6. The change added a handle_timeout method to
>> SocketServer.BaseServer, which gets called instead of raising a
>> socket.timeout exception. The bridge code was relying on this
>> timeout exception to re-register with the registry. That
>> functionality has now been moved to the handle_timeout method.
>>
>> The change has been committed to the AG code here:
>> https://trac.ci.uchicago.edu/accessgrid/changeset/6820
>
>
> Thanks Tom,
>
> Local testing confirms the fix works and I've just uploaded patched AG
> packages for Ubuntu 10.10 & Slackware 13.1 to their respective repos.
> Patched packages for other Ubuntu & Slackware versions should appear
> during today.
>
>
>> The relevant Python report is here:
>> http://bugs.python.org/issue742598
>>
>> This does leave open the question of why the problem couldn't be
>> replicated in test setups using Python 2.6, as more than one of us
>> has done.
>
> I think there is additional aberrant behaviour under python2.6 in the
> registry itself which masks the issue fixed by the patch. You'll
> recall that with the APAG registry, the original fault wasn't seen
> i.e. bridges didn't disappear. It turns out that bridges aren't being
> removed at all in this case, even after they have been intentionally
> stopped, which means non-existent bridges are still being advertised.
> They can only be removed from the advertised list by restarting the
> registry. As an example, I had a bridge named SLTest2 registered with
> the APAG registry. I stooped that bridge over an hour ago and since
> then the machine has been rebooted twice while making new AG packages
> for different distros. Yet that same bridge still appears in the
> bridge list on another machine after a "Purge Bridge Cache". Its
> disabled and unreachable, so doesn't appear in a user's list under the
> Tools menu, but its clearly still being advertised by the registry.
>
>
> chris
>
>
>
>> On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:
>>
>>> Chris,
>>>
>>> I can confirm that LSU is having to run an older version in order
>>> for our
>>> bridge not to disappear from the ANL registry. I haven't had time
>>> to figure
>>> out why it wasn't staying with our FC13 installation - so I've had
>>> to split
>>> the bridge and venueserver for the moment until I have time pick it
>>> apart...
>>> I initially suspected it was a python version issue...
>>>
>>> -John Q.
>>> --
>>> John I. Quebedeaux, Jr.; Louisiana State University
>>> Computer Manager LBRN; 131 Life Sciences Bldg.
>>> e-mail: johnq at lsu.edu; web: http://lbrn.lsu.edu
>>> phone: 225-578-0062 / fax: 225-578-2597
>>>
>>>
>>>> From: Christoph Willing <c.willing at uq.edu.au>
>>>> Date: Thu, 14 Oct 2010 21:09:12 +1000
>>>> To: Philippe d'Anfray <Philippe.d-Anfray at cea.fr>
>>>> Cc: "<Marcolino.Pires at ac-paris.fr>" <Marcolino.Pires at ac-paris.fr>,
>>>> "ag-tech at mcs.anl.gov" <ag-tech at mcs.anl.gov>
>>>> Subject: Re: [AG-TECH] Vanishing Bridges
>>>>
>>>>
>>>> On 14/10/2010, at 7:12 AM, Christoph Willing wrote:
>>>>
>>>>>
>>>>> On 14/10/2010, at 2:13 AM, Thomas Uram wrote:
>>>>>
>>>>>> Last week I set up a test registry, registered a bridge with it,
>>>>>> and successively queried bridges from the registry all day with no
>>>>>> trouble. Granted, these were all local, but if the problem appears
>>>>>> as reliably as I've heard, I would have expected to see a problem
>>>>>> even in this case. We clearly need to narrow down the cause of the
>>>>>> problem some more. What details do we have about the failure
>>>>>> cases?
>>>>>
>>>>>
>>>>> We have very few details, unfortunately. I recall, nearly a year
>>>>> ago, I was able to replicate the problem and at that time I thought
>>>>> it may have something to do with newer python versions (since 2.6
>>>>> was implicated in another problem I'd seen and the replicable cases
>>>>> were on newer systems which included python2.6).
>>>>>
>>>>> However when I was retesting a Debian lenny system (which uses
>>>>> python2.5) just night before last, I also ran a test with the new
>>>>> Ubuntu maverick (with python2.6). Both ran fine overnight i.e.
>>>>> maverick seems OK despite using python2.6 (however note that other
>>>>> tests in France were not successful with maverick, so ....).
>>>>> Anyway,
>>>>> since maverick had run OK for me, I then started a test with Ubuntu
>>>>> lucid (also python2.6), one of the systems with which I'd
>>>>> previously
>>>>> been able to replicate the problem. This time it has run overnight
>>>>> without any bridge disappearances - I just tried a bridge cache
>>>>> purge from home and it showed up fine (still showing up as
>>>>> "LucidTest" in the bridge list
>>>>> if the www.ap-accessgrid.org registry is enabled).
>>>>
>>>>
>>>> On re-reading this last line, I wondered if the problem has
>>>> something
>>>> to do with the registry itself. I guess all the failure instances so
>>>> far have been using the default ANL registryUrl at
>>>> www.accessgrid.org/registry/peers.txt
>>>> , whereas my tests the last few days, which produced no failures,
>>>> all
>>>> used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt
>>>> .
>>>> Obviously each points to a different registry so could that be the
>>>> problem?
>>>>
>>>> I spent all day today testing different _recent_ distros (Slackware
>>>> 13.1, Ubuntu lucid & maverick) against the different registries. In
>>>> all cases, bridges running against the ANL registry disappeared
>>>> within
>>>> 10-15 minutes. In all cases except one (not repeatable), bridges
>>>> running against the APAG registry did not disappear.
>>>>
>>>> My theory therefore is that ANL registry is running with an older
>>>> version of the AG toolkit that is not compatible with VenueClients
>>>> running newer AG versions. Tom's recent testing with a separate test
>>>> registry supports this theory (assuming the test registry is
>>>> running a
>>>> recent version of AG toolkit). Philippe's comment that tests with
>>>> maverick were unsuccessful also supports the theory (assuming those
>>>> tests used the default ANL registry).
>>>>
>>>>
>>>> Philippe and Tom (and anyone else interested),
>>>>
>>>> Could you try running (using the current AG release) a bridge
>>>> against
>>>> the APAG registry - some command like:
>>>> Bridge3.py --name=Testing123 --location=wherever
>>>> --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt
>>>>
>>>> Leave it running for about an hour or two to confirm it does not
>>>> disappear. Then stop it and run it again, this time against the ANL
>>>> registry with something like:
>>>> Bridge3.py --name=TestingXYZ --location=wherever
>>>> --registryUrl=http://www.accessgrid.org/registry/peers.txt
>>>>
>>>> Look for failure in the first 15 minutes.
>>>>
>>>>
>>>> If the fault is in the ANL registry, why do so many bridges _not_
>>>> disappear? Looking at the list of bridges, the names are becoming
>>>> very
>>>> familiar i.e. they've been around a long time. I'm guessing that
>>>> these
>>>> bridges are running on older versions of the AG toolkit - still
>>>> compatible with whatever version is running on the ANL registry
>>>> machine.
>>>>
>>>>
>>>> Of course, if the test results are in line with the theory, it still
>>>> doesn't explain the underlying cause. A quick look through bridge &
>>>> registry related AG code doesn't reveal any recent changes so the
>>>> real
>>>> cause may actually be down in some of the supporting software
>>>> (python,
>>>> m2crypto anyone?) which are constantly updated in each new Linux
>>>> release (typically every 6 months). If so, this issue will
>>>> eventually
>>>> also bite Windows & Mac users as new OS versions introduce up to
>>>> date
>>>> versions of python, m2crypto etc. for them too.
>>>>
>>>>
>>>> chris
>>>>
>>>>
>>>>> So we know very little about failure cases;
>>>>> - there are many in France
>>>>> - I was previously able to replicate but not now
>>>>> - I _think_ I recall that Todd Z reported that he had seen the
>>>>> problem too
>>>>>
>>>>> chris
>>>>>
>>>>>
>>>>>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:
>>>>>>
>>>>>>> Bonjour,
>>>>>>>
>>>>>>> I was not there yesterday and it's probably too late to "purge
>>>>>>> the
>>>>>>> cache" (there's just a Lucid test by now)
>>>>>>>
>>>>>>> By the time we decided to switch to debian because we have a
>>>>>>> seminar that will be transmitted
>>>>>>> tomorrow and really need the bridge to work (in fact to be
>>>>>>> visible
>>>>>>> to new users and there it is).
>>>>>>>
>>>>>>> If it works also with "maverick" it is a good news for other
>>>>>>> users
>>>>>>> in France (but in the first test we made the
>>>>>>> bridge disappears too...)
>>>>>>>
>>>>>>> Merci pour tout!!
>>>>>>>
>>>>>>>
>>>>>>> Philippe d'Anfray
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le 12/10/2010 12:56, Christoph Willing a écrit :
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We're still stuck with this bridge problem, we tried with
>>>>>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you
>>>>>>>>>> can confirm us that it works fine with Debian, I'll
>>>>>>>>>> reconfigure
>>>>>>>>>> our server and install a Debian.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm just about to leave for a short holiday so I can't
>>>>>>>>> reconfirm
>>>>>>>>> that Debian still works correctly until late next week.
>>>>>>>>
>>>>>>>> I'm now running a test bridge with Debian "lenny". It has been
>>>>>>>> running nearly 5 hours without any problem so far. I'm also
>>>>>>>> running another test bridge using the new Ubuntu "maverick",
>>>>>>>> which and been running for over 4.5 hours - also no problem yet.
>>>>>>>> I will let them both run overnight here (your day time) and you
>>>>>>>> can check whether they're still running OK if you purge your
>>>>>>>> bridge cache (assuming you have www.ap-accessgrid.org as one of
>>>>>>>> your bridge registries) and look for the bridges named DebTest
>>>>>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).
>>>>>>>>
>>>>>>> <Philippe_d-Anfray.vcf>
>>>>>>
>>>>>
>>>>> Christoph Willing +61 7 3365 8316
>>>>> QCIF Access Grid Manager
>>>>> University of Queensland
>>>>>
>>>>
>>>> Christoph Willing +61 7 3365 8316
>>>> QCIF Access Grid Manager
>>>> University of Queensland
>>>>
>>>
>>
>
> Christoph Willing +61 7 3365 8316
> QCIF Access Grid Manager
> University of Queensland
>
More information about the ag-tech
mailing list