[AG-TECH] Vanishing Bridges

John I. Quebedeaux, Jr johnq at lsu.edu
Thu Oct 14 20:59:37 CDT 2010


Great on the updates. - JQ


> From: Christoph Willing <c.willing at uq.edu.au>
> Date: Fri, 15 Oct 2010 11:47:12 +1000
> To: "Thomas D. Uram" <turam at mcs.anl.gov>
> Cc: John Quebedeaux <johnq at lsu.edu>, Philippe d'Anfray
> <Philippe.d-Anfray at cea.fr>, "<Marcolino.Pires at ac-paris.fr>"
> <Marcolino.Pires at ac-paris.fr>, "ag-tech at mcs.anl.gov" <ag-tech at mcs.anl.gov>
> Subject: Re: [AG-TECH] Vanishing Bridges
> 
> 
> On 15/10/2010, at 6:10 AM, Thomas Uram wrote:
> 
>> This has been fixed. I replicated the problem with a Bridge running
>> on Ubuntu Lucid, registered against the ANL bridge registry.
>> 
>> This problem came down to a change in the request handling code in
>> Python 2.6. The change added a handle_timeout method to
>> SocketServer.BaseServer, which gets called instead of raising a
>> socket.timeout exception. The bridge code was relying on this
>> timeout exception to re-register with the registry. That
>> functionality has now been moved to the handle_timeout method.
>> 
>> The change has been committed to the AG code here:
>> https://trac.ci.uchicago.edu/accessgrid/changeset/6820
> 
> 
> Thanks Tom,
> 
> Local testing confirms the fix works and I've just uploaded patched AG
> packages for Ubuntu 10.10 & Slackware 13.1 to their respective repos.
> Patched packages for other Ubuntu & Slackware versions should appear
> during today.
> 
> 
>> The relevant Python report is here:
>> http://bugs.python.org/issue742598
>> 
>> This does leave open the question of why the problem couldn't be
>> replicated in test setups using Python 2.6, as more than one of us
>> has done.
> 
> I think there is additional aberrant behaviour under python2.6 in the
> registry itself which masks the issue fixed by the patch. You'll
> recall that with the APAG registry, the original fault wasn't seen
> i.e. bridges didn't disappear. It turns out that bridges aren't being
> removed at all in this case, even after they have been intentionally
> stopped, which means non-existent bridges are still being advertised.
> They can only be removed from the advertised list by restarting the
> registry. As an example, I had a bridge named SLTest2 registered with
> the APAG registry. I stooped that bridge over an hour ago and since
> then the machine has been rebooted twice while making new AG packages
> for different distros. Yet that same bridge still appears in the
> bridge list on another machine after a "Purge Bridge Cache". Its
> disabled and unreachable, so doesn't appear in a user's list under the
> Tools menu, but its clearly still being advertised by the registry.
> 
> 
> chris
> 
> 
> 
>> On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:
>> 
>>> Chris,
>>> 
>>> I can confirm that LSU is having to run an older version in order
>>> for our
>>> bridge not to disappear from the ANL registry. I haven't had time
>>> to figure
>>> out why it wasn't staying with our FC13 installation - so I've had
>>> to split
>>> the bridge and venueserver for the moment until I have time pick it
>>> apart...
>>> I initially suspected it was a python version issue...
>>> 
>>> -John Q.
>>> -- 
>>> John I. Quebedeaux, Jr.; Louisiana State University
>>> Computer Manager LBRN; 131 Life Sciences Bldg.
>>> e-mail: johnq at lsu.edu; web: http://lbrn.lsu.edu
>>> phone: 225-578-0062 / fax: 225-578-2597
>>> 
>>> 
>>>> From: Christoph Willing <c.willing at uq.edu.au>
>>>> Date: Thu, 14 Oct 2010 21:09:12 +1000
>>>> To: Philippe d'Anfray <Philippe.d-Anfray at cea.fr>
>>>> Cc: "<Marcolino.Pires at ac-paris.fr>" <Marcolino.Pires at ac-paris.fr>,
>>>> "ag-tech at mcs.anl.gov" <ag-tech at mcs.anl.gov>
>>>> Subject: Re: [AG-TECH] Vanishing Bridges
>>>> 
>>>> 
>>>> On 14/10/2010, at 7:12 AM, Christoph Willing wrote:
>>>> 
>>>>> 
>>>>> On 14/10/2010, at 2:13 AM, Thomas Uram wrote:
>>>>> 
>>>>>> Last week I set up a test registry, registered a bridge with it,
>>>>>> and successively queried bridges from the registry all day with no
>>>>>> trouble. Granted, these were all local, but if the problem appears
>>>>>> as reliably as I've heard, I would have expected to see a problem
>>>>>> even in this case. We clearly need to narrow down the cause of the
>>>>>> problem some more. What details do we have about the failure
>>>>>> cases?
>>>>> 
>>>>> 
>>>>> We have very few details, unfortunately. I recall, nearly a year
>>>>> ago, I was able to replicate the problem and at that time I thought
>>>>> it may have something to do with newer python versions (since 2.6
>>>>> was implicated in another problem I'd seen and the replicable cases
>>>>> were on newer systems which included python2.6).
>>>>> 
>>>>> However when I was retesting a Debian lenny system (which uses
>>>>> python2.5) just night before last, I also ran a test with the new
>>>>> Ubuntu maverick (with python2.6). Both ran fine overnight i.e.
>>>>> maverick seems OK despite using python2.6 (however note that other
>>>>> tests in France were not successful with maverick, so ....).
>>>>> Anyway,
>>>>> since maverick had run OK for me, I then started a test with Ubuntu
>>>>> lucid (also python2.6), one of the systems with which I'd
>>>>> previously
>>>>> been able to replicate the problem. This time it has run overnight
>>>>> without any bridge disappearances - I just tried a bridge cache
>>>>> purge from home and it showed up fine (still showing up as
>>>>> "LucidTest" in the bridge list
>>>>> if the www.ap-accessgrid.org registry is enabled).
>>>> 
>>>> 
>>>> On re-reading this last line, I wondered if the problem has
>>>> something
>>>> to do with the registry itself. I guess all the failure instances so
>>>> far have been using the default ANL registryUrl at
>>>> www.accessgrid.org/registry/peers.txt
>>>> , whereas my tests the last few days, which produced no failures,
>>>> all
>>>> used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt
>>>> .
>>>> Obviously each points to a different registry so could that be the
>>>> problem?
>>>> 
>>>> I spent all day today testing different _recent_ distros (Slackware
>>>> 13.1, Ubuntu lucid & maverick) against the different registries. In
>>>> all cases, bridges running against the ANL registry disappeared
>>>> within
>>>> 10-15 minutes. In all cases except one (not repeatable), bridges
>>>> running against the APAG registry did not disappear.
>>>> 
>>>> My theory therefore is that ANL registry is running with an older
>>>> version of the AG toolkit that is not compatible with VenueClients
>>>> running newer AG versions. Tom's recent testing with a separate test
>>>> registry supports this theory (assuming the test registry is
>>>> running a
>>>> recent version of AG toolkit). Philippe's comment that tests with
>>>> maverick were unsuccessful also supports the theory (assuming those
>>>> tests used the default ANL registry).
>>>> 
>>>> 
>>>> Philippe and Tom (and anyone else interested),
>>>> 
>>>> Could you try running (using the current AG release) a bridge
>>>> against
>>>> the APAG registry - some command like:
>>>>   Bridge3.py --name=Testing123 --location=wherever
>>>> --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt
>>>> 
>>>> Leave it running for about an hour or two to confirm it does not
>>>> disappear. Then stop it and run it again, this time against the ANL
>>>> registry with something like:
>>>>    Bridge3.py --name=TestingXYZ --location=wherever
>>>> --registryUrl=http://www.accessgrid.org/registry/peers.txt
>>>> 
>>>> Look for failure in the first 15 minutes.
>>>> 
>>>> 
>>>> If the fault is in the ANL registry, why do so many bridges _not_
>>>> disappear? Looking at the list of bridges, the names are becoming
>>>> very
>>>> familiar i.e. they've been around a long time. I'm guessing that
>>>> these
>>>> bridges are running on older versions of the AG toolkit - still
>>>> compatible with whatever version is running on the ANL registry
>>>> machine.
>>>> 
>>>> 
>>>> Of course, if the test results are in line with the theory, it still
>>>> doesn't explain the underlying cause. A quick look through bridge &
>>>> registry related AG code doesn't reveal any recent changes so the
>>>> real
>>>> cause may actually be down in some of the supporting software
>>>> (python,
>>>> m2crypto anyone?) which are constantly updated in each new Linux
>>>> release (typically every 6 months). If so, this issue will
>>>> eventually
>>>> also bite Windows & Mac users as new OS versions introduce up to
>>>> date
>>>> versions of python, m2crypto etc. for them too.
>>>> 
>>>> 
>>>> chris
>>>> 
>>>> 
>>>>> So we know very little about failure cases;
>>>>> - there are many in France
>>>>> - I was previously able to replicate but not now
>>>>> - I _think_ I recall that Todd Z reported that he had seen the
>>>>> problem too
>>>>> 
>>>>> chris
>>>>> 
>>>>> 
>>>>>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:
>>>>>> 
>>>>>>> Bonjour,
>>>>>>> 
>>>>>>> I was not there yesterday and it's probably too late to "purge
>>>>>>> the
>>>>>>> cache" (there's just a Lucid test by now)
>>>>>>> 
>>>>>>> By the time we decided to switch to debian because we have a
>>>>>>> seminar that will be transmitted
>>>>>>> tomorrow and really need the bridge to work (in fact to be
>>>>>>> visible
>>>>>>> to new users and there it is).
>>>>>>> 
>>>>>>> If it works also with "maverick" it is a good news for other
>>>>>>> users
>>>>>>> in France (but in the first test we made the
>>>>>>> bridge disappears too...)
>>>>>>> 
>>>>>>> Merci pour tout!!
>>>>>>> 
>>>>>>> 
>>>>>>> Philippe d'Anfray
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Le 12/10/2010 12:56, Christoph Willing a écrit :
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> We're still stuck with this bridge problem, we tried with
>>>>>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you
>>>>>>>>>> can confirm us that it works fine with Debian, I'll
>>>>>>>>>> reconfigure
>>>>>>>>>> our server and install a Debian.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I'm just about to leave for a short holiday so I can't
>>>>>>>>> reconfirm
>>>>>>>>> that Debian still works correctly until late next week.
>>>>>>>> 
>>>>>>>> I'm now running a test bridge with Debian "lenny". It has been
>>>>>>>> running nearly 5 hours without any problem so far. I'm also
>>>>>>>> running another test bridge using the new Ubuntu "maverick",
>>>>>>>> which and been running for over 4.5 hours - also no problem yet.
>>>>>>>> I will let them both run overnight here (your day time) and you
>>>>>>>> can check whether they're still running OK if you purge your
>>>>>>>> bridge cache (assuming you have www.ap-accessgrid.org as one of
>>>>>>>> your bridge registries) and look for the bridges named DebTest
>>>>>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).
>>>>>>>> 
>>>>>>> <Philippe_d-Anfray.vcf>
>>>>>> 
>>>>> 
>>>>> Christoph Willing                       +61 7 3365 8316
>>>>> QCIF Access Grid Manager
>>>>> University of Queensland
>>>>> 
>>>> 
>>>> Christoph Willing                       +61 7 3365 8316
>>>> QCIF Access Grid Manager
>>>> University of Queensland
>>>> 
>>> 
>> 
> 
> Christoph Willing                       +61 7 3365 8316
> QCIF Access Grid Manager
> University of Queensland
> 



More information about the ag-tech mailing list