<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7655.8">
<TITLE>RE: [AG-TECH] Vanishing Bridges</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2>Andrew,<BR>
<BR>
The change was applied to the Ubuntus in package version 3.2-3.<BR>
<BR>
<BR>
chris<BR>
<BR>
<BR>
-----Original Message-----<BR>
From: adhesionmusic@gmail.com on behalf of Andrew Ford<BR>
Sent: Thu 11/4/2010 2:18 AM<BR>
To: Chris Willing<BR>
Cc: Thomas Uram; <Marcolino.Pires@ac-paris.fr>; ag-tech@mcs.anl.gov<BR>
Subject: Re: [AG-TECH] Vanishing Bridges<BR>
<BR>
Hi Chris,<BR>
<BR>
Has this fix been pushed to all the different distribution repos? I'm still<BR>
getting the issue (bridge registers, accessible to clients, then apparently<BR>
disappears from registry after a few minutes) with bridges run on Ubuntu<BR>
9.04 and Fedora 13, and trying to run Bridge3.py on an Ubuntu 10.10 box just<BR>
hangs.<BR>
<BR>
--Andrew<BR>
<BR>
2010/10/14 Christoph Willing <c.willing@uq.edu.au><BR>
<BR>
><BR>
> On 15/10/2010, at 1:23 PM, Thomas Uram wrote:<BR>
><BR>
> This fix also addresses the problem where bridges are not removed from the<BR>
>> registry. The RegistryPeer also uses the AGXMLRPCServer, and relies on the<BR>
>> timeout for cleaning up bridges that have timed out. I haven't confirmed the<BR>
>> fix in this case by testing, but it's clearly borne out in the code. I'll<BR>
>> test it tomorrow.<BR>
>><BR>
><BR>
><BR>
> Tom,<BR>
><BR>
> I just updated our registry machine and bridges are now being removed<BR>
> correctly. Looks like a good fix all round.<BR>
><BR>
> I can't help thinking there are other bits of AG code that would benefit<BR>
> from a similar fix - the ftps server springs to mind (can't currently upload<BR>
> data to a venue on a server running with python2.6).<BR>
><BR>
><BR>
><BR>
> chris<BR>
><BR>
><BR>
> On Oct 14, 2010, at 8:47 PM, Christoph Willing wrote:<BR>
>><BR>
>><BR>
>>> On 15/10/2010, at 6:10 AM, Thomas Uram wrote:<BR>
>>><BR>
>>> This has been fixed. I replicated the problem with a Bridge running on<BR>
>>>> Ubuntu Lucid, registered against the ANL bridge registry.<BR>
>>>><BR>
>>>> This problem came down to a change in the request handling code in<BR>
>>>> Python 2.6. The change added a handle_timeout method to<BR>
>>>> SocketServer.BaseServer, which gets called instead of raising a<BR>
>>>> socket.timeout exception. The bridge code was relying on this timeout<BR>
>>>> exception to re-register with the registry. That functionality has now been<BR>
>>>> moved to the handle_timeout method.<BR>
>>>><BR>
>>>> The change has been committed to the AG code here:<BR>
>>>> <A HREF="https://trac.ci.uchicago.edu/accessgrid/changeset/6820">https://trac.ci.uchicago.edu/accessgrid/changeset/6820</A><BR>
>>>><BR>
>>><BR>
>>><BR>
>>> Thanks Tom,<BR>
>>><BR>
>>> Local testing confirms the fix works and I've just uploaded patched AG<BR>
>>> packages for Ubuntu 10.10 & Slackware 13.1 to their respective repos.<BR>
>>> Patched packages for other Ubuntu & Slackware versions should appear during<BR>
>>> today.<BR>
>>><BR>
>>><BR>
>>> The relevant Python report is here:<BR>
>>>> <A HREF="http://bugs.python.org/issue742598">http://bugs.python.org/issue742598</A><BR>
>>>><BR>
>>>> This does leave open the question of why the problem couldn't be<BR>
>>>> replicated in test setups using Python 2.6, as more than one of us has done.<BR>
>>>><BR>
>>><BR>
>>> I think there is additional aberrant behaviour under python2.6 in the<BR>
>>> registry itself which masks the issue fixed by the patch. You'll recall that<BR>
>>> with the APAG registry, the original fault wasn't seen i.e. bridges didn't<BR>
>>> disappear. It turns out that bridges aren't being removed at all in this<BR>
>>> case, even after they have been intentionally stopped, which means<BR>
>>> non-existent bridges are still being advertised. They can only be removed<BR>
>>> from the advertised list by restarting the registry. As an example, I had a<BR>
>>> bridge named SLTest2 registered with the APAG registry. I stooped that<BR>
>>> bridge over an hour ago and since then the machine has been rebooted twice<BR>
>>> while making new AG packages for different distros. Yet that same bridge<BR>
>>> still appears in the bridge list on another machine after a "Purge Bridge<BR>
>>> Cache". Its disabled and unreachable, so doesn't appear in a user's list<BR>
>>> under the Tools menu, but its clearly still being advertised by the<BR>
>>> registry.<BR>
>>><BR>
>>><BR>
>>> chris<BR>
>>><BR>
>>><BR>
>>><BR>
>>> On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:<BR>
>>>><BR>
>>>> Chris,<BR>
>>>>><BR>
>>>>> I can confirm that LSU is having to run an older version in order for<BR>
>>>>> our<BR>
>>>>> bridge not to disappear from the ANL registry. I haven't had time to<BR>
>>>>> figure<BR>
>>>>> out why it wasn't staying with our FC13 installation - so I've had to<BR>
>>>>> split<BR>
>>>>> the bridge and venueserver for the moment until I have time pick it<BR>
>>>>> apart...<BR>
>>>>> I initially suspected it was a python version issue...<BR>
>>>>><BR>
>>>>> -John Q.<BR>
>>>>> --<BR>
>>>>> John I. Quebedeaux, Jr.; Louisiana State University<BR>
>>>>> Computer Manager LBRN; 131 Life Sciences Bldg.<BR>
>>>>> e-mail: johnq@lsu.edu; web: <A HREF="http://lbrn.lsu.edu">http://lbrn.lsu.edu</A><BR>
>>>>> phone: 225-578-0062 / fax: 225-578-2597<BR>
>>>>><BR>
>>>>><BR>
>>>>> From: Christoph Willing <c.willing@uq.edu.au><BR>
>>>>>> Date: Thu, 14 Oct 2010 21:09:12 +1000<BR>
>>>>>> To: Philippe d'Anfray <Philippe.d-Anfray@cea.fr><BR>
>>>>>> Cc: "<Marcolino.Pires@ac-paris.fr>" <Marcolino.Pires@ac-paris.fr>,<BR>
>>>>>> "ag-tech@mcs.anl.gov" <ag-tech@mcs.anl.gov><BR>
>>>>>> Subject: Re: [AG-TECH] Vanishing Bridges<BR>
>>>>>><BR>
>>>>>><BR>
>>>>>> On 14/10/2010, at 7:12 AM, Christoph Willing wrote:<BR>
>>>>>><BR>
>>>>>><BR>
>>>>>>> On 14/10/2010, at 2:13 AM, Thomas Uram wrote:<BR>
>>>>>>><BR>
>>>>>>> Last week I set up a test registry, registered a bridge with it,<BR>
>>>>>>>> and successively queried bridges from the registry all day with no<BR>
>>>>>>>> trouble. Granted, these were all local, but if the problem appears<BR>
>>>>>>>> as reliably as I've heard, I would have expected to see a problem<BR>
>>>>>>>> even in this case. We clearly need to narrow down the cause of the<BR>
>>>>>>>> problem some more. What details do we have about the failure cases?<BR>
>>>>>>>><BR>
>>>>>>><BR>
>>>>>>><BR>
>>>>>>> We have very few details, unfortunately. I recall, nearly a year<BR>
>>>>>>> ago, I was able to replicate the problem and at that time I thought<BR>
>>>>>>> it may have something to do with newer python versions (since 2.6<BR>
>>>>>>> was implicated in another problem I'd seen and the replicable cases<BR>
>>>>>>> were on newer systems which included python2.6).<BR>
>>>>>>><BR>
>>>>>>> However when I was retesting a Debian lenny system (which uses<BR>
>>>>>>> python2.5) just night before last, I also ran a test with the new<BR>
>>>>>>> Ubuntu maverick (with python2.6). Both ran fine overnight i.e.<BR>
>>>>>>> maverick seems OK despite using python2.6 (however note that other<BR>
>>>>>>> tests in France were not successful with maverick, so ....). Anyway,<BR>
>>>>>>> since maverick had run OK for me, I then started a test with Ubuntu<BR>
>>>>>>> lucid (also python2.6), one of the systems with which I'd previously<BR>
>>>>>>> been able to replicate the problem. This time it has run overnight<BR>
>>>>>>> without any bridge disappearances - I just tried a bridge cache<BR>
>>>>>>> purge from home and it showed up fine (still showing up as<BR>
>>>>>>> "LucidTest" in the bridge list<BR>
>>>>>>> if the www.ap-accessgrid.org registry is enabled).<BR>
>>>>>>><BR>
>>>>>><BR>
>>>>>><BR>
>>>>>> On re-reading this last line, I wondered if the problem has something<BR>
>>>>>> to do with the registry itself. I guess all the failure instances so<BR>
>>>>>> far have been using the default ANL registryUrl at<BR>
>>>>>> www.accessgrid.org/registry/peers.txt<BR>
>>>>>> , whereas my tests the last few days, which produced no failures, all<BR>
>>>>>> used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt<BR>
>>>>>> .<BR>
>>>>>> Obviously each points to a different registry so could that be the<BR>
>>>>>> problem?<BR>
>>>>>><BR>
>>>>>> I spent all day today testing different _recent_ distros (Slackware<BR>
>>>>>> 13.1, Ubuntu lucid & maverick) against the different registries. In<BR>
>>>>>> all cases, bridges running against the ANL registry disappeared within<BR>
>>>>>> 10-15 minutes. In all cases except one (not repeatable), bridges<BR>
>>>>>> running against the APAG registry did not disappear.<BR>
>>>>>><BR>
>>>>>> My theory therefore is that ANL registry is running with an older<BR>
>>>>>> version of the AG toolkit that is not compatible with VenueClients<BR>
>>>>>> running newer AG versions. Tom's recent testing with a separate test<BR>
>>>>>> registry supports this theory (assuming the test registry is running a<BR>
>>>>>> recent version of AG toolkit). Philippe's comment that tests with<BR>
>>>>>> maverick were unsuccessful also supports the theory (assuming those<BR>
>>>>>> tests used the default ANL registry).<BR>
>>>>>><BR>
>>>>>><BR>
>>>>>> Philippe and Tom (and anyone else interested),<BR>
>>>>>><BR>
>>>>>> Could you try running (using the current AG release) a bridge against<BR>
>>>>>> the APAG registry - some command like:<BR>
>>>>>> Bridge3.py --name=Testing123 --location=wherever<BR>
>>>>>> --registryUrl=<A HREF="http://www.ap-accessgrid.org/registry/peers.txt">http://www.ap-accessgrid.org/registry/peers.txt</A><BR>
>>>>>><BR>
>>>>>> Leave it running for about an hour or two to confirm it does not<BR>
>>>>>> disappear. Then stop it and run it again, this time against the ANL<BR>
>>>>>> registry with something like:<BR>
>>>>>> Bridge3.py --name=TestingXYZ --location=wherever<BR>
>>>>>> --registryUrl=<A HREF="http://www.accessgrid.org/registry/peers.txt">http://www.accessgrid.org/registry/peers.txt</A><BR>
>>>>>><BR>
>>>>>> Look for failure in the first 15 minutes.<BR>
>>>>>><BR>
>>>>>><BR>
>>>>>> If the fault is in the ANL registry, why do so many bridges _not_<BR>
>>>>>> disappear? Looking at the list of bridges, the names are becoming very<BR>
>>>>>> familiar i.e. they've been around a long time. I'm guessing that these<BR>
>>>>>> bridges are running on older versions of the AG toolkit - still<BR>
>>>>>> compatible with whatever version is running on the ANL registry<BR>
>>>>>> machine.<BR>
>>>>>><BR>
>>>>>><BR>
>>>>>> Of course, if the test results are in line with the theory, it still<BR>
>>>>>> doesn't explain the underlying cause. A quick look through bridge &<BR>
>>>>>> registry related AG code doesn't reveal any recent changes so the real<BR>
>>>>>> cause may actually be down in some of the supporting software (python,<BR>
>>>>>> m2crypto anyone?) which are constantly updated in each new Linux<BR>
>>>>>> release (typically every 6 months). If so, this issue will eventually<BR>
>>>>>> also bite Windows & Mac users as new OS versions introduce up to date<BR>
>>>>>> versions of python, m2crypto etc. for them too.<BR>
>>>>>><BR>
>>>>>><BR>
>>>>>> chris<BR>
>>>>>><BR>
>>>>>><BR>
>>>>>> So we know very little about failure cases;<BR>
>>>>>>> - there are many in France<BR>
>>>>>>> - I was previously able to replicate but not now<BR>
>>>>>>> - I _think_ I recall that Todd Z reported that he had seen the<BR>
>>>>>>> problem too<BR>
>>>>>>><BR>
>>>>>>> chris<BR>
>>>>>>><BR>
>>>>>>><BR>
>>>>>>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:<BR>
>>>>>>>><BR>
>>>>>>>> Bonjour,<BR>
>>>>>>>>><BR>
>>>>>>>>> I was not there yesterday and it's probably too late to "purge the<BR>
>>>>>>>>> cache" (there's just a Lucid test by now)<BR>
>>>>>>>>><BR>
>>>>>>>>> By the time we decided to switch to debian because we have a<BR>
>>>>>>>>> seminar that will be transmitted<BR>
>>>>>>>>> tomorrow and really need the bridge to work (in fact to be visible<BR>
>>>>>>>>> to new users and there it is).<BR>
>>>>>>>>><BR>
>>>>>>>>> If it works also with "maverick" it is a good news for other users<BR>
>>>>>>>>> in France (but in the first test we made the<BR>
>>>>>>>>> bridge disappears too...)<BR>
>>>>>>>>><BR>
>>>>>>>>> Merci pour tout!!<BR>
>>>>>>>>><BR>
>>>>>>>>><BR>
>>>>>>>>> Philippe d'Anfray<BR>
>>>>>>>>><BR>
>>>>>>>>><BR>
>>>>>>>>><BR>
>>>>>>>>><BR>
>>>>>>>>> Le 12/10/2010 12:56, Christoph Willing a écrit :<BR>
>>>>>>>>><BR>
>>>>>>>>>><BR>
>>>>>>>>>><BR>
>>>>>>>>>>>> We're still stuck with this bridge problem, we tried with<BR>
>>>>>>>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you<BR>
>>>>>>>>>>>> can confirm us that it works fine with Debian, I'll reconfigure<BR>
>>>>>>>>>>>> our server and install a Debian.<BR>
>>>>>>>>>>>><BR>
>>>>>>>>>>><BR>
>>>>>>>>>>><BR>
>>>>>>>>>>> I'm just about to leave for a short holiday so I can't reconfirm<BR>
>>>>>>>>>>> that Debian still works correctly until late next week.<BR>
>>>>>>>>>>><BR>
>>>>>>>>>><BR>
>>>>>>>>>> I'm now running a test bridge with Debian "lenny". It has been<BR>
>>>>>>>>>> running nearly 5 hours without any problem so far. I'm also<BR>
>>>>>>>>>> running another test bridge using the new Ubuntu "maverick",<BR>
>>>>>>>>>> which and been running for over 4.5 hours - also no problem yet.<BR>
>>>>>>>>>> I will let them both run overnight here (your day time) and you<BR>
>>>>>>>>>> can check whether they're still running OK if you purge your<BR>
>>>>>>>>>> bridge cache (assuming you have www.ap-accessgrid.org as one of<BR>
>>>>>>>>>> your bridge registries) and look for the bridges named DebTest<BR>
>>>>>>>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).<BR>
>>>>>>>>>><BR>
>>>>>>>>>> <Philippe_d-Anfray.vcf><BR>
>>>>>>>>><BR>
>>>>>>>><BR>
>>>>>>>><BR>
>>>>>>> Christoph Willing +61 7 3365 8316<BR>
>>>>>>> QCIF Access Grid Manager<BR>
>>>>>>> University of Queensland<BR>
>>>>>>><BR>
>>>>>>><BR>
>>>>>> Christoph Willing +61 7 3365 8316<BR>
>>>>>> QCIF Access Grid Manager<BR>
>>>>>> University of Queensland<BR>
>>>>>><BR>
>>>>>><BR>
>>>>><BR>
>>>><BR>
>>> Christoph Willing +61 7 3365 8316<BR>
>>> QCIF Access Grid Manager<BR>
>>> University of Queensland<BR>
>>><BR>
>>><BR>
>><BR>
> Christoph Willing +61 7 3365 8316<BR>
> QCIF Access Grid Manager<BR>
> University of Queensland<BR>
><BR>
><BR>
<BR>
</FONT>
</P>
</BODY>
</HTML>