[AG-TECH] Vanishing Bridges

Christoph Willing c.willing at uq.edu.au
Thu Oct 14 22:31:28 CDT 2010


On 15/10/2010, at 1:23 PM, Thomas Uram wrote:

> This fix also addresses the problem where bridges are not removed  
> from the registry. The RegistryPeer also uses the AGXMLRPCServer,  
> and relies on the timeout for cleaning up bridges that have timed  
> out. I haven't confirmed the fix in this case by testing, but it's  
> clearly borne out in the code. I'll test it tomorrow.


Tom,

I just updated our registry machine and bridges are now being removed  
correctly. Looks like a good fix all round.

I can't help thinking there are other bits of AG code that would  
benefit from a similar fix - the ftps server springs to mind (can't  
currently upload data to a venue on a server running with python2.6).


chris


> On Oct 14, 2010, at 8:47 PM, Christoph Willing wrote:
>
>>
>> On 15/10/2010, at 6:10 AM, Thomas Uram wrote:
>>
>>> This has been fixed. I replicated the problem with a Bridge  
>>> running on Ubuntu Lucid, registered against the ANL bridge registry.
>>>
>>> This problem came down to a change in the request handling code in  
>>> Python 2.6. The change added a handle_timeout method to  
>>> SocketServer.BaseServer, which gets called instead of raising a  
>>> socket.timeout exception. The bridge code was relying on this  
>>> timeout exception to re-register with the registry. That  
>>> functionality has now been moved to the handle_timeout method.
>>>
>>> The change has been committed to the AG code here:
>>> https://trac.ci.uchicago.edu/accessgrid/changeset/6820
>>
>>
>> Thanks Tom,
>>
>> Local testing confirms the fix works and I've just uploaded patched  
>> AG packages for Ubuntu 10.10 & Slackware 13.1 to their respective  
>> repos. Patched packages for other Ubuntu & Slackware versions  
>> should appear during today.
>>
>>
>>> The relevant Python report is here:
>>> http://bugs.python.org/issue742598
>>>
>>> This does leave open the question of why the problem couldn't be  
>>> replicated in test setups using Python 2.6, as more than one of us  
>>> has done.
>>
>> I think there is additional aberrant behaviour under python2.6 in  
>> the registry itself which masks the issue fixed by the patch.  
>> You'll recall that with the APAG registry, the original fault  
>> wasn't seen i.e. bridges didn't disappear. It turns out that  
>> bridges aren't being removed at all in this case, even after they  
>> have been intentionally stopped, which means non-existent bridges  
>> are still being advertised. They can only be removed from the  
>> advertised list by restarting the registry. As an example, I had a  
>> bridge named SLTest2 registered with the APAG registry. I stooped  
>> that bridge over an hour ago and since then the machine has been  
>> rebooted twice while making new AG packages for different distros.  
>> Yet that same bridge still appears in the bridge list on another  
>> machine after a "Purge Bridge Cache". Its disabled and unreachable,  
>> so doesn't appear in a user's list under the Tools menu, but its  
>> clearly still being advertised by the registry.
>>
>>
>> chris
>>
>>
>>
>>> On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:
>>>
>>>> Chris,
>>>>
>>>> I can confirm that LSU is having to run an older version in order  
>>>> for our
>>>> bridge not to disappear from the ANL registry. I haven't had time  
>>>> to figure
>>>> out why it wasn't staying with our FC13 installation - so I've  
>>>> had to split
>>>> the bridge and venueserver for the moment until I have time pick  
>>>> it apart...
>>>> I initially suspected it was a python version issue...
>>>>
>>>> -John Q.
>>>> -- 
>>>> John I. Quebedeaux, Jr.; Louisiana State University
>>>> Computer Manager LBRN; 131 Life Sciences Bldg.
>>>> e-mail: johnq at lsu.edu; web: http://lbrn.lsu.edu
>>>> phone: 225-578-0062 / fax: 225-578-2597
>>>>
>>>>
>>>>> From: Christoph Willing <c.willing at uq.edu.au>
>>>>> Date: Thu, 14 Oct 2010 21:09:12 +1000
>>>>> To: Philippe d'Anfray <Philippe.d-Anfray at cea.fr>
>>>>> Cc: "<Marcolino.Pires at ac-paris.fr>" <Marcolino.Pires at ac-paris.fr>,
>>>>> "ag-tech at mcs.anl.gov" <ag-tech at mcs.anl.gov>
>>>>> Subject: Re: [AG-TECH] Vanishing Bridges
>>>>>
>>>>>
>>>>> On 14/10/2010, at 7:12 AM, Christoph Willing wrote:
>>>>>
>>>>>>
>>>>>> On 14/10/2010, at 2:13 AM, Thomas Uram wrote:
>>>>>>
>>>>>>> Last week I set up a test registry, registered a bridge with it,
>>>>>>> and successively queried bridges from the registry all day  
>>>>>>> with no
>>>>>>> trouble. Granted, these were all local, but if the problem  
>>>>>>> appears
>>>>>>> as reliably as I've heard, I would have expected to see a  
>>>>>>> problem
>>>>>>> even in this case. We clearly need to narrow down the cause of  
>>>>>>> the
>>>>>>> problem some more. What details do we have about the failure  
>>>>>>> cases?
>>>>>>
>>>>>>
>>>>>> We have very few details, unfortunately. I recall, nearly a year
>>>>>> ago, I was able to replicate the problem and at that time I  
>>>>>> thought
>>>>>> it may have something to do with newer python versions (since 2.6
>>>>>> was implicated in another problem I'd seen and the replicable  
>>>>>> cases
>>>>>> were on newer systems which included python2.6).
>>>>>>
>>>>>> However when I was retesting a Debian lenny system (which uses
>>>>>> python2.5) just night before last, I also ran a test with the new
>>>>>> Ubuntu maverick (with python2.6). Both ran fine overnight i.e.
>>>>>> maverick seems OK despite using python2.6 (however note that  
>>>>>> other
>>>>>> tests in France were not successful with maverick, so ....).  
>>>>>> Anyway,
>>>>>> since maverick had run OK for me, I then started a test with  
>>>>>> Ubuntu
>>>>>> lucid (also python2.6), one of the systems with which I'd  
>>>>>> previously
>>>>>> been able to replicate the problem. This time it has run  
>>>>>> overnight
>>>>>> without any bridge disappearances - I just tried a bridge cache
>>>>>> purge from home and it showed up fine (still showing up as
>>>>>> "LucidTest" in the bridge list
>>>>>> if the www.ap-accessgrid.org registry is enabled).
>>>>>
>>>>>
>>>>> On re-reading this last line, I wondered if the problem has  
>>>>> something
>>>>> to do with the registry itself. I guess all the failure  
>>>>> instances so
>>>>> far have been using the default ANL registryUrl at
>>>>> www.accessgrid.org/registry/peers.txt
>>>>> , whereas my tests the last few days, which produced no  
>>>>> failures, all
>>>>> used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt 
>>>>> .
>>>>> Obviously each points to a different registry so could that be the
>>>>> problem?
>>>>>
>>>>> I spent all day today testing different _recent_ distros  
>>>>> (Slackware
>>>>> 13.1, Ubuntu lucid & maverick) against the different registries.  
>>>>> In
>>>>> all cases, bridges running against the ANL registry disappeared  
>>>>> within
>>>>> 10-15 minutes. In all cases except one (not repeatable), bridges
>>>>> running against the APAG registry did not disappear.
>>>>>
>>>>> My theory therefore is that ANL registry is running with an older
>>>>> version of the AG toolkit that is not compatible with VenueClients
>>>>> running newer AG versions. Tom's recent testing with a separate  
>>>>> test
>>>>> registry supports this theory (assuming the test registry is  
>>>>> running a
>>>>> recent version of AG toolkit). Philippe's comment that tests with
>>>>> maverick were unsuccessful also supports the theory (assuming  
>>>>> those
>>>>> tests used the default ANL registry).
>>>>>
>>>>>
>>>>> Philippe and Tom (and anyone else interested),
>>>>>
>>>>> Could you try running (using the current AG release) a bridge  
>>>>> against
>>>>> the APAG registry - some command like:
>>>>> Bridge3.py --name=Testing123 --location=wherever
>>>>> --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt
>>>>>
>>>>> Leave it running for about an hour or two to confirm it does not
>>>>> disappear. Then stop it and run it again, this time against the  
>>>>> ANL
>>>>> registry with something like:
>>>>>  Bridge3.py --name=TestingXYZ --location=wherever
>>>>> --registryUrl=http://www.accessgrid.org/registry/peers.txt
>>>>>
>>>>> Look for failure in the first 15 minutes.
>>>>>
>>>>>
>>>>> If the fault is in the ANL registry, why do so many bridges _not_
>>>>> disappear? Looking at the list of bridges, the names are  
>>>>> becoming very
>>>>> familiar i.e. they've been around a long time. I'm guessing that  
>>>>> these
>>>>> bridges are running on older versions of the AG toolkit - still
>>>>> compatible with whatever version is running on the ANL registry  
>>>>> machine.
>>>>>
>>>>>
>>>>> Of course, if the test results are in line with the theory, it  
>>>>> still
>>>>> doesn't explain the underlying cause. A quick look through  
>>>>> bridge &
>>>>> registry related AG code doesn't reveal any recent changes so  
>>>>> the real
>>>>> cause may actually be down in some of the supporting software  
>>>>> (python,
>>>>> m2crypto anyone?) which are constantly updated in each new Linux
>>>>> release (typically every 6 months). If so, this issue will  
>>>>> eventually
>>>>> also bite Windows & Mac users as new OS versions introduce up to  
>>>>> date
>>>>> versions of python, m2crypto etc. for them too.
>>>>>
>>>>>
>>>>> chris
>>>>>
>>>>>
>>>>>> So we know very little about failure cases;
>>>>>> - there are many in France
>>>>>> - I was previously able to replicate but not now
>>>>>> - I _think_ I recall that Todd Z reported that he had seen the
>>>>>> problem too
>>>>>>
>>>>>> chris
>>>>>>
>>>>>>
>>>>>>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:
>>>>>>>
>>>>>>>> Bonjour,
>>>>>>>>
>>>>>>>> I was not there yesterday and it's probably too late to  
>>>>>>>> "purge the
>>>>>>>> cache" (there's just a Lucid test by now)
>>>>>>>>
>>>>>>>> By the time we decided to switch to debian because we have a
>>>>>>>> seminar that will be transmitted
>>>>>>>> tomorrow and really need the bridge to work (in fact to be  
>>>>>>>> visible
>>>>>>>> to new users and there it is).
>>>>>>>>
>>>>>>>> If it works also with "maverick" it is a good news for other  
>>>>>>>> users
>>>>>>>> in France (but in the first test we made the
>>>>>>>> bridge disappears too...)
>>>>>>>>
>>>>>>>> Merci pour tout!!
>>>>>>>>
>>>>>>>>
>>>>>>>> Philippe d'Anfray
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 12/10/2010 12:56, Christoph Willing a écrit :
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We're still stuck with this bridge problem, we tried with
>>>>>>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you
>>>>>>>>>>> can confirm us that it works fine with Debian, I'll  
>>>>>>>>>>> reconfigure
>>>>>>>>>>> our server and install a Debian.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm just about to leave for a short holiday so I can't  
>>>>>>>>>> reconfirm
>>>>>>>>>> that Debian still works correctly until late next week.
>>>>>>>>>
>>>>>>>>> I'm now running a test bridge with Debian "lenny". It has been
>>>>>>>>> running nearly 5 hours without any problem so far. I'm also
>>>>>>>>> running another test bridge using the new Ubuntu "maverick",
>>>>>>>>> which and been running for over 4.5 hours - also no problem  
>>>>>>>>> yet.
>>>>>>>>> I will let them both run overnight here (your day time) and  
>>>>>>>>> you
>>>>>>>>> can check whether they're still running OK if you purge your
>>>>>>>>> bridge cache (assuming you have www.ap-accessgrid.org as one  
>>>>>>>>> of
>>>>>>>>> your bridge registries) and look for the bridges named DebTest
>>>>>>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).
>>>>>>>>>
>>>>>>>> <Philippe_d-Anfray.vcf>
>>>>>>>
>>>>>>
>>>>>> Christoph Willing                       +61 7 3365 8316
>>>>>> QCIF Access Grid Manager
>>>>>> University of Queensland
>>>>>>
>>>>>
>>>>> Christoph Willing                       +61 7 3365 8316
>>>>> QCIF Access Grid Manager
>>>>> University of Queensland
>>>>>
>>>>
>>>
>>
>> Christoph Willing                       +61 7 3365 8316
>> QCIF Access Grid Manager
>> University of Queensland
>>
>

Christoph Willing                       +61 7 3365 8316
QCIF Access Grid Manager
University of Queensland



More information about the ag-tech mailing list