[AG-TECH] Vanishing Bridges

John I. Quebedeaux, Jr johnq at lsu.edu
Thu Oct 14 08:48:00 CDT 2010


Chris,

I can confirm that LSU is having to run an older version in order for our
bridge not to disappear from the ANL registry. I haven't had time to figure
out why it wasn't staying with our FC13 installation - so I've had to split
the bridge and venueserver for the moment until I have time pick it apart...
I initially suspected it was a python version issue...

-John Q.
-- 
John I. Quebedeaux, Jr.; Louisiana State University
Computer Manager LBRN; 131 Life Sciences Bldg.
e-mail: johnq at lsu.edu; web: http://lbrn.lsu.edu
phone: 225-578-0062 / fax: 225-578-2597


> From: Christoph Willing <c.willing at uq.edu.au>
> Date: Thu, 14 Oct 2010 21:09:12 +1000
> To: Philippe d'Anfray <Philippe.d-Anfray at cea.fr>
> Cc: "<Marcolino.Pires at ac-paris.fr>" <Marcolino.Pires at ac-paris.fr>,
> "ag-tech at mcs.anl.gov" <ag-tech at mcs.anl.gov>
> Subject: Re: [AG-TECH] Vanishing Bridges
> 
> 
> On 14/10/2010, at 7:12 AM, Christoph Willing wrote:
> 
>> 
>> On 14/10/2010, at 2:13 AM, Thomas Uram wrote:
>> 
>>> Last week I set up a test registry, registered a bridge with it,
>>> and successively queried bridges from the registry all day with no
>>> trouble. Granted, these were all local, but if the problem appears
>>> as reliably as I've heard, I would have expected to see a problem
>>> even in this case. We clearly need to narrow down the cause of the
>>> problem some more. What details do we have about the failure cases?
>> 
>> 
>> We have very few details, unfortunately. I recall, nearly a year
>> ago, I was able to replicate the problem and at that time I thought
>> it may have something to do with newer python versions (since 2.6
>> was implicated in another problem I'd seen and the replicable cases
>> were on newer systems which included python2.6).
>> 
>> However when I was retesting a Debian lenny system (which uses
>> python2.5) just night before last, I also ran a test with the new
>> Ubuntu maverick (with python2.6). Both ran fine overnight i.e.
>> maverick seems OK despite using python2.6 (however note that other
>> tests in France were not successful with maverick, so ....). Anyway,
>> since maverick had run OK for me, I then started a test with Ubuntu
>> lucid (also python2.6), one of the systems with which I'd previously
>> been able to replicate the problem. This time it has run overnight
>> without any bridge disappearances - I just tried a bridge cache
>> purge from home and it showed up fine (still showing up as
>> "LucidTest" in the bridge list
>> if the www.ap-accessgrid.org registry is enabled).
> 
> 
> On re-reading this last line, I wondered if the problem has something
> to do with the registry itself. I guess all the failure instances so
> far have been using the default ANL registryUrl at
> www.accessgrid.org/registry/peers.txt
> , whereas my tests the last few days, which produced no failures, all
> used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt.
> Obviously each points to a different registry so could that be the
> problem?
> 
> I spent all day today testing different _recent_ distros (Slackware
> 13.1, Ubuntu lucid & maverick) against the different registries. In
> all cases, bridges running against the ANL registry disappeared within
> 10-15 minutes. In all cases except one (not repeatable), bridges
> running against the APAG registry did not disappear.
> 
> My theory therefore is that ANL registry is running with an older
> version of the AG toolkit that is not compatible with VenueClients
> running newer AG versions. Tom's recent testing with a separate test
> registry supports this theory (assuming the test registry is running a
> recent version of AG toolkit). Philippe's comment that tests with
> maverick were unsuccessful also supports the theory (assuming those
> tests used the default ANL registry).
> 
> 
> Philippe and Tom (and anyone else interested),
> 
> Could you try running (using the current AG release) a bridge against
> the APAG registry - some command like:
>    Bridge3.py --name=Testing123 --location=wherever
> --registryUrl=http://www.ap-accessgrid.org/registry/peers.txt
> 
> Leave it running for about an hour or two to confirm it does not
> disappear. Then stop it and run it again, this time against the ANL
> registry with something like:
>     Bridge3.py --name=TestingXYZ --location=wherever
> --registryUrl=http://www.accessgrid.org/registry/peers.txt
> 
> Look for failure in the first 15 minutes.
> 
> 
> If the fault is in the ANL registry, why do so many bridges _not_
> disappear? Looking at the list of bridges, the names are becoming very
> familiar i.e. they've been around a long time. I'm guessing that these
> bridges are running on older versions of the AG toolkit - still
> compatible with whatever version is running on the ANL registry machine.
> 
> 
> Of course, if the test results are in line with the theory, it still
> doesn't explain the underlying cause. A quick look through bridge &
> registry related AG code doesn't reveal any recent changes so the real
> cause may actually be down in some of the supporting software (python,
> m2crypto anyone?) which are constantly updated in each new Linux
> release (typically every 6 months). If so, this issue will eventually
> also bite Windows & Mac users as new OS versions introduce up to date
> versions of python, m2crypto etc. for them too.
> 
> 
> chris
> 
> 
>> So we know very little about failure cases;
>>  - there are many in France
>>  - I was previously able to replicate but not now
>>  - I _think_ I recall that Todd Z reported that he had seen the
>> problem too
>> 
>> chris
>> 
>> 
>>> On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:
>>> 
>>>> Bonjour,
>>>> 
>>>> I was not there yesterday and it's probably too late to "purge the
>>>> cache" (there's just a Lucid test by now)
>>>> 
>>>> By the time we decided to switch to debian because we have a
>>>> seminar that will be transmitted
>>>> tomorrow and really need the bridge to work (in fact to be visible
>>>> to new users and there it is).
>>>> 
>>>> If it works also with "maverick" it is a good news for other users
>>>> in France (but in the first test we made the
>>>> bridge disappears too...)
>>>> 
>>>> Merci pour tout!!
>>>> 
>>>> 
>>>> Philippe d'Anfray
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Le 12/10/2010 12:56, Christoph Willing a écrit :
>>>>> 
>>>>>>> 
>>>>>>> We're still stuck with this bridge problem, we tried with
>>>>>>> Ubuntu 10.10 this afternoon but it is still the same. If you
>>>>>>> can confirm us that it works fine with Debian, I'll reconfigure
>>>>>>> our server and install a Debian.
>>>>>> 
>>>>>> 
>>>>>> I'm just about to leave for a short holiday so I can't reconfirm
>>>>>> that Debian still works correctly until late next week.
>>>>> 
>>>>> I'm now running a test bridge with Debian "lenny". It has been
>>>>> running nearly 5 hours without any problem so far. I'm also
>>>>> running another test bridge using the new Ubuntu "maverick",
>>>>> which and been running for over 4.5 hours - also no problem yet.
>>>>> I will let them both run overnight here (your day time) and you
>>>>> can check whether they're still running OK if you purge your
>>>>> bridge cache (assuming you have www.ap-accessgrid.org as one of
>>>>> your bridge registries) and look for the bridges named DebTest
>>>>> (Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).
>>>>> 
>>>> <Philippe_d-Anfray.vcf>
>>> 
>> 
>> Christoph Willing                       +61 7 3365 8316
>> QCIF Access Grid Manager
>> University of Queensland
>> 
> 
> Christoph Willing                       +61 7 3365 8316
> QCIF Access Grid Manager
> University of Queensland
> 



More information about the ag-tech mailing list