<HTML>
<HEAD>
<TITLE>Re: [AG-TECH] Vanishing Bridges</TITLE>
</HEAD>
<BODY>
<FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>More questions to me too of what the heck was going on with my installation (that resolution was dated 2008 if read it right), shows the lack of time I’ve had with installations and development of our AG installations lately. I’ve been so busy with production stuff at the time I just needed to get it to work. I was sure my install on FC13 was current, we had recently installed it with our venueserver.<BR>
<BR>
I hope to look at it closely soon, when we’re done with several meetings with external persons with our grant renewal (yay) this coming week.<BR>
<BR>
I don’t know about everyone else, but we’re heavily using our Access Grid infrastructure still. Slowly moving to a ‘higher’ state of quality with video and ensuring quality by slowly implementing the QA process at sites as they upgrade.<BR>
<BR>
-John Q.<BR>
<BR>
<BR>
<HR ALIGN=CENTER SIZE="3" WIDTH="95%"><B>From: </B>"Thomas D. Uram" <<a href="turam@mcs.anl.gov">turam@mcs.anl.gov</a>><BR>
<B>Date: </B>Thu, 14 Oct 2010 15:10:15 -0500<BR>
<B>To: </B>John Quebedeaux <<a href="johnq@lsu.edu">johnq@lsu.edu</a>><BR>
<B>Cc: </B>Christoph Willing <<a href="c.willing@uq.edu.au">c.willing@uq.edu.au</a>>, Philippe d'Anfray <<a href="Philippe.d-Anfray@cea.fr">Philippe.d-Anfray@cea.fr</a>>, "<<a href="Marcolino.Pires@ac-paris.fr>">Marcolino.Pires@ac-paris.fr></a>" <<a href="Marcolino.Pires@ac-paris.fr">Marcolino.Pires@ac-paris.fr</a>>, "<a href="ag-tech@mcs.anl.gov">ag-tech@mcs.anl.gov</a>" <<a href="ag-tech@mcs.anl.gov">ag-tech@mcs.anl.gov</a>><BR>
<B>Subject: </B>Re: [AG-TECH] Vanishing Bridges<BR>
<BR>
This has been fixed. I replicated the problem with a Bridge running on Ubuntu Lucid, registered against the ANL bridge registry.<BR>
<BR>
This problem came down to a change in the request handling code in Python 2.6. The change added a handle_timeout method to SocketServer.BaseServer, which gets called instead of raising a socket.timeout exception. The bridge code was relying on this timeout exception to re-register with the registry. That functionality has now been moved to the handle_timeout method.<BR>
<BR>
<<a href="http://bugs.python.org/issue742598">http://bugs.python.org/issue742598</a>> The change has been committed to the AG code here:<BR>
<a href="https://trac.ci.uchicago.edu/accessgrid/changeset/6820">https://trac.ci.uchicago.edu/accessgrid/changeset/6820</a><BR>
<BR>
The relevant Python report is here:<BR>
<a href="http://bugs.python.org/issue742598">http://bugs.python.org/issue742598</a><BR>
<BR>
This does leave open the question of why the problem couldn't be replicated in test setups using Python 2.6, as more than one of us has done.<BR>
<BR>
Tom<BR>
<BR>
<<a href="https://trac.ci.uchicago.edu/accessgrid/changeset/6820">https://trac.ci.uchicago.edu/accessgrid/changeset/6820</a>> <BR>
On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>Chris,<BR>
<BR>
I can confirm that LSU is having to run an older version in order for our<BR>
bridge not to disappear from the ANL registry. I haven't had time to figure<BR>
out why it wasn't staying with our FC13 installation - so I've had to split<BR>
the bridge and venueserver for the moment until I have time pick it apart...<BR>
I initially suspected it was a python version issue...<BR>
<BR>
-John Q.<BR>
-- <BR>
John I. Quebedeaux, Jr.; Louisiana State University<BR>
Computer Manager LBRN; 131 Life Sciences Bldg.<BR>
e-mail: <a href="johnq@lsu.edu">johnq@lsu.edu</a>; web: <a href="http://lbrn.lsu.edu">http://lbrn.lsu.edu</a><BR>
phone: 225-578-0062 / fax: 225-578-2597<BR>
<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>From: Christoph Willing <<a href="c.willing@uq.edu.au">c.willing@uq.edu.au</a>><BR>
Date: Thu, 14 Oct 2010 21:09:12 +1000<BR>
To: Philippe d'Anfray <<a href="Philippe.d-Anfray@cea.fr">Philippe.d-Anfray@cea.fr</a>><BR>
Cc: "<<a href="Marcolino.Pires@ac-paris.fr>">Marcolino.Pires@ac-paris.fr></a>" <<a href="Marcolino.Pires@ac-paris.fr">Marcolino.Pires@ac-paris.fr</a>>,<BR>
"<a href="ag-tech@mcs.anl.gov">ag-tech@mcs.anl.gov</a>" <<a href="ag-tech@mcs.anl.gov">ag-tech@mcs.anl.gov</a>><BR>
Subject: Re: [AG-TECH] Vanishing Bridges<BR>
<BR>
<BR>
On 14/10/2010, at 7:12 AM, Christoph Willing wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
On 14/10/2010, at 2:13 AM, Thomas Uram wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>Last week I set up a test registry, registered a bridge with it,<BR>
and successively queried bridges from the registry all day with no<BR>
trouble. Granted, these were all local, but if the problem appears<BR>
as reliably as I've heard, I would have expected to see a problem<BR>
even in this case. We clearly need to narrow down the cause of the<BR>
problem some more. What details do we have about the failure cases?<BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
<BR>
We have very few details, unfortunately. I recall, nearly a year<BR>
ago, I was able to replicate the problem and at that time I thought<BR>
it may have something to do with newer python versions (since 2.6<BR>
was implicated in another problem I'd seen and the replicable cases<BR>
were on newer systems which included python2.6).<BR>
<BR>
However when I was retesting a Debian lenny system (which uses<BR>
python2.5) just night before last, I also ran a test with the new<BR>
Ubuntu maverick (with python2.6). Both ran fine overnight i.e.<BR>
maverick seems OK despite using python2.6 (however note that other<BR>
tests in France were not successful with maverick, so ....). Anyway,<BR>
since maverick had run OK for me, I then started a test with Ubuntu<BR>
lucid (also python2.6), one of the systems with which I'd previously<BR>
been able to replicate the problem. This time it has run overnight<BR>
without any bridge disappearances - I just tried a bridge cache<BR>
purge from home and it showed up fine (still showing up as<BR>
"LucidTest" in the bridge list<BR>
if the www.ap-accessgrid.org <<a href="http://www.ap-accessgrid.org">http://www.ap-accessgrid.org</a>> registry is enabled).<BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
<BR>
On re-reading this last line, I wondered if the problem has something<BR>
to do with the registry itself. I guess all the failure instances so<BR>
far have been using the default ANL registryUrl at<BR>
www.accessgrid.org/registry/peers.txt <<a href="http://www.accessgrid.org/registry/peers.txt">http://www.accessgrid.org/registry/peers.txt</a>> <BR>
, whereas my tests the last few days, which produced no failures, all<BR>
used the APAG registryUrl at www.ap-accessgrid.org/registry/peers.txt <<a href="http://www.ap-accessgrid.org/registry/peers.txt">http://www.ap-accessgrid.org/registry/peers.txt</a>> .<BR>
Obviously each points to a different registry so could that be the<BR>
problem?<BR>
<BR>
I spent all day today testing different _recent_ distros (Slackware<BR>
13.1, Ubuntu lucid & maverick) against the different registries. In<BR>
all cases, bridges running against the ANL registry disappeared within<BR>
10-15 minutes. In all cases except one (not repeatable), bridges<BR>
running against the APAG registry did not disappear.<BR>
<BR>
My theory therefore is that ANL registry is running with an older<BR>
version of the AG toolkit that is not compatible with VenueClients<BR>
running newer AG versions. Tom's recent testing with a separate test<BR>
registry supports this theory (assuming the test registry is running a<BR>
recent version of AG toolkit). Philippe's comment that tests with<BR>
maverick were unsuccessful also supports the theory (assuming those<BR>
tests used the default ANL registry).<BR>
<BR>
<BR>
Philippe and Tom (and anyone else interested),<BR>
<BR>
Could you try running (using the current AG release) a bridge against<BR>
the APAG registry - some command like:<BR>
Bridge3.py --name=Testing123 --location=wherever<BR>
--registryUrl=<a href="http://www.ap-accessgrid.org/registry/peers.txt">http://www.ap-accessgrid.org/registry/peers.txt</a><BR>
<BR>
Leave it running for about an hour or two to confirm it does not<BR>
disappear. Then stop it and run it again, this time against the ANL<BR>
registry with something like:<BR>
Bridge3.py --name=TestingXYZ --location=wherever<BR>
--registryUrl=<a href="http://www.accessgrid.org/registry/peers.txt">http://www.accessgrid.org/registry/peers.txt</a><BR>
<BR>
Look for failure in the first 15 minutes.<BR>
<BR>
<BR>
If the fault is in the ANL registry, why do so many bridges _not_<BR>
disappear? Looking at the list of bridges, the names are becoming very<BR>
familiar i.e. they've been around a long time. I'm guessing that these<BR>
bridges are running on older versions of the AG toolkit - still<BR>
compatible with whatever version is running on the ANL registry machine.<BR>
<BR>
<BR>
Of course, if the test results are in line with the theory, it still<BR>
doesn't explain the underlying cause. A quick look through bridge &<BR>
registry related AG code doesn't reveal any recent changes so the real<BR>
cause may actually be down in some of the supporting software (python,<BR>
m2crypto anyone?) which are constantly updated in each new Linux<BR>
release (typically every 6 months). If so, this issue will eventually<BR>
also bite Windows & Mac users as new OS versions introduce up to date<BR>
versions of python, m2crypto etc. for them too.<BR>
<BR>
<BR>
chris<BR>
<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>So we know very little about failure cases;<BR>
- there are many in France<BR>
- I was previously able to replicate but not now<BR>
- I _think_ I recall that Todd Z reported that he had seen the<BR>
problem too<BR>
<BR>
chris<BR>
<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>Bonjour,<BR>
<BR>
I was not there yesterday and it's probably too late to "purge the<BR>
cache" (there's just a Lucid test by now)<BR>
<BR>
By the time we decided to switch to debian because we have a<BR>
seminar that will be transmitted<BR>
tomorrow and really need the bridge to work (in fact to be visible<BR>
to new users and there it is).<BR>
<BR>
If it works also with "maverick" it is a good news for other users<BR>
in France (but in the first test we made the<BR>
bridge disappears too...)<BR>
<BR>
Merci pour tout!!<BR>
<BR>
<BR>
Philippe d'Anfray<BR>
<BR>
<BR>
<BR>
<BR>
Le 12/10/2010 12:56, Christoph Willing a écrit :<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
We're still stuck with this bridge problem, we tried with<BR>
Ubuntu 10.10 this afternoon but it is still the same. If you<BR>
can confirm us that it works fine with Debian, I'll reconfigure<BR>
our server and install a Debian.<BR>
<BR>
<BR>
I'm just about to leave for a short holiday so I can't reconfirm<BR>
that Debian still works correctly until late next week.<BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
I'm now running a test bridge with Debian "lenny". It has been<BR>
running nearly 5 hours without any problem so far. I'm also<BR>
running another test bridge using the new Ubuntu "maverick",<BR>
which and been running for over 4.5 hours - also no problem yet.<BR>
I will let them both run overnight here (your day time) and you<BR>
can check whether they're still running OK if you purge your<BR>
bridge cache (assuming you have www.ap-accessgrid.org <<a href="http://www.ap-accessgrid.org">http://www.ap-accessgrid.org</a>> as one of<BR>
your bridge registries) and look for the bridges named DebTest<BR>
(Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).<BR>
<BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><Philippe_d-Anfray.vcf><BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
Christoph Willing +61 7 3365 8316<BR>
QCIF Access Grid Manager<BR>
University of Queensland<BR>
<BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
Christoph Willing +61 7 3365 8316<BR>
QCIF Access Grid Manager<BR>
University of Queensland<BR>
<BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>
<BR>
</SPAN></FONT>
</BODY>
</HTML>