<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">This has been fixed. I replicated the problem with a Bridge running on Ubuntu Lucid, registered against the ANL bridge registry.<div><br></div><div>This problem came down to a change in the request handling code in Python 2.6. The change added a handle_timeout method to SocketServer.BaseServer, which gets called instead of raising a socket.timeout exception. The bridge code was relying on this timeout exception to re-register with the registry. That functionality has now been moved to the handle_timeout method.<div><br></div><div><a href="http://bugs.python.org/issue742598"></a>The change has been committed to the AG code here:</div><div><a href="https://trac.ci.uchicago.edu/accessgrid/changeset/6820">https://trac.ci.uchicago.edu/accessgrid/changeset/6820</a></div><div><br></div><div>The relevant Python report is here:</div><div><a href="http://bugs.python.org/issue742598">http://bugs.python.org/issue742598</a></div><div><br></div><div>This does leave open the question of why the problem couldn't be replicated in test setups using Python 2.6, as more than one of us has done.</div><div><br></div><div>Tom</div><div><br></div><div><a href="https://trac.ci.uchicago.edu/accessgrid/changeset/6820"></a><br><div><div>On Oct 14, 2010, at 8:48 AM, John I. Quebedeaux, Jr wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Chris,<br><br>I can confirm that LSU is having to run an older version in order for our<br>bridge not to disappear from the ANL registry. I haven't had time to figure<br>out why it wasn't staying with our FC13 installation - so I've had to split<br>the bridge and venueserver for the moment until I have time pick it apart...<br>I initially suspected it was a python version issue...<br><br>-John Q.<br>-- <br>John I. Quebedeaux, Jr.; Louisiana State University<br>Computer Manager LBRN; 131 Life Sciences Bldg.<br>e-mail: <a href="mailto:johnq@lsu.edu">johnq@lsu.edu</a>; web: <a href="http://lbrn.lsu.edu">http://lbrn.lsu.edu</a><br>phone: 225-578-0062 / fax: 225-578-2597<br><br><br><blockquote type="cite">From: Christoph Willing &lt;<a href="mailto:c.willing@uq.edu.au">c.willing@uq.edu.au</a>&gt;<br></blockquote><blockquote type="cite">Date: Thu, 14 Oct 2010 21:09:12 +1000<br></blockquote><blockquote type="cite">To: Philippe d'Anfray &lt;<a href="mailto:Philippe.d-Anfray@cea.fr">Philippe.d-Anfray@cea.fr</a>&gt;<br></blockquote><blockquote type="cite">Cc: "&lt;<a href="mailto:Marcolino.Pires@ac-paris.fr">Marcolino.Pires@ac-paris.fr</a>&gt;" &lt;<a href="mailto:Marcolino.Pires@ac-paris.fr">Marcolino.Pires@ac-paris.fr</a>&gt;,<br></blockquote><blockquote type="cite">"<a href="mailto:ag-tech@mcs.anl.gov">ag-tech@mcs.anl.gov</a>" &lt;<a href="mailto:ag-tech@mcs.anl.gov">ag-tech@mcs.anl.gov</a>&gt;<br></blockquote><blockquote type="cite">Subject: Re: [AG-TECH] Vanishing Bridges<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">On 14/10/2010, at 7:12 AM, Christoph Willing wrote:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">On 14/10/2010, at 2:13 AM, Thomas Uram wrote:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Last week I set up a test registry, registered a bridge with it,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">and successively queried bridges from the registry all day with no<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">trouble. Granted, these were all local, but if the problem appears<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">as reliably as I've heard, I would have expected to see a problem<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">even in this case. We clearly need to narrow down the cause of the<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">problem some more. What details do we have about the failure cases?<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">We have very few details, unfortunately. I recall, nearly a year<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">ago, I was able to replicate the problem and at that time I thought<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">it may have something to do with newer python versions (since 2.6<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">was implicated in another problem I'd seen and the replicable cases<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">were on newer systems which included python2.6).<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">However when I was retesting a Debian lenny system (which uses<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">python2.5) just night before last, I also ran a test with the new<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Ubuntu maverick (with python2.6). Both ran fine overnight i.e.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">maverick seems OK despite using python2.6 (however note that other<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">tests in France were not successful with maverick, so ....). Anyway,<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">since maverick had run OK for me, I then started a test with Ubuntu<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">lucid (also python2.6), one of the systems with which I'd previously<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">been able to replicate the problem. This time it has run overnight<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">without any bridge disappearances - I just tried a bridge cache<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">purge from home and it showed up fine (still showing up as<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">"LucidTest" in the bridge list<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">if the <a href="http://www.ap-accessgrid.org">www.ap-accessgrid.org</a> registry is enabled).<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">On re-reading this last line, I wondered if the problem has something<br></blockquote><blockquote type="cite">to do with the registry itself. I guess all the failure instances so<br></blockquote><blockquote type="cite">far have been using the default ANL registryUrl at<br></blockquote><blockquote type="cite"><a href="http://www.accessgrid.org/registry/peers.txt">www.accessgrid.org/registry/peers.txt</a><br></blockquote><blockquote type="cite">, whereas my tests the last few days, which produced no failures, all<br></blockquote><blockquote type="cite">used the APAG registryUrl at <a href="http://www.ap-accessgrid.org/registry/peers.txt">www.ap-accessgrid.org/registry/peers.txt</a>.<br></blockquote><blockquote type="cite">Obviously each points to a different registry so could that be the<br></blockquote><blockquote type="cite">problem?<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">I spent all day today testing different _recent_ distros (Slackware<br></blockquote><blockquote type="cite">13.1, Ubuntu lucid &amp; maverick) against the different registries. In<br></blockquote><blockquote type="cite">all cases, bridges running against the ANL registry disappeared within<br></blockquote><blockquote type="cite">10-15 minutes. In all cases except one (not repeatable), bridges<br></blockquote><blockquote type="cite">running against the APAG registry did not disappear.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">My theory therefore is that ANL registry is running with an older<br></blockquote><blockquote type="cite">version of the AG toolkit that is not compatible with VenueClients<br></blockquote><blockquote type="cite">running newer AG versions. Tom's recent testing with a separate test<br></blockquote><blockquote type="cite">registry supports this theory (assuming the test registry is running a<br></blockquote><blockquote type="cite">recent version of AG toolkit). Philippe's comment that tests with<br></blockquote><blockquote type="cite">maverick were unsuccessful also supports the theory (assuming those<br></blockquote><blockquote type="cite">tests used the default ANL registry).<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Philippe and Tom (and anyone else interested),<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Could you try running (using the current AG release) a bridge against<br></blockquote><blockquote type="cite">the APAG registry - some command like:<br></blockquote><blockquote type="cite"> &nbsp;&nbsp;Bridge3.py --name=Testing123 --location=wherever<br></blockquote><blockquote type="cite">--registryUrl=http://www.ap-accessgrid.org/registry/peers.txt<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Leave it running for about an hour or two to confirm it does not<br></blockquote><blockquote type="cite">disappear. Then stop it and run it again, this time against the ANL<br></blockquote><blockquote type="cite">registry with something like:<br></blockquote><blockquote type="cite"> &nbsp;&nbsp;&nbsp;Bridge3.py --name=TestingXYZ --location=wherever<br></blockquote><blockquote type="cite">--registryUrl=http://www.accessgrid.org/registry/peers.txt<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Look for failure in the first 15 minutes.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">If the fault is in the ANL registry, why do so many bridges _not_<br></blockquote><blockquote type="cite">disappear? Looking at the list of bridges, the names are becoming very<br></blockquote><blockquote type="cite">familiar i.e. they've been around a long time. I'm guessing that these<br></blockquote><blockquote type="cite">bridges are running on older versions of the AG toolkit - still<br></blockquote><blockquote type="cite">compatible with whatever version is running on the ANL registry machine.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Of course, if the test results are in line with the theory, it still<br></blockquote><blockquote type="cite">doesn't explain the underlying cause. A quick look through bridge &amp;<br></blockquote><blockquote type="cite">registry related AG code doesn't reveal any recent changes so the real<br></blockquote><blockquote type="cite">cause may actually be down in some of the supporting software (python,<br></blockquote><blockquote type="cite">m2crypto anyone?) which are constantly updated in each new Linux<br></blockquote><blockquote type="cite">release (typically every 6 months). If so, this issue will eventually<br></blockquote><blockquote type="cite">also bite Windows &amp; Mac users as new OS versions introduce up to date<br></blockquote><blockquote type="cite">versions of python, m2crypto etc. for them too.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">chris<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">So we know very little about failure cases;<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"> - there are many in France<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"> - I was previously able to replicate but not now<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"> - I _think_ I recall that Todd Z reported that he had seen the<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">problem too<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">chris<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">On Oct 13, 2010, at 2:27 AM, Philippe d'Anfray wrote:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Bonjour,<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I was not there yesterday and it's probably too late to "purge the<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">cache" (there's just a Lucid test by now)<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">By the time we decided to switch to debian because we have a<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">seminar that will be transmitted<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">tomorrow and really need the bridge to work (in fact to be visible<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">to new users and there it is).<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">If it works also with "maverick" it is a good news for other users<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">in France (but in the first test we made the<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">bridge disappears too...)<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Merci pour tout!!<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Philippe d'Anfray<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Le 12/10/2010 12:56, Christoph Willing a écrit :<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">We're still stuck with this bridge problem, we tried with<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Ubuntu 10.10 this afternoon but it is still the same. If you<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">can confirm us that it works fine with Debian, I'll reconfigure<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">our server and install a Debian.<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I'm just about to leave for a short holiday so I can't reconfirm<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">that Debian still works correctly until late next week.<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I'm now running a test bridge with Debian "lenny". It has been<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">running nearly 5 hours without any problem so far. I'm also<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">running another test bridge using the new Ubuntu "maverick",<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">which and been running for over 4.5 hours - also no problem yet.<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I will let them both run overnight here (your day time) and you<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">can check whether they're still running OK if you purge your<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">bridge cache (assuming you have <a href="http://www.ap-accessgrid.org">www.ap-accessgrid.org</a> as one of<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">your bridge registries) and look for the bridges named DebTest<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">(Debian lenny 64bit) and MaverickTest (Ubuntu maverick 32bit).<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">&lt;Philippe_d-Anfray.vcf&gt;<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Christoph Willing &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+61 7 3365 8316<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">QCIF Access Grid Manager<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">University of Queensland<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Christoph Willing &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+61 7 3365 8316<br></blockquote><blockquote type="cite">QCIF Access Grid Manager<br></blockquote><blockquote type="cite">University of Queensland<br></blockquote><blockquote type="cite"><br></blockquote><br></div></blockquote></div><br></div></div></body></html>