Venue Server problems

Thomas D. Uram turam at mcs.anl.gov
Mon Feb 2 10:33:07 CST 2004


Here's a summary of my work on the venue server problems.  This has been 
done on my laptop and on ag2-test.  The venue server is still leaking 
memory, but has improved significantly.  I'll be bringing up a copy of 
the TVS on the machine that is to replace vv2 (ag-dev);  I'll mail again 
when it's up, and we can test it before setting the cname mapping to 
replace vv2.

Let me know if any questions/suggestions.

-------------------------------------------

The primary problem was that the server would hang [Bug 763, but 
well-known anyway].  It also leaked memory to the point of consuming 
everything available [Bug 717].

Here's what I've done trying to fix the problems:

* Use new SOAPpy module which includes a threaded server, which will 
prevent the SOAP server hangs (it also improves performance and fixes 
memory leaks)

    Among the other fixes made in the SOAP module, the following 
remained:  If a call to sys.exc_info is
    made, and the return value (which includes a traceback object) is 
stored locally, a reference cycle
    is created.  To avoid this, the return should be explicitly deleted, 
preferably handled in a
    try...finally block to prevent the leak from ever occurring.
    
[http://mail.python.org/pipermail/python-bugs-list/2001-October/007864.html]

* Use new pyGlobus, which supports the threaded server
* Fix socket handling in AG code that could hang the server (event and 
text services)

    Because we re-register for listen in the acceptCallback, if anything 
goes wrong
    in this sequence, the {event,text}services stop listening.  Clients 
are unable to
    connect and simply hang.
    
* Use GT 2.4.3, which hopefully behaves better than the outdated amalgam 
of code we used before
* Use Python 2.2.3, which fixes several memory leaks (we've been using 
2.2.2 everywhere)

These changes should prevent the server from ever hanging.  As it was, 
if the SOAP server was hung, it would hang only until the offending 
client died; with the threaded server, such clients won't prevent other 
users from connecting.  The other potential source of hangs--the event 
and text service--should be fixed now, also.

Here's the problem that remains:

The event service is still leaking memory, and this is a real pain to 
debug, given a mixture of Python and C code from several different 
sources (AG, pyGlobus, globus, etc.); the text service leaks similarly.  
With a debug build of python, I've looked at the list of allocs/frees 
between calls in which memory leaked.  As far as Python is concerned, 
only a single int is allocated and not freed during event receipt, but 
the process size occasionally grows between 4k and 24k.  Because Python 
isn't indicating a build-up of objects alloc'd and not freed, I suspect 
either another reference cycle, or something below Python (Globus, maybe).




More information about the ag-dev mailing list