Venue Server problems
Thomas D. Uram
turam at mcs.anl.gov
Mon Feb 2 10:33:07 CST 2004
Here's a summary of my work on the venue server problems. This has been
done on my laptop and on ag2-test. The venue server is still leaking
memory, but has improved significantly. I'll be bringing up a copy of
the TVS on the machine that is to replace vv2 (ag-dev); I'll mail again
when it's up, and we can test it before setting the cname mapping to
replace vv2.
Let me know if any questions/suggestions.
-------------------------------------------
The primary problem was that the server would hang [Bug 763, but
well-known anyway]. It also leaked memory to the point of consuming
everything available [Bug 717].
Here's what I've done trying to fix the problems:
* Use new SOAPpy module which includes a threaded server, which will
prevent the SOAP server hangs (it also improves performance and fixes
memory leaks)
Among the other fixes made in the SOAP module, the following
remained: If a call to sys.exc_info is
made, and the return value (which includes a traceback object) is
stored locally, a reference cycle
is created. To avoid this, the return should be explicitly deleted,
preferably handled in a
try...finally block to prevent the leak from ever occurring.
[http://mail.python.org/pipermail/python-bugs-list/2001-October/007864.html]
* Use new pyGlobus, which supports the threaded server
* Fix socket handling in AG code that could hang the server (event and
text services)
Because we re-register for listen in the acceptCallback, if anything
goes wrong
in this sequence, the {event,text}services stop listening. Clients
are unable to
connect and simply hang.
* Use GT 2.4.3, which hopefully behaves better than the outdated amalgam
of code we used before
* Use Python 2.2.3, which fixes several memory leaks (we've been using
2.2.2 everywhere)
These changes should prevent the server from ever hanging. As it was,
if the SOAP server was hung, it would hang only until the offending
client died; with the threaded server, such clients won't prevent other
users from connecting. The other potential source of hangs--the event
and text service--should be fixed now, also.
Here's the problem that remains:
The event service is still leaking memory, and this is a real pain to
debug, given a mixture of Python and C code from several different
sources (AG, pyGlobus, globus, etc.); the text service leaks similarly.
With a debug build of python, I've looked at the list of allocs/frees
between calls in which memory leaked. As far as Python is concerned,
only a single int is allocated and not freed during event receipt, but
the process size occasionally grows between 4k and 24k. Because Python
isn't indicating a build-up of objects alloc'd and not freed, I suspect
either another reference cycle, or something below Python (Globus, maybe).
More information about the ag-dev
mailing list