Document Process Update

Ivan R. Judson judson at mcs.anl.gov
Mon Nov 3 11:36:13 CST 2003


> >I'm anxious to have them done.
> 
> No, it won't be done then.

Darn. Can I get an estimate of when you think it will be done?
 
> I think that if we're declaring we can't make any more 
> process until these 
> are done, the ensuing spare time can go toward making what's 
> there now solid.

That's what we're doing at some level; as well as sc prep. The issue is that
we (historically) have pretty bad judgement about what improvements will
drastically alter the behavior of the system. That's why I'd like even
seemingly innocuous changes to be cast against the design documents ahead of
time so we have some analytical effort going into identifying ramifications
of modifications. This *could* (but might not) help us get better at
estimating the effect of our changes.
 
> I'm worried about the stability of the base system still; 
> we're seeing a 
> lot of TVS restarts being required, and folks still seem to 
> be generally 
> having problems using the AG2 software. I know I have a pile 
> of things I 
> want to do with the security/cert mgmt side of things before 
> progressing on 
> to anything drastically new. And these aren't deep 
> design-related issues, 
> they are detail-oriented engineering issues that I need to 
> make right to 
> make things work well for the users.

Agreed, we have one confirmed performance improvement; we're going to look
at applying that to the servers in place now so we can differentiate between
inordinately long soap conversions and real hangs.

The current hangs are not evoking any tracebacks or exceptions. They are
silent. That is a big problem. We'll be poking at that (and have been) to
find the problem.

> I also have concerns about the use of the event channel. 
> Since everything 
> depends on it, it really needs to be rock solid, and it 
> apparently is not 
> (text client hangs, etc). If it's being affected by SOAP.py-related 
> slowdowns, perhaps we need to investigate moving the event 
> service to its 
> own process, and ensure that the code is dead simple and dead 
> solid. Or 
> perhaps we need to move away from relying on the event 
> service for basic 
> operation, using some notion of soft-state registration on 
> the clients 
> instead of the existence of active TCP connections via the 
> event service.

Yeah, the asynchronous event service has some issues (as did the synchronous
one). We need to look at this. But what you are proposing is a major design
change. Therefore it needs to be written up and we need to chat about it as
a group so we can ensure the modification moves us in the direction of
improvement. 

One point that I think is valueable though, is that the current system has
been working well the past few weeks. There are a few server hangs (among
all the servers), text blocked (but didn't die) last week once. But other
than that it's been much more stable then previously. We're moving up hill,
sometimes it's slower than I like.

--Ivan




More information about the ag-dev mailing list