[Swift-devel] examining the plots of a 65535 job CNARI run

Thu Sep 25 09:11:02 CDT 2008

On Wednesday, skenny ran a 65535 run which mostly finished.

The plots are here:
http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/

The rest of this email is rambling commentary on some of the things I see 
there.

The run mostly finishes, with some number (985 according to the totals of 
unfinished procedure calls, 8 according to the execute2 chart, and 11 
according to the karajan statuses) of activities outstanding.

Looking at this chart, which is karajan job submission tasks, 
http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/karatasks.JOB_SUBMISSION.sorted-start.png

there are strange things with karajan job duration. The majority of tasks 
run very quickly (a few pixels width, which is a few seconds). That's 
expected.

A large number though take what looks to be about 2000 seconds to end (and 
seemingly all are about the same duration, which maybe means its a timeout 
on the task itself);

and a few (about 9?) never finish (those are the lines that extend all the 
way from their respective start times all the way to the far right of the 
graph)

The tasks that take about 2000 seconds look like they're going into Queued 
state - looking at the plot of karajan job submission tasks in queued 
state, they appear there too:

http://www.ci.uchicago.edu/~benc/tmp/report-modelproc-20080924-1226-pkzripi7/karatasks.JOB_SUBMISSION.Queue.sorted-start.png

There are a couple of interesting things here that I haven't seen before:

1. stagein/stageout oscillation

Coasters are providing a plenty of cores for running tasks, with very low 
scheduling latency.

In this run, the execution rate is limited by the rate at which files can 
be staged in.

There is a fixed load for file staging, which is shared between stageins 
and stageouts.

Once a file has been staged in, the corresponding task will be executed 
almost instantly, and two seconds later a stageout task will go on the 
queue.

This seems to be causing a pretty-looking oscillation in the stageout and 
stagein graphs. Maybe that's a bad thing, maybe it doesn't matter.

2. Execution peaks at coaster restart time.

When no coaster workers are running, stageins still happen. So when 
coaster workers start up when there have been none running, there are 
plenty of tasks to run. The coaster workers die every 1h45m (6300 seconds) 
(due to wall time specification) and are restarted, which then is subject 
to gram+sge scheduling delay.

So every 6300s in the run there is a section of the active tasks graph 
where the nuber of active tasks drops to 0 for a bit and then shoots high 
up to 400 tasks active at once for a very short period of time.

In the present run, I don't think this is causing any actual delay in the 
total runtime of the workflow because coasters are not causing any rate 
limit. In other runs with other applications, maybe that will have some 
effect that might be significant.

Coasters are able to run 400 tasks at once because of what I regard as a 
bug in the way that multiple cores are supported in coasters - far too 
many (16 x too many in this case) cores are allocated which means where 
there is a sudden peak in job submissions there are lots of cores
available. This shouldn't happen.

However even if that was fixed so that it only allocated the right number 
of cores, rather than the wrong number of nodes, I think if there is a 
sudden peak in jobs as happens when the coaster workers all die around the 
same time due to walltime, then the worker manager will still end up 
trying to allocate enough workers to cover that peak, even though the peak 
is very unusual. So this will result in basically wasted coaster worker 
runs.

--