<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hi, Ben<br>
<br>
check the graphs here, terminable:/home/zzhang/swift_file/summary.tar.
And I will copy the -info files here soon.<br>
<br>
zhao<br>
<br>
Ben Clifford wrote:
<blockquote
cite="mid:Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk"
type="cite">
<pre wrap="">
</pre>
<blockquote type="cite">
<pre wrap="">Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log)
</pre>
</blockquote>
<pre wrap=""><!---->
<a class="moz-txt-link-freetext" href="http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g">http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g</a>
</pre>
<blockquote type="cite">
<pre wrap="">Once stage-ins start to complete, are the corresponding jobs initiated
quickly, or is Swift doing mostly stage-ins for some period?
</pre>
</blockquote>
<pre wrap=""><!---->
In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to falkon)
pretty much right as the corresponding stagein completes. I have no deeper
information about when the worker actually starts to run.
</pre>
<blockquote type="cite">
<pre wrap="">Zhao indicated he saw data indicating there was about a 700 second lag from
workflow start time till the first Falkon jobs started, if I understood
correctly. Do the graphs confirm this or say something different?
</pre>
</blockquote>
<pre wrap=""><!---->
There is a period of about 500s or so until stuff starts to happen; I
haven't looked at it. That is before stage-ins start too, though, which
means that i think this...
</pre>
<blockquote type="cite">
<pre wrap="">If the 700-second delay figure is true, and stage-in was eliminated by copying
input files right to the /tmp workdir rather than first to /shared, then we'd
have:
1190260 / ( 1290 * 2048 ) = .45 efficiency
</pre>
</blockquote>
<pre wrap=""><!---->
calculation is not meaningful.
I have not looked at what is going on during that 500s startup time, but I
plan to.
</pre>
<blockquote type="cite">
<pre wrap="">I assume we're paying the same staging price on the output side?
</pre>
</blockquote>
<pre wrap=""><!---->
not really - the output stageouts go very fast, and also because job
ending is staggered, they don't happen all at once.
This is the same with most of the large runs I've seen (of any
application) - stageout tends not to be a problem (or at least, no where
near the problems of stagein).
All stageins happen over a period t=400 to t=1100 fairly smoothly. There's
rate limiting still on file operations (100 max) and file transfers (2000
max) which is being hit still.
I think there's two directions to proceed in here that make sense for
actual use on single clusters running falkon (rather than trying to cut
out stuff randomly to push up numbers):
i) use some of the data placement features in falkon, rather than Swift's
relatively simple data management that was designed more for running
on the grid.
ii) do stage-ins using symlinks rather than file copying. this makes
sense when everything is living in a single filesystem, which again
is not what Swift's data management was originally optimised for.
I think option ii) is substantially easier to implement (on the order of
days) and is generally useful in the single-cluster, local-source-data
situation that appears to be what people want to do for running on the
BG/P and scicortex (that is, pretty much ignoring anything grid-like at
all).
Option i) is much harder (on the order of months), needing a very
different interface between Swift and Falkon than exists at the moment.
</pre>
</blockquote>
</body>
</html>