<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Hi, Ben<br>

<br>

check the graphs here, terminable:/home/zzhang/swift_file/summary.tar.

And I will copy the -info files here soon.<br>

<br>

zhao<br>

<br>

Ben Clifford wrote:

<blockquote

 cite="mid:Pine.LNX.4.64.0804131919290.31934@dildano.hawaga.org.uk"

 type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap="">Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log)

    </pre>

  </blockquote>

  <pre wrap=""><!---->

<a class="moz-txt-link-freetext" href="http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g">http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g</a>

  </pre>

  <blockquote type="cite">

    <pre wrap="">Once stage-ins start to complete, are the corresponding jobs initiated 

quickly, or is Swift doing mostly stage-ins for some period?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to falkon) 

pretty much right as the corresponding stagein completes. I have no deeper 

information about when the worker actually starts to run.

  </pre>

  <blockquote type="cite">

    <pre wrap="">Zhao indicated he saw data indicating there was about a 700 second lag from

workflow start time till the first Falkon jobs started, if I understood

correctly. Do the graphs confirm this or say something different?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

There is a period of about 500s or so until stuff starts to happen; I 

haven't looked at it. That is before stage-ins start too, though, which 

means that i think this...

  </pre>

  <blockquote type="cite">

    <pre wrap="">If the 700-second delay figure is true, and stage-in was eliminated by copying

input files right to the /tmp workdir rather than first to /shared, then we'd

have:

1190260 / ( 1290 * 2048 ) = .45 efficiency

    </pre>

  </blockquote>

  <pre wrap=""><!---->

calculation is not meaningful.

I have not looked at what is going on during that 500s startup time, but I 

plan to.

  </pre>

  <blockquote type="cite">

    <pre wrap="">I assume we're paying the same staging price on the output side?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

not really - the output stageouts go very fast, and also because job 

ending is staggered, they don't happen all at once.

This is the same with most of the large runs I've seen (of any 

application) - stageout tends not to be a problem (or at least, no where 

near the problems of stagein).

All stageins happen over a period t=400 to t=1100 fairly smoothly. There's 

rate limiting still on file operations (100 max) and file transfers (2000 

max) which is being hit still.

I think there's two directions to proceed in here that make sense for 

actual use on single clusters running falkon (rather than trying to cut 

out stuff randomly to push up numbers):

 i) use some of the data placement features in falkon, rather than Swift's

    relatively simple data management that was designed more for running

    on the grid.

 ii) do stage-ins using symlinks rather than file copying. this makes

     sense when everything is living in a single filesystem, which again

     is not what Swift's data management was originally optimised for.

I think option ii) is substantially easier to implement (on the order of 

days) and is generally useful in the single-cluster, local-source-data 

situation that appears to be what people want to do for running on the 

BG/P and scicortex (that is, pretty much ignoring anything grid-like at 

all).

Option i) is much harder (on the order of months), needing a very 

different interface between Swift and Falkon than exists at the moment.

  </pre>

</blockquote>

</body>

</html>