[Swift-devel] Re: provider staging stage-in rate on localhost and PADS

Wed Jan 19 12:10:11 CST 2011

On Wed, 2011-01-19 at 10:29 -0600, Michael Wilde wrote:
> I forgot to also state regarding the test below:
> - tried both stagingMethod proxy and file - no significant perf diff

Yes, but the memory overhead is considerably larger for proxy. So don't
use it if you can use file.

> - tried workersPerNode > 8 (10 seemed to cause slight degradation, so I went back to 8; this test may have just been noise)

Probably not noise. If you have more (actual) workers (instead of wpn)
the combined tcp buffer sizes are larger and there is somewhat more
parallelization.

> - I used this recent swift and cog: Swift svn swift-r3997 cog-r3029 (cog modified locally)
> 
> I think the following tests would be good to do in the micro-series:
> 
> - re-do the above on a quiet dedicated PADS cluster node (qsub -I)
> 
> - try the test with just input and just output
> - try the test with a 1-byte file (to see what the protocol issues are)
> - try the test with a 30MB file (to try to replicate Mihaels results)
> 
> - try testing from one pads node client to say 3-4 other pads nodes (again, with either qsub -I or a swift run with auto coasters and the maxNode and nodeGranularity set to say 4 & 4 (or 5 & 5, etc) 
> 
> This last test will probe the ability of swift to move more tasks/sec when there are more concurrent app-job endpoints (ie when swift is driving more cores).  We *think* swift trunk should be able to drive >100 tasks/sec - maybe even 200/sec - when the configuration is optimized: all local disk use; log settings tuned, perhaps; throttles set right; ...)
> 
> Then also try swift fast branch, but Mihael needs to post (or you need to check in svn) whether all the latest provider staging improvements have been, or could be, applied to fast branch.
> 
> Lastly, for the wide area test:
> 
> 10 OSG sites
> try to keep say 2 < N < 10 workers active per site (using queue-N COndor script) with most sites having large numbers of workers. That should more closely mimic the load your will need to drive for the actual application.
> workersPerNode=1
> 
> The wan test will likely require more thought.
> 
> - Mike
> 
> ----- Original Message -----
> > Continuing to work on resolving this problem.
> > 
> > I think the next step is to methodically test provider staginug moving
> > from the single-node test to multi-node local (pads) and then to
> > multi-node wan tests.
> > 
> > Now that the native coaster job rate to a single one-core worker is
> > better understood (and seems to be 4-5 jobs per second) we can now
> > devise tests with a better understanding of more factors involved.
> > 
> > I tried a local test on pads login(at a fairly quiet time, unloaded)
> > as follows:
> > - local coasters service (in Swift jvm)
> > - app is "mv" (to avoid extra data movement)
> > - same input data file is used (so its likely in kernel block cache)
> > - unique output file is used
> > - swift and cwd is on /scratch local disk
> > - file is 3MB (to be closer to Allan's 2.3 MB)
> > - mv app stages file to worker and back (no app reads or writes)
> > - workers per node = 8 (on an 8 core host)
> > - throttle of 200 jobs (2.0)
> > - 100 jobs per swift script invocation
> > 
> > I get just over 5 apps/sec or 30MB/sec with this setup.
> > 
> > Allan, I'd like to suggest you take it from here, but lets talk as
> > soon as possible this morning to make a plan.
> > 
> > One approach that may be fruitful is to re-design a remote test that
> > is closer to what a real scec workload would be (basically your prior
> > tests with some adjustment to the concurrency: more workers per site,
> > and more overall files going in parallel.
> > 
> > Then, every time we have a new insight or code change, re-test the
> > larger-scale WAN test in parallel with continuing down the micro-test
> > methods. That way, as soon as we hit a breakthrough that reaches your
> > requires WAN data transfer rate, you can restart the full scec
> > workflow, while we continue to analyze swift behavior issues with the
> > simpler micro benchmarks.
> > 
> > Regards,
> > 
> > Mike
> > 
> > 
> > ----- Original Message -----
> > > Ok, so I committed a fix to make the worker send files a bit faster
> > > and
> > > adjusted the buffer sizes a bit. There is a trade-off between per
> > > worker
> > > performance and number of workers, so this should probably be a
> > > setting
> > > of some sort (since when there are many workers, the client
> > > bandwidth
> > > becomes the bottleneck).
> > >
> > > With a plain cat, 4 workers, 1 job/w, and 32M files I get this:
> > > [IN]: Total transferred: 7.99 GB, current rate: 23.6 MB/s, average
> > > rate:
> > > 16.47 MB/s
> > > [MEM] Heap total: 155.31 MMB, Heap used: 104.2 MMB
> > > [OUT] Total transferred: 8 GB, current rate: 0 B/s, average rate:
> > > 16.49
> > > MB/s
> > > Final status: time:498988 Finished successfully:256
> > > Time: 500.653, rate: 0 j/s
> > >
> > > So the system probably sees 96 MB/s combined reads and writes. I'd
> > > be
> > > curious how this looks without caching, but during the run the
> > > computer
> > > became laggy, so it's saturating something in the OS and/or
> > > hardware.
> > >
> > > I'll test on a cluster next.
> > >
> > > On Sun, 2011-01-16 at 18:02 -0800, Mihael Hategan wrote:
> > > > On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote:
> > > > > So for the measurement interface, are you measuring the total
> > > > > data
> > > > > received as
> > > > > the data arrives or when the received file is completely written
> > > > > to the job
> > > > > directory.
> > > >
> > > > The average is all the bytes that go from client to all the
> > > > workers
> > > > over
> > > > the entire time spent to run the jobs.
> > > >
> > > > >
> > > > > I was measuring from the logs from JOB_START to JOB_END. I
> > > > > assumed
> > > > > the actualy
> > > > > job execution to be 0. The 7MB/s probably corresponds to
> > > > > Mihael's
> > > > > stage out
> > > > > results. the cat jobs dump to stdout (redirected to a file in
> > > > > the
> > > > > swift
> > > > > wrapper) probably shows the same behavior as the stageout.
> > > >
> > > > I'm becoming less surprised about 7MB/s in the local case. You
> > > > have
> > > > to
> > > > multiply that by 6 to get the real disk I/O bandwidth:
> > > > 1. client reads from disk
> > > > 2. worker writes to disk
> > > > 3. cat reads from disk
> > > > 4. cat writes to disk
> > > > 5. worker reads from disk
> > > > 6. client writes to disk
> > > >
> > > > If it all happens on a single disk, then it adds up to about 42
> > > > MB/s,
> > > > which is a reasonable fraction of what a normal disk can do. It
> > > > would be
> > > > useful to do a dd from /dev/zero to see what the actual disk
> > > > performance
> > > > is.
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>