[Swift-devel] Re: provider staging stage-in rate on localhost and PADS

Wed Jan 19 11:48:23 CST 2011

Hi Mike,

Here's the setup I tested:

18 services on communicado. 1 worker.pl w/ 40 workersPerNode
connecting to each.  Each worker is in a CS condor pool compute host.
using provider staging. no data. just a bunch of strings passed on the
app(() function.

I got 58k jobs in 1 hr.  so the rate is 16.3 jobs/sec.  well within
your 20 jobs/sec i guess.

I haven't experimented with stuff that includes data though.

-Allan

2011/1/19 Michael Wilde <wilde at mcs.anl.gov>:
> A few more test results:
>
> moving 3 byte files: this runs at about 20 jobs/sec in the single-node 8-core test.
>
> moving 30MB files: runs 100 jobs in 143 secs = about 40 MB/sec total in/out
>
> Both tests are using a single input file going to all jobs and N unique output files coming back.
>
> So the latter job I think is about the same ballpark as Mihael's latest results?  And the former job confirms that provider staging does not seem to slow down the job rate unacceptably.
>
> - Mike
>
>
> ----- Original Message -----
>> I forgot to also state regarding the test below:
>> - tried both stagingMethod proxy and file - no significant perf diff
>> - tried workersPerNode > 8 (10 seemed to cause slight degradation, so
>> I went back to 8; this test may have just been noise)
>> - I used this recent swift and cog: Swift svn swift-r3997 cog-r3029
>> (cog modified locally)
>>
>> I think the following tests would be good to do in the micro-series:
>>
>> - re-do the above on a quiet dedicated PADS cluster node (qsub -I)
>>
>> - try the test with just input and just output
>> - try the test with a 1-byte file (to see what the protocol issues
>> are)
>> - try the test with a 30MB file (to try to replicate Mihaels results)
>>
>> - try testing from one pads node client to say 3-4 other pads nodes
>> (again, with either qsub -I or a swift run with auto coasters and the
>> maxNode and nodeGranularity set to say 4 & 4 (or 5 & 5, etc)
>>
>> This last test will probe the ability of swift to move more tasks/sec
>> when there are more concurrent app-job endpoints (ie when swift is
>> driving more cores). We *think* swift trunk should be able to drive
>> >100 tasks/sec - maybe even 200/sec - when the configuration is
>> optimized: all local disk use; log settings tuned, perhaps; throttles
>> set right; ...)
>>
>> Then also try swift fast branch, but Mihael needs to post (or you need
>> to check in svn) whether all the latest provider staging improvements
>> have been, or could be, applied to fast branch.
>>
>> Lastly, for the wide area test:
>>
>> 10 OSG sites
>> try to keep say 2 < N < 10 workers active per site (using queue-N
>> COndor script) with most sites having large numbers of workers. That
>> should more closely mimic the load your will need to drive for the
>> actual application.
>> workersPerNode=1
>>
>> The wan test will likely require more thought.
>>
>> - Mike
>>
>> ----- Original Message -----
>> > Continuing to work on resolving this problem.
>> >
>> > I think the next step is to methodically test provider staginug
>> > moving
>> > from the single-node test to multi-node local (pads) and then to
>> > multi-node wan tests.
>> >
>> > Now that the native coaster job rate to a single one-core worker is
>> > better understood (and seems to be 4-5 jobs per second) we can now
>> > devise tests with a better understanding of more factors involved.
>> >
>> > I tried a local test on pads login(at a fairly quiet time, unloaded)
>> > as follows:
>> > - local coasters service (in Swift jvm)
>> > - app is "mv" (to avoid extra data movement)
>> > - same input data file is used (so its likely in kernel block cache)
>> > - unique output file is used
>> > - swift and cwd is on /scratch local disk
>> > - file is 3MB (to be closer to Allan's 2.3 MB)
>> > - mv app stages file to worker and back (no app reads or writes)
>> > - workers per node = 8 (on an 8 core host)
>> > - throttle of 200 jobs (2.0)
>> > - 100 jobs per swift script invocation
>> >
>> > I get just over 5 apps/sec or 30MB/sec with this setup.
>> >
>> > Allan, I'd like to suggest you take it from here, but lets talk as
>> > soon as possible this morning to make a plan.
>> >
>> > One approach that may be fruitful is to re-design a remote test that
>> > is closer to what a real scec workload would be (basically your
>> > prior
>> > tests with some adjustment to the concurrency: more workers per
>> > site,
>> > and more overall files going in parallel.
>> >
>> > Then, every time we have a new insight or code change, re-test the
>> > larger-scale WAN test in parallel with continuing down the
>> > micro-test
>> > methods. That way, as soon as we hit a breakthrough that reaches
>> > your
>> > requires WAN data transfer rate, you can restart the full scec
>> > workflow, while we continue to analyze swift behavior issues with
>> > the
>> > simpler micro benchmarks.
>> >
>> > Regards,
>> >
>> > Mike
>> >
>> >
>> > ----- Original Message -----
>> > > Ok, so I committed a fix to make the worker send files a bit
>> > > faster
>> > > and
>> > > adjusted the buffer sizes a bit. There is a trade-off between per
>> > > worker
>> > > performance and number of workers, so this should probably be a
>> > > setting
>> > > of some sort (since when there are many workers, the client
>> > > bandwidth
>> > > becomes the bottleneck).
>> > >
>> > > With a plain cat, 4 workers, 1 job/w, and 32M files I get this:
>> > > [IN]: Total transferred: 7.99 GB, current rate: 23.6 MB/s, average
>> > > rate:
>> > > 16.47 MB/s
>> > > [MEM] Heap total: 155.31 MMB, Heap used: 104.2 MMB
>> > > [OUT] Total transferred: 8 GB, current rate: 0 B/s, average rate:
>> > > 16.49
>> > > MB/s
>> > > Final status: time:498988 Finished successfully:256
>> > > Time: 500.653, rate: 0 j/s
>> > >
>> > > So the system probably sees 96 MB/s combined reads and writes. I'd
>> > > be
>> > > curious how this looks without caching, but during the run the
>> > > computer
>> > > became laggy, so it's saturating something in the OS and/or
>> > > hardware.
>> > >
>> > > I'll test on a cluster next.
>> > >
>> > > On Sun, 2011-01-16 at 18:02 -0800, Mihael Hategan wrote:
>> > > > On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote:
>> > > > > So for the measurement interface, are you measuring the total
>> > > > > data
>> > > > > received as
>> > > > > the data arrives or when the received file is completely
>> > > > > written
>> > > > > to the job
>> > > > > directory.
>> > > >
>> > > > The average is all the bytes that go from client to all the
>> > > > workers
>> > > > over
>> > > > the entire time spent to run the jobs.
>> > > >
>> > > > >
>> > > > > I was measuring from the logs from JOB_START to JOB_END. I
>> > > > > assumed
>> > > > > the actualy
>> > > > > job execution to be 0. The 7MB/s probably corresponds to
>> > > > > Mihael's
>> > > > > stage out
>> > > > > results. the cat jobs dump to stdout (redirected to a file in
>> > > > > the
>> > > > > swift
>> > > > > wrapper) probably shows the same behavior as the stageout.
>> > > >
>> > > > I'm becoming less surprised about 7MB/s in the local case. You
>> > > > have
>> > > > to
>> > > > multiply that by 6 to get the real disk I/O bandwidth:
>> > > > 1. client reads from disk
>> > > > 2. worker writes to disk
>> > > > 3. cat reads from disk
>> > > > 4. cat writes to disk
>> > > > 5. worker reads from disk
>> > > > 6. client writes to disk
>> > > >
>> > > > If it all happens on a single disk, then it adds up to about 42
>> > > > MB/s,
>> > > > which is a reasonable fraction of what a normal disk can do. It
>> > > > would be
>> > > > useful to do a dd from /dev/zero to see what the actual disk
>> > > > performance
>> > > > is.
>> > > >
>> > > >