[Swift-devel] persistent coasters and data staging

Mon Sep 12 11:58:44 CDT 2011

Hi Mihael,

Owing to the issues we were facing with OSG persistent coasters setup, I
have been doing some experiments. Since, apparently the issues were related
to data stagings, I conducted some experiment aiming to study the staging of
data from a local client to the OSG sites.

The description of my experiments is as follows:

I performed about 40 runs from Bridled client to OSG sites using persistent
coasters based setup.

Each run (catsn) consisted of 100 tasks and a fixed data size per task.

I increased the data size gradually from 0MB(20 bytes) to 10MB for
successive runs.

15 runs were successful, 11 were partially successful (upto 25% tasks
completed and rest failed owing to data staging timeout).

14 runs failed fully, after which I had to lower the throttle value ( the
jobthrottle and foreach throttle, implying the number of data staging done
in parallel) after which they succeeded.

The data ranged from 0 to 10MB per task.

12 runs were performed using the local /scratch directory as source of the
data and destination of the results.

14 runs involved /gpfs/pads as source and destination of data and results
respectively.

The results are summarized here:
https://docs.google.com/spreadsheet/ccc?key=0AmvYSwENKFY9dHpuM1NQQlZ5VS1idGs2M0hsbDFCa0E&hl=en_US

Sheet 2 contains the table summarizing parameters used in the run. The green
rows correspond to successful runs while orange ones correspond to partial
or failed runs.

Sheet 3 shows a histogram of time versus data size for the successful runs
only.

The key trend that I observe from these runs is that the data staging does
not really get very well as the size of data increases vis-a-vis the
throttle. At the stage of 8MB and 10MB data sizes, I had to decrease
throttle to 10 in order to get successful runs.

After some discussion with Mike, Our conclusion from these runs was that the
parallel data transfers are causing timeouts from the worker.pl, further, we
were undecided if somehow the timeout threshold is set too agressive plus
how are they determined and whether a change in that value could resolve the
issue.

The runs, sources, swift and service logs, and log ids as shown in the last
column are all available at : http://mcs.anl.gov/~ketan/catsn-condor.tgz

The last 1000 lines of the worker logs are logged in the condor directory in
the above tarball condor/n.err, condor/n.out. However, I do not think the
workers error is an issue here since for each run, I made sure a healthy
number of workers are running.

Regards,
-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110912/384f7b26/attachment.html>