Hi Mihael,<div><br></div><div>Owing to the issues we were facing with OSG persistent coasters setup, I have been doing some experiments. Since, apparently the issues were related to data stagings, I conducted some experiment aiming to study the staging of data from a local client to the OSG sites.</div>
<div><br></div><div>The description of my experiments is as follows:</div><div><br></div><div>I performed about 40 runs from Bridled client to OSG sites using persistent coasters based setup.</div><div><br></div><div>Each run (catsn) consisted of 100 tasks and a fixed data size per task.</div>
<div><br></div><div>I increased the data size gradually from 0MB(20 bytes) to 10MB for successive runs.</div><div><br></div><div>15 runs were successful, 11 were partially successful (upto 25% tasks completed and rest failed owing to data staging timeout).</div>
<div><br></div><div>14 runs failed fully, after which I had to lower the throttle value ( the jobthrottle and foreach throttle, implying the number of data staging done in parallel) after which they succeeded.</div><div>
<br>
</div><div>The data ranged from 0 to 10MB per task.</div><div><br></div><div>12 runs were performed using the local /scratch directory as source of the data and destination of the results.</div><div><br></div><div>14 runs involved /gpfs/pads as source and destination of data and results respectively.</div>
<div><br></div><div>The results are summarized here: <a href="https://docs.google.com/spreadsheet/ccc?key=0AmvYSwENKFY9dHpuM1NQQlZ5VS1idGs2M0hsbDFCa0E&hl=en_US" target="_blank">https://docs.google.com/spreadsheet/ccc?key=0AmvYSwENKFY9dHpuM1NQQlZ5VS1idGs2M0hsbDFCa0E&hl=en_US</a></div>
<div><br></div><div>Sheet 2 contains the table summarizing parameters used in the run. The green rows correspond to successful runs while orange ones correspond to partial or failed runs.</div><div><br></div><div>Sheet 3 shows a histogram of time versus data size for the successful runs only.</div>
<div><br></div><div>The key trend that I observe from these runs is that the data staging does not really get very well as the size of data increases vis-a-vis the throttle. At the stage of 8MB and 10MB data sizes, I had to decrease throttle to 10 in order to get successful runs.</div>
<div><br></div><div>After some discussion with Mike, Our conclusion from these runs was that the parallel data transfers are causing timeouts from the <a href="http://worker.pl" target="_blank">worker.pl</a>, further, we were undecided if somehow the timeout threshold is set too agressive plus how are they determined and whether a change in that value could resolve the issue.</div>
<div><br></div><div>The runs, sources, swift and service logs, and log ids as shown in the last column are all available at : <a href="http://mcs.anl.gov/~ketan/catsn-condor.tgz">http://mcs.anl.gov/~ketan/catsn-condor.tgz</a></div>
<div><br></div><div>The last 1000 lines of the worker logs are logged in the condor directory in the above tarball condor/n.err, condor/n.out. However, I do not think the workers error is an issue here since for each run, I made sure a healthy number of workers are running.</div>
<div><br></div><div><br></div><div><br></div><div>Regards,</div><div>-- <br>Ketan<br><br><br>
</div>