[Swift-devel] Re: angle-1000 second run

Mihael Hategan hategan at mcs.anl.gov
Tue Nov 6 12:37:50 CST 2007


So I just spoke to Bill. The errors we see when transfers go up, we
should not see them. In the tests they've done a while ago hundreds of
parallel transfers on typical machines were not a problem.

So we need to isolate the issue. Possible causes:
1. The Java GridFTP client
2. The CI network
3. Problems introduced in the server after the tests above.

Mihael

On Tue, 2007-11-06 at 18:20 +0000, Ben Clifford wrote:
> hitting the transfer throttle a lot according to this: 
> http://www.ci.uchicago.edu/~benc/log-processing/report-awf6-20071106-1101-yxipkgyg/
> 
> 
> On Tue, 6 Nov 2007, Michael Wilde wrote:
> 
> > It seems that the cluster problem is also due to the slow speed of input data
> > file stage-in.
> > 
> > It took 6 minutes to stage in 60 40MB input files to uc-tg
> > (this is to NFS; I will try GPFS as well).
> > 
> > So at 10 files per minute, if we check the cluster queue every 30 seconds,
> > that about 5 jobs per cluster on average, which explains what we're seeing.
> > 
> > 10 fpm = 400MB/min = 6.5MB/sec.  Note that Im submitting from the login node
> > to the same cluster - seems very slow.
> > 
> > I will test further and try to calibrate the expected speeds on a big file.
> > 
> > - Mike
> > 
> > 
> > On 11/6/07 10:19 AM, Michael Wilde wrote:
> > > 
> > > > > 3. The cluster sizes were extremely small about 4 - should have been
> > > > > 10-20 by
> > > > > my calcs.
> > > > 
> > > > Increase the cluster queue delay parameter from 4 to about 30 (seconds).
> > > > This will make Swift wait much longer before putting clusters together,
> > > > which may allow more jobs to build up in the clustering queue.
> > > 
> > > Previous run had this set to 10 seconds. The logs confirm that this was the
> > > clustering period: the cluster size=4 message came out every 10 seconds.
> > > 
> > > > Make sure that you havethe cluster maximum time and maxwalltimes for jobs
> > > > set to sensible values, because large clusters will highlight
> > > > misconfigurations there. In particular, note that the maximum cluster time
> > > > in the config file needs to be (less than) half of the maxwalltime
> > > > permitted for the site you submit to (so if you are allowewd to run 15
> > > > minute jobs, set the cluster maximum time to 7*60, for example).
> > > 
> > > I set cluster max time to 1200 with a maxwalltime of 60 seconds.
> > > 
> > > I will fiddle with this part with smaller runs till it works.
> > > 
> > > Likely I have a config issue somewhere, or theres a bug.
> > > 
> > > > Are you using the PBS provider or GRAM to submit?
> > > 
> > > GRAM, gt2.
> > > 
> > 
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list