[Swift-devel] Re: angle-1000 second run

Michael Wilde wilde at mcs.anl.gov
Tue Nov 6 11:24:50 CST 2007


It seems that the cluster problem is also due to the slow speed of input 
data file stage-in.

It took 6 minutes to stage in 60 40MB input files to uc-tg
(this is to NFS; I will try GPFS as well).

So at 10 files per minute, if we check the cluster queue every 30 
seconds, that about 5 jobs per cluster on average, which explains what 
we're seeing.

10 fpm = 400MB/min = 6.5MB/sec.  Note that Im submitting from the login 
node to the same cluster - seems very slow.

I will test further and try to calibrate the expected speeds on a big file.

- Mike


On 11/6/07 10:19 AM, Michael Wilde wrote:
> 
>>> 3. The cluster sizes were extremely small about 4 - should have been 
>>> 10-20 by
>>> my calcs.
>>
>> Increase the cluster queue delay parameter from 4 to about 30 
>> (seconds). This will make Swift wait much longer before putting 
>> clusters together, which may allow more jobs to build up in the 
>> clustering queue.
> 
> Previous run had this set to 10 seconds. The logs confirm that this was 
> the clustering period: the cluster size=4 message came out every 10 
> seconds.
> 
>> Make sure that you havethe cluster maximum time and maxwalltimes for 
>> jobs set to sensible values, because large clusters will highlight 
>> misconfigurations there. In particular, note that the maximum cluster 
>> time in the config file needs to be (less than) half of the 
>> maxwalltime permitted for the site you submit to (so if you are 
>> allowewd to run 15 minute jobs, set the cluster maximum time to 7*60, 
>> for example).
> 
> I set cluster max time to 1200 with a maxwalltime of 60 seconds.
> 
> I will fiddle with this part with smaller runs till it works.
> 
> Likely I have a config issue somewhere, or theres a bug.
> 
>> Are you using the PBS provider or GRAM to submit?
> 
> GRAM, gt2.
> 



More information about the Swift-devel mailing list