[Swift-devel] Re: angle-1000 second run
Michael Wilde
wilde at mcs.anl.gov
Tue Nov 6 11:24:50 CST 2007
It seems that the cluster problem is also due to the slow speed of input
data file stage-in.
It took 6 minutes to stage in 60 40MB input files to uc-tg
(this is to NFS; I will try GPFS as well).
So at 10 files per minute, if we check the cluster queue every 30
seconds, that about 5 jobs per cluster on average, which explains what
we're seeing.
10 fpm = 400MB/min = 6.5MB/sec. Note that Im submitting from the login
node to the same cluster - seems very slow.
I will test further and try to calibrate the expected speeds on a big file.
- Mike
On 11/6/07 10:19 AM, Michael Wilde wrote:
>
>>> 3. The cluster sizes were extremely small about 4 - should have been
>>> 10-20 by
>>> my calcs.
>>
>> Increase the cluster queue delay parameter from 4 to about 30
>> (seconds). This will make Swift wait much longer before putting
>> clusters together, which may allow more jobs to build up in the
>> clustering queue.
>
> Previous run had this set to 10 seconds. The logs confirm that this was
> the clustering period: the cluster size=4 message came out every 10
> seconds.
>
>> Make sure that you havethe cluster maximum time and maxwalltimes for
>> jobs set to sensible values, because large clusters will highlight
>> misconfigurations there. In particular, note that the maximum cluster
>> time in the config file needs to be (less than) half of the
>> maxwalltime permitted for the site you submit to (so if you are
>> allowewd to run 15 minute jobs, set the cluster maximum time to 7*60,
>> for example).
>
> I set cluster max time to 1200 with a maxwalltime of 60 seconds.
>
> I will fiddle with this part with smaller runs till it works.
>
> Likely I have a config issue somewhere, or theres a bug.
>
>> Are you using the PBS provider or GRAM to submit?
>
> GRAM, gt2.
>
More information about the Swift-devel
mailing list