[Swift-devel] Re: angle-1000 second run

Tue Nov 6 08:16:11 CST 2007

On Mon, 5 Nov 2007, Michael Wilde wrote:

> 1. only 4 jobs max ran at a time (as seen by qstat over many many spot checks)

We can look at scoring from the run.

> 2. only ONE data file came back before I killed the run - yet hundreds were
> produced (as seen on the server size). Surely these should have started
> trickling in by now?

Not if jobs were still staging in - there's one file transfer throttle 
shared between all file transfers, and stageins submitted at the start are 
going to get serviced before stage outs. That should be apparent from a 
graph if I plot it.

> 3. The cluster sizes were extremely small about 4 - should have been 10-20 by
> my calcs.

Increase the cluster queue delay parameter from 4 to about 30 (seconds). 
This will make Swift wait much longer before putting clusters together, 
which may allow more jobs to build up in the clustering queue.

Make sure that you havethe cluster maximum time and maxwalltimes for jobs 
set to sensible values, because large clusters will highlight 
misconfigurations there. In particular, note that the maximum cluster time 
in the config file needs to be (less than) half of the maxwalltime 
permitted for the site you submit to (so if you are allowewd to run 15 
minute jobs, set the cluster maximum time to 7*60, for example).

Are you using the PBS provider or GRAM to submit?

> 
> 4. I still got over a dozen PBS job aborted messages
> 
> --
> 
> Im going to start another run and let this one go till it finishes.
> 
> I'll use totally default throttles and increase my cluster params (but I dont
> understand why the current values didnt work).
> 
> One more note: this run is using executable script angle4.fast.sh which has a
> sleep 3 as its main action. It logs misc stuff to its 2 output files, but
> otherwise takes the same args as the real angle4.sh.
> 
> Its running out of ~wilde/angle/data on tg-login1.
> 
> - Mike
> 
> 
> 
>