[Swift-user] Looking for the cause of failure

Andriy Fedorov fedorov at bwh.harvard.edu
Sat Jan 30 22:28:23 CST 2010


On Sat, Jan 30, 2010 at 23:14, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> What may happen is that the block (the actual PBS job submitted to run
> the workers) is longer than what the queue allows.
>
> For example, you may select the "short" queue, and that may have a limit
> of, say, 2 hours for the walltime. You want to set the maxtime
> accordingly in order to prevent coasters from submitting a job with a
> walltime higher than what the queue allows, which would cause the job to
> fail immediately.
> Even in the case you don't explicitly specify a queue, the default queue
> may itself have a limit.

This makes sense -- thank you for the explanation!

So I changed the number of workers per node to 8, and set the provider
to "local:pbs", as Mike suggested. I see 2 PBS jobs (20 and 40 nodes)
running, but from what Swift reports to me, only 16 (?) jobs are being
active at a time.

Selecting site:664  Submitted:240  Active:16  Finished successfully:80

With the previous setup, it made more sense, because the number of
active jobs was <number of PBS nodes>*<number of workers per node>.

Am I missing something simple? Maybe I should just try the stable
branch. I will do this next.

>
>
>



More information about the Swift-user mailing list