[Swift-user] Looking for the cause of failure

Sat Jan 30 22:45:53 CST 2010

On Sat, 2010-01-30 at 23:28 -0500, Andriy Fedorov wrote:
> On Sat, Jan 30, 2010 at 23:14, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > What may happen is that the block (the actual PBS job submitted to run
> > the workers) is longer than what the queue allows.
> >
> > For example, you may select the "short" queue, and that may have a limit
> > of, say, 2 hours for the walltime. You want to set the maxtime
> > accordingly in order to prevent coasters from submitting a job with a
> > walltime higher than what the queue allows, which would cause the job to
> > fail immediately.
> > Even in the case you don't explicitly specify a queue, the default queue
> > may itself have a limit.
> 
> This makes sense -- thank you for the explanation!
> 
> So I changed the number of workers per node to 8, and set the provider
> to "local:pbs", as Mike suggested. I see 2 PBS jobs (20 and 40 nodes)
> running, but from what Swift reports to me, only 16 (?) jobs are being
> active at a time.
> 
> Selecting site:664  Submitted:240  Active:16  Finished successfully:80

It may be a strange variation on relativity. What swift sees as the
number of concurrent jobs may not be what the cluster sees as the number
of concurrent jobs because messages between the two take various amounts
of time to make it from one place to the other. This is especially
visible when the jobs are short.

That or maybe this patch I recently committed (cog branches/4.1.7 r2683)
for the PBS provider. 16 is suspiciously equal to
number_of_jobs*workers_per_node, which may be a result of the PBS
provider starting only one executable irrespective of the number of
nodes requested. The patch mentioned uses pdsh to start the proper
number of executable instances.

> 
> With the previous setup, it made more sense, because the number of
> active jobs was <number of PBS nodes>*<number of workers per node>.

Define "previous setup". If it's about one coaster job per node, yes.
Unfortunately that's also something that prevents scalability with gram2
or clusters that have limits on the number of jobs in the queue (like
the BG/P).

You can force that behavior though with maxnodes=1.

> 
> Am I missing something simple? Maybe I should just try the stable
> branch. I will do this next.
> 

I would advise everybody besides about 2 people doing research on I/O
scalability with Swift to use the stable branch. Not only does it get
fixes before trunk, but it doesn't get weird changes that may cause
random breakage.