[Swift-devel] Re: Need precise throttle on local provider

Fri Oct 1 10:52:59 CDT 2010

----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> Try <profile namespace="karajan" key="jobsPerCpu">1</profile>

That did not seem to work.  What does the local provider think a CPU is?

Since I have several pools (5 for my current tests) on a 4-core host (communicado), is the provider thinking that it can do 4 jobs per pool?  I need it to do one job per pool (be see my question at the end about implementing the slot concept in the provider itself).

> > But it seems like even this is not sufficient: under heavy load, Im
> > seeing a second job start on the same pool before the prior job has
> > completed (I use "mkdir" as a pseudo-mutex, and Im running on a
> local
> > filesystem under /tmp).
> 
> Explain "seeing" in the above sentence.

What I see is that more that one job runs concurrently in the same local execution provider pool.

I observe this because the job is a shell script that creates a fixed-name directory before entering its critical section and removes that directory on exiting the critical section.

If the directory exists on entry to the critical section, the job does a long sleep (42 secs) so I can see it in ps.

My ordinary locking logic (which is a successful workaround is to sleep in 1-second intervals waiting for the directory lock to free up. Under stress tests, I occasionally see a job waiting on that lock, but the retry logic works.

The locking code is:
-----
# Ready to talk to the server: send request and read response

while true; do
  mkdir $SLOTDIR/mutex
  if [ $? != 0 ]; then
    sleep 42;  # <<<<<<<<<<<< I see this sleep occasionally. Ordinarily its a sleep 1
  else
    break;
  fi
done

echo run $(pwd)/$callFile $(pwd)/$resultFile > $SLOTDIR/toR.fifo
touch $SLOTDIR/lastwrite

echo dummy stderr response 1>&2 # FIXME - testing if this is the provider staging problem (not xfering zero len stderr)

head -3 < $SLOTDIR/fromR.fifo # FIXME: Trim this down to 1 line for each call (or same # lines for each, in particular, for "quit")

rmdir $SLOTDIR/mutex
----

> But before that, try
> jobsPerCpu.
> 

Can you comment on this second question, below?

> > 
> > Second question: If you can point me to the right place, Justin or
> I
> > could do this the "right" way by modifying the local execution
> > provider set set "SLOT" numbers.  I initially thought the current
> hack
> > would be easier, and it seemed to work under standalone testing,
> but
> > seems to be failing now in the live setting.
> 
> The right way, I would think, is to modify the relevant throttling
> parameters for the scheduler for that site. That is, the local
> provider
> should not have anything to do with this.

I disagree. This "slot" functionality is the primary goal. The throttling approach is a poor workaround, as it forces me to create one local pool per core, when what I really want is just one pool.

> Luckily there already is a
> parameter to limit the number of concurrent jobs (and I mentioned it
> before).
> 
> Mihael

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory