[Swift-devel] Re: Need precise throttle on local provider

Mihael Hategan hategan at mcs.anl.gov
Fri Oct 1 11:38:50 CDT 2010


On Fri, 2010-10-01 at 09:52 -0600, Michael Wilde wrote:
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> 
> > Try <profile namespace="karajan" key="jobsPerCpu">1</profile>
> 
> That did not seem to work.  What does the local provider think a CPU is?

The provider does not meddle with scheduling and task management. But
the default is 1 cpu per host.

Though looking at the code it may be that the swift scheduler ignores
that parameter. So I will probably have to fix that.

> 
> Since I have several pools (5 for my current tests) on a 4-core host
> (communicado), is the provider thinking that it can do 4 jobs per
> pool?

Should be one job per pool.

>   I need it to do one job per pool (be see my question at the end
> about implementing the slot concept in the provider itself).
> 
> > > But it seems like even this is not sufficient: under heavy load, Im
> > > seeing a second job start on the same pool before the prior job has
> > > completed (I use "mkdir" as a pseudo-mutex, and Im running on a
> > local
> > > filesystem under /tmp).
> > 
> > Explain "seeing" in the above sentence.
> 
> What I see is that more that one job runs concurrently in the same local execution provider pool.
> 
> I observe this because the job is a shell script that creates a fixed-name directory before entering its critical section and removes that directory on exiting the critical section.
> 
> If the directory exists on entry to the critical section, the job does a long sleep (42 secs) so I can see it in ps.
> 
> My ordinary locking logic (which is a successful workaround is to sleep in 1-second intervals waiting for the directory lock to free up. Under stress tests, I occasionally see a job waiting on that lock, but the retry logic works.
> 
> The locking code is:
> -----
> # Ready to talk to the server: send request and read response
> 
> while true; do
>   mkdir $SLOTDIR/mutex
>   if [ $? != 0 ]; then
>     sleep 42;  # <<<<<<<<<<<< I see this sleep occasionally. Ordinarily its a sleep 1
>   else
>     break;
>   fi
> done
> 
> echo run $(pwd)/$callFile $(pwd)/$resultFile > $SLOTDIR/toR.fifo
> touch $SLOTDIR/lastwrite
> 
> echo dummy stderr response 1>&2 # FIXME - testing if this is the provider staging problem (not xfering zero len stderr)
> 
> head -3 < $SLOTDIR/fromR.fifo # FIXME: Trim this down to 1 line for each call (or same # lines for each, in particular, for "quit")
> 
> rmdir $SLOTDIR/mutex
> ----
> 
> > But before that, try
> > jobsPerCpu.
> > 
> 
> Can you comment on this second question, below?
> 
> > > 
> > > Second question: If you can point me to the right place, Justin or
> > I
> > > could do this the "right" way by modifying the local execution
> > > provider set set "SLOT" numbers.  I initially thought the current
> > hack
> > > would be easier, and it seemed to work under standalone testing,
> > but
> > > seems to be failing now in the live setting.
> > 
> > The right way, I would think, is to modify the relevant throttling
> > parameters for the scheduler for that site. That is, the local
> > provider
> > should not have anything to do with this.
> 
> I disagree. This "slot" functionality is the primary goal. The
> throttling approach is a poor workaround, as it forces me to create
> one local pool per core, when what I really want is just one pool.

I am unsure what one thing has to do with the other. If you say
poolCpus=4 and jobsPerCpu=1, you presumably get what you want. The issue
isn't how you express the throttling, but where in the code the
throttling is implemented. And I don't see what the mechanism used for
submission (e.g. gt2, local, pbs - i.e. provider) has to do with how to
distribute jobs to sites (the scheduler).

Mihael




More information about the Swift-devel mailing list