[Swift-devel] Re: Need precise throttle on local provider
Michael Wilde
wilde at mcs.anl.gov
Fri Oct 1 10:52:59 CDT 2010
----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> Try <profile namespace="karajan" key="jobsPerCpu">1</profile>
That did not seem to work. What does the local provider think a CPU is?
Since I have several pools (5 for my current tests) on a 4-core host (communicado), is the provider thinking that it can do 4 jobs per pool? I need it to do one job per pool (be see my question at the end about implementing the slot concept in the provider itself).
> > But it seems like even this is not sufficient: under heavy load, Im
> > seeing a second job start on the same pool before the prior job has
> > completed (I use "mkdir" as a pseudo-mutex, and Im running on a
> local
> > filesystem under /tmp).
>
> Explain "seeing" in the above sentence.
What I see is that more that one job runs concurrently in the same local execution provider pool.
I observe this because the job is a shell script that creates a fixed-name directory before entering its critical section and removes that directory on exiting the critical section.
If the directory exists on entry to the critical section, the job does a long sleep (42 secs) so I can see it in ps.
My ordinary locking logic (which is a successful workaround is to sleep in 1-second intervals waiting for the directory lock to free up. Under stress tests, I occasionally see a job waiting on that lock, but the retry logic works.
The locking code is:
-----
# Ready to talk to the server: send request and read response
while true; do
mkdir $SLOTDIR/mutex
if [ $? != 0 ]; then
sleep 42; # <<<<<<<<<<<< I see this sleep occasionally. Ordinarily its a sleep 1
else
break;
fi
done
echo run $(pwd)/$callFile $(pwd)/$resultFile > $SLOTDIR/toR.fifo
touch $SLOTDIR/lastwrite
echo dummy stderr response 1>&2 # FIXME - testing if this is the provider staging problem (not xfering zero len stderr)
head -3 < $SLOTDIR/fromR.fifo # FIXME: Trim this down to 1 line for each call (or same # lines for each, in particular, for "quit")
rmdir $SLOTDIR/mutex
----
> But before that, try
> jobsPerCpu.
>
Can you comment on this second question, below?
> >
> > Second question: If you can point me to the right place, Justin or
> I
> > could do this the "right" way by modifying the local execution
> > provider set set "SLOT" numbers. I initially thought the current
> hack
> > would be easier, and it seemed to work under standalone testing,
> but
> > seems to be failing now in the live setting.
>
> The right way, I would think, is to modify the relevant throttling
> parameters for the scheduler for that site. That is, the local
> provider
> should not have anything to do with this.
I disagree. This "slot" functionality is the primary goal. The throttling approach is a poor workaround, as it forces me to create one local pool per core, when what I really want is just one pool.
> Luckily there already is a
> parameter to limit the number of concurrent jobs (and I mentioned it
> before).
>
> Mihael
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list