[Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...]

Tue Mar 25 09:32:54 CDT 2008

Problem may be that, as a quick test shows, bash opens and closes the
info file every time a redirect is done.

On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote:
> I did runs the day before with a modified wrapper that bypassed the INFO 
> logging. It saved a good amount - I recall about 30% but need to 
> re-check the numbers.
> 
> Yes, I came to the same conclusion on the mkdirs.  Im looking at 
> reducing these, likely moving the jobdir to /tmp.  I think I can do that 
> within the current structure.  wrapper.sh is ver clear and nicely 
> written. (Ben: yes, eyeballing the log #s was easy and no problem).
> 
> First thing I want to do, though, is run some large scale tests on our 
> two science workflows, increasing the petro-modelling one (the 
> sub-second application) to a larger runtime through app-level batching.
> 
> Zhao's latest test indicate that if we do batches of 40, bringing the 
> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep 
> it running efficiently. Given the extra wrapper.sh overhead, I might 
> need to increase that another 10X, but once the app is wrapped in a 
> loop, it makes little difference to the user how big we make that.
> 
> The other app is a molecule-docking app, that can be batched similarly.
> 
> Once we get those running nicely at a larger, less brutal job time, I'll 
> come back to wrapper.sh tuning.  If you or Ben want to do this in the 
> meantime, though, that would be great.  We have the use-local-disk 
> scenario on our development stack anyways - this would be a good time to 
> do it.  If I do it, it will be only a prototype for measurement purposes.
> 
> Mike
> 
> 
> 
> 
> On 3/25/08 8:34 AM, Mihael Hategan wrote:
> > On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote:
> >> On 3/25/08 3:31 AM, Mihael Hategan wrote:
> >>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
> >>>> I eyeballed the wrapperlogs to get a rough idea of what was happening.
> >>>>
> >>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
> >>>> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
> >>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
> >>>> recollecting the times as at this point I didnt write much down).
> >>>>
> >>> I would personally like to see those logs.
> >> I listed all the runs in the previous mail (below), Mihael. They are on 
> >> CI NFS at ~benc/swift-logs/wilde/run{345-350}.
> > 
> > Sorry about that.
> > 
> >>  Let us know what you find.
> >>
> > 
> > It looks like this:
> > - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs:
> > mkdir -p $WFDIR/info/$JOBDIR
> > mkdir -p $WFDIR/status/$JOBDIR
> > and the creation of the info file.
> > - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem:
> > mkdir -p $DIR
> > (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5,
> > which seems to roughly fit the observed numbers).
> > - 3.5 seconds for COPYING_OUTPUTS
> > - 2.5 seconds for RM_JOBDIR
> > 
> > I'd be curious to know how much of the time is actually spent writing to
> > the logs. That's because I see one second between EXECUTE_DONE and
> > COPYING_OUTPUTS, a place where the only meaningful things that are done
> > are two log messages.
> > 
> > Perhaps it may be useful to run the whole thing through strace -T.
> > 
> > Mihael
> > 
> > 
>