[Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...]

Michael Wilde wilde at mcs.anl.gov
Mon Mar 24 12:59:31 CDT 2008



On 3/24/08 12:36 PM, Ioan Raicu wrote:
> 
> 
> Michael Wilde wrote:
>> > Now the real question is, what is the breakdown of the 100 sec
>> > invocation (108.645 sec on average to be exact), how much is due to
>> > wrapper.sh, and how much is due to the application itself?  Mike, can
>> > you comment on this?  I assume you are running amiga which should have
>> > 0.5 sec jobs, right?
>>
>> Amiga is about .5 secs and teh script that runs (runam3) I think adds 
>> another .5 secs (from a quick scan of falkon logs on the actual task 
>> run time - but please verify, I think you have all the data from the 
>> task log).
> The log with 1000 tasks, the shortest job was 72 secs, average 108, and 
> max 170 sec.  Is amiga working from RAM, or is it from NFS?  If its from 
> NFS, how big is the input data and script?  I thought it was about 
> 10KB?  The overall throughput was 6.6 jobs/sec, so that is only 66KB/s, 
> which seems quite small, assuming that each read is done in large 
> chunks, and not a few bytes at a time.
>>
>> I suspect, as you and I both agree, that hundreds of short jobs 
>> starting in some small interval causes heavy NFS activity. 
> Yes, but is the NFS activity due to the app, or due to wrapper.sh?

Its due to both. wrapper.sh fetches the app script from nfs which 
fetches the app from nfs.  then wrapper.sh does its setup, which causes 
more (synchronous) nfs activity, then the app output is copied, then 
fetched back to the run directory.

All this is dominated I suspect by nfs request overhead, most of which 
is not data transfer.

There's really nothing to discuss regarding this until I get some data 
from tests.

- Mike

> 
> I would replace the amiga app with a sleep 0.5, or sleep 1, just to see 
> if the graph looks much different or not.  That will surely isolate the 
> overhead from your app or wrapper.sh.
> 
> Ioan
>> The next round of testing we'll do should start to pick this apart, 
>> determine causes and prototype improvements.
>>
>> - Mike
>>
>>
>> On 3/24/08 11:52 AM, Ioan Raicu wrote:
>>> Not sure if this email made it to the mailing list, due to the larger 
>>> size (128KB)...
>>>
>>> Ioan
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Subject:
>>> Re: [Swift-devel] Re: swift-falkon problem... plots to explain 
>>> plateaus...
>>> From:
>>> Ioan Raicu <iraicu at cs.uchicago.edu>
>>> Date:
>>> Mon, 24 Mar 2008 11:48:16 -0500
>>> To:
>>> Ben Clifford <benc at hawaga.org.uk>
>>>
>>> To:
>>> Ben Clifford <benc at hawaga.org.uk>
>>> CC:
>>> swift-devel <swift-devel at ci.uchicago.edu>
>>>
>>>
>>> .OK, here is my analysis of the plateaus, from Falkon's point of view.
>>>
>>> Notice the per task execution (green) is about 100 seconds per job, 
>>> where the job is some invocation of the wrapper.sh that Swift sent to 
>>> Falkon.  Things look normal so far.  See the 2nd graph for more...
>>>
>>>
>>> This shows that there are 600 workers (600 CPUs), which all get their 
>>> work within 10 seconds... then they all churn away until about 100 
>>> sec when jobs start completing, and new ones get dispatched.  At 
>>> around 132 seconds, the wait queue is empty, and some workers start 
>>> becoming idle (the red area)... by time 155, the initial 600 jobs 
>>> that started between time 0 and 10, have completed, and from 155 to 
>>> 211, the remaining 400 jobs all run to completion; they really only 
>>> start completing around 190 sec, and all finish by 211.  So, the 
>>> plateau, that is evident here as well, is really when 400 workers are 
>>> executing 400 jobs in parallel, and since the jobs are taking around 
>>> 100 sec each to complete, the plateau of 50 seconds is completely 
>>> normal.  See more after the graph...
>>>
>>>
>>> Now the real question is, what is the breakdown of the 100 sec 
>>> invocation (108.645 sec on average to be exact), how much is due to 
>>> wrapper.sh, and how much is due to the application itself?  Mike, can 
>>> you comment on this?  I assume you are running amiga which should 
>>> have 0.5 sec jobs, right?
>>>
>>> Ioan
>>>
>>> Ioan Raicu wrote:
>>>> I see the plateau, but there are other graphs which seem to go crazy 
>>>> during those periods, such as
>>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png 
>>>>
>>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png 
>>>>
>>>>
>>>> Looking at the Falkon logs might reveal more about if the plateau 
>>>> was due to Falkon or not.  Where would I find the Falkon logs that 
>>>> correlate to these graphs?
>>>>
>>>> Ioan
>>>>
>>>> Ben Clifford wrote:
>>>>> you can get plots for your 1000 job run here:
>>>>>
>>>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/
>>>>>
>>>>> you're hitting the file transfer and file operation limits (that 
>>>>> are 20 in your config) once jobs start staging out.
>>>>>
>>>>> There's a wierd looking plateu in graph 'number of execute2 tasks 
>>>>> at once:' around 170s .. 200s where no jobs complete for some time.
>>>>>
>>>>> Getting the falkon logs and/or the wrapper (.d) logs would be 
>>>>> interesting there.
>>>>>
>>>>> these were generated on my laptop with:
>>>>>
>>>>> make \
>>>>>  LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log 
>>>>> clean \
>>>>>  webpage.weights webpage.kara webpage
>>>>>
>>>>> using the SVN log-procesisng code.
>>>>>   
>>>>
>>>
>>> -- 
>>> ===================================================
>>> Ioan Raicu
>>> Ph.D. Candidate
>>> ===================================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ===================================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>> http://dev.globus.org/wiki/Incubator/Falkon
>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>> ===================================================
>>> ===================================================
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 



More information about the Swift-devel mailing list