[Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...]

Michael Wilde wilde at mcs.anl.gov
Mon Mar 24 12:21:14 CDT 2008


 > Now the real question is, what is the breakdown of the 100 sec
 > invocation (108.645 sec on average to be exact), how much is due to
 > wrapper.sh, and how much is due to the application itself?  Mike, can
 > you comment on this?  I assume you are running amiga which should have
 > 0.5 sec jobs, right?

Amiga is about .5 secs and teh script that runs (runam3) I think adds 
another .5 secs (from a quick scan of falkon logs on the actual task run 
time - but please verify, I think you have all the data from the task log).

I suspect, as you and I both agree, that hundreds of short jobs starting 
in some small interval causes heavy NFS activity. The next round of 
testing we'll do should start to pick this apart, determine causes and 
prototype improvements.

- Mike


On 3/24/08 11:52 AM, Ioan Raicu wrote:
> Not sure if this email made it to the mailing list, due to the larger 
> size (128KB)...
> 
> Ioan
> 
> ------------------------------------------------------------------------
> 
> Subject:
> Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...
> From:
> Ioan Raicu <iraicu at cs.uchicago.edu>
> Date:
> Mon, 24 Mar 2008 11:48:16 -0500
> To:
> Ben Clifford <benc at hawaga.org.uk>
> 
> To:
> Ben Clifford <benc at hawaga.org.uk>
> CC:
> swift-devel <swift-devel at ci.uchicago.edu>
> 
> 
> .OK, here is my analysis of the plateaus, from Falkon's point of view.
> 
> Notice the per task execution (green) is about 100 seconds per job, 
> where the job is some invocation of the wrapper.sh that Swift sent to 
> Falkon.  Things look normal so far.  See the 2nd graph for more...
> 
> 
> This shows that there are 600 workers (600 CPUs), which all get their 
> work within 10 seconds... then they all churn away until about 100 sec 
> when jobs start completing, and new ones get dispatched.  At around 132 
> seconds, the wait queue is empty, and some workers start becoming idle 
> (the red area)... by time 155, the initial 600 jobs that started between 
> time 0 and 10, have completed, and from 155 to 211, the remaining 400 
> jobs all run to completion; they really only start completing around 190 
> sec, and all finish by 211.  So, the plateau, that is evident here as 
> well, is really when 400 workers are executing 400 jobs in parallel, and 
> since the jobs are taking around 100 sec each to complete, the plateau 
> of 50 seconds is completely normal.  See more after the graph...
> 
> 
> Now the real question is, what is the breakdown of the 100 sec 
> invocation (108.645 sec on average to be exact), how much is due to 
> wrapper.sh, and how much is due to the application itself?  Mike, can 
> you comment on this?  I assume you are running amiga which should have 
> 0.5 sec jobs, right?
> 
> Ioan
> 
> Ioan Raicu wrote:
>> I see the plateau, but there are other graphs which seem to go crazy 
>> during those periods, such as
>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png 
>>
>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png 
>>
>>
>> Looking at the Falkon logs might reveal more about if the plateau was 
>> due to Falkon or not.  Where would I find the Falkon logs that 
>> correlate to these graphs?
>>
>> Ioan
>>
>> Ben Clifford wrote:
>>> you can get plots for your 1000 job run here:
>>>
>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/
>>>
>>> you're hitting the file transfer and file operation limits (that are 
>>> 20 in your config) once jobs start staging out.
>>>
>>> There's a wierd looking plateu in graph 'number of execute2 tasks at 
>>> once:' around 170s .. 200s where no jobs complete for some time.
>>>
>>> Getting the falkon logs and/or the wrapper (.d) logs would be 
>>> interesting there.
>>>
>>> these were generated on my laptop with:
>>>
>>> make \
>>>  LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \
>>>  webpage.weights webpage.kara webpage
>>>
>>> using the SVN log-procesisng code.
>>>   
>>
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list