[Swift-devel] Swift Performance Data

Mon Jun 25 12:11:51 CDT 2007

Here is my 2c of experience in trying to draw up graphs of various 
experiments.  I make a clear distinction between 1) logs that will be 
used for debugging/info that are in a relatively human readable format, 
and those logs that will be used for plotting graphs!  The human 
readable logs (1) are almost always occurring based on events in the 
system.  On the other hand, the logs that are geared towards graphing 
them (2) are mostly based on fixed time intervals, and a few are based 
on events. 

For example, in Falkon, I have the following set of logs:
1) Falkon dispatcher log (1 for the entire Falkon system) with 
debug/info level human readable logs, and it typically writes to this 
log for events related to the task dispatch and notifications that 
happen in the Falkon service; this log is currently only used for 
debugging purposes.

2) Falkon provisioner log (1 for the entire Falkon system) with 
debug/info level human readable logs, and it typically writes to this 
log for events related to the allocation of resources; this log is 
currently only used for debugging purposes.

3) Executor logs (1 per executor, separated into different files); this 
is also for human consumption that at the most detailed logging level, 
it prints out even the STDOUT and STERR of the task executions!  These 
logs are not aggregated in any way currently, and are mostly used for 
debugging purposes.

4) Task description log (1 for the entire Falkon system), which stores 
the description of each task executed (i.e. TIMESTAMP, APPLICATION_ID, 
EXECUTABLE, ARGUEMENTS, ENVIRONMENT); I have not used this log yet for 
anything, but I envision we could use it for workload characterization, 
studies involving replaying an entire workload, etc...

5) Summary log (1 for the entire Falkon system) with an easy to parse 
format for automatic graph generation; this log is generated on fixed 
time intervals, in which some of the Falkon state is summarized for the 
duration of that period; the kind of state information that goes in this 
log is: TimeStamp_ms num_users num_resources num_threads num_all_workers 
num_free_workers num_pend_workers num_busy_workers waitQ_length 
waitNotQ_length activeQ_length doneQ_length delivered_tasks 
throughput_tasks/sec; this log can be used to plot the number of 
executors registered, active, idle, the queue length, the throughput of 
task delivered, etc... as the experiment progresses.  In my latest 
development branch, I actually have a few more parameters that I am 
logging, such as CPU utilization, free memory, data caching hit rates, 
etc...

6) Per task log (1 for the entire Falkon system) that has information on 
each task executed in Falkon; this log is used to plot the per task info 
as the experiment progresses.  The information that is kep on each task 
is: taskID workerID startTime endTime waitQueueTime execTime 
resultsQueueTime totalTime exitCode; this log can also be used to plot 
the per worker information, to see how the tasks were dispersed over the 
workers...

7) User information log (1 for the entire Falkon system) that stores 
information relevant for the end user, and is updated every time the 
state (wait, active, done) changes for any task; the information that 
this log contains is: Time_ms Users Resources JVM_Threads WaitingTasks 
ActiveTasks DoneTasks DeliveredTasks; I have not used this log for 
anything yet, but it has much more fine granular information that the 
summary log (5), so more detailed graphs/analysis could be generated for 
this log.

8) Worker information logs (1 for the entire Falkon system) that stores 
information about the workers state changes and is updated every time 
the state (free, pending, busy) changes for any worker; the information 
that this log contains is: Time_ms RegisteredWorkers FreeWorkers 
PendWorkers BusyWorkers; again, I have not used this log for anything 
yet, but it has much more fine granular information that the summary log 
(5), so more detailed graphs/analysis could be generated for this log.

Now, as a summary, I use (5) and (6) a lot to generate the graphs that I 
do for Falkon.  I have not used (7) and (8) yet, but might in the 
future.  Its also relatively easy to add new state information to log to 
these existing logs since they are all localized in a few places, with 
little effort, I can add new metrics to monitor, or create a completely 
new log that has other information that was not easy to integrate into 
existing logs.  For simplicity, my perf logs (5-8) are all simple logs 
that are just space delimited...

> taskID workerID startTime endTime waitQueueTime execTime 
> resultsQueueTime totalTime exitCode
> tg-viz-login1.uc.teragrid.org:50103:1_1326356873 
> tg-c058.uc.teragrid.org:50100 1182533457601 1182533985431 467599 60225 
> 6 527830 0
> tg-viz-login1.uc.teragrid.org:50103:2_1124048393 
> tg-c052.uc.teragrid.org:50100 1182533457613 1182533985454 467735 60101 
> 5 527841 0
> tg-viz-login1.uc.teragrid.org:50103:3_1648367237 
> tg-c053.uc.teragrid.org:50100 1182533457616 1182533985524 467760 60138 
> 10 527908 0
They could be converted to XML or any other format you want, but this is 
a nice format for programs like ploticus or gnuplot to understand easily. 

On the other hand, my debug logs (1-4) are all handled via log4j, look 
more like the traditional logs that log4j generates and people are 
accustomed to, but from my point of view, these are tedious and 
error-prone to parse for graphing purposes.

Does this distinction (human readable vs. machine readable) between logs 
exist in Swift?  If not, I would argue to not modify the debug/info 
logs, but to create new logs that are specifically targeted at automatic 
graph generations, such as my logs (5-8).  If we are to use tools that 
others have built, then we just need to make sure these new logs conform 
to the appropriate format; if we are to write our own tools (or we 
already have them), then we have as much freedom as we want on what 
format these logs should be.

Ioan

Mihael Hategan wrote:
> On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote:
>   
>> So who is going to do this?
>>
>> I've been asking about this for some time, and nothing has happened. The 
>> result, I think, has been a lot of confusion and delay.
>>     
>
> Are we still talking about collecting logs? I'm a bit confused.
>
>   
>>> I agree fully with Mihael's point that we can and should start 
>>> gathering all execution logs into a uniformly structured gathering 
>>> place. Then we can organize the current log tools and determine whats 
>>> needed next in that area.
>>>
>>>       
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070625/562de62a/attachment.html>