[Swift-devel] Swift Performance Data
Ioan Raicu
iraicu at cs.uchicago.edu
Mon Jun 25 12:11:51 CDT 2007
Here is my 2c of experience in trying to draw up graphs of various
experiments. I make a clear distinction between 1) logs that will be
used for debugging/info that are in a relatively human readable format,
and those logs that will be used for plotting graphs! The human
readable logs (1) are almost always occurring based on events in the
system. On the other hand, the logs that are geared towards graphing
them (2) are mostly based on fixed time intervals, and a few are based
on events.
For example, in Falkon, I have the following set of logs:
1) Falkon dispatcher log (1 for the entire Falkon system) with
debug/info level human readable logs, and it typically writes to this
log for events related to the task dispatch and notifications that
happen in the Falkon service; this log is currently only used for
debugging purposes.
2) Falkon provisioner log (1 for the entire Falkon system) with
debug/info level human readable logs, and it typically writes to this
log for events related to the allocation of resources; this log is
currently only used for debugging purposes.
3) Executor logs (1 per executor, separated into different files); this
is also for human consumption that at the most detailed logging level,
it prints out even the STDOUT and STERR of the task executions! These
logs are not aggregated in any way currently, and are mostly used for
debugging purposes.
4) Task description log (1 for the entire Falkon system), which stores
the description of each task executed (i.e. TIMESTAMP, APPLICATION_ID,
EXECUTABLE, ARGUEMENTS, ENVIRONMENT); I have not used this log yet for
anything, but I envision we could use it for workload characterization,
studies involving replaying an entire workload, etc...
5) Summary log (1 for the entire Falkon system) with an easy to parse
format for automatic graph generation; this log is generated on fixed
time intervals, in which some of the Falkon state is summarized for the
duration of that period; the kind of state information that goes in this
log is: TimeStamp_ms num_users num_resources num_threads num_all_workers
num_free_workers num_pend_workers num_busy_workers waitQ_length
waitNotQ_length activeQ_length doneQ_length delivered_tasks
throughput_tasks/sec; this log can be used to plot the number of
executors registered, active, idle, the queue length, the throughput of
task delivered, etc... as the experiment progresses. In my latest
development branch, I actually have a few more parameters that I am
logging, such as CPU utilization, free memory, data caching hit rates,
etc...
6) Per task log (1 for the entire Falkon system) that has information on
each task executed in Falkon; this log is used to plot the per task info
as the experiment progresses. The information that is kep on each task
is: taskID workerID startTime endTime waitQueueTime execTime
resultsQueueTime totalTime exitCode; this log can also be used to plot
the per worker information, to see how the tasks were dispersed over the
workers...
7) User information log (1 for the entire Falkon system) that stores
information relevant for the end user, and is updated every time the
state (wait, active, done) changes for any task; the information that
this log contains is: Time_ms Users Resources JVM_Threads WaitingTasks
ActiveTasks DoneTasks DeliveredTasks; I have not used this log for
anything yet, but it has much more fine granular information that the
summary log (5), so more detailed graphs/analysis could be generated for
this log.
8) Worker information logs (1 for the entire Falkon system) that stores
information about the workers state changes and is updated every time
the state (free, pending, busy) changes for any worker; the information
that this log contains is: Time_ms RegisteredWorkers FreeWorkers
PendWorkers BusyWorkers; again, I have not used this log for anything
yet, but it has much more fine granular information that the summary log
(5), so more detailed graphs/analysis could be generated for this log.
Now, as a summary, I use (5) and (6) a lot to generate the graphs that I
do for Falkon. I have not used (7) and (8) yet, but might in the
future. Its also relatively easy to add new state information to log to
these existing logs since they are all localized in a few places, with
little effort, I can add new metrics to monitor, or create a completely
new log that has other information that was not easy to integrate into
existing logs. For simplicity, my perf logs (5-8) are all simple logs
that are just space delimited...
> taskID workerID startTime endTime waitQueueTime execTime
> resultsQueueTime totalTime exitCode
> tg-viz-login1.uc.teragrid.org:50103:1_1326356873
> tg-c058.uc.teragrid.org:50100 1182533457601 1182533985431 467599 60225
> 6 527830 0
> tg-viz-login1.uc.teragrid.org:50103:2_1124048393
> tg-c052.uc.teragrid.org:50100 1182533457613 1182533985454 467735 60101
> 5 527841 0
> tg-viz-login1.uc.teragrid.org:50103:3_1648367237
> tg-c053.uc.teragrid.org:50100 1182533457616 1182533985524 467760 60138
> 10 527908 0
They could be converted to XML or any other format you want, but this is
a nice format for programs like ploticus or gnuplot to understand easily.
On the other hand, my debug logs (1-4) are all handled via log4j, look
more like the traditional logs that log4j generates and people are
accustomed to, but from my point of view, these are tedious and
error-prone to parse for graphing purposes.
Does this distinction (human readable vs. machine readable) between logs
exist in Swift? If not, I would argue to not modify the debug/info
logs, but to create new logs that are specifically targeted at automatic
graph generations, such as my logs (5-8). If we are to use tools that
others have built, then we just need to make sure these new logs conform
to the appropriate format; if we are to write our own tools (or we
already have them), then we have as much freedom as we want on what
format these logs should be.
Ioan
Mihael Hategan wrote:
> On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote:
>
>> So who is going to do this?
>>
>> I've been asking about this for some time, and nothing has happened. The
>> result, I think, has been a lot of confusion and delay.
>>
>
> Are we still talking about collecting logs? I'm a bit confused.
>
>
>>> I agree fully with Mihael's point that we can and should start
>>> gathering all execution logs into a uniformly structured gathering
>>> place. Then we can organize the current log tools and determine whats
>>> needed next in that area.
>>>
>>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070625/562de62a/attachment.html>
More information about the Swift-devel
mailing list