[hpc-announce] Extended Paper Deadline 7/20 for the Workshop on Managing Systems Automatically and Dynamically (MAD)

Bronevetsky, Greg bronevetsky1 at llnl.gov
Mon Jul 9 11:01:03 CDT 2012

      Workshop on Managing Systems Automatically and Dynamically (MAD)
At the USENIX Symposium on Operating Systems Design and Implementation (OSDI)
                             October 8-10, 2012
                             Hollywood, CA, USA

* Full paper submission due: Friday, July 20, 2012
* Notification of acceptance: Friday, August 17, 2012
* Final papers due: Wednesday, September 12, 2012

The complexity of modern systems makes them extremely challenging to manage. From highly heterogeneous desktop environments to large-scale systems that consist of many thousands of software and hardware components, these systems exhibit a wide range of complex behaviors are difficult to predict. As such, although raw computational capability of these systems grows each year, much of it is lost to (i) complex failures that are difficult to localize and (ii) to poor performance and efficiency that results from system configuration that is inappropriate for the user's workload. The MAD workshop focuses on techniques to make complex systems manageable, addressing the problem's three major aspects:

System Monitoring
Systems report their state and behavior using a wide range of mechanisms. System and application logs include reports of key events that occur within software or hardware components. Performance counters measure various OS and hardware-level metrics (e.g. packets sent or cache misses) within a given time period. Further, information from source code version control systems or request traces can help identify the source of failures of poor performance.

Data Analysis
Data produced by monitoring can be analyzed using a variety of techniques to understand the system state and predict its behavior in various possible scenarios. Traditionally this consisted of system administrators manually inspecting system logs or using explicit pattern-matching rules to identify key events. Recent research has also focused on statistical and machine learning techniques to automatically identify behavioral patterns. Finally, the data can be presented directly to system administrators. Because of its large volume, such displays involve aggregation techniques that show the maximal information in minimal space.

Informed Action
The analyses and visualizations are used by operators to select the best action to improve productivity or localize and resolve system failures. The possible actions include restarting processes, rebooting servers, rolling back application updates or reconfiguring system components. Since the choice of the best action is complex, it requires assistance from additional analysis tools to predict the productivity of any given configuration on the given workload.
MAD seeks original early work on system management, including position papers and work-in-progress reports that will mature to be published at high-quality conferences. Papers are expected to demonstrate a strong foundation in the needs of the system management community and be positioned within the broader context of related work. In addition to technical merit, papers will be selected to encourage discussion at the workshop and among members of the general system management community.

Topics include but are not limited to:
* Techniques to collect metric and log data, including tracing and statistical measurements
* Large-scale aggregation of metric and log data
* Reports on publicly available sources of sample logs of system metrics

* Automated analysis of system logs and metrics using statistical, machine learning, natural language processing techniques
* Visualization of system information in a way that leads administrators to actionable insights
* Evaluation of the quality of learned models, including assessing the confidence/reliability of models and comparisons between different methods

* Applications of log and metric analysis to address reliability, performance, power management, security, fault diagnosis, scheduling, or manageability
* Challenges of scale in applying machine learning to large systems
* Integration of machine learning into real-world systems and processes

Peter Bodik, Microsoft Research (peterb at microsoft.com<mailto:peterb at microsoft.com>)
Greg Bronevetsky, Lawrence Livermore National Laboratory (bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov>)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/hpc-announce/attachments/20120709/af4c4661/attachment.html>

More information about the hpc-announce mailing list