[hpc-announce] CFP: Resilience 2015, the 8th Workshop on Resiliency in High Performance Computing

Mon Apr 6 14:43:48 CDT 2015

We apologize if you receive multiple copies of this notice.

------------------------------------------------------------------------------------------------------------------------

   8th Workshop on Resiliency in High Performance Computing (Resilience)
                   in Clusters, Clouds, and Grids

                        in conjunction with

  the 21st International European Conference on Parallel and Distributed
        Computing (Euro-Par), Vienna, Austria, August 24-28, 2015

Overview:

Clouds, Grids, and Clusters are three different computational paradigms with
the potential to support High Performance Computing (HPC) and enterprise IT
infrastructure.  Currently, they consist of hardware, management, and usage
models particular to different computational regimes (e.g., high performance
cluster systems designed to support tightly coupled scientific simulation codes
typically utilize high-speed interconnects and commercial cloud systems designed
to support software as a service (SAS) typically do not). However, in order to
support HPC, all must at least utilize large numbers of resources and hence
effective HPC in any of these paradigms must address the same issue of
resiliency at a very large-scale.

Recent trends in high-performance computing (HPC) systems have clearly indicated
that future increases in performance, in excess of those resulting from
improvements in single-processor performance, will be achieved through
corresponding increases in system scale, i.e., using a significantly larger
component count. As the raw computational performance of the world's fastest
HPC systems increases from today's current multi-petascale to next-generation
exascale capability and beyond, their number of computational, networking, and
storage components will grow from the ten-to-one-hundred thousand compute nodes
of today's systems to several hundreds of thousands of compute nodes in the
foreseeable future. This substantial growth in system scale, and the resulting
component count, poses a challenge for HPC system and application software with
respect to reliability, availability and serviceability (RAS).

The expected total component count of these HPC systems calls into questions
many of today's HPC RAS assumptions. Although the mean-time to failure (MTTF)
for each individual component, e.g., processor, memory module, and network
interface, may be above typical consumer product standard, the probability of
failure for the overall system scales proportionally to the number of
interdependent components and their combined probabilities of failure. Thus,
the enormous number of individual components results in a much lower system
mean-time to failure (SMTTF), causing more frequent system-wide interruptions
than displayed by current HPC systems. This effect is not limited to hardware
components, but also extends to software components, e.g., operating system,
system software, and applications. Although software components do not show less
reliability with increasing age like hardware components, they do contain other
sources of failures, such as design and implementation errors. Furthermore, the
health of software components also involves resource utilization, such as
processor, memory and network usage.

To address the issue of computing resiliency, fault tolerance and high
availability have become critical research topics. The goal of this workshop is
to bring together the community in an effort to facilitate resilient HPC in each
of these three computational paradigms -- Clouds, Grids, and Clusters. Their
respective differences in architecture, management, and usage models may lend
themselves to different approaches to resiliency. Knowledge of these approaches
in one may be used to enable resiliency in the others or to define new usage
models to enable HPC. This workshop targets fundamental solutions and issues in
resiliency for HPC.

Submission Guidelines:

Authors are invited to submit papers electronically in English in PDF format.
Submitted manuscripts should be structured as technical papers and may not
exceed 12 pages, including figures, tables and references, using Springer's
Lecture Notes in Computer Science (LNCS) format at
<http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Submissions
should include abstract, key words and the e-mail address of the corresponding
author. Papers not conforming to these guidelines may be returned without
review. All manuscripts will be reviewed and will be judged on correctness,
originality, technical strength, significance, quality of presentation, and
interest and relevance to the conference attendees. Submitted papers must
represent original unpublished research that is not currently under review for
any other conference or journal. Papers not following these guidelines will be
rejected without review and further action may be taken, including (but not
limited to) notifications sent to the heads of the institutions of the authors
and sponsors of the conference. Submissions received after the due date,
exceeding length limit, or not appropriately structured may also not be
considered. The proceedings will be published in Springer's LNCS as
post-conference proceedings. At least one author of an accepted paper must
register for and attend the workshop for inclusion in the proceedings. Authors
may contact the workshop program chairs for more information.

Important websites:
- Resilience 2015 Website: <http://www.csm.ornl.gov/srt/conferences/Resilience/2015>
- Resilience 2015 Submissions: <https://easychair.org/conferences/?conf=europar2015ws>
- Euro-Par 2015 website: <http://www.europar2015.org>

Topics of interest include, but are not limited to:
- Hardware for fault detection and resiliency
- System-level resiliency for HPC, Grid, Cluster, and Cloud
- Algorithmic based resiliency - Generic, fundamental advances (not Hadoop)
- Statistical methods to improve system resiliency
- Fault tolerance mechanisms experiments
- Resource management for system resiliency and availability
- Resilient system based on hardware probes
- Monitoring mechanisms to support fault prediction, and fault mitigation
- Application-level fault tolerance
- Fault prediction and failure modeling

Important Dates:
- Workshop papers due: May 22, 2015
- Workshop author notification: June 19, 2015
- Workshop early registration: July 17, 2015
- Workshop paper (for informal workshop proceedings): July 31, 2015
- Workshop camera-ready papers: October 2, 2015

General Co-Chairs:
- Stephen L. Scott
 Senior Research Scientist - Systems Research Team
 Tennessee Tech University and Oak Ridge National Laboratory, USA
 scottsl at ornl.gov
- Chokchai (Box) Leangsuksun,
 SWEPCO Endowed Associate Professor of Computer Science
 Louisiana Tech University, USA
 box at latech.edu

Program Co-Chairs:
- Patrick G. Bridges
 University of New Mexico, USA
 bridges at cs.unm.edu
- Christian Engelmann
 Oak Ridge National Laboratory , USA
 engelmannc at ornl.gov

Program Committee:
- Ferrol Aderholdt, Tennessee Tech University, USA
- Dorian Arnold, University of New Mexico, USA
- Wesley Bland, Intel Corporation, USA
- Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
- Franck Cappello, Argonne National Laboratory and University of Illinois at
 Urbana-Champaign, USA
- Zizhong Chen, University of California at Riverside, USA
- Andrew A. Chien, University of Chicago and Argonne National Laboratory, USA
- Nathan DeBardeleben, Los Alamos National Laboratory, USA
- James Elliott, North Carolina State University, USA
- Kurt Ferreira, Sandia National Laboratory, USA
- Michael Heroux, Sandia National Laboratories, USA
- Larry Kaplan, Cray Inc., USA
- Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany
- Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA
- Ignacio Laguna, Lawrence Livermore National Laboratory, USA
- Scott Levy, University of New Mexico, USA
- Kathryn Mohror, Lawrence Livermore National Laboratory, USA
- Christine Morin, INRIA Rennes, France
- Nageswara Rao, Oak Ridge National Laboratory, USA
- Alexander Reinefeld, Zuse Institute Berlin, Germany
- Rolf Riesen, Intel Corporation, USA
- Martin Schulz, Lawrence Livermore National Laboratory, USA
- Marc Snir, Argonne National Laboratory, USA
- Keita Teranishi, Sandia National Laboratories, USA

--

Christian Engelmann, Ph.D.

System Software Team Task Lead / R&D Staff Scientist
Computer Science Research Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory

Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491
e-Mail: engelmannc at ornl.gov / Home: www.christian-engelmann.info