[hpc-announce] CFP: Resilience at Euro-Par 2015 - Papers due May 6

Christian Engelmann engelmannc at computer.org
Tue Mar 29 21:16:52 CDT 2016


We apologize if you receive multiple copies of this call for papers.

------------------------------------------------------------------------------------------------------------------------

 9th Workshop on Resiliency in High Performance Computing (Resilience)
                 in Clusters, Clouds, and Grids

                      in conjunction with

the 22nd International European Conference on Parallel and Distributed
      Computing (Euro-Par), Grenoble, France, August 22-26, 2016

Overview:

Resilience is a critical challenge as high performance computing (HPC)
systems continue to increase component counts, individual component
reliability decreases (such as due to shrinking process technology and
near-threshold voltage (NTV) operation), and software complexity increases.
Application correctness and execution efficiency, in spite of frequent
faults, errors, and failures, is essential to ensure the success of the
extreme-scale HPC systems, cluster computing environments, Grid computing
infrastructures, and Cloud computing services.

While a fault (e.g., a bug or stuck bit) is the cause of an error, its
manifestation as a state change is considered an error (e.g., a bad value
or incorrect execution), and the transition to an incorrect service is
observed as a failure (e.g., an application abort or system crash). A
failure in a computing system is typically observed through an application
abort or a full/partial service or system outage. A detectable correctable
error is often transparently handled by hardware, such as a single bit flip
in memory that is protected with single-error correction double-error
detection (SECDED) error correcting code (ECC). A detectable uncorrectable
error (DUE) typically results in a failure, such as multiple bit flips in
the same addressable word that escape SECDED ECC correction, but not
detection, and ultimately cause an application abort. An undetectable error
(UE) may result in silent data corruption (SDC), e.g., an incorrect
application output. There are many other types of hardware and software
faults, errors, and failures in computing systems.

Resilience for HPC systems encompasses a wide spectrum of fundamental and
applied research and development, including theoretical foundations, fault
detection and prediction, monitoring and control, end-to-end data integrity,
enabling infrastructure, and resilient solvers and algorithm-based fault
tolerance. This workshop brings together experts in the community to further
research and development in HPC resilience and to facilitate exchanges
across the computational paradigms of extreme-scale HPC, cluster computing,
Grid computing, and Cloud computing.

Submission Guidelines:

Authors are invited to submit papers electronically in English in PDF
format. Submitted manuscripts should be structured as technical papers and
may not exceed 12 pages, including figures, tables and references, using
Springer's Lecture Notes in Computer Science (LNCS) format at
<http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Submissions
should include abstract, key words and the e-mail address of the
corresponding author. Papers not conforming to these guidelines may be
returned without review. All manuscripts will be reviewed and will be
judged on correctness, originality, technical strength, significance,
quality of presentation, and interest and relevance to the conference
attendees. Submitted papers must represent original unpublished research
that is not currently under review for any other conference or journal.
Papers not following these guidelines will be rejected without review and
further action may be taken, including (but not limited to) notifications
sent to the heads of the institutions of the authors and sponsors of the
conference. Submissions received after the due date, exceeding length limit,
or not appropriately structured may also not be considered. The proceedings
will be published in Springer's LNCS as post-conference proceedings. At
least one author of an accepted paper must register for and attend the
workshop for inclusion in the proceedings. Authors may contact the workshop
program chairs for more information.

Important websites:

- Resilience 2016 Website: <http://www.csm.ornl.gov/srt/conferences/Resilience/2016>
- Resilience 2016 Submissions: <https://easychair.org/conferences/?conf=europar2016ws>
- Euro-Par 2016 website: <http://europar2016.inria.fr>

Topics of interest include, but are not limited to:

- Theoretical foundations for resilience:
 - Metrics and measurement
 - Statistics and optimization
 - Simulation and emulation
 - Formal methods
 - Efficiency modeling and uncertainty quantification

- Fault detection and prediction:
 - Statistical analyses
 - Machine learning
 - Anomaly detection
 - Data and information collection
 - Vizualization

- Monitoring and control for resilience:
 - Platform and application monitoring
 - Response and recovery
 - RAS theory and performability
 - Application and platform knobs
 - Tunable fidelity and quality of service

- End-to-end data integrity:
 - Fault tolerant design
 - Degraded modes
 - Forward migration and verification
 - Fault injection
 - Soft errors
 - Silent data corruption

- Enabling infrastructure for resilience:
 - RAS systems
 - System software and middleware
 - Programming models
 - Tools
 - Next-generation architectures

- Resilient solvers and algorithm-based fault tolerance:
 - Algorithmic detection and correction of hard and soft faults
 - Resilient algorithms
 - Fault tolerant numerical methods
 - Robust iterative algorithms
 - Scalability of resilient solvers and algorithm-based fault tolerance

Important Dates:

- Workshop papers due: May 6, 2016
- Workshop author notification: June 17, 2016
- Workshop early registration: TBD
- Workshop paper (for informal workshop proceedings): July 31, 2016
- Workshop camera-ready papers: October 3, 2016

General Co-Chairs:

- Stephen L. Scott
 Senior Research Scientist - Systems Research Team
 Tennessee Tech University and Oak Ridge National Laboratory, USA
 scottsl at ornl.gov
- Chokchai (Box) Leangsuksun,
 SWEPCO Endowed Associate Professor of Computer Science
 Louisiana Tech University, USA
 box at latech.edu

Program Co-Chairs:

- Patrick G. Bridges
 University of New Mexico, USA
 bridges at cs.unm.edu
- Christian Engelmann
 Oak Ridge National Laboratory , USA
 engelmannc at ornl.gov

Program Committee:

- Ferrol Aderholdt, Oak Ridge National Laboratory, USA
- Dorian Arnold, University of New Mexico, USA
- Wesley Bland, Intel Corporation, USA
- Hans-Joachim Bungartz, Technical University of Munich, Germany
- Franck Cappello, Argonne National Laboratory and
 University of Illinois at - Urbana-Champaign, USA
- Zizhong Chen, University of California at Riverside, USA
- Robert Clay, Sandia National Laboratories, USA
- Nathan DeBardeleben, Los Alamos National Laboratory, USA
- James Elliott, Sandia National Laboratories, USA
- Kurt Ferreira, Sandia National Laboratory, USA
- Larry Kaplan, Cray Inc., USA
- Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany
- Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA
- Ignacio Laguna, Lawrence Livermore National Laboratory, USA
- Scott Levy, University of New Mexico, USA
- Kathryn Mohror, Lawrence Livermore National Laboratory, USA
- Christine Morin, INRIA Rennes, France
- Dirk Pflueger, University of Stuttgart, Germany
- Nageswara Rao, Oak Ridge National Laboratory, USA
- Alexander Reinefeld, Zuse Institute Berlin, Germany
- Rolf Riesen, Intel Corporation, USA
- Yves Robert, ENS Lyon, France
- Martin Schulz, Lawrence Livermore National Laboratory, USA
- Keita Teranishi, Sandia National Laboratories, USA

--

Christian Engelmann, Ph.D.

System Software Team Task Lead / R&D Staff Scientist
Computer Science Research Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory

Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491
e-Mail: engelmannc at ornl.gov / Home: www.christian-engelmann.info


More information about the hpc-announce mailing list