[hpc-announce] CFP: Resilience 2014, The 7th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

Thu May 29 10:58:52 CDT 2014

7th Workshop on Resiliency in High Performance Computing (Resilience)
in Clusters, Clouds, and Grids in conjunction with the 20th
International European Conference on Parallel and Distributed
Computing (Euro-Par 2014),Porto, Portugal, August 25-29, 2014

Overview:

Clusters, Clouds, and Grids are three different computational
paradigms with the intent or potential to support High Performance
Computing (HPC). Currently, they consist of hardware, management, and
usage models particular to different computational regimes, e.g., high
performance cluster systems designed to support tightly coupled
scientific simulation codes typically utilize high-speed interconnects
and commercial cloud systems designed to support software as a service
(SAS) do not. However, in order to support HPC, all must at least
utilize large numbers of resources and hence effective HPC in any of
these paradigms must address the issue of resiliency at large-scale.

Recent trends in high-performance computing (HPC) systems have clearly
indicated that future increases in performance, in excess of those
resulting from improvements in single-processor performance, will be
achieved through corresponding increases in system scale, i.e., using
a significantly larger component count. As the raw computational
performance of the world's fastest HPC systems increases from today's
current multi-petascale to next-generation exascale capability and
beyond, their number of computational, networking, and storage
components will grow from the ten-to-one-hundred thousand compute
nodes of today's systems to several hundreds of thousands of compute
nodes in the foreseeable future. This substantial growth in system
scale, and the resulting component count, poses a challenge for HPC
system and application software with respect to reliability,
availability and serviceability (RAS).

The expected total component count of these HPC systems calls into
questions many of today's HPC RAS assumptions. Although the mean-time
to failure (MTTF) for each individual component, e.g., processor,
memory module, and network interface, may be above typical consumer
product standard, the probability of failure for the overall system
scales proportionally to the number of interdependent components and
their combined probabilities of failure. Thus, the enormous number of
individual components results in a much lower system meantime to
failure (SMTTF), causing more frequent system-wide interruptions than
displayed by current HPC systems. This effect is not limited to
hardware components, but also extends to software components, e.g.,
operating system, system software, and applications. Although software
components do not show less reliability with increasing age like
hardware components, they do contain other sources of failures, such
as design and implementation errors. Furthermore, the health of
software components also involves resource utilization, such as
processor, memory and network usage.

Authors are invited to submit papers electronically in English in PDF
format. Submitted manuscripts should be structured as technical papers
and may not exceed 12 pages, including figures, tables and references,
using Springer's Lecture Notes in Computer Science (LNCS) format at
<http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>.
Submissions should include abstract, key words and the e-mail address
of the corresponding author. Papers not conforming to these guidelines
may be returned without review. All manuscripts will be reviewed and
will be judged on correctness, originality, technical strength,
significance, quality of presentation, and interest and relevance to
the conference attendees. Submitted papers must represent original
unpublished research that is not currently under review for any other
conference or journal. Papers not following these guidelines will be
rejected without review and further action may be taken, including
(but not limited to) notifications sent to the heads of the
institutions of the authors and sponsors of the conference.
Submissions received after the due date, exceeding length limit, or
not appropriately structured may also not be considered. The
proceedings will be published in Springer's LNCS as post-conference
proceedings. At least one author of an accepted paper must register
for and attend the workshop for inclusion in the proceedings. Authors
may contact the workshop program chairs for more information.

Important websites:
- Resilience 2014 Website: http://xcr.cenit.latech.edu/resilience2014
- Resilience 2014 Submissions:
https://www.easychair.org/conferences/?conf=europar2014ws
- Euro-Par 2014 website: http://europar2014.dcc.fc.up.pt/

Topics of interest include, but are not limited to:
- Hardware for fault detection and resiliency
- System-level resiliency for HPC, Grid, Cluster, and Cloud
- Algorithmic based resiliency - Generic, fundamental advances (not Hadoop)
- Statistical methods to improve system resiliency
- Fault tolerance mechanisms experiments
- Resource management for system resiliency and availability
- Resilient system based on hardware probes
- Monitoring mechanisms to support fault prediction, and fault mitigation
- Application-level fault tolerance
- Fault prediction and failure modeling

Important Dates:
- Workshop papers due: June 9th, 2014 (previously May 30, 2014)
- Workshop author notification: July 4, 2014
- Workshop early registration: July 25, 2014
- Workshop camera-ready papers due: October 3, 2014

General Co-Chairs:
Stephen L. Scott
Stonecipher/Boeing Distinguished Professor of Computing
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl at ornl.gov

Chokchai (Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box at latech.edu

Program Co-Chairs:
Patrick G. Bridges
University of New Mexico, USA
bridges at cs.unm.edu

Christian Engelmann
Oak Ridge National Laboratory , USA
engelmannc at ornl.gov

Program Committee:
Ferrol Aderholdt, Tennessee Institute of Technology
Vassil Alexandrov, Barcelona Supercomputer Center
Wesley Bland, Argonne National Laboratory
Greg Bronevetsky, Lawrence Livermore National Laboratory
Franck Cappello, INRIA and University of Illinois at Urbana-Champaign
Zizhong Chen, University of California at Riverside
Nathan Debardeleben, Los Alamos National Laboratory
Kurt Ferreira, Sandia National Laboratory
Cecile Germain, Université Paris-Sud
Larry Kaplan, Cray Inc.
Dieter Kranzlmüller, Ludwig-Maximilians University of Munich
Sriram Krishnamoorthy, Pacific Northwest National Laboratory
Scott Levy, University of New Mexico
Celso Mendes, University of Illinois at Urbana-Champaign
Kathryn Mohror, Lawrence Livermore National Laboratory
Christine Morin, INRIA Rennes
Mihaela Paun, Louisiana Tech University
Alexander Reinefeld, Zuse Institute Berlin
Rolf Riesen, Intel Corporation