[hpc-announce] The 4th Fault Tolerance for HPC at eXtreme Scale (FTXS) 2014 - June 23, 2014 - With DSN 2014 - Atlanta, GA, USA

Debardeleben, Nathan A ndebard at lanl.gov
Fri Jan 10 14:29:16 CST 2014

4th International Workshop on Fault-Tolerance for HPC at Extreme Scale
(FTXS 2014)

In conjunction with
The 44th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN 2014)
Atlanta, Georgia, USA on June 23-26, 2014

For the HPC community, a new scaling in numbers of processing elements
has superseded the historical trend of Moore's Law scaling in processor
frequencies. This progression from single core to multi-core and
many-core will be further complicated by the community's imminent
migration from traditional homogeneous architectures to ones that are
heterogeneous in nature. As a consequence of these trends, the HPC
community is facing rapid increases in the number, variety, and
complexity of components, and must thus overcome increases in aggregate
fault rates, fault diversity, and complexity of isolating root cause.

Recent analyses demonstrate that HPC systems experience simultaneous
(often correlated) failures. In addition, statistical analyses suggest
that silent soft errors cannot be ignored anymore, because the increase
of components, memory size and data paths (including networks) make the
probability of silent data corruption (SDC) non-negligible. The HPC
community has serious concerns regarding this issue and application
users are less confident that they can rely on a correct answer to their
computations. Other studies have indicated a growing divergence between
failure rates experienced by applications and rates seen by the system
hardware and software. At Exascale, some scenarios project failure rates
reaching one failure per hour. This conflicts with the current
checkpointing approach to fault tolerance that requires up to 30 minutes
to restart a parallel execution on the largest systems.  Lastly,
stabilization periods for the largest systems are already significant,
and the possibility that these could increase in length is of great
concern.  During the Approaching Exascale report at SC11, DOE program
managers identified resilience as a black swan - the most difficult
under-addressed issue facing HPC.

What does the fault-tolerance community need to do in order to be
prepared to face the challenges of extreme scale computing? What is
needed to keep applications with billions of threads of parallelism up
and running on systems that fail tens of times per day? As models
predict less than 50% efficiency of traditional checkpoint/restart
methods on future systems, are we ready to pay the cost of full
redundancy, effectively performing redundant multi-threading (RMT)
across entire systems? Do we even have the infrastructure necessary to
implement an RMT strategy?

How is the supercomputing community going to efficiently isolate
failures on enormously complex systems? Is it realistic to understand
these systems in such a way that some failure could be predicted with
enough accuracy and anticipation to trigger useful failure avoidance
actions? What can the community do to protect applications from SDC in
memory and logic? To what extent should users and programmers be
involved in managing faults? What are the most promising self-healing
numerical methods?  Is there an emerging framework for fault management
at extreme scale?

The goals of this workshop are to consider these complex questions, to
discuss the unique limitations that extreme scale and complexity impose
on traditional methods of fault-tolerance, and to explore new strategies
for dealing with those challenges.

Submissions are solicited in the following categories:
* Regular papers presenting innovative ideas improving the state of the art.
* Experience papers discussing the issues seen on existing extreme-scale
  systems, including some form of analysis and evaluation.
* Extended abstracts proposing disruptive ideas in the field, including
  some form of preliminary results

Submissions shall be sent electronically, must conform to IEEE
conference proceedings style and should not exceed six pages including
all text, appendices, and figures.

Assuming hardware and software errors will be inescapable at extreme
scale, this workshop will consider aspects of fault tolerance peculiar
to extreme scale that include, but are not limited to:
* Quantitative assessments of cost in terms of power, performance, and
  resource impacts of fault-tolerant techniques, such as checkpoint
  restart, that are redundant in space, time or information
* Novel fault-tolerance techniques and implementations of emerging
  hardware and software technologies that guard against silent data
  corruption (SDC) in memory, logic, and storage and provide end-to-end
  data integrity for running applications; Studies of hardware / software
  tradeoffs in error detection, failure prediction, error preemption, and
* Advances in monitoring, analysis, and control of highly complex systems
* Highly scalable fault-tolerant programming models
* Metrics and standards for measuring, improving and enforcing the need
  for and effectiveness of fault-tolerance
* Failure modeling and scalable methods of reliability, availability,
  performability and failure prediction for fault-tolerant HPC systems
* Scalable Byzantine fault tolerance and security from single-fault and
  fail-silent violations
* Benchmarks and experimental environments, including fault-injection
  and accelerated lifetime testing, for evaluating performance of
  resilience techniques under stress
* Frameworks and APIs for fault tolerance and fault management.

Submission of papers: March 7th, 2014
Author notification: March 21st, 2014
Camera ready papers: April 2014
Workshop: June 23rd, 2014

Nathan DeBardeleben - Los Alamos National Laboratory
Franck Cappello – Argonne National Laboratory and the University of
  Illinois at Urbana-Champaign
Robert Clay – Sandia National Laboratories

Rob Aulwes – Los Alamos National Laboratory
Greg Bronevetsky - Lawrence Livermore National Laboratory
John Daly - Department of Defense
Christian Engelmann – Oak Ridge National Laboratory
Kurt Ferreira – Sandia National Laboratories
Ana Gainaru – University of Illinois at Urbana-Champaign
Leonardo Bautista Gomez – Tokyo Institute of Technology
Hideyuki Jitsumoto – The University of Tokyo
Zhiling Lan – Illinois Institute of Technology
Naoya Maruyama – Tokyo Institute of Technology
Kathryn Mohror – Lawrence Livermore National Laboratory
Bogdan Nicolae – IBM Research – Ireland
Rolf Riesen – IBM Research – Ireland
Yve Robert - ENS Lyon
Thomas Ropars - EPFL
Stephen Scott – Tennessee Tech University and Oak Ridge National Laboratory
Vilas Sridharan – AMD, Inc.
Abhinav Vishnu - Pacific Northwest National Laboratory
Roel Wuyts - Intel ExaScience Lab

See https://sites.google.com/site/ftxsworkshop/home/ftxs2014 and
http://2014.dsn.org/ for more information.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/hpc-announce/attachments/20140110/38ff3d86/attachment.html>

More information about the hpc-announce mailing list