[hpc-announce] FTXS @ SC21

Mon Jul 19 10:17:02 CDT 2021

CALL FOR PAPERS
11th Workshop on Fault-Tolerance for HPC at eXtreme Scale (FTXS 2021)

In conjunction with The International Conference for
High Performance Computing, Networking, Storage, and Analysis (SC21)
St. Louis, Missouri, USA November 14 - 19, 2021
https://sites.google.com/view/ftxs2021
twitter.com/ftxsworkshop

Important Dates
* Submission of papers: August 27, 2021
* Author notification: September 27, 2021
* Camera-ready papers: TBA
* Workshop: November 14, 2021

Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems). Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed.  Therefore, we are broadly interested in forward-looking papers that seek to characterize and mitigate the impact of faults.

We are particularly interested in papers that address issues related to the following developments in extreme-scale systems:

* Storage Devices: The storage hierarchy on HPC systems continues to increase in depth and complexity. SSDs and NVMe add high-speed node-local (or rack-local) persistent storage that can be used to improve the performance of checkpoint/restart or otherwise facilitate application resilience. Continuing to efficiently exploit these devices remains critical for extreme-scale HPC systems. Moreover, the recent availability of Non-Volatile Memory Modules (NVMMs) has begun to blur the line between memory and storage. The implications of this blurring for fault tolerance on  extreme-scale systems are still being explored.

* System Heterogeneity: Modern HPC systems increasingly include GPUs, FPGAs, and other types of accelerators. New networking devices like Data Processing Units (DPUs) and SmartNICs are also starting to be deployed. However, there are many resilience and fault tolerance issues associated with these devices that still need to be resolved. Papers at prominent recent conferences (including SC20, ICS 2019, and IEEE Cluster 2018) demonstrate that understanding the fault tolerance implications of heterogeneous compute devices is an important and active area of research.

* Computing Paradigms: Novel non-von Neumann computing paradigms, including quantum and neuromorphic computing, have attracted significant research interest. Recent publications demonstrate that understanding the fault tolerance implications of these computing paradigms is also an area of active research.

* Machine Learning: Algorithms that rely on elements of machine learning are becoming more and more prevalent on HPC systems.  Understanding how these algorithms react and respond to the frequency and variety of faults that occur on HPC systems is critical to ensuring that they continue to provide accurate and timely answers.

Additional topics of interest include, but are not limited to:

*   Algorithmic-Based Fault Tolerance (ABFT) techniques to address undetected (silent) errors
*   Silent data corruption (SDC) detection / correction techniques
*   Novel fault-tolerance techniques and implementations
*   Failure data analysis and field studies
*   Power, performance, resilience (PPR) assessments / tradeoffs
*   Emerging hardware and software technology for resilience
*   Advances in reliability monitoring, analysis, and control of highly complex systems
*   Failure prediction, error preemption, and recovery techniques
*   Fault-tolerant programming models
*   Models for software and hardware reliability
*   Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
*   Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
*   Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
*   Near-threshold-voltage implications and evaluations for reliability
*   Benchmarks and experimental environments including fault injection
*   Frameworks and APIs for fault-tolerance and fault management

PAPER SUBMISSIONS
Submissions are solicited in the following categories:

* Regular papers presenting innovative ideas improving the state of the art or discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation.

* Extended abstracts proposing disruptive ideas and challenging assumptions in the field, including some form of preliminary results.  Extended abstracts will be evaluated separately and given shorter oral presentations, but will NOT be published.

Submissions shall be sent electronically, must conform to IEEE conference proceedings style.  Regular papers should be at least six (6) pages but should not exceed ten (10) pages including all text, appendices, figures, and references.  Accepted regular papers that meet these requirements will be published in cooperation with IEEE TCHPC (subject to publisher conditions regarding the number of papers accepted by the workshop).  Extended abstracts should not exceed three (3) pages.  Extended abstracts will be posted on our website but will NOT be published.

WORKSHOP CHAIR
Scott Levy - Sandia National Laboratories

ORGANIZING COMMITTEE
Keita Teranishi - Sandia National Laboratories
John Daly - Laboratory for Physical Sciences

Questions? Contact Scott Levy (sllevy at sandia.gov).