[hpc-announce] FTXS 2024 @ SC24: Call for papers (Deadline extended to Aug 8)

Thu Jul 25 12:09:30 CDT 2024

CALL FOR PAPERS
14th Workshop on Fault-Tolerance for HPC at eXtreme Scale (FTXS 2024)

In conjunction with The International Conference for
High Performance Computing, Networking, Storage, and Analysis (SC24)
Atlanta, Georgia, USA November 17 - 22, 2024
https://urldefense.us/v3/__https://sites.google.com/view/ftxs2024__;!!G_uCfscf7eWS!Y5I_va9QPOUdcJ5_TTWHZwpan6Pcwpcrq4ZdDBsKKGP5rgH0zXulRX1Hrphom8RNCT647rj6sXJH1EQn6EXnoQjl$ 
twitter.com/ftxsworkshop

Important Dates
* Submissions open: July 1, 2024 (Submissions now open!)
* Submission of papers: August 1, 2024 (EXTENDED to August 8, 2024)
* Author notification: September 5, 2024
* Camera-ready papers: September 27, 2024
* Workshop: November 22, 2024

Featured Speaker: Karthik Pattabiraman, University of British Columbia
"Error-Resilient Machine Learning for HPC: Challenges and Opportunities"

Authors are invited to submit original papers on the research and practice of fault-tolerance in
extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).
Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed.  Therefore, we are broadly interested in forward-looking papers that seek to
characterize and mitigate the impact of faults.

We are particularly interested in papers that address issues related to the following developments
in extreme-scale systems:

* Artificial Intelligence and Machine Learning (AI/ML): Significant research has recently been published (including at SC23) on how AI/ML can be leveraged to improve the performance of extreme-scale systems. In the context of fault tolerance and resilience, AI/ML applications have the potential to exhibit novel
failure modes during both training and inference. Additionally, AI/ML may help to mitigate failures by either: predicting when and where failures may occur, or by reducing the impact of failures that do occur. Our understanding of AI/ML along these two dimensions of fault tolerance is developing rapidly and is an important area of research.

* System Heterogeneity: Modern HPC systems increasingly include GPUs, FPGAs, and other types
  of accelerators. New networking devices like Data Processing Units (DPUs) and SmartNICs are
  also starting to be deployed. However, there are many resilience and fault tolerance issues
  associated with these devices that still need to be resolved. Papers at prominent recent
  conferences (including SC20, ICS 2019, and IEEE Cluster 2018) demonstrate that understanding
  the fault tolerance implications of heterogeneous compute devices is an important and active
  area of research.

* Computing Paradigms: Novel non-von Neumann computing paradigms, including quantum and
  neuromorphic computing, have attracted significant research interest. Recent publications
  demonstrate that understanding the fault tolerance implications of these computing paradigms is
  also an area of active research.

* Machine Learning: Algorithms that rely on elements of machine learning are becoming more and
  more prevalent on HPC systems.  Understanding how these algorithms react and respond to the
  frequency and variety of faults that occur on HPC systems is critical to ensuring that they continue to
  provide accurate and timely answers.

Additional topics of interest include, but are not limited to:

*   Algorithmic-Based Fault Tolerance (ABFT) techniques to address undetected (silent) errors
*   Silent data corruption (SDC) detection / correction techniques
*   Novel fault-tolerance techniques and implementations
*   Failure data analysis and field studies
*   Power, performance, resilience (PPR) assessments / tradeoffs
*   Emerging hardware and software technology for resilience
*   Advances in reliability monitoring, analysis, and control of highly complex systems
*   Failure prediction, error preemption, and recovery techniques
*   Fault-tolerant programming models
*   Models for software and hardware reliability
*   Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
*   Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
*   Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, etc.)
*   Near-threshold-voltage implications and evaluations for reliability
*   Benchmarks and experimental environments including fault injection
*   Frameworks and APIs for fault-tolerance and fault management

PAPER SUBMISSIONS
Submissions are solicited in the following categories:
* Regular papers presenting innovative ideas improving the state of the art or discussing the issues
  seen on existing extreme-scale systems, including some form of analysis and evaluation.   Regular
  papers should not exceed ten (10) pages including all text, appendices, and figures, but excluding
  references.
* Extended abstracts presenting preliminary results, proposing disruptive ideas, or challenging
  assumptions in the field.  The inclusion of some form of preliminary results is encouraged.
  Extended abstract papers should not exceed four (4) pages, including all text, appendices, and figures,
  but excluding references.  Extended abstracts will be evaluated separately and given shorter oral
  presentations.  Given minimum publication requirements imposed by SC24, extended abstracts WILL NOT be
  published.

Submissions shall be submitted electronically at https://urldefense.us/v3/__https://submissions.supercomputing.org__;!!G_uCfscf7eWS!Y5I_va9QPOUdcJ5_TTWHZwpan6Pcwpcrq4ZdDBsKKGP5rgH0zXulRX1Hrphom8RNCT647rj6sXJH1EQn6AC8xMcd$  and must conform
to IEEE conference proceedings style.  IEEE templates are available at:
https://urldefense.us/v3/__https://www.ieee.org/conferences/publishing/templates.html__;!!G_uCfscf7eWS!Y5I_va9QPOUdcJ5_TTWHZwpan6Pcwpcrq4ZdDBsKKGP5rgH0zXulRX1Hrphom8RNCT647rj6sXJH1EQn6I7-TpPz$ 

We do not have an upper limit on the number of papers that we will accept.  We will make every
effort to make sure that every high-quality submission will be included in our workshop.

PUBLICATION
Subject to publisher constraints, our workshop will publish all submissions accepted for inclusion
in our workshop.  Our workshop has been approved to have our accepted papers included in the SC Workshop Proceedings.

REPRODUCIBILITY
Reproducibility is an important component of extreme-scale system research.  However, the goal of
our workshop is to encourage and facilitate discussion of novel approaches and preliminary results.
As a result, it may not always be feasible to release reproducibility artifacts.  Therefore, while
we encourage authors to make their work as public and reproducible as possible, we do not explicitly
require it.

WORKSHOP CHAIRS
Scott Levy - Sandia National Laboratories
Bo Fang - Pacific Northwest National Laboratory

ORGANIZING COMMITTEE
Keita Teranishi - Sandia National Laboratories
John Daly - Laboratory for Physical Sciences

Questions? Contact Scott Levy (sllevy at sandia.gov) or Bo Fang (bo.fang at pnnl.gov)