[hpc-announce] FTXS 2025 @ ICPP 2025: Call for papers

Mon Apr 14 16:25:07 CDT 2025

CALL FOR PAPERS
15th Workshop on Fault-Tolerance for HPC at eXtreme Scale (FTXS 2025)

In conjunction with the International Conference on Parallel Processing (ICPP 2025)
San Diego, California, USA September 8 - 11, 2025
https://urldefense.us/v3/__https://sites.google.com/view/ftxs2025__;!!G_uCfscf7eWS!ec6BkjXY0Dgi_erOM9ZzCVQ-zAApr5vUVz_rzuF2x7uHyea_5KNFbGYw3JQjALBTDSAN2e36Wpsorcr7D7thomgv$ 
twitter.com/ftxsworkshop

Important Dates
* Submissions open: TBD
* Submission of papers: mid-June, 2025 (actual deadline will be announced soon)
* Author notification: TBD
* Camera-ready papers: July 31, 2025
* Workshop: September 2025

Authors are invited to submit original papers on the research and practice of 
fault-tolerance in extreme-scale distributed systems (primarily HPC systems, 
but including grid and cloud systems). Resilience and fault-tolerance remain 
a major concern for supercomputing and advances in this area are needed.  
Therefore, we are broadly interested in forward-looking papers that seek to 
characterize and mitigate the impact of faults.

We are particularly interested in papers that address issues related to the 
following developments in extreme-scale systems:

* Artificial Intelligence and Machine Learning (AI/ML): Significant research 
has recently been published (including at SC23) on how AI/ML can be 
leveraged to improve the performance of extreme-scale systems. In the 
context of fault tolerance and resilience, AI/ML applications have the 
potential to exhibit novel failure modes during both training and inference. 
Additionally, AI/ML may help to mitigate failures by either: predicting when 
and where failures may occur, or by reducing the impact of failures that do 
occur. Our understanding of AI/ML along these two dimensions of fault 
tolerance is developing rapidly and is an important area of research.

* System Heterogeneity: Modern HPC systems increasingly include GPUs, 
FPGAs, and other types of accelerators. New networking devices like Data 
Processing Units (DPUs) and SmartNICs are also starting to be deployed. 
However, there are many resilience and fault tolerance issues associated 
with these devices that still need to be resolved. Papers at prominent recent
conferences demonstrate that understanding the fault tolerance implications 
of heterogeneous compute devices is an important and active area of research.

* Computing Paradigms: Novel non-von Neumann computing paradigms, including 
quantum and neuromorphic computing, have attracted significant research interest. 
Recent publications demonstrate that understanding the fault tolerance implications 
of these computing paradigms is also an area of active research.

* Machine Learning: Algorithms that rely on elements of machine learning are becoming 
more and more prevalent on HPC systems.  Understanding how these algorithms react 
and respond to the frequency and variety of faults that occur on HPC systems is critical to
ensuring that they continue to provide accurate and timely answers.

Additional topics of interest include, but are not limited to:

*   Algorithmic-Based Fault Tolerance (ABFT) techniques to address undetected (silent) errors
*   Silent data corruption (SDC) detection / correction techniques
*   Novel fault-tolerance techniques and implementations
*   Failure data analysis and field studies
*   Power, performance, resilience (PPR) assessments / tradeoffs
*   Emerging hardware and software technology for resilience
*   Advances in reliability monitoring, analysis, and control of highly complex systems
*   Failure prediction, error preemption, and recovery techniques
*   Fault-tolerant programming models
*   Models for software and hardware reliability
*   Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
*   Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
*   Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, etc.)
*   Near-threshold-voltage implications and evaluations for reliability
*   Benchmarks and experimental environments including fault injection
*   Frameworks and APIs for fault-tolerance and fault management

PAPER SUBMISSIONS
Paper limits will be finalized shortly.  We hope to have both regular papers (~10 pages) and 
extended abstracts (~4 pages).

Submissions shall be submitted electronically at https://urldefense.us/v3/__https://ssl.linklings.net/conferences/icpp__;!!G_uCfscf7eWS!ec6BkjXY0Dgi_erOM9ZzCVQ-zAApr5vUVz_rzuF2x7uHyea_5KNFbGYw3JQjALBTDSAN2e36Wpsorcr7D426VReK$  and must conform to the ACM sigconf style (https://urldefense.us/v3/__https://www.acm.org/publications/proceedings-template__;!!G_uCfscf7eWS!ec6BkjXY0Dgi_erOM9ZzCVQ-zAApr5vUVz_rzuF2x7uHyea_5KNFbGYw3JQjALBTDSAN2e36Wpsorcr7D8HNQ7Lc$ ) 
We do not have an upper limit on the number of papers that we will accept.  We will make every
effort to make sure that every high-quality submission will be included in our workshop.

WORKSHOP CHAIRS
Scott Levy - Sandia National Laboratories
Bo Fang - Pacific Northwest National Laboratory

ORGANIZING COMMITTEE
Keita Teranishi - Sandia National Laboratories
John Daly - Laboratory for Physical Sciences

Questions? Contact Scott Levy (sllevy at sandia.gov) or Bo Fang (bo.fang at pnnl.gov)