[hpc-announce] CFP: Resilience at Euro-Par 2021 - Papers due May 7

Thomas Naughton naughtont at ornl.gov
Wed Mar 24 10:25:30 CDT 2021


We apologize if you receive multiple copies of this call for papers.

--------------------------------------------------------------------------------


  14th Workshop on Resiliency in High Performance Computing (Resilience)
                   in Clusters, Clouds, and Grids
      <https://www.csm.ornl.gov/srt/conferences/Resilience/2021>

                        in conjunction with

  the 27th International European Conference on Parallel and Distributed
                 Computing (Euro-Par), Lisbon, Portugal
                        August 30 - September 3, 2021
                     <http://2021.euro-par.org>


Overview:

Resilience is a critical challenge as high performance computing (HPC) systems 
continue to increase component counts, individual component reliability 
decreases (such as due to shrinking process technology and near-threshold 
voltage (NTV) operation), hardware complexity increases (such as due to 
heterogeneous computing) and software complexity increases (such as due to 
complex data- and workflows, real-time requirements and integration of 
artificial intelligence (AI) technologies with traditional applications).

Correctness and execution efficiency, in spite of faults, errors, and failures, 
is essential to ensure the success of the HPC systems, cluster computing 
environments, Grid computing infrastructures, and Cloud computing services. The 
impact of faults, errors, and failures in such HPC systems can range from 
financial losses due to system downtime (sometimes several tens-of-thousands of 
Dollars per lost system-hour), to financial losses due to unnecessary 
overprovision (acquisition and operating costs), to financial losses and legal 
liabilities due to erroneous or delayed output.

The emergence of AI technology opens up new possibilities, but also new 
problems. Using AI technology for operational intelligence that enables 
resilience in HPC systems and centers is a complex control problem, while 
designing resilient AI technology for HPC applications is a difficult 
algorithmic problem. Resilience for HPC systems encompasses a wide spectrum of 
fundamental and applied research and development, including theoretical 
foundations, error/failure and anomaly detection, monitoring and control, 
end-to-end data integrity, enabling infrastructure, and resilient algorithms.

This workshop brings together experts in the community to further research and 
development in HPC resilience and to facilitate exchanges across the 
computational paradigms of extreme-scale HPC, cluster computing, Grid 
computing, and Cloud computing.

Submission Guidelines:

Authors are invited to submit papers electronically in English in PDF format. 
Submitted manuscripts should be structured as technical papers and BETWEEN 10 
AND 12 PAGES, including figures, tables and references, using Springer's 
Lecture Notes in Computer Science (LNCS) format at 
<http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Papers with 
less than 10 or more than 12 pages will not be accepted due to publisher 
guidelines. Submissions should include abstract, key words and the e-mail 
address of the corresponding author. Papers not conforming to these guidelines 
may be returned without review. All manuscripts will be reviewed and will be 
judged on correctness, originality, technical strength, significance, quality 
of presentation, and interest and relevance to the conference attendees. 
Submitted papers must represent original unpublished research that is not 
currently under review for any other conference or journal. Papers not 
following these guidelines will be rejected without review and further action 
may be taken, including (but not limited to) notifications sent to the heads of 
the institutions of the authors and sponsors of the conference. Submissions 
received after the due date or not appropriately structured may also not be 
considered. The proceedings will be published in Springer's LNCS as 
post-conference proceedings. At least one author of an accepted paper must 
register for and attend the workshop for inclusion in the proceedings. Authors 
may contact the workshop program chairs for more information.

Important websites:

- Resilience 2021 Website: 
<https://www.csm.ornl.gov/srt/conferences/Resilience/2021>
- Resilience 2021 Submissions: TBD
- Euro-Par 2021 website: <http://2021.euro-par.org>

Topics of interest include, but are not limited to:

- Theoretical foundations for resilience:
   - Metrics and measurement
   - Statistics and optimization
   - Simulation and emulation
   - Formal methods
   - Efficiency modeling and uncertainty quantification
   - Experience reports

- Error/failure/anomaly detection and reliability/dependability modeling:
   - Statistical analyses
   - Machine learning and artificial intelligence
   - Digital twins
   - Data collection and aggregation
   - Information visualization

- Monitoring and control for resilience:
   - Center, system and application monitoring and control
   - Reliability, availability, serviceability and performability
   - Tunable fidelity and quality of service
   - Automated response and recovery
   - Operational intelligence to enable resilience

- End-to-end integrity:
   - Fault tolerant design of centers, systems and applications
   - Forward migration and verification
   - Degraded operation
   - Error propagation, failure cascades, and error/failure containment
   - Testing and evaluation, including fault/error/failure injection

- Enabling infrastructure for resilience:
   - Reliability, availability, serviceability systems
   - System software and middleware
   - Resilience extensions for programming models
   - Tools and frameworks
   - Support for resilience in heterogeneous architectures

- Resilient algorithms:
   - Algorithmic detection and correction
   - Resilient solvers and algorithm-based fault tolerance
   - Fault tolerant numerical methods
   - Robust iterative algorithms
   - Resilient artificial intelligence

Important Dates:

- Workshop papers due: May 7, 2021 (23:59 AoE)
- Workshop author notification: July 16, 2021
- Workshop author registration: TBD
- Workshop date: August 30 or 31, 2021
- Workshop camera-ready papers: TBD

General Co-Chairs:

- Stephen L. Scott
   Tennessee Tech University, USA
   scottsl at ornl.gov
- Christian Engelmann
   Oak Ridge National Laboratory , USA
   engelmannc at ornl.gov

Program Co-Chairs:

- Ferrol Aderholdt
   Middle Tennessee State University, USA
   ferrol.aderholdt at mtsu.edu
- Thomas Naughton
   Oak Ridge National Laboratory , USA
   naughtont at ornl.gov


Workshop Chair Emeritus:

- Chokchai (Box) Leangsuksun
   Louisiana Tech University, USA
   box at latech.edu



  _________________________________________________________________________
   Thomas Naughton                                      naughtont at ornl.gov
   Research Associate                                   (865) 576-4184


More information about the hpc-announce mailing list