[hpc-announce] [CFP] [Deadline: Aug 18th] SuperCheck'23: Fourth International Symposium on Checkpointing for Supercomputing (in conjunction with SC'23)
Bogdan Nicolae
bogdan.nicolae at acm.org
Tue Aug 15 17:33:09 CDT 2023
SuperCheck at SC'23: Fourth International Symposium on Checkpointing for
Supercomputing
Workshop Website: https://supercheck.lbl.gov
CALL FOR PAPERS
---------------
The Fourth International Symposium on Checkpointing for Supercomputing
will be held November 12, 2023 in Denver, Colorado, USA, in
conjunction with SC23: The International Conference for High
Performance Computing, Networking, Storage and Analysis. This workshop
will feature the latest work in checkpoint/restart research, tools
development and production use.
About the Workshop
------------------
As a primary approach to fault-tolerant computing, Checkpoint/Restart
(C/R) is essential to a wide range of HPC communities. While there has
been much C/R research and tools development, continued C/R research
is indispensable to keep pace with ever-changing HPC architectures,
technologies, and workloads. More effort is also needed to narrow the
gap between proof-of-concept C/R research codes and production-quality
codes capable of deployment in real-world workloads. In this workshop,
we will bring together C/R researchers and tools developers,
practitioners, application developers, and end users to focus on C/R
research and successes in production use, motivating the development
of usable C/R tools, the closing of the gap between state-of-the-art
research and production, and the harnessing of the full benefits of
C/R for the HPC community. Paper submissions will be peer-reviewed,
and the accepted papers will be published with IEEE Computer Society.
We especially encourage PhD students and HPC end users to participate.
Background
----------
Checkpointing is widely used in high performance computing (HPC). It
involves capturing key states during the runtime of a distributed
application (checkpointing), which are reused later during runtime.
Initially widely applied in the HPC community for resilience purposes
(checkpoint periodically, roll back and restart the application from a
previously known correct state in case of failures), it has seen
increasing adoption in many other scenarios: suspend-resume
(checkpoint as a response to an event, such as a reservation running
out of time or a job being preempted to make room for another job,
then resume at a later time when more resources are available),
migration (checkpoint on one machine, restart on another, potentially
on different hardware), debugging (checkpoint close to a problematic
region of code and replay that region multiple times instead of
starting from the beginning). More recently, with an increasing
convergence of HPC, big data analytics and machine learning,
checkpointing is becoming an essential pattern in allowing
applications to progress with their computations. For example, it is
used to communicate states between tasks in a workflow, to revisit
previous states (e.g. adjoint computations), or to explore alternative
directions starting from a common ancestor (e.g. checkpoint models
and/or training data to explore variations of architecture and/or
training paths).
On the other hand, checkpointing is challenging: states are
distributed, which means the checkpoints require coordination to
capture globally consistent states, they incur high I/O overheads due
to their size and competition for I/O bandwidth, they can be either
explicitly defined by users or transparently determined at
system-level, etc. With increasing scale and heterogeneity of
supercomputing architectures, both from a computational and I/O
perspective, such challenges are becoming even more difficult to
overcome.
As a consequence, there is a need to form a community around this
essential yet difficult to address topic that is currently underserved
in the HPC community. This workshop aims to fill the aforementioned
gap. It encourages interaction and cross-pollination between
application developers that have both traditional and novel use cases
for checkpointing, researchers that develop checkpointing approaches
and runtimes/middlewares at all levels (system-level,
application-level, transparent, hybrid), storage and I/O experts that
need to manage massive data sizes generated by checkpointing,
architecture experts that need to provide means of capturing the state
of devices and other subsystems (which are needed in addition to
user-level in-memory data structures). In this context, it envisions
to become a forum where participants can (1) underline challenges,
opportunities and solutions for novel research directions; (2) share
their experience and best practices for production-runs; (3) engage in
co-design activities (users learn about approaches and new
capabilities of runtimes and middlewares, runtime developers learn
about the needs of users).
Workshop Scope
--------------
- The workshop scope includes but is not limited to:
- Application-level checkpointing: APIs to define critical states,
techniques to capture critical states (e.g. efficient serialization)
- Transparent/system-level checkpointing: techniques to capture state
of devices and accelerators (CPUs, GPUs, network interfaces, etc)
- I/O and storage solutions that leverage heterogeneous storage to
persist checkpoints at scale
- Checkpoint size reduction techniques (compression, deduplication)
- Alternative techniques that avoid persisting checkpoints to storage
(e.g. erasure coding)
- Synchronous vs. asynchronous checkpointing strategies
- Multi-level and hybrid strategies combining application-level,
system-level, transparent checkpointing on heterogeneous hardware
- Application-specific techniques combined with checkpointing (e.g. ABFT)
- Performance evaluation and reproducibility, study of real failures
and their recovery
- Research on optimal checkpointing interval, C/R-aware job scheduling
and resource management
- Furthermore, contributions on C/R use in production are also welcome:
- Experience with traditional use cases of checkpointing on novel platforms
- New use cases of checkpointing beyond resilience
- Support on HPC systems (e.g., resource scheduling, system
utilization, batch system integration, best practice, etc.)
We propose two tracks of paper submissions within the workshop,
research and production. For the production track, we broaden the
definition of novelty for our workshop, to include the work of
incorporating novel research results into practice, resulting in a
real-life impact.
Submission Guidelines
---------------------
We invite authors to submit their original, high-quality work with the
following categories:
(a) Regular papers:
Intended for submissions describing original work and ideas that have
NOT appeared in another conference or journal, and are NOT currently
under review for any other conference or journal. Both research and
production tracks can submit regular papers. Regular paper submissions
must be at least six (6) and must not exceed eight (8) pages in the
IEEE format. The page limit will be increased to 10 for accepted
submissions.
Accepted regular papers (subject to post-review revisions) will be
published in the workshop proceedings in cooperation with IEEE
Computer Society.
(b) Short papers:
Intended for material that is not mature enough for a full paper,
allowing authors to present novel, interesting ideas or preliminary
results that will be formally submitted elsewhere later. Short papers
are also for authors sharing their new efforts on adopting C/R tools
in production use. Short paper submissions must not exceed two (2)
pages in the IEEE format. The page limit will be increased to 3 for
accepted submissions.
Accepted short papers will NOT be included in the workshop proceedings
published with the IEEE Computer Society; instead they will be
published in arXiv. We will provide links to those short papers in
arXiv on our workshop website as we did for our previous workshop.
Note that the page limit above includes figures and tables, but does
not include references, for which there is no page limit.
All submissions should be made electronically through the SC23
submission website and must follow the IEEE format. Submissions must
be double blind, i.e., authors should remove their names, institutions
or hints found in references to earlier work. When discussing past
work, they need to refer to themselves in the third person, as if they
were discussing another researcher’s work. Furthermore, authors can
identify any conflict of interest with the program committee members
(reviewers) at the SC23 submission site after their papers are
submitted (using the “My Conflicts” tab).
(c) Lightning talks
In addition to the paper categories above, which require new and
unpublished work, authors can submit a short abstract (no more than
250 words) for a 5-minute lightning talk, for which both previously
published and unpublished work are welcome. Lightning talks are to
help the HPC community to stay informed about the existing C/R
libraries and tools, C/R needs, support, approaches, and challenges in
HPC applications and workflows, and to share experience on adopting
C/R tools and libraries in production. They are also for authors to
share ideas or proposals on addressing challenges in C/R to enable C/R
on fast-changing HPC architectures and workloads and to generate
real-life impacts. Authors will use the same SC23 submission website
(selecting Lightning Talks for the Submission Track option). The
workshop organizers will review the submissions based on the quality
of work and relevance to the intended purposes of the lightning talks.
The accepted abstracts will be made available on the SuperCheck-SC23
website.
Reproducibility Initiative
--------------------------
While an Artifact Description (AD) Appendix and the Artifact
Evaluation (AE) are optional, we encourage authors to follow the SC23
reproducibility and transparency initiative.
Important Dates
---------------
Paper Submission Deadline: August 18, 2023 AOE
Author Notification: September 8, 2023 AOE
Workshop Ready Deadline: September 29, 2023 AOE
Presentation Slides and Recordings Deadline: November 1, 2023 AOE
Workshop @SC23: Sunday, November 12, 1:30-5:00 pm (Mountain Standard Time)
Submissions are accepted at the SC23 Submissions Website:
https://submissions.supercomputing.org
Organizing Committee
--------------------
Gene Cooperman, Northeastern University, USA
Donglai Dai, X-Scale Solutions, USA
Rebecca Hartman-Baker, National Energy Research Scientific Computing
Center at Lawrence Berkeley National Laboratory (NERSC at LBNL), USA
Bogdan Nicolae, Argonne National Laboratory, USA
Program Committee
-----------------
Kapil Arya, Azure Systems Research, USA
Franck Cappello, Argonne National Laboratory (ANL), USA
Rohan Garg, Nutanix Inc, USA
Anjus George, Oak Ridge National Laboratory (ORNL), USA
Alfredo Goldman, University of São Paulo, Brazil
Twinkle Jain, Intel, USA
Jack Kosaian, NVIDIA, USA
Preeti Malakar, Indian Institute of Technology Kanpur, India
Rafael Mayo-García, CIEMAT, Spain
Dejan Milojicic, Hewlett Packard Labs, USA
Dhabaleswar K. (DK) Panda, Ohio State University, USA
Yves Robert, ENS Lyon, France
Kento Sato, RIKEN, Japan
Martin Schulz, Technical University Munich, Germany
Osman Unsal, Barcelona Supercomputing Center (BSC), Spain
Orcun Yildiz, Argonne National Laboratory, USA
--
Bogdan Nicolae
Computer Scientist
Argonne National Laboratory
Web: www.bnicolae.net
on behalf of the co-chairs
More information about the hpc-announce
mailing list