[hpc-announce] The 1st Workshop on Enhancing MPI for Resilience (EMPIRe): Call for Papers

Halim Amer aamer at anl.gov
Thu Jun 15 17:22:59 CDT 2017


======================================================================
      The 1st Workshop on Enhancing MPI for Resilience (EMPIRe)
              https://icl.utk.edu/workshops/EMPIRe2017
                        In conjunction with
   The 24th European MPI Users' Group Meeting, September 25-28, 2017
                https://www.mcs.anl.gov/eurompi2017
======================================================================

Overview
--------
The continuing trend in hardware architectures towards smaller, more
efficient and certainly more cost effective components, as well as the
increase in scale of systems for computational science opens the door
for deeper and more clear understanding of the physical phenomena
governing our surroundings. On the other side, though, reduction in
feature sizes due to improvements in photolithography combined with a
growing number of components and the volatility of computational
resources in some types of platforms lead, from the application
perspective, to a decrease in the mean time to failure. Failures
manifest in all types and scales of execution platforms with
consistently dramatic result, the lost of data, computations and
results. Most of the parallel programming paradigms and runtimes used
in the high performance computing field have been impermeable to
notions of resilience, and provide little support for programmatically
dealing with any type of faults. Moreover, solutions widely used in
the industry have been slow to make their way into the high
performance computing field.

This workshop targets cross-cutting research into resilience, to
ensure that scientific computations timely deliver their results on
all execution platforms, free of defective results. Its scope is to
highlight the complexity of faults and isolate some of their root
causes as well as to investigate solutions to address the natural
increase in faults diversity and rates, and to provide efficient and
portable solutions that encompass all types of parallel execution
platforms, runtimes and applications. While the main focus of this
workshop is in the context of message passing programming paradigms,
we welcome other solutions not bound to a particular parallel
programming model or runtime system, but hopefully portable enough
across programming languages and paradigms, capable of delivering
their promises at all execution platform sizes.

Topics of interest
-----------------
- Failure detection, prediction and characterization
- Checkpoint/Restart: optimal checkpoint interval, lossy compression
   of checkpoints, application level interface (SCR, FTI)
- SDC detection: Predictor, Auxiliary methods and recovery, ABFT
- Resilient software stack: Global OS, file system, runtimes (MPI,
   ULFM)
- Algorithms: Resilient numerical methods
- Methodology: failure and SDC injectors, Detection (Recall,
   Precision) and Reliability Metrics
- Models for fault prediction, impact, management and application
   costs
- Resource management for system resiliency and availability
- Naturally fault tolerant, self-healing, or fault oblivious
   scientific algorithms
- Programming model and system software support for scalability and
   resilience

Submission Guidelines
--------------------
Authors are invited to submit manuscripts in English, structured as
technical papers not exceeding 10 letter size (8.5in x 11in) pages or
as short papers limited to 4 pages of the same format, using the ACM
2017 Template. Similarly to the main conference,  Euro MPI/USA 2017 ,
the page limit includes figures, tables, and appendices, but does not
include references, for which there is no page limit. Margins and font
sizes should not be modified. Authors should submit their work through
the EMPIRe Submission Site.

In collaboration with the main conference,  Euro MPI/USA 2017 , select
papers will be invited to submit revised and extended versions to be
considered for inclusion in an invitation-only special issue of the
Elsevier  Parallel Computing  journal. The extended version of the
paper must have at least 30% additional content compared to the
version published at EuroMPI/USA 2017.

Important Dates
--------------
- Full paper submission:      July 01, 2017  AoE  (firm)
- Notification of acceptance: July 10, 2017
- Final paper submission:     July 20, 2017
- Workshop/conference early registration: TBD
- Workshop:  September 25, 2017

-- 
Halim
www.mcs.anl.gov/~aamer


More information about the hpc-announce mailing list