[hpc-announce] Deadline Extension: Resilience at Euro-Par 2011

Christian Engelmann engelmannc at computer.org
Thu Jun 2 10:37:50 CDT 2011

Due to multiple requests, we have extended the paper submission deadline 
to June 24, 2011. We apologize if you receive multiple copies of this 


    4th Workshop on Resiliency in High Performance Computing (Resilience)
                      in Clusters, Clouds, and Grids
                         in conjunction with the
          17th International European Conference on Parallel and
                  Distributed Computing (Euro-Par 2011)
             Bordeaux France, August 29 - September 2nd, 2011

Clusters, Clouds, and Grids are three different computational paradigms 
with the intent or potential to support High Performance Computing 
(HPC). Currently, they consist of hardware, management, and usage models 
particular to different computational regimes, e.g., high performance 
cluster systems designed to support tightly coupled scientific 
simulation codes typically utilize high-speed interconnects and 
commercial cloud systems designed to support software as a service (SAS) 
do not. However, in order to support HPC, all must at least utilize 
large numbers of resources and hence effective HPC in any of these 
paradigms must address the issue of resiliency at large-scale.

Recent trends in HPC systems have clearly indicated that future 
increases in performance, in excess of those resulting from improvements 
in single- processor performance, will be achieved through corresponding 
increases in system scale, i.e., using a significantly larger component 
count. As the raw computational performance of these HPC systems 
increases from today's tera- and peta-scale to next-generation multi 
peta-scale capability and beyond, their number of computational, 
networking, and storage components will grow from the ten-to-one-hundred 
thousand compute nodes of today's systems to several hundreds of 
thousands of compute nodes and more in the foreseeable future. This 
substantial growth in system scale, and the resulting component count, 
poses a challenge for HPC system and application software with respect 
to fault tolerance and resilience.

Furthermore, recent experiences on extreme-scale HPC systems with 
non-recoverable soft errors, i.e., bit flips in memory, cache, 
registers, and logic added another major source of concern. The 
probability of such errors not only grows with system size, but also 
with increasing architectural vulnerability caused by employing 
accelerators, such as FPGAs and GPUs, and by shrinking nanometer 
technology. Reactive fault tolerance technologies, such as 
checkpoint/restart, are unable to handle high failure rates due to 
associated overheads, while proactive resiliency technologies, such as 
migration, simply fail as random soft errors can't be predicted. 
Moreover, soft errors may even remain undetected resulting in silent 
data corruption.

Important Web sites:
Resilience 2011 at http://xcr.cenit.latech.edu/resilience2011
Euro-Par 2011 at http://europar2011.bordeaux.inria.fr

Prior conferences Web sites:
Resilience 2010 at http://xcr.cenit.latech.edu/resilience2010
Resilience 2009 at http://xcr.cenit.latech.edu/resilience2009
Resilience 2008 at http://xcr.cenit.latech.edu/resilience2008

Important dates:
Paper submission deadline on June 24, 2011
Notification deadline on July 12, 2011
Resilience Workshop on August 30, 2011
Euro-Par conference on August 29 - September 2nd, 2011
Camera ready deadline is after the workshop

Submission guidelines:
Authors are invited to submit papers electronically in English in PDF 
format via EasyChair at 
<https://www.easychair.org/conferences/?conf=resilience20110>. Submitted 
manuscripts should be structured as technical papers and may not exceed 
10 pages, including figures, tables and references, using Springer's 
Lecture Notes in Computer Science (LNCS) format at 
Submissions should include abstract, key words and the e-mail address of 
the corresponding author. Papers not conforming to these guidelines may 
be returned without review. All manuscripts will be reviewed and will be 
judged on correctness, originality, technical strength, significance, 
quality of presentation, and interest and relevance to the conference 
attendees. Submitted papers must represent original unpublished research 
that is not currently under review for any other conference or journal. 
Papers not following these guidelines will be rejected without review 
and further action may be taken, including (but not limited to) 
notifications sent to the heads of the institutions of the authors and 
sponsors of the conference. Submissions received after the due date, 
exceeding length limit, or not appropriately structured may also not be 
considered. The proceedings will be published in Springer's LNCS as 
post-conference proceedings. At least one author of an accepted paper 
must register for and attend the workshop for inclusion in the 
proceedings. Authors may contact the workshop program chair for more 

Topics of interest include, but are not limited to:

Reports on current HPC system and application resiliency
HPC resiliency metrics and standards
HPC system and application resiliency analysis
HPC system and application-level fault handling and anticipation
HPC system and application health monitoring
Resiliency for HPC file and storage systems
System-level checkpoint/restart for HPC
System-level migration for HPC
Algorithm-based resiliency fundamentals for HPC (not Hadoop)
Fault tolerant MPI concepts and solutions
Soft error detection and recovery in HPC systems
HPC system and application log analysis
Statistical methods to identify failure root causes
Fault injection studies in HPC environments
High availability solutions for HPC systems
Reliability and availability analysis
Hardware for fault detection and recovery
Resource management for system resiliency and availability

General Co-Chairs:
Stephen L. Scott, Oak Ridge National Laboratory, USA
Chokchai (Box) Leangsuksun, Louisiana Tech University, USA

Program Chair:
Christian Engelmann, Oak Ridge National Laboratory, USA

Publication Co-Chairs:
James Brandt, Sandia National Laboratories, USA
Ann Gentile, Sandia National Laboratories, USA

Program Committee:
Vassil Alexandrov, Barcelona Supercomputing Center, Spain
David E. Bernholdt, Oak Ridge National Laboratory, USA
George Bosilca, University of Tennessee, USA
Jim Brandt, Sandia National Laboratories, USA
Patrick G. Bridges, University of New Mexico
Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
Franck Cappello, INRIA/UIUC, France/USA
Kasidit Chanchio, Thammasat University, Thailand
Zizhong Chen, Colorado School of Mines, USA
Nathan DeBardeleben, Los Alamos National Laboratory, USA
Jack Dongarra, University of Tennessee, USA
Christian Engelmann, Oak Ridge National Laboratory, USA
Yung-Chin Fang, Dell, USA
Kurt B. Ferreira, Sandia National Laboratories, USA
Ann Gentile, Sandia National Laboratories, USA
Cecile Germain, University Paris-Sud, France
Rinku Gupta, Argonne National Laboratory, USA
Paul Hargrove, Lawrence Berkeley National Laboratory, USA
Xubin He, Virginia Commonwealth University, USA
Larry Kaplan, Cray, USA
Daniel S. Katz, University of Chicago, USA
Thilo Kielmann, Vrije Universiteit Amsterdam, Netherlands
Dieter Kranzlmueller, LMU/LRZ Munich, Germany
Zhiling Lan, Illinois Institute of Technology, USA
Chokchai (Box) Leangsuksun, Louisiana Tech University, USA
Xiaosong Ma, North Carolina State University, USA
Celso Mendes, University of Illinois at Urbana Champaign, USA
Christine Morin, INRIA Rennes, France
Thomas Naughton, Oak Ridge National Laboratory, USA
George Ostrouchov, Oak Ridge National Laboratory, USA
DK Panda, The Ohio State University, USA
Mihaela Paun, Louisiana Tech University, USA
Alexander Reinefeld, Zuse Institute Berlin, Germany
Rolf Riesen, IBM Research, Ireland
Eric Roman, Lawrence Berkeley National Laboratory, USA
Stephen L. Scott, Oak Ridge National Laboratory, USA
Jon Stearley, Sandia National Laboratories, USA
Gregory M. Thorson, SGI, USA
Geoffroy Vallee, Oak Ridge National Laboratory, USA
Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA

Dr. Christian Engelmann                        Phone: +1 (865) 574-3132
Research and Development Staff Member            Fax: +1 (865) 576-5491
Oak Ridge National Laboratory                    One Bethel Valley Road
mailto:engelmannc at computer.org                   P.O. Box 2008, MS-6173
http://www.christian-engelmann.info            Oak Ridge, TN 37831, USA

