itaps-parallel FW: MPI Analysis and Reliability projects

Sat Mar 13 07:28:04 CST 2010

Steve,

We have been participating in some of the exascale workshops and looking 
at various tools that will be needed on the new machines as the node 
parallelism increases by a factor of 100 to 1000. It is clear that 
straight MPI is not likely be the answer.

Most think it will be MPI, or MPI like, between nodes and something else 
on node. Ken Jansen has already been looking at other on node modes of 
parallelism.

Dealing with fault tolerance is also a substantial issue as we go 
froward as discussed in the LLNL overview. I think tracking the work the 
LLNL team is proposing is a good idea. However, at least for some time 
into the future we need to focus our attention on the getting the 3-D 
capabilities in place and increasing scalability of that within the 
current MPI model. From there we can start working on taking better 
advantage of multiple cores per node with more than MPI.

As you know Fabien and Fan are working on all the pieces of the 
infrastructure to support the 3-D code. In addition, to effectively 
support the ability to start from one 2-D plane mesh and build the other 
planes and create the 3-D mesh, we have designed extensions to FMDB 
which Seegyoung is currently implementing.

Mark

Stephen C. Jardin wrote:
> Fabien,
> 
>  
> 
> I got this email from LLNL regarding the development of new parallel 
> programming models (beyond MPI).   Since our use of MPI is largely 
> hidden within your utilities, I wonder if your team has any interest in 
> this?
> 
>  
> 
> Regards,
> 
>  
> 
> -steve
> 
>  
> 
> ------------------------------------------------------------------------
> 
> *From:* Greg Bronevetsky [mailto:bronevetsky1 at llnl.gov]
> *Sent:* Friday, March 12, 2010 5:25 PM
> *To:* Stephen C. Jardin
> *Subject:* MPI Analysis and Reliability projects
> 
>  
> 
> Steve,
> I am putting together a proposal to the Department of Energy on 
> application reliability. We're looking to make connections to 
> application teams who may find such tools useful and I'd like to talk to 
> you about whether our proposed tool would be useful to you and if so, 
> how we may be to use CEMM to guide and evaluate our work. A short 
> summary of the proposed project is below. I look forward to talking to you!
> 
> Greg Bronevetsky
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky at llnl.gov
> http://greg.bronevetsky.com
> 
> 
> <http://greg.bronevetsky.com/>_MPI Analysis
> _Future petascale and exascale hardware is expected to be significantly 
> more complex than current machines, forcing developers to re-implement 
> their current MPI applications in new parallel programming models such 
> as UPC and Chapel. This project will focus on easier, more pragmatic 
> approach by developing a compiler infrastructure for MPI applications 
> that will use developer-specified annotations and a novel compiler 
> analysis to transform applications run efficiently on future hardware. 
> In effect, the addition of compiler support to MPI will provide for it 
> many of the benefits of newer programming models without the drawback of 
> costly re-implementation.
>         Some of our prior work on this topic:
> 
>     * Complex compiler transformations of MPI code:
>           o "Optimizing Transformations of Stencil Operations for
>             Parallel Cache-based Architectures" (
>             http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.8903&rep=rep1&type=pdf
>             <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.8903&rep=rep1&type=pdf>
>             )
>           o "Using MPI Communication Patterns to Guide Source Code
>             Transformations" (
>             http://www.springerlink.com/content/04vuh553j30p12n8)
>     * Performance-oriented extensions to MPI: "Implementation and
>       Performance Analysis of Non-Blocking Collective Operations for
>       MPI" ( http://www.unixer.de/publications/img/hoefler-sc07.pdf)
>     * Topology-sensitive compiler analysis: "Communication-Sensitive
>       Static Dataflow for Parallel Message Passing Applications" (
>       http://greg.bronevetsky.com/papers/2009CGO.pdf)
> 
> 
> _Application Reliability
> _HPC systems are growing increasingly vulnerable to soft faults. Caused 
> by physical phenomena such as radiation, high heat and voltage 
> fluctuations, these faults cause corruptions in application data and the 
> results of computations. Unfortunately, little is known about the effect 
> of these faults on applications and today the only tools for measuring 
> application reliability are either very expensive (e.g. neutron beam 
> experiment) or have no physical accuracy. Our project will build a new 
> fault injection tool that will eliminate this tradeoff by providing both 
> high performance and physical accuracy. It will allow users to analyze 
> the effect of soft faults on their applications and identify portions of 
> their code that are more critical to the reliability of the entire 
> application and thus need to be explicitly hardened against errors.
>         Some of our prior work on this topic:
> 
>     * Application vulnerability to faults: "Soft Error Vulnerability of
>       Iterative Linear Algebra Methods" (
>       http://greg.bronevetsky.com/papers/2008ICS.pdf)
>     * System reliability analysis: "Terrestrial-Based Radiation Upsets:
>       A Cautionary Tale" (
>       http://www.rasr.lanl.gov/RadEffects/publications/quinn_terr_final.pdf
>       )
>     * Gate-level fault injection tools: "The STAR-C Truth: Analyzing
>       Reconfigurable Supercomputing" (
>       http://www.fermat.ece.vt.edu/old_site/Publications/online-papers/Nano/FCCM06.pdf
>       )
>     * Microarchitectural fault modeling: "Accurate
>       Microarchitecture-Level Fault Modeling for Studying Hardware
>       Faults" ( http://rsim.cs.illinois.edu/Pubs/09HPCA-Li.pdf)
>     * More information about the topic and techniques to detect/correct
>       errors in applications:
>       http://greg.bronevetsky.com/CS717FA2004/Lectures.html
>