[mpich-discuss] Announcing the Release of MVAPICH2 1.7
Dhabaleswar Panda
panda at cse.ohio-state.edu
Sat Oct 15 00:17:43 CDT 2011
This release might be of interest to some of the MPICH users. Thus, I am
posting it here.
Thanks,
DK
---------- Forwarded message ----------
Date: Fri, 14 Oct 2011 23:08:13 -0400 (EDT)
From: Dhabaleswar Panda <panda at cse.ohio-state.edu>
To: mvapich-discuss at cse.ohio-state.edu
Cc: Dhabaleswar Panda <panda at cse.ohio-state.edu>
Subject: [mvapich-discuss] Announcing the Release of MVAPICH2 1.7
The MVAPICH team is pleased to announce the release of MVAPICH2-1.7
and OSU Micro-Benchmarks (OMB) 3.4.
The complete set of Features, Enhancements, and Bug Fixes for MVAPICH2
1.7 (since MVAPICH2-1.6 release) are listed here.
- Based on MPICH2-1.4.1p1
- Integrated Hybrid (UD-RC/XRC) design to get best performance
on large-scale systems with reduced/constant memory footprint
- CH3 shared memory channel for standalone hosts
(including laptops) without any InfiniBand adapters
- HugePage support
- Improved intra-node shared memory communication performance
- Shared memory backed windows for One-Sided Communication
- Support for truly passive locking for intra-node RMA in shared
memory and LIMIC based windows
- Improved on-demand InfiniBand connection setup (CH3 and RoCE)
- Tuned RDMA Fast Path Buffer size to get better performance
with less memory footprint (CH3 and Nemesis)
- Support for large data transfers (>2GB)
- Integrated with enhanced LiMIC2 (v0.5.5) to support Intra-node
large message (>2GB) transfers
- Optimized Fence synchronization (with and without
LIMIC2 support)
- Automatic intra-node communication parameter tuning
based on platform
- Efficient connection set-up for multi-core systems
- Enhanced designs and tuning for collectives
(bcast, reduce, barrier, gather, allreduce, allgather,
gatherv, allgatherv and alltoall)
- Support for shared-memory collectives for modern clusters
with up to 64 cores/node
- MPI_THREAD_SINGLE provided by default and
MPI_THREAD_MULTIPLE as an option
- Fast process migration using RDMA
- Enabling Checkpoint/Restart support in pure SMP mode
- Compact and shorthand way to specify blocks of processes
on the same host with mpirun_rsh
- Support for latest stable version of HWLOC v1.2.2
- Enhanced mpirun_rsh design to avoid race conditions,
support for fault-tolerance functionality and
improved debug messages
- Enhanced debugging config options to generate
core files and back-traces
- Automatic inter-node communication parameter tuning
based on platform and adapter detection (Nemesis)
- Integrated with latest OSU Micro-benchmarks (3.4)
- Improved performance for medium sized messages (QLogic PSM interface)
- Multi-core-aware collective support (QLogic PSM interface)
- Performance optimization for QDR cards
- Support for Chelsio T4 Adapter
- Support for Ekopath Compiler
Bug Fixes:
- Fixes in Checkpoint/Restart and Migration support
- Fix Restart when using automatic checkpoint
- Thanks to Alexandr for reporting this
- Handling very large one-sided transfers using RDMA
- Fixes for memory leaks
- Graceful handling of unknown HCAs
- Better handling of shmem file creation errors
- Fix for a hang in intra-node transfer
- Fix for a build error with --disable-weak-symbols
- Thanks to Peter Willis for reporting this issue
- Fixes for one-sided communication with passive target
synchronization
- Better handling of memory allocation and registration failures
- Fixes for compilation warnings
- Fix a bug that disallows '=' from mpirun_rsh arguments
- Handling of non-contiguous transfer in Nemesis interface
- Bug fix in gather collective when ranks are in cyclic order
- Fix for the ignore_locks bug in MPI-IO with Lustre
- Compiler preference lists reordered to avoid mixing GCC and Intel
compilers if both are found by configure
- Fix a bug in transferring very large messages (>2GB)
- Thanks to Tibor Pausz from Univ. of Frankfurt for reporting it
- Fix a hang with One-Sided Put operation
- Fix a bug in ptmalloc integration
- Avoid double-free crash with mpispawn
- Avoid crash and print an error message in mpirun_rsh when the
hostfile is empty
- Checking for error codes in PMI design
- Verify programs can link with LiMIC2 at runtime
- Fix for compilation issue when BLCR or FTB installed in
non-system paths
- Fix an issue with RDMA-Migration
- Fix a hang with RDMA CM
- Fix an issue in supporting RoCE with second port on available on HCA
- Thanks to Jeffrey Konz from HP for reporting it
- Fix for a hang with passive RMA tests (QLogic PSM interface)
New features, Enhancements and Bug Fixes of OSU Micro-Benchmarks (OMB)
3.4 (since OMB 3.3 release) are listed here.
New Features & Enhancements
- Add passive one-sided communication benchmarks
- Update one-sided communication benchmarks to provide shared
memory hint in MPI_Alloc_mem calls
- Update one-sided communication benchmarks to use MPI_Alloc_mem
for buffer allocation
- Give default values to configure definitions (can now build
directly with mpicc)
- Update latency benchmarks to begin from 0 byte message
* Bug Fixes
- Remove memory leaks in one-sided communication benchmarks
- Update benchmarks to touch buffers before using them for
communication
- Fix osu_get_bw test to use different buffers for concurrent
communication operations
- Fix compilation warnings
MVAPICH2 1.7 is being made available with OFED 1.5.4. It continues to
deliver excellent performance. Sample performance numbers include:
OpenFabrics/Gen2 on Westmere quad-core (2.53 GHz) with PCIe-Gen2
and ConnectX2-QDR (Two-sided Operations):
- 1.64 microsec one-way latency (4 bytes)
- 3394 MB/sec unidirectional bandwidth
- 6537 MB/sec bidirectional bandwidth
QLogic InfiniPath Support on Westmere quad-core (2.53 GHz) with
PCIe-Gen2 and QLogic-QDR (Two-sided Operations):
- 1.70 microsec one-way latency (4 bytes)
- 3265 MB/sec unidirectional bandwidth
- 4228 MB/sec bidirectional bandwidth
OpenFabrics/Gen2-RoCE (RDMA over Converged Ethernet) Support on
Westmere quad-core (2.53 GHz) with ConnectX-EN (10GigE)
(Two-sided operations):
- 1.96 microsec one-way latency (4 bytes)
- 1143 MB/sec unidirectional bandwidth
- 2284 MB/sec bidirectional bandwidth
Intra-node performance on Westmere quad-core (2.53 GHz)
(Two-sided operations, intra-socket)
- 0.29 microsec one-way latency (4 bytes)
- 9964 MB/sec unidirectional bandwidth with LiMIC2
- 17998 MB/sec bidirectional bandwidth with LiMIC2
Performance numbers for several other platforms and system
configurations can be viewed by visiting `Performance' section of the
project's web page.
Starting with this release, a short `Quick Start Guide' is made
available for new and novice MVAPICH2 users. The complete `User
Guide' is also available.
For downloading MVAPICH2-1.7, OSU Micro-Benchmarks (OMB) 3.4,
associated quick start guide/user guide and accessing the SVN,
please visit the following URL:
http://mvapich.cse.ohio-state.edu
All questions, feedbacks, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).
Thanks,
The MVAPICH Team
More information about the mpich-discuss
mailing list