[mpich-discuss] [mvapich-discuss] Announcing the Release of MVAPICH2 1.8 and OSU Micro-Benchmarks (OMB) 3.6

Mon Apr 30 22:44:03 CDT 2012

These releases might be of interest to some of the MPICH users. Thus, I am
posting it here.

Thanks,

DK

---------- Forwarded message ----------
Date: Mon, 30 Apr 2012 21:44:37 -0400 (EDT)
From: Dhabaleswar Panda <panda at cse.ohio-state.edu>
To: mvapich-discuss at cse.ohio-state.edu
Cc: Dhabaleswar Panda <panda at cse.ohio-state.edu>
Subject: [mvapich-discuss] Announcing the Release of MVAPICH2 1.8 and OSU
    Micro-Benchmarks (OMB) 3.6

The MVAPICH team is pleased to announce the release of MVAPICH2 1.8 and
OSU Micro-Benchmarks (OMB) 3.6.

Features, Enhancements, and Bug Fixes for MVAPICH2 1.8 are listed here.

* New Features and Enhancements (since 1.8RC1):

    - Introduced a unified run time parameter MV2_USE_ONLY_UD to
      enable UD only mode
    - Enhanced designs for Alltoall and Allgather collective communication
      from GPU device buffers
    - Tuned collective communication from GPU device buffers
    - Tuned Gather collective
    - Introduced a run time parameter MV2_SHOW_CPU_BINDING to show current
      CPU bindings
    - Updated to hwloc v1.4.1
    - Remove dependency on LEX and YACC

* Bug Fixes (since 1.8RC1):

    - Fix hang with multiple GPU configuration
        - Thanks to Jens Glaser from University of Minnesota
          for the report
    - Fix buffer alignment issues to improve intra-node performance
    - Fix a DPM multispawn behavior
    - Enhanced error reporting in DPM functionality
    - Quote environment variables in job startup to protect from shell
    - Fix hang when LIMIC is enabled
    - Fix hang in environments with heterogeneous HCAs
    - Fix issue when using multiple HCA ports in RDMA_CM mode
        - Thanks to Steve Wise from Open Grid Computing for the report
    - Fix hang during MPI_Finalize in Nemesis IB netmod
    - Fix for a start-up issue in Nemesis with heterogeneous architectures
    - Fix few memory leaks and warnings

Features, Enhancements, and Bug Fixes for OSU Micro-Benchmarks (OMB) 3.6
are listed here.

* New Features & Enhancements (since OMB 3.5.1)
    - New collective benchmarks
        * osu_allgather
        * osu_allgatherv
        * osu_allreduce
        * osu_alltoall
        * osu_alltoallv
        * osu_barrier
        * osu_bcast
        * osu_gather
        * osu_gatherv
        * osu_reduce
        * osu_reduce_scatter
        * osu_scatter
        * osu_scatterv

* Bug Fixes (since OMB 3.5.1)
    - Fix GPU binding issue when running with HH mode

The complete set of features and enhancements for MVAPICH2 1.8 compared
to MVAPICH2 1.7 are as follows:

* Features & Enhancements:
    - Support for MPI communication from NVIDIA GPU device memory
        - High performance RDMA-based inter-node point-to-point
          communication (GPU-GPU, GPU-Host and Host-GPU)
        - High performance intra-node point-to-point communication for
          multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
        - Taking advantage of CUDA IPC (available in CUDA 4.1) in
          intra-node communication
          for multiple GPU adapters/node
        - Enhanced designs for Alltoall and Allgather collective
          communication from GPU device buffers
        - Optimized and tuned collectives for GPU device buffers
        - MPI datatype support for point-to-point and collective
          communication from GPU device buffers
    - Support for running UD only mode
    - Support suspend/resume functionality with mpirun_rsh
    - Enhanced support for CPU binding with socket and numanode level
      granularity
    - Support for showing current CPU bindings
    - Exporting local rank, local size, global rank and global
      size through environment variables (both mpirun_rsh and hydra)
    - Update to hwloc v1.4.1
    - Checkpoint-Restart support in OFA-IB-Nemesis interface
    - Enabling run-through stabilization support to handle
      process failures in OFA-IB-Nemesis interface
    - Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully
    - Performance tuning on various architecture clusters
    - Support for Mellanox IB FDR adapter
    - Adjust shared-memory communication block size at runtime
    - Enable XRC by default at configure time
    - New shared memory design for enhanced intra-node small message
      performance
    - Tuned inter-node and intra-node performance on different cluster
      architectures
    - Support for fallback to R3 rendezvous protocol if RGET fails
    - SLURM integration with mpiexec.mpirun_rsh to use SLURM
      allocated hosts without specifying a hostfile
    - Support added to automatically use PBS_NODEFILE in Torque and PBS
      environments
    - Enable signal-triggered (SIGUSR2) migration
    - Reduced memory footprint of the library
    - Enhanced one-sided communication design with reduced
      memory requirement
    - Enhancements and tuned collectives (Bcast and Alltoallv)
    - Flexible HCA selection with Nemesis interface
        - Thanks to Grigori Inozemtsev, Queens University
    - Support iWARP interoperability between Intel NE020 and
      Chelsio T4 Adapters
    - RoCE enable environment variable name is changed from
      MV2_USE_RDMAOE to MV2_USE_RoCE

MVAPICH2 1.8 continues to deliver excellent performance. Sample
performance numbers include:

  OpenFabrics/Gen2 on Sandy Bridge 8-core (2.6 GHz) with PCIe-Gen3
      and ConnectX-3 FDR (Two-sided Operations):
        - 1.05 microsec one-way latency (4 bytes)
        - 6344 MB/sec unidirectional bandwidth
        - 11994 MB/sec bidirectional bandwidth

  OpenFabrics/Gen2-RoCE (RDMA over Converged Ethernet) Support on
      Sandy Bridge 8-core (2.6 GHz) with ConnectX-3 EN (40GigE)
      (Two-sided operations):
        - 1.2 microsec one-way latency (4 bytes)
        - 4565 MB/sec unidirectional bandwidth
        - 9117 MB/sec bidirectional bandwidth

  Intra-node performance on Sandy Bridge 8-core (2.6 GHz)
      (Two-sided operations, intra-socket)
        - 0.19 microsec one-way latency (4 bytes)
        - 9643 MB/sec unidirectional bandwidth
        - 16941 MB/sec bidirectional bandwidth

Sample performance numbers for MPI communication from NVIDIA GPU memory
using MVAPICH2 1.8 and OMB 3.6 can be obtained from the following URL:

http://mvapich.cse.ohio-state.edu/performance/gpu.shtml

Performance numbers for several other platforms and system configurations
can be viewed by visiting `Performance' section of the project's web page.

For downloading MVAPICH2 1.8, OMB 3.6, associated user guide, quick start
guide, and accessing the SVN, please visit the following URL:

http://mvapich.cse.ohio-state.edu

All questions, feedbacks, bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list (mvapich-discuss at cse.ohio-state.edu).

We are also happy to inform that the number of organizations using
MVAPICH/MVAPICH2 (and registered at the MVAPICH site) has crossed 1,900
world-wide (in 67 countries). The MVAPICH team extends thanks to all these
organizations.

Thanks,

The MVAPICH Team