[petsc-users] How to confirm the performance of asynchronous computations

Fri Jan 22 11:20:15 CST 2021

Dear Victor and Berry,

Thank you so much for your answers.

I fixed the code with the bug in the PetscCommSplitReductionBegin function
as commented by Brave.

     ierr = PetscCommSplitReductionBegin (PetscObjectComm ((PetscObject)
u));

It was also a mistake to set the vector size too small.
I just set a vector size of 100000000 and ran the code on 4 nodes with 2
processors per node. The result is as follows

The time used for the asynchronous calculation: 0.022043
+ | u | = 10000.
The time used for the synchronous calculation: 0.016188
+ | b | = 10000.

Asynchronous computation still takes a longer time.

I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the
file $PETSC_DIR/include/petscconf.h

I built Petsc by using the following script

#!/usr/bin/bash
set -e
DATE="21.01.18"
MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"
MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"
INSTL_DIR="${HOME}/local/petsc-3.14.3"
BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"
PETSC_DIR="${HOME}/tmp/petsc"

cd ${PETSC_DIR}
./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR}
 --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec
--with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR}
--download-make --with-debugging=0 COPTFLAGS='-O3 -march=native
-mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3
-march=native -mtune=native'

make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all
make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install

Intel 2018 also complies with the MPI-3 standard.

Are there specific settings for Intel MPI to obtain the performance of the
MPI_IALLREDUCE function?

Sincerely,
Viet.

On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>   ierr = VecNormBegin(u,NORM_2,&norm1);
>     ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>
> How come you call this on Ax and not on u? For clarity, if nothing else, I
> think you should call it on u.
>
> comb.c has
>
> /*
>       Split phase global vector reductions with support for combining the
>    communication portion of several operations. Using MPI-1.1 support only
>
>       The idea for this and much of the initial code is contributed by
>    Victor Eijkhout.
>
>        Usage:
>              VecDotBegin(Vec,Vec,PetscScalar *);
>              VecNormBegin(Vec,NormType,PetscReal *);
>              ....
>              VecDotEnd(Vec,Vec,PetscScalar *);
>              VecNormEnd(Vec,NormType,PetscReal *);
>
>        Limitations:
>          - The order of the xxxEnd() functions MUST be in the same order
>            as the xxxBegin(). There is extensive error checking to try to
>            insure that the user calls the routines in the correct order
> */
>
> #include <petsc/private/vecimpl.h>    /*I   "petscvec.h"    I*/
>
> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void
> *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm
> comm,MPI_Request *request)
> {
>   PETSC_UNUSED PetscErrorCode ierr;
>
>   PetscFunctionBegin;
> #if defined(PETSC_HAVE_MPI_IALLREDUCE)
>   ierr =
> MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);
> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE)
>   ierr =
> MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);
> #else
>   ierr =
> MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);
>   *request = MPI_REQUEST_NULL;
> #endif
>   PetscFunctionReturn(0);
> }
>
>
> So first check if $PETSC_DIR/include/petscconf.h has
>
> PETSC_HAVE_MPI_IALLREDUCE
>
> if it does not then the standard MPI reduce is called.
>
> If this is set then any improvement depends on the implementation of
> iallreduce inside the MPI you are using.
>
> Barry
>
>
> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>
>
> Hello Petsc developers and supporters,
>
> I would like to confirm the performance of asynchronous computations of
> inner product computation overlapping with matrix-vector multiplication
> computation by the below code.
>
>
>  PetscLogDouble tt1,tt2;
>     KSP ksp;
>     //ierr = VecSet(c,one);
>     ierr = VecSet(c,one);
>     ierr = VecSet(u,one);
>     ierr = VecSet(b,one);
>
>     ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);
>     ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);
>
>
>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>     ierr = VecNormBegin(u,NORM_2,&norm1);
>     ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>     ierr = KSP_MatMult(ksp,A,c,Ac);
>     ierr = VecNormEnd(u,NORM_2,&norm1);
>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>
>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
> asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| =  %g\n",(double) norm1);
> CHKERRQ(ierr);
>
>
>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>     ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);
>     ierr = KSP_MatMult(ksp,A,c,Ac);
>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>
>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
> synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| =  %g\n",(double) norm2);
> CHKERRQ(ierr);
>
>
> On a cluster with two or four nodes, the asynchronous computation is
> always much slower than synchronous computation.
>
> The time used for the asynchronous calculation: 0.000203
> + |u| =  100.
> The time used for the synchronous calculation: 0.000006
> + |b| =  100.
>
> Are there any necessary settings on MPI or Petsc to gain performance of
> asynchronous computation?
>
> Thank you very much for anything you can provide.
> Sincerely,
> Viet.
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210123/9bb36397/attachment-0001.html>