[petsc-users] How to confirm the performance of asynchronous computations
Viet H.Q.H.
hqhviet at tohoku.ac.jp
Fri Jan 22 11:20:15 CST 2021
Dear Victor and Berry,
Thank you so much for your answers.
I fixed the code with the bug in the PetscCommSplitReductionBegin function
as commented by Brave.
ierr = PetscCommSplitReductionBegin (PetscObjectComm ((PetscObject)
u));
It was also a mistake to set the vector size too small.
I just set a vector size of 100000000 and ran the code on 4 nodes with 2
processors per node. The result is as follows
The time used for the asynchronous calculation: 0.022043
+ | u | = 10000.
The time used for the synchronous calculation: 0.016188
+ | b | = 10000.
Asynchronous computation still takes a longer time.
I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the
file $PETSC_DIR/include/petscconf.h
I built Petsc by using the following script
#!/usr/bin/bash
set -e
DATE="21.01.18"
MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"
MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"
INSTL_DIR="${HOME}/local/petsc-3.14.3"
BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"
PETSC_DIR="${HOME}/tmp/petsc"
cd ${PETSC_DIR}
./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR}
--with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec
--with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR}
--download-make --with-debugging=0 COPTFLAGS='-O3 -march=native
-mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3
-march=native -mtune=native'
make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all
make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install
Intel 2018 also complies with the MPI-3 standard.
Are there specific settings for Intel MPI to obtain the performance of the
MPI_IALLREDUCE function?
Sincerely,
Viet.
On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsmith at petsc.dev> wrote:
>
> ierr = VecNormBegin(u,NORM_2,&norm1);
> ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>
> How come you call this on Ax and not on u? For clarity, if nothing else, I
> think you should call it on u.
>
> comb.c has
>
> /*
> Split phase global vector reductions with support for combining the
> communication portion of several operations. Using MPI-1.1 support only
>
> The idea for this and much of the initial code is contributed by
> Victor Eijkhout.
>
> Usage:
> VecDotBegin(Vec,Vec,PetscScalar *);
> VecNormBegin(Vec,NormType,PetscReal *);
> ....
> VecDotEnd(Vec,Vec,PetscScalar *);
> VecNormEnd(Vec,NormType,PetscReal *);
>
> Limitations:
> - The order of the xxxEnd() functions MUST be in the same order
> as the xxxBegin(). There is extensive error checking to try to
> insure that the user calls the routines in the correct order
> */
>
> #include <petsc/private/vecimpl.h> /*I "petscvec.h" I*/
>
> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void
> *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm
> comm,MPI_Request *request)
> {
> PETSC_UNUSED PetscErrorCode ierr;
>
> PetscFunctionBegin;
> #if defined(PETSC_HAVE_MPI_IALLREDUCE)
> ierr =
> MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);
> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE)
> ierr =
> MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);
> #else
> ierr =
> MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);
> *request = MPI_REQUEST_NULL;
> #endif
> PetscFunctionReturn(0);
> }
>
>
> So first check if $PETSC_DIR/include/petscconf.h has
>
> PETSC_HAVE_MPI_IALLREDUCE
>
> if it does not then the standard MPI reduce is called.
>
> If this is set then any improvement depends on the implementation of
> iallreduce inside the MPI you are using.
>
> Barry
>
>
> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>
>
> Hello Petsc developers and supporters,
>
> I would like to confirm the performance of asynchronous computations of
> inner product computation overlapping with matrix-vector multiplication
> computation by the below code.
>
>
> PetscLogDouble tt1,tt2;
> KSP ksp;
> //ierr = VecSet(c,one);
> ierr = VecSet(c,one);
> ierr = VecSet(u,one);
> ierr = VecSet(b,one);
>
> ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);
> ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);
>
>
> ierr = PetscTime(&tt1);CHKERRQ(ierr);
> ierr = VecNormBegin(u,NORM_2,&norm1);
> ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
> ierr = KSP_MatMult(ksp,A,c,Ac);
> ierr = VecNormEnd(u,NORM_2,&norm1);
> ierr = PetscTime(&tt2);CHKERRQ(ierr);
>
> ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
> asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
> ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| = %g\n",(double) norm1);
> CHKERRQ(ierr);
>
>
> ierr = PetscTime(&tt1);CHKERRQ(ierr);
> ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);
> ierr = KSP_MatMult(ksp,A,c,Ac);
> ierr = PetscTime(&tt2);CHKERRQ(ierr);
>
> ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
> synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
> ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| = %g\n",(double) norm2);
> CHKERRQ(ierr);
>
>
> On a cluster with two or four nodes, the asynchronous computation is
> always much slower than synchronous computation.
>
> The time used for the asynchronous calculation: 0.000203
> + |u| = 100.
> The time used for the synchronous calculation: 0.000006
> + |b| = 100.
>
> Are there any necessary settings on MPI or Petsc to gain performance of
> asynchronous computation?
>
> Thank you very much for anything you can provide.
> Sincerely,
> Viet.
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210123/9bb36397/attachment-0001.html>
More information about the petsc-users
mailing list