[petsc-users] How to confirm the performance of asynchronous computations
Barry Smith
bsmith at petsc.dev
Fri Jan 22 12:01:08 CST 2021
https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html <https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html>
It states "and a partial support for non-blocking collectives ( MPI_Ibcas t, MPI_Ireduce , and MPI_Iallreduce )." I do not know what partial support means but you can try setting the variables and see if that helps.
> On Jan 22, 2021, at 11:20 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>
>
> Dear Victor and Berry,
>
> Thank you so much for your answers.
>
> I fixed the code with the bug in the PetscCommSplitReductionBegin function as commented by Brave.
>
> ierr = PetscCommSplitReductionBegin (PetscObjectComm ((PetscObject) u));
>
> It was also a mistake to set the vector size too small.
> I just set a vector size of 100000000 and ran the code on 4 nodes with 2 processors per node. The result is as follows
>
> The time used for the asynchronous calculation: 0.022043
> + | u | = 10000.
> The time used for the synchronous calculation: 0.016188
> + | b | = 10000.
>
> Asynchronous computation still takes a longer time.
>
> I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the file $PETSC_DIR/include/petscconf.h
>
> I built Petsc by using the following script
>
> #!/usr/bin/bash
> set -e
> DATE="21.01.18"
> MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"
> MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"
> INSTL_DIR="${HOME}/local/petsc-3.14.3"
> BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"
> PETSC_DIR="${HOME}/tmp/petsc"
>
> cd ${PETSC_DIR}
> ./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR} --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec --with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR} --download-make --with-debugging=0 COPTFLAGS='-O3 -march=native -mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3 -march=native -mtune=native'
>
> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all
> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install
>
>
> Intel 2018 also complies with the MPI-3 standard.
>
> Are there specific settings for Intel MPI to obtain the performance of the MPI_IALLREDUCE function?
>
> Sincerely,
> Viet.
>
>
> On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>
> ierr = VecNormBegin(u,NORM_2,&norm1);
> ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>
> How come you call this on Ax and not on u? For clarity, if nothing else, I think you should call it on u.
>
> comb.c has
>
> /*
> Split phase global vector reductions with support for combining the
> communication portion of several operations. Using MPI-1.1 support only
>
> The idea for this and much of the initial code is contributed by
> Victor Eijkhout.
>
> Usage:
> VecDotBegin(Vec,Vec,PetscScalar *);
> VecNormBegin(Vec,NormType,PetscReal *);
> ....
> VecDotEnd(Vec,Vec,PetscScalar *);
> VecNormEnd(Vec,NormType,PetscReal *);
>
> Limitations:
> - The order of the xxxEnd() functions MUST be in the same order
> as the xxxBegin(). There is extensive error checking to try to
> insure that the user calls the routines in the correct order
> */
>
> #include <petsc/private/vecimpl.h> /*I "petscvec.h" I*/
>
> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm comm,MPI_Request *request)
> {
> PETSC_UNUSED PetscErrorCode ierr;
>
> PetscFunctionBegin;
> #if defined(PETSC_HAVE_MPI_IALLREDUCE)
> ierr = MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);
> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE)
> ierr = MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);
> #else
> ierr = MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);
> *request = MPI_REQUEST_NULL;
> #endif
> PetscFunctionReturn(0);
> }
>
>
> So first check if $PETSC_DIR/include/petscconf.h has
>
> PETSC_HAVE_MPI_IALLREDUCE
>
> if it does not then the standard MPI reduce is called.
>
> If this is set then any improvement depends on the implementation of iallreduce inside the MPI you are using.
>
> Barry
>
>
>> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp <mailto:hqhviet at tohoku.ac.jp>> wrote:
>>
>>
>> Hello Petsc developers and supporters,
>>
>> I would like to confirm the performance of asynchronous computations of inner product computation overlapping with matrix-vector multiplication computation by the below code.
>>
>>
>> PetscLogDouble tt1,tt2;
>> KSP ksp;
>> //ierr = VecSet(c,one);
>> ierr = VecSet(c,one);
>> ierr = VecSet(u,one);
>> ierr = VecSet(b,one);
>>
>> ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);
>> ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);
>>
>>
>> ierr = PetscTime(&tt1);CHKERRQ(ierr);
>> ierr = VecNormBegin(u,NORM_2,&norm1);
>> ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>> ierr = KSP_MatMult(ksp,A,c,Ac);
>> ierr = VecNormEnd(u,NORM_2,&norm1);
>> ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>
>> ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>> ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| = %g\n",(double) norm1); CHKERRQ(ierr);
>>
>>
>> ierr = PetscTime(&tt1);CHKERRQ(ierr);
>> ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);
>> ierr = KSP_MatMult(ksp,A,c,Ac);
>> ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>
>> ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>> ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| = %g\n",(double) norm2); CHKERRQ(ierr);
>>
>>
>> On a cluster with two or four nodes, the asynchronous computation is always much slower than synchronous computation.
>>
>> The time used for the asynchronous calculation: 0.000203
>> + |u| = 100.
>> The time used for the synchronous calculation: 0.000006
>> + |b| = 100.
>>
>> Are there any necessary settings on MPI or Petsc to gain performance of asynchronous computation?
>>
>> Thank you very much for anything you can provide.
>> Sincerely,
>> Viet.
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210122/0badd78d/attachment.html>
More information about the petsc-users
mailing list