[petsc-users] How to confirm the performance of asynchronous computations

Fri Jan 22 12:01:08 CST 2021

https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html <https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html>

It states "and a partial support for non-blocking collectives ( MPI_Ibcas t, MPI_Ireduce , and MPI_Iallreduce )."  I do not know what partial support means but you can try setting the variables and see if that helps.

> On Jan 22, 2021, at 11:20 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
> 
> 
> Dear Victor and Berry,
> 
> Thank you so much for your answers.
> 
> I fixed the code with the bug in the PetscCommSplitReductionBegin function as commented by Brave.
> 
>      ierr = PetscCommSplitReductionBegin (PetscObjectComm ((PetscObject) u));
> 
> It was also a mistake to set the vector size too small.
> I just set a vector size of 100000000 and ran the code on 4 nodes with 2 processors per node. The result is as follows
> 
> The time used for the asynchronous calculation: 0.022043
> + | u | = 10000.
> The time used for the synchronous calculation: 0.016188
> + | b | = 10000.
> 
> Asynchronous computation still takes a longer time.
> 
> I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the file $PETSC_DIR/include/petscconf.h
> 
> I built Petsc by using the following script
> 
> #!/usr/bin/bash
> set -e
> DATE="21.01.18"
> MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"
> MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"
> INSTL_DIR="${HOME}/local/petsc-3.14.3"
> BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"
> PETSC_DIR="${HOME}/tmp/petsc"
> 
> cd ${PETSC_DIR}
> ./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR}  --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec --with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR} --download-make --with-debugging=0 COPTFLAGS='-O3 -march=native -mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3 -march=native -mtune=native'
> 
> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all
> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install 
> 
> 
> Intel 2018 also complies with the MPI-3 standard.
> 
> Are there specific settings for Intel MPI to obtain the performance of the MPI_IALLREDUCE function?
> 
> Sincerely,
> Viet.
> 
> 
> On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
> 
>   ierr = VecNormBegin(u,NORM_2,&norm1);
>     ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax)); 
> 
> How come you call this on Ax and not on u? For clarity, if nothing else, I think you should call it on u.
> 
> comb.c has 
> 
> /*
>       Split phase global vector reductions with support for combining the
>    communication portion of several operations. Using MPI-1.1 support only
> 
>       The idea for this and much of the initial code is contributed by
>    Victor Eijkhout.
> 
>        Usage:
>              VecDotBegin(Vec,Vec,PetscScalar *);
>              VecNormBegin(Vec,NormType,PetscReal *);
>              ....
>              VecDotEnd(Vec,Vec,PetscScalar *);
>              VecNormEnd(Vec,NormType,PetscReal *);
> 
>        Limitations:
>          - The order of the xxxEnd() functions MUST be in the same order
>            as the xxxBegin(). There is extensive error checking to try to
>            insure that the user calls the routines in the correct order
> */
> 
> #include <petsc/private/vecimpl.h>    /*I   "petscvec.h"    I*/
> 
> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm comm,MPI_Request *request)
> {
>   PETSC_UNUSED PetscErrorCode ierr;
> 
>   PetscFunctionBegin;
> #if defined(PETSC_HAVE_MPI_IALLREDUCE)
>   ierr = MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);
> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE)
>   ierr = MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);
> #else
>   ierr = MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);
>   *request = MPI_REQUEST_NULL;
> #endif
>   PetscFunctionReturn(0);
> }
> 
> 
> So first check if $PETSC_DIR/include/petscconf.h has 
> 
> PETSC_HAVE_MPI_IALLREDUCE
> 
> if it does not then the standard MPI reduce is called. 
> 
> If this is set then any improvement depends on the implementation of iallreduce inside the MPI you are using. 
> 
> Barry
> 
> 
>> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp <mailto:hqhviet at tohoku.ac.jp>> wrote:
>> 
>> 
>> Hello Petsc developers and supporters,
>> 
>> I would like to confirm the performance of asynchronous computations of inner product computation overlapping with matrix-vector multiplication computation by the below code.
>> 
>> 
>>  PetscLogDouble tt1,tt2;
>>     KSP ksp;
>>     //ierr = VecSet(c,one);
>>     ierr = VecSet(c,one);
>>     ierr = VecSet(u,one);
>>     ierr = VecSet(b,one);
>> 
>>     ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);
>>     ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);
>> 
>> 
>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>     ierr = VecNormBegin(u,NORM_2,&norm1);
>>     ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax)); 
>>     ierr = KSP_MatMult(ksp,A,c,Ac); 
>>     ierr = VecNormEnd(u,NORM_2,&norm1);
>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>> 
>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| =  %g\n",(double) norm1); CHKERRQ(ierr);
>> 
>> 
>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>     ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);
>>     ierr = KSP_MatMult(ksp,A,c,Ac); 
>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>> 
>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| =  %g\n",(double) norm2); CHKERRQ(ierr);
>> 
>> 
>> On a cluster with two or four nodes, the asynchronous computation is always much slower than synchronous computation.
>> 
>> The time used for the asynchronous calculation: 0.000203
>> + |u| =  100.
>> The time used for the synchronous calculation: 0.000006
>> + |b| =  100.
>> 
>> Are there any necessary settings on MPI or Petsc to gain performance of asynchronous computation?
>> 
>> Thank you very much for anything you can provide.
>> Sincerely,
>> Viet.
>> 
>> 
>> 
>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210122/0badd78d/attachment.html>