[petsc-users] How to confirm the performance of asynchronous computations

Mon Jan 25 02:17:51 CST 2021

Dear Barry,

Thank you very much for your information.

It seems complicated to set environment variables to allow asynchronous
progress and pinning threads to cores when using Intel MPI.

$ export I_MPI_ASYNC_PROGRESS = 1
$ export I_MPI_ASYNC_PROGRESS_PIN = <CPU list>

https://techdecoded.intel.io/resources/hiding-communication-latency-using-mpi-3-non-blocking-collectives/

I'm still not sure how to get an appropriate "CPU list" when running MPI
with multiple nodes and multiple processes on one node.
Best,
Viet.

On Sat, Jan 23, 2021 at 3:01 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html
>
> It states "and a partial support for non-blocking collectives ( MPI_Ibcas
>  t, MPI_Ireduce , and MPI_Iallreduce )."  I do not know what partial
> support means but you can try setting the variables and see if that helps.
>
>
>
> On Jan 22, 2021, at 11:20 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>
>
> Dear Victor and Berry,
>
> Thank you so much for your answers.
>
> I fixed the code with the bug in the PetscCommSplitReductionBegin function
> as commented by Brave.
>
>      ierr = PetscCommSplitReductionBegin (PetscObjectComm ((PetscObject)
> u));
>
> It was also a mistake to set the vector size too small.
> I just set a vector size of 100000000 and ran the code on 4 nodes with 2
> processors per node. The result is as follows
>
> The time used for the asynchronous calculation: 0.022043
> + | u | = 10000.
> The time used for the synchronous calculation: 0.016188
> + | b | = 10000.
>
> Asynchronous computation still takes a longer time.
>
> I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the
> file $PETSC_DIR/include/petscconf.h
>
> I built Petsc by using the following script
>
> #!/usr/bin/bash
> set -e
> DATE="21.01.18"
>
> MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"
>
> MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"
> INSTL_DIR="${HOME}/local/petsc-3.14.3"
> BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"
> PETSC_DIR="${HOME}/tmp/petsc"
>
> cd ${PETSC_DIR}
> ./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR}
>  --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec
> --with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR}
> --download-make --with-debugging=0 COPTFLAGS='-O3 -march=native
> -mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3
> -march=native -mtune=native'
>
> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all
> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install
>
>
> Intel 2018 also complies with the MPI-3 standard.
>
> Are there specific settings for Intel MPI to obtain the performance of the
> MPI_IALLREDUCE function?
>
> Sincerely,
> Viet.
>
>
> On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   ierr = VecNormBegin(u,NORM_2,&norm1);
>>     ierr =
>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>>
>> How come you call this on Ax and not on u? For clarity, if nothing else,
>> I think you should call it on u.
>>
>> comb.c has
>>
>> /*
>>       Split phase global vector reductions with support for combining the
>>    communication portion of several operations. Using MPI-1.1 support only
>>
>>       The idea for this and much of the initial code is contributed by
>>    Victor Eijkhout.
>>
>>        Usage:
>>              VecDotBegin(Vec,Vec,PetscScalar *);
>>              VecNormBegin(Vec,NormType,PetscReal *);
>>              ....
>>              VecDotEnd(Vec,Vec,PetscScalar *);
>>              VecNormEnd(Vec,NormType,PetscReal *);
>>
>>        Limitations:
>>          - The order of the xxxEnd() functions MUST be in the same order
>>            as the xxxBegin(). There is extensive error checking to try to
>>            insure that the user calls the routines in the correct order
>> */
>>
>> #include <petsc/private/vecimpl.h>    /*I   "petscvec.h"    I*/
>>
>> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void
>> *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm
>> comm,MPI_Request *request)
>> {
>>   PETSC_UNUSED PetscErrorCode ierr;
>>
>>   PetscFunctionBegin;
>> #if defined(PETSC_HAVE_MPI_IALLREDUCE)
>>   ierr =
>> MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);
>> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE)
>>   ierr =
>> MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);
>> #else
>>   ierr =
>> MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);
>>   *request = MPI_REQUEST_NULL;
>> #endif
>>   PetscFunctionReturn(0);
>> }
>>
>>
>> So first check if $PETSC_DIR/include/petscconf.h has
>>
>> PETSC_HAVE_MPI_IALLREDUCE
>>
>> if it does not then the standard MPI reduce is called.
>>
>> If this is set then any improvement depends on the implementation of
>> iallreduce inside the MPI you are using.
>>
>> Barry
>>
>>
>> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>>
>>
>> Hello Petsc developers and supporters,
>>
>> I would like to confirm the performance of asynchronous computations of
>> inner product computation overlapping with matrix-vector multiplication
>> computation by the below code.
>>
>>
>>  PetscLogDouble tt1,tt2;
>>     KSP ksp;
>>     //ierr = VecSet(c,one);
>>     ierr = VecSet(c,one);
>>     ierr = VecSet(u,one);
>>     ierr = VecSet(b,one);
>>
>>     ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);
>>     ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);
>>
>>
>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>     ierr = VecNormBegin(u,NORM_2,&norm1);
>>     ierr =
>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>>     ierr = KSP_MatMult(ksp,A,c,Ac);
>>     ierr = VecNormEnd(u,NORM_2,&norm1);
>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>
>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
>> asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| =  %g\n",(double) norm1);
>> CHKERRQ(ierr);
>>
>>
>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>     ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);
>>     ierr = KSP_MatMult(ksp,A,c,Ac);
>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>
>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
>> synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| =  %g\n",(double) norm2);
>> CHKERRQ(ierr);
>>
>>
>> On a cluster with two or four nodes, the asynchronous computation is
>> always much slower than synchronous computation.
>>
>> The time used for the asynchronous calculation: 0.000203
>> + |u| =  100.
>> The time used for the synchronous calculation: 0.000006
>> + |b| =  100.
>>
>> Are there any necessary settings on MPI or Petsc to gain performance of
>> asynchronous computation?
>>
>> Thank you very much for anything you can provide.
>> Sincerely,
>> Viet.
>>
>>
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210125/3228fd72/attachment.html>