[petsc-users] How to confirm the performance of asynchronous computations

Mon Jan 25 22:31:18 CST 2021

Dear Patrick Sanan,

Thank you very much for your answer, especially for your code.
I was able to compile and run your code on 8 nodes with 20 processes per
node. Below is the result

Testing with 160 MPI ranks
reducing an array of size 32 (256 bytes)
Running 5 burnin runs and 100 tests ...  Done.
For 100 runs with 5 burnin runs, on 160 MPI processes, min/max times over
all ranks:
MPI timer resolution:         1.0000e-06 seconds
MPI timer resolution/#trials: 1.0000e-08 seconds
B.   Red. Only (min/max):    8.850098e-06 / 8.890629e-06 seconds
N.B. Red. Only (min/max):    1.725912e-05 / 1.733065e-05 seconds
Loc. Only (min/max):         2.364278e-04 / 2.374697e-04 seconds
Blocking (min/max):          2.650309e-04 / 2.650595e-04 seconds
Non-Blocking (min/max):      2.673984e-04 / 2.674508e-04 seconds
Observe to see if the local time is enough to hide the reduction, and see
if the reduction is indeed hidden

It appears that the non-blocking computation with this test is no faster
than the blocking computation.
I think I am missing some suitable Intel MPI environment settings.
I am now thinking about using MPICH, which does not require any environment
settings for non-blocking computation.
Could you please let me know which MPI (MPICH or OpenMPI) you used in your
tests?
Thanks again.
Viet

On Mon, Jan 25, 2021 at 7:47 PM Patrick Sanan <patrick.sanan at gmail.com>
wrote:

> Sorry about the delay in responding, but I'll add a couple of points here:
>
>
> 1) It's important to have some reason to believe that pipelining will
> actually help your problem. Pipelined Krylov methods work by overlapping
> reductions with operator and preconditioner applications. So, to see
> speedup, the time for a reduction needs to be comparable to the time for
> the operator/preconditioner application. This will only be true in some
> cases - typical cases are when you have a large number of ranks/nodes, a
> slow network, or very fast operator/preconditioner applications (assuming
> that these require the same time on each rank - it's an interesting case
> when they don't, but unless you say otherwise I'll assume this doesn't
> apply to your use case).
>
> 2) As you're discovering, simply ensuring that asynchronous progress
> works, at the pure MPI level, isn't as easy as it might be, as it's so
> dependent on the MPI implementation.
>
>
> For both of these reasons, I suggest setting up a test that just directly
> uses MPI (which you can of course do from a PETSc-style code) and allows
> you to compare times for blocking and non-blocking reductions, overlapping
> some (useless) local work. You should make sure to run multiple iterations
> within the script, and also run the script multiple times on the cluster
> (bearing in mind that it's possible that the performance will be affected
> by other users of the system).
>
> I attach an old script I found that I used to test some of these things,
> to give a more concrete idea of what I mean. Note that this was used early
> on in our own exploration of these topics so I'm only offering it to give
> an idea, not as a meaningful benchmark in its own right.
>
> Am 25.01.2021 um 09:17 schrieb Viet H.Q.H. <hqhviet at tohoku.ac.jp>:
>
>
> Dear Barry,
>
> Thank you very much for your information.
>
> It seems complicated to set environment variables to allow asynchronous
> progress and pinning threads to cores when using Intel MPI.
>
> $ export I_MPI_ASYNC_PROGRESS = 1
> $ export I_MPI_ASYNC_PROGRESS_PIN = <CPU list>
>
>
>
> https://techdecoded.intel.io/resources/hiding-communication-latency-using-mpi-3-non-blocking-collectives/
>
> I'm still not sure how to get an appropriate "CPU list" when running MPI
> with multiple nodes and multiple processes on one node.
> Best,
> Viet.
>
>
>
>
> On Sat, Jan 23, 2021 at 3:01 AM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>
>> https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html
>>
>> It states "and a partial support for non-blocking collectives ( MPI_Ibcas
>>  t, MPI_Ireduce , and MPI_Iallreduce )."  I do not know what partial
>> support means but you can try setting the variables and see if that helps.
>>
>>
>>
>> On Jan 22, 2021, at 11:20 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>>
>>
>> Dear Victor and Berry,
>>
>> Thank you so much for your answers.
>>
>> I fixed the code with the bug in the PetscCommSplitReductionBegin
>> function as commented by Brave.
>>
>>      ierr = PetscCommSplitReductionBegin (PetscObjectComm ((PetscObject)
>> u));
>>
>> It was also a mistake to set the vector size too small.
>> I just set a vector size of 100000000 and ran the code on 4 nodes with 2
>> processors per node. The result is as follows
>>
>> The time used for the asynchronous calculation: 0.022043
>> + | u | = 10000.
>> The time used for the synchronous calculation: 0.016188
>> + | b | = 10000.
>>
>> Asynchronous computation still takes a longer time.
>>
>> I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the
>> file $PETSC_DIR/include/petscconf.h
>>
>> I built Petsc by using the following script
>>
>> #!/usr/bin/bash
>> set -e
>> DATE="21.01.18"
>>
>> MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"
>>
>> MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"
>> INSTL_DIR="${HOME}/local/petsc-3.14.3"
>> BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"
>> PETSC_DIR="${HOME}/tmp/petsc"
>>
>> cd ${PETSC_DIR}
>> ./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR}
>>  --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec
>> --with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR}
>> --download-make --with-debugging=0 COPTFLAGS='-O3 -march=native
>> -mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3
>> -march=native -mtune=native'
>>
>> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all
>> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install
>>
>>
>> Intel 2018 also complies with the MPI-3 standard.
>>
>> Are there specific settings for Intel MPI to obtain the performance of
>> the MPI_IALLREDUCE function?
>>
>> Sincerely,
>> Viet.
>>
>>
>> On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   ierr = VecNormBegin(u,NORM_2,&norm1);
>>>     ierr =
>>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>>>
>>> How come you call this on Ax and not on u? For clarity, if nothing else,
>>> I think you should call it on u.
>>>
>>> comb.c has
>>>
>>> /*
>>>       Split phase global vector reductions with support for combining the
>>>    communication portion of several operations. Using MPI-1.1 support
>>> only
>>>
>>>       The idea for this and much of the initial code is contributed by
>>>    Victor Eijkhout.
>>>
>>>        Usage:
>>>              VecDotBegin(Vec,Vec,PetscScalar *);
>>>              VecNormBegin(Vec,NormType,PetscReal *);
>>>              ....
>>>              VecDotEnd(Vec,Vec,PetscScalar *);
>>>              VecNormEnd(Vec,NormType,PetscReal *);
>>>
>>>        Limitations:
>>>          - The order of the xxxEnd() functions MUST be in the same order
>>>            as the xxxBegin(). There is extensive error checking to try to
>>>            insure that the user calls the routines in the correct order
>>> */
>>>
>>> #include <petsc/private/vecimpl.h>    /*I   "petscvec.h"    I*/
>>>
>>> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void
>>> *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm
>>> comm,MPI_Request *request)
>>> {
>>>   PETSC_UNUSED PetscErrorCode ierr;
>>>
>>>   PetscFunctionBegin;
>>> #if defined(PETSC_HAVE_MPI_IALLREDUCE)
>>>   ierr =
>>> MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);
>>> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE)
>>>   ierr =
>>> MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);
>>> #else
>>>   ierr =
>>> MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);
>>>   *request = MPI_REQUEST_NULL;
>>> #endif
>>>   PetscFunctionReturn(0);
>>> }
>>>
>>>
>>> So first check if $PETSC_DIR/include/petscconf.h has
>>>
>>> PETSC_HAVE_MPI_IALLREDUCE
>>>
>>> if it does not then the standard MPI reduce is called.
>>>
>>> If this is set then any improvement depends on the implementation of
>>> iallreduce inside the MPI you are using.
>>>
>>> Barry
>>>
>>>
>>> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>>>
>>>
>>> Hello Petsc developers and supporters,
>>>
>>> I would like to confirm the performance of asynchronous computations of
>>> inner product computation overlapping with matrix-vector multiplication
>>> computation by the below code.
>>>
>>>
>>>  PetscLogDouble tt1,tt2;
>>>     KSP ksp;
>>>     //ierr = VecSet(c,one);
>>>     ierr = VecSet(c,one);
>>>     ierr = VecSet(u,one);
>>>     ierr = VecSet(b,one);
>>>
>>>     ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);
>>>     ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);
>>>
>>>
>>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>>     ierr = VecNormBegin(u,NORM_2,&norm1);
>>>     ierr =
>>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>>>     ierr = KSP_MatMult(ksp,A,c,Ac);
>>>     ierr = VecNormEnd(u,NORM_2,&norm1);
>>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>>
>>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
>>> asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| =  %g\n",(double) norm1);
>>> CHKERRQ(ierr);
>>>
>>>
>>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>>     ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);
>>>     ierr = KSP_MatMult(ksp,A,c,Ac);
>>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>>
>>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
>>> synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| =  %g\n",(double) norm2);
>>> CHKERRQ(ierr);
>>>
>>>
>>> On a cluster with two or four nodes, the asynchronous computation is
>>> always much slower than synchronous computation.
>>>
>>> The time used for the asynchronous calculation: 0.000203
>>> + |u| =  100.
>>> The time used for the synchronous calculation: 0.000006
>>> + |b| =  100.
>>>
>>> Are there any necessary settings on MPI or Petsc to gain performance of
>>> asynchronous computation?
>>>
>>> Thank you very much for anything you can provide.
>>> Sincerely,
>>> Viet.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210126/738fdd3f/attachment-0001.html>