[petsc-users] How to confirm the performance of asynchronous computations

Wed Jan 27 10:30:11 CST 2021

On Wed, Jan 27, 2021 at 10:51 AM Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:

> Dear Patrick and Matthew,
>
> Thank you very much for your answers.
> My test code is just the code that contains the inner product calculation
> and the calculation of the matrix-vector multiplication. It is easy to
> measure the time of each calculation. I'm going to revise the code to
> measure the time as it's done in Patrick's code. I will also switch to
> Mpich so as not to worry about the MPI environment settings.
>
> With the current number of nodes (8), the reduction time for calculating
> the inner product is relatively small, but I think it would get bigger if I
> increased the number of nodes to 100 or more. In this case, the latency of
> the inner product calculation would increase, since the reduced values are
> transmitted via a complex network structure with cables and switches. Then,
> hopefully, the non-blocking communication can show its effectiveness.
>

This is very important to do _first_. It would probably only take you a day
to measure the Allreduce time on your target, say the whole machine you run
on.

  Thanks

     Matt

> Best,
> Viet
>
>
> On Wed, Jan 27, 2021 at 2:53 AM Patrick Sanan <patrick.sanan at gmail.com>
> wrote:
>
>>
>>
>> Am 26.01.2021 um 12:01 schrieb Matthew Knepley <knepley at gmail.com>:
>>
>> On Mon, Jan 25, 2021 at 11:31 PM Viet H.Q.H. <hqhviet at tohoku.ac.jp>
>> wrote:
>>
>>> Dear Patrick Sanan,
>>>
>>> Thank you very much for your answer, especially for your code.
>>> I was able to compile and run your code on 8 nodes with 20 processes per
>>> node. Below is the result
>>>
>>> Testing with 160 MPI ranks
>>> reducing an array of size 32 (256 bytes)
>>> Running 5 burnin runs and 100 tests ...  Done.
>>> For 100 runs with 5 burnin runs, on 160 MPI processes, min/max times
>>> over all ranks:
>>> MPI timer resolution:         1.0000e-06 seconds
>>> MPI timer resolution/#trials: 1.0000e-08 seconds
>>> B.   Red. Only (min/max):    8.850098e-06 / 8.890629e-06 seconds
>>> N.B. Red. Only (min/max):    1.725912e-05 / 1.733065e-05 seconds
>>> Loc. Only (min/max):         2.364278e-04 / 2.374697e-04 seconds
>>> Blocking (min/max):          2.650309e-04 / 2.650595e-04 seconds
>>> Non-Blocking (min/max):      2.673984e-04 / 2.674508e-04 seconds
>>> Observe to see if the local time is enough to hide the reduction, and
>>> see if the reduction is indeed hidden
>>>
>>>
>>> It appears that the non-blocking computation with this test is no faster
>>> than the blocking computation.
>>> I think I am missing some suitable Intel MPI environment settings.
>>> I am now thinking about using MPICH, which does not require any
>>> environment settings for non-blocking computation.
>>> Could you please let me know which MPI (MPICH or OpenMPI) you used in
>>> your tests?
>>>
>> I used Cray MPI, which is based on MPICH.
>>
>>
>> Note that in the test, the cost of the reduction is only 3% of the total
>> cost. Is that worth putting time into? Is the cost definitely
>> larger in your application?
>>
>> This is very important! Note that the reductions are quite fast compared
>> to the local work. Even on a much larger number of ranks, they might still
>> be quite fast. The local work here is just some useless computation (here
>> tuned to be comparable to how long it took to do a reduction on a few
>> thousand ranks at the time). As Matt says, having some estimate of how long
>> it takes to apply your operator (matrix) and preconditioner is essential.
>> Do you know how to estimate that for your application?
>>
>>
>>    Thanks,
>>
>>       Matt
>>
>>
>>> Thanks again.
>>> Viet
>>>
>>>
>>> On Mon, Jan 25, 2021 at 7:47 PM Patrick Sanan <patrick.sanan at gmail.com>
>>> wrote:
>>>
>>>> Sorry about the delay in responding, but I'll add a couple of points
>>>> here:
>>>>
>>>>
>>>> 1) It's important to have some reason to believe that pipelining will
>>>> actually help your problem. Pipelined Krylov methods work by overlapping
>>>> reductions with operator and preconditioner applications. So, to see
>>>> speedup, the time for a reduction needs to be comparable to the time for
>>>> the operator/preconditioner application. This will only be true in some
>>>> cases - typical cases are when you have a large number of ranks/nodes, a
>>>> slow network, or very fast operator/preconditioner applications (assuming
>>>> that these require the same time on each rank - it's an interesting case
>>>> when they don't, but unless you say otherwise I'll assume this doesn't
>>>> apply to your use case).
>>>>
>>>> 2) As you're discovering, simply ensuring that asynchronous progress
>>>> works, at the pure MPI level, isn't as easy as it might be, as it's so
>>>> dependent on the MPI implementation.
>>>>
>>>>
>>>> For both of these reasons, I suggest setting up a test that just
>>>> directly uses MPI (which you can of course do from a PETSc-style code) and
>>>> allows you to compare times for blocking and non-blocking reductions,
>>>> overlapping some (useless) local work. You should make sure to run multiple
>>>> iterations within the script, and also run the script multiple times on the
>>>> cluster (bearing in mind that it's possible that the performance will be
>>>> affected by other users of the system).
>>>>
>>>> I attach an old script I found that I used to test some of these
>>>> things, to give a more concrete idea of what I mean. Note that this was
>>>> used early on in our own exploration of these topics so I'm only offering
>>>> it to give an idea, not as a meaningful benchmark in its own right.
>>>>
>>>> Am 25.01.2021 um 09:17 schrieb Viet H.Q.H. <hqhviet at tohoku.ac.jp>:
>>>>
>>>>
>>>> Dear Barry,
>>>>
>>>> Thank you very much for your information.
>>>>
>>>> It seems complicated to set environment variables to allow asynchronous
>>>> progress and pinning threads to cores when using Intel MPI.
>>>>
>>>> $ export I_MPI_ASYNC_PROGRESS = 1
>>>> $ export I_MPI_ASYNC_PROGRESS_PIN = <CPU list>
>>>>
>>>>
>>>>
>>>> https://techdecoded.intel.io/resources/hiding-communication-latency-using-mpi-3-non-blocking-collectives/
>>>>
>>>> I'm still not sure how to get an appropriate "CPU list" when running
>>>> MPI with multiple nodes and multiple processes on one node.
>>>> Best,
>>>> Viet.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Jan 23, 2021 at 3:01 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>
>>>>> https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html
>>>>>
>>>>> It states "and a partial support for non-blocking collectives (
>>>>> MPI_Ibcas t, MPI_Ireduce , and MPI_Iallreduce )."  I do not know what
>>>>> partial support means but you can try setting the variables and see if that
>>>>> helps.
>>>>>
>>>>>
>>>>>
>>>>> On Jan 22, 2021, at 11:20 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp>
>>>>> wrote:
>>>>>
>>>>>
>>>>> Dear Victor and Berry,
>>>>>
>>>>> Thank you so much for your answers.
>>>>>
>>>>> I fixed the code with the bug in the PetscCommSplitReductionBegin
>>>>> function as commented by Brave.
>>>>>
>>>>>      ierr = PetscCommSplitReductionBegin (PetscObjectComm
>>>>> ((PetscObject) u));
>>>>>
>>>>> It was also a mistake to set the vector size too small.
>>>>> I just set a vector size of 100000000 and ran the code on 4 nodes with
>>>>> 2 processors per node. The result is as follows
>>>>>
>>>>> The time used for the asynchronous calculation: 0.022043
>>>>> + | u | = 10000.
>>>>> The time used for the synchronous calculation: 0.016188
>>>>> + | b | = 10000.
>>>>>
>>>>> Asynchronous computation still takes a longer time.
>>>>>
>>>>> I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the
>>>>> file $PETSC_DIR/include/petscconf.h
>>>>>
>>>>> I built Petsc by using the following script
>>>>>
>>>>> #!/usr/bin/bash
>>>>> set -e
>>>>> DATE="21.01.18"
>>>>>
>>>>> MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"
>>>>>
>>>>> MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"
>>>>> INSTL_DIR="${HOME}/local/petsc-3.14.3"
>>>>> BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"
>>>>> PETSC_DIR="${HOME}/tmp/petsc"
>>>>>
>>>>> cd ${PETSC_DIR}
>>>>> ./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR}
>>>>>  --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec
>>>>> --with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR}
>>>>> --download-make --with-debugging=0 COPTFLAGS='-O3 -march=native
>>>>> -mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3
>>>>> -march=native -mtune=native'
>>>>>
>>>>> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all
>>>>> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install
>>>>>
>>>>>
>>>>> Intel 2018 also complies with the MPI-3 standard.
>>>>>
>>>>> Are there specific settings for Intel MPI to obtain the performance of
>>>>> the MPI_IALLREDUCE function?
>>>>>
>>>>> Sincerely,
>>>>> Viet.
>>>>>
>>>>>
>>>>> On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>>
>>>>>>   ierr = VecNormBegin(u,NORM_2,&norm1);
>>>>>>     ierr =
>>>>>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>>>>>>
>>>>>> How come you call this on Ax and not on u? For clarity, if nothing
>>>>>> else, I think you should call it on u.
>>>>>>
>>>>>> comb.c has
>>>>>>
>>>>>> /*
>>>>>>       Split phase global vector reductions with support for
>>>>>> combining the
>>>>>>    communication portion of several operations. Using MPI-1.1 support
>>>>>> only
>>>>>>
>>>>>>       The idea for this and much of the initial code is contributed
>>>>>> by
>>>>>>    Victor Eijkhout.
>>>>>>
>>>>>>        Usage:
>>>>>>              VecDotBegin(Vec,Vec,PetscScalar *);
>>>>>>              VecNormBegin(Vec,NormType,PetscReal *);
>>>>>>              ....
>>>>>>              VecDotEnd(Vec,Vec,PetscScalar *);
>>>>>>              VecNormEnd(Vec,NormType,PetscReal *);
>>>>>>
>>>>>>        Limitations:
>>>>>>          - The order of the xxxEnd() functions MUST be in the same
>>>>>> order
>>>>>>            as the xxxBegin(). There is extensive error checking to
>>>>>> try to
>>>>>>            insure that the user calls the routines in the correct
>>>>>> order
>>>>>> */
>>>>>>
>>>>>> #include <petsc/private/vecimpl.h>    /*I   "petscvec.h"    I*/
>>>>>>
>>>>>> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void
>>>>>> *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm
>>>>>> comm,MPI_Request *request)
>>>>>> {
>>>>>>   PETSC_UNUSED PetscErrorCode ierr;
>>>>>>
>>>>>>   PetscFunctionBegin;
>>>>>> #if defined(PETSC_HAVE_MPI_IALLREDUCE)
>>>>>>   ierr =
>>>>>> MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);
>>>>>> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE)
>>>>>>   ierr =
>>>>>> MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);
>>>>>> #else
>>>>>>   ierr =
>>>>>> MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);
>>>>>>   *request = MPI_REQUEST_NULL;
>>>>>> #endif
>>>>>>   PetscFunctionReturn(0);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> So first check if $PETSC_DIR/include/petscconf.h has
>>>>>>
>>>>>> PETSC_HAVE_MPI_IALLREDUCE
>>>>>>
>>>>>> if it does not then the standard MPI reduce is called.
>>>>>>
>>>>>> If this is set then any improvement depends on the implementation of
>>>>>> iallreduce inside the MPI you are using.
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>>
>>>>>> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hello Petsc developers and supporters,
>>>>>>
>>>>>> I would like to confirm the performance of asynchronous computations
>>>>>> of inner product computation overlapping with matrix-vector multiplication
>>>>>> computation by the below code.
>>>>>>
>>>>>>
>>>>>>  PetscLogDouble tt1,tt2;
>>>>>>     KSP ksp;
>>>>>>     //ierr = VecSet(c,one);
>>>>>>     ierr = VecSet(c,one);
>>>>>>     ierr = VecSet(u,one);
>>>>>>     ierr = VecSet(b,one);
>>>>>>
>>>>>>     ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);
>>>>>>     ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);
>>>>>>
>>>>>>
>>>>>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>>>>>     ierr = VecNormBegin(u,NORM_2,&norm1);
>>>>>>     ierr =
>>>>>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>>>>>>     ierr = KSP_MatMult(ksp,A,c,Ac);
>>>>>>     ierr = VecNormEnd(u,NORM_2,&norm1);
>>>>>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>>>>>
>>>>>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
>>>>>> asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>>>>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| =  %g\n",(double)
>>>>>> norm1); CHKERRQ(ierr);
>>>>>>
>>>>>>
>>>>>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>>>>>     ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);
>>>>>>     ierr = KSP_MatMult(ksp,A,c,Ac);
>>>>>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>>>>>
>>>>>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
>>>>>> synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>>>>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| =  %g\n",(double)
>>>>>> norm2); CHKERRQ(ierr);
>>>>>>
>>>>>>
>>>>>> On a cluster with two or four nodes, the asynchronous computation is
>>>>>> always much slower than synchronous computation.
>>>>>>
>>>>>> The time used for the asynchronous calculation: 0.000203
>>>>>> + |u| =  100.
>>>>>> The time used for the synchronous calculation: 0.000006
>>>>>> + |b| =  100.
>>>>>>
>>>>>> Are there any necessary settings on MPI or Petsc to gain performance
>>>>>> of asynchronous computation?
>>>>>>
>>>>>> Thank you very much for anything you can provide.
>>>>>> Sincerely,
>>>>>> Viet.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210127/9733e991/attachment-0001.html>