[petsc-users] How to confirm the performance of asynchronous computations

Tue Jan 26 05:01:26 CST 2021

On Mon, Jan 25, 2021 at 11:31 PM Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:

> Dear Patrick Sanan,
>
> Thank you very much for your answer, especially for your code.
> I was able to compile and run your code on 8 nodes with 20 processes per
> node. Below is the result
>
> Testing with 160 MPI ranks
> reducing an array of size 32 (256 bytes)
> Running 5 burnin runs and 100 tests ...  Done.
> For 100 runs with 5 burnin runs, on 160 MPI processes, min/max times over
> all ranks:
> MPI timer resolution:         1.0000e-06 seconds
> MPI timer resolution/#trials: 1.0000e-08 seconds
> B.   Red. Only (min/max):    8.850098e-06 / 8.890629e-06 seconds
> N.B. Red. Only (min/max):    1.725912e-05 / 1.733065e-05 seconds
> Loc. Only (min/max):         2.364278e-04 / 2.374697e-04 seconds
> Blocking (min/max):          2.650309e-04 / 2.650595e-04 seconds
> Non-Blocking (min/max):      2.673984e-04 / 2.674508e-04 seconds
> Observe to see if the local time is enough to hide the reduction, and see
> if the reduction is indeed hidden
>
>
> It appears that the non-blocking computation with this test is no faster
> than the blocking computation.
> I think I am missing some suitable Intel MPI environment settings.
> I am now thinking about using MPICH, which does not require any
> environment settings for non-blocking computation.
> Could you please let me know which MPI (MPICH or OpenMPI) you used in your
> tests?
>

Note that in the test, the cost of the reduction is only 3% of the total
cost. Is that worth putting time into? Is the cost definitely
larger in your application?

   Thanks,

      Matt

> Thanks again.
> Viet
>
>
> On Mon, Jan 25, 2021 at 7:47 PM Patrick Sanan <patrick.sanan at gmail.com>
> wrote:
>
>> Sorry about the delay in responding, but I'll add a couple of points here:
>>
>>
>> 1) It's important to have some reason to believe that pipelining will
>> actually help your problem. Pipelined Krylov methods work by overlapping
>> reductions with operator and preconditioner applications. So, to see
>> speedup, the time for a reduction needs to be comparable to the time for
>> the operator/preconditioner application. This will only be true in some
>> cases - typical cases are when you have a large number of ranks/nodes, a
>> slow network, or very fast operator/preconditioner applications (assuming
>> that these require the same time on each rank - it's an interesting case
>> when they don't, but unless you say otherwise I'll assume this doesn't
>> apply to your use case).
>>
>> 2) As you're discovering, simply ensuring that asynchronous progress
>> works, at the pure MPI level, isn't as easy as it might be, as it's so
>> dependent on the MPI implementation.
>>
>>
>> For both of these reasons, I suggest setting up a test that just directly
>> uses MPI (which you can of course do from a PETSc-style code) and allows
>> you to compare times for blocking and non-blocking reductions, overlapping
>> some (useless) local work. You should make sure to run multiple iterations
>> within the script, and also run the script multiple times on the cluster
>> (bearing in mind that it's possible that the performance will be affected
>> by other users of the system).
>>
>> I attach an old script I found that I used to test some of these things,
>> to give a more concrete idea of what I mean. Note that this was used early
>> on in our own exploration of these topics so I'm only offering it to give
>> an idea, not as a meaningful benchmark in its own right.
>>
>> Am 25.01.2021 um 09:17 schrieb Viet H.Q.H. <hqhviet at tohoku.ac.jp>:
>>
>>
>> Dear Barry,
>>
>> Thank you very much for your information.
>>
>> It seems complicated to set environment variables to allow asynchronous
>> progress and pinning threads to cores when using Intel MPI.
>>
>> $ export I_MPI_ASYNC_PROGRESS = 1
>> $ export I_MPI_ASYNC_PROGRESS_PIN = <CPU list>
>>
>>
>>
>> https://techdecoded.intel.io/resources/hiding-communication-latency-using-mpi-3-non-blocking-collectives/
>>
>> I'm still not sure how to get an appropriate "CPU list" when running MPI
>> with multiple nodes and multiple processes on one node.
>> Best,
>> Viet.
>>
>>
>>
>>
>> On Sat, Jan 23, 2021 at 3:01 AM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>
>>> https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html
>>>
>>> It states "and a partial support for non-blocking collectives (
>>> MPI_Ibcas t, MPI_Ireduce , and MPI_Iallreduce )."  I do not know what
>>> partial support means but you can try setting the variables and see if that
>>> helps.
>>>
>>>
>>>
>>> On Jan 22, 2021, at 11:20 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>>>
>>>
>>> Dear Victor and Berry,
>>>
>>> Thank you so much for your answers.
>>>
>>> I fixed the code with the bug in the PetscCommSplitReductionBegin
>>> function as commented by Brave.
>>>
>>>      ierr = PetscCommSplitReductionBegin (PetscObjectComm
>>> ((PetscObject) u));
>>>
>>> It was also a mistake to set the vector size too small.
>>> I just set a vector size of 100000000 and ran the code on 4 nodes with 2
>>> processors per node. The result is as follows
>>>
>>> The time used for the asynchronous calculation: 0.022043
>>> + | u | = 10000.
>>> The time used for the synchronous calculation: 0.016188
>>> + | b | = 10000.
>>>
>>> Asynchronous computation still takes a longer time.
>>>
>>> I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the
>>> file $PETSC_DIR/include/petscconf.h
>>>
>>> I built Petsc by using the following script
>>>
>>> #!/usr/bin/bash
>>> set -e
>>> DATE="21.01.18"
>>>
>>> MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"
>>>
>>> MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"
>>> INSTL_DIR="${HOME}/local/petsc-3.14.3"
>>> BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"
>>> PETSC_DIR="${HOME}/tmp/petsc"
>>>
>>> cd ${PETSC_DIR}
>>> ./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR}
>>>  --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec
>>> --with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR}
>>> --download-make --with-debugging=0 COPTFLAGS='-O3 -march=native
>>> -mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3
>>> -march=native -mtune=native'
>>>
>>> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all
>>> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install
>>>
>>>
>>> Intel 2018 also complies with the MPI-3 standard.
>>>
>>> Are there specific settings for Intel MPI to obtain the performance of
>>> the MPI_IALLREDUCE function?
>>>
>>> Sincerely,
>>> Viet.
>>>
>>>
>>> On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>>   ierr = VecNormBegin(u,NORM_2,&norm1);
>>>>     ierr =
>>>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>>>>
>>>> How come you call this on Ax and not on u? For clarity, if nothing
>>>> else, I think you should call it on u.
>>>>
>>>> comb.c has
>>>>
>>>> /*
>>>>       Split phase global vector reductions with support for combining
>>>> the
>>>>    communication portion of several operations. Using MPI-1.1 support
>>>> only
>>>>
>>>>       The idea for this and much of the initial code is contributed by
>>>>    Victor Eijkhout.
>>>>
>>>>        Usage:
>>>>              VecDotBegin(Vec,Vec,PetscScalar *);
>>>>              VecNormBegin(Vec,NormType,PetscReal *);
>>>>              ....
>>>>              VecDotEnd(Vec,Vec,PetscScalar *);
>>>>              VecNormEnd(Vec,NormType,PetscReal *);
>>>>
>>>>        Limitations:
>>>>          - The order of the xxxEnd() functions MUST be in the same order
>>>>            as the xxxBegin(). There is extensive error checking to try
>>>> to
>>>>            insure that the user calls the routines in the correct order
>>>> */
>>>>
>>>> #include <petsc/private/vecimpl.h>    /*I   "petscvec.h"    I*/
>>>>
>>>> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void
>>>> *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm
>>>> comm,MPI_Request *request)
>>>> {
>>>>   PETSC_UNUSED PetscErrorCode ierr;
>>>>
>>>>   PetscFunctionBegin;
>>>> #if defined(PETSC_HAVE_MPI_IALLREDUCE)
>>>>   ierr =
>>>> MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);
>>>> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE)
>>>>   ierr =
>>>> MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);
>>>> #else
>>>>   ierr =
>>>> MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);
>>>>   *request = MPI_REQUEST_NULL;
>>>> #endif
>>>>   PetscFunctionReturn(0);
>>>> }
>>>>
>>>>
>>>> So first check if $PETSC_DIR/include/petscconf.h has
>>>>
>>>> PETSC_HAVE_MPI_IALLREDUCE
>>>>
>>>> if it does not then the standard MPI reduce is called.
>>>>
>>>> If this is set then any improvement depends on the implementation of
>>>> iallreduce inside the MPI you are using.
>>>>
>>>> Barry
>>>>
>>>>
>>>> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhviet at tohoku.ac.jp> wrote:
>>>>
>>>>
>>>> Hello Petsc developers and supporters,
>>>>
>>>> I would like to confirm the performance of asynchronous computations of
>>>> inner product computation overlapping with matrix-vector multiplication
>>>> computation by the below code.
>>>>
>>>>
>>>>  PetscLogDouble tt1,tt2;
>>>>     KSP ksp;
>>>>     //ierr = VecSet(c,one);
>>>>     ierr = VecSet(c,one);
>>>>     ierr = VecSet(u,one);
>>>>     ierr = VecSet(b,one);
>>>>
>>>>     ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);
>>>>     ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);
>>>>
>>>>
>>>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>>>     ierr = VecNormBegin(u,NORM_2,&norm1);
>>>>     ierr =
>>>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax));
>>>>     ierr = KSP_MatMult(ksp,A,c,Ac);
>>>>     ierr = VecNormEnd(u,NORM_2,&norm1);
>>>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>>>
>>>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
>>>> asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| =  %g\n",(double)
>>>> norm1); CHKERRQ(ierr);
>>>>
>>>>
>>>>     ierr = PetscTime(&tt1);CHKERRQ(ierr);
>>>>     ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);
>>>>     ierr = KSP_MatMult(ksp,A,c,Ac);
>>>>     ierr = PetscTime(&tt2);CHKERRQ(ierr);
>>>>
>>>>     ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the
>>>> synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);
>>>>     ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| =  %g\n",(double)
>>>> norm2); CHKERRQ(ierr);
>>>>
>>>>
>>>> On a cluster with two or four nodes, the asynchronous computation is
>>>> always much slower than synchronous computation.
>>>>
>>>> The time used for the asynchronous calculation: 0.000203
>>>> + |u| =  100.
>>>> The time used for the synchronous calculation: 0.000006
>>>> + |b| =  100.
>>>>
>>>> Are there any necessary settings on MPI or Petsc to gain performance of
>>>> asynchronous computation?
>>>>
>>>> Thank you very much for anything you can provide.
>>>> Sincerely,
>>>> Viet.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210126/7c772438/attachment.html>