[petsc-users] MemCpy (HtoD and DtoH) in Krylov solver

Tue Jul 23 21:01:41 CDT 2019

Thanks, Barry and Karli! You comments are very helpful. At this point, I
just started learning how to load the linear solve to GPU. Leave a copy of
matrix on CPU is fine at the moment for me.

If the GPUs I am going to work on are Nvidia GPUs, can AIJVIENNACL  and
AIJCUSPARSE provide very similar performance based on your experience?

Thank you.

Xiangdong

On Tue, Jul 23, 2019 at 6:50 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:

>
>
> > On Jul 23, 2019, at 12:24 PM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
> >
> > Hi,
> >
> >> I have two quick questions related to run gpu solvers.
> >> 1) # of MPI processes vs # of GPUs. Is it true that we should set these
> two numbers equal if most of computations are done on GPU? For one case I
> tested, with only one GPU, running with np=2 is 15% slower than np=1
> (probably due to mpi communications). I am curious in what case, one will
> benefit by having more mpi processes than number of GPUs.
> >
> > yes, usually you want to have the same number of MPI processes as the
> number of GPUs on each node, unless you have a lot of work for the CPU and
> only little work for the GPU. If you oversubscribe the GPU, it will not get
> any faster (unlike multi-core CPUs); instead, the extra management overhead
> will make it slow down.
> >
> >
> >> 2) When I use MatLoad(A,viewer) to load a binary format data into a
> aijcusparse format A, how many matrices are created? Does it involve
> creating a intermediate aij matrix A' on CPU then convert it to aijcuspare
> A on GPU? I am not sure whether such intermediate aij matrix exist. If yes,
> What is the life time of this matrix? Is it destroyed right after the
> converting?
> >
> > GPU matrices are essentially AIJ matrices with additional GPU data
> members. That is, if you MatLoad(), the data will be copied into the CPU
> buffers first and then pushed down to the GPU when needed. The CPU buffers
> will never be free'd, but might be updated from the latest GPU data to
> allow for fallback operations that have no GPU implementation.
>
>   Eventually we should provide a mechanism that allows users to free the
> CPU side of matrices (and vectors) if the user knows that they will never
> need them again on the CPU. Honestly we are still working on the basics,
> not yet at this level of optimization. If you would benefit from such a
> feature now feel free to make a pull request.
>
>   Barry
>
>
>
> >
> > Best regards,
> > Karli
> >
> >
> >
> >
> >
> >> On Fri, Jul 19, 2019 at 12:08 PM Karl Rupp <rupp at iue.tuwien.ac.at
> <mailto:rupp at iue.tuwien.ac.at>> wrote:
> >>    Hi Xiangdong,
> >>     > I can understand some of the numbers, but not the HtoD case.
> >>     >
> >>     > In DtoH1, it is the data movement from VecMDot. The size of data
> is
> >>     > 8.192KB, which is sizeof(PetscScalar) * MDOT_WORKGROUP_NUM * 8 =
> >>    8*128*8
> >>     > = 8192. My question is: instead of calling cublasDdot nv times,
> >>    why do
> >>     > you implement your own kernels? I guess it must be for
> >>    performance, but
> >>     > can you explain a little more?
> >>    Yes, this is a performance optimization. We've used several
> >>    dot-products
> >>    (suffers from kernel launch latency) as well as
> matrix-vector-products
> >>    (suffers extra matrix setup) in the past; in both cases, there was
> >>    extra
> >>    memory traffic, thus impacting performance.
> >>    The reason why the data size is 8192 is to get around a separate
> >>    reduction stage on the GPU (i.e. a second kernel launch). By moving
> the
> >>    data to the CPU and doing the reduction there, one is faster than
> doing
> >>    it on the GPU and then moving only a few numbers. This has to do with
> >>    PCI-Express latency: It takes about the same time to send a single
> byte
> >>    as sending a few kilobytes. Only beyond ~10 KB the bandwidth becomes
> >>    the
> >>    limiting factor.
> >>     > In DtoH2, it is the data movement from VecNorm. The size of data
> >>    is 8B,
> >>     > which is just the sizeof(PetscScalar).
> >>    This is most likely the result required for the control flow on the
> CPU.
> >>     > In DtoD1, it is the data movement from VecAXPY. The size of data
> is
> >>     > 17.952MB, which is exactly sizeof(PetscScalar)*length(b).
> >>    This is a vector assignment. If I remember correctly, it uses the
> >>    memcpy-routines and hence shows up as a separate DtoD instead of
> just a
> >>    kernel. It matches the time required for scal_kernel_val (scaling a
> >>    vector by a scalar), so it runs at full bandwidth on the GPU.
> >>     > However, I do not understand the number in HostToDevice in gmres
> for
> >>     > np=1. The size of data movement is 1.032KB. I thought this is
> >>    related to
> >>     > the updated upper Hessenberg matrix, but the number does not
> >>    match. Can
> >>     > anyone help me understand the data movement of HToD in GMRES for
> >>    np=1?
> >>    1032 = (128+1)*8, so this might be some auxiliary work information on
> >>    the GPU. I could figure out the exact source of these transfers, but
> >>    that is some effort. Let me know whether this is important
> information
> >>    for you, then I can do it.
> >>    Best regards,
> >>    Karli
> >>     >
> >>     > Thank you.
> >>     >
> >>     > Best,
> >>     > Xiangdong
> >>     >
> >>     > On Thu, Jul 18, 2019 at 1:14 PM Karl Rupp <rupp at iue.tuwien.ac.at
> >>    <mailto:rupp at iue.tuwien.ac.at>
> >>     > <mailto:rupp at iue.tuwien.ac.at <mailto:rupp at iue.tuwien.ac.at>>>
> wrote:
> >>     >
> >>     >     Hi,
> >>     >
> >>     >     as you can see from the screenshot, the communication is
> >>    merely for
> >>     >     scalars from the dot-products and/or norms. These are needed
> >>    on the
> >>     >     host
> >>     >     for the control flow and convergence checks and is true for
> any
> >>     >     iterative solver.
> >>     >
> >>     >     Best regards,
> >>     >     Karli
> >>     >
> >>     >
> >>     >
> >>     >     On 7/18/19 3:11 PM, Xiangdong via petsc-users wrote:
> >>     >      >
> >>     >      >
> >>     >      > On Thu, Jul 18, 2019 at 5:11 AM Smith, Barry F.
> >>     >     <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>
> >>    <mailto:bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>>
> >>     >      > <mailto:bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>
> >>    <mailto:bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>>>> wrote:
> >>     >      >
> >>     >      >
> >>     >      >         1) What preconditioner are you using? If any.
> >>     >      >
> >>     >      > Currently I am using none as I want to understand how
> >>    gmres works
> >>     >     on GPU.
> >>     >      >
> >>     >      >
> >>     >      >         2) Where/how are you getting this information
> >>    about the
> >>     >      >     MemCpy(HtoD) and one call MemCpy(DtoH)? We might like
> >>    to utilize
> >>     >      >     this same sort of information to plan future
> >>    optimizations.
> >>     >      >
> >>     >      > I am using nvprof and nvvp from cuda toolkit. It looks
> >>    like there
> >>     >     are
> >>     >      > one MemCpy(HtoD) and three MemCpy(DtoH) calls per
> >>    iteration for np=1
> >>     >      > case. See the attached snapshots.
> >>     >      >
> >>     >      >         3) Are you using more than 1 MPI rank?
> >>     >      >
> >>     >      >
> >>     >      > I tried both np=1 and np=2. Attached please find snapshots
> >>    from
> >>     >     nvvp for
> >>     >      > both np=1 and np=2 cases. The figures showing gpu calls
> >>    with two
> >>     >     pure
> >>     >      > gmres iterations.
> >>     >      >
> >>     >      > Thanks.
> >>     >      > Xiangdong
> >>     >      >
> >>     >      >
> >>     >      >        If you use the master branch (which we highly
> >>    recommend for
> >>     >      >     anyone using GPUs and PETSc) the -log_view option will
> log
> >>     >      >     communication between CPU and GPU and display it in
> >>    the summary
> >>     >      >     table. This is useful for seeing exactly what
> >>    operations are
> >>     >     doing
> >>     >      >     vector communication between the CPU/GPU.
> >>     >      >
> >>     >      >        We welcome all feedback on the GPUs since it
> previously
> >>     >     has only
> >>     >      >     been lightly used.
> >>     >      >
> >>     >      >         Barry
> >>     >      >
> >>     >      >
> >>     >      >      > On Jul 16, 2019, at 9:05 PM, Xiangdong via
> petsc-users
> >>     >      >     <petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov> <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>>
> >>     >     <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov> <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>>>>
> >>     >     wrote:
> >>     >      >      >
> >>     >      >      > Hello everyone,
> >>     >      >      >
> >>     >      >      > I am new to petsc gpu and have a simple question.
> >>     >      >      >
> >>     >      >      > When I tried to solve Ax=b where A is
> >>    MATAIJCUSPARSE and b
> >>     >     and x
> >>     >      >     are VECSEQCUDA  with GMRES(or GCR) and pcnone, I found
> >>    that
> >>     >     during
> >>     >      >     each krylov iteration, there are one call MemCpy(HtoD)
> and
> >>     >     one call
> >>     >      >     MemCpy(DtoH). Does that mean the Krylov solve is not
> >>    100% on
> >>     >     GPU and
> >>     >      >     the solve still needs some work from CPU? What are
> these
> >>     >     MemCpys for
> >>     >      >     during the each iteration?
> >>     >      >      >
> >>     >      >      > Thank you.
> >>     >      >      >
> >>     >      >      > Best,
> >>     >      >      > Xiangdong
> >>     >      >
> >>     >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190723/1d5e5fa3/attachment-0001.html>