[petsc-users] MemCpy (HtoD and DtoH) in Krylov solver
Xiangdong
epscodes at gmail.com
Tue Jul 23 21:01:41 CDT 2019
Thanks, Barry and Karli! You comments are very helpful. At this point, I
just started learning how to load the linear solve to GPU. Leave a copy of
matrix on CPU is fine at the moment for me.
If the GPUs I am going to work on are Nvidia GPUs, can AIJVIENNACL and
AIJCUSPARSE provide very similar performance based on your experience?
Thank you.
Xiangdong
On Tue, Jul 23, 2019 at 6:50 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>
>
> > On Jul 23, 2019, at 12:24 PM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
> >
> > Hi,
> >
> >> I have two quick questions related to run gpu solvers.
> >> 1) # of MPI processes vs # of GPUs. Is it true that we should set these
> two numbers equal if most of computations are done on GPU? For one case I
> tested, with only one GPU, running with np=2 is 15% slower than np=1
> (probably due to mpi communications). I am curious in what case, one will
> benefit by having more mpi processes than number of GPUs.
> >
> > yes, usually you want to have the same number of MPI processes as the
> number of GPUs on each node, unless you have a lot of work for the CPU and
> only little work for the GPU. If you oversubscribe the GPU, it will not get
> any faster (unlike multi-core CPUs); instead, the extra management overhead
> will make it slow down.
> >
> >
> >> 2) When I use MatLoad(A,viewer) to load a binary format data into a
> aijcusparse format A, how many matrices are created? Does it involve
> creating a intermediate aij matrix A' on CPU then convert it to aijcuspare
> A on GPU? I am not sure whether such intermediate aij matrix exist. If yes,
> What is the life time of this matrix? Is it destroyed right after the
> converting?
> >
> > GPU matrices are essentially AIJ matrices with additional GPU data
> members. That is, if you MatLoad(), the data will be copied into the CPU
> buffers first and then pushed down to the GPU when needed. The CPU buffers
> will never be free'd, but might be updated from the latest GPU data to
> allow for fallback operations that have no GPU implementation.
>
> Eventually we should provide a mechanism that allows users to free the
> CPU side of matrices (and vectors) if the user knows that they will never
> need them again on the CPU. Honestly we are still working on the basics,
> not yet at this level of optimization. If you would benefit from such a
> feature now feel free to make a pull request.
>
> Barry
>
>
>
> >
> > Best regards,
> > Karli
> >
> >
> >
> >
> >
> >> On Fri, Jul 19, 2019 at 12:08 PM Karl Rupp <rupp at iue.tuwien.ac.at
> <mailto:rupp at iue.tuwien.ac.at>> wrote:
> >> Hi Xiangdong,
> >> > I can understand some of the numbers, but not the HtoD case.
> >> >
> >> > In DtoH1, it is the data movement from VecMDot. The size of data
> is
> >> > 8.192KB, which is sizeof(PetscScalar) * MDOT_WORKGROUP_NUM * 8 =
> >> 8*128*8
> >> > = 8192. My question is: instead of calling cublasDdot nv times,
> >> why do
> >> > you implement your own kernels? I guess it must be for
> >> performance, but
> >> > can you explain a little more?
> >> Yes, this is a performance optimization. We've used several
> >> dot-products
> >> (suffers from kernel launch latency) as well as
> matrix-vector-products
> >> (suffers extra matrix setup) in the past; in both cases, there was
> >> extra
> >> memory traffic, thus impacting performance.
> >> The reason why the data size is 8192 is to get around a separate
> >> reduction stage on the GPU (i.e. a second kernel launch). By moving
> the
> >> data to the CPU and doing the reduction there, one is faster than
> doing
> >> it on the GPU and then moving only a few numbers. This has to do with
> >> PCI-Express latency: It takes about the same time to send a single
> byte
> >> as sending a few kilobytes. Only beyond ~10 KB the bandwidth becomes
> >> the
> >> limiting factor.
> >> > In DtoH2, it is the data movement from VecNorm. The size of data
> >> is 8B,
> >> > which is just the sizeof(PetscScalar).
> >> This is most likely the result required for the control flow on the
> CPU.
> >> > In DtoD1, it is the data movement from VecAXPY. The size of data
> is
> >> > 17.952MB, which is exactly sizeof(PetscScalar)*length(b).
> >> This is a vector assignment. If I remember correctly, it uses the
> >> memcpy-routines and hence shows up as a separate DtoD instead of
> just a
> >> kernel. It matches the time required for scal_kernel_val (scaling a
> >> vector by a scalar), so it runs at full bandwidth on the GPU.
> >> > However, I do not understand the number in HostToDevice in gmres
> for
> >> > np=1. The size of data movement is 1.032KB. I thought this is
> >> related to
> >> > the updated upper Hessenberg matrix, but the number does not
> >> match. Can
> >> > anyone help me understand the data movement of HToD in GMRES for
> >> np=1?
> >> 1032 = (128+1)*8, so this might be some auxiliary work information on
> >> the GPU. I could figure out the exact source of these transfers, but
> >> that is some effort. Let me know whether this is important
> information
> >> for you, then I can do it.
> >> Best regards,
> >> Karli
> >> >
> >> > Thank you.
> >> >
> >> > Best,
> >> > Xiangdong
> >> >
> >> > On Thu, Jul 18, 2019 at 1:14 PM Karl Rupp <rupp at iue.tuwien.ac.at
> >> <mailto:rupp at iue.tuwien.ac.at>
> >> > <mailto:rupp at iue.tuwien.ac.at <mailto:rupp at iue.tuwien.ac.at>>>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > as you can see from the screenshot, the communication is
> >> merely for
> >> > scalars from the dot-products and/or norms. These are needed
> >> on the
> >> > host
> >> > for the control flow and convergence checks and is true for
> any
> >> > iterative solver.
> >> >
> >> > Best regards,
> >> > Karli
> >> >
> >> >
> >> >
> >> > On 7/18/19 3:11 PM, Xiangdong via petsc-users wrote:
> >> > >
> >> > >
> >> > > On Thu, Jul 18, 2019 at 5:11 AM Smith, Barry F.
> >> > <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>
> >> <mailto:bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>>
> >> > > <mailto:bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>
> >> <mailto:bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>>>> wrote:
> >> > >
> >> > >
> >> > > 1) What preconditioner are you using? If any.
> >> > >
> >> > > Currently I am using none as I want to understand how
> >> gmres works
> >> > on GPU.
> >> > >
> >> > >
> >> > > 2) Where/how are you getting this information
> >> about the
> >> > > MemCpy(HtoD) and one call MemCpy(DtoH)? We might like
> >> to utilize
> >> > > this same sort of information to plan future
> >> optimizations.
> >> > >
> >> > > I am using nvprof and nvvp from cuda toolkit. It looks
> >> like there
> >> > are
> >> > > one MemCpy(HtoD) and three MemCpy(DtoH) calls per
> >> iteration for np=1
> >> > > case. See the attached snapshots.
> >> > >
> >> > > 3) Are you using more than 1 MPI rank?
> >> > >
> >> > >
> >> > > I tried both np=1 and np=2. Attached please find snapshots
> >> from
> >> > nvvp for
> >> > > both np=1 and np=2 cases. The figures showing gpu calls
> >> with two
> >> > pure
> >> > > gmres iterations.
> >> > >
> >> > > Thanks.
> >> > > Xiangdong
> >> > >
> >> > >
> >> > > If you use the master branch (which we highly
> >> recommend for
> >> > > anyone using GPUs and PETSc) the -log_view option will
> log
> >> > > communication between CPU and GPU and display it in
> >> the summary
> >> > > table. This is useful for seeing exactly what
> >> operations are
> >> > doing
> >> > > vector communication between the CPU/GPU.
> >> > >
> >> > > We welcome all feedback on the GPUs since it
> previously
> >> > has only
> >> > > been lightly used.
> >> > >
> >> > > Barry
> >> > >
> >> > >
> >> > > > On Jul 16, 2019, at 9:05 PM, Xiangdong via
> petsc-users
> >> > > <petsc-users at mcs.anl.gov
> >> <mailto:petsc-users at mcs.anl.gov> <mailto:petsc-users at mcs.anl.gov
> >> <mailto:petsc-users at mcs.anl.gov>>
> >> > <mailto:petsc-users at mcs.anl.gov
> >> <mailto:petsc-users at mcs.anl.gov> <mailto:petsc-users at mcs.anl.gov
> >> <mailto:petsc-users at mcs.anl.gov>>>>
> >> > wrote:
> >> > > >
> >> > > > Hello everyone,
> >> > > >
> >> > > > I am new to petsc gpu and have a simple question.
> >> > > >
> >> > > > When I tried to solve Ax=b where A is
> >> MATAIJCUSPARSE and b
> >> > and x
> >> > > are VECSEQCUDA with GMRES(or GCR) and pcnone, I found
> >> that
> >> > during
> >> > > each krylov iteration, there are one call MemCpy(HtoD)
> and
> >> > one call
> >> > > MemCpy(DtoH). Does that mean the Krylov solve is not
> >> 100% on
> >> > GPU and
> >> > > the solve still needs some work from CPU? What are
> these
> >> > MemCpys for
> >> > > during the each iteration?
> >> > > >
> >> > > > Thank you.
> >> > > >
> >> > > > Best,
> >> > > > Xiangdong
> >> > >
> >> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190723/1d5e5fa3/attachment-0001.html>
More information about the petsc-users
mailing list