[petsc-users] MemCpy (HtoD and DtoH) in Krylov solver
Karl Rupp
rupp at iue.tuwien.ac.at
Fri Jul 19 11:08:04 CDT 2019
Hi Xiangdong,
> I can understand some of the numbers, but not the HtoD case.
>
> In DtoH1, it is the data movement from VecMDot. The size of data is
> 8.192KB, which is sizeof(PetscScalar) * MDOT_WORKGROUP_NUM * 8 = 8*128*8
> = 8192. My question is: instead of calling cublasDdot nv times, why do
> you implement your own kernels? I guess it must be for performance, but
> can you explain a little more?
Yes, this is a performance optimization. We've used several dot-products
(suffers from kernel launch latency) as well as matrix-vector-products
(suffers extra matrix setup) in the past; in both cases, there was extra
memory traffic, thus impacting performance.
The reason why the data size is 8192 is to get around a separate
reduction stage on the GPU (i.e. a second kernel launch). By moving the
data to the CPU and doing the reduction there, one is faster than doing
it on the GPU and then moving only a few numbers. This has to do with
PCI-Express latency: It takes about the same time to send a single byte
as sending a few kilobytes. Only beyond ~10 KB the bandwidth becomes the
limiting factor.
> In DtoH2, it is the data movement from VecNorm. The size of data is 8B,
> which is just the sizeof(PetscScalar).
This is most likely the result required for the control flow on the CPU.
> In DtoD1, it is the data movement from VecAXPY. The size of data is
> 17.952MB, which is exactly sizeof(PetscScalar)*length(b).
This is a vector assignment. If I remember correctly, it uses the
memcpy-routines and hence shows up as a separate DtoD instead of just a
kernel. It matches the time required for scal_kernel_val (scaling a
vector by a scalar), so it runs at full bandwidth on the GPU.
> However, I do not understand the number in HostToDevice in gmres for
> np=1. The size of data movement is 1.032KB. I thought this is related to
> the updated upper Hessenberg matrix, but the number does not match. Can
> anyone help me understand the data movement of HToD in GMRES for np=1?
1032 = (128+1)*8, so this might be some auxiliary work information on
the GPU. I could figure out the exact source of these transfers, but
that is some effort. Let me know whether this is important information
for you, then I can do it.
Best regards,
Karli
>
> Thank you.
>
> Best,
> Xiangdong
>
> On Thu, Jul 18, 2019 at 1:14 PM Karl Rupp <rupp at iue.tuwien.ac.at
> <mailto:rupp at iue.tuwien.ac.at>> wrote:
>
> Hi,
>
> as you can see from the screenshot, the communication is merely for
> scalars from the dot-products and/or norms. These are needed on the
> host
> for the control flow and convergence checks and is true for any
> iterative solver.
>
> Best regards,
> Karli
>
>
>
> On 7/18/19 3:11 PM, Xiangdong via petsc-users wrote:
> >
> >
> > On Thu, Jul 18, 2019 at 5:11 AM Smith, Barry F.
> <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>
> > <mailto:bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>>> wrote:
> >
> >
> > 1) What preconditioner are you using? If any.
> >
> > Currently I am using none as I want to understand how gmres works
> on GPU.
> >
> >
> > 2) Where/how are you getting this information about the
> > MemCpy(HtoD) and one call MemCpy(DtoH)? We might like to utilize
> > this same sort of information to plan future optimizations.
> >
> > I am using nvprof and nvvp from cuda toolkit. It looks like there
> are
> > one MemCpy(HtoD) and three MemCpy(DtoH) calls per iteration for np=1
> > case. See the attached snapshots.
> >
> > 3) Are you using more than 1 MPI rank?
> >
> >
> > I tried both np=1 and np=2. Attached please find snapshots from
> nvvp for
> > both np=1 and np=2 cases. The figures showing gpu calls with two
> pure
> > gmres iterations.
> >
> > Thanks.
> > Xiangdong
> >
> >
> > If you use the master branch (which we highly recommend for
> > anyone using GPUs and PETSc) the -log_view option will log
> > communication between CPU and GPU and display it in the summary
> > table. This is useful for seeing exactly what operations are
> doing
> > vector communication between the CPU/GPU.
> >
> > We welcome all feedback on the GPUs since it previously
> has only
> > been lightly used.
> >
> > Barry
> >
> >
> > > On Jul 16, 2019, at 9:05 PM, Xiangdong via petsc-users
> > <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>
> <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>>
> wrote:
> > >
> > > Hello everyone,
> > >
> > > I am new to petsc gpu and have a simple question.
> > >
> > > When I tried to solve Ax=b where A is MATAIJCUSPARSE and b
> and x
> > are VECSEQCUDA with GMRES(or GCR) and pcnone, I found that
> during
> > each krylov iteration, there are one call MemCpy(HtoD) and
> one call
> > MemCpy(DtoH). Does that mean the Krylov solve is not 100% on
> GPU and
> > the solve still needs some work from CPU? What are these
> MemCpys for
> > during the each iteration?
> > >
> > > Thank you.
> > >
> > > Best,
> > > Xiangdong
> >
>
More information about the petsc-users
mailing list