[petsc-users] MemCpy (HtoD and DtoH) in Krylov solver

Fri Jul 19 11:08:04 CDT 2019

Hi Xiangdong,

> I can understand some of the numbers, but not the HtoD case.
> 
> In DtoH1, it is the data movement from VecMDot. The size of data is 
> 8.192KB, which is sizeof(PetscScalar) * MDOT_WORKGROUP_NUM * 8 = 8*128*8 
> = 8192. My question is: instead of calling cublasDdot nv times, why do 
> you implement your own kernels? I guess it must be for performance, but 
> can you explain a little more?

Yes, this is a performance optimization. We've used several dot-products 
(suffers from kernel launch latency) as well as matrix-vector-products 
(suffers extra matrix setup) in the past; in both cases, there was extra 
memory traffic, thus impacting performance.

The reason why the data size is 8192 is to get around a separate 
reduction stage on the GPU (i.e. a second kernel launch). By moving the 
data to the CPU and doing the reduction there, one is faster than doing 
it on the GPU and then moving only a few numbers. This has to do with 
PCI-Express latency: It takes about the same time to send a single byte 
as sending a few kilobytes. Only beyond ~10 KB the bandwidth becomes the 
limiting factor.

> In DtoH2, it is the data movement from VecNorm. The size of data is 8B, 
> which is just the sizeof(PetscScalar).

This is most likely the result required for the control flow on the CPU.

> In DtoD1, it is the data movement from VecAXPY. The size of data is 
> 17.952MB, which is exactly sizeof(PetscScalar)*length(b).

This is a vector assignment. If I remember correctly, it uses the 
memcpy-routines and hence shows up as a separate DtoD instead of just a 
kernel. It matches the time required for scal_kernel_val (scaling a 
vector by a scalar), so it runs at full bandwidth on the GPU.

> However, I do not understand the number in HostToDevice in gmres for 
> np=1. The size of data movement is 1.032KB. I thought this is related to 
> the updated upper Hessenberg matrix, but the number does not match. Can 
> anyone help me understand the data movement of HToD in GMRES for np=1?

1032 = (128+1)*8, so this might be some auxiliary work information on 
the GPU. I could figure out the exact source of these transfers, but 
that is some effort. Let me know whether this is important information 
for you, then I can do it.

Best regards,
Karli

> 
> Thank you.
> 
> Best,
> Xiangdong
> 
> On Thu, Jul 18, 2019 at 1:14 PM Karl Rupp <rupp at iue.tuwien.ac.at 
> <mailto:rupp at iue.tuwien.ac.at>> wrote:
> 
>     Hi,
> 
>     as you can see from the screenshot, the communication is merely for
>     scalars from the dot-products and/or norms. These are needed on the
>     host
>     for the control flow and convergence checks and is true for any
>     iterative solver.
> 
>     Best regards,
>     Karli
> 
> 
> 
>     On 7/18/19 3:11 PM, Xiangdong via petsc-users wrote:
>      >
>      >
>      > On Thu, Jul 18, 2019 at 5:11 AM Smith, Barry F.
>     <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>
>      > <mailto:bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>>> wrote:
>      >
>      >
>      >         1) What preconditioner are you using? If any.
>      >
>      > Currently I am using none as I want to understand how gmres works
>     on GPU.
>      >
>      >
>      >         2) Where/how are you getting this information about the
>      >     MemCpy(HtoD) and one call MemCpy(DtoH)? We might like to utilize
>      >     this same sort of information to plan future optimizations.
>      >
>      > I am using nvprof and nvvp from cuda toolkit. It looks like there
>     are
>      > one MemCpy(HtoD) and three MemCpy(DtoH) calls per iteration for np=1
>      > case. See the attached snapshots.
>      >
>      >         3) Are you using more than 1 MPI rank?
>      >
>      >
>      > I tried both np=1 and np=2. Attached please find snapshots from
>     nvvp for
>      > both np=1 and np=2 cases. The figures showing gpu calls with two
>     pure
>      > gmres iterations.
>      >
>      > Thanks.
>      > Xiangdong
>      >
>      >
>      >        If you use the master branch (which we highly recommend for
>      >     anyone using GPUs and PETSc) the -log_view option will log
>      >     communication between CPU and GPU and display it in the summary
>      >     table. This is useful for seeing exactly what operations are
>     doing
>      >     vector communication between the CPU/GPU.
>      >
>      >        We welcome all feedback on the GPUs since it previously
>     has only
>      >     been lightly used.
>      >
>      >         Barry
>      >
>      >
>      >      > On Jul 16, 2019, at 9:05 PM, Xiangdong via petsc-users
>      >     <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>
>     <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>>
>     wrote:
>      >      >
>      >      > Hello everyone,
>      >      >
>      >      > I am new to petsc gpu and have a simple question.
>      >      >
>      >      > When I tried to solve Ax=b where A is MATAIJCUSPARSE and b
>     and x
>      >     are VECSEQCUDA  with GMRES(or GCR) and pcnone, I found that
>     during
>      >     each krylov iteration, there are one call MemCpy(HtoD) and
>     one call
>      >     MemCpy(DtoH). Does that mean the Krylov solve is not 100% on
>     GPU and
>      >     the solve still needs some work from CPU? What are these
>     MemCpys for
>      >     during the each iteration?
>      >      >
>      >      > Thank you.
>      >      >
>      >      > Best,
>      >      > Xiangdong
>      >
>