[petsc-users] MemCpy (HtoD and DtoH) in Krylov solver

Thu Jul 18 22:24:00 CDT 2019

  Thanks for your email.

> On Jul 18, 2019, at 8:45 PM, Xiangdong <epscodes at gmail.com> wrote:
> 
> Yes, nvprof can give the size of the data as well as the amount of time for data movement. See the attached snapshots.
> 
> I can understand some of the numbers, but not the HtoD case.
> 
> In DtoH1, it is the data movement from VecMDot. The size of data is 8.192KB, which is sizeof(PetscScalar) * MDOT_WORKGROUP_NUM * 8 = 8*128*8 = 8192. My question is: instead of calling cublasDdot nv times, why do you implement your own kernels? I guess it must be for performance, but can you explain a little more?

  It is two fold: reduce the number of kernel launches and reduce the number of times the x vector needs to be streamed from memory to floating point units. In some cases our code produces much faster times but it has not been studied extensively and there may be better ways.

https://bitbucket.org/petsc/petsc/issues/319/revisit-optimizations-of-vecmdot_cuda

> 
> In DtoH2, it is the data movement from VecNorm. The size of data is 8B, which is just the sizeof(PetscScalar).
> 
> In DtoD1, it is the data movement from VecAXPY. The size of data is 17.952MB, which is exactly sizeof(PetscScalar)*length(b).

> 
> However, I do not understand the number in HostToDevice in gmres for np=1. The size of data movement is 1.032KB. I thought this is related to the updated upper Hessenberg matrix, but the number does not match. Can anyone help me understand the data movement of HToD in GMRES for np=1?

  You'll need to look at the GMRES code to get the numbers exactly right. 

  The results of the MDot are stored on the CPU and then copied back down to the GPU for the VecMAXPY(). In addition KSPGMRESBuildSoln() has
possibly another VecMAXPY() plus a some vector operations for example
  ierr = VecAXPY(vdest,1.0,VEC_TEMP);CHKERRQ(ierr); 
note that even though the 1.0 is a constant at compile time the VecAXPY_SeqCUDA() doesn't know this and has to bring the alpha value down from the CPU and this requires a memory communication of 8 bytes. 

  How many iterations did you run GMRES? Based on the MDot communication of 8192 bytes/128 (MDOT_WORKGROUP_SIZE) = 64 bytes it looks like 8 scalars are computed in mdots?  But the HtoD is 1032 bytes which divided by 8 bytes is 129 scalars communicated. I cannot explain why the number going down seems much more than going up. Perhaps you can have your tool zoom in more to see in exactly which routines the values are going down in?  You'll need to dig around a bit deeper to attach each down communication to its place in the code.

  The Hessenberg matrix is never moved between the GPU and CPU, it just stays in CPU. The assumption is that it is so small doing its computation on the GPU would not be faster.

> 
> Thank you.
> 
> Best,
> Xiangdong
> 
> On Thu, Jul 18, 2019 at 1:14 PM Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
> Hi,
> 
> as you can see from the screenshot, the communication is merely for 
> scalars from the dot-products and/or norms. These are needed on the host 
> for the control flow and convergence checks and is true for any 
> iterative solver.
> 
> Best regards,
> Karli
> 
> 
> 
> On 7/18/19 3:11 PM, Xiangdong via petsc-users wrote:
> > 
> > 
> > On Thu, Jul 18, 2019 at 5:11 AM Smith, Barry F. <bsmith at mcs.anl.gov 
> > <mailto:bsmith at mcs.anl.gov>> wrote:
> > 
> > 
> >         1) What preconditioner are you using? If any.
> > 
> > Currently I am using none as I want to understand how gmres works on GPU.
> > 
> > 
> >         2) Where/how are you getting this information about the
> >     MemCpy(HtoD) and one call MemCpy(DtoH)? We might like to utilize
> >     this same sort of information to plan future optimizations.
> > 
> > I am using nvprof and nvvp from cuda toolkit. It looks like there are 
> > one MemCpy(HtoD) and three MemCpy(DtoH) calls per iteration for np=1 
> > case. See the attached snapshots.
> > 
> >         3) Are you using more than 1 MPI rank?
> > 
> > 
> > I tried both np=1 and np=2. Attached please find snapshots from nvvp for 
> > both np=1 and np=2 cases. The figures showing gpu calls with two pure 
> > gmres iterations.
> > 
> > Thanks.
> > Xiangdong
> > 
> > 
> >        If you use the master branch (which we highly recommend for
> >     anyone using GPUs and PETSc) the -log_view option will log
> >     communication between CPU and GPU and display it in the summary
> >     table. This is useful for seeing exactly what operations are doing
> >     vector communication between the CPU/GPU.
> > 
> >        We welcome all feedback on the GPUs since it previously has only
> >     been lightly used.
> > 
> >         Barry
> > 
> > 
> >      > On Jul 16, 2019, at 9:05 PM, Xiangdong via petsc-users
> >     <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
> >      >
> >      > Hello everyone,
> >      >
> >      > I am new to petsc gpu and have a simple question.
> >      >
> >      > When I tried to solve Ax=b where A is MATAIJCUSPARSE and b and x
> >     are VECSEQCUDA  with GMRES(or GCR) and pcnone, I found that during
> >     each krylov iteration, there are one call MemCpy(HtoD) and one call
> >     MemCpy(DtoH). Does that mean the Krylov solve is not 100% on GPU and
> >     the solve still needs some work from CPU? What are these MemCpys for
> >     during the each iteration?
> >      >
> >      > Thank you.
> >      >
> >      > Best,
> >      > Xiangdong
> > 
> <DtoH1.png><DtoH2.png><DtoD1.png><HtoD1.png>