[petsc-users] MemCpy (HtoD and DtoH) in Krylov solver

Thu Jul 18 11:11:54 CDT 2019

  Thanks, these look like useful tools.

  Is there any way to get it to tell you where (in the function calls) the communication takes place and how much data is moved? Also ideally the amount of time spent in the communications. If the amounts are small and the times are small that is a very different story then if the amounts are large (like full vectors) and the time is large. We need to tackle the big weasels before the small ones.

  Barry

> On Jul 18, 2019, at 8:11 AM, Xiangdong <epscodes at gmail.com> wrote:
> 
> 
> 
> On Thu, Jul 18, 2019 at 5:11 AM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> 
>    1) What preconditioner are you using? If any.
>  
> Currently I am using none as I want to understand how gmres works on GPU. 
> 
> 
>    2) Where/how are you getting this information about the MemCpy(HtoD) and one call MemCpy(DtoH)? We might like to utilize this same sort of information to plan future optimizations. 
> 
>  
> I am using nvprof and nvvp from cuda toolkit. It looks like there are one MemCpy(HtoD) and three MemCpy(DtoH) calls per iteration for np=1 case. See the attached snapshots.
>  
>    3) Are you using more than 1 MPI rank?
> 
> I tried both np=1 and np=2. Attached please find snapshots from nvvp for both np=1 and np=2 cases. The figures showing gpu calls with two pure gmres iterations.
> 
> Thanks.
> Xiangdong 
> 
> 
>   If you use the master branch (which we highly recommend for anyone using GPUs and PETSc) the -log_view option will log communication between CPU and GPU and display it in the summary table. This is useful for seeing exactly what operations are doing vector communication between the CPU/GPU.
> 
>   We welcome all feedback on the GPUs since it previously has only been lightly used.
> 
>    Barry
> 
> 
> > On Jul 16, 2019, at 9:05 PM, Xiangdong via petsc-users <petsc-users at mcs.anl.gov> wrote:
> > 
> > Hello everyone,
> > 
> > I am new to petsc gpu and have a simple question. 
> > 
> > When I tried to solve Ax=b where A is MATAIJCUSPARSE and b and x are VECSEQCUDA  with GMRES(or GCR) and pcnone, I found that during each krylov iteration, there are one call MemCpy(HtoD) and one call MemCpy(DtoH). Does that mean the Krylov solve is not 100% on GPU and the solve still needs some work from CPU? What are these MemCpys for during the each iteration?
> > 
> > Thank you.
> > 
> > Best,
> > Xiangdong
> 
> <nvprof_gmres_np1.png><nvprof_gmres_np2.png>