<div dir="ltr"><div>Hello everyone,</div><div><br></div>I have two quick questions related to run gpu solvers.<div><br></div><div>1) # of MPI processes vs # of GPUs. Is it true that we should set these two numbers equal if most of computations are done on GPU? For one case I tested, with only one GPU, running with np=2 is 15% slower than np=1 (probably due to mpi communications). I am curious in what case, one will benefit by having more mpi processes than number of GPUs.</div><div><br></div><div>2) When I use MatLoad(A,viewer) to load a binary format data into a aijcusparse format A, how many matrices are created? Does it involve creating a intermediate aij matrix A' on CPU then convert it to aijcuspare A on GPU? I am not sure whether such intermediate aij matrix exist. If yes, What is the life time of this matrix? Is it destroyed right after the converting?</div><div><br></div><div>Thanks for your help.</div><div><br></div><div>Best,</div><div>Xiangdong</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 19, 2019 at 12:08 PM Karl Rupp <<a href="mailto:rupp@iue.tuwien.ac.at">rupp@iue.tuwien.ac.at</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Xiangdong,<br>

<br>

<br>

> I can understand some of the numbers, but not the HtoD case.<br>

> <br>

> In DtoH1, it is the data movement from VecMDot. The size of data is <br>

> 8.192KB, which is sizeof(PetscScalar) * MDOT_WORKGROUP_NUM * 8 = 8*128*8 <br>

> = 8192. My question is: instead of calling cublasDdot nv times, why do <br>

> you implement your own kernels? I guess it must be for performance, but <br>

> can you explain a little more?<br>

<br>

Yes, this is a performance optimization. We've used several dot-products <br>

(suffers from kernel launch latency) as well as matrix-vector-products <br>

(suffers extra matrix setup) in the past; in both cases, there was extra <br>

memory traffic, thus impacting performance.<br>

<br>

The reason why the data size is 8192 is to get around a separate <br>

reduction stage on the GPU (i.e. a second kernel launch). By moving the <br>

data to the CPU and doing the reduction there, one is faster than doing <br>

it on the GPU and then moving only a few numbers. This has to do with <br>

PCI-Express latency: It takes about the same time to send a single byte <br>

as sending a few kilobytes. Only beyond ~10 KB the bandwidth becomes the <br>

limiting factor.<br>

<br>

<br>

<br>

> In DtoH2, it is the data movement from VecNorm. The size of data is 8B, <br>

> which is just the sizeof(PetscScalar).<br>

<br>

This is most likely the result required for the control flow on the CPU.<br>

<br>

<br>

> In DtoD1, it is the data movement from VecAXPY. The size of data is <br>

> 17.952MB, which is exactly sizeof(PetscScalar)*length(b).<br>

<br>

This is a vector assignment. If I remember correctly, it uses the <br>

memcpy-routines and hence shows up as a separate DtoD instead of just a <br>

kernel. It matches the time required for scal_kernel_val (scaling a <br>

vector by a scalar), so it runs at full bandwidth on the GPU.<br>

<br>

<br>

> However, I do not understand the number in HostToDevice in gmres for <br>

> np=1. The size of data movement is 1.032KB. I thought this is related to <br>

> the updated upper Hessenberg matrix, but the number does not match. Can <br>

> anyone help me understand the data movement of HToD in GMRES for np=1?<br>

<br>

1032 = (128+1)*8, so this might be some auxiliary work information on <br>

the GPU. I could figure out the exact source of these transfers, but <br>

that is some effort. Let me know whether this is important information <br>

for you, then I can do it.<br>

<br>

Best regards,<br>

Karli<br>

<br>

<br>

<br>

<br>

> <br>

> Thank you.<br>

> <br>

> Best,<br>

> Xiangdong<br>

> <br>

> On Thu, Jul 18, 2019 at 1:14 PM Karl Rupp <<a href="mailto:rupp@iue.tuwien.ac.at" target="_blank">rupp@iue.tuwien.ac.at</a> <br>

> <mailto:<a href="mailto:rupp@iue.tuwien.ac.at" target="_blank">rupp@iue.tuwien.ac.at</a>>> wrote:<br>

> <br>

>     Hi,<br>

> <br>

>     as you can see from the screenshot, the communication is merely for<br>

>     scalars from the dot-products and/or norms. These are needed on the<br>

>     host<br>

>     for the control flow and convergence checks and is true for any<br>

>     iterative solver.<br>

> <br>

>     Best regards,<br>

>     Karli<br>

> <br>

> <br>

> <br>

>     On 7/18/19 3:11 PM, Xiangdong via petsc-users wrote:<br>

>      ><br>

>      ><br>

>      > On Thu, Jul 18, 2019 at 5:11 AM Smith, Barry F.<br>

>     <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a> <mailto:<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>><br>

>      > <mailto:<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a> <mailto:<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>>>> wrote:<br>

>      ><br>

>      ><br>

>      >         1) What preconditioner are you using? If any.<br>

>      ><br>

>      > Currently I am using none as I want to understand how gmres works<br>

>     on GPU.<br>

>      ><br>

>      ><br>

>      >         2) Where/how are you getting this information about the<br>

>      >     MemCpy(HtoD) and one call MemCpy(DtoH)? We might like to utilize<br>

>      >     this same sort of information to plan future optimizations.<br>

>      ><br>

>      > I am using nvprof and nvvp from cuda toolkit. It looks like there<br>

>     are<br>

>      > one MemCpy(HtoD) and three MemCpy(DtoH) calls per iteration for np=1<br>

>      > case. See the attached snapshots.<br>

>      ><br>

>      >         3) Are you using more than 1 MPI rank?<br>

>      ><br>

>      ><br>

>      > I tried both np=1 and np=2. Attached please find snapshots from<br>

>     nvvp for<br>

>      > both np=1 and np=2 cases. The figures showing gpu calls with two<br>

>     pure<br>

>      > gmres iterations.<br>

>      ><br>

>      > Thanks.<br>

>      > Xiangdong<br>

>      ><br>

>      ><br>

>      >        If you use the master branch (which we highly recommend for<br>

>      >     anyone using GPUs and PETSc) the -log_view option will log<br>

>      >     communication between CPU and GPU and display it in the summary<br>

>      >     table. This is useful for seeing exactly what operations are<br>

>     doing<br>

>      >     vector communication between the CPU/GPU.<br>

>      ><br>

>      >        We welcome all feedback on the GPUs since it previously<br>

>     has only<br>

>      >     been lightly used.<br>

>      ><br>

>      >         Barry<br>

>      ><br>

>      ><br>

>      >      > On Jul 16, 2019, at 9:05 PM, Xiangdong via petsc-users<br>

>      >     <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>

>     <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>>><br>

>     wrote:<br>

>      >      ><br>

>      >      > Hello everyone,<br>

>      >      ><br>

>      >      > I am new to petsc gpu and have a simple question.<br>

>      >      ><br>

>      >      > When I tried to solve Ax=b where A is MATAIJCUSPARSE and b<br>

>     and x<br>

>      >     are VECSEQCUDA  with GMRES(or GCR) and pcnone, I found that<br>

>     during<br>

>      >     each krylov iteration, there are one call MemCpy(HtoD) and<br>

>     one call<br>

>      >     MemCpy(DtoH). Does that mean the Krylov solve is not 100% on<br>

>     GPU and<br>

>      >     the solve still needs some work from CPU? What are these<br>

>     MemCpys for<br>

>      >     during the each iteration?<br>

>      >      ><br>

>      >      > Thank you.<br>

>      >      ><br>

>      >      > Best,<br>

>      >      > Xiangdong<br>

>      ><br>

> <br>

</blockquote></div>