<div dir="ltr">There are a few things:<div>* GPU have higher latencies and so you basically need a large enough problem to get GPU speedup</div><div>* I assume you are assembling the matrix on the CPU. The copy of data to the GPU takes time and you really should be creating the matrix on the GPU</div><div>* I agree with Barry, Roughly 1M / GPU is around where you start seeing a win but this depends on a lot of things.</div><div>* There are startup costs, like the CPU-GPU copy. It is best to run one mat-vec, or whatever, push a new stage and then run the benchmark. The timing for this new stage will be separate in the log view data. Look at that.</div><div> - You can fake this by running your benchmark many times to amortize any setup costs.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <<a href="mailto:rohany@alumni.cmu.edu">rohany@alumni.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>I'm looking to use PETSc with GPUs to do some linear algebra operations, like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and running with `-mat_type aijcusparse -vec_type cuda` gives me a large slowdown from the same code running on the CPU. This is not entirely unexpected, as things like data transfer costs across the PCIE might erroneously be included in my timing. Are there some examples of benchmarking GPU computations with PETSc, or just the proper way to write code in PETSc that will work for CPUs and GPUs?</div><div><br></div><div>Rohan</div></div>

</blockquote></div>