<div dir="ltr">Thanks for your explanation. <div><br></div><div>Is it correct to say that ONLY the VecNorm, VecDot, VecMDot, MatMult operations in GMRES are done on GPU and everything else is done on CPU?<div><br></div><div>Thanks.</div><div><br></div><div>Xiangdong </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 17, 2019 at 7:34 AM Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Also, MPI communication is done from the host, so every mat-vec will do a "CopySome" call from the device, do MPI comms, and then the next time you do GPU work it will copy from the host to get the update.<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 16, 2019 at 10:22 PM Matthew Knepley via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Tue, Jul 16, 2019 at 9:07 PM Xiangdong via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello everyone,<div><br></div><div>I am new to petsc gpu and have a simple question. </div><div><br></div><div>When I tried to solve Ax=b where A is MATAIJCUSPARSE and b and x are VECSEQCUDA with GMRES(or GCR) and pcnone, I found that during each krylov iteration, there are one call MemCpy(HtoD) and one call MemCpy(DtoH). Does that mean the Krylov solve is not 100% on GPU and the solve still needs some work from CPU? What are these MemCpys for during the each iteration?</div></div></blockquote><div><br></div><div>We have GPU experts on the list, but there is definitely a communication because we do not do orthogonalization on the GPU,</div><div>just the BLAS ops. This is a very small amount of data, so it just contributed latency, and I would guess that it is less then kernel</div><div>launch latency.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Thank you.</div><div><br></div><div>Best,</div><div>Xiangdong</div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail-m_-1183555328547073044gmail-m_3910969314595676455gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>
</blockquote></div>
</blockquote></div>