<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"><br id="lineBreakAtBeginningOfMessage"><div><br><blockquote type="cite"><div>On 20 Dec 2023, at 8:42 AM, Sreeram R Venkat <srvenkat@utexas.edu> wrote:</div><br class="Apple-interchange-newline"><div><div dir="ltr">Ok, I think the error I'm getting has something to do with how the multiple solves are being done in succession. I'll try to see if there's anything I'm doing wrong there. <div><br></div><div>One question about the -pc_type lu -ksp_type preonly method: do you know which parts of the solve (factorization/triangular solves) are done on host and which are done on device?</div></div></div></blockquote><div><br></div><div>I think only the triangular solves can be done on device.</div><div>Since you have many right-hand sides, it may not be that bad.</div><div>GPU people will hopefully give you a more insightful answer.</div><div><br></div><div>Thanks,</div><div>Pierre</div><br><blockquote type="cite"><div><div dir="ltr"><div>Thanks,</div><div>Sreeram</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Dec 16, 2023 at 10:56 PM Pierre Jolivet <<a href="mailto:pierre@joliv.et">pierre@joliv.et</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Unfortunately, I am not able to reproduce such a failure with your input matrix.<div>I’ve used ex79 that I linked previously and the system is properly solved.</div><div><div>$ ./ex79 -pc_type hypre -ksp_type hpddm -ksp_hpddm_type cg -ksp_converged_reason -ksp_view_mat ascii::ascii_info -ksp_view_rhs ascii::ascii_info</div><div>Linear solve converged due to CONVERGED_RTOL iterations 6</div><div>Mat Object: 1 MPI process</div><div>  type: seqaijcusparse</div><div>  rows=289, cols=289</div><div>  total: nonzeros=2401, allocated nonzeros=2401</div><div>  total number of mallocs used during MatSetValues calls=0</div><div>    not using I-node routines</div><div>Mat Object: 1 MPI process</div><div>  type: seqdensecuda</div><div>  rows=289, cols=10</div><div>  total: nonzeros=2890, allocated nonzeros=2890</div><div>  total number of mallocs used during MatSetValues calls=0</div><div><br></div><div>You mentioned in a subsequent email that you are interested in systems with at most 1E6 unknowns, and up to 1E4 right-hand sides.</div><div>I’m not sure you can expect significant gains from using GPU for such systems.</div><div>Probably, the fastest approach would indeed be -pc_type lu -ksp_type preonly -ksp_matsolve_batch_size 100 or something, depending on the memory available on your host.</div><div><br></div><div>Thanks,</div><div>Pierre</div><div><br><blockquote type="cite"><div>On 15 Dec 2023, at 9:52 PM, Sreeram R Venkat <<a href="mailto:srvenkat@utexas.edu" target="_blank">srvenkat@utexas.edu</a>> wrote:</div><br><div><div dir="ltr">Here are the ksp_view files. 


I set the options -ksp_error_if_not_converged to try to get the vectors that caused the error. I noticed that some of the KSPMatSolves converge while others don't. In the code, the solves are called as:<div><br></div><div>input vector v --> insert data of v into a dense mat --> KSPMatSolve() --> MatMatMult() --> KSPMatSolve() --> insert data of dense mat into output vector w -- output w</div><div><br></div><div>The operator used in the KSP is a Laplacian-like operator, and the MatMatMult is with a Mass Matrix. The whole thing is supposed to be a solve with a biharmonic-like operator. I can also run it with only the first KSPMatSolve (i.e. just a Laplacian-like operator). In that case, the KSP reportedly converges after 0 iterations (see the next line), but this causes problems in other parts of the code later on. </div><div><br></div><div>I saw that sometimes the first KSPMatSolve "converges" after 0 iterations due to CONVERGED_RTOL. Then, the second KSPMatSolve produces a NaN/Inf. I tried setting ksp_min_it, but that didn't seem to do anything. </div><div><br></div><div>I'll keep trying different options and also try to get the MWE made (this KSPMatSolve is pretty performance critical for us). </div><div><br></div><div>Thanks for all your help,</div><div>Sreeram</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Dec 15, 2023 at 1:01 AM Pierre Jolivet <<a href="mailto:pierre@joliv.et" target="_blank">pierre@joliv.et</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br><div><blockquote type="cite"><div>On 14 Dec 2023, at 11:45 PM, Sreeram R Venkat <<a href="mailto:srvenkat@utexas.edu" target="_blank">srvenkat@utexas.edu</a>> wrote:</div><br><div><div dir="auto">Thanks, I will try to create a minimal reproducible example. This may take me some time though, as I need to figure out how to extract only the relevant parts (the full program this solve is used in is getting quite complex).</div></div></blockquote><div><br></div><div>You could just do -ksp_view_mat binary:Amat.bin -ksp_view_pmat binary:Pmat.bin -ksp_view_rhs binary:rhs.bin and send me those three files (I’m guessing your are using double-precision scalars with 32-bit PetscInt).</div><br><blockquote type="cite"><div><div dir="auto"><div dir="auto">I'll also try out some of the BoomerAMG options to see if that helps.</div></div></div></blockquote><div><br></div><div>These should work (this is where all “PCMatApply()-ready” PC are being tested): <a href="https://petsc.org/release/src/ksp/ksp/tutorials/ex79.c.html#line215" target="_blank">https://petsc.org/release/src/ksp/ksp/tutorials/ex79.c.html#line215</a></div><div>You can see it’s also testing PCHYPRE + KSPHPDDM on device (but not with HIP).</div><div>I’m aware the performance should not be optimal (see your comment about host/device copies), I’ve money to hire someone to work on this but: a) I need to find the correct engineer/post-doc, b) I currently don’t have good use cases (of course, I could generate a synthetic benchmark, for science).</div><div>So even if you send me the three Mat, a MWE would be appreciated if the KSPMatSolve() is performance-critical for you (see point b) from above).</div><div><br></div><div>Thanks,</div><div>Pierre</div><br><blockquote type="cite"><div><div dir="auto"><div dir="auto">Thanks,</div><div dir="auto">Sreeram</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 14, 2023, 1:12 PM Pierre Jolivet <<a href="mailto:pierre@joliv.et" target="_blank">pierre@joliv.et</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br id="m_-4214890537289386006m_-7017557633373574702m_-233179979615078302lineBreakAtBeginningOfMessage"><div><br><blockquote type="cite"><div>On 14 Dec 2023, at 8:02 PM, Sreeram R Venkat <<a href="mailto:srvenkat@utexas.edu" rel="noreferrer" target="_blank">srvenkat@utexas.edu</a>> wrote:</div><br><div><div dir="ltr">Hello Pierre,<div><br></div><div>Thank you for your reply. I tried out the HPDDM CG as you said, and it seems to be doing the batched solves, but the KSP is not converging due to a NaN or Inf being generated. I also noticed there are a lot of host-to-device and device-to-host copies of the matrices (the non-batched KSP solve did not have any memcopies). I have attached dump.0 again. Could you please take a look?</div></div></div></blockquote><div><br></div><div>Yes, but you’d need to send me something I can run with your set of options (if you are more confident doing this in private, you can remove the list from c/c).</div><div>Not all BoomerAMG smoothers handle blocks of right-hand sides, and there is not much error checking, so instead of erroring out, this may be the reason why you are getting garbage.</div><div><br></div><div>Thanks,</div><div>Pierre</div><br><blockquote type="cite"><div><div dir="ltr"><div>Thanks,</div><div>Sreeram</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 14, 2023 at 12:42 AM Pierre Jolivet <<a href="mailto:pierre@joliv.et" rel="noreferrer" target="_blank">pierre@joliv.et</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Hello Sreeram,<div>KSPCG (PETSc implementation of CG) does not handle solves with multiple columns at once.</div><div>There is only a single native PETSc KSP implementation which handles solves with multiple columns at once: KSPPREONLY.</div><div>If you use --download-hpddm, you can use a CG (or GMRES, or more advanced methods) implementation which handles solves with multiple columns at once (via -ksp_type hpddm -ksp_hpddm_type cg or KSPSetType(ksp, KSPHPDDM); KSPHPDDMSetType(ksp, KSP_HPDDM_TYPE_CG);).</div><div>I’m the main author of HPDDM, there is preliminary support for device matrices, but if it’s not working as intended/not faster than column by column, I’d be happy to have a deeper look (maybe in private), because most (if not all) of my users interested in (pseudo-)block Krylov solvers (i.e., solvers that treat right-hand sides in a single go) are using plain host matrices.</div><div><br></div><div>Thanks,</div><div>Pierre</div><div><br></div><div>PS: you could have a look at <a href="https://www.sciencedirect.com/science/article/abs/pii/S0898122121000055" rel="noreferrer" target="_blank">https://www.sciencedirect.com/science/article/abs/pii/S0898122121000055</a> to understand the philosophy behind block iterative methods in PETSc (and in HPDDM), src/mat/tests/ex237.c, the benchmark I mentioned earlier, was developed in the context of this paper to produce Figures 2-3. Note that this paper is now slightly outdated, since then, PCHYPRE and PCMG (among others) have been made “PCMatApply()-ready”.</div><div><div><br><blockquote type="cite"><div>On 13 Dec 2023, at 11:05 PM, Sreeram R Venkat <<a href="mailto:srvenkat@utexas.edu" rel="noreferrer" target="_blank">srvenkat@utexas.edu</a>> wrote:</div><br><div><div dir="ltr">Hello Pierre,<div><br></div><div>I am trying out the KSPMatSolve with the BoomerAMG preconditioner. However, I am noticing that it is still solving column by column (this is stated explicitly in the info dump attached). I looked at the code for KSPMatSolve_Private() and saw that as long as <span>ksp->ops->matsolve is true, it should do the batched solve, though I'm not sure where that gets set. </span></div><div><span><br></span></div><div><span>I am using the options -pc_type hypre -pc_hypre_type boomeramg when running the code.</span></div><div><span><br></span></div><div><span>Can you please help me with this?</span></div><div><span><br></span></div><div><span>Thanks,</span></div><div><span>Sreeram</span></div><div><span><br></span></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 7, 2023 at 4:04 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" rel="noreferrer" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">N.B., AMGX interface is a bit experimental.<div><div>Mark</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 7, 2023 at 4:11 PM Sreeram R Venkat <<a href="mailto:srvenkat@utexas.edu" rel="noreferrer" target="_blank">srvenkat@utexas.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Oh, in that case I will try out BoomerAMG. Getting AMGX to build correctly was also tricky so hopefully the HYPRE build will be easier.<div dir="auto"><br></div><div dir="auto">Thanks,</div><div dir="auto">Sreeram</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 7, 2023, 3:03 PM Pierre Jolivet <<a href="mailto:pierre@joliv.et" rel="noreferrer noreferrer" target="_blank">pierre@joliv.et</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br id="m_-4214890537289386006m_-7017557633373574702m_-233179979615078302m_-3044900305059802604m_4661579025182849266m_858150001532199206m_7082342255457086237m_-8161104843020054705lineBreakAtBeginningOfMessage"><div><br><blockquote type="cite"><div>On 7 Dec 2023, at 9:37 PM, Sreeram R Venkat <<a href="mailto:srvenkat@utexas.edu" rel="noreferrer noreferrer noreferrer" target="_blank">srvenkat@utexas.edu</a>> wrote:</div><br><div><div dir="ltr">Thank you Barry and Pierre; I will proceed with the first option. <div><br></div><div>I want to use the AMGX preconditioner for the KSP. I will try it out and see how it performs.</div></div></div></blockquote><div><br></div><div>Just FYI, AMGX does not handle systems with multiple RHS, and thus has no PCMatApply() implementation.</div><div>BoomerAMG does, and there is a PCMatApply_HYPRE_BoomerAMG() implementation.</div><div>But let us know if you need assistance figuring things out.</div><div><br></div><div>Thanks,</div><div>Pierre</div><br><blockquote type="cite"><div><div dir="ltr"><div>Thanks,</div><div>Sreeram</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 7, 2023 at 2:02 PM Pierre Jolivet <<a href="mailto:pierre@joliv.et" rel="noreferrer noreferrer noreferrer" target="_blank">pierre@joliv.et</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>To expand on Barry’s answer, we have observed repeatedly that MatMatMult with MatAIJ performs better than MatMult with MatMAIJ, you can reproduce this on your own with <a href="https://petsc.org/release/src/mat/tests/ex237.c.html" rel="noreferrer noreferrer noreferrer" target="_blank">https://petsc.org/release/src/mat/tests/ex237.c.html</a>.<div>Also, I’m guessing you are using some sort of preconditioner within your KSP.</div><div>Not all are “KSPMatSolve-ready”, i.e., they may treat blocks of right-hand sides column by column, which is very inefficient.</div><div>You could run your code with -info dump and send us dump.0 to see what needs to be done on our end to make things more efficient, should you not be satisfied with the current performance of the code.</div><div><br></div><div>Thanks,</div><div>Pierre<br><div><br><div><blockquote type="cite"><div>On 7 Dec 2023, at 8:34 PM, Barry Smith <<a href="mailto:bsmith@petsc.dev" rel="noreferrer noreferrer noreferrer" target="_blank">bsmith@petsc.dev</a>> wrote:</div><br><div><div style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br><br><blockquote type="cite"><div>On Dec 7, 2023, at 1:17 PM, Sreeram R Venkat <<a href="mailto:srvenkat@utexas.edu" rel="noreferrer noreferrer noreferrer" target="_blank">srvenkat@utexas.edu</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr" class="gmail_signature"><div dir="ltr"><div style="color:rgb(34,34,34)">I have 2 sequential matrices M and R (both MATSEQAIJCUSPARSE of size n x n) and a vector v of size n*m. v = [v_1 , v_2 ,... , v_m] where v_i has size n. The data for v can be stored either in column-major or row-major order.  Now, I want to do 2 types of operations:</div><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">1. Matvecs of the form M*v_i = w_i, for i = 1..m. </div><div style="color:rgb(34,34,34)">2. KSPSolves of the form R*x_i = v_i, for i = 1..m.</div><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">From what I have read on the documentation, I can think of 2 approaches. </div><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">1. Get the pointer to the data in v (column-major) and use it to create a dense matrix V. Then do a MatMatMult with M*V = W, and take the data pointer of W to create the vector w. For KSPSolves, use KSPMatSolve with R and V.</div><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">2. Create a MATMAIJ using M/R and use that for matvecs directly with the vector v. I don't know if KSPSolve with the MATMAIJ will know that it is a multiple RHS system and act accordingly.</div><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">Which would be the more efficient option?</div></div></div></div></div></blockquote><div><br></div>Use 1. <br><blockquote type="cite"><div><div dir="ltr"><div dir="ltr" class="gmail_signature"><div dir="ltr"><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">As a side-note, I am also wondering if there is a way to use row-major storage of the vector v.</div></div></div></div></div></blockquote><div><br></div>No</div><div style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br><blockquote type="cite"><div><div dir="ltr"><div dir="ltr" class="gmail_signature"><div dir="ltr"><div style="color:rgb(34,34,34)">The reason is that this could allow for more coalesced memory access when doing matvecs.</div></div></div></div></div></blockquote><div><br></div>  PETSc matrix-vector products use BLAS GMEV matrix-vector products for the computation so in theory they should already be well-optimized</div><div style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br><blockquote type="cite"><div><div dir="ltr"><div dir="ltr" class="gmail_signature"><div dir="ltr"><div style="color:rgb(34,34,34)"><br></div><div style="color:rgb(34,34,34)">Thanks,</div><div style="color:rgb(34,34,34)">Sreeram</div></div></div></div></div></blockquote></div></div></blockquote></div><br></div></div></div></blockquote></div>

</div></blockquote></div><br></div></blockquote></div>

</blockquote></div>

</blockquote></div>

<span id="m_-4214890537289386006m_-7017557633373574702m_-233179979615078302m_-3044900305059802604cid:f_lq4bjh4n0"><dump.0></span></div></blockquote></div><br></div></div></blockquote></div>

<span id="m_-4214890537289386006m_-7017557633373574702m_-233179979615078302cid:f_lq5kd6ap0"><dump.0></span></div></blockquote></div><br></div></blockquote></div>

</div></blockquote></div><br></div></blockquote></div>

<span id="m_-4214890537289386006cid:f_lq73gysm1"><Pmat.bin></span><span id="m_-4214890537289386006cid:f_lq73gys90"><Amat.bin></span><span id="m_-4214890537289386006cid:f_lq73gyso2"><rhs.bin></span></div></blockquote></div><br></div></div></blockquote></div>

</div></blockquote></div><br></body></html>