[petsc-users] Matvecs and KSPSolves with multiple vectors

Thu Dec 21 12:38:24 CST 2023

On Thu, Dec 21, 2023 at 5:54 AM Matthew Knepley <knepley at gmail.com> wrote:

> On Thu, Dec 21, 2023 at 6:46 AM Sreeram R Venkat <srvenkat at utexas.edu>
> wrote:
>
>> Ok, I think the error I'm getting has something to do with how the
>> multiple solves are being done in succession. I'll try to see if there's
>> anything I'm doing wrong there.
>>
>> One question about the -pc_type lu -ksp_type preonly method: do you know
>> which parts of the solve (factorization/triangular solves) are done on host
>> and which are done on device?
>>
>
> For SEQDENSE, I believe both the factorization and solve is on device. It
> is hard to see, but I believe the dispatch code is here:
>
Yes, it is correct.

>
>
> https://gitlab.com/petsc/petsc/-/blob/main/src/mat/impls/dense/seq/cupm/matseqdensecupm.hpp?ref_type=heads#L368
>
>   Thanks,
>
>      Matt
>
>
>> Thanks,
>> Sreeram
>>
>> On Sat, Dec 16, 2023 at 10:56 PM Pierre Jolivet <pierre at joliv.et> wrote:
>>
>>> Unfortunately, I am not able to reproduce such a failure with your input
>>> matrix.
>>> I’ve used ex79 that I linked previously and the system is properly
>>> solved.
>>> $ ./ex79 -pc_type hypre -ksp_type hpddm -ksp_hpddm_type cg
>>> -ksp_converged_reason -ksp_view_mat ascii::ascii_info -ksp_view_rhs
>>> ascii::ascii_info
>>> Linear solve converged due to CONVERGED_RTOL iterations 6
>>> Mat Object: 1 MPI process
>>>   type: seqaijcusparse
>>>   rows=289, cols=289
>>>   total: nonzeros=2401, allocated nonzeros=2401
>>>   total number of mallocs used during MatSetValues calls=0
>>>     not using I-node routines
>>> Mat Object: 1 MPI process
>>>   type: seqdensecuda
>>>   rows=289, cols=10
>>>   total: nonzeros=2890, allocated nonzeros=2890
>>>   total number of mallocs used during MatSetValues calls=0
>>>
>>> You mentioned in a subsequent email that you are interested in systems
>>> with at most 1E6 unknowns, and up to 1E4 right-hand sides.
>>> I’m not sure you can expect significant gains from using GPU for such
>>> systems.
>>> Probably, the fastest approach would indeed be -pc_type lu -ksp_type
>>> preonly -ksp_matsolve_batch_size 100 or something, depending on the memory
>>> available on your host.
>>>
>>> Thanks,
>>> Pierre
>>>
>>> On 15 Dec 2023, at 9:52 PM, Sreeram R Venkat <srvenkat at utexas.edu>
>>> wrote:
>>>
>>> Here are the ksp_view files.  I set the options
>>> -ksp_error_if_not_converged to try to get the vectors that caused the
>>> error. I noticed that some of the KSPMatSolves converge while others don't.
>>> In the code, the solves are called as:
>>>
>>> input vector v --> insert data of v into a dense mat --> KSPMatSolve()
>>> --> MatMatMult() --> KSPMatSolve() --> insert data of dense mat into output
>>> vector w -- output w
>>>
>>> The operator used in the KSP is a Laplacian-like operator, and the
>>> MatMatMult is with a Mass Matrix. The whole thing is supposed to be a solve
>>> with a biharmonic-like operator. I can also run it with only the first
>>> KSPMatSolve (i.e. just a Laplacian-like operator). In that case, the KSP
>>> reportedly converges after 0 iterations (see the next line), but this
>>> causes problems in other parts of the code later on.
>>>
>>> I saw that sometimes the first KSPMatSolve "converges" after 0
>>> iterations due to CONVERGED_RTOL. Then, the second KSPMatSolve produces a
>>> NaN/Inf. I tried setting ksp_min_it, but that didn't seem to do anything.
>>>
>>> I'll keep trying different options and also try to get the MWE made
>>> (this KSPMatSolve is pretty performance critical for us).
>>>
>>> Thanks for all your help,
>>> Sreeram
>>>
>>> On Fri, Dec 15, 2023 at 1:01 AM Pierre Jolivet <pierre at joliv.et> wrote:
>>>
>>>>
>>>> On 14 Dec 2023, at 11:45 PM, Sreeram R Venkat <srvenkat at utexas.edu>
>>>> wrote:
>>>>
>>>> Thanks, I will try to create a minimal reproducible example. This may
>>>> take me some time though, as I need to figure out how to extract only the
>>>> relevant parts (the full program this solve is used in is getting quite
>>>> complex).
>>>>
>>>>
>>>> You could just do -ksp_view_mat binary:Amat.bin -ksp_view_pmat
>>>> binary:Pmat.bin -ksp_view_rhs binary:rhs.bin and send me those three files
>>>> (I’m guessing your are using double-precision scalars with 32-bit PetscInt).
>>>>
>>>> I'll also try out some of the BoomerAMG options to see if that helps.
>>>>
>>>>
>>>> These should work (this is where all “PCMatApply()-ready” PC are being
>>>> tested):
>>>> https://petsc.org/release/src/ksp/ksp/tutorials/ex79.c.html#line215
>>>> You can see it’s also testing PCHYPRE + KSPHPDDM on device (but not
>>>> with HIP).
>>>> I’m aware the performance should not be optimal (see your comment about
>>>> host/device copies), I’ve money to hire someone to work on this but: a) I
>>>> need to find the correct engineer/post-doc, b) I currently don’t have good
>>>> use cases (of course, I could generate a synthetic benchmark, for science).
>>>> So even if you send me the three Mat, a MWE would be appreciated if the
>>>> KSPMatSolve() is performance-critical for you (see point b) from above).
>>>>
>>>> Thanks,
>>>> Pierre
>>>>
>>>> Thanks,
>>>> Sreeram
>>>>
>>>> On Thu, Dec 14, 2023, 1:12 PM Pierre Jolivet <pierre at joliv.et> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 14 Dec 2023, at 8:02 PM, Sreeram R Venkat <srvenkat at utexas.edu>
>>>>> wrote:
>>>>>
>>>>> Hello Pierre,
>>>>>
>>>>> Thank you for your reply. I tried out the HPDDM CG as you said, and it
>>>>> seems to be doing the batched solves, but the KSP is not converging due to
>>>>> a NaN or Inf being generated. I also noticed there are a lot of
>>>>> host-to-device and device-to-host copies of the matrices (the non-batched
>>>>> KSP solve did not have any memcopies). I have attached dump.0 again. Could
>>>>> you please take a look?
>>>>>
>>>>>
>>>>> Yes, but you’d need to send me something I can run with your set of
>>>>> options (if you are more confident doing this in private, you can remove
>>>>> the list from c/c).
>>>>> Not all BoomerAMG smoothers handle blocks of right-hand sides, and
>>>>> there is not much error checking, so instead of erroring out, this may be
>>>>> the reason why you are getting garbage.
>>>>>
>>>>> Thanks,
>>>>> Pierre
>>>>>
>>>>> Thanks,
>>>>> Sreeram
>>>>>
>>>>> On Thu, Dec 14, 2023 at 12:42 AM Pierre Jolivet <pierre at joliv.et>
>>>>> wrote:
>>>>>
>>>>>> Hello Sreeram,
>>>>>> KSPCG (PETSc implementation of CG) does not handle solves with
>>>>>> multiple columns at once.
>>>>>> There is only a single native PETSc KSP implementation which handles
>>>>>> solves with multiple columns at once: KSPPREONLY.
>>>>>> If you use --download-hpddm, you can use a CG (or GMRES, or more
>>>>>> advanced methods) implementation which handles solves with multiple columns
>>>>>> at once (via -ksp_type hpddm -ksp_hpddm_type cg or KSPSetType(ksp,
>>>>>> KSPHPDDM); KSPHPDDMSetType(ksp, KSP_HPDDM_TYPE_CG);).
>>>>>> I’m the main author of HPDDM, there is preliminary support for device
>>>>>> matrices, but if it’s not working as intended/not faster than column by
>>>>>> column, I’d be happy to have a deeper look (maybe in private), because most
>>>>>> (if not all) of my users interested in (pseudo-)block Krylov solvers (i.e.,
>>>>>> solvers that treat right-hand sides in a single go) are using plain host
>>>>>> matrices.
>>>>>>
>>>>>> Thanks,
>>>>>> Pierre
>>>>>>
>>>>>> PS: you could have a look at
>>>>>> https://www.sciencedirect.com/science/article/abs/pii/S0898122121000055 to
>>>>>> understand the philosophy behind block iterative methods in PETSc (and in
>>>>>> HPDDM), src/mat/tests/ex237.c, the benchmark I mentioned earlier, was
>>>>>> developed in the context of this paper to produce Figures 2-3. Note that
>>>>>> this paper is now slightly outdated, since then, PCHYPRE and PCMG (among
>>>>>> others) have been made “PCMatApply()-ready”.
>>>>>>
>>>>>> On 13 Dec 2023, at 11:05 PM, Sreeram R Venkat <srvenkat at utexas.edu>
>>>>>> wrote:
>>>>>>
>>>>>> Hello Pierre,
>>>>>>
>>>>>> I am trying out the KSPMatSolve with the BoomerAMG preconditioner.
>>>>>> However, I am noticing that it is still solving column by column (this is
>>>>>> stated explicitly in the info dump attached). I looked at the code for
>>>>>> KSPMatSolve_Private() and saw that as long as ksp->ops->matsolve is
>>>>>> true, it should do the batched solve, though I'm not sure where that gets
>>>>>> set.
>>>>>>
>>>>>> I am using the options -pc_type hypre -pc_hypre_type boomeramg when
>>>>>> running the code.
>>>>>>
>>>>>> Can you please help me with this?
>>>>>>
>>>>>> Thanks,
>>>>>> Sreeram
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 7, 2023 at 4:04 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>
>>>>>>> N.B., AMGX interface is a bit experimental.
>>>>>>> Mark
>>>>>>>
>>>>>>> On Thu, Dec 7, 2023 at 4:11 PM Sreeram R Venkat <srvenkat at utexas.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Oh, in that case I will try out BoomerAMG. Getting AMGX to build
>>>>>>>> correctly was also tricky so hopefully the HYPRE build will be easier.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sreeram
>>>>>>>>
>>>>>>>> On Thu, Dec 7, 2023, 3:03 PM Pierre Jolivet <pierre at joliv.et>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 7 Dec 2023, at 9:37 PM, Sreeram R Venkat <srvenkat at utexas.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thank you Barry and Pierre; I will proceed with the first option.
>>>>>>>>>
>>>>>>>>> I want to use the AMGX preconditioner for the KSP. I will try it
>>>>>>>>> out and see how it performs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just FYI, AMGX does not handle systems with multiple RHS, and thus
>>>>>>>>> has no PCMatApply() implementation.
>>>>>>>>> BoomerAMG does, and there is a PCMatApply_HYPRE_BoomerAMG()
>>>>>>>>> implementation.
>>>>>>>>> But let us know if you need assistance figuring things out.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Pierre
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sreeram
>>>>>>>>>
>>>>>>>>> On Thu, Dec 7, 2023 at 2:02 PM Pierre Jolivet <pierre at joliv.et>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> To expand on Barry’s answer, we have observed repeatedly that
>>>>>>>>>> MatMatMult with MatAIJ performs better than MatMult with MatMAIJ, you can
>>>>>>>>>> reproduce this on your own with
>>>>>>>>>> https://petsc.org/release/src/mat/tests/ex237.c.html.
>>>>>>>>>> Also, I’m guessing you are using some sort of preconditioner
>>>>>>>>>> within your KSP.
>>>>>>>>>> Not all are “KSPMatSolve-ready”, i.e., they may treat blocks of
>>>>>>>>>> right-hand sides column by column, which is very inefficient.
>>>>>>>>>> You could run your code with -info dump and send us dump.0 to see
>>>>>>>>>> what needs to be done on our end to make things more efficient, should you
>>>>>>>>>> not be satisfied with the current performance of the code.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Pierre
>>>>>>>>>>
>>>>>>>>>> On 7 Dec 2023, at 8:34 PM, Barry Smith <bsmith at petsc.dev> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Dec 7, 2023, at 1:17 PM, Sreeram R Venkat <srvenkat at utexas.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I have 2 sequential matrices M and R (both MATSEQAIJCUSPARSE of
>>>>>>>>>> size n x n) and a vector v of size n*m. v = [v_1 , v_2 ,... , v_m] where
>>>>>>>>>> v_i has size n. The data for v can be stored either in column-major or
>>>>>>>>>> row-major order.  Now, I want to do 2 types of operations:
>>>>>>>>>>
>>>>>>>>>> 1. Matvecs of the form M*v_i = w_i, for i = 1..m.
>>>>>>>>>> 2. KSPSolves of the form R*x_i = v_i, for i = 1..m.
>>>>>>>>>>
>>>>>>>>>> From what I have read on the documentation, I can think of 2
>>>>>>>>>> approaches.
>>>>>>>>>>
>>>>>>>>>> 1. Get the pointer to the data in v (column-major) and use it to
>>>>>>>>>> create a dense matrix V. Then do a MatMatMult with M*V = W, and take the
>>>>>>>>>> data pointer of W to create the vector w. For KSPSolves, use KSPMatSolve
>>>>>>>>>> with R and V.
>>>>>>>>>>
>>>>>>>>>> 2. Create a MATMAIJ using M/R and use that for matvecs directly
>>>>>>>>>> with the vector v. I don't know if KSPSolve with the MATMAIJ will know that
>>>>>>>>>> it is a multiple RHS system and act accordingly.
>>>>>>>>>>
>>>>>>>>>> Which would be the more efficient option?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Use 1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As a side-note, I am also wondering if there is a way to use
>>>>>>>>>> row-major storage of the vector v.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No
>>>>>>>>>>
>>>>>>>>>> The reason is that this could allow for more coalesced memory
>>>>>>>>>> access when doing matvecs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   PETSc matrix-vector products use BLAS GMEV matrix-vector
>>>>>>>>>> products for the computation so in theory they should already be
>>>>>>>>>> well-optimized
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Sreeram
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> <dump.0>
>>>>>>
>>>>>>
>>>>>> <dump.0>
>>>>>
>>>>>
>>>>>
>>>> <Pmat.bin><Amat.bin><rhs.bin>
>>>
>>>
>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231221/fcbf1a68/attachment-0001.html>