From varunhiremath at gmail.com Fri Oct 1 00:12:18 2021 From: varunhiremath at gmail.com (Varun Hiremath) Date: Thu, 30 Sep 2021 22:12:18 -0700 Subject: [petsc-users] SLEPc: smallest eigenvalues In-Reply-To: <4FC17DE7-B910-43D8-9EC5-816285FD52F4@dsic.upv.es> References: <179BDB69-1EC0-4334-A964-ABE29E33EFF8@dsic.upv.es> <5B1750B3-E05F-45D7-929B-A5CF816B4A75@dsic.upv.es> <7031EC8B-A238-45AD-B4C2-FA8988022864@dsic.upv.es> <6B968AE2-8325-4E20-B94A-16ECDD0FBA90@dsic.upv.es> <4BB88AB3-410E-493C-9161-97775747936D@dsic.upv.es> <32B34038-7E1A-42CA-A55D-9AF9D41D1697@dsic.upv.es> <4FC17DE7-B910-43D8-9EC5-816285FD52F4@dsic.upv.es> Message-ID: Hi Jose, Thanks again for your valuable suggestions. I am still working on this but wanted to give you a quick update. For the linear problem, I tried different KSP solvers, and finally, I'm getting good convergence using CGS with LU (using MUMPS) inexact inverse. So thank you very much for your help! But for the quadratic problem, I'm still struggling. As you suggested, I have now started using the PEP solver. For the simple case where the K matrix is explicitly known, everything works fine. But for the case where K is a shell matrix, it struggles to converge. I am yet to try the scaling option and some other preconditioning options. I will get back to you on this if I have any questions. Appreciate your help! Thanks, Varun On Tue, Sep 28, 2021 at 8:09 AM Jose E. Roman wrote: > > > > El 28 sept 2021, a las 7:50, Varun Hiremath > escribi?: > > > > Hi Jose, > > > > I implemented the LU factorized preconditioner and tested it using > PREONLY + LU, but that actually is converging to the wrong eigenvalues, > compared to just using BICGS + BJACOBI, or simply computing > EPS_SMALLEST_MAGNITUDE without any preconditioning. My preconditioning > matrix is only a 1st order approximation, and the off-diagonal terms are > not very accurate, so I'm guessing this is why the LU factorization doesn't > help much? Nonetheless, using BICGS + BJACOBI with slightly relaxed > tolerances seems to be working fine. > > If your PCMAT is not an exact inverse, then you have to iterate, i.e. not > use KSPPREONLY but KSPBCGS or another. > > > > > I now want to test the same preconditioning idea for a quadratic > problem. I am solving a quadratic equation similar to Eqn.(5.1) in the > SLEPc manual: > > (K + lambda*C + lambda^2*M)*x = 0, > > I don't use the PEP package directly, but solve this by linearizing > similar to Eqn.(5.3) and calling EPS. Without explicitly forming the full > matrix, I just use the block matrix structure as explained in the below > example and that works nicely for my case: > > https://slepc.upv.es/documentation/current/src/eps/tutorials/ex9.c.html > > Using PEP is generally recommended. The default solver TOAR is > memory-efficient and performs less computation than a trivial > linearization. In addition, PEP allows you to do scaling, which is often > very important to get accurate results in some problems, depending on > conditioning. > > In your case K is a shell matrix, so things may not be trivial. If I am > not wrong, you should be able to use STSetPreconditionerMat() for a PEP, > where the preconditioner in this case should be built to approximate > Q(sigma), where Q(.) is the quadratic polynomial and sigma is the target. > > > > > In my case, K is not explicitly known, and for linear problems, where C > = 0, I am using a 1st order approximation of K as the preconditioner. Now > could you please tell me if there is a way to conveniently set the > preconditioner for the quadratic problem, which will be of the form [-K 0; > 0 I]? Note that K is constructed in parallel (the rows are distributed), so > I wasn't sure how to construct this preconditioner matrix which will be > compatible with the shell matrix structure that I'm using to define the > MatMult function as in ex9. > > The shell matrix of ex9.c interleaves the local parts of the first block > and the second block. In other words, a process' local part consists of the > local rows of the first block followed by the local rows of the second > block. In your case, the local rows of K followed by the local rows of the > identity (appropriately padded with zeros). > > Jose > > > > > > Thanks, > > Varun > > > > On Fri, Sep 24, 2021 at 11:50 PM Varun Hiremath > wrote: > > Ok, great! I will give that a try, thanks for your help! > > > > On Fri, Sep 24, 2021 at 11:12 PM Jose E. Roman > wrote: > > Yes, you can use PCMAT > https://petsc.org/release/docs/manualpages/PC/PCMAT.html then pass a > preconditioner matrix that performs the inverse via a shell matrix. > > > > > El 25 sept 2021, a las 8:07, Varun Hiremath > escribi?: > > > > > > Hi Jose, > > > > > > Thanks for checking my code and providing suggestions. > > > > > > In my particular case, I don't know the matrix A explicitly, I compute > A*x in a matrix-free way within a shell matrix, so I can't use any of the > direct factorization methods. But just a question regarding your suggestion > to compute a (parallel) LU factorization. In our work, we do use MUMPS to > compute the parallel factorization. For solving the generalized problem, > A*x = lambda*B*x, we are computing inv(B)*A*x within a shell matrix, where > factorization of B is computed using MUMPS. (We don't call MUMPS through > SLEPc as we have our own MPI wrapper and other user settings to handle.) > > > > > > So for the preconditioning, instead of using the iterative solvers, > can I provide a shell matrix that computes inv(P)*x corrections (where P is > the preconditioner matrix) using MUMPS direct solver? > > > > > > And yes, thanks, #define PETSC_USE_COMPLEX 1 is not needed, it works > without it. > > > > > > Regards, > > > Varun > > > > > > On Fri, Sep 24, 2021 at 9:14 AM Jose E. Roman > wrote: > > > If you do > > > $ ./acoustic_matrix_test.o -shell 0 -st_type sinvert -deflate 1 > > > then it is using an LU factorization (the default), which is fast. > > > > > > Use -eps_view to see which solver settings are you using. > > > > > > BiCGStab with block Jacobi does not work for you matrix, it exceeds > the maximum 10000 iterations. So this is not viable unless you can find a > better preconditioner for your problem. If not, just using > EPS_SMALLEST_MAGNITUDE will be faster. > > > > > > Computing smallest magnitude eigenvalues is a difficult task. The most > robust way is to compute a (parallel) LU factorization if you can afford it. > > > > > > > > > A side note: don't add this to your source code > > > #define PETSC_USE_COMPLEX 1 > > > This define is taken from PETSc's include files, you should not mess > with it. Instead, you probably want to add something like this AFTER > #include : > > > #if !defined(PETSC_USE_COMPLEX) > > > #error "Requires complex scalars" > > > #endif > > > > > > Jose > > > > > > > > > > El 22 sept 2021, a las 19:38, Varun Hiremath < > varunhiremath at gmail.com> escribi?: > > > > > > > > Hi Jose, > > > > > > > > Thank you, that explains it and my example code works now without > specifying "-eps_target 0" in the command line. > > > > > > > > However, both the Krylov inexact shift-invert and JD solvers are > struggling to converge for some of my actual problems. The issue seems to > be related to non-symmetric general matrices. I have extracted one such > matrix attached here as MatA.gz (size 100k), and have also included a short > program that loads this matrix and then computes the smallest eigenvalues > as I described earlier. > > > > > > > > For this matrix, if I compute the eigenvalues directly (without > using the shell matrix) using shift-and-invert (as below) then it converges > in less than a minute. > > > > $ ./acoustic_matrix_test.o -shell 0 -st_type sinvert -deflate 1 > > > > > > > > However, if I use the shell matrix and use any of the preconditioned > solvers JD or Krylov shift-invert (as shown below) with the same matrix as > the preconditioner, then they struggle to converge. > > > > $ ./acoustic_matrix_test.o -usejd 1 -deflate 1 > > > > $ ./acoustic_matrix_test.o -sinvert 1 -deflate 1 > > > > > > > > Could you please check the attached code and suggest any changes in > settings that might help with convergence for these kinds of matrices? I > appreciate your help! > > > > > > > > Thanks, > > > > Varun > > > > > > > > On Tue, Sep 21, 2021 at 11:14 AM Jose E. Roman > wrote: > > > > I will have a look at your code when I have more time. Meanwhile, I > am answering 3) below... > > > > > > > > > El 21 sept 2021, a las 0:23, Varun Hiremath < > varunhiremath at gmail.com> escribi?: > > > > > > > > > > Hi Jose, > > > > > > > > > > Sorry, it took me a while to test these settings in the new > builds. I am getting good improvement in performance using the > preconditioned solvers, so thanks for the suggestions! But I have some > questions related to the usage. > > > > > > > > > > We are using SLEPc to solve the acoustic modal eigenvalue problem. > Attached is a simple standalone program that computes acoustic modes in a > simple rectangular box. This program illustrates the general setup I am > using, though here the shell matrix and the preconditioner matrix are the > same, while in my actual program the shell matrix computes A*x without > explicitly forming A, and the preconditioner is a 0th order approximation > of A. > > > > > > > > > > In the attached program I have tested both > > > > > 1) the Krylov-Schur with inexact shift-and-invert (implemented > under the option sinvert); > > > > > 2) the JD solver with preconditioner (implemented under the option > usejd) > > > > > > > > > > Both the solvers seem to work decently, compared to no > preconditioning. This is how I run the two solvers (for a mesh size of > 1600x400): > > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 > -eps_target 0 > > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -sinvert 1 -deflate 1 > -eps_target 0 > > > > > Both finish in about ~10 minutes on my system in serial. JD seems > to be slightly faster and more accurate (for the imaginary part of > eigenvalue). > > > > > The program also runs in parallel using mpiexec. I use complex > builds, as in my main program the matrix can be complex. > > > > > > > > > > Now here are my questions: > > > > > 1) For this particular problem type, could you please check if > these are the best settings that one could use? I have tried different > combinations of KSP/PC types e.g. GMRES, GAMG, etc, but BCGSL + BJACOBI > seems to work the best in serial and parallel. > > > > > > > > > > 2) When I tested these settings in my main program, for some > reason the JD solver was not converging. After further testing, I found the > issue was related to the setting of "-eps_target 0". I have included > "EPSSetTarget(eps,0.0);" in the program and I assumed this is equivalent to > passing "-eps_target 0" from the command line, but that doesn't seem to be > the case. For instance, if I run the attached program without "-eps_target > 0" in the command line then it doesn't converge. > > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 > -eps_target 0 > > > > > the above finishes in about 10 minutes > > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 > > > > > the above doesn't converge even though "EPSSetTarget(eps,0.0);" > is included in the code > > > > > > > > > > This only seems to affect the JD solver, not the Krylov > shift-and-invert (-sinvert 1) option. So is there any difference between > passing "-eps_target 0" from the command line vs using > "EPSSetTarget(eps,0.0);" in the code? I cannot pass any command line > arguments in my actual program, so need to set everything internally. > > > > > > > > > > 3) Also, another minor related issue. While using the inexact > shift-and-invert option, I was running into the following error: > > > > > > > > > > "" > > > > > Missing or incorrect user input > > > > > Shift-and-invert requires a target 'which' (see > EPSSetWhichEigenpairs), for instance -st_type sinvert -eps_target 0 > -eps_target_magnitude > > > > > "" > > > > > > > > > > I already have the below two lines in the code: > > > > > EPSSetWhichEigenpairs(eps,EPS_SMALLEST_MAGNITUDE); > > > > > EPSSetTarget(eps,0.0); > > > > > > > > > > so shouldn't these be enough? If I comment out the first line > "EPSSetWhichEigenpairs", then the code works fine. > > > > > > > > You should either do > > > > > > > > EPSSetWhichEigenpairs(eps,EPS_SMALLEST_MAGNITUDE); > > > > > > > > without shift-and-invert or > > > > > > > > EPSSetWhichEigenpairs(eps,EPS_TARGET_MAGNITUDE); > > > > EPSSetTarget(eps,0.0); > > > > > > > > with shift-and-invert. The latter can also be used without > shift-and-invert (e.g. in JD). > > > > > > > > I have to check, but a possible explanation why in your comment > above (2) the command-line option -eps_target 0 works differently is that > it also sets -eps_target_magnitude if omitted, so to be equivalent in > source code you have to call both > > > > EPSSetWhichEigenpairs(eps,EPS_TARGET_MAGNITUDE); > > > > EPSSetTarget(eps,0.0); > > > > > > > > Jose > > > > > > > > > I have some more questions regarding setting the preconditioner > for a quadratic eigenvalue problem, which I will ask in a follow-up email. > > > > > > > > > > Thanks for your help! > > > > > > > > > > -Varun > > > > > > > > > > > > > > > On Thu, Jul 1, 2021 at 5:01 AM Varun Hiremath < > varunhiremath at gmail.com> wrote: > > > > > Thank you very much for these suggestions! We are currently using > version 3.12, so I'll try to update to the latest version and try your > suggestions. Let me get back to you, thanks! > > > > > > > > > > On Thu, Jul 1, 2021, 4:45 AM Jose E. Roman > wrote: > > > > > Then I would try Davidson methods https://doi.org/10.1145/2543696 > > > > > You can also try Krylov-Schur with "inexact" shift-and-invert, for > instance, with preconditioned BiCGStab or GMRES, see section 3.4.1 of the > users manual. > > > > > > > > > > In both cases, you have to pass matrix A in the call to > EPSSetOperators() and the preconditioner matrix via > STSetPreconditionerMat() - note this function was introduced in version > 3.15. > > > > > > > > > > Jose > > > > > > > > > > > > > > > > > > > > > El 1 jul 2021, a las 13:36, Varun Hiremath < > varunhiremath at gmail.com> escribi?: > > > > > > > > > > > > Thanks. I actually do have a 1st order approximation of matrix > A, that I can explicitly compute and also invert. Can I use that matrix as > preconditioner to speed things up? Is there some example that explains how > to setup and call SLEPc for this scenario? > > > > > > > > > > > > On Thu, Jul 1, 2021, 4:29 AM Jose E. Roman > wrote: > > > > > > For smallest real parts one could adapt ex34.c, but it is going > to be costly > https://slepc.upv.es/documentation/current/src/eps/tutorials/ex36.c.html > > > > > > Also, if eigenvalues are clustered around the origin, > convergence may still be very slow. > > > > > > > > > > > > It is a tough problem, unless you are able to compute a good > preconditioner of A (no need to compute the exact inverse). > > > > > > > > > > > > Jose > > > > > > > > > > > > > > > > > > > El 1 jul 2021, a las 13:23, Varun Hiremath < > varunhiremath at gmail.com> escribi?: > > > > > > > > > > > > > > I'm solving for the smallest eigenvalues in magnitude. Though > is it cheaper to solve smallest in real part, as that might also work in my > case? Thanks for your help. > > > > > > > > > > > > > > On Thu, Jul 1, 2021, 4:08 AM Jose E. Roman > wrote: > > > > > > > Smallest eigenvalue in magnitude or real part? > > > > > > > > > > > > > > > > > > > > > > El 1 jul 2021, a las 11:58, Varun Hiremath < > varunhiremath at gmail.com> escribi?: > > > > > > > > > > > > > > > > Sorry, no both A and B are general sparse matrices > (non-hermitian). So is there anything else I could try? > > > > > > > > > > > > > > > > On Thu, Jul 1, 2021 at 2:43 AM Jose E. Roman < > jroman at dsic.upv.es> wrote: > > > > > > > > Is the problem symmetric (GHEP)? In that case, you can try > LOBPCG on the pair (A,B). But this will likely be slow as well, unless you > can provide a good preconditioner. > > > > > > > > > > > > > > > > Jose > > > > > > > > > > > > > > > > > > > > > > > > > El 1 jul 2021, a las 11:37, Varun Hiremath < > varunhiremath at gmail.com> escribi?: > > > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > > > I am trying to compute the smallest eigenvalues of a > generalized system A*x= lambda*B*x. I don't explicitly know the matrix A > (so I am using a shell matrix with a custom matmult function) however, the > matrix B is explicitly known so I compute inv(B)*A within the shell matrix > and solve inv(B)*A*x = lambda*x. > > > > > > > > > > > > > > > > > > To compute the smallest eigenvalues it is recommended to > solve the inverted system, but since matrix A is not explicitly known I > can't invert the system. Moreover, the size of the system can be really > big, and with the default Krylov solver, it is extremely slow. So is there > a better way for me to compute the smallest eigenvalues of this system? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Varun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rthirumalaisam1857 at sdsu.edu Fri Oct 1 01:10:16 2021 From: rthirumalaisam1857 at sdsu.edu (Ramakrishnan Thirumalaisamy) Date: Thu, 30 Sep 2021 23:10:16 -0700 Subject: [petsc-users] Convergence rate for spatially varying Helmholtz system In-Reply-To: References: <00A92945-C009-4A92-B7E2-909B1783CCF4@petsc.dev> Message-ID: We fixed the issue. The linear operator in the matrix-free solver was set up correctly, whereas the linear operator in the preconditioner was not set up correctly (it was lagging in time). After setting the linear operators correctly, we see that the diagonal system is solved in 1 iteration. Thank you. On Thu, Sep 30, 2021 at 5:49 PM Matthew Knepley wrote: > On Thu, Sep 30, 2021 at 6:58 PM Amneet Bhalla > wrote: > >> >> >>> >>> For a diagonal system with this modest range of values Jacobi should >>> converge in a single iteration. >>> >> >> This is what I wanted to confirm (and my expectation also). There could >> be a bug in the way we are setting up the linear operators in the >> preconditioner and the matrix-free solver. We need to do some debugging. >> >> (with regard to the diagonal). >> >> We have printed the matrix and viewed it in Matlab. It is a diagonal >> matrix. >> > > Can you send us the matrix? This definitely should converge in 1 iterate > now, so something I do not understand is going on. > I will take any format you've got :) > > Thanks, > > Matt > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marco.cisternino at optimad.it Fri Oct 1 05:38:35 2021 From: marco.cisternino at optimad.it (Marco Cisternino) Date: Fri, 1 Oct 2021 10:38:35 +0000 Subject: [petsc-users] Disconnected domains and Poisson equation In-Reply-To: <3A2F7686-44AA-47A5-B996-461E057F4EC3@petsc.dev> References: <448CEBF7-5B16-4E1C-8D1D-9CC067BD38BB@petsc.dev> <10EA28EF-AD98-4F59-A78D-7DE3D4B585DE@petsc.dev> <3A2F7686-44AA-47A5-B996-461E057F4EC3@petsc.dev> Message-ID: Thank you Barry. I added a custom atoll = 1.0e-12 and this makes the CFD stable with all the linear solver types. CFD solution is good and pressure is a good ?zero? field at every CFD iteration. I did the same test using ASM+ILU+FGMRES(BCGS and GMRES) and the behaviour is the same. During some CFD iteration the residual of linear system starts slightly higher than atol and the linear solver makes some iteration (2/3 iterations) before it stops because of atol. The pressure is still different in the 2 sub-domains (order 1.0e-14 because of those few linear solver iterations), therefore no symmetry of the solution In the 2 sub-domains. I think it is a matter of round-off, do you agree on this? Or do I need to take care of this difference as a symptom of something wrong? Thank you for your support. Marco Cisternino From: Barry Smith Sent: gioved? 30 settembre 2021 16:39 To: Marco Cisternino Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Disconnected domains and Poisson equation It looks like the initial solution (guess) is to round-off the solution to the linear system 9.010260489109e-14 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min 9.566256813737e-01 max/min 1.000000000000e+00 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min 7.360950481750e-01 max/min 1.306083963538e+00 Thus the Krylov solver will not be able to improve the solution, it then gets stuck trying to improve the solution but cannot because of round off. In other words the algorithm has converged (even at the initial solution (guess) and should stop immediately. You can use -ksp_atol 1.e-12 to get it to stop immediately without iterating if the initial residual is less than 1e-12. Barry On Sep 30, 2021, at 4:16 AM, Marco Cisternino > wrote: Hello Barry. This is the output of ksp_view using fgmres and gamg. It has to be said that the solution of the linear system should be a zero values field. As you can see both unpreconditioned residual and r/b converge at this iteration of the CFD solver. During the time integration of the CFD, I can observe pressure linear solver residuals behaving in a different way: unpreconditioned residual stil converges but r/b stalls. After the output of ksp_view I add the output of ksp_monitor_true_residual for one of these iteration where r/b stalls. Thanks, KSP Object: 1 MPI processes type: fgmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=100, nonzero initial guess tolerances: relative=1e-05, absolute=1e-50, divergence=10000. right preconditioning using UNPRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: gamg type is MULTIPLICATIVE, levels=4 cycles=v Cycles per PCApply=1 Using externally compute Galerkin coarse grid matrices GAMG specific options Threshold for dropping small values in graph on each level = 0.02 0.02 Threshold scaling factor for each level not specified = 1. AGG specific options Symmetric graph true Number of levels to square graph 1 Number smoothing steps 0 Coarse grid solver -- level ------------------------------- KSP Object: (mg_coarse_) 1 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (mg_coarse_) 1 MPI processes type: bjacobi number of blocks = 1 Local solve is same for all blocks, in the following KSP and PC objects: KSP Object: (mg_coarse_sub_) 1 MPI processes type: preonly maximum iterations=1, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using DEFAULT norm type for convergence test PC Object: (mg_coarse_sub_) 1 MPI processes type: lu PC has not been set up so information may be incomplete out-of-place factorization tolerance for zero pivot 2.22045e-14 using diagonal shift on blocks to prevent zero pivot [INBLOCKS] matrix ordering: nd linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=18, cols=18 total: nonzeros=104, allocated nonzeros=104 total number of mallocs used during MatSetValues calls =0 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=18, cols=18 total: nonzeros=104, allocated nonzeros=104 total number of mallocs used during MatSetValues calls =0 not using I-node routines Down solver (pre-smoother) on level 1 ------------------------------- KSP Object: (mg_levels_1_) 1 MPI processes type: chebyshev eigenvalue estimates used: min = 0., max = 0. eigenvalues estimate via gmres min 0., max 0. eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] KSP Object: (mg_levels_1_esteig_) 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10, initial guess is zero tolerances: relative=1e-12, absolute=1e-50, divergence=10000. left preconditioning using DEFAULT norm type for convergence test estimating eigenvalues using noisy right hand side maximum iterations=2, nonzero initial guess tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (mg_levels_1_) 1 MPI processes type: sor type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=67, cols=67 total: nonzeros=675, allocated nonzeros=675 total number of mallocs used during MatSetValues calls =0 not using I-node routines Up solver (post-smoother) same as down solver (pre-smoother) Down solver (pre-smoother) on level 2 ------------------------------- KSP Object: (mg_levels_2_) 1 MPI processes type: chebyshev eigenvalue estimates used: min = 0., max = 0. eigenvalues estimate via gmres min 0., max 0. eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] KSP Object: (mg_levels_2_esteig_) 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10, initial guess is zero tolerances: relative=1e-12, absolute=1e-50, divergence=10000. left preconditioning using DEFAULT norm type for convergence test estimating eigenvalues using noisy right hand side maximum iterations=2, nonzero initial guess tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (mg_levels_2_) 1 MPI processes type: sor type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=348, cols=348 total: nonzeros=3928, allocated nonzeros=3928 total number of mallocs used during MatSetValues calls =0 not using I-node routines Up solver (post-smoother) same as down solver (pre-smoother) Down solver (pre-smoother) on level 3 ------------------------------- KSP Object: (mg_levels_3_) 1 MPI processes type: chebyshev eigenvalue estimates used: min = 0., max = 0. eigenvalues estimate via gmres min 0., max 0. eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] KSP Object: (mg_levels_3_esteig_) 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10, initial guess is zero tolerances: relative=1e-12, absolute=1e-50, divergence=10000. left preconditioning using DEFAULT norm type for convergence test estimating eigenvalues using noisy right hand side maximum iterations=2, nonzero initial guess tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (mg_levels_3_) 1 MPI processes type: sor type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=3584, cols=3584 total: nonzeros=23616, allocated nonzeros=23616 total number of mallocs used during MatSetValues calls =0 has attached null space not using I-node routines Up solver (post-smoother) same as down solver (pre-smoother) linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=3584, cols=3584 total: nonzeros=23616, allocated nonzeros=23616 total number of mallocs used during MatSetValues calls =0 has attached null space not using I-node routines Pressure system has reached convergence in 0 iterations with reason 3. 0 KSP unpreconditioned resid norm 4.798763170703e-16 true resid norm 4.798763170703e-16 ||r(i)||/||b|| 1.000000000000e+00 0 KSP Residual norm 4.798763170703e-16 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 1 KSP unpreconditioned resid norm 1.648749109132e-17 true resid norm 1.648749109132e-17 ||r(i)||/||b|| 3.435779284125e-02 1 KSP Residual norm 1.648749109132e-17 % max 9.561792537103e-01 min 9.561792537103e-01 max/min 1.000000000000e+00 2 KSP unpreconditioned resid norm 4.737880600040e-19 true resid norm 4.737880600040e-19 ||r(i)||/||b|| 9.873128619820e-04 2 KSP Residual norm 4.737880600040e-19 % max 9.828636644296e-01 min 9.293131521763e-01 max/min 1.057623753767e+00 3 KSP unpreconditioned resid norm 2.542212716830e-20 true resid norm 2.542212716830e-20 ||r(i)||/||b|| 5.297641551371e-05 3 KSP Residual norm 2.542212716830e-20 % max 9.933572357920e-01 min 9.158303248850e-01 max/min 1.084652046127e+00 4 KSP unpreconditioned resid norm 6.614510286263e-21 true resid norm 6.614510286269e-21 ||r(i)||/||b|| 1.378378146822e-05 4 KSP Residual norm 6.614510286263e-21 % max 9.950912550705e-01 min 6.296575800237e-01 max/min 1.580368896747e+00 5 KSP unpreconditioned resid norm 1.981505525281e-22 true resid norm 1.981505525272e-22 ||r(i)||/||b|| 4.129200493513e-07 5 KSP Residual norm 1.981505525281e-22 % max 9.984097962703e-01 min 5.316259535293e-01 max/min 1.878030577029e+00 Linear solve converged due to CONVERGED_RTOL iterations 5 Ksp_monitor_true_residual output for stalling r/b CFD iteration 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min 9.566256813737e-01 max/min 1.000000000000e+00 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min 7.360950481750e-01 max/min 1.306083963538e+00 3 KSP unpreconditioned resid norm 6.623206616803e-16 true resid norm 6.654132553541e-16 ||r(i)||/||b|| 1.492933720678e-02 3 KSP Residual norm 6.623206616803e-16 % max 9.764112945239e-01 min 4.911485418014e-01 max/min 1.988016274960e+00 4 KSP unpreconditioned resid norm 6.551896936698e-16 true resid norm 6.646157296305e-16 ||r(i)||/||b|| 1.491144376933e-02 4 KSP Residual norm 6.551896936698e-16 % max 9.883425885532e-01 min 1.461270778833e-01 max/min 6.763582786091e+00 5 KSP unpreconditioned resid norm 6.222297644887e-16 true resid norm 1.720560536914e-15 ||r(i)||/||b|| 3.860282047823e-02 5 KSP Residual norm 6.222297644887e-16 % max 1.000409371755e+00 min 4.989767363560e-03 max/min 2.004921870829e+02 6 KSP unpreconditioned resid norm 6.496945794974e-17 true resid norm 2.031914800253e-14 ||r(i)||/||b|| 4.558842341106e-01 6 KSP Residual norm 6.496945794974e-17 % max 1.004914985753e+00 min 1.459258738706e-03 max/min 6.886475709192e+02 7 KSP unpreconditioned resid norm 1.965237342540e-17 true resid norm 1.684522207337e-14 ||r(i)||/||b|| 3.779425772373e-01 7 KSP Residual norm 1.965237342540e-17 % max 1.005737762541e+00 min 1.452603803766e-03 max/min 6.923689446035e+02 8 KSP unpreconditioned resid norm 1.627718951285e-17 true resid norm 1.958642967520e-14 ||r(i)||/||b|| 4.394448276241e-01 8 KSP Residual norm 1.627718951285e-17 % max 1.006364278765e+00 min 1.452081813014e-03 max/min 6.930492963590e+02 9 KSP unpreconditioned resid norm 1.616577677764e-17 true resid norm 2.019110946644e-14 ||r(i)||/||b|| 4.530115373837e-01 9 KSP Residual norm 1.616577677764e-17 % max 1.006648747131e+00 min 1.452031376577e-03 max/min 6.932692801059e+02 10 KSP unpreconditioned resid norm 1.285788988203e-17 true resid norm 2.065082694477e-14 ||r(i)||/||b|| 4.633258453698e-01 10 KSP Residual norm 1.285788988203e-17 % max 1.007469033514e+00 min 1.433291867068e-03 max/min 7.029057072477e+02 11 KSP unpreconditioned resid norm 5.490854431580e-19 true resid norm 1.798071628891e-14 ||r(i)||/||b|| 4.034187394623e-01 11 KSP Residual norm 5.490854431580e-19 % max 1.008058905554e+00 min 1.369401685301e-03 max/min 7.361309076612e+02 12 KSP unpreconditioned resid norm 1.371754802104e-20 true resid norm 1.965688920064e-14 ||r(i)||/||b|| 4.410256708163e-01 12 KSP Residual norm 1.371754802104e-20 % max 1.008409402214e+00 min 1.369243011779e-03 max/min 7.364721919624e+02 Linear solve converged due to CONVERGED_RTOL iterations 12 Marco Cisternino From: Barry Smith > Sent: mercoled? 29 settembre 2021 18:34 To: Marco Cisternino > Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Disconnected domains and Poisson equation On Sep 29, 2021, at 11:59 AM, Marco Cisternino > wrote: For sake of completeness, explicitly building the null space using a vector per sub-domain make s the CFD runs using BCGS and GMRES more stable, but still slower than FGMRES. Something is strange. Please run with -ksp_view and send the output on the solver details. I had divergence using BCGS and GMRES setting the null space with only one constant. Thanks Marco Cisternino From: Marco Cisternino Sent: mercoled? 29 settembre 2021 17:54 To: Barry Smith > Cc: petsc-users at mcs.anl.gov Subject: RE: [petsc-users] Disconnected domains and Poisson equation Thank you Barry for the quick reply. About the null space: I already tried what you suggest, building 2 Vec (constants) with 0 and 1 chosen by sub-domain, normalizing them and setting the null space like this MatNullSpaceCreate(PETSC_COMM_WORLD,PETSC_FALSE,nconstants,constants,&nullspace); The solution is slightly different in values but it is still different in the two sub-domains. About the solver: I tried BCGS, GMRES and FGMRES. The linear system is a pressure system in a navier-stokes solver and only solving with FGMRES makes the CFD stable, with BCGS and GMRES the CFD solution diverges. Moreover, in the same case but with a single domain, CFD solution is stable using all the solvers, but FGMRES converges in much less iterations than the others. Marco Cisternino From: Barry Smith > Sent: mercoled? 29 settembre 2021 15:59 To: Marco Cisternino > Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Disconnected domains and Poisson equation The problem actually has a two dimensional null space; constant on each domain but possibly different constants. I think you need to build the MatNullSpace by explicitly constructing two vectors, one with 0 on one domain and constant value on the other and one with 0 on the other domain and constant on the first. Separate note: why use FGMRES instead of just GMRES? If the problem is linear and the preconditioner is linear (no GMRES inside the smoother) then you can just use GMRES and it will save a little space/work and be conceptually clearer. Barry On Sep 29, 2021, at 8:46 AM, Marco Cisternino > wrote: Good morning, I want to solve the Poisson equation on a 3D domain with 2 non-connected sub-domains. I am using FGMRES+GAMG and I have no problem if the two sub-domains see a Dirichlet boundary condition each. On the same domain I would like to solve the Poisson equation imposing periodic boundary condition in one direction and homogenous Neumann boundary conditions in the other two directions. The two sub-domains are symmetric with respect to the separation between them and the operator discretization and the right hand side are symmetric as well. It would be nice to have the same solution in both the sub-domains. Setting the null space to the constant, the solver converges to a solution having the same gradients in both sub-domains but different values. Am I doing some wrong with the null space? I?m not setting a block matrix (one block for each sub-domain), should I? I tested the null space against the matrix using MatNullSpaceTest and the answer is true. Can I do something more to have a symmetric solution as outcome of the solver? Thank you in advance for any comments and hints. Best regards, Marco Cisternino -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Oct 1 05:53:48 2021 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 1 Oct 2021 06:53:48 -0400 Subject: [petsc-users] Convergence rate for spatially varying Helmholtz system In-Reply-To: References: <00A92945-C009-4A92-B7E2-909B1783CCF4@petsc.dev> Message-ID: On Fri, Oct 1, 2021 at 2:10 AM Ramakrishnan Thirumalaisamy < rthirumalaisam1857 at sdsu.edu> wrote: > We fixed the issue. The linear operator in the matrix-free solver was set > up correctly, whereas the linear operator in the preconditioner was not set > up correctly (it was lagging in time). After setting the linear operators > correctly, we see that the diagonal system is solved in 1 iteration. Thank > you. > Cool. Thanks, Matt > On Thu, Sep 30, 2021 at 5:49 PM Matthew Knepley wrote: > >> On Thu, Sep 30, 2021 at 6:58 PM Amneet Bhalla >> wrote: >> >>> >>> >>>> >>>> For a diagonal system with this modest range of values Jacobi should >>>> converge in a single iteration. >>>> >>> >>> This is what I wanted to confirm (and my expectation also). There could >>> be a bug in the way we are setting up the linear operators in the >>> preconditioner and the matrix-free solver. We need to do some debugging. >>> >>> (with regard to the diagonal). >>> >>> We have printed the matrix and viewed it in Matlab. It is a diagonal >>> matrix. >>> >> >> Can you send us the matrix? This definitely should converge in 1 iterate >> now, so something I do not understand is going on. >> I will take any format you've got :) >> >> Thanks, >> >> Matt >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From k.sagiyama at imperial.ac.uk Fri Oct 1 07:07:04 2021 From: k.sagiyama at imperial.ac.uk (Sagiyama, Koki) Date: Fri, 1 Oct 2021 12:07:04 +0000 Subject: [petsc-users] DMView and DMLoad In-Reply-To: <45d209e2-ecab-ead7-7229-a819736b91df@ovgu.de> References: <56ce2135-9757-4292-e33b-c7eea8fb7b2e@ovgu.de> <056E066F-D596-4254-A44A-60BFFD30FE82@erdw.ethz.ch> <45d209e2-ecab-ead7-7229-a819736b91df@ovgu.de> Message-ID: Hi Berend, DMPlexCreateFromfile(), in your case, internally calls DMLoad(), which calls DMPlexTopologyLoad(), DMPlexLabelsLoad(), and DMPlexCoordinatesLoad(), so, to get what you want, you would need to do something like: DMPlexTopologyView(dm, viewer); DMPlexLabelsView(dm, viewer); DMPlexCoordinatesView(dm, viewer); for saving, and: DMPlexTopologyLoad(dm, viewer, &sfO); DMPlexLabelsLoad(dm, viewer); DMPlexCoordinatesLoad(dm, viewer); DMPlexDistribute(..., &sfDist, ...); for loading. Please note that the interface for DMPlexCoordinatesLoad() may change in the near future so that it would take an SF as argument (This is to view coordinates just like other fields). After the change, you will need to do: DMPlexTopologyLoad(dm, viewer, &sfO); DMPlexLabelsLoad(dm, viewer); DMPlexCoordinatesLoad(dm, viewer, sfO); DMPlexDistribute(..., &sfDist, ...); or: DMPlexTopologyLoad(dm, viewer, &sfO); DMPlexLabelsLoad(dm, viewer); DMPlexDistribute(..., &sfDist, ...); PetscSFCompose(sfO, sfDist, &sf); DMPlexCoordinatesLoad(..., viewer, sf); The latter will load coordinates in parallel directly on the redistributed DMPlex, so will be preferred. The example (`src/dm/impls/plex/tutorials/ex12.c`) will also be updated accordingly. Thanks, Koki ________________________________ From: petsc-users on behalf of Berend van Wachem Sent: Thursday, September 30, 2021 12:02 PM To: Hapla Vaclav Cc: PETSc users list Subject: Re: [petsc-users] DMView and DMLoad ******************* This email originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list https://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address. ******************* Dear Vaclav, Lawrence, following your example, we have managed to save the DM with a wrapped Vector in h5 format (PETSC_VIEWER_HDF5_PETSC) with: DMPlexTopologyView(dm, viewer); DMClone(dm, &sdm); ... DMPlexSectionView(dm, viewer, sdm); DMGetLocalVector(sdm, &vec); ... DMPlexLocalVectorView(dm, viewer, sdm, vec); The problem comes with the loading of the "DM+Vec.h5" with: DMCreate(PETSC_COMM_WORLD, &dm); DMSetType(dm, DMPLEX); ... DMPlexTopologyLoad(dm, viewer, &sfO); ... PetscSFCompose(sfO, sfDist, &sf); ... DMClone(dm, &sdm); DMPlexSectionLoad(dm, viewer, sdm, sf, &globalDataSF, &localDataSF); DMGetLocalVector(sdm, &vec); ... DMPlexLocalVectorLoad(dm, H5Viewer, sdm, localDataSF, vec); The loaded DM is different to the one created with DMPlexCreateFromfile (for instance, no "coordinates" are recovered with the use of DMGetCoordinatesLocal). This conflicts with our code, which relies on features of the DM as delivered by the DMPlexCreateFromfile function. We have also noticed that the "DM+Vec.h5" can not be loaded directly with DMPlexCreateFromfile because it contains only the groups "topology" and "topologies" while the groups "geometry" and "labels" are missing (and probably other conflicts). Is this something which can be changed? We would need to reload a DM similar to the one created with DMPlexCreateFromfile. Best regards, Berend. On 9/22/21 8:59 PM, Hapla Vaclav wrote: > To avoid confusions here, Berend seems to be specifically demanding XDMF > (PETSC_VIEWER_HDF5_XDMF). The stuff we are now working on is parallel > checkpointing in our own HDF5 format (PETSC_VIEWER_HDF5_PETSC), I will > make a series of MRs on this topic in the following days. > > For XDMF, we are specifically missing the ability to write/load DMLabels > properly. XDMF uses specific cell-local numbering for faces for > specification of face sets, and face-local numbering for specification > of edge sets, which is not great wrt DMPlex design. And ParaView doesn't > show any of these properly so it's hard to debug. Matt, we should talk > about this soon. > > Berend, for now, could you just load the mesh initially from XDMF and > then use our PETSC_VIEWER_HDF5_PETSC format for subsequent saving/loading? > > Thanks, > > Vaclav > >> On 17 Sep 2021, at 15:46, Lawrence Mitchell > > wrote: >> >> Hi Berend, >> >>> On 14 Sep 2021, at 12:23, Matthew Knepley >> > wrote: >>> >>> On Tue, Sep 14, 2021 at 5:15 AM Berend van Wachem >>> > wrote: >>> Dear PETSc-team, >>> >>> We are trying to save and load distributed DMPlex and its associated >>> physical fields (created with DMCreateGlobalVector) (Uvelocity, >>> VVelocity, ...) in HDF5_XDMF format. To achieve this, we do the >>> following: >>> >>> 1) save in the same xdmf.h5 file: >>> DMView( DM , H5_XDMF_Viewer ); >>> VecView( UVelocity, H5_XDMF_Viewer ); >>> >>> 2) load the dm: >>> DMPlexCreateFromfile(PETSC_COMM_WORLD, Filename, PETSC_TRUE, DM); >>> >>> 3) load the physical field: >>> VecLoad( UVelocity, H5_XDMF_Viewer ); >>> >>> There are no errors in the execution, but the loaded DM is distributed >>> differently to the original one, which results in the incorrect >>> placement of the values of the physical fields (UVelocity etc.) in the >>> domain. >>> >>> This approach is used to restart the simulation with the last saved DM. >>> Is there something we are missing, or there exists alternative routes to >>> this goal? Can we somehow get the IS of the redistribution, so we can >>> re-distribute the vector data as well? >>> >>> Many thanks, best regards, >>> >>> Hi Berend, >>> >>> We are in the midst of rewriting this. We want to support saving >>> multiple meshes, with fields attached to each, >>> and preserving the discretization (section) information, and allowing >>> us to load up on a different number of >>> processes. We plan to be done by October. Vaclav and I are doing this >>> in collaboration with Koki Sagiyama, >>> David Ham, and Lawrence Mitchell from the Firedrake team. >> >> The core load/save cycle functionality is now in PETSc main. So if >> you're using main rather than a release, you can get access to it now. >> This section of the manual shows an example of how to do >> thingshttps://petsc.org/main/docs/manual/dmplex/#saving-and-loading-data-with-hdf5 >> >> >> Let us know if things aren't clear! >> >> Thanks, >> >> Lawrence > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Fri Oct 1 07:51:00 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Fri, 1 Oct 2021 12:51:00 +0000 Subject: [petsc-users] (percent time in this phase) In-Reply-To: <3B13EDB4-A22B-421B-9B5C-F95BA9CF9705@petsc.dev> References: <6295C9A3-0EC7-4D6A-8F62-88EC8651D207@stfc.ac.uk> <3B13EDB4-A22B-421B-9B5C-F95BA9CF9705@petsc.dev> Message-ID: <7B8AE6C6-D949-4D65-916C-0F00628DA9AA@stfc.ac.uk> Hi Barry, Yes, looks like it is computationally faster using GPUs. I used block jacobi as the preconditioner. I have attached the output data for cpu and gpu using -ksp_view. I am not sure; what information I should be looking at using -ksp_view? I have an outstanding question, event time T% cal = (event/max)*100 max time 2.87E+02 KSPSolve 1.58E+02 53 55.2 MatMult 1.08E+01 4 3.76 PCApply 1.31E+02 37 45.6 VecNorm 6.23E+01 11 21.7 Matt couple of days back helped breakdown KSPSolve (53 %) ~ PCApply (37%) + VecNorm (11%) + MatMul (4%) However, when I calculate T% manually using max time, the numbers for PCApply and VecNorm are way off as you can see from the above table. As a result, the cumulative sum of event time don?t match up to KSPSolve. Can you please let me know what I might be doing wrong? I will be performing extensive benchmarking of various preconditioners and comparing their performance on cpus and gpus, so this information is critical. Many thanks! Karthik. From: Barry Smith Date: Thursday, 30 September 2021 at 15:47 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) The MatSolve is no better on the GPUs then on the CPU; while other parts of the computation seem to speed up nicely. What is the result of -ksp_view ? Are you using ILU(0) as the preconditioner, this will not solve well on the GPU, its solve is essentially sequential. You won't want to use ILU(0) in this way on GPUs. Barry On Sep 30, 2021, at 9:41 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Based on your feedback from yesterday. I was trying to breakdown KSPSolve. Please find the attached bar plot. The numbers are not adding up at least for GPUs. Your feedback from yesterday were based on T%. I plotted the time spend on each event, hoping that the cumulative sum would add up to KSPSolve time. Kind regards, Karthik. From: Matthew Knepley > Date: Thursday, 30 September 2021 at 13:52 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: When comparing the MatSolve data for GPU MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and CPU MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? Looks like that. Thanks Matt KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 16:29 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you! Just to summarize KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? Yes. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 11:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you Mathew. Now, it is all making sense to me. From data file ksp_ex45_N511_gpu_2.txt KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 So the solve time is: 53% ~ 37% + 4% + 11% and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 It looks like the remainder of the time (23%) is spent preallocating the matrix. Thanks, Matt The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 10:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: 1. graph.pdf a plot showing overall time and various petsc events. 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary I used the following petsc options for cpu mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor and for gpus mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor to run the following problem https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. In your response you said that ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly consist of MatMult + PCApply, with some vector work. I am hoping to time KSP solving and preconditioning mutually exclusively. I am not sure that concept makes sense here. See above. Thanks, Matt Kind regards, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 19:19 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks for Barry for your response. I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. Barry Best, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 16:56 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. Barry Thanks! Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ksp_ex45_N511_cpu_6.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ksp_ex45_N511_gpu_2.txt URL: From knepley at gmail.com Fri Oct 1 08:50:37 2021 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 1 Oct 2021 09:50:37 -0400 Subject: [petsc-users] (percent time in this phase) In-Reply-To: References: Message-ID: On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > When comparing the MatSolve data for > > > > GPU > > > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 > 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 > 0.00e+00 100 > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > > > and CPU > > > > MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 > 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 > > > > the time spent is almost the same for this preconditioner. Look like > MatCUSPARSSolAnl is called only *twice* (since I am running on two cores) > > > > mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z > 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type > bjacobi -ksp_monitor > > > > So would it be fair to assume MatCUSPARSSolAnl is *not *accounted for in > MatSolve and it is an exclusive event? > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) > ~ 100 % > I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with -log_view :foo.txt:ascii_flamegraph and then there are tools for plotting that output, described here https://firedrakeproject.org/optimising.html This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. Thanks, Matt Best, > > Karthik. > > > > > > *From: *Matthew Knepley > *Date: *Wednesday, 29 September 2021 at 16:29 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *Barry Smith , "petsc-users at mcs.anl.gov" < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Thank you! > > > > Just to summarize > > > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) > ~ 100 % > > > > You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I > right in accounting for it as above? > > > > I am not sure.I thought it might be the GPU part of MatSolve(). I will > have to look in the code. I am not as familiar with the GPU part. > > > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > > > Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and > VecAYPX are mutually exclusive? > > > > Yes. > > > > Thanks, > > > > Matt > > > > Best, > > > > Karthik. > > > > *From: *Matthew Knepley > *Date: *Wednesday, 29 September 2021 at 11:58 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *Barry Smith , "petsc-users at mcs.anl.gov" < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Thank you Mathew. Now, it is all making sense to me. > > > > From data file ksp_ex45_N511_gpu_2.txt > > > > KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). > > > > However, you said ?So an iteration would mostly consist of MatMult + > PCApply, with some vector work? > > > > 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than > one process and using Block-Jacobi . Half the time is spent in the solve > (53%) > > > > KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 > > KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 > > > > 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which > is all setup of the individual blocks, and this is all used by the > numerical ILU factorization. > > > > PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 > 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 > 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 > 6.93e+03 0 0.00e+00 0 > > MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 > > MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > > > 3) The preconditioner application takes 37% of the time, which is all > solving the factors and recorded in MatSolve(). Matrix multiplication takes > 4%. > > > > PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 > 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > > MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 > > > > 4) The significant vector time is all in norms (11%) since they are really > slow on the GPU. > > > > VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 > > VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 > > VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 > > VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 > > > > So the solve time is: > > > > 53% ~ 37% + 4% + 11% > > > > and the setup time is about 16%. I was wrong about the SetUp time being > included, as it is outside the event: > > > > > https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 > > > > It looks like the remainder of the time (23%) is spent preallocating the > matrix. > > > > Thanks, > > > > Matt > > > > The MalMult event is 4 %. How does this event figure into the above > equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? > > > > Best, > > Karthik. > > > > *From: *Matthew Knepley > *Date: *Wednesday, 29 September 2021 at 10:58 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *Barry Smith , "petsc-users at mcs.anl.gov" < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > That was helpful. I would like to provide some additional details of my > run on cpus and gpus. Please find the following attachments: > > > > 1. graph.pdf a plot showing overall time and various petsc events. > 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary > 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary > > > > I used the following petsc options for cpu > > > > mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z > 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi > -ksp_monitor > > > > and for gpus > > > > mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z > 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type > bjacobi -ksp_monitor > > > > to run the following problem > > > > https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html > > > > From the above code, I see is there no individual function called KSPSetUp(), > so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, > kSPSetComputeOperators all are timed together as KSPSetUp. For this > example, is KSPSetUp time and KSPSolve time mutually exclusive? > > > > No, KSPSetUp() will be contained in KSPSolve() if it is called > automatically. > > > > In your response you said that > > > > ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it > depends on how much of the preconditioner construction can take place > early, so depends exactly on the preconditioner used.? > > > > I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for > this particular preconditioner (bjacobi) how can I tell how they are timed? > > > > They are all inside KSPSolve(). If you have a preconditioned linear solve, > the oreconditioning happens during the iteration. So an iteration would > mostly > > consist of MatMult + PCApply, with some vector work. > > > > I am hoping to time KSP solving and preconditioning mutually exclusively. > > > > I am not sure that concept makes sense here. See above. > > > > Thanks, > > > > Matt > > > > > > Kind regards, > > Karthik. > > > > > > *From: *Barry Smith > *Date: *Tuesday, 28 September 2021 at 19:19 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > > > > > On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > > > Thanks for Barry for your response. > > > > I was just benchmarking the problem with various preconditioner on cpu and > gpu. I understand, it is not possible to get mutually exclusive timing. > > However, can you tell if KSPSolve time includes both PCSetup and PCApply? > And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp > and PCApply. > > > > If you do not call KSPSetUp() separately from KSPSolve() then its time > is included with KSPSolve(). > > > > PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends > on how much of the preconditioner construction can take place early, so > depends exactly on the preconditioner used. > > > > So yes the answer is not totally satisfying. The one thing I would > recommend is to not call KSPSetUp() directly and then KSPSolve() will > always include the total time of the solve plus all setup time. PCApply > will contain all the time to apply the preconditioner but may also include > some setup time. > > > > Barry > > > > > > Best, > > Karthik. > > > > > > > > > > *From: *Barry Smith > *Date: *Tuesday, 28 September 2021 at 16:56 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > > > > > On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > > > Hello, > > > > I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson > problem. I noticed from the output from using the flag -log_summary that > for various events their respective %T (percent time in this phase) do not > add up to 100 but rather exceeds 100. So, I gather there is some overlap > among these events. I am primarily looking at the events KSPSetUp, > KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive > %T or Time for these individual events? I have attached the log_summary > output file from my run for your reference. > > > > > > For nested solvers it is tricky to get the times to be mutually > exclusive because some parts of the building of the preconditioner is for > some preconditioners delayed until the solve has started. > > > > It looks like you are using the default preconditioner options which for > this example are taking more or less no time since so many iterations are > needed. It is best to use -pc_type mg to use geometric multigrid on this > problem. > > > > Barry > > > > > > > > Thanks! > > Karthik. > > > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Oct 1 09:56:29 2021 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 1 Oct 2021 10:56:29 -0400 Subject: [petsc-users] Disconnected domains and Poisson equation In-Reply-To: References: <448CEBF7-5B16-4E1C-8D1D-9CC067BD38BB@petsc.dev> <10EA28EF-AD98-4F59-A78D-7DE3D4B585DE@petsc.dev> <3A2F7686-44AA-47A5-B996-461E057F4EC3@petsc.dev> Message-ID: <5E2505EA-9665-49DF-9D8D-DE6CCF1E0972@petsc.dev> > On Oct 1, 2021, at 6:38 AM, Marco Cisternino wrote: > > Thank you Barry. > I added a custom atoll = 1.0e-12 and this makes the CFD stable with all the linear solver types. CFD solution is good and pressure is a good ?zero? field at every CFD iteration. > I did the same test using ASM+ILU+FGMRES(BCGS and GMRES) and the behaviour is the same. > During some CFD iteration the residual of linear system starts slightly higher than atol and the linear solver makes some iteration (2/3 iterations) before it stops because of atol. > The pressure is still different in the 2 sub-domains (order 1.0e-14 because of those few linear solver iterations), therefore no symmetry of the solution In the 2 sub-domains. > I think it is a matter of round-off, do you agree on this? Or do I need to take care of this difference as a symptom of something wrong? Yes, if the differences in the two solutions are order 1.e-14 that is very good, one cannot expect them to be identical. > > Thank you for your support. > > Marco Cisternino > > From: Barry Smith > > Sent: gioved? 30 settembre 2021 16:39 > To: Marco Cisternino > > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Disconnected domains and Poisson equation > > > It looks like the initial solution (guess) is to round-off the solution to the linear system 9.010260489109e-14 > > 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 > 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 > 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 > 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min 9.566256813737e-01 max/min 1.000000000000e+00 > 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 > 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min 7.360950481750e-01 max/min 1.306083963538e+00 > > Thus the Krylov solver will not be able to improve the solution, it then gets stuck trying to improve the solution but cannot because of round off. > > In other words the algorithm has converged (even at the initial solution (guess) and should stop immediately. > > You can use -ksp_atol 1.e-12 to get it to stop immediately without iterating if the initial residual is less than 1e-12. > > Barry > > > > > On Sep 30, 2021, at 4:16 AM, Marco Cisternino > wrote: > > Hello Barry. > This is the output of ksp_view using fgmres and gamg. It has to be said that the solution of the linear system should be a zero values field. As you can see both unpreconditioned residual and r/b converge at this iteration of the CFD solver. During the time integration of the CFD, I can observe pressure linear solver residuals behaving in a different way: unpreconditioned residual stil converges but r/b stalls. After the output of ksp_view I add the output of ksp_monitor_true_residual for one of these iteration where r/b stalls. > Thanks, > > KSP Object: 1 MPI processes > type: fgmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=100, nonzero initial guess > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > right preconditioning > using UNPRECONDITIONED norm type for convergence test > PC Object: 1 MPI processes > type: gamg > type is MULTIPLICATIVE, levels=4 cycles=v > Cycles per PCApply=1 > Using externally compute Galerkin coarse grid matrices > GAMG specific options > Threshold for dropping small values in graph on each level = 0.02 0.02 > Threshold scaling factor for each level not specified = 1. > AGG specific options > Symmetric graph true > Number of levels to square graph 1 > Number smoothing steps 0 > Coarse grid solver -- level ------------------------------- > KSP Object: (mg_coarse_) 1 MPI processes > type: preonly > maximum iterations=10000, initial guess is zero > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (mg_coarse_) 1 MPI processes > type: bjacobi > number of blocks = 1 > Local solve is same for all blocks, in the following KSP and PC objects: > KSP Object: (mg_coarse_sub_) 1 MPI processes > type: preonly > maximum iterations=1, initial guess is zero > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using DEFAULT norm type for convergence test > PC Object: (mg_coarse_sub_) 1 MPI processes > type: lu > PC has not been set up so information may be incomplete > out-of-place factorization > tolerance for zero pivot 2.22045e-14 > using diagonal shift on blocks to prevent zero pivot [INBLOCKS] > matrix ordering: nd > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=18, cols=18 > total: nonzeros=104, allocated nonzeros=104 > total number of mallocs used during MatSetValues calls =0 > not using I-node routines > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=18, cols=18 > total: nonzeros=104, allocated nonzeros=104 > total number of mallocs used during MatSetValues calls =0 > not using I-node routines > Down solver (pre-smoother) on level 1 ------------------------------- > KSP Object: (mg_levels_1_) 1 MPI processes > type: chebyshev > eigenvalue estimates used: min = 0., max = 0. > eigenvalues estimate via gmres min 0., max 0. > eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] > KSP Object: (mg_levels_1_esteig_) 1 MPI processes > type: gmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=10, initial guess is zero > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > left preconditioning > using DEFAULT norm type for convergence test > estimating eigenvalues using noisy right hand side > maximum iterations=2, nonzero initial guess > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (mg_levels_1_) 1 MPI processes > type: sor > type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=67, cols=67 > total: nonzeros=675, allocated nonzeros=675 > total number of mallocs used during MatSetValues calls =0 > not using I-node routines > Up solver (post-smoother) same as down solver (pre-smoother) > Down solver (pre-smoother) on level 2 ------------------------------- > KSP Object: (mg_levels_2_) 1 MPI processes > type: chebyshev > eigenvalue estimates used: min = 0., max = 0. > eigenvalues estimate via gmres min 0., max 0. > eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] > KSP Object: (mg_levels_2_esteig_) 1 MPI processes > type: gmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=10, initial guess is zero > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > left preconditioning > using DEFAULT norm type for convergence test > estimating eigenvalues using noisy right hand side > maximum iterations=2, nonzero initial guess > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (mg_levels_2_) 1 MPI processes > type: sor > type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=348, cols=348 > total: nonzeros=3928, allocated nonzeros=3928 > total number of mallocs used during MatSetValues calls =0 > not using I-node routines > Up solver (post-smoother) same as down solver (pre-smoother) > Down solver (pre-smoother) on level 3 ------------------------------- > KSP Object: (mg_levels_3_) 1 MPI processes > type: chebyshev > eigenvalue estimates used: min = 0., max = 0. > eigenvalues estimate via gmres min 0., max 0. > eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] > KSP Object: (mg_levels_3_esteig_) 1 MPI processes > type: gmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=10, initial guess is zero > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > left preconditioning > using DEFAULT norm type for convergence test > estimating eigenvalues using noisy right hand side > maximum iterations=2, nonzero initial guess > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (mg_levels_3_) 1 MPI processes > type: sor > type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=3584, cols=3584 > total: nonzeros=23616, allocated nonzeros=23616 > total number of mallocs used during MatSetValues calls =0 > has attached null space > not using I-node routines > Up solver (post-smoother) same as down solver (pre-smoother) > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=3584, cols=3584 > total: nonzeros=23616, allocated nonzeros=23616 > total number of mallocs used during MatSetValues calls =0 > has attached null space > not using I-node routines > Pressure system has reached convergence in 0 iterations with reason 3. > 0 KSP unpreconditioned resid norm 4.798763170703e-16 true resid norm 4.798763170703e-16 ||r(i)||/||b|| 1.000000000000e+00 > 0 KSP Residual norm 4.798763170703e-16 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 > 1 KSP unpreconditioned resid norm 1.648749109132e-17 true resid norm 1.648749109132e-17 ||r(i)||/||b|| 3.435779284125e-02 > 1 KSP Residual norm 1.648749109132e-17 % max 9.561792537103e-01 min 9.561792537103e-01 max/min 1.000000000000e+00 > 2 KSP unpreconditioned resid norm 4.737880600040e-19 true resid norm 4.737880600040e-19 ||r(i)||/||b|| 9.873128619820e-04 > 2 KSP Residual norm 4.737880600040e-19 % max 9.828636644296e-01 min 9.293131521763e-01 max/min 1.057623753767e+00 > 3 KSP unpreconditioned resid norm 2.542212716830e-20 true resid norm 2.542212716830e-20 ||r(i)||/||b|| 5.297641551371e-05 > 3 KSP Residual norm 2.542212716830e-20 % max 9.933572357920e-01 min 9.158303248850e-01 max/min 1.084652046127e+00 > 4 KSP unpreconditioned resid norm 6.614510286263e-21 true resid norm 6.614510286269e-21 ||r(i)||/||b|| 1.378378146822e-05 > 4 KSP Residual norm 6.614510286263e-21 % max 9.950912550705e-01 min 6.296575800237e-01 max/min 1.580368896747e+00 > 5 KSP unpreconditioned resid norm 1.981505525281e-22 true resid norm 1.981505525272e-22 ||r(i)||/||b|| 4.129200493513e-07 > 5 KSP Residual norm 1.981505525281e-22 % max 9.984097962703e-01 min 5.316259535293e-01 max/min 1.878030577029e+00 > Linear solve converged due to CONVERGED_RTOL iterations 5 > > Ksp_monitor_true_residual output for stalling r/b CFD iteration > 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 > 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 > 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 > 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min 9.566256813737e-01 max/min 1.000000000000e+00 > 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 > 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min 7.360950481750e-01 max/min 1.306083963538e+00 > 3 KSP unpreconditioned resid norm 6.623206616803e-16 true resid norm 6.654132553541e-16 ||r(i)||/||b|| 1.492933720678e-02 > 3 KSP Residual norm 6.623206616803e-16 % max 9.764112945239e-01 min 4.911485418014e-01 max/min 1.988016274960e+00 > 4 KSP unpreconditioned resid norm 6.551896936698e-16 true resid norm 6.646157296305e-16 ||r(i)||/||b|| 1.491144376933e-02 > 4 KSP Residual norm 6.551896936698e-16 % max 9.883425885532e-01 min 1.461270778833e-01 max/min 6.763582786091e+00 > 5 KSP unpreconditioned resid norm 6.222297644887e-16 true resid norm 1.720560536914e-15 ||r(i)||/||b|| 3.860282047823e-02 > 5 KSP Residual norm 6.222297644887e-16 % max 1.000409371755e+00 min 4.989767363560e-03 max/min 2.004921870829e+02 > 6 KSP unpreconditioned resid norm 6.496945794974e-17 true resid norm 2.031914800253e-14 ||r(i)||/||b|| 4.558842341106e-01 > 6 KSP Residual norm 6.496945794974e-17 % max 1.004914985753e+00 min 1.459258738706e-03 max/min 6.886475709192e+02 > 7 KSP unpreconditioned resid norm 1.965237342540e-17 true resid norm 1.684522207337e-14 ||r(i)||/||b|| 3.779425772373e-01 > 7 KSP Residual norm 1.965237342540e-17 % max 1.005737762541e+00 min 1.452603803766e-03 max/min 6.923689446035e+02 > 8 KSP unpreconditioned resid norm 1.627718951285e-17 true resid norm 1.958642967520e-14 ||r(i)||/||b|| 4.394448276241e-01 > 8 KSP Residual norm 1.627718951285e-17 % max 1.006364278765e+00 min 1.452081813014e-03 max/min 6.930492963590e+02 > 9 KSP unpreconditioned resid norm 1.616577677764e-17 true resid norm 2.019110946644e-14 ||r(i)||/||b|| 4.530115373837e-01 > 9 KSP Residual norm 1.616577677764e-17 % max 1.006648747131e+00 min 1.452031376577e-03 max/min 6.932692801059e+02 > 10 KSP unpreconditioned resid norm 1.285788988203e-17 true resid norm 2.065082694477e-14 ||r(i)||/||b|| 4.633258453698e-01 > 10 KSP Residual norm 1.285788988203e-17 % max 1.007469033514e+00 min 1.433291867068e-03 max/min 7.029057072477e+02 > 11 KSP unpreconditioned resid norm 5.490854431580e-19 true resid norm 1.798071628891e-14 ||r(i)||/||b|| 4.034187394623e-01 > 11 KSP Residual norm 5.490854431580e-19 % max 1.008058905554e+00 min 1.369401685301e-03 max/min 7.361309076612e+02 > 12 KSP unpreconditioned resid norm 1.371754802104e-20 true resid norm 1.965688920064e-14 ||r(i)||/||b|| 4.410256708163e-01 > 12 KSP Residual norm 1.371754802104e-20 % max 1.008409402214e+00 min 1.369243011779e-03 max/min 7.364721919624e+02 > Linear solve converged due to CONVERGED_RTOL iterations 12 > > > > Marco Cisternino > > From: Barry Smith > > Sent: mercoled? 29 settembre 2021 18:34 > To: Marco Cisternino > > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Disconnected domains and Poisson equation > > > > > > On Sep 29, 2021, at 11:59 AM, Marco Cisternino > wrote: > > For sake of completeness, explicitly building the null space using a vector per sub-domain make s the CFD runs using BCGS and GMRES more stable, but still slower than FGMRES. > > Something is strange. Please run with -ksp_view and send the output on the solver details. > > > > I had divergence using BCGS and GMRES setting the null space with only one constant. > Thanks > > Marco Cisternino > > From: Marco Cisternino > Sent: mercoled? 29 settembre 2021 17:54 > To: Barry Smith > > Cc: petsc-users at mcs.anl.gov > Subject: RE: [petsc-users] Disconnected domains and Poisson equation > > Thank you Barry for the quick reply. > About the null space: I already tried what you suggest, building 2 Vec (constants) with 0 and 1 chosen by sub-domain, normalizing them and setting the null space like this > MatNullSpaceCreate(PETSC_COMM_WORLD,PETSC_FALSE,nconstants,constants,&nullspace); > The solution is slightly different in values but it is still different in the two sub-domains. > About the solver: I tried BCGS, GMRES and FGMRES. The linear system is a pressure system in a navier-stokes solver and only solving with FGMRES makes the CFD stable, with BCGS and GMRES the CFD solution diverges. Moreover, in the same case but with a single domain, CFD solution is stable using all the solvers, but FGMRES converges in much less iterations than the others. > > Marco Cisternino > > From: Barry Smith > > Sent: mercoled? 29 settembre 2021 15:59 > To: Marco Cisternino > > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Disconnected domains and Poisson equation > > > The problem actually has a two dimensional null space; constant on each domain but possibly different constants. I think you need to build the MatNullSpace by explicitly constructing two vectors, one with 0 on one domain and constant value on the other and one with 0 on the other domain and constant on the first. > > Separate note: why use FGMRES instead of just GMRES? If the problem is linear and the preconditioner is linear (no GMRES inside the smoother) then you can just use GMRES and it will save a little space/work and be conceptually clearer. > > Barry > > > On Sep 29, 2021, at 8:46 AM, Marco Cisternino > wrote: > > Good morning, > I want to solve the Poisson equation on a 3D domain with 2 non-connected sub-domains. > I am using FGMRES+GAMG and I have no problem if the two sub-domains see a Dirichlet boundary condition each. > On the same domain I would like to solve the Poisson equation imposing periodic boundary condition in one direction and homogenous Neumann boundary conditions in the other two directions. The two sub-domains are symmetric with respect to the separation between them and the operator discretization and the right hand side are symmetric as well. It would be nice to have the same solution in both the sub-domains. > Setting the null space to the constant, the solver converges to a solution having the same gradients in both sub-domains but different values. > Am I doing some wrong with the null space? I?m not setting a block matrix (one block for each sub-domain), should I? > I tested the null space against the matrix using MatNullSpaceTest and the answer is true. Can I do something more to have a symmetric solution as outcome of the solver? > Thank you in advance for any comments and hints. > > Best regards, > > Marco Cisternino -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Oct 1 10:03:32 2021 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 1 Oct 2021 11:03:32 -0400 Subject: [petsc-users] (percent time in this phase) In-Reply-To: <7B8AE6C6-D949-4D65-916C-0F00628DA9AA@stfc.ac.uk> References: <6295C9A3-0EC7-4D6A-8F62-88EC8651D207@stfc.ac.uk> <3B13EDB4-A22B-421B-9B5C-F95BA9CF9705@petsc.dev> <7B8AE6C6-D949-4D65-916C-0F00628DA9AA@stfc.ac.uk> Message-ID: <54B15B76-8815-4E07-BB59-F1EA9908274B@petsc.dev> What is "max time"? It is best to gather timings with a stage PetscLogStagePush() to get a separate subtable for exactly the part of the code you want timing for. For example if you are studying linear solver time you want only the solver part of the code in the stage, not the time to build the matrix and right hand side. It is very difficult to get really correct publishable reliable performance numbers when comparing solvers with similar timings on parallel machines and especially GPUs. Values can be very dependent on particular compilers used, the specific hardware used, generation of memory used etc. Barry > On Oct 1, 2021, at 8:51 AM, Karthikeyan Chockalingam - STFC UKRI wrote: > > Hi Barry, > > Yes, looks like it is computationally faster using GPUs. I used block jacobi as the preconditioner. > I have attached the output data for cpu and gpu using -ksp_view. I am not sure; what information I should be looking at using -ksp_view? > > I have an outstanding question, > > event time > T% > cal = (event/max)*100 > max time > 2.87E+02 > KSPSolve > 1.58E+02 > 53 > 55.2 > MatMult > 1.08E+01 > 4 > 3.76 > PCApply > 1.31E+02 > 37 > 45.6 > VecNorm > 6.23E+01 > 11 > 21.7 > > Matt couple of days back helped breakdown KSPSolve (53 %) ~ PCApply (37%) + VecNorm (11%) + MatMul (4%) > > However, when I calculate T% manually using max time, the numbers for PCApply and VecNorm are way off as you can see from the above table. > As a result, the cumulative sum of event time don?t match up to KSPSolve. Can you please let me know what I might be doing wrong? > > I will be performing extensive benchmarking of various preconditioners and comparing their performance on cpus and gpus, so this information is critical. > > Many thanks! > Karthik. > > From: Barry Smith > > Date: Thursday, 30 September 2021 at 15:47 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > > The MatSolve is no better on the GPUs then on the CPU; while other parts of the computation seem to speed up nicely. What is the result of -ksp_view ? Are you using ILU(0) as the preconditioner, this will not solve well on the GPU, its solve is essentially sequential. You won't want to use ILU(0) in this way on GPUs. > > Barry > > > > On Sep 30, 2021, at 9:41 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Based on your feedback from yesterday. I was trying to breakdown KSPSolve. > Please find the attached bar plot. The numbers are not adding up at least for GPUs. > Your feedback from yesterday were based on T%. > I plotted the time spend on each event, hoping that the cumulative sum would add up to KSPSolve time. > > Kind regards, > Karthik. > > From: Matthew Knepley > > Date: Thursday, 30 September 2021 at 13:52 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > When comparing the MatSolve data for > > GPU > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > and CPU > > MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 > > the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) > > mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? > > Looks like that. > > Thanks > > Matt > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > Best, > Karthik. > > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 16:29 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you! > > Just to summarize > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? > > I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? > > Yes. > > Thanks, > > Matt > > Best, > > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 11:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you Mathew. Now, it is all making sense to me. > > From data file ksp_ex45_N511_gpu_2.txt > > KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). > > However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? > > 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) > > KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 > KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 > > > 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. > > PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 > MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 > MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. > > PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 > > 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. > > > VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 > VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 > VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 > VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 > > So the solve time is: > > 53% ~ 37% + 4% + 11% > > and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: > > https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 > > It looks like the remainder of the time (23%) is spent preallocating the matrix. > > Thanks, > > Matt > > The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? > > Best, > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 10:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: > > graph.pdf a plot showing overall time and various petsc events. > ksp_ex45_N511_cpu_6.txt data file of the log_summary > ksp_ex45_N511_gpu_2.txt data file of the log_summary > > I used the following petsc options for cpu > > mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor > > and for gpus > > mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > to run the following problem > > https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html > > From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? > > No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. > > In your response you said that > > ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? > > I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? > > They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly > consist of MatMult + PCApply, with some vector work. > > I am hoping to time KSP solving and preconditioning mutually exclusively. > > I am not sure that concept makes sense here. See above. > > Thanks, > > Matt > > > Kind regards, > Karthik. > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 19:19 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Thanks for Barry for your response. > > I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. > However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. > > If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). > > PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. > > So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. > > Barry > > > Best, > Karthik. > > > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 16:56 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Hello, > > I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. > > > For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. > > It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. > > Barry > > > > > Thanks! > Karthik. > > This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Fri Oct 1 11:52:40 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Fri, 1 Oct 2021 16:52:40 +0000 Subject: [petsc-users] (percent time in this phase) In-Reply-To: <54B15B76-8815-4E07-BB59-F1EA9908274B@petsc.dev> References: <6295C9A3-0EC7-4D6A-8F62-88EC8651D207@stfc.ac.uk> <3B13EDB4-A22B-421B-9B5C-F95BA9CF9705@petsc.dev> <7B8AE6C6-D949-4D65-916C-0F00628DA9AA@stfc.ac.uk> <54B15B76-8815-4E07-BB59-F1EA9908274B@petsc.dev> Message-ID: <24B0DB73-D49B-4CAB-B237-9217AF99A2C0@stfc.ac.uk> Hi Barry, Thanks for your comment. I took max time as 2.868e+02 sec from the below table, as the total time taken to solve the entire problem. Was I correct in my assumption? Using this max time, I manually tried to calculate the individual event percentage to see if it matched up to T%. Max Max/Min Avg Total Time (sec): 2.868e+02 1.000 2.868e+02 Objects: 3.800e+01 1.000 3.800e+01 Flop: 8.659e+11 1.004 8.642e+11 1.728e+12 Flop/sec: 3.019e+09 1.004 3.013e+09 6.026e+09 Memory: 1.764e+10 1.004 1.760e+10 3.521e+10 MPI Messages: 3.430e+02 1.000 3.430e+02 6.860e+02 MPI Message Lengths: 7.134e+08 1.000 2.080e+06 1.427e+09 MPI Reductions: 4.637e+03 1.000 Best, Karthik. From: Barry Smith Date: Friday, 1 October 2021 at 16:03 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) What is "max time"? It is best to gather timings with a stage PetscLogStagePush() to get a separate subtable for exactly the part of the code you want timing for. For example if you are studying linear solver time you want only the solver part of the code in the stage, not the time to build the matrix and right hand side. It is very difficult to get really correct publishable reliable performance numbers when comparing solvers with similar timings on parallel machines and especially GPUs. Values can be very dependent on particular compilers used, the specific hardware used, generation of memory used etc. Barry On Oct 1, 2021, at 8:51 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Barry, Yes, looks like it is computationally faster using GPUs. I used block jacobi as the preconditioner. I have attached the output data for cpu and gpu using -ksp_view. I am not sure; what information I should be looking at using -ksp_view? I have an outstanding question, event time T% cal = (event/max)*100 max time 2.87E+02 KSPSolve 1.58E+02 53 55.2 MatMult 1.08E+01 4 3.76 PCApply 1.31E+02 37 45.6 VecNorm 6.23E+01 11 21.7 Matt couple of days back helped breakdown KSPSolve (53 %) ~ PCApply (37%) + VecNorm (11%) + MatMul (4%) However, when I calculate T% manually using max time, the numbers for PCApply and VecNorm are way off as you can see from the above table. As a result, the cumulative sum of event time don?t match up to KSPSolve. Can you please let me know what I might be doing wrong? I will be performing extensive benchmarking of various preconditioners and comparing their performance on cpus and gpus, so this information is critical. Many thanks! Karthik. From: Barry Smith > Date: Thursday, 30 September 2021 at 15:47 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) The MatSolve is no better on the GPUs then on the CPU; while other parts of the computation seem to speed up nicely. What is the result of -ksp_view ? Are you using ILU(0) as the preconditioner, this will not solve well on the GPU, its solve is essentially sequential. You won't want to use ILU(0) in this way on GPUs. Barry On Sep 30, 2021, at 9:41 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Based on your feedback from yesterday. I was trying to breakdown KSPSolve. Please find the attached bar plot. The numbers are not adding up at least for GPUs. Your feedback from yesterday were based on T%. I plotted the time spend on each event, hoping that the cumulative sum would add up to KSPSolve time. Kind regards, Karthik. From: Matthew Knepley > Date: Thursday, 30 September 2021 at 13:52 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: When comparing the MatSolve data for GPU MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and CPU MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? Looks like that. Thanks Matt KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 16:29 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you! Just to summarize KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? Yes. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 11:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you Mathew. Now, it is all making sense to me. From data file ksp_ex45_N511_gpu_2.txt KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 So the solve time is: 53% ~ 37% + 4% + 11% and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 It looks like the remainder of the time (23%) is spent preallocating the matrix. Thanks, Matt The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 10:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: 1. graph.pdf a plot showing overall time and various petsc events. 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary I used the following petsc options for cpu mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor and for gpus mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor to run the following problem https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. In your response you said that ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly consist of MatMult + PCApply, with some vector work. I am hoping to time KSP solving and preconditioning mutually exclusively. I am not sure that concept makes sense here. See above. Thanks, Matt Kind regards, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 19:19 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks for Barry for your response. I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. Barry Best, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 16:56 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. Barry Thanks! Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Oct 1 16:52:45 2021 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 1 Oct 2021 17:52:45 -0400 Subject: [petsc-users] (percent time in this phase) In-Reply-To: <24B0DB73-D49B-4CAB-B237-9217AF99A2C0@stfc.ac.uk> References: <6295C9A3-0EC7-4D6A-8F62-88EC8651D207@stfc.ac.uk> <3B13EDB4-A22B-421B-9B5C-F95BA9CF9705@petsc.dev> <7B8AE6C6-D949-4D65-916C-0F00628DA9AA@stfc.ac.uk> <54B15B76-8815-4E07-BB59-F1EA9908274B@petsc.dev> <24B0DB73-D49B-4CAB-B237-9217AF99A2C0@stfc.ac.uk> Message-ID: <6809F24A-7A54-48D9-90C0-3BB53004EACA@petsc.dev> That max_time is the maximum over all processes from PetscInitialize() to PetscFinalize(), it is not a good number to use to compute the percentages since it also includes the time to get all the MPI ranks up and running. This is why I recommend using a PetscStage and the percentages it reports for time, these will reflect exactly the times relevant for your computations. Barry > On Oct 1, 2021, at 12:52 PM, Karthikeyan Chockalingam - STFC UKRI wrote: > > Hi Barry, > > Thanks for your comment. > > I took max time as 2.868e+02 sec from the below table, as the total time taken to solve the entire problem. Was I correct in my assumption? > Using this max time, I manually tried to calculate the individual event percentage to see if it matched up to T%. > > Max Max/Min Avg Total > Time (sec): 2.868e+02 1.000 2.868e+02 > Objects: 3.800e+01 1.000 3.800e+01 > Flop: 8.659e+11 1.004 8.642e+11 1.728e+12 > Flop/sec: 3.019e+09 1.004 3.013e+09 6.026e+09 > Memory: 1.764e+10 1.004 1.760e+10 3.521e+10 > MPI Messages: 3.430e+02 1.000 3.430e+02 6.860e+02 > MPI Message Lengths: 7.134e+08 1.000 2.080e+06 1.427e+09 > MPI Reductions: 4.637e+03 1.000 > > Best, > Karthik. > > From: Barry Smith > > Date: Friday, 1 October 2021 at 16:03 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > > What is "max time"? It is best to gather timings with a stage PetscLogStagePush() to get a separate subtable for exactly the part of the code you want timing for. For example if you are studying linear solver time you want only the solver part of the code in the stage, not the time to build the matrix and right hand side. > > It is very difficult to get really correct publishable reliable performance numbers when comparing solvers with similar timings on parallel machines and especially GPUs. Values can be very dependent on particular compilers used, the specific hardware used, generation of memory used etc. > > Barry > > > > On Oct 1, 2021, at 8:51 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Hi Barry, > > Yes, looks like it is computationally faster using GPUs. I used block jacobi as the preconditioner. > I have attached the output data for cpu and gpu using -ksp_view. I am not sure; what information I should be looking at using -ksp_view? > > I have an outstanding question, > > event time > T% > cal = (event/max)*100 > max time > 2.87E+02 > KSPSolve > 1.58E+02 > 53 > 55.2 > MatMult > 1.08E+01 > 4 > 3.76 > PCApply > 1.31E+02 > 37 > 45.6 > VecNorm > 6.23E+01 > 11 > 21.7 > > Matt couple of days back helped breakdown KSPSolve (53 %) ~ PCApply (37%) + VecNorm (11%) + MatMul (4%) > > However, when I calculate T% manually using max time, the numbers for PCApply and VecNorm are way off as you can see from the above table. > As a result, the cumulative sum of event time don?t match up to KSPSolve. Can you please let me know what I might be doing wrong? > > I will be performing extensive benchmarking of various preconditioners and comparing their performance on cpus and gpus, so this information is critical. > > Many thanks! > Karthik. > > From: Barry Smith > > Date: Thursday, 30 September 2021 at 15:47 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > > The MatSolve is no better on the GPUs then on the CPU; while other parts of the computation seem to speed up nicely. What is the result of -ksp_view ? Are you using ILU(0) as the preconditioner, this will not solve well on the GPU, its solve is essentially sequential. You won't want to use ILU(0) in this way on GPUs. > > Barry > > > > > On Sep 30, 2021, at 9:41 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Based on your feedback from yesterday. I was trying to breakdown KSPSolve. > Please find the attached bar plot. The numbers are not adding up at least for GPUs. > Your feedback from yesterday were based on T%. > I plotted the time spend on each event, hoping that the cumulative sum would add up to KSPSolve time. > > Kind regards, > Karthik. > > From: Matthew Knepley > > Date: Thursday, 30 September 2021 at 13:52 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > When comparing the MatSolve data for > > GPU > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > and CPU > > MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 > > the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) > > mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? > > Looks like that. > > Thanks > > Matt > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > Best, > Karthik. > > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 16:29 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you! > > Just to summarize > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? > > I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? > > Yes. > > Thanks, > > Matt > > Best, > > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 11:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you Mathew. Now, it is all making sense to me. > > From data file ksp_ex45_N511_gpu_2.txt > > KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). > > However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? > > 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) > > KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 > KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 > > > 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. > > PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 > MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 > MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. > > PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 > > 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. > > > VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 > VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 > VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 > VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 > > So the solve time is: > > 53% ~ 37% + 4% + 11% > > and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: > > https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 > > It looks like the remainder of the time (23%) is spent preallocating the matrix. > > Thanks, > > Matt > > The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? > > Best, > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 10:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: > > graph.pdf a plot showing overall time and various petsc events. > ksp_ex45_N511_cpu_6.txt data file of the log_summary > ksp_ex45_N511_gpu_2.txt data file of the log_summary > > I used the following petsc options for cpu > > mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor > > and for gpus > > mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > to run the following problem > > https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html > > From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? > > No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. > > In your response you said that > > ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? > > I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? > > They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly > consist of MatMult + PCApply, with some vector work. > > I am hoping to time KSP solving and preconditioning mutually exclusively. > > I am not sure that concept makes sense here. See above. > > Thanks, > > Matt > > > Kind regards, > Karthik. > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 19:19 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Thanks for Barry for your response. > > I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. > However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. > > If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). > > PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. > > So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. > > Barry > > > Best, > Karthik. > > > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 16:56 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Hello, > > I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. > > > For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. > > It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. > > Barry > > > > > Thanks! > Karthik. > > This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Sun Oct 3 04:43:40 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Sun, 3 Oct 2021 09:43:40 +0000 Subject: [petsc-users] (percent time in this phase) In-Reply-To: References: Message-ID: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> Hi Matt, Thank you. The flamegraph tool is helpful. Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app). I find the following call sequence from the graph KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve I have a couple of questions 1. The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? 2. What is the unit of measurement in flamegrah? Thanks, Karthik. From: Matthew Knepley Date: Friday, 1 October 2021 at 14:51 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: Barry Smith , "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: When comparing the MatSolve data for GPU MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and CPU MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with -log_view :foo.txt:ascii_flamegraph and then there are tools for plotting that output, described here https://firedrakeproject.org/optimising.html This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 16:29 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you! Just to summarize KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? Yes. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 11:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you Mathew. Now, it is all making sense to me. From data file ksp_ex45_N511_gpu_2.txt KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 So the solve time is: 53% ~ 37% + 4% + 11% and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 It looks like the remainder of the time (23%) is spent preallocating the matrix. Thanks, Matt The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 10:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: 1. graph.pdf a plot showing overall time and various petsc events. 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary I used the following petsc options for cpu mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor and for gpus mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor to run the following problem https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. In your response you said that ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly consist of MatMult + PCApply, with some vector work. I am hoping to time KSP solving and preconditioning mutually exclusively. I am not sure that concept makes sense here. See above. Thanks, Matt Kind regards, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 19:19 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks for Barry for your response. I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. Barry Best, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 16:56 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. Barry Thanks! Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screenshot 2021-10-03 at 10.33.21 3.png Type: image/png Size: 74759 bytes Desc: Screenshot 2021-10-03 at 10.33.21 3.png URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: foo.txt URL: From knepley at gmail.com Sun Oct 3 06:54:28 2021 From: knepley at gmail.com (Matthew Knepley) Date: Sun, 3 Oct 2021 07:54:28 -0400 Subject: [petsc-users] (percent time in this phase) In-Reply-To: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> Message-ID: On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > Hi Matt, > > > > Thank you. The flamegraph tool is helpful. > > Please find the attached screen shoot and foo.txt which generated that > graph (using https://www.speedscope.app). > > I find the following call sequence from the graph > > KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve > > > > I have a couple of questions > > 1. The KSPSolve time listed in the file using -log_summary (or > -log_view), is it the time taken by the first KSPSolve (in the above call > sequence)? > > Yes. When calls are nested, we just do not record the time for the nested call in log_view. > > 1. > 2. What is the unit of measurement in flamegrah? > > I believe it is microseconds, but I am not sure. Thanks, Matt > > 1. > > Thanks, > > Karthik. > > > > *From: *Matthew Knepley > *Date: *Friday, 1 October 2021 at 14:51 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *Barry Smith , "petsc-users at mcs.anl.gov" < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] (percent time in this phase) > > > > On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > When comparing the MatSolve data for > > > > GPU > > > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 > 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 > 0.00e+00 100 > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > > > and CPU > > > > MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 > 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 > > > > the time spent is almost the same for this preconditioner. Look like > MatCUSPARSSolAnl is called only *twice* (since I am running on two cores) > > > > mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z > 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type > bjacobi -ksp_monitor > > > > So would it be fair to assume MatCUSPARSSolAnl is *not *accounted for in > MatSolve and it is an exclusive event? > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) > ~ 100 % > > > > I am getting so old. We have a different kind of log output if you are > really concerned about inclusion. You can run with > > > > -log_view :foo.txt:ascii_flamegraph > > > > and then there are tools for plotting that output, described here > > > > https://firedrakeproject.org/optimising.html > > > > This output _guarantees_ strict inclusion, so you will not have the > problems you have above adding things up. > > > > Thanks, > > > > Matt > > > > Best, > > Karthik. > > > > > > *From: *Matthew Knepley > *Date: *Wednesday, 29 September 2021 at 16:29 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *Barry Smith , "petsc-users at mcs.anl.gov" < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Thank you! > > > > Just to summarize > > > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) > ~ 100 % > > > > You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I > right in accounting for it as above? > > > > I am not sure.I thought it might be the GPU part of MatSolve(). I will > have to look in the code. I am not as familiar with the GPU part. > > > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > > > Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and > VecAYPX are mutually exclusive? > > > > Yes. > > > > Thanks, > > > > Matt > > > > Best, > > > > Karthik. > > > > *From: *Matthew Knepley > *Date: *Wednesday, 29 September 2021 at 11:58 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *Barry Smith , "petsc-users at mcs.anl.gov" < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Thank you Mathew. Now, it is all making sense to me. > > > > From data file ksp_ex45_N511_gpu_2.txt > > > > KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). > > > > However, you said ?So an iteration would mostly consist of MatMult + > PCApply, with some vector work? > > > > 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than > one process and using Block-Jacobi . Half the time is spent in the solve > (53%) > > > > KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 > > KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 > > > > 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which > is all setup of the individual blocks, and this is all used by the > numerical ILU factorization. > > > > PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 > 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 > 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 > 6.93e+03 0 0.00e+00 0 > > MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 > > MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > > > 3) The preconditioner application takes 37% of the time, which is all > solving the factors and recorded in MatSolve(). Matrix multiplication takes > 4%. > > > > PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 > 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > > MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 > > > > 4) The significant vector time is all in norms (11%) since they are really > slow on the GPU. > > > > VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 > > VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 > > VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 > > VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 > > > > So the solve time is: > > > > 53% ~ 37% + 4% + 11% > > > > and the setup time is about 16%. I was wrong about the SetUp time being > included, as it is outside the event: > > > > > https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 > > > > It looks like the remainder of the time (23%) is spent preallocating the > matrix. > > > > Thanks, > > > > Matt > > > > The MalMult event is 4 %. How does this event figure into the above > equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? > > > > Best, > > Karthik. > > > > *From: *Matthew Knepley > *Date: *Wednesday, 29 September 2021 at 10:58 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *Barry Smith , "petsc-users at mcs.anl.gov" < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > That was helpful. I would like to provide some additional details of my > run on cpus and gpus. Please find the following attachments: > > > > 1. graph.pdf a plot showing overall time and various petsc events. > 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary > 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary > > > > I used the following petsc options for cpu > > > > mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z > 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi > -ksp_monitor > > > > and for gpus > > > > mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z > 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type > bjacobi -ksp_monitor > > > > to run the following problem > > > > https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html > > > > From the above code, I see is there no individual function called KSPSetUp(), > so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, > kSPSetComputeOperators all are timed together as KSPSetUp. For this > example, is KSPSetUp time and KSPSolve time mutually exclusive? > > > > No, KSPSetUp() will be contained in KSPSolve() if it is called > automatically. > > > > In your response you said that > > > > ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it > depends on how much of the preconditioner construction can take place > early, so depends exactly on the preconditioner used.? > > > > I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for > this particular preconditioner (bjacobi) how can I tell how they are timed? > > > > They are all inside KSPSolve(). If you have a preconditioned linear solve, > the oreconditioning happens during the iteration. So an iteration would > mostly > > consist of MatMult + PCApply, with some vector work. > > > > I am hoping to time KSP solving and preconditioning mutually exclusively. > > > > I am not sure that concept makes sense here. See above. > > > > Thanks, > > > > Matt > > > > > > Kind regards, > > Karthik. > > > > > > *From: *Barry Smith > *Date: *Tuesday, 28 September 2021 at 19:19 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > > > > > On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > > > Thanks for Barry for your response. > > > > I was just benchmarking the problem with various preconditioner on cpu and > gpu. I understand, it is not possible to get mutually exclusive timing. > > However, can you tell if KSPSolve time includes both PCSetup and PCApply? > And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp > and PCApply. > > > > If you do not call KSPSetUp() separately from KSPSolve() then its time > is included with KSPSolve(). > > > > PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends > on how much of the preconditioner construction can take place early, so > depends exactly on the preconditioner used. > > > > So yes the answer is not totally satisfying. The one thing I would > recommend is to not call KSPSetUp() directly and then KSPSolve() will > always include the total time of the solve plus all setup time. PCApply > will contain all the time to apply the preconditioner but may also include > some setup time. > > > > Barry > > > > > > Best, > > Karthik. > > > > > > > > > > *From: *Barry Smith > *Date: *Tuesday, 28 September 2021 at 16:56 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] %T (percent time in this phase) > > > > > > > > On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > > > Hello, > > > > I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson > problem. I noticed from the output from using the flag -log_summary that > for various events their respective %T (percent time in this phase) do not > add up to 100 but rather exceeds 100. So, I gather there is some overlap > among these events. I am primarily looking at the events KSPSetUp, > KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive > %T or Time for these individual events? I have attached the log_summary > output file from my run for your reference. > > > > > > For nested solvers it is tricky to get the times to be mutually > exclusive because some parts of the building of the preconditioner is for > some preconditioners delayed until the solve has started. > > > > It looks like you are using the default preconditioner options which for > this example are taking more or less no time since so many iterations are > needed. It is best to use -pc_type mg to use geometric multigrid on this > problem. > > > > Barry > > > > > > > > Thanks! > > Karthik. > > > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From velizhaninae at gmail.com Sun Oct 3 15:45:46 2021 From: velizhaninae at gmail.com (Yelyzaveta Velizhanina) Date: Sun, 3 Oct 2021 22:45:46 +0200 Subject: [petsc-users] Eigenvalues always converge to zero when using slepc4py-complex Message-ID: Dear all, I am having a problem to get EPS run properly with PETSc and SLEPc build with scalar_value=complex. I am using petsc4py and slepc4py. Installed everything, including PETSc and SLEPc, with conda. While real scalar value build works well, when using the complex one, all the eigenvalues always converge to 0 for any matrix and any solver. I?ve tried running examples given in this repo https://github.com/myousefi2016/slepc4py as well - same outcome, only zero eigenvalues. I am running MacOSX BigSur. Will appreciate any help, Best regards, Yelyzaveta Velizhanina. From jroman at dsic.upv.es Mon Oct 4 08:10:21 2021 From: jroman at dsic.upv.es (Jose E. Roman) Date: Mon, 4 Oct 2021 15:10:21 +0200 Subject: [petsc-users] Eigenvalues always converge to zero when using slepc4py-complex In-Reply-To: References: Message-ID: <3AE681B3-3351-4324-93BE-A2F847831DC0@dsic.upv.es> Conda supports complex scalars for petsc4py. However, this is not implemented in slepc4py. Lisandro is trying to get this fixed, so if no issues arise this will be available in a couple of days, with slepc4py-3.16.0. Jose > El 3 oct 2021, a las 22:45, Yelyzaveta Velizhanina escribi?: > > Dear all, > > I am having a problem to get EPS run properly with PETSc and SLEPc build with scalar_value=complex. I am using petsc4py and slepc4py. Installed everything, including PETSc and SLEPc, with conda. While real scalar value build works well, when using the complex one, all the eigenvalues always converge to 0 for any matrix and any solver. I?ve tried running examples given in this repo https://github.com/myousefi2016/slepc4py as well - same outcome, only zero eigenvalues. I am running MacOSX BigSur. > > Will appreciate any help, > > Best regards, > Yelyzaveta Velizhanina. From velizhaninae at gmail.com Mon Oct 4 08:13:32 2021 From: velizhaninae at gmail.com (Yelyzaveta Velizhanina) Date: Mon, 4 Oct 2021 13:13:32 +0000 Subject: [petsc-users] Eigenvalues always converge to zero when using slepc4py-complex In-Reply-To: <3AE681B3-3351-4324-93BE-A2F847831DC0@dsic.upv.es> References: <3AE681B3-3351-4324-93BE-A2F847831DC0@dsic.upv.es> Message-ID: I see. Thanks, much appreciated. Best regards, Yelyzaveta Velizhanina. From: Jose E. Roman Date: Monday, 4 October 2021 at 15:10 To: Yelyzaveta Velizhanina Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Eigenvalues always converge to zero when using slepc4py-complex Conda supports complex scalars for petsc4py. However, this is not implemented in slepc4py. Lisandro is trying to get this fixed, so if no issues arise this will be available in a couple of days, with slepc4py-3.16.0. Jose > El 3 oct 2021, a las 22:45, Yelyzaveta Velizhanina escribi?: > > Dear all, > > I am having a problem to get EPS run properly with PETSc and SLEPc build with scalar_value=complex. I am using petsc4py and slepc4py. Installed everything, including PETSc and SLEPc, with conda. While real scalar value build works well, when using the complex one, all the eigenvalues always converge to 0 for any matrix and any solver. I?ve tried running examples given in this repo https://github.com/myousefi2016/slepc4py as well - same outcome, only zero eigenvalues. I am running MacOSX BigSur. > > Will appreciate any help, > > Best regards, > Yelyzaveta Velizhanina. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pjool at mek.dtu.dk Mon Oct 4 08:26:42 2021 From: pjool at mek.dtu.dk (=?iso-8859-1?Q?Peder_J=F8rgensgaard_Olesen?=) Date: Mon, 4 Oct 2021 13:26:42 +0000 Subject: [petsc-users] Skipping data when reading from binary file Message-ID: Hello I have a binary file in which a mix of different objects is stored (Vecs, Mats, PetscInts). I can read each element just fine using VecLoad, MatLoad, and PetscIntView, provided they're read in the order in which they were put in the binary. What I would like to do is to skip the reading of any unneeded element, instead proceeding directly to the next one. I tried using PetscBinarySeek() for this, as shown in the attached code. This produces segmentation faults, suggesting that the file pointer isn't going where I want it to. Any suggestions as to what I'm doing wrong here? Best regards, Peder [http://www.dtu.dk/-/media/DTU_Generelt/Andet/mail-signature-logo.png] Peder J?rgensgaard Olesen PhD student Department of Mechanical Engineering pjool at mek.dtu.dk Koppels All? Building 403, room 105 2800 Kgs. Lyngby www.dtu.dk/english -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: skip_mwe.c Type: text/x-csrc Size: 3880 bytes Desc: skip_mwe.c URL: From knepley at gmail.com Mon Oct 4 08:37:08 2021 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 4 Oct 2021 09:37:08 -0400 Subject: [petsc-users] Skipping data when reading from binary file In-Reply-To: References: Message-ID: On Mon, Oct 4, 2021 at 9:26 AM Peder J?rgensgaard Olesen via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hello > > I have a binary file in which a mix of different objects is stored (Vecs, > Mats, PetscInts). I can read each element just fine using VecLoad, MatLoad, > and PetscIntView, provided they're read in the order in which they were put > in the binary. What I would like to do is to skip the reading of any > unneeded element, instead proceeding directly to the next one. I tried > using PetscBinarySeek() for this, as shown in the attached code. This > produces segmentation faults, suggesting that the file pointer isn't going > where I want it to. > There is header information you also have to skip for each object. We can go over the sizes for that (it is best just to look at the code), but that is fragile. A more robust way to achieve this random access is to use HDF5 and name the objects. Thanks, Matt > Any suggestions as to what I'm doing wrong here? > > > Best regards, > > Peder > > > Peder J?rgensgaard Olesen > PhD student > Department of Mechanical Engineering > > pjool at mek.dtu.dk > Koppels All? > Building 403, room 105 > 2800 Kgs. Lyngby > www.dtu.dk/english > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pjool at mek.dtu.dk Mon Oct 4 08:58:42 2021 From: pjool at mek.dtu.dk (=?iso-8859-1?Q?Peder_J=F8rgensgaard_Olesen?=) Date: Mon, 4 Oct 2021 13:58:42 +0000 Subject: [petsc-users] Skipping data when reading from binary file In-Reply-To: References: , Message-ID: <20c73a37aa9d421e9b5023b062043aa0@mek.dtu.dk> Thank you for your quick reply. I've had to change away from HDF5 to Binary format at an earlier stage in my work due to the former not working well with what I needed, so I would prefer to stick with the binary format. I had a quick view at the code for some of the Viewer-routines, but I'm not well versed in gleaning information about header sizes from that. Hints about what I'm looking for there would be appreciated. Best regards Peder ________________________________ Fra: Matthew Knepley Sendt: 4. oktober 2021 15:37:08 Til: Peder J?rgensgaard Olesen Cc: petsc-users at mcs.anl.gov Emne: Re: [petsc-users] Skipping data when reading from binary file On Mon, Oct 4, 2021 at 9:26 AM Peder J?rgensgaard Olesen via petsc-users > wrote: Hello I have a binary file in which a mix of different objects is stored (Vecs, Mats, PetscInts). I can read each element just fine using VecLoad, MatLoad, and PetscIntView, provided they're read in the order in which they were put in the binary. What I would like to do is to skip the reading of any unneeded element, instead proceeding directly to the next one. I tried using PetscBinarySeek() for this, as shown in the attached code. This produces segmentation faults, suggesting that the file pointer isn't going where I want it to. There is header information you also have to skip for each object. We can go over the sizes for that (it is best just to look at the code), but that is fragile. A more robust way to achieve this random access is to use HDF5 and name the objects. Thanks, Matt Any suggestions as to what I'm doing wrong here? Best regards, Peder [http://www.dtu.dk/-/media/DTU_Generelt/Andet/mail-signature-logo.png] Peder J?rgensgaard Olesen PhD student Department of Mechanical Engineering pjool at mek.dtu.dk Koppels All? Building 403, room 105 2800 Kgs. Lyngby www.dtu.dk/english -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Mon Oct 4 09:38:10 2021 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 4 Oct 2021 10:38:10 -0400 Subject: [petsc-users] Skipping data when reading from binary file In-Reply-To: <20c73a37aa9d421e9b5023b062043aa0@mek.dtu.dk> References: <20c73a37aa9d421e9b5023b062043aa0@mek.dtu.dk> Message-ID: <759DAB32-2D48-47EE-B9F6-64C53701E885@petsc.dev> To minimize code changes you could add a PETSc viewer format that caused skipping reading in an object. Then each object load would need a skip-read method that mimicked the reading but actually just skipped over the parts of the data of the object (using the correct sizes). For vectors this is trivial since you just skip the single known array. For sparse matrices it is not difficult but you will need to read in the number of nonzeros so you know how much to skip etc. Barry > On Oct 4, 2021, at 9:58 AM, Peder J?rgensgaard Olesen via petsc-users wrote: > > Thank you for your quick reply. > > I've had to change away from HDF5 to Binary format at an earlier stage in my work due to the former not working well with what I needed, so I would prefer to stick with the binary format. > I had a quick view at the code for some of the Viewer-routines, but I'm not well versed in gleaning information about header sizes from that. Hints about what I'm looking for there would be appreciated. > > Best regards > Peder > Fra: Matthew Knepley > Sendt: 4. oktober 2021 15:37:08 > Til: Peder J?rgensgaard Olesen > Cc: petsc-users at mcs.anl.gov > Emne: Re: [petsc-users] Skipping data when reading from binary file > > On Mon, Oct 4, 2021 at 9:26 AM Peder J?rgensgaard Olesen via petsc-users > wrote: > Hello > I have a binary file in which a mix of different objects is stored (Vecs, Mats, PetscInts). I can read each element just fine using VecLoad, MatLoad, and PetscIntView, provided they're read in the order in which they were put in the binary. What I would like to do is to skip the reading of any unneeded element, instead proceeding directly to the next one. I tried using PetscBinarySeek() for this, as shown in the attached code. This produces segmentation faults, suggesting that the file pointer isn't going where I want it to. > There is header information you also have to skip for each object. > > We can go over the sizes for that (it is best just to look at the code), but that is fragile. A more robust way to achieve this random > access is to use HDF5 and name the objects. > > Thanks, > > Matt > Any suggestions as to what I'm doing wrong here? > > Best regards, > Peder > > > > Peder J?rgensgaard Olesen > PhD student > Department of Mechanical Engineering > > pjool at mek.dtu.dk > Koppels All? > Building 403, room 105 > 2800 Kgs. Lyngby > www.dtu.dk/english > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pjool at mek.dtu.dk Mon Oct 4 10:27:15 2021 From: pjool at mek.dtu.dk (=?iso-8859-1?Q?Peder_J=F8rgensgaard_Olesen?=) Date: Mon, 4 Oct 2021 15:27:15 +0000 Subject: [petsc-users] Skipping data when reading from binary file In-Reply-To: <759DAB32-2D48-47EE-B9F6-64C53701E885@petsc.dev> References: <20c73a37aa9d421e9b5023b062043aa0@mek.dtu.dk>, <759DAB32-2D48-47EE-B9F6-64C53701E885@petsc.dev> Message-ID: In theory it should be relatively simple to write up a routine to skip a given number of objects. I'm not getting PetscBinarySeek() to work even on a header-less array of integers, however. I suppose that > PetscBinarySeek(fd, PETSC_BINARY_INT_SIZE*array_size, PETSC_BINARY_SEEK_CUR, NULL); ought to do the trick, but that does not seem to be the case. Am I somehow using this routine incorrectly? Peder ________________________________ Fra: Barry Smith Sendt: 4. oktober 2021 16:38:10 Til: Peder J?rgensgaard Olesen Cc: petsc-users at mcs.anl.gov Emne: Re: [petsc-users] Skipping data when reading from binary file To minimize code changes you could add a PETSc viewer format that caused skipping reading in an object. Then each object load would need a skip-read method that mimicked the reading but actually just skipped over the parts of the data of the object (using the correct sizes). For vectors this is trivial since you just skip the single known array. For sparse matrices it is not difficult but you will need to read in the number of nonzeros so you know how much to skip etc. Barry On Oct 4, 2021, at 9:58 AM, Peder J?rgensgaard Olesen via petsc-users > wrote: Thank you for your quick reply. I've had to change away from HDF5 to Binary format at an earlier stage in my work due to the former not working well with what I needed, so I would prefer to stick with the binary format. I had a quick view at the code for some of the Viewer-routines, but I'm not well versed in gleaning information about header sizes from that. Hints about what I'm looking for there would be appreciated. Best regards Peder ________________________________ Fra: Matthew Knepley > Sendt: 4. oktober 2021 15:37:08 Til: Peder J?rgensgaard Olesen Cc: petsc-users at mcs.anl.gov Emne: Re: [petsc-users] Skipping data when reading from binary file On Mon, Oct 4, 2021 at 9:26 AM Peder J?rgensgaard Olesen via petsc-users > wrote: Hello I have a binary file in which a mix of different objects is stored (Vecs, Mats, PetscInts). I can read each element just fine using VecLoad, MatLoad, and PetscIntView, provided they're read in the order in which they were put in the binary. What I would like to do is to skip the reading of any unneeded element, instead proceeding directly to the next one. I tried using PetscBinarySeek() for this, as shown in the attached code. This produces segmentation faults, suggesting that the file pointer isn't going where I want it to. There is header information you also have to skip for each object. We can go over the sizes for that (it is best just to look at the code), but that is fragile. A more robust way to achieve this random access is to use HDF5 and name the objects. Thanks, Matt Any suggestions as to what I'm doing wrong here? Best regards, Peder [http://www.dtu.dk/-/media/DTU_Generelt/Andet/mail-signature-logo.png] Peder J?rgensgaard Olesen PhD student Department of Mechanical Engineering pjool at mek.dtu.dk Koppels All? Building 403, room 105 2800 Kgs. Lyngby www.dtu.dk/english -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Oct 4 11:05:57 2021 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 4 Oct 2021 12:05:57 -0400 Subject: [petsc-users] Skipping data when reading from binary file In-Reply-To: References: <20c73a37aa9d421e9b5023b062043aa0@mek.dtu.dk> <759DAB32-2D48-47EE-B9F6-64C53701E885@petsc.dev> Message-ID: On Mon, Oct 4, 2021 at 11:27 AM Peder J?rgensgaard Olesen via petsc-users < petsc-users at mcs.anl.gov> wrote: > In theory it should be relatively simple to write up a routine to skip a > given number of objects. I'm not getting PetscBinarySeek() to work even on > a header-less array of integers, however. I suppose that > > > PetscBinarySeek(fd, PETSC_BINARY_INT_SIZE*array_size, > PETSC_BINARY_SEEK_CUR, NULL); > > The VecView code is here: https://gitlab.com/petsc/petsc/-/blob/main/src/vec/vec/utils/vecio.c#L32 you can see that it writes 2 integers and then the scalar data. So, first you skip the 2 integers, and then you skip the array_size scalars (so you want PETSC_BINARY_SCALAR_SIZE). Thanks, Matt > ought to do the trick, but that does not seem to be the case. > > > Am I somehow using this routine incorrectly? > > > Peder > ------------------------------ > *Fra:* Barry Smith > *Sendt:* 4. oktober 2021 16:38:10 > *Til:* Peder J?rgensgaard Olesen > *Cc:* petsc-users at mcs.anl.gov > *Emne:* Re: [petsc-users] Skipping data when reading from binary file > > > To minimize code changes you could add a PETSc viewer format that caused > skipping reading in an object. Then each object load would need a skip-read > method that mimicked the reading but actually just skipped over the parts > of the data of the object (using the correct sizes). For vectors this is > trivial since you just skip the single known array. For sparse matrices it > is not difficult but you will need to read in the number of nonzeros so you > know how much to skip etc. > > Barry > > > On Oct 4, 2021, at 9:58 AM, Peder J?rgensgaard Olesen via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > Thank you for your quick reply. > > > I've had to change away from HDF5 to Binary format at an earlier stage in > my work due to the former not working well with what I needed, so I would > prefer to stick with the binary format. > > I had a quick view at the code for some of the Viewer-routines, but I'm > not well versed in gleaning information about header sizes from that. Hints > about what I'm looking for there would be appreciated. > > > Best regards > > Peder > ------------------------------ > *Fra:* Matthew Knepley > *Sendt:* 4. oktober 2021 15:37:08 > *Til:* Peder J?rgensgaard Olesen > *Cc:* petsc-users at mcs.anl.gov > *Emne:* Re: [petsc-users] Skipping data when reading from binary file > > On Mon, Oct 4, 2021 at 9:26 AM Peder J?rgensgaard Olesen via petsc-users < > petsc-users at mcs.anl.gov> wrote: > >> Hello >> >> I have a binary file in which a mix of different objects is stored (Vecs, >> Mats, PetscInts). I can read each element just fine using VecLoad, MatLoad, >> and PetscIntView, provided they're read in the order in which they were put >> in the binary. What I would like to do is to skip the reading of any >> unneeded element, instead proceeding directly to the next one. I tried >> using PetscBinarySeek() for this, as shown in the attached code. This >> produces segmentation faults, suggesting that the file pointer isn't going >> where I want it to. >> > There is header information you also have to skip for each object. > > We can go over the sizes for that (it is best just to look at the code), > but that is fragile. A more robust way to achieve this random > access is to use HDF5 and name the objects. > > Thanks, > > Matt > >> Any suggestions as to what I'm doing wrong here? >> >> >> Best regards, >> >> Peder >> >> >> >> Peder J?rgensgaard Olesen >> PhD student >> Department of Mechanical Engineering >> >> pjool at mek.dtu.dk >> Koppels All? >> Building 403, room 105 >> 2800 Kgs. Lyngby >> www.dtu.dk/english >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pjool at mek.dtu.dk Mon Oct 4 11:19:31 2021 From: pjool at mek.dtu.dk (=?iso-8859-1?Q?Peder_J=F8rgensgaard_Olesen?=) Date: Mon, 4 Oct 2021 16:19:31 +0000 Subject: [petsc-users] Skipping data when reading from binary file In-Reply-To: References: <20c73a37aa9d421e9b5023b062043aa0@mek.dtu.dk> <759DAB32-2D48-47EE-B9F6-64C53701E885@petsc.dev> , Message-ID: I believe figured out how to get it to work, though it seems that in order to skip two integers I must use PETSC_BINARY_INT_SIZE*4; in other words, an integer appears to take twice the number of bytes given by PETSC_BINARY_INT_SIZE. Not sure what could be causing this. Also, I had somehow led myself to believe that the output parameter of the seek routine could be set to NULL. Writing output to the null pointer didn't produce the desired effects, much to the astonishment of no one whatsoever. -Peder ________________________________ Fra: Matthew Knepley Sendt: 4. oktober 2021 18:05:57 Til: Peder J?rgensgaard Olesen Cc: Barry Smith; petsc-users at mcs.anl.gov Emne: Re: [petsc-users] Skipping data when reading from binary file On Mon, Oct 4, 2021 at 11:27 AM Peder J?rgensgaard Olesen via petsc-users > wrote: In theory it should be relatively simple to write up a routine to skip a given number of objects. I'm not getting PetscBinarySeek() to work even on a header-less array of integers, however. I suppose that > PetscBinarySeek(fd, PETSC_BINARY_INT_SIZE*array_size, PETSC_BINARY_SEEK_CUR, NULL); The VecView code is here: https://gitlab.com/petsc/petsc/-/blob/main/src/vec/vec/utils/vecio.c#L32 you can see that it writes 2 integers and then the scalar data. So, first you skip the 2 integers, and then you skip the array_size scalars (so you want PETSC_BINARY_SCALAR_SIZE). Thanks, Matt ought to do the trick, but that does not seem to be the case. Am I somehow using this routine incorrectly? Peder ________________________________ Fra: Barry Smith > Sendt: 4. oktober 2021 16:38:10 Til: Peder J?rgensgaard Olesen Cc: petsc-users at mcs.anl.gov Emne: Re: [petsc-users] Skipping data when reading from binary file To minimize code changes you could add a PETSc viewer format that caused skipping reading in an object. Then each object load would need a skip-read method that mimicked the reading but actually just skipped over the parts of the data of the object (using the correct sizes). For vectors this is trivial since you just skip the single known array. For sparse matrices it is not difficult but you will need to read in the number of nonzeros so you know how much to skip etc. Barry On Oct 4, 2021, at 9:58 AM, Peder J?rgensgaard Olesen via petsc-users > wrote: Thank you for your quick reply. I've had to change away from HDF5 to Binary format at an earlier stage in my work due to the former not working well with what I needed, so I would prefer to stick with the binary format. I had a quick view at the code for some of the Viewer-routines, but I'm not well versed in gleaning information about header sizes from that. Hints about what I'm looking for there would be appreciated. Best regards Peder ________________________________ Fra: Matthew Knepley > Sendt: 4. oktober 2021 15:37:08 Til: Peder J?rgensgaard Olesen Cc: petsc-users at mcs.anl.gov Emne: Re: [petsc-users] Skipping data when reading from binary file On Mon, Oct 4, 2021 at 9:26 AM Peder J?rgensgaard Olesen via petsc-users > wrote: Hello I have a binary file in which a mix of different objects is stored (Vecs, Mats, PetscInts). I can read each element just fine using VecLoad, MatLoad, and PetscIntView, provided they're read in the order in which they were put in the binary. What I would like to do is to skip the reading of any unneeded element, instead proceeding directly to the next one. I tried using PetscBinarySeek() for this, as shown in the attached code. This produces segmentation faults, suggesting that the file pointer isn't going where I want it to. There is header information you also have to skip for each object. We can go over the sizes for that (it is best just to look at the code), but that is fragile. A more robust way to achieve this random access is to use HDF5 and name the objects. Thanks, Matt Any suggestions as to what I'm doing wrong here? Best regards, Peder [http://www.dtu.dk/-/media/DTU_Generelt/Andet/mail-signature-logo.png] Peder J?rgensgaard Olesen PhD student Department of Mechanical Engineering pjool at mek.dtu.dk Koppels All? Building 403, room 105 2800 Kgs. Lyngby www.dtu.dk/english -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From varunhiremath at gmail.com Tue Oct 5 01:04:19 2021 From: varunhiremath at gmail.com (Varun Hiremath) Date: Mon, 4 Oct 2021 23:04:19 -0700 Subject: [petsc-users] SLEPc: smallest eigenvalues In-Reply-To: References: <179BDB69-1EC0-4334-A964-ABE29E33EFF8@dsic.upv.es> <5B1750B3-E05F-45D7-929B-A5CF816B4A75@dsic.upv.es> <7031EC8B-A238-45AD-B4C2-FA8988022864@dsic.upv.es> <6B968AE2-8325-4E20-B94A-16ECDD0FBA90@dsic.upv.es> <4BB88AB3-410E-493C-9161-97775747936D@dsic.upv.es> <32B34038-7E1A-42CA-A55D-9AF9D41D1697@dsic.upv.es> <4FC17DE7-B910-43D8-9EC5-816285FD52F4@dsic.upv.es> Message-ID: Hi Jose, I have now gotten the quadratic problem working decently using the PEP package with appropriate scaling and preconditioning, so thanks for all the suggestions! For the case where K is a shell matrix, I used a scaling based on an approximation of K, and that seems to be working well. So now that both linear and quadratic problems are working, I wanted to get your suggestions on solving a non-linear problem. In some of our cases, we have a non-linear source term S(lambda) on the right-hand side of the equation as follows: (K + lambda*C + lambda^2*M)*x = S(lambda)*x, where the source can sometimes be simplified as S(lambda) = exp(lambda*t)*A, where A is a constant matrix. I am currently solving this non-linear problem iteratively. For each eigenvalue, I compute the source and add it into the K matrix, and then iterate until convergence. For this reason, I end up solving the system multiple times which makes it very slow. I saw some examples of non-linear problems included in the NEP package. I just wanted to get your thoughts if I would benefit from using the NEP package for this particular problem? Will I be able to use preconditioning and scaling as with the PEP package to speed up the computation for the case where K is a shell matrix? Thanks for your help. Regards, Varun On Thu, Sep 30, 2021 at 10:12 PM Varun Hiremath wrote: > Hi Jose, > > Thanks again for your valuable suggestions. I am still working on this but > wanted to give you a quick update. > > For the linear problem, I tried different KSP solvers, and finally, I'm > getting good convergence using CGS with LU (using MUMPS) inexact inverse. > So thank you very much for your help! > > But for the quadratic problem, I'm still struggling. As you suggested, I > have now started using the PEP solver. For the simple case where the K > matrix is explicitly known, everything works fine. But for the case where K > is a shell matrix, it struggles to converge. I am yet to try the scaling > option and some other preconditioning options. I will get back to you on > this if I have any questions. Appreciate your help! > > Thanks, > Varun > > On Tue, Sep 28, 2021 at 8:09 AM Jose E. Roman wrote: > >> >> >> > El 28 sept 2021, a las 7:50, Varun Hiremath >> escribi?: >> > >> > Hi Jose, >> > >> > I implemented the LU factorized preconditioner and tested it using >> PREONLY + LU, but that actually is converging to the wrong eigenvalues, >> compared to just using BICGS + BJACOBI, or simply computing >> EPS_SMALLEST_MAGNITUDE without any preconditioning. My preconditioning >> matrix is only a 1st order approximation, and the off-diagonal terms are >> not very accurate, so I'm guessing this is why the LU factorization doesn't >> help much? Nonetheless, using BICGS + BJACOBI with slightly relaxed >> tolerances seems to be working fine. >> >> If your PCMAT is not an exact inverse, then you have to iterate, i.e. not >> use KSPPREONLY but KSPBCGS or another. >> >> > >> > I now want to test the same preconditioning idea for a quadratic >> problem. I am solving a quadratic equation similar to Eqn.(5.1) in the >> SLEPc manual: >> > (K + lambda*C + lambda^2*M)*x = 0, >> > I don't use the PEP package directly, but solve this by linearizing >> similar to Eqn.(5.3) and calling EPS. Without explicitly forming the full >> matrix, I just use the block matrix structure as explained in the below >> example and that works nicely for my case: >> > https://slepc.upv.es/documentation/current/src/eps/tutorials/ex9.c.html >> >> Using PEP is generally recommended. The default solver TOAR is >> memory-efficient and performs less computation than a trivial >> linearization. In addition, PEP allows you to do scaling, which is often >> very important to get accurate results in some problems, depending on >> conditioning. >> >> In your case K is a shell matrix, so things may not be trivial. If I am >> not wrong, you should be able to use STSetPreconditionerMat() for a PEP, >> where the preconditioner in this case should be built to approximate >> Q(sigma), where Q(.) is the quadratic polynomial and sigma is the target. >> >> > >> > In my case, K is not explicitly known, and for linear problems, where C >> = 0, I am using a 1st order approximation of K as the preconditioner. Now >> could you please tell me if there is a way to conveniently set the >> preconditioner for the quadratic problem, which will be of the form [-K 0; >> 0 I]? Note that K is constructed in parallel (the rows are distributed), so >> I wasn't sure how to construct this preconditioner matrix which will be >> compatible with the shell matrix structure that I'm using to define the >> MatMult function as in ex9. >> >> The shell matrix of ex9.c interleaves the local parts of the first block >> and the second block. In other words, a process' local part consists of the >> local rows of the first block followed by the local rows of the second >> block. In your case, the local rows of K followed by the local rows of the >> identity (appropriately padded with zeros). >> >> Jose >> >> >> > >> > Thanks, >> > Varun >> > >> > On Fri, Sep 24, 2021 at 11:50 PM Varun Hiremath < >> varunhiremath at gmail.com> wrote: >> > Ok, great! I will give that a try, thanks for your help! >> > >> > On Fri, Sep 24, 2021 at 11:12 PM Jose E. Roman >> wrote: >> > Yes, you can use PCMAT >> https://petsc.org/release/docs/manualpages/PC/PCMAT.html then pass a >> preconditioner matrix that performs the inverse via a shell matrix. >> > >> > > El 25 sept 2021, a las 8:07, Varun Hiremath >> escribi?: >> > > >> > > Hi Jose, >> > > >> > > Thanks for checking my code and providing suggestions. >> > > >> > > In my particular case, I don't know the matrix A explicitly, I >> compute A*x in a matrix-free way within a shell matrix, so I can't use any >> of the direct factorization methods. But just a question regarding your >> suggestion to compute a (parallel) LU factorization. In our work, we do use >> MUMPS to compute the parallel factorization. For solving the generalized >> problem, A*x = lambda*B*x, we are computing inv(B)*A*x within a shell >> matrix, where factorization of B is computed using MUMPS. (We don't call >> MUMPS through SLEPc as we have our own MPI wrapper and other user settings >> to handle.) >> > > >> > > So for the preconditioning, instead of using the iterative solvers, >> can I provide a shell matrix that computes inv(P)*x corrections (where P is >> the preconditioner matrix) using MUMPS direct solver? >> > > >> > > And yes, thanks, #define PETSC_USE_COMPLEX 1 is not needed, it works >> without it. >> > > >> > > Regards, >> > > Varun >> > > >> > > On Fri, Sep 24, 2021 at 9:14 AM Jose E. Roman >> wrote: >> > > If you do >> > > $ ./acoustic_matrix_test.o -shell 0 -st_type sinvert -deflate 1 >> > > then it is using an LU factorization (the default), which is fast. >> > > >> > > Use -eps_view to see which solver settings are you using. >> > > >> > > BiCGStab with block Jacobi does not work for you matrix, it exceeds >> the maximum 10000 iterations. So this is not viable unless you can find a >> better preconditioner for your problem. If not, just using >> EPS_SMALLEST_MAGNITUDE will be faster. >> > > >> > > Computing smallest magnitude eigenvalues is a difficult task. The >> most robust way is to compute a (parallel) LU factorization if you can >> afford it. >> > > >> > > >> > > A side note: don't add this to your source code >> > > #define PETSC_USE_COMPLEX 1 >> > > This define is taken from PETSc's include files, you should not mess >> with it. Instead, you probably want to add something like this AFTER >> #include : >> > > #if !defined(PETSC_USE_COMPLEX) >> > > #error "Requires complex scalars" >> > > #endif >> > > >> > > Jose >> > > >> > > >> > > > El 22 sept 2021, a las 19:38, Varun Hiremath < >> varunhiremath at gmail.com> escribi?: >> > > > >> > > > Hi Jose, >> > > > >> > > > Thank you, that explains it and my example code works now without >> specifying "-eps_target 0" in the command line. >> > > > >> > > > However, both the Krylov inexact shift-invert and JD solvers are >> struggling to converge for some of my actual problems. The issue seems to >> be related to non-symmetric general matrices. I have extracted one such >> matrix attached here as MatA.gz (size 100k), and have also included a short >> program that loads this matrix and then computes the smallest eigenvalues >> as I described earlier. >> > > > >> > > > For this matrix, if I compute the eigenvalues directly (without >> using the shell matrix) using shift-and-invert (as below) then it converges >> in less than a minute. >> > > > $ ./acoustic_matrix_test.o -shell 0 -st_type sinvert -deflate 1 >> > > > >> > > > However, if I use the shell matrix and use any of the >> preconditioned solvers JD or Krylov shift-invert (as shown below) with the >> same matrix as the preconditioner, then they struggle to converge. >> > > > $ ./acoustic_matrix_test.o -usejd 1 -deflate 1 >> > > > $ ./acoustic_matrix_test.o -sinvert 1 -deflate 1 >> > > > >> > > > Could you please check the attached code and suggest any changes in >> settings that might help with convergence for these kinds of matrices? I >> appreciate your help! >> > > > >> > > > Thanks, >> > > > Varun >> > > > >> > > > On Tue, Sep 21, 2021 at 11:14 AM Jose E. Roman >> wrote: >> > > > I will have a look at your code when I have more time. Meanwhile, I >> am answering 3) below... >> > > > >> > > > > El 21 sept 2021, a las 0:23, Varun Hiremath < >> varunhiremath at gmail.com> escribi?: >> > > > > >> > > > > Hi Jose, >> > > > > >> > > > > Sorry, it took me a while to test these settings in the new >> builds. I am getting good improvement in performance using the >> preconditioned solvers, so thanks for the suggestions! But I have some >> questions related to the usage. >> > > > > >> > > > > We are using SLEPc to solve the acoustic modal eigenvalue >> problem. Attached is a simple standalone program that computes acoustic >> modes in a simple rectangular box. This program illustrates the general >> setup I am using, though here the shell matrix and the preconditioner >> matrix are the same, while in my actual program the shell matrix computes >> A*x without explicitly forming A, and the preconditioner is a 0th order >> approximation of A. >> > > > > >> > > > > In the attached program I have tested both >> > > > > 1) the Krylov-Schur with inexact shift-and-invert (implemented >> under the option sinvert); >> > > > > 2) the JD solver with preconditioner (implemented under the >> option usejd) >> > > > > >> > > > > Both the solvers seem to work decently, compared to no >> preconditioning. This is how I run the two solvers (for a mesh size of >> 1600x400): >> > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 >> -eps_target 0 >> > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -sinvert 1 -deflate 1 >> -eps_target 0 >> > > > > Both finish in about ~10 minutes on my system in serial. JD seems >> to be slightly faster and more accurate (for the imaginary part of >> eigenvalue). >> > > > > The program also runs in parallel using mpiexec. I use complex >> builds, as in my main program the matrix can be complex. >> > > > > >> > > > > Now here are my questions: >> > > > > 1) For this particular problem type, could you please check if >> these are the best settings that one could use? I have tried different >> combinations of KSP/PC types e.g. GMRES, GAMG, etc, but BCGSL + BJACOBI >> seems to work the best in serial and parallel. >> > > > > >> > > > > 2) When I tested these settings in my main program, for some >> reason the JD solver was not converging. After further testing, I found the >> issue was related to the setting of "-eps_target 0". I have included >> "EPSSetTarget(eps,0.0);" in the program and I assumed this is equivalent to >> passing "-eps_target 0" from the command line, but that doesn't seem to be >> the case. For instance, if I run the attached program without "-eps_target >> 0" in the command line then it doesn't converge. >> > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 >> -eps_target 0 >> > > > > the above finishes in about 10 minutes >> > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 >> > > > > the above doesn't converge even though "EPSSetTarget(eps,0.0);" >> is included in the code >> > > > > >> > > > > This only seems to affect the JD solver, not the Krylov >> shift-and-invert (-sinvert 1) option. So is there any difference between >> passing "-eps_target 0" from the command line vs using >> "EPSSetTarget(eps,0.0);" in the code? I cannot pass any command line >> arguments in my actual program, so need to set everything internally. >> > > > > >> > > > > 3) Also, another minor related issue. While using the inexact >> shift-and-invert option, I was running into the following error: >> > > > > >> > > > > "" >> > > > > Missing or incorrect user input >> > > > > Shift-and-invert requires a target 'which' (see >> EPSSetWhichEigenpairs), for instance -st_type sinvert -eps_target 0 >> -eps_target_magnitude >> > > > > "" >> > > > > >> > > > > I already have the below two lines in the code: >> > > > > EPSSetWhichEigenpairs(eps,EPS_SMALLEST_MAGNITUDE); >> > > > > EPSSetTarget(eps,0.0); >> > > > > >> > > > > so shouldn't these be enough? If I comment out the first line >> "EPSSetWhichEigenpairs", then the code works fine. >> > > > >> > > > You should either do >> > > > >> > > > EPSSetWhichEigenpairs(eps,EPS_SMALLEST_MAGNITUDE); >> > > > >> > > > without shift-and-invert or >> > > > >> > > > EPSSetWhichEigenpairs(eps,EPS_TARGET_MAGNITUDE); >> > > > EPSSetTarget(eps,0.0); >> > > > >> > > > with shift-and-invert. The latter can also be used without >> shift-and-invert (e.g. in JD). >> > > > >> > > > I have to check, but a possible explanation why in your comment >> above (2) the command-line option -eps_target 0 works differently is that >> it also sets -eps_target_magnitude if omitted, so to be equivalent in >> source code you have to call both >> > > > EPSSetWhichEigenpairs(eps,EPS_TARGET_MAGNITUDE); >> > > > EPSSetTarget(eps,0.0); >> > > > >> > > > Jose >> > > > >> > > > > I have some more questions regarding setting the preconditioner >> for a quadratic eigenvalue problem, which I will ask in a follow-up email. >> > > > > >> > > > > Thanks for your help! >> > > > > >> > > > > -Varun >> > > > > >> > > > > >> > > > > On Thu, Jul 1, 2021 at 5:01 AM Varun Hiremath < >> varunhiremath at gmail.com> wrote: >> > > > > Thank you very much for these suggestions! We are currently using >> version 3.12, so I'll try to update to the latest version and try your >> suggestions. Let me get back to you, thanks! >> > > > > >> > > > > On Thu, Jul 1, 2021, 4:45 AM Jose E. Roman >> wrote: >> > > > > Then I would try Davidson methods https://doi.org/10.1145/2543696 >> > > > > You can also try Krylov-Schur with "inexact" shift-and-invert, >> for instance, with preconditioned BiCGStab or GMRES, see section 3.4.1 of >> the users manual. >> > > > > >> > > > > In both cases, you have to pass matrix A in the call to >> EPSSetOperators() and the preconditioner matrix via >> STSetPreconditionerMat() - note this function was introduced in version >> 3.15. >> > > > > >> > > > > Jose >> > > > > >> > > > > >> > > > > >> > > > > > El 1 jul 2021, a las 13:36, Varun Hiremath < >> varunhiremath at gmail.com> escribi?: >> > > > > > >> > > > > > Thanks. I actually do have a 1st order approximation of matrix >> A, that I can explicitly compute and also invert. Can I use that matrix as >> preconditioner to speed things up? Is there some example that explains how >> to setup and call SLEPc for this scenario? >> > > > > > >> > > > > > On Thu, Jul 1, 2021, 4:29 AM Jose E. Roman >> wrote: >> > > > > > For smallest real parts one could adapt ex34.c, but it is going >> to be costly >> https://slepc.upv.es/documentation/current/src/eps/tutorials/ex36.c.html >> > > > > > Also, if eigenvalues are clustered around the origin, >> convergence may still be very slow. >> > > > > > >> > > > > > It is a tough problem, unless you are able to compute a good >> preconditioner of A (no need to compute the exact inverse). >> > > > > > >> > > > > > Jose >> > > > > > >> > > > > > >> > > > > > > El 1 jul 2021, a las 13:23, Varun Hiremath < >> varunhiremath at gmail.com> escribi?: >> > > > > > > >> > > > > > > I'm solving for the smallest eigenvalues in magnitude. Though >> is it cheaper to solve smallest in real part, as that might also work in my >> case? Thanks for your help. >> > > > > > > >> > > > > > > On Thu, Jul 1, 2021, 4:08 AM Jose E. Roman < >> jroman at dsic.upv.es> wrote: >> > > > > > > Smallest eigenvalue in magnitude or real part? >> > > > > > > >> > > > > > > >> > > > > > > > El 1 jul 2021, a las 11:58, Varun Hiremath < >> varunhiremath at gmail.com> escribi?: >> > > > > > > > >> > > > > > > > Sorry, no both A and B are general sparse matrices >> (non-hermitian). So is there anything else I could try? >> > > > > > > > >> > > > > > > > On Thu, Jul 1, 2021 at 2:43 AM Jose E. Roman < >> jroman at dsic.upv.es> wrote: >> > > > > > > > Is the problem symmetric (GHEP)? In that case, you can try >> LOBPCG on the pair (A,B). But this will likely be slow as well, unless you >> can provide a good preconditioner. >> > > > > > > > >> > > > > > > > Jose >> > > > > > > > >> > > > > > > > >> > > > > > > > > El 1 jul 2021, a las 11:37, Varun Hiremath < >> varunhiremath at gmail.com> escribi?: >> > > > > > > > > >> > > > > > > > > Hi All, >> > > > > > > > > >> > > > > > > > > I am trying to compute the smallest eigenvalues of a >> generalized system A*x= lambda*B*x. I don't explicitly know the matrix A >> (so I am using a shell matrix with a custom matmult function) however, the >> matrix B is explicitly known so I compute inv(B)*A within the shell matrix >> and solve inv(B)*A*x = lambda*x. >> > > > > > > > > >> > > > > > > > > To compute the smallest eigenvalues it is recommended to >> solve the inverted system, but since matrix A is not explicitly known I >> can't invert the system. Moreover, the size of the system can be really >> big, and with the default Krylov solver, it is extremely slow. So is there >> a better way for me to compute the smallest eigenvalues of this system? >> > > > > > > > > >> > > > > > > > > Thanks, >> > > > > > > > > Varun >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > >> > > > >> > > >> > >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jroman at dsic.upv.es Tue Oct 5 02:55:05 2021 From: jroman at dsic.upv.es (Jose E. Roman) Date: Tue, 5 Oct 2021 09:55:05 +0200 Subject: [petsc-users] SLEPc: smallest eigenvalues In-Reply-To: References: <179BDB69-1EC0-4334-A964-ABE29E33EFF8@dsic.upv.es> <5B1750B3-E05F-45D7-929B-A5CF816B4A75@dsic.upv.es> <7031EC8B-A238-45AD-B4C2-FA8988022864@dsic.upv.es> <6B968AE2-8325-4E20-B94A-16ECDD0FBA90@dsic.upv.es> <4BB88AB3-410E-493C-9161-97775747936D@dsic.upv.es> <32B34038-7E1A-42CA-A55D-9AF9D41D1697@dsic.upv.es> <4FC17DE7-B910-43D8-9EC5-816285FD52F4@dsic.upv.es> Message-ID: Nonlinear eigenvalue problems can still be considered a research topic. The NEP package is more or less "finished", cf. https://doi.org/10.1145/3447544 , but your use case may require changes. I would suggest that you write another email to me (not the list) and we can discuss the details. Jose > El 5 oct 2021, a las 8:04, Varun Hiremath escribi?: > > Hi Jose, > > I have now gotten the quadratic problem working decently using the PEP package with appropriate scaling and preconditioning, so thanks for all the suggestions! For the case where K is a shell matrix, I used a scaling based on an approximation of K, and that seems to be working well. > > So now that both linear and quadratic problems are working, I wanted to get your suggestions on solving a non-linear problem. In some of our cases, we have a non-linear source term S(lambda) on the right-hand side of the equation as follows: > (K + lambda*C + lambda^2*M)*x = S(lambda)*x, > where the source can sometimes be simplified as S(lambda) = exp(lambda*t)*A, where A is a constant matrix. > > I am currently solving this non-linear problem iteratively. For each eigenvalue, I compute the source and add it into the K matrix, and then iterate until convergence. For this reason, I end up solving the system multiple times which makes it very slow. I saw some examples of non-linear problems included in the NEP package. I just wanted to get your thoughts if I would benefit from using the NEP package for this particular problem? Will I be able to use preconditioning and scaling as with the PEP package to speed up the computation for the case where K is a shell matrix? Thanks for your help. > > Regards, > Varun > > > On Thu, Sep 30, 2021 at 10:12 PM Varun Hiremath wrote: > Hi Jose, > > Thanks again for your valuable suggestions. I am still working on this but wanted to give you a quick update. > > For the linear problem, I tried different KSP solvers, and finally, I'm getting good convergence using CGS with LU (using MUMPS) inexact inverse. So thank you very much for your help! > > But for the quadratic problem, I'm still struggling. As you suggested, I have now started using the PEP solver. For the simple case where the K matrix is explicitly known, everything works fine. But for the case where K is a shell matrix, it struggles to converge. I am yet to try the scaling option and some other preconditioning options. I will get back to you on this if I have any questions. Appreciate your help! > > Thanks, > Varun > > On Tue, Sep 28, 2021 at 8:09 AM Jose E. Roman wrote: > > > > El 28 sept 2021, a las 7:50, Varun Hiremath escribi?: > > > > Hi Jose, > > > > I implemented the LU factorized preconditioner and tested it using PREONLY + LU, but that actually is converging to the wrong eigenvalues, compared to just using BICGS + BJACOBI, or simply computing EPS_SMALLEST_MAGNITUDE without any preconditioning. My preconditioning matrix is only a 1st order approximation, and the off-diagonal terms are not very accurate, so I'm guessing this is why the LU factorization doesn't help much? Nonetheless, using BICGS + BJACOBI with slightly relaxed tolerances seems to be working fine. > > If your PCMAT is not an exact inverse, then you have to iterate, i.e. not use KSPPREONLY but KSPBCGS or another. > > > > > I now want to test the same preconditioning idea for a quadratic problem. I am solving a quadratic equation similar to Eqn.(5.1) in the SLEPc manual: > > (K + lambda*C + lambda^2*M)*x = 0, > > I don't use the PEP package directly, but solve this by linearizing similar to Eqn.(5.3) and calling EPS. Without explicitly forming the full matrix, I just use the block matrix structure as explained in the below example and that works nicely for my case: > > https://slepc.upv.es/documentation/current/src/eps/tutorials/ex9.c.html > > Using PEP is generally recommended. The default solver TOAR is memory-efficient and performs less computation than a trivial linearization. In addition, PEP allows you to do scaling, which is often very important to get accurate results in some problems, depending on conditioning. > > In your case K is a shell matrix, so things may not be trivial. If I am not wrong, you should be able to use STSetPreconditionerMat() for a PEP, where the preconditioner in this case should be built to approximate Q(sigma), where Q(.) is the quadratic polynomial and sigma is the target. > > > > > In my case, K is not explicitly known, and for linear problems, where C = 0, I am using a 1st order approximation of K as the preconditioner. Now could you please tell me if there is a way to conveniently set the preconditioner for the quadratic problem, which will be of the form [-K 0; 0 I]? Note that K is constructed in parallel (the rows are distributed), so I wasn't sure how to construct this preconditioner matrix which will be compatible with the shell matrix structure that I'm using to define the MatMult function as in ex9. > > The shell matrix of ex9.c interleaves the local parts of the first block and the second block. In other words, a process' local part consists of the local rows of the first block followed by the local rows of the second block. In your case, the local rows of K followed by the local rows of the identity (appropriately padded with zeros). > > Jose > > > > > > Thanks, > > Varun > > > > On Fri, Sep 24, 2021 at 11:50 PM Varun Hiremath wrote: > > Ok, great! I will give that a try, thanks for your help! > > > > On Fri, Sep 24, 2021 at 11:12 PM Jose E. Roman wrote: > > Yes, you can use PCMAT https://petsc.org/release/docs/manualpages/PC/PCMAT.html then pass a preconditioner matrix that performs the inverse via a shell matrix. > > > > > El 25 sept 2021, a las 8:07, Varun Hiremath escribi?: > > > > > > Hi Jose, > > > > > > Thanks for checking my code and providing suggestions. > > > > > > In my particular case, I don't know the matrix A explicitly, I compute A*x in a matrix-free way within a shell matrix, so I can't use any of the direct factorization methods. But just a question regarding your suggestion to compute a (parallel) LU factorization. In our work, we do use MUMPS to compute the parallel factorization. For solving the generalized problem, A*x = lambda*B*x, we are computing inv(B)*A*x within a shell matrix, where factorization of B is computed using MUMPS. (We don't call MUMPS through SLEPc as we have our own MPI wrapper and other user settings to handle.) > > > > > > So for the preconditioning, instead of using the iterative solvers, can I provide a shell matrix that computes inv(P)*x corrections (where P is the preconditioner matrix) using MUMPS direct solver? > > > > > > And yes, thanks, #define PETSC_USE_COMPLEX 1 is not needed, it works without it. > > > > > > Regards, > > > Varun > > > > > > On Fri, Sep 24, 2021 at 9:14 AM Jose E. Roman wrote: > > > If you do > > > $ ./acoustic_matrix_test.o -shell 0 -st_type sinvert -deflate 1 > > > then it is using an LU factorization (the default), which is fast. > > > > > > Use -eps_view to see which solver settings are you using. > > > > > > BiCGStab with block Jacobi does not work for you matrix, it exceeds the maximum 10000 iterations. So this is not viable unless you can find a better preconditioner for your problem. If not, just using EPS_SMALLEST_MAGNITUDE will be faster. > > > > > > Computing smallest magnitude eigenvalues is a difficult task. The most robust way is to compute a (parallel) LU factorization if you can afford it. > > > > > > > > > A side note: don't add this to your source code > > > #define PETSC_USE_COMPLEX 1 > > > This define is taken from PETSc's include files, you should not mess with it. Instead, you probably want to add something like this AFTER #include : > > > #if !defined(PETSC_USE_COMPLEX) > > > #error "Requires complex scalars" > > > #endif > > > > > > Jose > > > > > > > > > > El 22 sept 2021, a las 19:38, Varun Hiremath escribi?: > > > > > > > > Hi Jose, > > > > > > > > Thank you, that explains it and my example code works now without specifying "-eps_target 0" in the command line. > > > > > > > > However, both the Krylov inexact shift-invert and JD solvers are struggling to converge for some of my actual problems. The issue seems to be related to non-symmetric general matrices. I have extracted one such matrix attached here as MatA.gz (size 100k), and have also included a short program that loads this matrix and then computes the smallest eigenvalues as I described earlier. > > > > > > > > For this matrix, if I compute the eigenvalues directly (without using the shell matrix) using shift-and-invert (as below) then it converges in less than a minute. > > > > $ ./acoustic_matrix_test.o -shell 0 -st_type sinvert -deflate 1 > > > > > > > > However, if I use the shell matrix and use any of the preconditioned solvers JD or Krylov shift-invert (as shown below) with the same matrix as the preconditioner, then they struggle to converge. > > > > $ ./acoustic_matrix_test.o -usejd 1 -deflate 1 > > > > $ ./acoustic_matrix_test.o -sinvert 1 -deflate 1 > > > > > > > > Could you please check the attached code and suggest any changes in settings that might help with convergence for these kinds of matrices? I appreciate your help! > > > > > > > > Thanks, > > > > Varun > > > > > > > > On Tue, Sep 21, 2021 at 11:14 AM Jose E. Roman wrote: > > > > I will have a look at your code when I have more time. Meanwhile, I am answering 3) below... > > > > > > > > > El 21 sept 2021, a las 0:23, Varun Hiremath escribi?: > > > > > > > > > > Hi Jose, > > > > > > > > > > Sorry, it took me a while to test these settings in the new builds. I am getting good improvement in performance using the preconditioned solvers, so thanks for the suggestions! But I have some questions related to the usage. > > > > > > > > > > We are using SLEPc to solve the acoustic modal eigenvalue problem. Attached is a simple standalone program that computes acoustic modes in a simple rectangular box. This program illustrates the general setup I am using, though here the shell matrix and the preconditioner matrix are the same, while in my actual program the shell matrix computes A*x without explicitly forming A, and the preconditioner is a 0th order approximation of A. > > > > > > > > > > In the attached program I have tested both > > > > > 1) the Krylov-Schur with inexact shift-and-invert (implemented under the option sinvert); > > > > > 2) the JD solver with preconditioner (implemented under the option usejd) > > > > > > > > > > Both the solvers seem to work decently, compared to no preconditioning. This is how I run the two solvers (for a mesh size of 1600x400): > > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 -eps_target 0 > > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -sinvert 1 -deflate 1 -eps_target 0 > > > > > Both finish in about ~10 minutes on my system in serial. JD seems to be slightly faster and more accurate (for the imaginary part of eigenvalue). > > > > > The program also runs in parallel using mpiexec. I use complex builds, as in my main program the matrix can be complex. > > > > > > > > > > Now here are my questions: > > > > > 1) For this particular problem type, could you please check if these are the best settings that one could use? I have tried different combinations of KSP/PC types e.g. GMRES, GAMG, etc, but BCGSL + BJACOBI seems to work the best in serial and parallel. > > > > > > > > > > 2) When I tested these settings in my main program, for some reason the JD solver was not converging. After further testing, I found the issue was related to the setting of "-eps_target 0". I have included "EPSSetTarget(eps,0.0);" in the program and I assumed this is equivalent to passing "-eps_target 0" from the command line, but that doesn't seem to be the case. For instance, if I run the attached program without "-eps_target 0" in the command line then it doesn't converge. > > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 -eps_target 0 > > > > > the above finishes in about 10 minutes > > > > > $ ./acoustic_box_test.o -nx 1600 -ny 400 -usejd 1 -deflate 1 > > > > > the above doesn't converge even though "EPSSetTarget(eps,0.0);" is included in the code > > > > > > > > > > This only seems to affect the JD solver, not the Krylov shift-and-invert (-sinvert 1) option. So is there any difference between passing "-eps_target 0" from the command line vs using "EPSSetTarget(eps,0.0);" in the code? I cannot pass any command line arguments in my actual program, so need to set everything internally. > > > > > > > > > > 3) Also, another minor related issue. While using the inexact shift-and-invert option, I was running into the following error: > > > > > > > > > > "" > > > > > Missing or incorrect user input > > > > > Shift-and-invert requires a target 'which' (see EPSSetWhichEigenpairs), for instance -st_type sinvert -eps_target 0 -eps_target_magnitude > > > > > "" > > > > > > > > > > I already have the below two lines in the code: > > > > > EPSSetWhichEigenpairs(eps,EPS_SMALLEST_MAGNITUDE); > > > > > EPSSetTarget(eps,0.0); > > > > > > > > > > so shouldn't these be enough? If I comment out the first line "EPSSetWhichEigenpairs", then the code works fine. > > > > > > > > You should either do > > > > > > > > EPSSetWhichEigenpairs(eps,EPS_SMALLEST_MAGNITUDE); > > > > > > > > without shift-and-invert or > > > > > > > > EPSSetWhichEigenpairs(eps,EPS_TARGET_MAGNITUDE); > > > > EPSSetTarget(eps,0.0); > > > > > > > > with shift-and-invert. The latter can also be used without shift-and-invert (e.g. in JD). > > > > > > > > I have to check, but a possible explanation why in your comment above (2) the command-line option -eps_target 0 works differently is that it also sets -eps_target_magnitude if omitted, so to be equivalent in source code you have to call both > > > > EPSSetWhichEigenpairs(eps,EPS_TARGET_MAGNITUDE); > > > > EPSSetTarget(eps,0.0); > > > > > > > > Jose > > > > > > > > > I have some more questions regarding setting the preconditioner for a quadratic eigenvalue problem, which I will ask in a follow-up email. > > > > > > > > > > Thanks for your help! > > > > > > > > > > -Varun > > > > > > > > > > > > > > > On Thu, Jul 1, 2021 at 5:01 AM Varun Hiremath wrote: > > > > > Thank you very much for these suggestions! We are currently using version 3.12, so I'll try to update to the latest version and try your suggestions. Let me get back to you, thanks! > > > > > > > > > > On Thu, Jul 1, 2021, 4:45 AM Jose E. Roman wrote: > > > > > Then I would try Davidson methods https://doi.org/10.1145/2543696 > > > > > You can also try Krylov-Schur with "inexact" shift-and-invert, for instance, with preconditioned BiCGStab or GMRES, see section 3.4.1 of the users manual. > > > > > > > > > > In both cases, you have to pass matrix A in the call to EPSSetOperators() and the preconditioner matrix via STSetPreconditionerMat() - note this function was introduced in version 3.15. > > > > > > > > > > Jose > > > > > > > > > > > > > > > > > > > > > El 1 jul 2021, a las 13:36, Varun Hiremath escribi?: > > > > > > > > > > > > Thanks. I actually do have a 1st order approximation of matrix A, that I can explicitly compute and also invert. Can I use that matrix as preconditioner to speed things up? Is there some example that explains how to setup and call SLEPc for this scenario? > > > > > > > > > > > > On Thu, Jul 1, 2021, 4:29 AM Jose E. Roman wrote: > > > > > > For smallest real parts one could adapt ex34.c, but it is going to be costly https://slepc.upv.es/documentation/current/src/eps/tutorials/ex36.c.html > > > > > > Also, if eigenvalues are clustered around the origin, convergence may still be very slow. > > > > > > > > > > > > It is a tough problem, unless you are able to compute a good preconditioner of A (no need to compute the exact inverse). > > > > > > > > > > > > Jose > > > > > > > > > > > > > > > > > > > El 1 jul 2021, a las 13:23, Varun Hiremath escribi?: > > > > > > > > > > > > > > I'm solving for the smallest eigenvalues in magnitude. Though is it cheaper to solve smallest in real part, as that might also work in my case? Thanks for your help. > > > > > > > > > > > > > > On Thu, Jul 1, 2021, 4:08 AM Jose E. Roman wrote: > > > > > > > Smallest eigenvalue in magnitude or real part? > > > > > > > > > > > > > > > > > > > > > > El 1 jul 2021, a las 11:58, Varun Hiremath escribi?: > > > > > > > > > > > > > > > > Sorry, no both A and B are general sparse matrices (non-hermitian). So is there anything else I could try? > > > > > > > > > > > > > > > > On Thu, Jul 1, 2021 at 2:43 AM Jose E. Roman wrote: > > > > > > > > Is the problem symmetric (GHEP)? In that case, you can try LOBPCG on the pair (A,B). But this will likely be slow as well, unless you can provide a good preconditioner. > > > > > > > > > > > > > > > > Jose > > > > > > > > > > > > > > > > > > > > > > > > > El 1 jul 2021, a las 11:37, Varun Hiremath escribi?: > > > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > > > I am trying to compute the smallest eigenvalues of a generalized system A*x= lambda*B*x. I don't explicitly know the matrix A (so I am using a shell matrix with a custom matmult function) however, the matrix B is explicitly known so I compute inv(B)*A within the shell matrix and solve inv(B)*A*x = lambda*x. > > > > > > > > > > > > > > > > > > To compute the smallest eigenvalues it is recommended to solve the inverted system, but since matrix A is not explicitly known I can't invert the system. Moreover, the size of the system can be really big, and with the default Krylov solver, it is extremely slow. So is there a better way for me to compute the smallest eigenvalues of this system? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Varun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From karthikeyan.chockalingam at stfc.ac.uk Tue Oct 5 11:02:50 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Tue, 5 Oct 2021 16:02:50 +0000 Subject: [petsc-users] (percent time in this phase) In-Reply-To: References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> Message-ID: Hi Matt, I have a couple of questions; 1. Weather I run on single core or on multiple cores I find that KSPSetUP (ksp/tutorial/ex.45.c) is always called twice. But why? Is setup not done once? 2. I find PCSetUpOnBlocks is calling PCSetup and not the other way around. Can you shed some light? The preconditioner used is block jacobi. Thanks, Karthik. From: Matthew Knepley Date: Sunday, 3 October 2021 at 12:54 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, Thank you. The flamegraph tool is helpful. Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app). I find the following call sequence from the graph KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve I have a couple of questions 1. The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? Yes. When calls are nested, we just do not record the time for the nested call in log_view. 1. 2. What is the unit of measurement in flamegrah? I believe it is microseconds, but I am not sure. Thanks, Matt 1. Thanks, Karthik. From: Matthew Knepley > Date: Friday, 1 October 2021 at 14:51 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: When comparing the MatSolve data for GPU MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and CPU MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with -log_view :foo.txt:ascii_flamegraph and then there are tools for plotting that output, described here https://firedrakeproject.org/optimising.html This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 16:29 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you! Just to summarize KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? Yes. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 11:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you Mathew. Now, it is all making sense to me. From data file ksp_ex45_N511_gpu_2.txt KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 So the solve time is: 53% ~ 37% + 4% + 11% and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 It looks like the remainder of the time (23%) is spent preallocating the matrix. Thanks, Matt The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 10:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: 1. graph.pdf a plot showing overall time and various petsc events. 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary I used the following petsc options for cpu mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor and for gpus mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor to run the following problem https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. In your response you said that ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly consist of MatMult + PCApply, with some vector work. I am hoping to time KSP solving and preconditioning mutually exclusively. I am not sure that concept makes sense here. See above. Thanks, Matt Kind regards, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 19:19 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks for Barry for your response. I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. Barry Best, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 16:56 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. Barry Thanks! Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Oct 5 11:14:48 2021 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 5 Oct 2021 12:14:48 -0400 Subject: [petsc-users] (percent time in this phase) In-Reply-To: References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> Message-ID: PCSetUpOnBlocks() calls KSPSetUp() on each block which calls PCSetUp() on each block. This is why you are getting the confusing logging results. So the " extra" KSPSetUp() is the one on the blocks. > On Oct 5, 2021, at 12:02 PM, Karthikeyan Chockalingam - STFC UKRI wrote: > > Hi Matt, > > I have a couple of questions; > > Weather I run on single core or on multiple cores I find that KSPSetUP (ksp/tutorial/ex.45.c) is always called twice. But why? Is setup not done once? > I find PCSetUpOnBlocks is calling PCSetup and not the other way around. Can you shed some light? The preconditioner used is block jacobi. > > Thanks, > Karthik. > > From: Matthew Knepley > > Date: Sunday, 3 October 2021 at 12:54 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Hi Matt, > > Thank you. The flamegraph tool is helpful. > Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app ). > I find the following call sequence from the graph > KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve > > I have a couple of questions > The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? > Yes. When calls are nested, we just do not record the time for the nested call in log_view. > > > What is the unit of measurement in flamegrah? > I believe it is microseconds, but I am not sure. > > Thanks, > > Matt > > > Thanks, > Karthik. > > From: Matthew Knepley > > Date: Friday, 1 October 2021 at 14:51 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > When comparing the MatSolve data for > > GPU > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > and CPU > > MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 > > the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) > > mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with > > -log_view :foo.txt:ascii_flamegraph > > and then there are tools for plotting that output, described here > > https://firedrakeproject.org/optimising.html > > This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. > > Thanks, > > Matt > > Best, > Karthik. > > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 16:29 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you! > > Just to summarize > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? > > I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? > > Yes. > > Thanks, > > Matt > > Best, > > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 11:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you Mathew. Now, it is all making sense to me. > > From data file ksp_ex45_N511_gpu_2.txt > > KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). > > However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? > > 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) > > KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 > KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 > > > 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. > > PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 > MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 > MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. > > PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 > > 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. > > > VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 > VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 > VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 > VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 > > So the solve time is: > > 53% ~ 37% + 4% + 11% > > and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: > > https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 > > It looks like the remainder of the time (23%) is spent preallocating the matrix. > > Thanks, > > Matt > > The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? > > Best, > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 10:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: > > graph.pdf a plot showing overall time and various petsc events. > ksp_ex45_N511_cpu_6.txt data file of the log_summary > ksp_ex45_N511_gpu_2.txt data file of the log_summary > > I used the following petsc options for cpu > > mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor > > and for gpus > > mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > to run the following problem > > https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html > > From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? > > No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. > > In your response you said that > > ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? > > I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? > > They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly > consist of MatMult + PCApply, with some vector work. > > I am hoping to time KSP solving and preconditioning mutually exclusively. > > I am not sure that concept makes sense here. See above. > > Thanks, > > Matt > > > Kind regards, > Karthik. > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 19:19 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Thanks for Barry for your response. > > I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. > However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. > > If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). > > PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. > > So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. > > Barry > > > Best, > Karthik. > > > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 16:56 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Hello, > > I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. > > > For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. > > It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. > > Barry > > > > > Thanks! > Karthik. > > This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Tue Oct 5 11:28:27 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Tue, 5 Oct 2021 16:28:27 +0000 Subject: [petsc-users] (percent time in this phase) In-Reply-To: References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> Message-ID: Thanks Barry. Please find the attached screen shoot (flamegraph) and foo.txt which generated that graph (using https://www.speedscope.app). If you look at the far right of the flamegraph PCSetUpOnBlock() calls PCSetUp() and not KSPSetup(). Unless, I am not reading the graph right? Secondly how can I know, how many blocks are being setup? Is there a default flag on the number of blocks being SetUp? From: Barry Smith Date: Tuesday, 5 October 2021 at 17:15 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: Matthew Knepley , "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) PCSetUpOnBlocks() calls KSPSetUp() on each block which calls PCSetUp() on each block. This is why you are getting the confusing logging results. So the " extra" KSPSetUp() is the one on the blocks. On Oct 5, 2021, at 12:02 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, I have a couple of questions; 1. Weather I run on single core or on multiple cores I find that KSPSetUP (ksp/tutorial/ex.45.c) is always called twice. But why? Is setup not done once? 2. I find PCSetUpOnBlocks is calling PCSetup and not the other way around. Can you shed some light? The preconditioner used is block jacobi. Thanks, Karthik. From: Matthew Knepley > Date: Sunday, 3 October 2021 at 12:54 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, Thank you. The flamegraph tool is helpful. Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app). I find the following call sequence from the graph KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve I have a couple of questions 1. The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? Yes. When calls are nested, we just do not record the time for the nested call in log_view. 1. 2. What is the unit of measurement in flamegrah? I believe it is microseconds, but I am not sure. Thanks, Matt 1. Thanks, Karthik. From: Matthew Knepley > Date: Friday, 1 October 2021 at 14:51 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: When comparing the MatSolve data for GPU MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and CPU MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with -log_view :foo.txt:ascii_flamegraph and then there are tools for plotting that output, described here https://firedrakeproject.org/optimising.html This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 16:29 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you! Just to summarize KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? Yes. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 11:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you Mathew. Now, it is all making sense to me. From data file ksp_ex45_N511_gpu_2.txt KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 So the solve time is: 53% ~ 37% + 4% + 11% and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 It looks like the remainder of the time (23%) is spent preallocating the matrix. Thanks, Matt The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 10:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: 1. graph.pdf a plot showing overall time and various petsc events. 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary I used the following petsc options for cpu mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor and for gpus mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor to run the following problem https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. In your response you said that ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly consist of MatMult + PCApply, with some vector work. I am hoping to time KSP solving and preconditioning mutually exclusively. I am not sure that concept makes sense here. See above. Thanks, Matt Kind regards, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 19:19 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks for Barry for your response. I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. Barry Best, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 16:56 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. Barry Thanks! Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: flamegrah.png Type: image/png Size: 74759 bytes Desc: flamegrah.png URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: foo.txt URL: From karthikeyan.chockalingam at stfc.ac.uk Tue Oct 5 11:33:43 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Tue, 5 Oct 2021 16:33:43 +0000 Subject: [petsc-users] (percent time in this phase) In-Reply-To: References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> Message-ID: <4A58AE1B-DAA4-462F-8ACD-3186A6C7AD52@stfc.ac.uk> The graph was generated using the flag -log_view :foo.txt:ascii_flamegraph From: "Chockalingam, Karthikeyan (STFC,DL,HC)" Date: Tuesday, 5 October 2021 at 17:28 To: Barry Smith Cc: Matthew Knepley , "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) Thanks Barry. Please find the attached screen shoot (flamegraph) and foo.txt which generated that graph (using https://www.speedscope.app). If you look at the far right of the flamegraph PCSetUpOnBlock() calls PCSetUp() and not KSPSetup(). Unless, I am not reading the graph right? Secondly how can I know, how many blocks are being setup? Is there a default flag on the number of blocks being SetUp? From: Barry Smith Date: Tuesday, 5 October 2021 at 17:15 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: Matthew Knepley , "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) PCSetUpOnBlocks() calls KSPSetUp() on each block which calls PCSetUp() on each block. This is why you are getting the confusing logging results. So the " extra" KSPSetUp() is the one on the blocks. On Oct 5, 2021, at 12:02 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, I have a couple of questions; 1. Weather I run on single core or on multiple cores I find that KSPSetUP (ksp/tutorial/ex.45.c) is always called twice. But why? Is setup not done once? 2. I find PCSetUpOnBlocks is calling PCSetup and not the other way around. Can you shed some light? The preconditioner used is block jacobi. Thanks, Karthik. From: Matthew Knepley > Date: Sunday, 3 October 2021 at 12:54 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, Thank you. The flamegraph tool is helpful. Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app). I find the following call sequence from the graph KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve I have a couple of questions 1. The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? Yes. When calls are nested, we just do not record the time for the nested call in log_view. 1. 2. What is the unit of measurement in flamegrah? I believe it is microseconds, but I am not sure. Thanks, Matt 1. Thanks, Karthik. From: Matthew Knepley > Date: Friday, 1 October 2021 at 14:51 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: When comparing the MatSolve data for GPU MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and CPU MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with -log_view :foo.txt:ascii_flamegraph and then there are tools for plotting that output, described here https://firedrakeproject.org/optimising.html This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 16:29 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you! Just to summarize KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? Yes. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 11:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you Mathew. Now, it is all making sense to me. From data file ksp_ex45_N511_gpu_2.txt KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 So the solve time is: 53% ~ 37% + 4% + 11% and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 It looks like the remainder of the time (23%) is spent preallocating the matrix. Thanks, Matt The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 10:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: 1. graph.pdf a plot showing overall time and various petsc events. 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary I used the following petsc options for cpu mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor and for gpus mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor to run the following problem https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. In your response you said that ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly consist of MatMult + PCApply, with some vector work. I am hoping to time KSP solving and preconditioning mutually exclusively. I am not sure that concept makes sense here. See above. Thanks, Matt Kind regards, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 19:19 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks for Barry for your response. I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. Barry Best, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 16:56 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. Barry Thanks! Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Oct 5 11:47:07 2021 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 5 Oct 2021 12:47:07 -0400 Subject: [petsc-users] (percent time in this phase) In-Reply-To: References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> Message-ID: > On Oct 5, 2021, at 12:28 PM, Karthikeyan Chockalingam - STFC UKRI wrote: > > Thanks Barry. > > Please find the attached screen shoot (flamegraph) and foo.txt which generated that graph (using https://www.speedscope.app ). > If you look at the far right of the flamegraph PCSetUpOnBlock() calls PCSetUp() and not KSPSetup(). Unless, I am not reading the graph right? For block Jacobi static PetscErrorCode PCSetUpOnBlocks_BJacobi_Singleblock(PC pc) { PetscErrorCode ierr; PC_BJacobi *jac = (PC_BJacobi*)pc->data; KSP subksp = jac->ksp[0]; KSPConvergedReason reason; PetscFunctionBegin; ierr = KSPSetUp(subksp);CHKERRQ(ierr); ierr = KSPGetConvergedReason(subksp,&reason);CHKERRQ(ierr); if (reason == KSP_DIVERGED_PC_FAILED) { pc->failedreason = PC_SUBPC_ERROR; } PetscFunctionReturn(0); } I am not sure why the KSPSetUp does not appear in the Flame logging, there may be something that ensures it does not get logged. KSPSetUpOnBlocks() is not logged so does not appear in the logging. Sometimes you may need to run in the debugger with break points on certain functions to indicate if they are called and when > Secondly how can I know, how many blocks are being setup? Is there a default flag on the number of blocks being SetUp? For ASM and block Jacobi the default blocks is one per MPI rank. -pc_bjacobi_local_blocks 2 indicates you want to 2 blocks per rank > > > > From: Barry Smith > > Date: Tuesday, 5 October 2021 at 17:15 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Matthew Knepley >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > > PCSetUpOnBlocks() calls KSPSetUp() on each block which calls PCSetUp() on each block. This is why you are getting the confusing logging results. So the " > extra" KSPSetUp() is the one on the blocks. > > > On Oct 5, 2021, at 12:02 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Hi Matt, > > I have a couple of questions; > > Weather I run on single core or on multiple cores I find that KSPSetUP (ksp/tutorial/ex.45.c) is always called twice. But why? Is setup not done once? > I find PCSetUpOnBlocks is calling PCSetup and not the other way around. Can you shed some light? The preconditioner used is block jacobi. > > Thanks, > Karthik. > > From: Matthew Knepley > > Date: Sunday, 3 October 2021 at 12:54 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Hi Matt, > > Thank you. The flamegraph tool is helpful. > Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app ). > I find the following call sequence from the graph > KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve > > I have a couple of questions > The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? > Yes. When calls are nested, we just do not record the time for the nested call in log_view. > > > What is the unit of measurement in flamegrah? > I believe it is microseconds, but I am not sure. > > Thanks, > > Matt > > > Thanks, > Karthik. > > From: Matthew Knepley > > Date: Friday, 1 October 2021 at 14:51 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > When comparing the MatSolve data for > > GPU > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > and CPU > > MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 > > the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) > > mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with > > -log_view :foo.txt:ascii_flamegraph > > and then there are tools for plotting that output, described here > > https://firedrakeproject.org/optimising.html > > This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. > > Thanks, > > Matt > > Best, > Karthik. > > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 16:29 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you! > > Just to summarize > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? > > I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? > > Yes. > > Thanks, > > Matt > > Best, > > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 11:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you Mathew. Now, it is all making sense to me. > > From data file ksp_ex45_N511_gpu_2.txt > > KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). > > However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? > > 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) > > KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 > KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 > > > 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. > > PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 > MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 > MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. > > PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 > > 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. > > > VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 > VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 > VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 > VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 > > So the solve time is: > > 53% ~ 37% + 4% + 11% > > and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: > > https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 > > It looks like the remainder of the time (23%) is spent preallocating the matrix. > > Thanks, > > Matt > > The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? > > Best, > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 10:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: > > graph.pdf a plot showing overall time and various petsc events. > ksp_ex45_N511_cpu_6.txt data file of the log_summary > ksp_ex45_N511_gpu_2.txt data file of the log_summary > > I used the following petsc options for cpu > > mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor > > and for gpus > > mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > to run the following problem > > https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html > > From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? > > No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. > > In your response you said that > > ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? > > I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? > > They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly > consist of MatMult + PCApply, with some vector work. > > I am hoping to time KSP solving and preconditioning mutually exclusively. > > I am not sure that concept makes sense here. See above. > > Thanks, > > Matt > > > Kind regards, > Karthik. > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 19:19 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Thanks for Barry for your response. > > I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. > However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. > > If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). > > PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. > > So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. > > Barry > > > Best, > Karthik. > > > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 16:56 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Hello, > > I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. > > > For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. > > It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. > > Barry > > > > > Thanks! > Karthik. > > This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Tue Oct 5 13:17:17 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Tue, 5 Oct 2021 18:17:17 +0000 Subject: [petsc-users] (percent time in this phase) In-Reply-To: References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> Message-ID: <6A73EDD4-CA0D-45B8-B714-707E875BCCD1@stfc.ac.uk> Thank you very much! KPS Setup does appear in the flame login it in the far left (and you have to zoom to see it using the browser) I will look into the debugger option. You said ?. -pc_bjacobi_local_blocks 2 indicates you want to 2 blocks per rank? does it imply on mpi -n 2 that KSPSetUP should be called 4 times? The below shows a different behaviour? mpirun -n 6 ??? -pc_bjacobi_local_blocks 1 ? KSPSetUP is called 2 times mpirun -n 6 ??? -pc_bjacobi_local_blocks 2 ? KSPSetUP is called 3 times mpirun -n 6 ??? -pc_bjacobi_local_blocks 3 ? KSPSetUP is called 4 times mpirun -n 3 ??? -pc_bjacobi_local_blocks 1 ? KSPSetUP is called 2 times mpirun -n 3 ??? -pc_bjacobi_local_blocks 2 ? KSPSetUP is called 3 times mpirun -n 3 ??? -pc_bjacobi_local_blocks 3 ? KSPSetUP is called 4 times Thanks, Karthik. From: Barry Smith Date: Tuesday, 5 October 2021 at 17:47 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: Matthew Knepley , "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) On Oct 5, 2021, at 12:28 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks Barry. Please find the attached screen shoot (flamegraph) and foo.txt which generated that graph (using https://www.speedscope.app). If you look at the far right of the flamegraph PCSetUpOnBlock() calls PCSetUp() and not KSPSetup(). Unless, I am not reading the graph right? For block Jacobi static PetscErrorCode PCSetUpOnBlocks_BJacobi_Singleblock(PC pc) { PetscErrorCode ierr; PC_BJacobi *jac = (PC_BJacobi*)pc->data; KSP subksp = jac->ksp[0]; KSPConvergedReason reason; PetscFunctionBegin; ierr = KSPSetUp(subksp);CHKERRQ(ierr); ierr = KSPGetConvergedReason(subksp,&reason);CHKERRQ(ierr); if (reason == KSP_DIVERGED_PC_FAILED) { pc->failedreason = PC_SUBPC_ERROR; } PetscFunctionReturn(0); } I am not sure why the KSPSetUp does not appear in the Flame logging, there may be something that ensures it does not get logged. KSPSetUpOnBlocks() is not logged so does not appear in the logging. Sometimes you may need to run in the debugger with break points on certain functions to indicate if they are called and when Secondly how can I know, how many blocks are being setup? Is there a default flag on the number of blocks being SetUp? For ASM and block Jacobi the default blocks is one per MPI rank. -pc_bjacobi_local_blocks 2 indicates you want to 2 blocks per rank From: Barry Smith > Date: Tuesday, 5 October 2021 at 17:15 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Matthew Knepley >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) PCSetUpOnBlocks() calls KSPSetUp() on each block which calls PCSetUp() on each block. This is why you are getting the confusing logging results. So the " extra" KSPSetUp() is the one on the blocks. On Oct 5, 2021, at 12:02 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, I have a couple of questions; 1. Weather I run on single core or on multiple cores I find that KSPSetUP (ksp/tutorial/ex.45.c) is always called twice. But why? Is setup not done once? 2. I find PCSetUpOnBlocks is calling PCSetup and not the other way around. Can you shed some light? The preconditioner used is block jacobi. Thanks, Karthik. From: Matthew Knepley > Date: Sunday, 3 October 2021 at 12:54 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, Thank you. The flamegraph tool is helpful. Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app). I find the following call sequence from the graph KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve I have a couple of questions 1. The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? Yes. When calls are nested, we just do not record the time for the nested call in log_view. 1. 2. What is the unit of measurement in flamegrah? I believe it is microseconds, but I am not sure. Thanks, Matt 1. Thanks, Karthik. From: Matthew Knepley > Date: Friday, 1 October 2021 at 14:51 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: When comparing the MatSolve data for GPU MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and CPU MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with -log_view :foo.txt:ascii_flamegraph and then there are tools for plotting that output, described here https://firedrakeproject.org/optimising.html This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 16:29 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you! Just to summarize KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? Yes. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 11:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you Mathew. Now, it is all making sense to me. From data file ksp_ex45_N511_gpu_2.txt KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 So the solve time is: 53% ~ 37% + 4% + 11% and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 It looks like the remainder of the time (23%) is spent preallocating the matrix. Thanks, Matt The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 10:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: 1. graph.pdf a plot showing overall time and various petsc events. 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary I used the following petsc options for cpu mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor and for gpus mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor to run the following problem https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. In your response you said that ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly consist of MatMult + PCApply, with some vector work. I am hoping to time KSP solving and preconditioning mutually exclusively. I am not sure that concept makes sense here. See above. Thanks, Matt Kind regards, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 19:19 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks for Barry for your response. I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. Barry Best, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 16:56 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. Barry Thanks! Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Tue Oct 5 13:27:35 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Tue, 5 Oct 2021 18:27:35 +0000 Subject: [petsc-users] (percent time in this phase) In-Reply-To: <6A73EDD4-CA0D-45B8-B714-707E875BCCD1@stfc.ac.uk> References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> <6A73EDD4-CA0D-45B8-B714-707E875BCCD1@stfc.ac.uk> Message-ID: <4C9DB5C9-94F0-48E1-A7A6-E10CB2652EAF@stfc.ac.uk> Just to clarify, I am referring to the below -log_view output, where KPSSetUp is called two times KSPSetUp 2 1.0 1.3278e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 1 0 0 0 0 1 0 By using the following option: mpirun -n 6 ??? -pc_bjacobi_local_blocks 1 ? KSPSetUP is called 2 times From: "Chockalingam, Karthikeyan (STFC,DL,HC)" Date: Tuesday, 5 October 2021 at 19:17 To: Barry Smith Cc: Matthew Knepley , "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) Thank you very much! KPS Setup does appear in the flame login it in the far left (and you have to zoom to see it using the browser) I will look into the debugger option. You said ?. -pc_bjacobi_local_blocks 2 indicates you want to 2 blocks per rank? does it imply on mpi -n 2 that KSPSetUP should be called 4 times? The below shows a different behaviour? mpirun -n 6 ??? -pc_bjacobi_local_blocks 1 ? KSPSetUP is called 2 times mpirun -n 6 ??? -pc_bjacobi_local_blocks 2 ? KSPSetUP is called 3 times mpirun -n 6 ??? -pc_bjacobi_local_blocks 3 ? KSPSetUP is called 4 times mpirun -n 3 ??? -pc_bjacobi_local_blocks 1 ? KSPSetUP is called 2 times mpirun -n 3 ??? -pc_bjacobi_local_blocks 2 ? KSPSetUP is called 3 times mpirun -n 3 ??? -pc_bjacobi_local_blocks 3 ? KSPSetUP is called 4 times Thanks, Karthik. From: Barry Smith Date: Tuesday, 5 October 2021 at 17:47 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: Matthew Knepley , "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] (percent time in this phase) On Oct 5, 2021, at 12:28 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks Barry. Please find the attached screen shoot (flamegraph) and foo.txt which generated that graph (using https://www.speedscope.app). If you look at the far right of the flamegraph PCSetUpOnBlock() calls PCSetUp() and not KSPSetup(). Unless, I am not reading the graph right? For block Jacobi static PetscErrorCode PCSetUpOnBlocks_BJacobi_Singleblock(PC pc) { PetscErrorCode ierr; PC_BJacobi *jac = (PC_BJacobi*)pc->data; KSP subksp = jac->ksp[0]; KSPConvergedReason reason; PetscFunctionBegin; ierr = KSPSetUp(subksp);CHKERRQ(ierr); ierr = KSPGetConvergedReason(subksp,&reason);CHKERRQ(ierr); if (reason == KSP_DIVERGED_PC_FAILED) { pc->failedreason = PC_SUBPC_ERROR; } PetscFunctionReturn(0); } I am not sure why the KSPSetUp does not appear in the Flame logging, there may be something that ensures it does not get logged. KSPSetUpOnBlocks() is not logged so does not appear in the logging. Sometimes you may need to run in the debugger with break points on certain functions to indicate if they are called and when Secondly how can I know, how many blocks are being setup? Is there a default flag on the number of blocks being SetUp? For ASM and block Jacobi the default blocks is one per MPI rank. -pc_bjacobi_local_blocks 2 indicates you want to 2 blocks per rank From: Barry Smith > Date: Tuesday, 5 October 2021 at 17:15 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Matthew Knepley >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) PCSetUpOnBlocks() calls KSPSetUp() on each block which calls PCSetUp() on each block. This is why you are getting the confusing logging results. So the " extra" KSPSetUp() is the one on the blocks. On Oct 5, 2021, at 12:02 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, I have a couple of questions; 1. Weather I run on single core or on multiple cores I find that KSPSetUP (ksp/tutorial/ex.45.c) is always called twice. But why? Is setup not done once? 2. I find PCSetUpOnBlocks is calling PCSetup and not the other way around. Can you shed some light? The preconditioner used is block jacobi. Thanks, Karthik. From: Matthew Knepley > Date: Sunday, 3 October 2021 at 12:54 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hi Matt, Thank you. The flamegraph tool is helpful. Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app). I find the following call sequence from the graph KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve I have a couple of questions 1. The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? Yes. When calls are nested, we just do not record the time for the nested call in log_view. 1. 2. What is the unit of measurement in flamegrah? I believe it is microseconds, but I am not sure. Thanks, Matt 1. Thanks, Karthik. From: Matthew Knepley > Date: Friday, 1 October 2021 at 14:51 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] (percent time in this phase) On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: When comparing the MatSolve data for GPU MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and CPU MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with -log_view :foo.txt:ascii_flamegraph and then there are tools for plotting that output, described here https://firedrakeproject.org/optimising.html This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 16:29 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you! Just to summarize KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? Yes. Thanks, Matt Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 11:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Thank you Mathew. Now, it is all making sense to me. From data file ksp_ex45_N511_gpu_2.txt KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 So the solve time is: 53% ~ 37% + 4% + 11% and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 It looks like the remainder of the time (23%) is spent preallocating the matrix. Thanks, Matt The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? Best, Karthik. From: Matthew Knepley > Date: Wednesday, 29 September 2021 at 10:58 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: Barry Smith >, "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: 1. graph.pdf a plot showing overall time and various petsc events. 2. ksp_ex45_N511_cpu_6.txt data file of the log_summary 3. ksp_ex45_N511_gpu_2.txt data file of the log_summary I used the following petsc options for cpu mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor and for gpus mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor to run the following problem https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. In your response you said that ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly consist of MatMult + PCApply, with some vector work. I am hoping to time KSP solving and preconditioning mutually exclusively. I am not sure that concept makes sense here. See above. Thanks, Matt Kind regards, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 19:19 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: Thanks for Barry for your response. I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. Barry Best, Karthik. From: Barry Smith > Date: Tuesday, 28 September 2021 at 16:56 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] %T (percent time in this phase) On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. Barry Thanks! Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Oct 5 14:27:36 2021 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 5 Oct 2021 15:27:36 -0400 Subject: [petsc-users] (percent time in this phase) In-Reply-To: <6A73EDD4-CA0D-45B8-B714-707E875BCCD1@stfc.ac.uk> References: <6C41B1AA-ABF0-4BE8-885D-53F22535E032@stfc.ac.uk> <6A73EDD4-CA0D-45B8-B714-707E875BCCD1@stfc.ac.uk> Message-ID: <36013448-A581-4674-9FFD-655EF51063C9@petsc.dev> > On Oct 5, 2021, at 2:17 PM, Karthikeyan Chockalingam - STFC UKRI wrote: > > Thank you very much! > KPS Setup does appear in the flame login it in the far left (and you have to zoom to see it using the browser) > I will look into the debugger option. > > You said ?. -pc_bjacobi_local_blocks 2 indicates you want to 2 blocks per rank? does it imply on mpi -n 2 that KSPSetUP should be called 4 times? > > The below shows a different behaviour? > mpirun -n 6 ??? -pc_bjacobi_local_blocks 1 ? KSPSetUP is called 2 times This is correct, it is called once on the entire problem the outer KSP and then once on the block KSP > mpirun -n 6 ??? -pc_bjacobi_local_blocks 2 ? KSPSetUP is called 3 times This is correct, it is called once on the entire problem the outer KSP and then once on each of the two blocks > mpirun -n 6 ??? -pc_bjacobi_local_blocks 3 ? KSPSetUP is called 4 times This is correct etc > > mpirun -n 3 ??? -pc_bjacobi_local_blocks 1 ? KSPSetUP is called 2 times > mpirun -n 3 ??? -pc_bjacobi_local_blocks 2 ? KSPSetUP is called 3 times > mpirun -n 3 ??? -pc_bjacobi_local_blocks 3 ? KSPSetUP is called 4 times > > Thanks, > Karthik. > > From: Barry Smith > > Date: Tuesday, 5 October 2021 at 17:47 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Matthew Knepley >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > > > > On Oct 5, 2021, at 12:28 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Thanks Barry. > > Please find the attached screen shoot (flamegraph) and foo.txt which generated that graph (using https://www.speedscope.app ). > If you look at the far right of the flamegraph PCSetUpOnBlock() calls PCSetUp() and not KSPSetup(). Unless, I am not reading the graph right? > > For block Jacobi > > static PetscErrorCode PCSetUpOnBlocks_BJacobi_Singleblock(PC pc) > { > PetscErrorCode ierr; > PC_BJacobi *jac = (PC_BJacobi*)pc->data; > KSP subksp = jac->ksp[0]; > KSPConvergedReason reason; > > PetscFunctionBegin; > ierr = KSPSetUp(subksp);CHKERRQ(ierr); > ierr = KSPGetConvergedReason(subksp,&reason);CHKERRQ(ierr); > if (reason == KSP_DIVERGED_PC_FAILED) { > pc->failedreason = PC_SUBPC_ERROR; > } > PetscFunctionReturn(0); > } > > I am not sure why the KSPSetUp does not appear in the Flame logging, there may be something that ensures it does not get logged. > > KSPSetUpOnBlocks() is not logged so does not appear in the logging. > > Sometimes you may need to run in the debugger with break points on certain functions to indicate if they are called and when > > Secondly how can I know, how many blocks are being setup? Is there a default flag on the number of blocks being SetUp? > > For ASM and block Jacobi the default blocks is one per MPI rank. -pc_bjacobi_local_blocks 2 indicates you want to 2 blocks per rank > > > > > > > From: Barry Smith > > Date: Tuesday, 5 October 2021 at 17:15 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Matthew Knepley >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > > PCSetUpOnBlocks() calls KSPSetUp() on each block which calls PCSetUp() on each block. This is why you are getting the confusing logging results. So the " > extra" KSPSetUp() is the one on the blocks. > > > > On Oct 5, 2021, at 12:02 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Hi Matt, > > I have a couple of questions; > > Weather I run on single core or on multiple cores I find that KSPSetUP (ksp/tutorial/ex.45.c) is always called twice. But why? Is setup not done once? > I find PCSetUpOnBlocks is calling PCSetup and not the other way around. Can you shed some light? The preconditioner used is block jacobi. > > Thanks, > Karthik. > > From: Matthew Knepley > > Date: Sunday, 3 October 2021 at 12:54 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > On Sun, Oct 3, 2021 at 5:43 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Hi Matt, > > Thank you. The flamegraph tool is helpful. > Please find the attached screen shoot and foo.txt which generated that graph (using https://www.speedscope.app ). > I find the following call sequence from the graph > KSPSolve -> PCApply -> KSPSolve -> PCApply -> MatSolve > > I have a couple of questions > The KSPSolve time listed in the file using -log_summary (or -log_view), is it the time taken by the first KSPSolve (in the above call sequence)? > Yes. When calls are nested, we just do not record the time for the nested call in log_view. > > > What is the unit of measurement in flamegrah? > I believe it is microseconds, but I am not sure. > > Thanks, > > Matt > > > Thanks, > Karthik. > > From: Matthew Knepley > > Date: Friday, 1 October 2021 at 14:51 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] (percent time in this phase) > > On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > When comparing the MatSolve data for > > GPU > > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > and CPU > > MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 35 34 0 0 0 35 34 0 0 0 4489 > > the time spent is almost the same for this preconditioner. Look like MatCUSPARSSolAnl is called only twice (since I am running on two cores) > > mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > So would it be fair to assume MatCUSPARSSolAnl is not accounted for in MatSolve and it is an exclusive event? > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > I am getting so old. We have a different kind of log output if you are really concerned about inclusion. You can run with > > -log_view :foo.txt:ascii_flamegraph > > and then there are tools for plotting that output, described here > > https://firedrakeproject.org/optimising.html > > This output _guarantees_ strict inclusion, so you will not have the problems you have above adding things up. > > Thanks, > > Matt > > Best, > Karthik. > > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 16:29 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you! > > Just to summarize > > KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~ 100 % > > You didn?t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right in accounting for it as above? > > I am not sure.I thought it might be the GPU part of MatSolve(). I will have to look in the code. I am not as familiar with the GPU part. > > MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX are mutually exclusive? > > Yes. > > Thanks, > > Matt > > Best, > > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 11:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > Thank you Mathew. Now, it is all making sense to me. > > From data file ksp_ex45_N511_gpu_2.txt > > KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%). > > However, you said ?So an iteration would mostly consist of MatMult + PCApply, with some vector work? > > 1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one process and using Block-Jacobi . Half the time is spent in the solve (53%) > > KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0 > KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06 4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021 8.17e-03 100 > > > 2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is all setup of the individual blocks, and this is all used by the numerical ILU factorization. > > PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0 0.00e+00 0 > MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0 0.00e+00 0 > MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 > > 3) The preconditioner application takes 37% of the time, which is all solving the factors and recorded in MatSolve(). Matrix multiplication takes 4%. > > PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0 37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100 > MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0 0.00e+00 100 > MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06 2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0 0.00e+00 100 > > 4) The significant vector time is all in norms (11%) since they are really slow on the GPU. > > > VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00 6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342 2.74e-03 100 > VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00 1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680 5.44e-03 100 > VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0 0.00e+00 100 > VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0 0.00e+00 100 > > So the solve time is: > > 53% ~ 37% + 4% + 11% > > and the setup time is about 16%. I was wrong about the SetUp time being included, as it is outside the event: > > https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852 > > It looks like the remainder of the time (23%) is spent preallocating the matrix. > > Thanks, > > Matt > > The MalMult event is 4 %. How does this event figure into the above equation; if preconditioning (MatMult + PCApply) is included in KSPSolve? > > Best, > Karthik. > > From: Matthew Knepley > > Date: Wednesday, 29 September 2021 at 10:58 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: Barry Smith >, "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI > wrote: > That was helpful. I would like to provide some additional details of my run on cpus and gpus. Please find the following attachments: > > graph.pdf a plot showing overall time and various petsc events. > ksp_ex45_N511_cpu_6.txt data file of the log_summary > ksp_ex45_N511_gpu_2.txt data file of the log_summary > > I used the following petsc options for cpu > > mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor > > and for gpus > > mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi -ksp_monitor > > to run the following problem > > https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html > > From the above code, I see is there no individual function called KSPSetUp(), so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS, kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is KSPSetUp time and KSPSolve time mutually exclusive? > > No, KSPSetUp() will be contained in KSPSolve() if it is called automatically. > > In your response you said that > > ?PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used.? > > I don?t see a explicit call to PCSetUp() or PCApply() in ex45; so for this particular preconditioner (bjacobi) how can I tell how they are timed? > > They are all inside KSPSolve(). If you have a preconditioned linear solve, the oreconditioning happens during the iteration. So an iteration would mostly > consist of MatMult + PCApply, with some vector work. > > I am hoping to time KSP solving and preconditioning mutually exclusively. > > I am not sure that concept makes sense here. See above. > > Thanks, > > Matt > > > Kind regards, > Karthik. > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 19:19 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Thanks for Barry for your response. > > I was just benchmarking the problem with various preconditioner on cpu and gpu. I understand, it is not possible to get mutually exclusive timing. > However, can you tell if KSPSolve time includes both PCSetup and PCApply? And if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and PCApply. > > If you do not call KSPSetUp() separately from KSPSolve() then its time is included with KSPSolve(). > > PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on how much of the preconditioner construction can take place early, so depends exactly on the preconditioner used. > > So yes the answer is not totally satisfying. The one thing I would recommend is to not call KSPSetUp() directly and then KSPSolve() will always include the total time of the solve plus all setup time. PCApply will contain all the time to apply the preconditioner but may also include some setup time. > > Barry > > > Best, > Karthik. > > > > > From: Barry Smith > > Date: Tuesday, 28 September 2021 at 16:56 > To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > > Cc: "petsc-users at mcs.anl.gov " > > Subject: Re: [petsc-users] %T (percent time in this phase) > > > > > On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI > wrote: > > Hello, > > I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson problem. I noticed from the output from using the flag -log_summary that for various events their respective %T (percent time in this phase) do not add up to 100 but rather exceeds 100. So, I gather there is some overlap among these events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and PCSolve. Is it possible to get a mutually exclusive %T or Time for these individual events? I have attached the log_summary output file from my run for your reference. > > > For nested solvers it is tricky to get the times to be mutually exclusive because some parts of the building of the preconditioner is for some preconditioners delayed until the solve has started. > > It looks like you are using the default preconditioner options which for this example are taking more or less no time since so many iterations are needed. It is best to use -pc_type mg to use geometric multigrid on this problem. > > Barry > > > > > Thanks! > Karthik. > > This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marco.cisternino at optimad.it Wed Oct 6 12:20:29 2021 From: marco.cisternino at optimad.it (Marco Cisternino) Date: Wed, 6 Oct 2021 17:20:29 +0000 Subject: [petsc-users] Disconnected domains and Poisson equation In-Reply-To: <5E2505EA-9665-49DF-9D8D-DE6CCF1E0972@petsc.dev> References: <448CEBF7-5B16-4E1C-8D1D-9CC067BD38BB@petsc.dev> <10EA28EF-AD98-4F59-A78D-7DE3D4B585DE@petsc.dev> <3A2F7686-44AA-47A5-B996-461E057F4EC3@petsc.dev> <5E2505EA-9665-49DF-9D8D-DE6CCF1E0972@petsc.dev> Message-ID: Hello Barry. I tried to force the solver to start from an initial guess which is not the solution of the problem. For sake of completeness, the solution has to be a constant field. With this initial condition, the solver iterates to a solution which is constant in the 2 sub-domains but * the constants have not the same value * they are not close to zero (minimal norm solution) * they are not opposite (zero-average solution over the whole domain, like 3 and -3) After 20 CFD iterations my pressure is 32 in one sub-domain and 2.2 in the other one. And their norm is increasing. How can I force the solver to give me minimal norm solution, or in other words the zero constant? I can do it by myself, anchoring domain-by-domain the solution removing its local average, but I was wondering if the solver can do this for me. In some way, giving a null space made of 2 vectors (1 on dofs living in the sub-domain and zero elsewhere), I would expect a solution with zero average in the 2 sub-domains, separately, but I?m wrong, probably. Finally, which is the closure of the problem defining the value of the constant? Zero-average condition, minimal norm condition, or none of them? Thanks! Bests, Marco Cisternino, PhD marco.cisternino at optimad.it ______________________ Optimad Engineering Srl Via Bligny 5, Torino, Italia. +3901119719782 www.optimad.it From: Barry Smith Sent: venerd? 1 ottobre 2021 16:56 To: Marco Cisternino Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Disconnected domains and Poisson equation On Oct 1, 2021, at 6:38 AM, Marco Cisternino > wrote: Thank you Barry. I added a custom atoll = 1.0e-12 and this makes the CFD stable with all the linear solver types. CFD solution is good and pressure is a good ?zero? field at every CFD iteration. I did the same test using ASM+ILU+FGMRES(BCGS and GMRES) and the behaviour is the same. During some CFD iteration the residual of linear system starts slightly higher than atol and the linear solver makes some iteration (2/3 iterations) before it stops because of atol. The pressure is still different in the 2 sub-domains (order 1.0e-14 because of those few linear solver iterations), therefore no symmetry of the solution In the 2 sub-domains. I think it is a matter of round-off, do you agree on this? Or do I need to take care of this difference as a symptom of something wrong? Yes, if the differences in the two solutions are order 1.e-14 that is very good, one cannot expect them to be identical. Thank you for your support. Marco Cisternino From: Barry Smith > Sent: gioved? 30 settembre 2021 16:39 To: Marco Cisternino > Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Disconnected domains and Poisson equation It looks like the initial solution (guess) is to round-off the solution to the linear system 9.010260489109e-14 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min 9.566256813737e-01 max/min 1.000000000000e+00 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min 7.360950481750e-01 max/min 1.306083963538e+00 Thus the Krylov solver will not be able to improve the solution, it then gets stuck trying to improve the solution but cannot because of round off. In other words the algorithm has converged (even at the initial solution (guess) and should stop immediately. You can use -ksp_atol 1.e-12 to get it to stop immediately without iterating if the initial residual is less than 1e-12. Barry On Sep 30, 2021, at 4:16 AM, Marco Cisternino > wrote: Hello Barry. This is the output of ksp_view using fgmres and gamg. It has to be said that the solution of the linear system should be a zero values field. As you can see both unpreconditioned residual and r/b converge at this iteration of the CFD solver. During the time integration of the CFD, I can observe pressure linear solver residuals behaving in a different way: unpreconditioned residual stil converges but r/b stalls. After the output of ksp_view I add the output of ksp_monitor_true_residual for one of these iteration where r/b stalls. Thanks, KSP Object: 1 MPI processes type: fgmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=100, nonzero initial guess tolerances: relative=1e-05, absolute=1e-50, divergence=10000. right preconditioning using UNPRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: gamg type is MULTIPLICATIVE, levels=4 cycles=v Cycles per PCApply=1 Using externally compute Galerkin coarse grid matrices GAMG specific options Threshold for dropping small values in graph on each level = 0.02 0.02 Threshold scaling factor for each level not specified = 1. AGG specific options Symmetric graph true Number of levels to square graph 1 Number smoothing steps 0 Coarse grid solver -- level ------------------------------- KSP Object: (mg_coarse_) 1 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (mg_coarse_) 1 MPI processes type: bjacobi number of blocks = 1 Local solve is same for all blocks, in the following KSP and PC objects: KSP Object: (mg_coarse_sub_) 1 MPI processes type: preonly maximum iterations=1, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using DEFAULT norm type for convergence test PC Object: (mg_coarse_sub_) 1 MPI processes type: lu PC has not been set up so information may be incomplete out-of-place factorization tolerance for zero pivot 2.22045e-14 using diagonal shift on blocks to prevent zero pivot [INBLOCKS] matrix ordering: nd linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=18, cols=18 total: nonzeros=104, allocated nonzeros=104 total number of mallocs used during MatSetValues calls =0 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=18, cols=18 total: nonzeros=104, allocated nonzeros=104 total number of mallocs used during MatSetValues calls =0 not using I-node routines Down solver (pre-smoother) on level 1 ------------------------------- KSP Object: (mg_levels_1_) 1 MPI processes type: chebyshev eigenvalue estimates used: min = 0., max = 0. eigenvalues estimate via gmres min 0., max 0. eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] KSP Object: (mg_levels_1_esteig_) 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10, initial guess is zero tolerances: relative=1e-12, absolute=1e-50, divergence=10000. left preconditioning using DEFAULT norm type for convergence test estimating eigenvalues using noisy right hand side maximum iterations=2, nonzero initial guess tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (mg_levels_1_) 1 MPI processes type: sor type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=67, cols=67 total: nonzeros=675, allocated nonzeros=675 total number of mallocs used during MatSetValues calls =0 not using I-node routines Up solver (post-smoother) same as down solver (pre-smoother) Down solver (pre-smoother) on level 2 ------------------------------- KSP Object: (mg_levels_2_) 1 MPI processes type: chebyshev eigenvalue estimates used: min = 0., max = 0. eigenvalues estimate via gmres min 0., max 0. eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] KSP Object: (mg_levels_2_esteig_) 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10, initial guess is zero tolerances: relative=1e-12, absolute=1e-50, divergence=10000. left preconditioning using DEFAULT norm type for convergence test estimating eigenvalues using noisy right hand side maximum iterations=2, nonzero initial guess tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (mg_levels_2_) 1 MPI processes type: sor type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=348, cols=348 total: nonzeros=3928, allocated nonzeros=3928 total number of mallocs used during MatSetValues calls =0 not using I-node routines Up solver (post-smoother) same as down solver (pre-smoother) Down solver (pre-smoother) on level 3 ------------------------------- KSP Object: (mg_levels_3_) 1 MPI processes type: chebyshev eigenvalue estimates used: min = 0., max = 0. eigenvalues estimate via gmres min 0., max 0. eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] KSP Object: (mg_levels_3_esteig_) 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10, initial guess is zero tolerances: relative=1e-12, absolute=1e-50, divergence=10000. left preconditioning using DEFAULT norm type for convergence test estimating eigenvalues using noisy right hand side maximum iterations=2, nonzero initial guess tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (mg_levels_3_) 1 MPI processes type: sor type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=3584, cols=3584 total: nonzeros=23616, allocated nonzeros=23616 total number of mallocs used during MatSetValues calls =0 has attached null space not using I-node routines Up solver (post-smoother) same as down solver (pre-smoother) linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=3584, cols=3584 total: nonzeros=23616, allocated nonzeros=23616 total number of mallocs used during MatSetValues calls =0 has attached null space not using I-node routines Pressure system has reached convergence in 0 iterations with reason 3. 0 KSP unpreconditioned resid norm 4.798763170703e-16 true resid norm 4.798763170703e-16 ||r(i)||/||b|| 1.000000000000e+00 0 KSP Residual norm 4.798763170703e-16 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 1 KSP unpreconditioned resid norm 1.648749109132e-17 true resid norm 1.648749109132e-17 ||r(i)||/||b|| 3.435779284125e-02 1 KSP Residual norm 1.648749109132e-17 % max 9.561792537103e-01 min 9.561792537103e-01 max/min 1.000000000000e+00 2 KSP unpreconditioned resid norm 4.737880600040e-19 true resid norm 4.737880600040e-19 ||r(i)||/||b|| 9.873128619820e-04 2 KSP Residual norm 4.737880600040e-19 % max 9.828636644296e-01 min 9.293131521763e-01 max/min 1.057623753767e+00 3 KSP unpreconditioned resid norm 2.542212716830e-20 true resid norm 2.542212716830e-20 ||r(i)||/||b|| 5.297641551371e-05 3 KSP Residual norm 2.542212716830e-20 % max 9.933572357920e-01 min 9.158303248850e-01 max/min 1.084652046127e+00 4 KSP unpreconditioned resid norm 6.614510286263e-21 true resid norm 6.614510286269e-21 ||r(i)||/||b|| 1.378378146822e-05 4 KSP Residual norm 6.614510286263e-21 % max 9.950912550705e-01 min 6.296575800237e-01 max/min 1.580368896747e+00 5 KSP unpreconditioned resid norm 1.981505525281e-22 true resid norm 1.981505525272e-22 ||r(i)||/||b|| 4.129200493513e-07 5 KSP Residual norm 1.981505525281e-22 % max 9.984097962703e-01 min 5.316259535293e-01 max/min 1.878030577029e+00 Linear solve converged due to CONVERGED_RTOL iterations 5 Ksp_monitor_true_residual output for stalling r/b CFD iteration 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min 9.566256813737e-01 max/min 1.000000000000e+00 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min 7.360950481750e-01 max/min 1.306083963538e+00 3 KSP unpreconditioned resid norm 6.623206616803e-16 true resid norm 6.654132553541e-16 ||r(i)||/||b|| 1.492933720678e-02 3 KSP Residual norm 6.623206616803e-16 % max 9.764112945239e-01 min 4.911485418014e-01 max/min 1.988016274960e+00 4 KSP unpreconditioned resid norm 6.551896936698e-16 true resid norm 6.646157296305e-16 ||r(i)||/||b|| 1.491144376933e-02 4 KSP Residual norm 6.551896936698e-16 % max 9.883425885532e-01 min 1.461270778833e-01 max/min 6.763582786091e+00 5 KSP unpreconditioned resid norm 6.222297644887e-16 true resid norm 1.720560536914e-15 ||r(i)||/||b|| 3.860282047823e-02 5 KSP Residual norm 6.222297644887e-16 % max 1.000409371755e+00 min 4.989767363560e-03 max/min 2.004921870829e+02 6 KSP unpreconditioned resid norm 6.496945794974e-17 true resid norm 2.031914800253e-14 ||r(i)||/||b|| 4.558842341106e-01 6 KSP Residual norm 6.496945794974e-17 % max 1.004914985753e+00 min 1.459258738706e-03 max/min 6.886475709192e+02 7 KSP unpreconditioned resid norm 1.965237342540e-17 true resid norm 1.684522207337e-14 ||r(i)||/||b|| 3.779425772373e-01 7 KSP Residual norm 1.965237342540e-17 % max 1.005737762541e+00 min 1.452603803766e-03 max/min 6.923689446035e+02 8 KSP unpreconditioned resid norm 1.627718951285e-17 true resid norm 1.958642967520e-14 ||r(i)||/||b|| 4.394448276241e-01 8 KSP Residual norm 1.627718951285e-17 % max 1.006364278765e+00 min 1.452081813014e-03 max/min 6.930492963590e+02 9 KSP unpreconditioned resid norm 1.616577677764e-17 true resid norm 2.019110946644e-14 ||r(i)||/||b|| 4.530115373837e-01 9 KSP Residual norm 1.616577677764e-17 % max 1.006648747131e+00 min 1.452031376577e-03 max/min 6.932692801059e+02 10 KSP unpreconditioned resid norm 1.285788988203e-17 true resid norm 2.065082694477e-14 ||r(i)||/||b|| 4.633258453698e-01 10 KSP Residual norm 1.285788988203e-17 % max 1.007469033514e+00 min 1.433291867068e-03 max/min 7.029057072477e+02 11 KSP unpreconditioned resid norm 5.490854431580e-19 true resid norm 1.798071628891e-14 ||r(i)||/||b|| 4.034187394623e-01 11 KSP Residual norm 5.490854431580e-19 % max 1.008058905554e+00 min 1.369401685301e-03 max/min 7.361309076612e+02 12 KSP unpreconditioned resid norm 1.371754802104e-20 true resid norm 1.965688920064e-14 ||r(i)||/||b|| 4.410256708163e-01 12 KSP Residual norm 1.371754802104e-20 % max 1.008409402214e+00 min 1.369243011779e-03 max/min 7.364721919624e+02 Linear solve converged due to CONVERGED_RTOL iterations 12 Marco Cisternino From: Barry Smith > Sent: mercoled? 29 settembre 2021 18:34 To: Marco Cisternino > Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Disconnected domains and Poisson equation On Sep 29, 2021, at 11:59 AM, Marco Cisternino > wrote: For sake of completeness, explicitly building the null space using a vector per sub-domain make s the CFD runs using BCGS and GMRES more stable, but still slower than FGMRES. Something is strange. Please run with -ksp_view and send the output on the solver details. I had divergence using BCGS and GMRES setting the null space with only one constant. Thanks Marco Cisternino From: Marco Cisternino Sent: mercoled? 29 settembre 2021 17:54 To: Barry Smith > Cc: petsc-users at mcs.anl.gov Subject: RE: [petsc-users] Disconnected domains and Poisson equation Thank you Barry for the quick reply. About the null space: I already tried what you suggest, building 2 Vec (constants) with 0 and 1 chosen by sub-domain, normalizing them and setting the null space like this MatNullSpaceCreate(PETSC_COMM_WORLD,PETSC_FALSE,nconstants,constants,&nullspace); The solution is slightly different in values but it is still different in the two sub-domains. About the solver: I tried BCGS, GMRES and FGMRES. The linear system is a pressure system in a navier-stokes solver and only solving with FGMRES makes the CFD stable, with BCGS and GMRES the CFD solution diverges. Moreover, in the same case but with a single domain, CFD solution is stable using all the solvers, but FGMRES converges in much less iterations than the others. Marco Cisternino From: Barry Smith > Sent: mercoled? 29 settembre 2021 15:59 To: Marco Cisternino > Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Disconnected domains and Poisson equation The problem actually has a two dimensional null space; constant on each domain but possibly different constants. I think you need to build the MatNullSpace by explicitly constructing two vectors, one with 0 on one domain and constant value on the other and one with 0 on the other domain and constant on the first. Separate note: why use FGMRES instead of just GMRES? If the problem is linear and the preconditioner is linear (no GMRES inside the smoother) then you can just use GMRES and it will save a little space/work and be conceptually clearer. Barry On Sep 29, 2021, at 8:46 AM, Marco Cisternino > wrote: Good morning, I want to solve the Poisson equation on a 3D domain with 2 non-connected sub-domains. I am using FGMRES+GAMG and I have no problem if the two sub-domains see a Dirichlet boundary condition each. On the same domain I would like to solve the Poisson equation imposing periodic boundary condition in one direction and homogenous Neumann boundary conditions in the other two directions. The two sub-domains are symmetric with respect to the separation between them and the operator discretization and the right hand side are symmetric as well. It would be nice to have the same solution in both the sub-domains. Setting the null space to the constant, the solver converges to a solution having the same gradients in both sub-domains but different values. Am I doing some wrong with the null space? I?m not setting a block matrix (one block for each sub-domain), should I? I tested the null space against the matrix using MatNullSpaceTest and the answer is true. Can I do something more to have a symmetric solution as outcome of the solver? Thank you in advance for any comments and hints. Best regards, Marco Cisternino -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 6 13:08:02 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 6 Oct 2021 14:08:02 -0400 Subject: [petsc-users] Disconnected domains and Poisson equation In-Reply-To: References: <448CEBF7-5B16-4E1C-8D1D-9CC067BD38BB@petsc.dev> <10EA28EF-AD98-4F59-A78D-7DE3D4B585DE@petsc.dev> <3A2F7686-44AA-47A5-B996-461E057F4EC3@petsc.dev> <5E2505EA-9665-49DF-9D8D-DE6CCF1E0972@petsc.dev> Message-ID: On Wed, Oct 6, 2021 at 1:20 PM Marco Cisternino wrote: > Hello Barry. > > I tried to force the solver to start from an initial guess which is not > the solution of the problem. For sake of completeness, the solution has to > be a constant field. > With this initial condition, the solver iterates to a solution which is > constant in the 2 sub-domains but > > - the constants have not the same value > - they are not close to zero (minimal norm solution) > - they are not opposite (zero-average solution over the whole domain, > like 3 and -3) > > After 20 CFD iterations my pressure is 32 in one sub-domain and 2.2 in the > other one. And their norm is increasing. > > How can I force the solver to give me minimal norm solution, or in other > words the zero constant? > > I can do it by myself, anchoring domain-by-domain the solution removing > its local average, but I was wondering if the solver can do this for me. > This is the point of providing that null space to the solver. If you give the constant vector on each subdomain, then the average of the pressure on each domain will be 0 if that is consistent with your forcing. MatSetNullSpace() can take any number of vectors. Thanks, Matt > In some way, giving a null space made of 2 vectors (1 on dofs living in > the sub-domain and zero elsewhere), I would expect a solution with zero > average in the 2 sub-domains, separately, but I?m wrong, probably. > Finally, which is the closure of the problem defining the value of the > constant? Zero-average condition, minimal norm condition, or none of them? > > > > Thanks! > > > > Bests, > > > > Marco Cisternino, PhD > marco.cisternino at optimad.it > > ______________________ > > Optimad Engineering Srl > > Via Bligny 5, Torino, Italia. > +3901119719782 > www.optimad.it > > > > *From:* Barry Smith > *Sent:* venerd? 1 ottobre 2021 16:56 > *To:* Marco Cisternino > *Cc:* petsc-users at mcs.anl.gov > *Subject:* Re: [petsc-users] Disconnected domains and Poisson equation > > > > > > > > On Oct 1, 2021, at 6:38 AM, Marco Cisternino > wrote: > > > > Thank you Barry. > > I added a custom atoll = 1.0e-12 and this makes the CFD stable with all > the linear solver types. CFD solution is good and pressure is a good ?zero? > field at every CFD iteration. > > I did the same test using ASM+ILU+FGMRES(BCGS and GMRES) and the behaviour > is the same. > > During some CFD iteration the residual of linear system starts slightly > higher than atol and the linear solver makes some iteration (2/3 > iterations) before it stops because of atol. > > The pressure is still different in the 2 sub-domains (order 1.0e-14 > because of those few linear solver iterations), therefore no symmetry of > the solution In the 2 sub-domains. > > I think it is a matter of round-off, do you agree on this? Or do I need to > take care of this difference as a symptom of something wrong? > > > > Yes, if the differences in the two solutions are order 1.e-14 that is > very good, one cannot expect them to be identical. > > > > Thank you for your support. > > > > Marco Cisternino > > > > *From:* Barry Smith > *Sent:* gioved? 30 settembre 2021 16:39 > *To:* Marco Cisternino > *Cc:* petsc-users at mcs.anl.gov > *Subject:* Re: [petsc-users] Disconnected domains and Poisson equation > > > > > > It looks like the initial solution (guess) is to round-off the solution > to the linear system 9.010260489109e-14 > > > > 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm > 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 > > 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min > 1.000000000000e+00 max/min 1.000000000000e+00 > > 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm > 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 > > 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min > 9.566256813737e-01 max/min 1.000000000000e+00 > > 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm > 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 > > 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min > 7.360950481750e-01 max/min 1.306083963538e+00 > > > > Thus the Krylov solver will not be able to improve the solution, it then > gets stuck trying to improve the solution but cannot because of round off. > > > > In other words the algorithm has converged (even at the initial solution > (guess) and should stop immediately. > > > > You can use -ksp_atol 1.e-12 to get it to stop immediately without > iterating if the initial residual is less than 1e-12. > > > > Barry > > > > > > > > > On Sep 30, 2021, at 4:16 AM, Marco Cisternino > wrote: > > > > Hello Barry. > > This is the output of ksp_view using fgmres and gamg. It has to be said > that the solution of the linear system should be a zero values field. As > you can see both unpreconditioned residual and r/b converge at this > iteration of the CFD solver. During the time integration of the CFD, I can > observe pressure linear solver residuals behaving in a different way: > unpreconditioned residual stil converges but r/b stalls. After the output > of ksp_view I add the output of ksp_monitor_true_residual for one of these > iteration where r/b stalls. > Thanks, > > > > KSP Object: 1 MPI processes > > type: fgmres > > restart=30, using Classical (unmodified) Gram-Schmidt > Orthogonalization with no iterative refinement > > happy breakdown tolerance 1e-30 > > maximum iterations=100, nonzero initial guess > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > right preconditioning > > using UNPRECONDITIONED norm type for convergence test > > PC Object: 1 MPI processes > > type: gamg > > type is MULTIPLICATIVE, levels=4 cycles=v > > Cycles per PCApply=1 > > Using externally compute Galerkin coarse grid matrices > > GAMG specific options > > Threshold for dropping small values in graph on each level = > 0.02 0.02 > > Threshold scaling factor for each level not specified = 1. > > AGG specific options > > Symmetric graph true > > Number of levels to square graph 1 > > Number smoothing steps 0 > > Coarse grid solver -- level ------------------------------- > > KSP Object: (mg_coarse_) 1 MPI processes > > type: preonly > > maximum iterations=10000, initial guess is zero > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > left preconditioning > > using NONE norm type for convergence test > > PC Object: (mg_coarse_) 1 MPI processes > > type: bjacobi > > number of blocks = 1 > > Local solve is same for all blocks, in the following KSP and PC > objects: > > KSP Object: (mg_coarse_sub_) 1 MPI processes > > type: preonly > > maximum iterations=1, initial guess is zero > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > left preconditioning > > using DEFAULT norm type for convergence test > > PC Object: (mg_coarse_sub_) 1 MPI processes > > type: lu > > PC has not been set up so information may be incomplete > > out-of-place factorization > > tolerance for zero pivot 2.22045e-14 > > using diagonal shift on blocks to prevent zero pivot [INBLOCKS] > > matrix ordering: nd > > linear system matrix = precond matrix: > > Mat Object: 1 MPI processes > > type: seqaij > > rows=18, cols=18 > > total: nonzeros=104, allocated nonzeros=104 > > total number of mallocs used during MatSetValues calls =0 > > not using I-node routines > > linear system matrix = precond matrix: > > Mat Object: 1 MPI processes > > type: seqaij > > rows=18, cols=18 > > total: nonzeros=104, allocated nonzeros=104 > > total number of mallocs used during MatSetValues calls =0 > > not using I-node routines > > Down solver (pre-smoother) on level 1 ------------------------------- > > KSP Object: (mg_levels_1_) 1 MPI processes > > type: chebyshev > > eigenvalue estimates used: min = 0., max = 0. > > eigenvalues estimate via gmres min 0., max 0. > > eigenvalues estimated using gmres with translations [0. 0.1; 0. > 1.1] > > KSP Object: (mg_levels_1_esteig_) 1 MPI processes > > type: gmres > > restart=30, using Classical (unmodified) Gram-Schmidt > Orthogonalization with no iterative refinement > > happy breakdown tolerance 1e-30 > > maximum iterations=10, initial guess is zero > > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > > left preconditioning > > using DEFAULT norm type for convergence test > > estimating eigenvalues using noisy right hand side > > maximum iterations=2, nonzero initial guess > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > left preconditioning > > using NONE norm type for convergence test > > PC Object: (mg_levels_1_) 1 MPI processes > > type: sor > > type = local_symmetric, iterations = 1, local iterations = 1, > omega = 1. > > linear system matrix = precond matrix: > > Mat Object: 1 MPI processes > > type: seqaij > > rows=67, cols=67 > > total: nonzeros=675, allocated nonzeros=675 > > total number of mallocs used during MatSetValues calls =0 > > not using I-node routines > > Up solver (post-smoother) same as down solver (pre-smoother) > > Down solver (pre-smoother) on level 2 ------------------------------- > > KSP Object: (mg_levels_2_) 1 MPI processes > > type: chebyshev > > eigenvalue estimates used: min = 0., max = 0. > > eigenvalues estimate via gmres min 0., max 0. > > eigenvalues estimated using gmres with translations [0. 0.1; 0. > 1.1] > > KSP Object: (mg_levels_2_esteig_) 1 MPI processes > > type: gmres > > restart=30, using Classical (unmodified) Gram-Schmidt > Orthogonalization with no iterative refinement > > happy breakdown tolerance 1e-30 > > maximum iterations=10, initial guess is zero > > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > > left preconditioning > > using DEFAULT norm type for convergence test > > estimating eigenvalues using noisy right hand side > > maximum iterations=2, nonzero initial guess > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > left preconditioning > > using NONE norm type for convergence test > > PC Object: (mg_levels_2_) 1 MPI processes > > type: sor > > type = local_symmetric, iterations = 1, local iterations = 1, > omega = 1. > > linear system matrix = precond matrix: > > Mat Object: 1 MPI processes > > type: seqaij > > rows=348, cols=348 > > total: nonzeros=3928, allocated nonzeros=3928 > > total number of mallocs used during MatSetValues calls =0 > > not using I-node routines > > Up solver (post-smoother) same as down solver (pre-smoother) > > Down solver (pre-smoother) on level 3 ------------------------------- > > KSP Object: (mg_levels_3_) 1 MPI processes > > type: chebyshev > > eigenvalue estimates used: min = 0., max = 0. > > eigenvalues estimate via gmres min 0., max 0. > > eigenvalues estimated using gmres with translations [0. 0.1; 0. > 1.1] > > KSP Object: (mg_levels_3_esteig_) 1 MPI processes > > type: gmres > > restart=30, using Classical (unmodified) Gram-Schmidt > Orthogonalization with no iterative refinement > > happy breakdown tolerance 1e-30 > > maximum iterations=10, initial guess is zero > > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > > left preconditioning > > using DEFAULT norm type for convergence test > > estimating eigenvalues using noisy right hand side > > maximum iterations=2, nonzero initial guess > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > left preconditioning > > using NONE norm type for convergence test > > PC Object: (mg_levels_3_) 1 MPI processes > > type: sor > > type = local_symmetric, iterations = 1, local iterations = 1, > omega = 1. > > linear system matrix = precond matrix: > > Mat Object: 1 MPI processes > > type: seqaij > > rows=3584, cols=3584 > > total: nonzeros=23616, allocated nonzeros=23616 > > total number of mallocs used during MatSetValues calls =0 > > has attached null space > > not using I-node routines > > Up solver (post-smoother) same as down solver (pre-smoother) > > linear system matrix = precond matrix: > > Mat Object: 1 MPI processes > > type: seqaij > > rows=3584, cols=3584 > > total: nonzeros=23616, allocated nonzeros=23616 > > total number of mallocs used during MatSetValues calls =0 > > has attached null space > > not using I-node routines > > Pressure system has reached convergence in 0 iterations with reason 3. > > 0 KSP unpreconditioned resid norm 4.798763170703e-16 true resid norm > 4.798763170703e-16 ||r(i)||/||b|| 1.000000000000e+00 > > 0 KSP Residual norm 4.798763170703e-16 % max 1.000000000000e+00 min > 1.000000000000e+00 max/min 1.000000000000e+00 > > 1 KSP unpreconditioned resid norm 1.648749109132e-17 true resid norm > 1.648749109132e-17 ||r(i)||/||b|| 3.435779284125e-02 > > 1 KSP Residual norm 1.648749109132e-17 % max 9.561792537103e-01 min > 9.561792537103e-01 max/min 1.000000000000e+00 > > 2 KSP unpreconditioned resid norm 4.737880600040e-19 true resid norm > 4.737880600040e-19 ||r(i)||/||b|| 9.873128619820e-04 > > 2 KSP Residual norm 4.737880600040e-19 % max 9.828636644296e-01 min > 9.293131521763e-01 max/min 1.057623753767e+00 > > 3 KSP unpreconditioned resid norm 2.542212716830e-20 true resid norm > 2.542212716830e-20 ||r(i)||/||b|| 5.297641551371e-05 > > 3 KSP Residual norm 2.542212716830e-20 % max 9.933572357920e-01 min > 9.158303248850e-01 max/min 1.084652046127e+00 > > 4 KSP unpreconditioned resid norm 6.614510286263e-21 true resid norm > 6.614510286269e-21 ||r(i)||/||b|| 1.378378146822e-05 > > 4 KSP Residual norm 6.614510286263e-21 % max 9.950912550705e-01 min > 6.296575800237e-01 max/min 1.580368896747e+00 > > 5 KSP unpreconditioned resid norm 1.981505525281e-22 true resid norm > 1.981505525272e-22 ||r(i)||/||b|| 4.129200493513e-07 > > 5 KSP Residual norm 1.981505525281e-22 % max 9.984097962703e-01 min > 5.316259535293e-01 max/min 1.878030577029e+00 > > Linear solve converged due to CONVERGED_RTOL iterations 5 > > > > Ksp_monitor_true_residual output for stalling r/b CFD iteration > 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm > 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 > > 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min > 1.000000000000e+00 max/min 1.000000000000e+00 > > 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm > 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 > > 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min > 9.566256813737e-01 max/min 1.000000000000e+00 > > 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm > 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 > > 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min > 7.360950481750e-01 max/min 1.306083963538e+00 > > 3 KSP unpreconditioned resid norm 6.623206616803e-16 true resid norm > 6.654132553541e-16 ||r(i)||/||b|| 1.492933720678e-02 > > 3 KSP Residual norm 6.623206616803e-16 % max 9.764112945239e-01 min > 4.911485418014e-01 max/min 1.988016274960e+00 > > 4 KSP unpreconditioned resid norm 6.551896936698e-16 true resid norm > 6.646157296305e-16 ||r(i)||/||b|| 1.491144376933e-02 > > 4 KSP Residual norm 6.551896936698e-16 % max 9.883425885532e-01 min > 1.461270778833e-01 max/min 6.763582786091e+00 > > 5 KSP unpreconditioned resid norm 6.222297644887e-16 true resid norm > 1.720560536914e-15 ||r(i)||/||b|| 3.860282047823e-02 > > 5 KSP Residual norm 6.222297644887e-16 % max 1.000409371755e+00 min > 4.989767363560e-03 max/min 2.004921870829e+02 > > 6 KSP unpreconditioned resid norm 6.496945794974e-17 true resid norm > 2.031914800253e-14 ||r(i)||/||b|| 4.558842341106e-01 > > 6 KSP Residual norm 6.496945794974e-17 % max 1.004914985753e+00 min > 1.459258738706e-03 max/min 6.886475709192e+02 > > 7 KSP unpreconditioned resid norm 1.965237342540e-17 true resid norm > 1.684522207337e-14 ||r(i)||/||b|| 3.779425772373e-01 > > 7 KSP Residual norm 1.965237342540e-17 % max 1.005737762541e+00 min > 1.452603803766e-03 max/min 6.923689446035e+02 > > 8 KSP unpreconditioned resid norm 1.627718951285e-17 true resid norm > 1.958642967520e-14 ||r(i)||/||b|| 4.394448276241e-01 > > 8 KSP Residual norm 1.627718951285e-17 % max 1.006364278765e+00 min > 1.452081813014e-03 max/min 6.930492963590e+02 > > 9 KSP unpreconditioned resid norm 1.616577677764e-17 true resid norm > 2.019110946644e-14 ||r(i)||/||b|| 4.530115373837e-01 > > 9 KSP Residual norm 1.616577677764e-17 % max 1.006648747131e+00 min > 1.452031376577e-03 max/min 6.932692801059e+02 > > 10 KSP unpreconditioned resid norm 1.285788988203e-17 true resid norm > 2.065082694477e-14 ||r(i)||/||b|| 4.633258453698e-01 > > 10 KSP Residual norm 1.285788988203e-17 % max 1.007469033514e+00 min > 1.433291867068e-03 max/min 7.029057072477e+02 > > 11 KSP unpreconditioned resid norm 5.490854431580e-19 true resid norm > 1.798071628891e-14 ||r(i)||/||b|| 4.034187394623e-01 > > 11 KSP Residual norm 5.490854431580e-19 % max 1.008058905554e+00 min > 1.369401685301e-03 max/min 7.361309076612e+02 > > 12 KSP unpreconditioned resid norm 1.371754802104e-20 true resid norm > 1.965688920064e-14 ||r(i)||/||b|| 4.410256708163e-01 > > 12 KSP Residual norm 1.371754802104e-20 % max 1.008409402214e+00 min > 1.369243011779e-03 max/min 7.364721919624e+02 > > Linear solve converged due to CONVERGED_RTOL iterations 12 > > > > > > > > Marco Cisternino > > > > *From:* Barry Smith > *Sent:* mercoled? 29 settembre 2021 18:34 > *To:* Marco Cisternino > *Cc:* petsc-users at mcs.anl.gov > *Subject:* Re: [petsc-users] Disconnected domains and Poisson equation > > > > > > > > > > On Sep 29, 2021, at 11:59 AM, Marco Cisternino < > marco.cisternino at optimad.it> wrote: > > > > For sake of completeness, explicitly building the null space using a > vector per sub-domain make s the CFD runs using BCGS and GMRES more stable, > but still slower than FGMRES. > > > > Something is strange. Please run with -ksp_view and send the output on > the solver details. > > > > > > I had divergence using BCGS and GMRES setting the null space with only one > constant. > > Thanks > > > > Marco Cisternino > > > > *From:* Marco Cisternino > *Sent:* mercoled? 29 settembre 2021 17:54 > *To:* Barry Smith > *Cc:* petsc-users at mcs.anl.gov > *Subject:* RE: [petsc-users] Disconnected domains and Poisson equation > > > > Thank you Barry for the quick reply. > > About the null space: I already tried what you suggest, building 2 Vec > (constants) with 0 and 1 chosen by sub-domain, normalizing them and setting > the null space like this > > > MatNullSpaceCreate(PETSC_COMM_WORLD,PETSC_FALSE,nconstants,constants,&nullspace); > > The solution is slightly different in values but it is still different in > the two sub-domains. > > About the solver: I tried BCGS, GMRES and FGMRES. The linear system is a > pressure system in a navier-stokes solver and only solving with FGMRES > makes the CFD stable, with BCGS and GMRES the CFD solution diverges. > Moreover, in the same case but with a single domain, CFD solution is stable > using all the solvers, but FGMRES converges in much less iterations than > the others. > > > > Marco Cisternino > > > > *From:* Barry Smith > *Sent:* mercoled? 29 settembre 2021 15:59 > *To:* Marco Cisternino > *Cc:* petsc-users at mcs.anl.gov > *Subject:* Re: [petsc-users] Disconnected domains and Poisson equation > > > > > > The problem actually has a two dimensional null space; constant on each > domain but possibly different constants. I think you need to build the > MatNullSpace by explicitly constructing two vectors, one with 0 on one > domain and constant value on the other and one with 0 on the other domain > and constant on the first. > > > > Separate note: why use FGMRES instead of just GMRES? If the problem is > linear and the preconditioner is linear (no GMRES inside the smoother) then > you can just use GMRES and it will save a little space/work and be > conceptually clearer. > > > > Barry > > > > On Sep 29, 2021, at 8:46 AM, Marco Cisternino > wrote: > > > > Good morning, > > I want to solve the Poisson equation on a 3D domain with 2 non-connected > sub-domains. > > I am using FGMRES+GAMG and I have no problem if the two sub-domains see a > Dirichlet boundary condition each. > > On the same domain I would like to solve the Poisson equation imposing > periodic boundary condition in one direction and homogenous Neumann > boundary conditions in the other two directions. The two sub-domains are > symmetric with respect to the separation between them and the operator > discretization and the right hand side are symmetric as well. It would be > nice to have the same solution in both the sub-domains. > > Setting the null space to the constant, the solver converges to a solution > having the same gradients in both sub-domains but different values. > > Am I doing some wrong with the null space? I?m not setting a block matrix > (one block for each sub-domain), should I? > > I tested the null space against the matrix using MatNullSpaceTest and the > answer is true. Can I do something more to have a symmetric solution as > outcome of the solver? > > Thank you in advance for any comments and hints. > > > > Best regards, > > > > Marco Cisternino > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Wed Oct 6 13:13:54 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 6 Oct 2021 14:13:54 -0400 Subject: [petsc-users] Disconnected domains and Poisson equation In-Reply-To: References: <448CEBF7-5B16-4E1C-8D1D-9CC067BD38BB@petsc.dev> <10EA28EF-AD98-4F59-A78D-7DE3D4B585DE@petsc.dev> <3A2F7686-44AA-47A5-B996-461E057F4EC3@petsc.dev> <5E2505EA-9665-49DF-9D8D-DE6CCF1E0972@petsc.dev> Message-ID: > On Oct 6, 2021, at 1:20 PM, Marco Cisternino wrote: > > Hello Barry. > I tried to force the solver to start from an initial guess which is not the solution of the problem. For sake of completeness, the solution has to be a constant field. > With this initial condition, the solver iterates to a solution which is constant in the 2 sub-domains but > the constants have not the same value > they are not close to zero (minimal norm solution) > they are not opposite (zero-average solution over the whole domain, like 3 and -3) > After 20 CFD iterations my pressure is 32 in one sub-domain and 2.2 in the other one. And their norm is increasing. > How can I force the solver to give me minimal norm solution, or in other words the zero constant? Providing the appropriate null space should result in GMRES giving you the minimal norm solution which corresponds to the average of the solution on each domain being zero. For a general right hand side the residual will not decrease to zero because the right hand side is inconsistent. You can still run GMRES and the solution will converge but you will have a nonzero residual at the end (this makes the stopping criteria harder so it is useful to remove the inconsistent part of the right hand side out from the right hand side before calling GMRES.) Barry > I can do it by myself, anchoring domain-by-domain the solution removing its local average, but I was wondering if the solver can do this for me. > In some way, giving a null space made of 2 vectors (1 on dofs living in the sub-domain and zero elsewhere), I would expect a solution with zero average in the 2 sub-domains, separately, but I?m wrong, probably. > Finally, which is the closure of the problem defining the value of the constant? Zero-average condition, minimal norm condition, or none of them? > > Thanks! > > Bests, > > Marco Cisternino, PhD > marco.cisternino at optimad.it > ______________________ > Optimad Engineering Srl > Via Bligny 5, Torino, Italia. > +3901119719782 > www.optimad.it > > From: Barry Smith > > Sent: venerd? 1 ottobre 2021 16:56 > To: Marco Cisternino > > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Disconnected domains and Poisson equation > > > > > On Oct 1, 2021, at 6:38 AM, Marco Cisternino > wrote: > > Thank you Barry. > I added a custom atoll = 1.0e-12 and this makes the CFD stable with all the linear solver types. CFD solution is good and pressure is a good ?zero? field at every CFD iteration. > I did the same test using ASM+ILU+FGMRES(BCGS and GMRES) and the behaviour is the same. > During some CFD iteration the residual of linear system starts slightly higher than atol and the linear solver makes some iteration (2/3 iterations) before it stops because of atol. > The pressure is still different in the 2 sub-domains (order 1.0e-14 because of those few linear solver iterations), therefore no symmetry of the solution In the 2 sub-domains. > I think it is a matter of round-off, do you agree on this? Or do I need to take care of this difference as a symptom of something wrong? > > Yes, if the differences in the two solutions are order 1.e-14 that is very good, one cannot expect them to be identical. > > > Thank you for your support. > > Marco Cisternino > > From: Barry Smith > > Sent: gioved? 30 settembre 2021 16:39 > To: Marco Cisternino > > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Disconnected domains and Poisson equation > > > It looks like the initial solution (guess) is to round-off the solution to the linear system 9.010260489109e-14 > > 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 > 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 > 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 > 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min 9.566256813737e-01 max/min 1.000000000000e+00 > 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 > 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min 7.360950481750e-01 max/min 1.306083963538e+00 > > Thus the Krylov solver will not be able to improve the solution, it then gets stuck trying to improve the solution but cannot because of round off. > > In other words the algorithm has converged (even at the initial solution (guess) and should stop immediately. > > You can use -ksp_atol 1.e-12 to get it to stop immediately without iterating if the initial residual is less than 1e-12. > > Barry > > > > > > On Sep 30, 2021, at 4:16 AM, Marco Cisternino > wrote: > > Hello Barry. > This is the output of ksp_view using fgmres and gamg. It has to be said that the solution of the linear system should be a zero values field. As you can see both unpreconditioned residual and r/b converge at this iteration of the CFD solver. During the time integration of the CFD, I can observe pressure linear solver residuals behaving in a different way: unpreconditioned residual stil converges but r/b stalls. After the output of ksp_view I add the output of ksp_monitor_true_residual for one of these iteration where r/b stalls. > Thanks, > > KSP Object: 1 MPI processes > type: fgmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=100, nonzero initial guess > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > right preconditioning > using UNPRECONDITIONED norm type for convergence test > PC Object: 1 MPI processes > type: gamg > type is MULTIPLICATIVE, levels=4 cycles=v > Cycles per PCApply=1 > Using externally compute Galerkin coarse grid matrices > GAMG specific options > Threshold for dropping small values in graph on each level = 0.02 0.02 > Threshold scaling factor for each level not specified = 1. > AGG specific options > Symmetric graph true > Number of levels to square graph 1 > Number smoothing steps 0 > Coarse grid solver -- level ------------------------------- > KSP Object: (mg_coarse_) 1 MPI processes > type: preonly > maximum iterations=10000, initial guess is zero > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (mg_coarse_) 1 MPI processes > type: bjacobi > number of blocks = 1 > Local solve is same for all blocks, in the following KSP and PC objects: > KSP Object: (mg_coarse_sub_) 1 MPI processes > type: preonly > maximum iterations=1, initial guess is zero > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using DEFAULT norm type for convergence test > PC Object: (mg_coarse_sub_) 1 MPI processes > type: lu > PC has not been set up so information may be incomplete > out-of-place factorization > tolerance for zero pivot 2.22045e-14 > using diagonal shift on blocks to prevent zero pivot [INBLOCKS] > matrix ordering: nd > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=18, cols=18 > total: nonzeros=104, allocated nonzeros=104 > total number of mallocs used during MatSetValues calls =0 > not using I-node routines > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=18, cols=18 > total: nonzeros=104, allocated nonzeros=104 > total number of mallocs used during MatSetValues calls =0 > not using I-node routines > Down solver (pre-smoother) on level 1 ------------------------------- > KSP Object: (mg_levels_1_) 1 MPI processes > type: chebyshev > eigenvalue estimates used: min = 0., max = 0. > eigenvalues estimate via gmres min 0., max 0. > eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] > KSP Object: (mg_levels_1_esteig_) 1 MPI processes > type: gmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=10, initial guess is zero > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > left preconditioning > using DEFAULT norm type for convergence test > estimating eigenvalues using noisy right hand side > maximum iterations=2, nonzero initial guess > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (mg_levels_1_) 1 MPI processes > type: sor > type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=67, cols=67 > total: nonzeros=675, allocated nonzeros=675 > total number of mallocs used during MatSetValues calls =0 > not using I-node routines > Up solver (post-smoother) same as down solver (pre-smoother) > Down solver (pre-smoother) on level 2 ------------------------------- > KSP Object: (mg_levels_2_) 1 MPI processes > type: chebyshev > eigenvalue estimates used: min = 0., max = 0. > eigenvalues estimate via gmres min 0., max 0. > eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] > KSP Object: (mg_levels_2_esteig_) 1 MPI processes > type: gmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=10, initial guess is zero > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > left preconditioning > using DEFAULT norm type for convergence test > estimating eigenvalues using noisy right hand side > maximum iterations=2, nonzero initial guess > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (mg_levels_2_) 1 MPI processes > type: sor > type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=348, cols=348 > total: nonzeros=3928, allocated nonzeros=3928 > total number of mallocs used during MatSetValues calls =0 > not using I-node routines > Up solver (post-smoother) same as down solver (pre-smoother) > Down solver (pre-smoother) on level 3 ------------------------------- > KSP Object: (mg_levels_3_) 1 MPI processes > type: chebyshev > eigenvalue estimates used: min = 0., max = 0. > eigenvalues estimate via gmres min 0., max 0. > eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1] > KSP Object: (mg_levels_3_esteig_) 1 MPI processes > type: gmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=10, initial guess is zero > tolerances: relative=1e-12, absolute=1e-50, divergence=10000. > left preconditioning > using DEFAULT norm type for convergence test > estimating eigenvalues using noisy right hand side > maximum iterations=2, nonzero initial guess > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (mg_levels_3_) 1 MPI processes > type: sor > type = local_symmetric, iterations = 1, local iterations = 1, omega = 1. > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=3584, cols=3584 > total: nonzeros=23616, allocated nonzeros=23616 > total number of mallocs used during MatSetValues calls =0 > has attached null space > not using I-node routines > Up solver (post-smoother) same as down solver (pre-smoother) > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=3584, cols=3584 > total: nonzeros=23616, allocated nonzeros=23616 > total number of mallocs used during MatSetValues calls =0 > has attached null space > not using I-node routines > Pressure system has reached convergence in 0 iterations with reason 3. > 0 KSP unpreconditioned resid norm 4.798763170703e-16 true resid norm 4.798763170703e-16 ||r(i)||/||b|| 1.000000000000e+00 > 0 KSP Residual norm 4.798763170703e-16 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 > 1 KSP unpreconditioned resid norm 1.648749109132e-17 true resid norm 1.648749109132e-17 ||r(i)||/||b|| 3.435779284125e-02 > 1 KSP Residual norm 1.648749109132e-17 % max 9.561792537103e-01 min 9.561792537103e-01 max/min 1.000000000000e+00 > 2 KSP unpreconditioned resid norm 4.737880600040e-19 true resid norm 4.737880600040e-19 ||r(i)||/||b|| 9.873128619820e-04 > 2 KSP Residual norm 4.737880600040e-19 % max 9.828636644296e-01 min 9.293131521763e-01 max/min 1.057623753767e+00 > 3 KSP unpreconditioned resid norm 2.542212716830e-20 true resid norm 2.542212716830e-20 ||r(i)||/||b|| 5.297641551371e-05 > 3 KSP Residual norm 2.542212716830e-20 % max 9.933572357920e-01 min 9.158303248850e-01 max/min 1.084652046127e+00 > 4 KSP unpreconditioned resid norm 6.614510286263e-21 true resid norm 6.614510286269e-21 ||r(i)||/||b|| 1.378378146822e-05 > 4 KSP Residual norm 6.614510286263e-21 % max 9.950912550705e-01 min 6.296575800237e-01 max/min 1.580368896747e+00 > 5 KSP unpreconditioned resid norm 1.981505525281e-22 true resid norm 1.981505525272e-22 ||r(i)||/||b|| 4.129200493513e-07 > 5 KSP Residual norm 1.981505525281e-22 % max 9.984097962703e-01 min 5.316259535293e-01 max/min 1.878030577029e+00 > Linear solve converged due to CONVERGED_RTOL iterations 5 > > Ksp_monitor_true_residual output for stalling r/b CFD iteration > 0 KSP unpreconditioned resid norm 9.010260489109e-14 true resid norm 9.010260489109e-14 ||r(i)||/||b|| 2.021559024868e+00 > 0 KSP Residual norm 9.010260489109e-14 % max 1.000000000000e+00 min 1.000000000000e+00 max/min 1.000000000000e+00 > 1 KSP unpreconditioned resid norm 4.918108339808e-15 true resid norm 4.918171792537e-15 ||r(i)||/||b|| 1.103450292594e-01 > 1 KSP Residual norm 4.918108339808e-15 % max 9.566256813737e-01 min 9.566256813737e-01 max/min 1.000000000000e+00 > 2 KSP unpreconditioned resid norm 1.443599554690e-15 true resid norm 1.444867143493e-15 ||r(i)||/||b|| 3.241731154382e-02 > 2 KSP Residual norm 1.443599554690e-15 % max 9.614019380614e-01 min 7.360950481750e-01 max/min 1.306083963538e+00 > 3 KSP unpreconditioned resid norm 6.623206616803e-16 true resid norm 6.654132553541e-16 ||r(i)||/||b|| 1.492933720678e-02 > 3 KSP Residual norm 6.623206616803e-16 % max 9.764112945239e-01 min 4.911485418014e-01 max/min 1.988016274960e+00 > 4 KSP unpreconditioned resid norm 6.551896936698e-16 true resid norm 6.646157296305e-16 ||r(i)||/||b|| 1.491144376933e-02 > 4 KSP Residual norm 6.551896936698e-16 % max 9.883425885532e-01 min 1.461270778833e-01 max/min 6.763582786091e+00 > 5 KSP unpreconditioned resid norm 6.222297644887e-16 true resid norm 1.720560536914e-15 ||r(i)||/||b|| 3.860282047823e-02 > 5 KSP Residual norm 6.222297644887e-16 % max 1.000409371755e+00 min 4.989767363560e-03 max/min 2.004921870829e+02 > 6 KSP unpreconditioned resid norm 6.496945794974e-17 true resid norm 2.031914800253e-14 ||r(i)||/||b|| 4.558842341106e-01 > 6 KSP Residual norm 6.496945794974e-17 % max 1.004914985753e+00 min 1.459258738706e-03 max/min 6.886475709192e+02 > 7 KSP unpreconditioned resid norm 1.965237342540e-17 true resid norm 1.684522207337e-14 ||r(i)||/||b|| 3.779425772373e-01 > 7 KSP Residual norm 1.965237342540e-17 % max 1.005737762541e+00 min 1.452603803766e-03 max/min 6.923689446035e+02 > 8 KSP unpreconditioned resid norm 1.627718951285e-17 true resid norm 1.958642967520e-14 ||r(i)||/||b|| 4.394448276241e-01 > 8 KSP Residual norm 1.627718951285e-17 % max 1.006364278765e+00 min 1.452081813014e-03 max/min 6.930492963590e+02 > 9 KSP unpreconditioned resid norm 1.616577677764e-17 true resid norm 2.019110946644e-14 ||r(i)||/||b|| 4.530115373837e-01 > 9 KSP Residual norm 1.616577677764e-17 % max 1.006648747131e+00 min 1.452031376577e-03 max/min 6.932692801059e+02 > 10 KSP unpreconditioned resid norm 1.285788988203e-17 true resid norm 2.065082694477e-14 ||r(i)||/||b|| 4.633258453698e-01 > 10 KSP Residual norm 1.285788988203e-17 % max 1.007469033514e+00 min 1.433291867068e-03 max/min 7.029057072477e+02 > 11 KSP unpreconditioned resid norm 5.490854431580e-19 true resid norm 1.798071628891e-14 ||r(i)||/||b|| 4.034187394623e-01 > 11 KSP Residual norm 5.490854431580e-19 % max 1.008058905554e+00 min 1.369401685301e-03 max/min 7.361309076612e+02 > 12 KSP unpreconditioned resid norm 1.371754802104e-20 true resid norm 1.965688920064e-14 ||r(i)||/||b|| 4.410256708163e-01 > 12 KSP Residual norm 1.371754802104e-20 % max 1.008409402214e+00 min 1.369243011779e-03 max/min 7.364721919624e+02 > Linear solve converged due to CONVERGED_RTOL iterations 12 > > > > Marco Cisternino > > From: Barry Smith > > Sent: mercoled? 29 settembre 2021 18:34 > To: Marco Cisternino > > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Disconnected domains and Poisson equation > > > > > > > On Sep 29, 2021, at 11:59 AM, Marco Cisternino > wrote: > > For sake of completeness, explicitly building the null space using a vector per sub-domain make s the CFD runs using BCGS and GMRES more stable, but still slower than FGMRES. > > Something is strange. Please run with -ksp_view and send the output on the solver details. > > > > > I had divergence using BCGS and GMRES setting the null space with only one constant. > Thanks > > Marco Cisternino > > From: Marco Cisternino > Sent: mercoled? 29 settembre 2021 17:54 > To: Barry Smith > > Cc: petsc-users at mcs.anl.gov > Subject: RE: [petsc-users] Disconnected domains and Poisson equation > > Thank you Barry for the quick reply. > About the null space: I already tried what you suggest, building 2 Vec (constants) with 0 and 1 chosen by sub-domain, normalizing them and setting the null space like this > MatNullSpaceCreate(PETSC_COMM_WORLD,PETSC_FALSE,nconstants,constants,&nullspace); > The solution is slightly different in values but it is still different in the two sub-domains. > About the solver: I tried BCGS, GMRES and FGMRES. The linear system is a pressure system in a navier-stokes solver and only solving with FGMRES makes the CFD stable, with BCGS and GMRES the CFD solution diverges. Moreover, in the same case but with a single domain, CFD solution is stable using all the solvers, but FGMRES converges in much less iterations than the others. > > Marco Cisternino > > From: Barry Smith > > Sent: mercoled? 29 settembre 2021 15:59 > To: Marco Cisternino > > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Disconnected domains and Poisson equation > > > The problem actually has a two dimensional null space; constant on each domain but possibly different constants. I think you need to build the MatNullSpace by explicitly constructing two vectors, one with 0 on one domain and constant value on the other and one with 0 on the other domain and constant on the first. > > Separate note: why use FGMRES instead of just GMRES? If the problem is linear and the preconditioner is linear (no GMRES inside the smoother) then you can just use GMRES and it will save a little space/work and be conceptually clearer. > > Barry > > > On Sep 29, 2021, at 8:46 AM, Marco Cisternino > wrote: > > Good morning, > I want to solve the Poisson equation on a 3D domain with 2 non-connected sub-domains. > I am using FGMRES+GAMG and I have no problem if the two sub-domains see a Dirichlet boundary condition each. > On the same domain I would like to solve the Poisson equation imposing periodic boundary condition in one direction and homogenous Neumann boundary conditions in the other two directions. The two sub-domains are symmetric with respect to the separation between them and the operator discretization and the right hand side are symmetric as well. It would be nice to have the same solution in both the sub-domains. > Setting the null space to the constant, the solver converges to a solution having the same gradients in both sub-domains but different values. > Am I doing some wrong with the null space? I?m not setting a block matrix (one block for each sub-domain), should I? > I tested the null space against the matrix using MatNullSpaceTest and the answer is true. Can I do something more to have a symmetric solution as outcome of the solver? > Thank you in advance for any comments and hints. > > Best regards, > > Marco Cisternino -------------- next part -------------- An HTML attachment was scrubbed... URL: From dalcinl at gmail.com Wed Oct 6 13:52:36 2021 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Wed, 6 Oct 2021 21:52:36 +0300 Subject: [petsc-users] Eigenvalues always converge to zero when using slepc4py-complex In-Reply-To: References: <3AE681B3-3351-4324-93BE-A2F847831DC0@dsic.upv.es> Message-ID: The new builds are up. Jose added instructions in the SLEPc FAQ. Could you try again? Regards, On Mon, 4 Oct 2021 at 16:13 Yelyzaveta Velizhanina wrote: > I see. Thanks, much appreciated. > > > > Best regards, > > Yelyzaveta Velizhanina. > > > > *From: *Jose E. Roman > *Date: *Monday, 4 October 2021 at 15:10 > *To: *Yelyzaveta Velizhanina > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] Eigenvalues always converge to zero when > using slepc4py-complex > > Conda supports complex scalars for petsc4py. However, this is not > implemented in slepc4py. Lisandro is trying to get this fixed, so if no > issues arise this will be available in a couple of days, with > slepc4py-3.16.0. > > Jose > > > > El 3 oct 2021, a las 22:45, Yelyzaveta Velizhanina < > velizhaninae at gmail.com> escribi?: > > > > Dear all, > > > > I am having a problem to get EPS run properly with PETSc and SLEPc build > with scalar_value=complex. I am using petsc4py and slepc4py. Installed > everything, including PETSc and SLEPc, with conda. While real scalar value > build works well, when using the complex one, all the eigenvalues always > converge to 0 for any matrix and any solver. I?ve tried running examples > given in this repo https://github.com/myousefi2016/slepc4py as well - > same outcome, only zero eigenvalues. I am running MacOSX BigSur. > > > > Will appreciate any help, > > > > Best regards, > > Yelyzaveta Velizhanina. > -- Lisandro Dalcin ============ Senior Research Scientist Extreme Computing Research Center (ECRC) King Abdullah University of Science and Technology (KAUST) http://ecrc.kaust.edu.sa/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From milan.pelletier at protonmail.com Wed Oct 6 14:31:35 2021 From: milan.pelletier at protonmail.com (Milan Pelletier) Date: Wed, 06 Oct 2021 19:31:35 +0000 Subject: [petsc-users] Hypre runtime switch CPU/GPU Message-ID: Dear PETSc users, Is there a way to switch a runtime setting for PETSc+Hypre to run on CPU, even when it has been compiled to allow for GPU support? I looks like setting the matrix and vector types to respectively "seqaij" and "seq" results in GPU computation when Hypre is used as a preconditioner. I thought GPU would be used only when mat_type is set to "hypre", following the examples provided with the last release. Thanks for the help, Best regards, Milan -------------- next part -------------- An HTML attachment was scrubbed... URL: From Eric.Chamberland at giref.ulaval.ca Wed Oct 6 16:43:07 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Wed, 6 Oct 2021 17:43:07 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> Message-ID: Hi Matthew, we tried to use that.? Now, we discovered that: 1- even if we "ask" for sfNatural creation with DMSetUseNatural, it is not created because DMPlexCreateGlobalToNaturalSF looks for a "section": this is not documented in DMSetUseNaturalso we are asking ourselfs: "is this a permanent feature or a temporary situation?" 2- We then tried to create a "section" in different manners: we took the code into the example petsc/src/dm/impls/plex/tests/ex15.c.? However, we ended up with a segfault: corrupted size vs. prev_size [rohan:07297] *** Process received signal *** [rohan:07297] Signal: Aborted (6) [rohan:07297] Signal code:? (-6) [rohan:07297] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] [rohan:07297] [ 1] /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] [rohan:07297] [ 2] /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] [rohan:07297] [ 3] /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] [rohan:07297] [ 4] /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] [rohan:07297] [ 5] /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] [rohan:07297] [ 6] /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] [rohan:07297] [ 7] /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] [rohan:07297] [ 8] /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] [rohan:07297] [ 9] /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] [rohan:07297] [10] /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] [rohan:07297] [11] /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] [rohan:07297] [12] /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] [rohan:07297] [13] /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] [rohan:07297] [14] /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] [rohan:07297] [15] /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] [rohan:07297] [16] /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] [rohan:07297] [17] /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] [rohan:07297] [18] /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] If we do not create a section, the call to DMPlexDistribute is successful, but DMPlexGetGlobalToNaturalSF return a null SF pointer... Here are the operations we are calling ( this is almost the code we are using, I just removed verifications and creation of the connectivity which use our parallel structure and code): =========== ? PetscInt* lCells????? = 0; ? PetscInt? lNumCorners = 0; ? PetscInt? lDimMail??? = 0; ? PetscInt? lnumCells?? = 0; ? //At this point we create the cells for PETSc expected input for DMPlexBuildFromCellListParallel and set lNumCorners, lDimMail and lnumCells to correct values. ? ... ? DM?????? lDMBete = 0 ? DMPlexCreate(lMPIComm,&lDMBete); ? DMSetDimension(lDMBete, lDimMail); ? DMPlexBuildFromCellListParallel(lDMBete, ????????????????????????????????? lnumCells, ????????????????????????????????? PETSC_DECIDE, pLectureElementsLocaux.reqNbTotalSommets(), ????????????????????????????????? lNumCorners, ????????????????????????????????? lCells, ????????????????????????????????? PETSC_NULL); ? DM lDMBeteInterp = 0; ? DMPlexInterpolate(lDMBete, &lDMBeteInterp); ? DMDestroy(&lDMBete); ? lDMBete = lDMBeteInterp; ? DMSetUseNatural(lDMBete,PETSC_TRUE); ? PetscSF lSFMigrationSansOvl = 0; ? PetscSF lSFMigrationOvl = 0; ? DM lDMDistribueSansOvl = 0; ? DM lDMAvecOverlap = 0; ? PetscPartitioner lPart; ? DMPlexGetPartitioner(lDMBete, &lPart); ? PetscPartitionerSetFromOptions(lPart); ? PetscSection?? section; ? PetscInt?????? numFields?? = 1; ? PetscInt?????? numBC?????? = 0; ? PetscInt?????? numComp[1]? = {1}; ? PetscInt?????? numDof[4]?? = {1, 0, 0, 0}; ? PetscInt?????? bcFields[1] = {0}; ? IS???????????? bcPoints[1] = {NULL}; ? DMSetNumFields(lDMBete, numFields); ? DMPlexCreateSection(lDMBete, NULL, numComp, numDof, numBC, bcFields, bcPoints, NULL, NULL, §ion); ? DMSetLocalSection(lDMBete, section); ? DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, &lDMDistribueSansOvl); // segfault! =========== So we have other question/remarks: 3- Maybe PETSc expect something specific that is missing/not verified: for example, we didn't gave any coordinates since we just want to partition and compute overlap for the mesh... and then recover our element numbers in a "simple way" 4- We are telling ourselves it is somewhat a "big price to pay" to have to build an unused section to have the global to natural ordering set ?? Could this requirement be avoided? 5- Are there any improvement towards our usages in 3.16 release? Thanks, Eric On 2021-09-29 7:39 p.m., Matthew Knepley wrote: > On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland > > wrote: > > Hi, > > I come back with _almost_ the original question: > > I would like to add an integer information (*our* original element > number, not petsc one) on each element of the DMPlex I create with > DMPlexBuildFromCellListParallel. > > I would like this interger to be distribruted by or the same way > DMPlexDistribute distribute the mesh. > > Is it possible to do this? > > > I think we already have support for what you want. If you call > > https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html > > > before DMPlexDistribute(), it will compute a PetscSF encoding the > global to natural map. You > can get it with > > https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html > > > and use it with > > https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html > > > Is this sufficient? > > ? Thanks, > > ? ? ?Matt > > Thanks, > > Eric > > On 2021-07-14 1:18 p.m., Eric Chamberland wrote: > > Hi, > > > > I want to use DMPlexDistribute from PETSc for computing overlapping > > and play with the different partitioners supported. > > > > However, after calling DMPlexDistribute, I noticed the elements are > > renumbered and then the original number is lost. > > > > What would be the best way to keep track of the element renumbering? > > > > a) Adding an optional parameter to let the user retrieve a > vector or > > "IS" giving the old number? > > > > b) Adding a DMLabel (seems a wrong good solution) > > > > c) Other idea? > > > > Of course, I don't want to loose performances with the need of this > > "mapping"... > > > > Thanks, > > > > Eric > > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 6 20:23:05 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 6 Oct 2021 21:23:05 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> Message-ID: On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland < Eric.Chamberland at giref.ulaval.ca> wrote: > Hi Matthew, > > we tried to use that. Now, we discovered that: > > 1- even if we "ask" for sfNatural creation with DMSetUseNatural, it is not > created because DMPlexCreateGlobalToNaturalSF looks for a "section": this > is not documented in DMSetUseNaturalso we are asking ourselfs: "is this a > permanent feature or a temporary situation?" > > I think explaining this will help clear up a lot. What the Natural2Global map does is permute a solution vector into the ordering that it would have had prior to mesh distribution. Now, in order to do this permutation, I need to know the original (global) data layout. If it is not specified _before_ distribution, we cannot build the permutation. The section describes the data layout, so I need it before distribution. I cannot think of another way that you would implement this, but if you want something else, let me know. > 2- We then tried to create a "section" in different manners: we took the > code into the example petsc/src/dm/impls/plex/tests/ex15.c. However, we > ended up with a segfault: > > corrupted size vs. prev_size > [rohan:07297] *** Process received signal *** > [rohan:07297] Signal: Aborted (6) > [rohan:07297] Signal code: (-6) > [rohan:07297] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] > [rohan:07297] [ 1] /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] > [rohan:07297] [ 2] /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] > [rohan:07297] [ 3] /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] > [rohan:07297] [ 4] /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] > [rohan:07297] [ 5] /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] > [rohan:07297] [ 6] /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] > [rohan:07297] [ 7] /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] > [rohan:07297] [ 8] /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] > [rohan:07297] [ 9] > /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] > [rohan:07297] [10] > /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] > [rohan:07297] [11] > /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] > [rohan:07297] [12] > /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] > [rohan:07297] [13] /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] > > [rohan:07297] [14] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] > [rohan:07297] [15] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] > [rohan:07297] [16] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] > [rohan:07297] [17] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] > [rohan:07297] [18] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] > > I am not sure what happened here, but if you could send a sample code, I will figure it out. > If we do not create a section, the call to DMPlexDistribute is successful, > but DMPlexGetGlobalToNaturalSF return a null SF pointer... > Yes, it just ignores it in this case because it does not have a global layout. > Here are the operations we are calling ( this is almost the code we are > using, I just removed verifications and creation of the connectivity which > use our parallel structure and code): > > =========== > > PetscInt* lCells = 0; > PetscInt lNumCorners = 0; > PetscInt lDimMail = 0; > PetscInt lnumCells = 0; > > //At this point we create the cells for PETSc expected input for > DMPlexBuildFromCellListParallel and set lNumCorners, lDimMail and lnumCells > to correct values. > ... > > DM lDMBete = 0 > DMPlexCreate(lMPIComm,&lDMBete); > > DMSetDimension(lDMBete, lDimMail); > > DMPlexBuildFromCellListParallel(lDMBete, > lnumCells, > PETSC_DECIDE, > > pLectureElementsLocaux.reqNbTotalSommets(), > lNumCorners, > lCells, > PETSC_NULL); > > DM lDMBeteInterp = 0; > DMPlexInterpolate(lDMBete, &lDMBeteInterp); > DMDestroy(&lDMBete); > lDMBete = lDMBeteInterp; > > DMSetUseNatural(lDMBete,PETSC_TRUE); > > PetscSF lSFMigrationSansOvl = 0; > PetscSF lSFMigrationOvl = 0; > DM lDMDistribueSansOvl = 0; > DM lDMAvecOverlap = 0; > > PetscPartitioner lPart; > DMPlexGetPartitioner(lDMBete, &lPart); > PetscPartitionerSetFromOptions(lPart); > > PetscSection section; > PetscInt numFields = 1; > PetscInt numBC = 0; > PetscInt numComp[1] = {1}; > PetscInt numDof[4] = {1, 0, 0, 0}; > PetscInt bcFields[1] = {0}; > IS bcPoints[1] = {NULL}; > > DMSetNumFields(lDMBete, numFields); > > DMPlexCreateSection(lDMBete, NULL, numComp, numDof, numBC, bcFields, > bcPoints, NULL, NULL, §ion); > DMSetLocalSection(lDMBete, section); > > DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, > &lDMDistribueSansOvl); // segfault! > > =========== > > So we have other question/remarks: > > 3- Maybe PETSc expect something specific that is missing/not verified: for > example, we didn't gave any coordinates since we just want to partition and > compute overlap for the mesh... and then recover our element numbers in a > "simple way" > > 4- We are telling ourselves it is somewhat a "big price to pay" to have to > build an unused section to have the global to natural ordering set ? Could > this requirement be avoided? > I don't think so. There would have to be _some_ way of describing your data layout in terms of mesh points, and I do not see how you could use less memory doing that. > 5- Are there any improvement towards our usages in 3.16 release? > Let me try and run the code above. Thanks, Matt > Thanks, > > Eric > > > On 2021-09-29 7:39 p.m., Matthew Knepley wrote: > > On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland < > Eric.Chamberland at giref.ulaval.ca> wrote: > >> Hi, >> >> I come back with _almost_ the original question: >> >> I would like to add an integer information (*our* original element >> number, not petsc one) on each element of the DMPlex I create with >> DMPlexBuildFromCellListParallel. >> >> I would like this interger to be distribruted by or the same way >> DMPlexDistribute distribute the mesh. >> >> Is it possible to do this? >> > > I think we already have support for what you want. If you call > > https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html > > before DMPlexDistribute(), it will compute a PetscSF encoding the global > to natural map. You > can get it with > > > https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html > > and use it with > > > https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html > > Is this sufficient? > > Thanks, > > Matt > > >> Thanks, >> >> Eric >> >> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >> > Hi, >> > >> > I want to use DMPlexDistribute from PETSc for computing overlapping >> > and play with the different partitioners supported. >> > >> > However, after calling DMPlexDistribute, I noticed the elements are >> > renumbered and then the original number is lost. >> > >> > What would be the best way to keep track of the element renumbering? >> > >> > a) Adding an optional parameter to let the user retrieve a vector or >> > "IS" giving the old number? >> > >> > b) Adding a DMLabel (seems a wrong good solution) >> > >> > c) Other idea? >> > >> > Of course, I don't want to loose performances with the need of this >> > "mapping"... >> > >> > Thanks, >> > >> > Eric >> > >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.croucher at auckland.ac.nz Wed Oct 6 22:05:10 2021 From: a.croucher at auckland.ac.nz (Adrian Croucher) Date: Thu, 7 Oct 2021 16:05:10 +1300 Subject: [petsc-users] HDF5 corruption Message-ID: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> hi, One of the users of my PETSc-based code has reported that HDF5 output files can be corrupted and unusable if e.g. the run is killed. I've just done a bit of reading about this and it appears to be a known issue with HDF5. Some people suggest flushing the HDF5 file periodically to help prevent data loss. I had a look at PetscViewerFlush() but it doesn't seem to be implemented for the HDF5 viewer- is that correct? I am currently calling PetscViewerHDF5Open() once at the start of the run and closing it at the end. Some other people suggest doing this before and after each write instead. Is there likely to be a significant performance penalty in doing that? I gather HDF5 journaling has been promised for a while to get around this problem but as far as I can see it hasn't materialised yet... Regards, Adrian -- Dr Adrian Croucher Senior Research Fellow Department of Engineering Science University of Auckland, New Zealand email: a.croucher at auckland.ac.nz tel: +64 (0)9 923 4611 From velizhaninae at gmail.com Thu Oct 7 04:07:06 2021 From: velizhaninae at gmail.com (Yelyzaveta Velizhanina) Date: Thu, 7 Oct 2021 09:07:06 +0000 Subject: [petsc-users] Eigenvalues always converge to zero when using slepc4py-complex In-Reply-To: References: <3AE681B3-3351-4324-93BE-A2F847831DC0@dsic.upv.es> Message-ID: Hello Lisandro, I am testing EPS for a random non-Hermitian complex matrix in shift-invert mode and comparing with SciPy?s wrapper of Lapack ? seems to be working perfectly. Thanks. Regards, Yelyzaveta. From: Lisandro Dalcin Date: Wednesday, 6 October 2021 at 20:52 To: Yelyzaveta Velizhanina Cc: Jose E. Roman , petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Eigenvalues always converge to zero when using slepc4py-complex The new builds are up. Jose added instructions in the SLEPc FAQ. Could you try again? Regards, On Mon, 4 Oct 2021 at 16:13 Yelyzaveta Velizhanina > wrote: I see. Thanks, much appreciated. Best regards, Yelyzaveta Velizhanina. From: Jose E. Roman > Date: Monday, 4 October 2021 at 15:10 To: Yelyzaveta Velizhanina > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Eigenvalues always converge to zero when using slepc4py-complex Conda supports complex scalars for petsc4py. However, this is not implemented in slepc4py. Lisandro is trying to get this fixed, so if no issues arise this will be available in a couple of days, with slepc4py-3.16.0. Jose > El 3 oct 2021, a las 22:45, Yelyzaveta Velizhanina > escribi?: > > Dear all, > > I am having a problem to get EPS run properly with PETSc and SLEPc build with scalar_value=complex. I am using petsc4py and slepc4py. Installed everything, including PETSc and SLEPc, with conda. While real scalar value build works well, when using the complex one, all the eigenvalues always converge to 0 for any matrix and any solver. I?ve tried running examples given in this repo https://github.com/myousefi2016/slepc4py as well - same outcome, only zero eigenvalues. I am running MacOSX BigSur. > > Will appreciate any help, > > Best regards, > Yelyzaveta Velizhanina. -- Lisandro Dalcin ============ Senior Research Scientist Extreme Computing Research Center (ECRC) King Abdullah University of Science and Technology (KAUST) http://ecrc.kaust.edu.sa/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Thu Oct 7 06:09:35 2021 From: mfadams at lbl.gov (Mark Adams) Date: Thu, 7 Oct 2021 07:09:35 -0400 Subject: [petsc-users] Hypre runtime switch CPU/GPU In-Reply-To: References: Message-ID: I'm not sure, but I suspect that Hypre does not support runtime switching and our model is that you can switch at runtime. This leads to an inconsistency. If we remove -mat_type hypre then your issue would go away but 1) we would have to add it back if hypre supports runtime switching in the future, and break everyone's input decks, and 2) it would be inconsistent with the PETSc model. I could see throwing an error if you do not use -mat_type hypre and are configured for GPUs. Mark On Wed, Oct 6, 2021 at 3:31 PM Milan Pelletier via petsc-users < petsc-users at mcs.anl.gov> wrote: > Dear PETSc users, > > Is there a way to switch a runtime setting for PETSc+Hypre to run on CPU, > even when it has been compiled to allow for GPU support? > I looks like setting the matrix and vector types to respectively "seqaij" > and "seq" results in GPU computation when Hypre is used as a > preconditioner. I thought GPU would be used only when mat_type is set to > "hypre", following the examples provided with the last release. > > Thanks for the help, > Best regards, > > Milan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Thu Oct 7 06:22:28 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Thu, 7 Oct 2021 14:22:28 +0300 Subject: [petsc-users] Hypre runtime switch CPU/GPU In-Reply-To: References: Message-ID: We have discussed full runtime switch in HYPRE with Ruipeng few weeks ago, I'm not sure what's the status. cc'ing him Il giorno gio 7 ott 2021 alle ore 14:10 Mark Adams ha scritto: > I'm not sure, but I suspect that Hypre does not support runtime switching > and our model is that you can switch at runtime. This leads to an > inconsistency. > > If we remove -mat_type hypre then your issue would go away but 1) we would > have to add it back if hypre supports runtime switching in the future, and > break everyone's input decks, and 2) it would be inconsistent with the > PETSc model. > > I could see throwing an error if you do not use -mat_type hypre and are > configured for GPUs. > > Mark > > On Wed, Oct 6, 2021 at 3:31 PM Milan Pelletier via petsc-users < > petsc-users at mcs.anl.gov> wrote: > >> Dear PETSc users, >> >> Is there a way to switch a runtime setting for PETSc+Hypre to run on CPU, >> even when it has been compiled to allow for GPU support? >> I looks like setting the matrix and vector types to respectively "seqaij" >> and "seq" results in GPU computation when Hypre is used as a >> preconditioner. I thought GPU would be used only when mat_type is set to >> "hypre", following the examples provided with the last release. >> >> Thanks for the help, >> Best regards, >> >> Milan >> >> -- Stefano -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Thu Oct 7 06:39:39 2021 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 7 Oct 2021 07:39:39 -0400 Subject: [petsc-users] HDF5 corruption In-Reply-To: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> Message-ID: On Wed, Oct 6, 2021 at 11:05 PM Adrian Croucher wrote: > hi, > > One of the users of my PETSc-based code has reported that HDF5 output > files can be corrupted and unusable if e.g. the run is killed. I've just > done a bit of reading about this and it appears to be a known issue with > HDF5. > > Some people suggest flushing the HDF5 file periodically to help prevent > data loss. I had a look at PetscViewerFlush() but it doesn't seem to be > implemented for the HDF5 viewer- is that correct? > > I am currently calling PetscViewerHDF5Open() once at the start of the > run and closing it at the end. Some other people suggest doing this > before and after each write instead. Is there likely to be a significant > performance penalty in doing that? > I don't think so. This is what I did on another project to guard against the HDF5 issue. > I gather HDF5 journaling has been promised for a while to get around > this problem but as far as I can see it hasn't materialised yet... > Don't worry. It will happen right after fusion power and world peace :) Thanks, Matt > Regards, Adrian > > -- > Dr Adrian Croucher > Senior Research Fellow > Department of Engineering Science > University of Auckland, New Zealand > email: a.croucher at auckland.ac.nz > tel: +64 (0)9 923 4611 > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From milan.pelletier at protonmail.com Thu Oct 7 06:44:35 2021 From: milan.pelletier at protonmail.com (Milan Pelletier) Date: Thu, 07 Oct 2021 11:44:35 +0000 Subject: [petsc-users] Hypre runtime switch CPU/GPU In-Reply-To: References: Message-ID: <-Ui2ZgIIWiVFFv_mrSUmqaYYyZvuCRc7k7mM2ZO8vVpMYweHPRdztk76GMwU9-kvIDTrF92eG4ERFW2fiBdkLI24JrdGi1-H5swLcEZVqV4=@protonmail.com> Ok thanks for the answers, this would definitely be a super useful feature to have. Milan ??????? Original Message ??????? Le jeudi 7 octobre 2021 ? 1:22 PM, Stefano Zampini a ?crit : > We have discussed full runtime switch in HYPRE with Ruipeng few weeks ago, I'm not sure what's the status. cc'ing him > > Il giorno gio 7 ott 2021 alle ore 14:10 Mark Adams ha scritto: > >> I'm not sure, but I suspect that Hypre does not support runtime switching and our model is that you can switch at runtime. This leads to an inconsistency. >> >> If we remove -mat_type hypre then your issue would go away but 1) we would have to add it back if hypre supports runtime switching in the future, and break everyone's input decks, and 2) it would be inconsistent with the PETSc model. >> >> I could see throwing an error if you do not use -mat_type hypre and are configured for GPUs. >> >> Mark >> >> On Wed, Oct 6, 2021 at 3:31 PM Milan Pelletier via petsc-users wrote: >> >>> Dear PETSc users, >>> >>> Is there a way to switch a runtime setting for PETSc+Hypre to run on CPU, even when it has been compiled to allow for GPU support? >>> I looks like setting the matrix and vector types to respectively "seqaij" and "seq" results in GPU computation when Hypre is used as a preconditioner. I thought GPU would be used only when mat_type is set to "hypre", following the examples provided with the last release. >>> >>> Thanks for the help, >>> Best regards, >>> >>> Milan > > -- > > Stefano -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.werner at dlr.de Thu Oct 7 08:50:12 2021 From: michael.werner at dlr.de (Michael Werner) Date: Thu, 7 Oct 2021 15:50:12 +0200 Subject: [petsc-users] petsc4py - Spike in memory usage when loading a matrix in parallel Message-ID: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> Hello, I noticed that there is a peak in memory consumption when I load an existing matrix into PETSc. The matrix is previously created by an external program and saved in the PETSc binary format. The code I'm using in petsc4py is simple: viewer = PETSc.Viewer().createBinary(, "r", comm=PETSc.COMM_WORLD) A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) A.load(viewer) When I run this code in serial, the memory consumption of the process is about 50GB RAM, similar to the file size of the saved matrix. However, if I run the code in parallel, for a few seconds the memory consumption of the process doubles to around 100GB RAM, before dropping back down to around 50GB RAM. So it seems as if, for some reason, the matrix is copied after it is read into memory. Is there a way to avoid this behaviour? Currently, it is a clear bottleneck in my code. I tried setting the size of the matrix and to explicitly preallocate the necessary NNZ (with A.setSizes(dim) and A.setPreallocationNNZ(nnz), respectively) before loading, but that didn't help. As mentioned above, I'm using petsc4py together with PETSc-3.16 on a Linux workstation. Best regards, Michael Werner -- ____________________________________________________ Deutsches Zentrum f?r Luft- und Raumfahrt e.V. (DLR) Institut f?r Aerodynamik und Str?mungstechnik | Bunsenstr. 10 | 37073 G?ttingen Michael Werner Telefon 0551 709-2627 | Telefax 0551 709-2811 | Michael.Werner at dlr.de DLR.de From bsmith at petsc.dev Thu Oct 7 09:03:05 2021 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 7 Oct 2021 10:03:05 -0400 Subject: [petsc-users] petsc4py - Spike in memory usage when loading a matrix in parallel In-Reply-To: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> References: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> Message-ID: <07CEDA45-CCCB-4EA7-8AAC-F2FB9E69A654@petsc.dev> How many ranks are you using? Is it a sparse matrix with MPIAIJ? The intention is that for parallel runs the first rank reads in its own part of the matrix, then reads in the part of the next rank and sends it, then reads the part of the third rank and sends it etc. So there should not be too much of a blip in memory usage. You can run valgrind with the option for tracking memory usage to see exactly where in the code the blip occurs; it could be a regression occurred in the code making it require more memory. But internal MPI buffers might explain some blip. Barry > On Oct 7, 2021, at 9:50 AM, Michael Werner wrote: > > Hello, > > I noticed that there is a peak in memory consumption when I load an > existing matrix into PETSc. The matrix is previously created by an > external program and saved in the PETSc binary format. > The code I'm using in petsc4py is simple: > > viewer = PETSc.Viewer().createBinary(, "r", > comm=PETSc.COMM_WORLD) > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) > A.load(viewer) > > When I run this code in serial, the memory consumption of the process is > about 50GB RAM, similar to the file size of the saved matrix. However, > if I run the code in parallel, for a few seconds the memory consumption > of the process doubles to around 100GB RAM, before dropping back down to > around 50GB RAM. So it seems as if, for some reason, the matrix is > copied after it is read into memory. Is there a way to avoid this > behaviour? Currently, it is a clear bottleneck in my code. > > I tried setting the size of the matrix and to explicitly preallocate the > necessary NNZ (with A.setSizes(dim) and A.setPreallocationNNZ(nnz), > respectively) before loading, but that didn't help. > > As mentioned above, I'm using petsc4py together with PETSc-3.16 on a > Linux workstation. > > Best regards, > Michael Werner > > -- > > ____________________________________________________ > > Deutsches Zentrum f?r Luft- und Raumfahrt e.V. (DLR) > Institut f?r Aerodynamik und Str?mungstechnik | Bunsenstr. 10 | 37073 G?ttingen > > Michael Werner > Telefon 0551 709-2627 | Telefax 0551 709-2811 | Michael.Werner at dlr.de > DLR.de > > > > > > > > > From knepley at gmail.com Thu Oct 7 09:09:15 2021 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 7 Oct 2021 10:09:15 -0400 Subject: [petsc-users] petsc4py - Spike in memory usage when loading a matrix in parallel In-Reply-To: <07CEDA45-CCCB-4EA7-8AAC-F2FB9E69A654@petsc.dev> References: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> <07CEDA45-CCCB-4EA7-8AAC-F2FB9E69A654@petsc.dev> Message-ID: On Thu, Oct 7, 2021 at 10:03 AM Barry Smith wrote: > > How many ranks are you using? Is it a sparse matrix with MPIAIJ? > > The intention is that for parallel runs the first rank reads in its own > part of the matrix, then reads in the part of the next rank and sends it, > then reads the part of the third rank and sends it etc. So there should not > be too much of a blip in memory usage. You can run valgrind with the option > for tracking memory usage to see exactly where in the code the blip occurs; > it could be a regression occurred in the code making it require more > memory. But internal MPI buffers might explain some blip. > Is it possible that we free the memory, but the OS has just not given back that memory for use yet? How are you measuring memory usage? Thanks, Matt > Barry > > > > On Oct 7, 2021, at 9:50 AM, Michael Werner > wrote: > > > > Hello, > > > > I noticed that there is a peak in memory consumption when I load an > > existing matrix into PETSc. The matrix is previously created by an > > external program and saved in the PETSc binary format. > > The code I'm using in petsc4py is simple: > > > > viewer = PETSc.Viewer().createBinary(, "r", > > comm=PETSc.COMM_WORLD) > > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) > > A.load(viewer) > > > > When I run this code in serial, the memory consumption of the process is > > about 50GB RAM, similar to the file size of the saved matrix. However, > > if I run the code in parallel, for a few seconds the memory consumption > > of the process doubles to around 100GB RAM, before dropping back down to > > around 50GB RAM. So it seems as if, for some reason, the matrix is > > copied after it is read into memory. Is there a way to avoid this > > behaviour? Currently, it is a clear bottleneck in my code. > > > > I tried setting the size of the matrix and to explicitly preallocate the > > necessary NNZ (with A.setSizes(dim) and A.setPreallocationNNZ(nnz), > > respectively) before loading, but that didn't help. > > > > As mentioned above, I'm using petsc4py together with PETSc-3.16 on a > > Linux workstation. > > > > Best regards, > > Michael Werner > > > > -- > > > > ____________________________________________________ > > > > Deutsches Zentrum f?r Luft- und Raumfahrt e.V. (DLR) > > Institut f?r Aerodynamik und Str?mungstechnik | Bunsenstr. 10 | 37073 > G?ttingen > > > > Michael Werner > > Telefon 0551 709-2627 | Telefax 0551 709-2811 | Michael.Werner at dlr.de > > DLR.de > > > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.werner at dlr.de Thu Oct 7 10:35:44 2021 From: michael.werner at dlr.de (Michael Werner) Date: Thu, 7 Oct 2021 17:35:44 +0200 Subject: [petsc-users] petsc4py - Spike in memory usage when loading a matrix in parallel In-Reply-To: References: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> <07CEDA45-CCCB-4EA7-8AAC-F2FB9E69A654@petsc.dev> Message-ID: Currently I'm using psutil to query every process for its memory usage and sum it up. However, the spike was only visible in top (I had a call to psutil right before and after A.load(viewer), and both reported only 50 GB of RAM usage). That's why I thought it might be directly tied to loading the matrix. However, I also had the problem that the computation crashed due to running out of memory while loading a matrix that should in theory fit into memory. In that case I would expect the OS to free unused meory immediatly, right? Concerning Barry's questions: the matrix is a sparse matrix and is originally created sequentially as SEQAIJ. However, it is then loaded as MPIAIJ, and if I look at the memory usage of the various processes, they fill up one after another, just as described. Is the origin of the matrix somehow preserved in the binary file? I was under the impression that the binary format was agnostic to the number of processes? I also varied the number of processes between 1 and 60, as soon as I use more than one process I can observe the spike (and its always twice the memory, no matter how many processes I'm using). I also tried running Valgrind with the --tool=massif option. However, I don't know what to look for. I can send you the output file separately, if it helps. Best regards, Michael On 07.10.21 16:09, Matthew Knepley wrote: > On Thu, Oct 7, 2021 at 10:03 AM Barry Smith > wrote: > > > ? ?How many ranks are you using? Is it a sparse matrix with MPIAIJ? > > ? ?The intention is that for parallel runs the first rank reads in > its own part of the matrix, then reads in the part of the next > rank and sends it, then reads the part of the third rank and sends > it etc. So there should not be too much of a blip in memory usage. > You can run valgrind with the option for tracking memory usage to > see exactly where in the code the blip occurs; it could be a > regression occurred in the code making it require more memory. But > internal MPI buffers might explain some blip. > > > Is it possible that we free the memory, but the OS has just not given > back that memory for use yet? How are you measuring memory usage? > > ? Thanks, > > ? ? ?Matt > ? > > ? Barry > > > > On Oct 7, 2021, at 9:50 AM, Michael Werner > > wrote: > > > > Hello, > > > > I noticed that there is a peak in memory consumption when I load an > > existing matrix into PETSc. The matrix is previously created by an > > external program and saved in the PETSc binary format. > > The code I'm using in petsc4py is simple: > > > > viewer = PETSc.Viewer().createBinary(, "r", > > comm=PETSc.COMM_WORLD) > > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) > > A.load(viewer) > > > > When I run this code in serial, the memory consumption of the > process is > > about 50GB RAM, similar to the file size of the saved matrix. > However, > > if I run the code in parallel, for a few seconds the memory > consumption > > of the process doubles to around 100GB RAM, before dropping back > down to > > around 50GB RAM. So it seems as if, for some reason, the matrix is > > copied after it is read into memory. Is there a way to avoid this > > behaviour? Currently, it is a clear bottleneck in my code. > > > > I tried setting the size of the matrix and to explicitly > preallocate the > > necessary NNZ (with A.setSizes(dim) and A.setPreallocationNNZ(nnz), > > respectively) before loading, but that didn't help. > > > > As mentioned above, I'm using petsc4py together with PETSc-3.16 on a > > Linux workstation. > > > > Best regards, > > Michael Werner > > > > -- > > > > ____________________________________________________ > > > > Deutsches Zentrum f?r Luft- und Raumfahrt e.V. (DLR) > > Institut f?r Aerodynamik und Str?mungstechnik | Bunsenstr. 10 | > 37073 G?ttingen > > > > Michael Werner > > Telefon 0551 709-2627 | Telefax 0551 709-2811 | > Michael.Werner at dlr.de > > DLR.de > > > > > > > > > > > > > > > > > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Thu Oct 7 10:55:00 2021 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 7 Oct 2021 11:55:00 -0400 Subject: [petsc-users] petsc4py - Spike in memory usage when loading a matrix in parallel In-Reply-To: References: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> <07CEDA45-CCCB-4EA7-8AAC-F2FB9E69A654@petsc.dev> Message-ID: <0EAF1EE7-C34D-4118-BF74-78E1D983EFFD@petsc.dev> > On Oct 7, 2021, at 11:35 AM, Michael Werner wrote: > > Currently I'm using psutil to query every process for its memory usage and sum it up. However, the spike was only visible in top (I had a call to psutil right before and after A.load(viewer), and both reported only 50 GB of RAM usage). That's why I thought it might be directly tied to loading the matrix. However, I also had the problem that the computation crashed due to running out of memory while loading a matrix that should in theory fit into memory. In that case I would expect the OS to free unused meory immediatly, right? > > Concerning Barry's questions: the matrix is a sparse matrix and is originally created sequentially as SEQAIJ. However, it is then loaded as MPIAIJ, and if I look at the memory usage of the various processes, they fill up one after another, just as described. Is the origin of the matrix somehow preserved in the binary file? I was under the impression that the binary format was agnostic to the number of processes? The file format is independent of the number of processes that created it. > I also varied the number of processes between 1 and 60, as soon as I use more than one process I can observe the spike (and its always twice the memory, no matter how many processes I'm using). Twice the size of the entire matrix (when stored on one process) or twice the size of the resulting matrix stored on the first rank? The latter is exactly as expected, since rank 0 has to load the part of the matrix destined for the next rank and hence for a short time contains its own part of the matrix and the part of one other rank. Barry > > I also tried running Valgrind with the --tool=massif option. However, I don't know what to look for. I can send you the output file separately, if it helps. > > Best regards, > Michael > > On 07.10.21 16:09, Matthew Knepley wrote: >> On Thu, Oct 7, 2021 at 10:03 AM Barry Smith > wrote: >> >> How many ranks are you using? Is it a sparse matrix with MPIAIJ? >> >> The intention is that for parallel runs the first rank reads in its own part of the matrix, then reads in the part of the next rank and sends it, then reads the part of the third rank and sends it etc. So there should not be too much of a blip in memory usage. You can run valgrind with the option for tracking memory usage to see exactly where in the code the blip occurs; it could be a regression occurred in the code making it require more memory. But internal MPI buffers might explain some blip. >> >> Is it possible that we free the memory, but the OS has just not given back that memory for use yet? How are you measuring memory usage? >> >> Thanks, >> >> Matt >> >> Barry >> >> >> > On Oct 7, 2021, at 9:50 AM, Michael Werner > wrote: >> > >> > Hello, >> > >> > I noticed that there is a peak in memory consumption when I load an >> > existing matrix into PETSc. The matrix is previously created by an >> > external program and saved in the PETSc binary format. >> > The code I'm using in petsc4py is simple: >> > >> > viewer = PETSc.Viewer().createBinary(, "r", >> > comm=PETSc.COMM_WORLD) >> > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) >> > A.load(viewer) >> > >> > When I run this code in serial, the memory consumption of the process is >> > about 50GB RAM, similar to the file size of the saved matrix. However, >> > if I run the code in parallel, for a few seconds the memory consumption >> > of the process doubles to around 100GB RAM, before dropping back down to >> > around 50GB RAM. So it seems as if, for some reason, the matrix is >> > copied after it is read into memory. Is there a way to avoid this >> > behaviour? Currently, it is a clear bottleneck in my code. >> > >> > I tried setting the size of the matrix and to explicitly preallocate the >> > necessary NNZ (with A.setSizes(dim) and A.setPreallocationNNZ(nnz), >> > respectively) before loading, but that didn't help. >> > >> > As mentioned above, I'm using petsc4py together with PETSc-3.16 on a >> > Linux workstation. >> > >> > Best regards, >> > Michael Werner >> > >> > -- >> > >> > ____________________________________________________ >> > >> > Deutsches Zentrum f?r Luft- und Raumfahrt e.V. (DLR) >> > Institut f?r Aerodynamik und Str?mungstechnik | Bunsenstr. 10 | 37073 G?ttingen >> > >> > Michael Werner >> > Telefon 0551 709-2627 | Telefax 0551 709-2811 | Michael.Werner at dlr.de >> > DLR.de >> > >> > >> > >> > >> > >> > >> > >> > >> > >> >> >> >> -- >> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.werner at dlr.de Thu Oct 7 10:59:57 2021 From: michael.werner at dlr.de (Michael Werner) Date: Thu, 7 Oct 2021 17:59:57 +0200 Subject: [petsc-users] petsc4py - Spike in memory usage when loading a matrix in parallel In-Reply-To: <0EAF1EE7-C34D-4118-BF74-78E1D983EFFD@petsc.dev> References: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> <07CEDA45-CCCB-4EA7-8AAC-F2FB9E69A654@petsc.dev> <0EAF1EE7-C34D-4118-BF74-78E1D983EFFD@petsc.dev> Message-ID: <646573ca-cd86-48a0-cbaa-2c8d0b6eb704@dlr.de> Its twice the memory of the entire matrix (when stored on one process). I also just sent you the valgrind results, both for a serial run and a parallel run. The size on disk of the matrix I used is 20 GB. In the serial run, valgrind shows a peak memory usage of 21GB, while in the parallel run (with 4 processes) each process shows a peak memory usage of 10.8GB Best regards, Michael On 07.10.21 17:55, Barry Smith wrote: > > >> On Oct 7, 2021, at 11:35 AM, Michael Werner > > wrote: >> >> Currently I'm using psutil to query every process for its memory >> usage and sum it up. However, the spike was only visible in top (I >> had a call to psutil right before and after A.load(viewer), and both >> reported only 50 GB of RAM usage). That's why I thought it might be >> directly tied to loading the matrix. However, I also had the problem >> that the computation crashed due to running out of memory while >> loading a matrix that should in theory fit into memory. In that case >> I would expect the OS to free unused meory immediatly, right? >> >> Concerning Barry's questions: the matrix is a sparse matrix and is >> originally created sequentially as SEQAIJ. However, it is then loaded >> as MPIAIJ, and if I look at the memory usage of the various >> processes, they fill up one after another, just as described. Is the >> origin of the matrix somehow preserved in the binary file? I was >> under the impression that the binary format was agnostic to the >> number of processes? > > ?The file format is independent of the number of processes that > created it. > >> I also varied the number of processes between 1 and 60, as soon as I >> use more than one process I can observe the spike (and its always >> twice the memory, no matter how many processes I'm using). > > ? Twice the size of the entire matrix (when stored on one process) or > twice the size of the resulting matrix stored on the first rank? The > latter is exactly as expected, since rank 0 has to load the part of > the matrix destined for the next rank and hence for a short time > contains its own part of the matrix and the part of one other rank. > > ? Barry > >> >> I also tried running Valgrind with the --tool=massif option. However, >> I don't know what to look for. I can send you the output file >> separately, if it helps. >> >> Best regards, >> Michael >> >> On 07.10.21 16:09, Matthew Knepley wrote: >>> On Thu, Oct 7, 2021 at 10:03 AM Barry Smith >> > wrote: >>> >>> >>> ? ?How many ranks are you using? Is it a sparse matrix with MPIAIJ? >>> >>> ? ?The intention is that for parallel runs the first rank reads >>> in its own part of the matrix, then reads in the part of the >>> next rank and sends it, then reads the part of the third rank >>> and sends it etc. So there should not be too much of a blip in >>> memory usage. You can run valgrind with the option for tracking >>> memory usage to see exactly where in the code the blip occurs; >>> it could be a regression occurred in the code making it require >>> more memory. But internal MPI buffers might explain some blip. >>> >>> >>> Is it possible that we free the memory, but the OS has just not >>> given back that memory for use yet? How are you measuring memory usage? >>> >>> ? Thanks, >>> >>> ? ? ?Matt >>> ? >>> >>> ? Barry >>> >>> >>> > On Oct 7, 2021, at 9:50 AM, Michael Werner >>> > wrote: >>> > >>> > Hello, >>> > >>> > I noticed that there is a peak in memory consumption when I >>> load an >>> > existing matrix into PETSc. The matrix is previously created by an >>> > external program and saved in the PETSc binary format. >>> > The code I'm using in petsc4py is simple: >>> > >>> > viewer = >>> PETSc.Viewer().createBinary(, "r", >>> > comm=PETSc.COMM_WORLD) >>> > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) >>> > A.load(viewer) >>> > >>> > When I run this code in serial, the memory consumption of the >>> process is >>> > about 50GB RAM, similar to the file size of the saved matrix. >>> However, >>> > if I run the code in parallel, for a few seconds the memory >>> consumption >>> > of the process doubles to around 100GB RAM, before dropping >>> back down to >>> > around 50GB RAM. So it seems as if, for some reason, the matrix is >>> > copied after it is read into memory. Is there a way to avoid this >>> > behaviour? Currently, it is a clear bottleneck in my code. >>> > >>> > I tried setting the size of the matrix and to explicitly >>> preallocate the >>> > necessary NNZ (with A.setSizes(dim) and >>> A.setPreallocationNNZ(nnz), >>> > respectively) before loading, but that didn't help. >>> > >>> > As mentioned above, I'm using petsc4py together with >>> PETSc-3.16 on a >>> > Linux workstation. >>> > >>> > Best regards, >>> > Michael Werner >>> > >>> > -- >>> > >>> > ____________________________________________________ >>> > >>> > Deutsches Zentrum f?r Luft- und Raumfahrt e.V. (DLR) >>> > Institut f?r Aerodynamik und Str?mungstechnik | Bunsenstr. 10 >>> | 37073 G?ttingen >>> > >>> > Michael Werner >>> > Telefon 0551 709-2627 | Telefax 0551 709-2811 | >>> Michael.Werner at dlr.de >>> > DLR.de >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which >>> their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Thu Oct 7 11:32:49 2021 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 7 Oct 2021 12:32:49 -0400 Subject: [petsc-users] petsc4py - Spike in memory usage when loading a matrix in parallel In-Reply-To: <646573ca-cd86-48a0-cbaa-2c8d0b6eb704@dlr.de> References: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> <07CEDA45-CCCB-4EA7-8AAC-F2FB9E69A654@petsc.dev> <0EAF1EE7-C34D-4118-BF74-78E1D983EFFD@petsc.dev> <646573ca-cd86-48a0-cbaa-2c8d0b6eb704@dlr.de> Message-ID: On Thu, Oct 7, 2021 at 11:59 AM Michael Werner wrote: > Its twice the memory of the entire matrix (when stored on one process). I > also just sent you the valgrind results, both for a serial run and a > parallel run. The size on disk of the matrix I used is 20 GB. > In the serial run, valgrind shows a peak memory usage of 21GB, while in > the parallel run (with 4 processes) each process shows a peak memory usage > of 10.8GB > Barry is right that at least proc 0 must have twice its own memory, since it loads the other pieces. That makes 10GB sounds correct. Thanks, Matt > Best regards, > Michael > > On 07.10.21 17:55, Barry Smith wrote: > > > > On Oct 7, 2021, at 11:35 AM, Michael Werner wrote: > > Currently I'm using psutil to query every process for its memory usage and > sum it up. However, the spike was only visible in top (I had a call to > psutil right before and after A.load(viewer), and both reported only 50 GB > of RAM usage). That's why I thought it might be directly tied to loading > the matrix. However, I also had the problem that the computation crashed > due to running out of memory while loading a matrix that should in theory > fit into memory. In that case I would expect the OS to free unused meory > immediatly, right? > > Concerning Barry's questions: the matrix is a sparse matrix and is > originally created sequentially as SEQAIJ. However, it is then loaded as > MPIAIJ, and if I look at the memory usage of the various processes, they > fill up one after another, just as described. Is the origin of the matrix > somehow preserved in the binary file? I was under the impression that the > binary format was agnostic to the number of processes? > > > The file format is independent of the number of processes that created it. > > I also varied the number of processes between 1 and 60, as soon as I use > more than one process I can observe the spike (and its always twice the > memory, no matter how many processes I'm using). > > > Twice the size of the entire matrix (when stored on one process) or > twice the size of the resulting matrix stored on the first rank? The latter > is exactly as expected, since rank 0 has to load the part of the matrix > destined for the next rank and hence for a short time contains its own part > of the matrix and the part of one other rank. > > Barry > > > I also tried running Valgrind with the --tool=massif option. However, I > don't know what to look for. I can send you the output file separately, if > it helps. > > Best regards, > Michael > > On 07.10.21 16:09, Matthew Knepley wrote: > > On Thu, Oct 7, 2021 at 10:03 AM Barry Smith wrote: > >> >> How many ranks are you using? Is it a sparse matrix with MPIAIJ? >> >> The intention is that for parallel runs the first rank reads in its >> own part of the matrix, then reads in the part of the next rank and sends >> it, then reads the part of the third rank and sends it etc. So there should >> not be too much of a blip in memory usage. You can run valgrind with the >> option for tracking memory usage to see exactly where in the code the blip >> occurs; it could be a regression occurred in the code making it require >> more memory. But internal MPI buffers might explain some blip. >> > > Is it possible that we free the memory, but the OS has just not given back > that memory for use yet? How are you measuring memory usage? > > Thanks, > > Matt > > >> Barry >> >> >> > On Oct 7, 2021, at 9:50 AM, Michael Werner >> wrote: >> > >> > Hello, >> > >> > I noticed that there is a peak in memory consumption when I load an >> > existing matrix into PETSc. The matrix is previously created by an >> > external program and saved in the PETSc binary format. >> > The code I'm using in petsc4py is simple: >> > >> > viewer = PETSc.Viewer().createBinary(, "r", >> > comm=PETSc.COMM_WORLD) >> > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) >> > A.load(viewer) >> > >> > When I run this code in serial, the memory consumption of the process is >> > about 50GB RAM, similar to the file size of the saved matrix. However, >> > if I run the code in parallel, for a few seconds the memory consumption >> > of the process doubles to around 100GB RAM, before dropping back down to >> > around 50GB RAM. So it seems as if, for some reason, the matrix is >> > copied after it is read into memory. Is there a way to avoid this >> > behaviour? Currently, it is a clear bottleneck in my code. >> > >> > I tried setting the size of the matrix and to explicitly preallocate the >> > necessary NNZ (with A.setSizes(dim) and A.setPreallocationNNZ(nnz), >> > respectively) before loading, but that didn't help. >> > >> > As mentioned above, I'm using petsc4py together with PETSc-3.16 on a >> > Linux workstation. >> > >> > Best regards, >> > Michael Werner >> > >> > -- >> > >> > ____________________________________________________ >> > >> > Deutsches Zentrum f?r Luft- und Raumfahrt e.V. (DLR) >> > Institut f?r Aerodynamik und Str?mungstechnik | Bunsenstr. 10 | 37073 >> G?ttingen >> > >> > Michael Werner >> > Telefon 0551 709-2627 | Telefax 0551 709-2811 | Michael.Werner at dlr.de >> > DLR.de >> > >> > >> > >> > >> > >> > >> > >> > >> > >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jed at jedbrown.org Thu Oct 7 15:36:35 2021 From: jed at jedbrown.org (Jed Brown) Date: Thu, 07 Oct 2021 14:36:35 -0600 Subject: [petsc-users] HDF5 corruption In-Reply-To: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> Message-ID: <8735pcd12k.fsf@jedbrown.org> Adrian Croucher writes: > hi, > > One of the users of my PETSc-based code has reported that HDF5 output > files can be corrupted and unusable if e.g. the run is killed. I've just > done a bit of reading about this and it appears to be a known issue with > HDF5. > > Some people suggest flushing the HDF5 file periodically to help prevent > data loss. I had a look at PetscViewerFlush() but it doesn't seem to be > implemented for the HDF5 viewer- is that correct? Correct, but I think we can and should implement it. In your research just now, were there subtleties beyond this call? https://portal.hdfgroup.org/display/HDF5/H5F_FLUSH From a.croucher at auckland.ac.nz Thu Oct 7 17:43:14 2021 From: a.croucher at auckland.ac.nz (Adrian Croucher) Date: Fri, 8 Oct 2021 11:43:14 +1300 Subject: [petsc-users] HDF5 corruption In-Reply-To: <8735pcd12k.fsf@jedbrown.org> References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> <8735pcd12k.fsf@jedbrown.org> Message-ID: hi Jed, It looked to me like a call to h5f_flush() is all that is required. Some people said there would be a performance hit (maybe ~ 10% slower), which would be the trade-off for increased reliability. So if this were made available via PetscViewerFlush(), I'd probably make it optional in my code so the user could decide for themselves if it was worth it for them. Do you think flushing would be a better option than closing/opening the file between writes? Regards, Adrian On 10/8/21 9:36 AM, Jed Brown wrote: > Adrian Croucher writes: > > > hi, > > > > One of the users of my PETSc-based code has reported that HDF5 output > > files can be corrupted and unusable if e.g. the run is killed. I've > just > > done a bit of reading about this and it appears to be a known issue > with > > HDF5. > > > > Some people suggest flushing the HDF5 file periodically to help prevent > > data loss. I had a look at PetscViewerFlush() but it doesn't seem to be > > implemented for the HDF5 viewer- is that correct? > > Correct, but I think we can and should implement it. In your research > just now, were there subtleties beyond this call? > > https://portal.hdfgroup.org/display/HDF5/H5F_FLUSH > -- Dr Adrian Croucher Senior Research Fellow Department of Engineering Science University of Auckland, New Zealand email: a.croucher at auckland.ac.nz tel: +64 (0)9 923 4611 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jed at jedbrown.org Thu Oct 7 17:48:27 2021 From: jed at jedbrown.org (Jed Brown) Date: Thu, 07 Oct 2021 16:48:27 -0600 Subject: [petsc-users] HDF5 corruption In-Reply-To: References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> <8735pcd12k.fsf@jedbrown.org> Message-ID: <87k0iobgec.fsf@jedbrown.org> Adrian Croucher writes: > hi Jed, > > It looked to me like a call to h5f_flush() is all that is required. > > Some people said there would be a performance hit (maybe ~ 10% slower), > which would be the trade-off for increased reliability. So if this were > made available via PetscViewerFlush(), I'd probably make it optional in > my code so the user could decide for themselves if it was worth it for them. > > Do you think flushing would be a better option than closing/opening the > file between writes? Yes, less costly at scale (metadata like opening files can be expensive on parallel file systems), and simpler to manage from your code. From michael.werner at dlr.de Fri Oct 8 02:14:26 2021 From: michael.werner at dlr.de (Michael Werner) Date: Fri, 8 Oct 2021 09:14:26 +0200 Subject: [petsc-users] petsc4py - Spike in memory usage when loading a matrix in parallel In-Reply-To: References: <97889b6d-e7ce-5366-7e49-e4cd42ac0b1d@dlr.de> <07CEDA45-CCCB-4EA7-8AAC-F2FB9E69A654@petsc.dev> <0EAF1EE7-C34D-4118-BF74-78E1D983EFFD@petsc.dev> <646573ca-cd86-48a0-cbaa-2c8d0b6eb704@dlr.de> Message-ID: I can understand that process 0 needs to have twice its own memory due to the process Barry explained. However, in my case every process has twice the "necessary" memory. That doesn't seem to be correct to me. Especially with Barry's explanation in mind it seems strange that all processes have the same peak memory usage. If it were only process 0 then it wouldn't matter, because with enough processes the overhead would be negligible. Best regards, Michael On 07.10.21 18:32, Matthew Knepley wrote: > On Thu, Oct 7, 2021 at 11:59 AM Michael Werner > wrote: > > Its twice the memory of the entire matrix (when stored on one > process). I also just sent you the valgrind results, both for a > serial run and a parallel run. The size on disk of the matrix I > used is 20 GB. > In the serial run, valgrind shows a peak memory usage of 21GB, > while in the parallel run (with 4 processes) each process shows a > peak memory usage of 10.8GB > > > Barry is right that at least proc 0 must have twice its own memory, > since it loads the other pieces. That makes 10GB sounds correct. > > ? Thanks, > > ? ? ?Matt > ? > > Best regards, > Michael > > On 07.10.21 17:55, Barry Smith wrote: >> >> >>> On Oct 7, 2021, at 11:35 AM, Michael Werner >>> > wrote: >>> >>> Currently I'm using psutil to query every process for its memory >>> usage and sum it up. However, the spike was only visible in top >>> (I had a call to psutil right before and after A.load(viewer), >>> and both reported only 50 GB of RAM usage). That's why I thought >>> it might be directly tied to loading the matrix. However, I also >>> had the problem that the computation crashed due to running out >>> of memory while loading a matrix that should in theory fit into >>> memory. In that case I would expect the OS to free unused meory >>> immediatly, right? >>> >>> Concerning Barry's questions: the matrix is a sparse matrix and >>> is originally created sequentially as SEQAIJ. However, it is >>> then loaded as MPIAIJ, and if I look at the memory usage of the >>> various processes, they fill up one after another, just as >>> described. Is the origin of the matrix somehow preserved in the >>> binary file? I was under the impression that the binary format >>> was agnostic to the number of processes? >> >> ?The file format is independent of the number of processes that >> created it. >> >>> I also varied the number of processes between 1 and 60, as soon >>> as I use more than one process I can observe the spike (and its >>> always twice the memory, no matter how many processes I'm using). >> >> ? Twice the size of the entire matrix (when stored on one >> process) or twice the size of the resulting matrix stored on the >> first rank? The latter is exactly as expected, since rank 0 has >> to load the part of the matrix destined for the next rank and >> hence for a short time contains its own part of the matrix and >> the part of one other rank. >> >> ? Barry >> >>> >>> I also tried running Valgrind with the --tool=massif option. >>> However, I don't know what to look for. I can send you the >>> output file separately, if it helps. >>> >>> Best regards, >>> Michael >>> >>> On 07.10.21 16:09, Matthew Knepley wrote: >>>> On Thu, Oct 7, 2021 at 10:03 AM Barry Smith >>> > wrote: >>>> >>>> >>>> ? ?How many ranks are you using? Is it a sparse matrix with >>>> MPIAIJ? >>>> >>>> ? ?The intention is that for parallel runs the first rank >>>> reads in its own part of the matrix, then reads in the part >>>> of the next rank and sends it, then reads the part of the >>>> third rank and sends it etc. So there should not be too >>>> much of a blip in memory usage. You can run valgrind with >>>> the option for tracking memory usage to see exactly where >>>> in the code the blip occurs; it could be a regression >>>> occurred in the code making it require more memory. But >>>> internal MPI buffers might explain some blip. >>>> >>>> >>>> Is it possible that we free the memory, but the OS has just not >>>> given back that memory for use yet? How are you measuring >>>> memory usage? >>>> >>>> ? Thanks, >>>> >>>> ? ? ?Matt >>>> ? >>>> >>>> ? Barry >>>> >>>> >>>> > On Oct 7, 2021, at 9:50 AM, Michael Werner >>>> > wrote: >>>> > >>>> > Hello, >>>> > >>>> > I noticed that there is a peak in memory consumption when >>>> I load an >>>> > existing matrix into PETSc. The matrix is previously >>>> created by an >>>> > external program and saved in the PETSc binary format. >>>> > The code I'm using in petsc4py is simple: >>>> > >>>> > viewer = >>>> PETSc.Viewer().createBinary(, "r", >>>> > comm=PETSc.COMM_WORLD) >>>> > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) >>>> > A.load(viewer) >>>> > >>>> > When I run this code in serial, the memory consumption of >>>> the process is >>>> > about 50GB RAM, similar to the file size of the saved >>>> matrix. However, >>>> > if I run the code in parallel, for a few seconds the >>>> memory consumption >>>> > of the process doubles to around 100GB RAM, before >>>> dropping back down to >>>> > around 50GB RAM. So it seems as if, for some reason, the >>>> matrix is >>>> > copied after it is read into memory. Is there a way to >>>> avoid this >>>> > behaviour? Currently, it is a clear bottleneck in my code. >>>> > >>>> > I tried setting the size of the matrix and to explicitly >>>> preallocate the >>>> > necessary NNZ (with A.setSizes(dim) and >>>> A.setPreallocationNNZ(nnz), >>>> > respectively) before loading, but that didn't help. >>>> > >>>> > As mentioned above, I'm using petsc4py together with >>>> PETSc-3.16 on a >>>> > Linux workstation. >>>> > >>>> > Best regards, >>>> > Michael Werner >>>> > >>>> > -- >>>> > >>>> > ____________________________________________________ >>>> > >>>> > Deutsches Zentrum f?r Luft- und Raumfahrt e.V. (DLR) >>>> > Institut f?r Aerodynamik und Str?mungstechnik | >>>> Bunsenstr. 10 | 37073 G?ttingen >>>> > >>>> > Michael Werner >>>> > Telefon 0551 709-2627 | Telefax 0551 709-2811 | >>>> Michael.Werner at dlr.de >>>> > DLR.de >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin >>>> their experiments is infinitely more interesting than any >>>> results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >> > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From koziccla at fit.cvut.cz Fri Oct 8 05:29:33 2021 From: koziccla at fit.cvut.cz (Claudio =?iso-8859-1?Q?Kozick=FD?=) Date: Fri, 8 Oct 2021 12:29:33 +0200 Subject: [petsc-users] MatMult with one sequential and one parallel vector Message-ID: Hello, I am using PETSc in a performance comparison that evaluates the performance of parallel sparse matrix-vector multiplication (SpMV). For this purpose I have implemented a simple SpMV operation using PETSc, which multiplies parallel matrix?A (type MatAIJ) with parallel vector?x1 and stores the result in parallel vector?y1. Thus I perform SpMV using PETSc as MatMult(A, x1, y1). This part works without any problems. I would also like to implement SpMV operation y2 = A * x2, where x2?is a sequential vector (i.e.?created using VecCreateSeq) but where A and?y2 are still parallel. The resulting implementation would be something like: MatCreateAIJ(..., &A); // a parallel matrix VecCreateSeq(..., &x2); // a per-process _sequential_ vector VecCreateMPI(..., &y2); // a parallel vector MatMult(A, x2, y2); The motivation of storing all of?x2 in each process to is remove the need of broadcasting any elements of?x2 (this approach makes sense in the context of what I am benchmarking). However I cannot seem to get this approach to work in PETSc. For example when I try this approach with a 4-by-4 matrix, with two order-4 vectors and using two MPI processes, then PETSc prints: Nonconforming object sizes Mat mat,Vec x: local dim 2 4 I have attached a minimal working example that demonstrates what I am attempting to perform. Could it be that PETSc does not support combining a parallel and sequential vector in a single MatMult call? I have found functions for scattering and gathering vectors in the documentation of PETSc, but these do not seem to be a good match for what I am trying to benchmark. My intention is for each process to keep an identical copy of vector?x2 and therefore the necessity to scatter or gather values in?x2 should never arise. I would appreciate if somebody could help point me in the right direction regarding my failing MatMult call. Thanks! -- Claudio Kozick? -------------- next part -------------- #include int main(int argc, char **argv) { PetscErrorCode ierr = PetscInitialize(&argc, &argv, NULL, NULL); if (ierr != 0) { return 1; } Mat A; ierr = MatCreateAIJ(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, 4, 4, 1, NULL, 0, NULL, &A); CHKERRQ(ierr); PetscInt rows[] = {0, 1, 2, 3}; PetscInt cols[] = {0, 1, 2, 3}; PetscScalar vals[] = {10, 20, 30, 40}; for (int i = 0; i < 4; ++i) { ierr = MatSetValues(A, 1, &rows[i], 1, &cols[i], &vals[i], INSERT_VALUES); CHKERRQ(ierr); } ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr); ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr); Vec x1; ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, 4, &x1); CHKERRQ(ierr); ierr = VecSet(x1, 10); CHKERRQ(ierr); Vec y1; ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, 4, &y1); CHKERRQ(ierr); ierr = MatMult(A, x1, y1); CHKERRQ(ierr); ierr = VecView(y1, PETSC_VIEWER_STDOUT_WORLD); CHKERRQ(ierr); Vec x2; ierr = VecCreateSeq(PETSC_COMM_SELF, 4, &x2); CHKERRQ(ierr); ierr = VecSet(x2, 10); CHKERRQ(ierr); Vec y2; ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, 4, &y2); CHKERRQ(ierr); ierr = MatMult(A, x2, y2); CHKERRQ(ierr); ierr = VecView(y2, PETSC_VIEWER_STDOUT_WORLD); CHKERRQ(ierr); return PetscFinalize(); } From knepley at gmail.com Fri Oct 8 05:36:53 2021 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 8 Oct 2021 06:36:53 -0400 Subject: [petsc-users] MatMult with one sequential and one parallel vector In-Reply-To: References: Message-ID: On Fri, Oct 8, 2021 at 6:29 AM Claudio Kozick? wrote: > Hello, > > I am using PETSc in a performance comparison that evaluates the > performance of parallel sparse matrix-vector multiplication (SpMV). For > this purpose I have implemented a simple SpMV operation using PETSc, > which multiplies parallel matrix A (type MatAIJ) with parallel vector x1 > and stores the result in parallel vector y1. Thus I perform SpMV using > PETSc as MatMult(A, x1, y1). This part works without any problems. > > I would also like to implement SpMV operation y2 = A * x2, where x2 is a > sequential vector (i.e. created using VecCreateSeq) but where A and y2 > are still parallel. The resulting implementation would be something > like: > > MatCreateAIJ(..., &A); // a parallel matrix > VecCreateSeq(..., &x2); // a per-process _sequential_ vector > VecCreateMPI(..., &y2); // a parallel vector > MatMult(A, x2, y2); > You are correct. MatMult() is not intended to be used with different communicators. You can get the effect you want by making a block diagonal matrix A with P*N columns. Thanks, Matt > The motivation of storing all of x2 in each process to is remove the > need of broadcasting any elements of x2 (this approach makes sense in > the context of what I am benchmarking). However I cannot seem to get > this approach to work in PETSc. For example when I try this approach > with a 4-by-4 matrix, with two order-4 vectors and using two MPI > processes, then PETSc prints: > > Nonconforming object sizes > Mat mat,Vec x: local dim 2 4 > > I have attached a minimal working example that demonstrates what I am > attempting to perform. Could it be that PETSc does not support > combining a parallel and sequential vector in a single MatMult call? > > I have found functions for scattering and gathering vectors in the > documentation of PETSc, but these do not seem to be a good match for > what I am trying to benchmark. My intention is for each process to keep > an identical copy of vector x2 and therefore the necessity to scatter or > gather values in x2 should never arise. > > I would appreciate if somebody could help point me in the right > direction regarding my failing MatMult call. Thanks! > > -- > Claudio Kozick? > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Fri Oct 8 05:59:05 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Fri, 8 Oct 2021 10:59:05 +0000 Subject: [petsc-users] hypre on gpus Message-ID: Hello, I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have attached the python configuration file and -log_view output from running the below command options mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor The problem was solved and converged but from the output file I suspect hypre is not running on gpus as PCApply and DMCreate does not record any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are running on gpus. Can you please let me know if I need to add any extra flag to the attached arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? Thanks, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: arch-ci-linux-cuda11-double-xx.py Type: text/x-python-script Size: 779 bytes Desc: arch-ci-linux-cuda11-double-xx.py URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ksp_gmres_pc_hypre_ex45_N169_gpu_2.txt URL: From mfadams at lbl.gov Fri Oct 8 08:33:31 2021 From: mfadams at lbl.gov (Mark Adams) Date: Fri, 8 Oct 2021 09:33:31 -0400 Subject: [petsc-users] hypre on gpus In-Reply-To: References: Message-ID: Hypre does not record its flops with PETSc's timers. Configure with and without CUDA and see if the timings change in PCApply. Hypre does not dynamically switch between CUDA and CPU solves at this time, but you want to use -dm_mat_type hypre. Mark On Fri, Oct 8, 2021 at 6:59 AM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > Hello, > > > > I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have > attached the python configuration file and -log_view output from running > the below command options > > > > mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 > -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type > hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 > -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor > > > > The problem was solved and converged but from the output file I suspect > hypre is not running on gpus as PCApply and DMCreate does *not* record > any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are > running on gpus. > > > > Can you please let me know if I need to add any extra flag to the attached > arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? > > > > Thanks, > > Karthik. > > > > > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Fri Oct 8 08:55:14 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Fri, 8 Oct 2021 13:55:14 +0000 Subject: [petsc-users] hypre on gpus In-Reply-To: References: Message-ID: <35051739-B062-4DF3-B4E2-2C1297453609@stfc.ac.uk> Thanks Mark, I will try your recommendations. Should I also change -dm_vec_type to hypre currently I have it as mpicuda? Karthik. From: Mark Adams Date: Friday, 8 October 2021 at 14:33 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] hypre on gpus Hypre does not record its flops with PETSc's timers. Configure with and without CUDA and see if the timings change in PCApply. Hypre does not dynamically switch between CUDA and CPU solves at this time, but you want to use -dm_mat_type hypre. Mark On Fri, Oct 8, 2021 at 6:59 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have attached the python configuration file and -log_view output from running the below command options mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor The problem was solved and converged but from the output file I suspect hypre is not running on gpus as PCApply and DMCreate does not record any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are running on gpus. Can you please let me know if I need to add any extra flag to the attached arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? Thanks, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Fri Oct 8 09:29:16 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Fri, 8 Oct 2021 14:29:16 +0000 Subject: [petsc-users] hypre on gpus In-Reply-To: <35051739-B062-4DF3-B4E2-2C1297453609@stfc.ac.uk> References: <35051739-B062-4DF3-B4E2-2C1297453609@stfc.ac.uk> Message-ID: <39D1801E-DC81-4F1E-912C-DBD78BDD01DB@stfc.ac.uk> The PCApply timing on gpu PCApply 6 1.0 1.0235e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 39 0 0 0 0 39 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and cpu PCApply 6 1.0 1.0242e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 are close. It is hard for me tell if hypre on gpu is on or not. Best, Karthik. From: "Chockalingam, Karthikeyan (STFC,DL,HC)" Date: Friday, 8 October 2021 at 14:55 To: Mark Adams Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] hypre on gpus Thanks Mark, I will try your recommendations. Should I also change -dm_vec_type to hypre currently I have it as mpicuda? Karthik. From: Mark Adams Date: Friday, 8 October 2021 at 14:33 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] hypre on gpus Hypre does not record its flops with PETSc's timers. Configure with and without CUDA and see if the timings change in PCApply. Hypre does not dynamically switch between CUDA and CPU solves at this time, but you want to use -dm_mat_type hypre. Mark On Fri, Oct 8, 2021 at 6:59 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have attached the python configuration file and -log_view output from running the below command options mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor The problem was solved and converged but from the output file I suspect hypre is not running on gpus as PCApply and DMCreate does not record any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are running on gpus. Can you please let me know if I need to add any extra flag to the attached arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? Thanks, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Fri Oct 8 10:08:46 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Fri, 8 Oct 2021 10:08:46 -0500 Subject: [petsc-users] MatMult with one sequential and one parallel vector In-Reply-To: References: Message-ID: Hi, Claudio, You might be aware that petsc internally splits a MATMPIAIJ matrix A in two MATSEQAIJ matrices Ad and Ao, for the *diagonal* portion and the *off-diagonal *portion respectively, see https://petsc.org/release/docs/manualpages/Mat/MatCreateAIJ.html#MatCreateAIJ There are tricks in MatMult(). Ao is a 'reduced' matrix, i.e., using local column indices. Ao's column size (as returned by MatGetLocalSize) is the number of nonzero columns in the off-diagonal portion. petsc kind of implements MatMult(A,x, y) as y = Ad*x + Ao*lvec, where lvec is a sequential vector with entries gathered/communicated from other processes. petsc only communicates entries corresponding to nonzero columns. For your experimental purpose, you can un-split the matrix with MatMPIAIJGetLocalMat(A,scall,&Aloc), and do y2 = Aloc * x2. Note y2 is a sequential vector, since Aloc and x2 are sequential. You can alias y2 with y, by letting them share the data array. Basically, VecGetArray(y,&a) and VecCreateSeqWithArray(..,a,&y2). That is easy and you can search petsc doc to find info. For benchmarking, you need to consider the MatMPIAIJGetLocalMat() cost. It does memory copying to merge Ad and Ao into Aloc. Hope that helps. --Junchao Zhang On Fri, Oct 8, 2021 at 5:29 AM Claudio Kozick? wrote: > Hello, > > I am using PETSc in a performance comparison that evaluates the > performance of parallel sparse matrix-vector multiplication (SpMV). For > this purpose I have implemented a simple SpMV operation using PETSc, > which multiplies parallel matrix A (type MatAIJ) with parallel vector x1 > and stores the result in parallel vector y1. Thus I perform SpMV using > PETSc as MatMult(A, x1, y1). This part works without any problems. > > I would also like to implement SpMV operation y2 = A * x2, where x2 is a > sequential vector (i.e. created using VecCreateSeq) but where A and y2 > are still parallel. The resulting implementation would be something > like: > > MatCreateAIJ(..., &A); // a parallel matrix > VecCreateSeq(..., &x2); // a per-process _sequential_ vector > VecCreateMPI(..., &y2); // a parallel vector > MatMult(A, x2, y2); > > The motivation of storing all of x2 in each process to is remove the > need of broadcasting any elements of x2 (this approach makes sense in > the context of what I am benchmarking). However I cannot seem to get > this approach to work in PETSc. For example when I try this approach > with a 4-by-4 matrix, with two order-4 vectors and using two MPI > processes, then PETSc prints: > > Nonconforming object sizes > Mat mat,Vec x: local dim 2 4 > > I have attached a minimal working example that demonstrates what I am > attempting to perform. Could it be that PETSc does not support > combining a parallel and sequential vector in a single MatMult call? > > I have found functions for scattering and gathering vectors in the > documentation of PETSc, but these do not seem to be a good match for > what I am trying to benchmark. My intention is for each process to keep > an identical copy of vector x2 and therefore the necessity to scatter or > gather values in x2 should never arise. > > I would appreciate if somebody could help point me in the right > direction regarding my failing MatMult call. Thanks! > > -- > Claudio Kozick? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Fri Oct 8 10:36:22 2021 From: mfadams at lbl.gov (Mark Adams) Date: Fri, 8 Oct 2021 11:36:22 -0400 Subject: [petsc-users] hypre on gpus In-Reply-To: <39D1801E-DC81-4F1E-912C-DBD78BDD01DB@stfc.ac.uk> References: <35051739-B062-4DF3-B4E2-2C1297453609@stfc.ac.uk> <39D1801E-DC81-4F1E-912C-DBD78BDD01DB@stfc.ac.uk> Message-ID: On Fri, Oct 8, 2021 at 10:29 AM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > The PCApply timing on > > > > gpu > > > > PCApply 6 1.0 1.0235e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 39 0 0 0 0 39 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > > > and cpu > > > > PCApply 6 1.0 1.0242e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 > > > You don't have GPUs. probably. Use -dm_mat_type hypre. > are close. It is hard for me tell if hypre on gpu is on or not. > > > > Best, > > Karthik. > > > > > > *From: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Date: *Friday, 8 October 2021 at 14:55 > *To: *Mark Adams > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > Thanks Mark, I will try your recommendations. > > Should I also change -dm_vec_type to hypre currently I have it as mpicuda? > > > > Karthik. > > > > > > *From: *Mark Adams > *Date: *Friday, 8 October 2021 at 14:33 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > Hypre does not record its flops with PETSc's timers. > > Configure with and without CUDA and see if the timings change in PCApply. > > Hypre does not dynamically switch between CUDA and CPU solves at > this time, but you want to use -dm_mat_type hypre. > > Mark > > > > On Fri, Oct 8, 2021 at 6:59 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Hello, > > > > I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have > attached the python configuration file and -log_view output from running > the below command options > > > > mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 > -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type > hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 > -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor > > > > The problem was solved and converged but from the output file I suspect > hypre is not running on gpus as PCApply and DMCreate does *not* record > any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are > running on gpus. > > > > Can you please let me know if I need to add any extra flag to the attached > arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? > > > > Thanks, > > Karthik. > > > > > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Fri Oct 8 11:19:38 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Fri, 8 Oct 2021 16:19:38 +0000 Subject: [petsc-users] hypre on gpus In-Reply-To: References: <35051739-B062-4DF3-B4E2-2C1297453609@stfc.ac.uk> <39D1801E-DC81-4F1E-912C-DBD78BDD01DB@stfc.ac.uk> Message-ID: I tried a different exercise ran the same problem on two cpu cores and on two gpu: On gpu PCApply 6 1.0 6.0335e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 15 0 0 0 1 15 0 0 0 1 0 0 0 0.00e+00 5 9.65e+01 0 and on cpu PCApply 6 1.0 5.6348e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 0 timings again are close but gpu version did a reduction 6.0e+00 but the cpu version did not 0.0e+00. I am not sure if that is any indication if hypre ran on gpus? Thanks, Karthik. From: Mark Adams Date: Friday, 8 October 2021 at 16:36 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] hypre on gpus On Fri, Oct 8, 2021 at 10:29 AM Karthikeyan Chockalingam - STFC UKRI > wrote: The PCApply timing on gpu PCApply 6 1.0 1.0235e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 39 0 0 0 0 39 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and cpu PCApply 6 1.0 1.0242e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 You don't have GPUs. probably. Use -dm_mat_type hypre. are close. It is hard for me tell if hypre on gpu is on or not. Best, Karthik. From: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Date: Friday, 8 October 2021 at 14:55 To: Mark Adams > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] hypre on gpus Thanks Mark, I will try your recommendations. Should I also change -dm_vec_type to hypre currently I have it as mpicuda? Karthik. From: Mark Adams > Date: Friday, 8 October 2021 at 14:33 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] hypre on gpus Hypre does not record its flops with PETSc's timers. Configure with and without CUDA and see if the timings change in PCApply. Hypre does not dynamically switch between CUDA and CPU solves at this time, but you want to use -dm_mat_type hypre. Mark On Fri, Oct 8, 2021 at 6:59 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have attached the python configuration file and -log_view output from running the below command options mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor The problem was solved and converged but from the output file I suspect hypre is not running on gpus as PCApply and DMCreate does not record any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are running on gpus. Can you please let me know if I need to add any extra flag to the attached arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? Thanks, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Fri Oct 8 12:27:45 2021 From: mfadams at lbl.gov (Mark Adams) Date: Fri, 8 Oct 2021 13:27:45 -0400 Subject: [petsc-users] hypre on gpus In-Reply-To: References: <35051739-B062-4DF3-B4E2-2C1297453609@stfc.ac.uk> <39D1801E-DC81-4F1E-912C-DBD78BDD01DB@stfc.ac.uk> Message-ID: Did you use -dm_mat_type hypre on the GPU case ? On Fri, Oct 8, 2021 at 12:19 PM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > I tried a different exercise ran the same problem on two cpu cores and on > two gpu: > > > > On gpu > > > > PCApply 6 1.0 6.0335e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 15 0 0 0 1 15 0 0 0 1 0 0 0 0.00e+00 5 > 9.65e+01 0 > > > > and on cpu > > > > PCApply 6 1.0 5.6348e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 16 0 0 0 0 16 0 0 0 0 0 > > > > timings again are close but gpu version did a reduction 6.0e+00 but the > cpu version did not 0.0e+00. > > I am not sure if that is any indication if hypre ran on gpus? > > > > Thanks, > > Karthik. > > > > > > *From: *Mark Adams > *Date: *Friday, 8 October 2021 at 16:36 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > > > > > On Fri, Oct 8, 2021 at 10:29 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > The PCApply timing on > > > > gpu > > > > PCApply 6 1.0 1.0235e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 39 0 0 0 0 39 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > > > and cpu > > > > PCApply 6 1.0 1.0242e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 > > > > > > You don't have GPUs. probably. > > Use -dm_mat_type hypre. > > > > are close. It is hard for me tell if hypre on gpu is on or not. > > > > Best, > > Karthik. > > > > > > *From: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Date: *Friday, 8 October 2021 at 14:55 > *To: *Mark Adams > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > Thanks Mark, I will try your recommendations. > > Should I also change -dm_vec_type to hypre currently I have it as mpicuda? > > > > Karthik. > > > > > > *From: *Mark Adams > *Date: *Friday, 8 October 2021 at 14:33 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > Hypre does not record its flops with PETSc's timers. > > Configure with and without CUDA and see if the timings change in PCApply. > > Hypre does not dynamically switch between CUDA and CPU solves at > this time, but you want to use -dm_mat_type hypre. > > Mark > > > > On Fri, Oct 8, 2021 at 6:59 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Hello, > > > > I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have > attached the python configuration file and -log_view output from running > the below command options > > > > mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 > -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type > hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 > -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor > > > > The problem was solved and converged but from the output file I suspect > hypre is not running on gpus as PCApply and DMCreate does *not* record > any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are > running on gpus. > > > > Can you please let me know if I need to add any extra flag to the attached > arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? > > > > Thanks, > > Karthik. > > > > > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Fri Oct 8 12:35:55 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Fri, 8 Oct 2021 17:35:55 +0000 Subject: [petsc-users] hypre on gpus In-Reply-To: References: <35051739-B062-4DF3-B4E2-2C1297453609@stfc.ac.uk> <39D1801E-DC81-4F1E-912C-DBD78BDD01DB@stfc.ac.uk> Message-ID: <6215BB2D-14CB-42A4-9A22-31AE98B2237C@stfc.ac.uk> Yes, I used it for both cpu and gpu. Is that not okay? For gpu: -dm_mat_type hypre -dm_vec_type mpicuda For cpu: -dm_mat_type hypre -dm_vec_type mpi From: Mark Adams Date: Friday, 8 October 2021 at 18:28 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] hypre on gpus Did you use -dm_mat_type hypre on the GPU case ? On Fri, Oct 8, 2021 at 12:19 PM Karthikeyan Chockalingam - STFC UKRI > wrote: I tried a different exercise ran the same problem on two cpu cores and on two gpu: On gpu PCApply 6 1.0 6.0335e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 15 0 0 0 1 15 0 0 0 1 0 0 0 0.00e+00 5 9.65e+01 0 and on cpu PCApply 6 1.0 5.6348e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 0 timings again are close but gpu version did a reduction 6.0e+00 but the cpu version did not 0.0e+00. I am not sure if that is any indication if hypre ran on gpus? Thanks, Karthik. From: Mark Adams > Date: Friday, 8 October 2021 at 16:36 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] hypre on gpus On Fri, Oct 8, 2021 at 10:29 AM Karthikeyan Chockalingam - STFC UKRI > wrote: The PCApply timing on gpu PCApply 6 1.0 1.0235e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 39 0 0 0 0 39 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 and cpu PCApply 6 1.0 1.0242e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 You don't have GPUs. probably. Use -dm_mat_type hypre. are close. It is hard for me tell if hypre on gpu is on or not. Best, Karthik. From: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Date: Friday, 8 October 2021 at 14:55 To: Mark Adams > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] hypre on gpus Thanks Mark, I will try your recommendations. Should I also change -dm_vec_type to hypre currently I have it as mpicuda? Karthik. From: Mark Adams > Date: Friday, 8 October 2021 at 14:33 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" > Cc: "petsc-users at mcs.anl.gov" > Subject: Re: [petsc-users] hypre on gpus Hypre does not record its flops with PETSc's timers. Configure with and without CUDA and see if the timings change in PCApply. Hypre does not dynamically switch between CUDA and CPU solves at this time, but you want to use -dm_mat_type hypre. Mark On Fri, Oct 8, 2021 at 6:59 AM Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have attached the python configuration file and -log_view output from running the below command options mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor The problem was solved and converged but from the output file I suspect hypre is not running on gpus as PCApply and DMCreate does not record any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are running on gpus. Can you please let me know if I need to add any extra flag to the attached arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? Thanks, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Fri Oct 8 12:46:47 2021 From: mfadams at lbl.gov (Mark Adams) Date: Fri, 8 Oct 2021 13:46:47 -0400 Subject: [petsc-users] hypre on gpus In-Reply-To: <6215BB2D-14CB-42A4-9A22-31AE98B2237C@stfc.ac.uk> References: <35051739-B062-4DF3-B4E2-2C1297453609@stfc.ac.uk> <39D1801E-DC81-4F1E-912C-DBD78BDD01DB@stfc.ac.uk> <6215BB2D-14CB-42A4-9A22-31AE98B2237C@stfc.ac.uk> Message-ID: I think you would want to use 'cuda' vec_type, but I . You might ask Hypre how one verifies that the GPU is used. Mark On Fri, Oct 8, 2021 at 1:35 PM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > Yes, I used it for both cpu and gpu. Is that not okay? > > > > For gpu: -dm_mat_type hypre -dm_vec_type mpicuda > > > > For cpu: -dm_mat_type hypre -dm_vec_type mpi > > > > *From: *Mark Adams > *Date: *Friday, 8 October 2021 at 18:28 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > Did you use -dm_mat_type hypre on the GPU case ? > > > > On Fri, Oct 8, 2021 at 12:19 PM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > I tried a different exercise ran the same problem on two cpu cores and on > two gpu: > > > > On gpu > > > > PCApply 6 1.0 6.0335e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 15 0 0 0 1 15 0 0 0 1 0 0 0 0.00e+00 5 > 9.65e+01 0 > > > > and on cpu > > > > PCApply 6 1.0 5.6348e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 16 0 0 0 0 16 0 0 0 0 0 > > > > timings again are close but gpu version did a reduction 6.0e+00 but the > cpu version did not 0.0e+00. > > I am not sure if that is any indication if hypre ran on gpus? > > > > Thanks, > > Karthik. > > > > > > *From: *Mark Adams > *Date: *Friday, 8 October 2021 at 16:36 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > > > > > On Fri, Oct 8, 2021 at 10:29 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > The PCApply timing on > > > > gpu > > > > PCApply 6 1.0 1.0235e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 39 0 0 0 0 39 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > > > and cpu > > > > PCApply 6 1.0 1.0242e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 > > > > > > You don't have GPUs. probably. > > Use -dm_mat_type hypre. > > > > are close. It is hard for me tell if hypre on gpu is on or not. > > > > Best, > > Karthik. > > > > > > *From: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Date: *Friday, 8 October 2021 at 14:55 > *To: *Mark Adams > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > Thanks Mark, I will try your recommendations. > > Should I also change -dm_vec_type to hypre currently I have it as mpicuda? > > > > Karthik. > > > > > > *From: *Mark Adams > *Date: *Friday, 8 October 2021 at 14:33 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] hypre on gpus > > > > Hypre does not record its flops with PETSc's timers. > > Configure with and without CUDA and see if the timings change in PCApply. > > Hypre does not dynamically switch between CUDA and CPU solves at > this time, but you want to use -dm_mat_type hypre. > > Mark > > > > On Fri, Oct 8, 2021 at 6:59 AM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Hello, > > > > I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have > attached the python configuration file and -log_view output from running > the below command options > > > > mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169 > -dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type > hypre -pc_hypre_type boomeramg -ksp_gmres_restart 31 > -pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor > > > > The problem was solved and converged but from the output file I suspect > hypre is not running on gpus as PCApply and DMCreate does *not* record > any gpu Mflop/s. However, some events such KSPSolve, MatMult etc are > running on gpus. > > > > Can you please let me know if I need to add any extra flag to the attached > arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus? > > > > Thanks, > > Karthik. > > > > > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.croucher at auckland.ac.nz Sun Oct 10 17:50:45 2021 From: a.croucher at auckland.ac.nz (Adrian Croucher) Date: Mon, 11 Oct 2021 11:50:45 +1300 Subject: [petsc-users] HDF5 time step count In-Reply-To: References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> <8735pcd12k.fsf@jedbrown.org> Message-ID: <7639e2f6-d326-e04b-bd8e-53cecb8d794e@auckland.ac.nz> hi Is there any way to query the PETSc HDF5 viewer to find the number of time steps in the file? A common use case I have is that an HDF5 file from a previous simulation is used to get initial conditions for a subsequent run. The most common thing you want to do is restart from the last set of results in the previous output. To do that you need to know how many time steps there are, so you can set the output index to be the last one. I thought maybe I could just query the size of the "time" dataset, but I can't even see any obvious way to do that using the viewer functions. Regards, Adrian -- Dr Adrian Croucher Senior Research Fellow Department of Engineering Science University of Auckland, New Zealand email: a.croucher at auckland.ac.nz tel: +64 (0)9 923 4611 From knepley at gmail.com Sun Oct 10 17:59:55 2021 From: knepley at gmail.com (Matthew Knepley) Date: Sun, 10 Oct 2021 18:59:55 -0400 Subject: [petsc-users] HDF5 time step count In-Reply-To: <7639e2f6-d326-e04b-bd8e-53cecb8d794e@auckland.ac.nz> References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> <8735pcd12k.fsf@jedbrown.org> <7639e2f6-d326-e04b-bd8e-53cecb8d794e@auckland.ac.nz> Message-ID: On Sun, Oct 10, 2021 at 6:51 PM Adrian Croucher wrote: > hi > > Is there any way to query the PETSc HDF5 viewer to find the number of > time steps in the file? > > A common use case I have is that an HDF5 file from a previous simulation > is used to get initial conditions for a subsequent run. The most common > thing you want to do is restart from the last set of results in the > previous output. To do that you need to know how many time steps there > are, so you can set the output index to be the last one. > > I thought maybe I could just query the size of the "time" dataset, but I > can't even see any obvious way to do that using the viewer functions. > There is nothing in there that does it right now. Do you know how to do it in HDF5? If so, I can put it in. Otherwise, I will have to learn more HDF5 :) Thanks, Matt > Regards, Adrian > > -- > Dr Adrian Croucher > Senior Research Fellow > Department of Engineering Science > University of Auckland, New Zealand > email: a.croucher at auckland.ac.nz > tel: +64 (0)9 923 4611 > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.croucher at auckland.ac.nz Sun Oct 10 20:08:58 2021 From: a.croucher at auckland.ac.nz (Adrian Croucher) Date: Mon, 11 Oct 2021 14:08:58 +1300 Subject: [petsc-users] HDF5 time step count In-Reply-To: References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> <8735pcd12k.fsf@jedbrown.org> <7639e2f6-d326-e04b-bd8e-53cecb8d794e@auckland.ac.nz> Message-ID: <220daca5-21ea-00df-bc51-ba6124d020ad@auckland.ac.nz> hi Matt, On 10/11/21 11:59 AM, Matthew Knepley wrote: > On Sun, Oct 10, 2021 at 6:51 PM Adrian Croucher > > wrote: > > hi > > Is there any way to query the PETSc HDF5 viewer to find the number of > time steps in the file? > > A common use case I have is that an HDF5 file from a previous > simulation > is used to get initial conditions for a subsequent run. The most > common > thing you want to do is restart from the last set of results in the > previous output. To do that you need to know how many time steps > there > are, so you can set the output index to be the last one. > > I thought maybe I could just query the size of the "time" dataset, > but I > can't even see any obvious way to do that using the viewer functions. > > > There is nothing in there that does it right now. Do you know how to > do it in HDF5? > If so, I can put it in. Otherwise, I will have to learn more HDF5 :) I haven't actually tried this myself but it looks like what you do is: 1) get the dataspace for the dataset (in our case the "time" dataset): hid_t dspace = H5Dget_space(dset); 2) Get the dimensions of the dataspace: const int ndims = 1; hsize_t dims[ndims]; H5Sget_simple_extent_dims(dspace, dims, NULL); The first element of dims should be the number of time steps. Here I've assumed the number of dimensions of the time dataset is 1. In general you can instead query the rank of the dataspace using H5Sget_simple_extent_ndims() to get the rank ndims. Regards, Adrian -- Dr Adrian Croucher Senior Research Fellow Department of Engineering Science University of Auckland, New Zealand email: a.croucher at auckland.ac.nz tel: +64 (0)9 923 4611 -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Mon Oct 11 04:04:25 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Mon, 11 Oct 2021 09:04:25 +0000 Subject: [petsc-users] Building PETSc with Hypre GPU Message-ID: <6C7EE81D-CF16-4B7E-9AAB-60606322B926@stfc.ac.uk> Dear all, I would like to confirm if I have successfully build petsc with hypre gpu. I have attached my build python script file. Please let me know if I have it right? Kind regards, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: arch-ci-linux-cuda11-double-xx.py Type: text/x-python-script Size: 779 bytes Desc: arch-ci-linux-cuda11-double-xx.py URL: From roland.richter at ntnu.no Mon Oct 11 04:23:52 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Mon, 11 Oct 2021 11:23:52 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC Message-ID: Hei, I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. using Intel OneAPI together with the supplied mpicxx-compiler). Compilation and installation worked fine, but running the tests resulted in the error "Attempting to use an MPI routine before initializing MPICH". A simple test program (attached) worked fine with the same combination. What could be the reason for that? Thanks! Regards, Roland Richter -------------- next part -------------- A non-text attachment was scrubbed... Name: main.cpp Type: text/x-c++src Size: 1292 bytes Desc: not available URL: From stefano.zampini at gmail.com Mon Oct 11 05:08:46 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Mon, 11 Oct 2021 13:08:46 +0300 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: Message-ID: <335E6D79-7378-4A44-8B84-1CC7EB925F03@gmail.com> Try removing line 15 boost_procs = boost::thread::physical_concurrency(); Usually these errors are caused by destructors called when objects go out of scope > On Oct 11, 2021, at 12:23 PM, Roland Richter wrote: > > From roland.richter at ntnu.no Mon Oct 11 05:11:02 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Mon, 11 Oct 2021 12:11:02 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: <335E6D79-7378-4A44-8B84-1CC7EB925F03@gmail.com> References: <335E6D79-7378-4A44-8B84-1CC7EB925F03@gmail.com> Message-ID: <11b5647c-d613-2629-4b9c-b67a4559f9c3@ntnu.no> Hei, the attached test case works fine when compiled with Intel MPI and g++, thereby confirming that the compiler should work in general. But all examples provided by PETSc fail with the compiler mentioned above. Thus, I don't think that that line is related to those issues. Regards, Roland Richter Am 11.10.21 um 12:08 schrieb Stefano Zampini: > Try removing line 15 > > boost_procs = boost::thread::physical_concurrency(); > > Usually these errors are caused by destructors called when objects go out of scope > >> On Oct 11, 2021, at 12:23 PM, Roland Richter wrote: >> >> From knepley at gmail.com Mon Oct 11 06:57:31 2021 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 11 Oct 2021 07:57:31 -0400 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: Message-ID: On Mon, Oct 11, 2021 at 5:24 AM Roland Richter wrote: > Hei, > > I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. using > Intel OneAPI together with the supplied mpicxx-compiler). Compilation > and installation worked fine, but running the tests resulted in the > error "Attempting to use an MPI routine before initializing MPICH". A > simple test program (attached) worked fine with the same combination. > > What could be the reason for that? > Hi Roland, Can you get a stack trace for this error using the debugger? Thanks, Matt > Thanks! > > Regards, > > Roland Richter > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.richter at ntnu.no Mon Oct 11 07:07:45 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Mon, 11 Oct 2021 14:07:45 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: Message-ID: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> Hei, at least in gdb it fails with Attempting to use an MPI routine before initializing MPICH [Inferior 1 (process 7854) exited with code 01] (gdb) backtrace No stack. Regards, Roland Am 11.10.21 um 13:57 schrieb Matthew Knepley: > On Mon, Oct 11, 2021 at 5:24 AM Roland Richter > wrote: > > Hei, > > I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. > using > Intel OneAPI together with the supplied mpicxx-compiler). Compilation > and installation worked fine, but running the tests resulted in the > error "Attempting to use an MPI routine before initializing MPICH". A > simple test program (attached) worked fine with the same combination. > > What could be the reason for that? > > > Hi Roland, > > Can you get a stack trace for this error using the debugger? > > ? Thanks, > > ? ? ?Matt > ? > > Thanks! > > Regards, > > Roland Richter > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Oct 11 07:22:21 2021 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 11 Oct 2021 08:22:21 -0400 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> Message-ID: On Mon, Oct 11, 2021 at 8:07 AM Roland Richter wrote: > Hei, > > at least in gdb it fails with > > Attempting to use an MPI routine before initializing MPICH > [Inferior 1 (process 7854) exited with code 01] > (gdb) backtrace > No stack. > What were you running? If it never makes it into PETSc code, I am not sure what we are doing to cause this. Thanks, Matt > Regards, > > Roland > Am 11.10.21 um 13:57 schrieb Matthew Knepley: > > On Mon, Oct 11, 2021 at 5:24 AM Roland Richter > wrote: > >> Hei, >> >> I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. using >> Intel OneAPI together with the supplied mpicxx-compiler). Compilation >> and installation worked fine, but running the tests resulted in the >> error "Attempting to use an MPI routine before initializing MPICH". A >> simple test program (attached) worked fine with the same combination. >> >> What could be the reason for that? >> > > Hi Roland, > > Can you get a stack trace for this error using the debugger? > > Thanks, > > Matt > > >> Thanks! >> >> Regards, >> >> Roland Richter >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.richter at ntnu.no Mon Oct 11 07:23:37 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Mon, 11 Oct 2021 14:23:37 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> Message-ID: <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> I tried either /./ex19/ (SNES-example), /mpirun ./ex19/ or /mpirun -n 1 ./ex19/, all with the same result. Regards, Roland Am 11.10.21 um 14:22 schrieb Matthew Knepley: > On Mon, Oct 11, 2021 at 8:07 AM Roland Richter > wrote: > > Hei, > > at least in gdb it fails with > > Attempting to use an MPI routine before initializing MPICH > [Inferior 1 (process 7854) exited with code 01] > (gdb) backtrace > No stack. > > > What were you running? If it never makes it into PETSc code, I am not > sure what we are > doing to cause this. > > ? Thanks, > > ? ? ?Matt > ? > > Regards, > > Roland > > Am 11.10.21 um 13:57 schrieb Matthew Knepley: >> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >> wrote: >> >> Hei, >> >> I compiled PETSc with Intel MPI (MPICH) and GCC as compiler >> (i.e. using >> Intel OneAPI together with the supplied mpicxx-compiler). >> Compilation >> and installation worked fine, but running the tests resulted >> in the >> error "Attempting to use an MPI routine before initializing >> MPICH". A >> simple test program (attached) worked fine with the same >> combination. >> >> What could be the reason for that? >> >> >> Hi Roland, >> >> Can you get a stack trace for this error using the debugger? >> >> ? Thanks, >> >> ? ? ?Matt >> ? >> >> Thanks! >> >> Regards, >> >> Roland Richter >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Mon Oct 11 07:24:44 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Mon, 11 Oct 2021 15:24:44 +0300 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> Message-ID: <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> You are most probably using a different mpiexec then the one used to compile petsc. > On Oct 11, 2021, at 3:23 PM, Roland Richter wrote: > > I tried either ./ex19 (SNES-example), mpirun ./ex19 or mpirun -n 1 ./ex19, all with the same result. > > Regards, > > Roland > > Am 11.10.21 um 14:22 schrieb Matthew Knepley: >> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter > wrote: >> Hei, >> >> at least in gdb it fails with >> >> Attempting to use an MPI routine before initializing MPICH >> [Inferior 1 (process 7854) exited with code 01] >> (gdb) backtrace >> No stack. >> >> >> What were you running? If it never makes it into PETSc code, I am not sure what we are >> doing to cause this. >> >> Thanks, >> >> Matt >> >> Regards, >> >> Roland >> >> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter > wrote: >>> Hei, >>> >>> I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. using >>> Intel OneAPI together with the supplied mpicxx-compiler). Compilation >>> and installation worked fine, but running the tests resulted in the >>> error "Attempting to use an MPI routine before initializing MPICH". A >>> simple test program (attached) worked fine with the same combination. >>> >>> What could be the reason for that? >>> >>> Hi Roland, >>> >>> Can you get a stack trace for this error using the debugger? >>> >>> Thanks, >>> >>> Matt >>> >>> Thanks! >>> >>> Regards, >>> >>> Roland Richter >>> >>> >>> -- >>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >> >> >> -- >> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.richter at ntnu.no Mon Oct 11 07:30:39 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Mon, 11 Oct 2021 14:30:39 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> Message-ID: <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> At least according to configure.log mpiexec was defined as Checking for program /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found ????????????????? Defined make macro "MPIEXECEXECUTABLE" to "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" When running ex19 with this mpiexec it fails with the usual error, even though all configuration steps worked fine. I attached the configuration log. Regards, Roland Am 11.10.21 um 14:24 schrieb Stefano Zampini: > You are most probably using a different mpiexec then the one used to > compile petsc. > > > >> On Oct 11, 2021, at 3:23 PM, Roland Richter >> wrote: >> >> I tried either /./ex19/ (SNES-example), /mpirun ./ex19/ or /mpirun -n >> 1 ./ex19/, all with the same result. >> >> Regards, >> >> Roland >> >> Am 11.10.21 um 14:22 schrieb Matthew Knepley: >>> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter >>> wrote: >>> >>> Hei, >>> >>> at least in gdb it fails with >>> >>> Attempting to use an MPI routine before initializing MPICH >>> [Inferior 1 (process 7854) exited with code 01] >>> (gdb) backtrace >>> No stack. >>> >>> >>> What were you running? If it never makes it into PETSc code, I am >>> not sure what we are >>> doing to cause this. >>> >>> ? Thanks, >>> >>> ? ? ?Matt >>> ? >>> >>> Regards, >>> >>> Roland >>> >>> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >>>> wrote: >>>> >>>> Hei, >>>> >>>> I compiled PETSc with Intel MPI (MPICH) and GCC as compiler >>>> (i.e. using >>>> Intel OneAPI together with the supplied mpicxx-compiler). >>>> Compilation >>>> and installation worked fine, but running the tests >>>> resulted in the >>>> error "Attempting to use an MPI routine before initializing >>>> MPICH". A >>>> simple test program (attached) worked fine with the same >>>> combination. >>>> >>>> What could be the reason for that? >>>> >>>> >>>> Hi Roland, >>>> >>>> Can you get a stack trace for this error using the debugger? >>>> >>>> ? Thanks, >>>> >>>> ? ? ?Matt >>>> ? >>>> >>>> Thanks! >>>> >>>> Regards, >>>> >>>> Roland Richter >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin >>>> their experiments is infinitely more interesting than any >>>> results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which >>> their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: configure.log Type: text/x-log Size: 120379 bytes Desc: not available URL: From stefano.zampini at gmail.com Mon Oct 11 07:34:35 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Mon, 11 Oct 2021 15:34:35 +0300 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> Message-ID: <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> Can you try with a simple call that only calls PetscInitialize/Finalize? > On Oct 11, 2021, at 3:30 PM, Roland Richter wrote: > > At least according to configure.log mpiexec was defined as > > Checking for program /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found > Defined make macro "MPIEXECEXECUTABLE" to "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" > > When running ex19 with this mpiexec it fails with the usual error, even though all configuration steps worked fine. I attached the configuration log. > > Regards, > > Roland > > Am 11.10.21 um 14:24 schrieb Stefano Zampini: >> You are most probably using a different mpiexec then the one used to compile petsc. >> >> >> >>> On Oct 11, 2021, at 3:23 PM, Roland Richter > wrote: >>> >>> I tried either ./ex19 (SNES-example), mpirun ./ex19 or mpirun -n 1 ./ex19, all with the same result. >>> >>> Regards, >>> >>> Roland >>> >>> Am 11.10.21 um 14:22 schrieb Matthew Knepley: >>>> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter > wrote: >>>> Hei, >>>> >>>> at least in gdb it fails with >>>> >>>> Attempting to use an MPI routine before initializing MPICH >>>> [Inferior 1 (process 7854) exited with code 01] >>>> (gdb) backtrace >>>> No stack. >>>> >>>> >>>> What were you running? If it never makes it into PETSc code, I am not sure what we are >>>> doing to cause this. >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> Regards, >>>> >>>> Roland >>>> >>>> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>>>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter > wrote: >>>>> Hei, >>>>> >>>>> I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. using >>>>> Intel OneAPI together with the supplied mpicxx-compiler). Compilation >>>>> and installation worked fine, but running the tests resulted in the >>>>> error "Attempting to use an MPI routine before initializing MPICH". A >>>>> simple test program (attached) worked fine with the same combination. >>>>> >>>>> What could be the reason for that? >>>>> >>>>> Hi Roland, >>>>> >>>>> Can you get a stack trace for this error using the debugger? >>>>> >>>>> Thanks, >>>>> >>>>> Matt >>>>> >>>>> Thanks! >>>>> >>>>> Regards, >>>>> >>>>> Roland Richter >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.richter at ntnu.no Mon Oct 11 08:13:38 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Mon, 11 Oct 2021 15:13:38 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> Message-ID: Hei, the following code works fine: #include #include static char help[] = "Solves 2D Poisson equation using multigrid.\n\n"; int main(int argc,char **argv) { ??? PetscInitialize(&argc,&argv,(char*)0,help); ??? std::cout << "Hello World\n"; ??? PetscFinalize(); ??? return 0; } Regards, Roland Am 11.10.21 um 14:34 schrieb Stefano Zampini: > Can you try with a simple call that only calls PetscInitialize/Finalize? > > >> On Oct 11, 2021, at 3:30 PM, Roland Richter >> wrote: >> >> At least according to configure.log mpiexec was defined as >> >> Checking for program /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found >> ????????????????? Defined make macro "MPIEXECEXECUTABLE" to >> "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" >> >> When running ex19 with this mpiexec it fails with the usual error, >> even though all configuration steps worked fine. I attached the >> configuration log. >> >> Regards, >> >> Roland >> >> Am 11.10.21 um 14:24 schrieb Stefano Zampini: >>> You are most probably using a different mpiexec then the one used to >>> compile petsc. >>> >>> >>> >>>> On Oct 11, 2021, at 3:23 PM, Roland Richter >>>> wrote: >>>> >>>> I tried either /./ex19/ (SNES-example), /mpirun ./ex19/ or /mpirun >>>> -n 1 ./ex19/, all with the same result. >>>> >>>> Regards, >>>> >>>> Roland >>>> >>>> Am 11.10.21 um 14:22 schrieb Matthew Knepley: >>>>> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter >>>>> wrote: >>>>> >>>>> Hei, >>>>> >>>>> at least in gdb it fails with >>>>> >>>>> Attempting to use an MPI routine before initializing MPICH >>>>> [Inferior 1 (process 7854) exited with code 01] >>>>> (gdb) backtrace >>>>> No stack. >>>>> >>>>> >>>>> What were you running? If it never makes it into PETSc code, I am >>>>> not sure what we are >>>>> doing to cause this. >>>>> >>>>> ? Thanks, >>>>> >>>>> ? ? ?Matt >>>>> ? >>>>> >>>>> Regards, >>>>> >>>>> Roland >>>>> >>>>> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>>>>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >>>>>> wrote: >>>>>> >>>>>> Hei, >>>>>> >>>>>> I compiled PETSc with Intel MPI (MPICH) and GCC as >>>>>> compiler (i.e. using >>>>>> Intel OneAPI together with the supplied mpicxx-compiler). >>>>>> Compilation >>>>>> and installation worked fine, but running the tests >>>>>> resulted in the >>>>>> error "Attempting to use an MPI routine before >>>>>> initializing MPICH". A >>>>>> simple test program (attached) worked fine with the same >>>>>> combination. >>>>>> >>>>>> What could be the reason for that? >>>>>> >>>>>> >>>>>> Hi Roland, >>>>>> >>>>>> Can you get a stack trace for this error using the debugger? >>>>>> >>>>>> ? Thanks, >>>>>> >>>>>> ? ? ?Matt >>>>>> ? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Regards, >>>>>> >>>>>> Roland Richter >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin >>>>>> their experiments is infinitely more interesting than any >>>>>> results to which their experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to >>>>> which their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aduarteg at utexas.edu Mon Oct 11 11:26:28 2021 From: aduarteg at utexas.edu (Alfredo J Duarte Gomez) Date: Mon, 11 Oct 2021 11:26:28 -0500 Subject: [petsc-users] TS initial guess Message-ID: Good morning PETSC team, I have a working algorithm for my implicit TS integrator with a system of ODE/DAE's, but I am observing a rather high number of iterations I am currently using the simplest settings of a TSBEULER and setting a constant time step. My question right now is whether the default settings use any sort of initial guess algorithm before every time step. Since I have seen that the time step adapter calculates the Local Truncation Error, it should be possible to use an extrapolation of arbitrary order of accuracy as an initial guess for every time step right? Can someone indicate how I would be able to use that? Additionally, it would be very helpful to take a look at that initial guess, is it possible to use any existing function to calculate it either in the PreStep or PostStep function to visualize it? Thank you, -- Alfredo Duarte Graduate Research Assistant The University of Texas at Austin -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Mon Oct 11 15:39:51 2021 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 11 Oct 2021 16:39:51 -0400 Subject: [petsc-users] TS initial guess In-Reply-To: References: Message-ID: <1B573129-FF26-43A5-BC4C-6B51DB6D1047@petsc.dev> For TSBEULER (the theta method) see https://petsc.org/release/docs/manualpages/TS/TSTHETA.html and look at the source code src/ts/impls/implicit/theta/teta.c for TSStep_Theta. You can use -snes_monitor_solution OPTIONS to see what the solutions are the nonlinear system look like as it solves the system. Barry > On Oct 11, 2021, at 12:26 PM, Alfredo J Duarte Gomez wrote: > > Good morning PETSC team, > > I have a working algorithm for my implicit TS integrator with a system of ODE/DAE's, but I am observing a rather high number of iterations > > I am currently using the simplest settings of a TSBEULER and setting a constant time step. > > My question right now is whether the default settings use any sort of initial guess algorithm before every time step. > > Since I have seen that the time step adapter calculates the Local Truncation Error, it should be possible to use an extrapolation of arbitrary order of accuracy as an initial guess for every time step right? Can someone indicate how I would be able to use that? > > Additionally, it would be very helpful to take a look at that initial guess, is it possible to use any existing function to calculate it either in the PreStep or PostStep function to visualize it? > > Thank you, > > -- > Alfredo Duarte > Graduate Research Assistant > The University of Texas at Austin -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Mon Oct 11 22:24:58 2021 From: cliu at pppl.gov (Chang Liu) Date: Mon, 11 Oct 2021 23:24:58 -0400 Subject: [petsc-users] request to add an option similar to use_omp_threads for mumps to cusparse solver Message-ID: Hi, Currently, it is possible to use mumps solver in PETSC with -mat_mumps_use_omp_threads option, so that multiple MPI processes will transfer the matrix and rhs data to the master rank, and then master rank will call mumps with OpenMP to solve the matrix. I wonder if someone can develop similar option for cusparse solver. Right now, this solver does not work with mpiaijcusparse. I think a possible workaround is to transfer all the matrix data to one MPI process, and then upload the data to GPU to solve. In this way, one can use cusparse solver for a MPI program. Chang -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From vaclav.hapla at erdw.ethz.ch Tue Oct 12 04:56:28 2021 From: vaclav.hapla at erdw.ethz.ch (Hapla Vaclav) Date: Tue, 12 Oct 2021 09:56:28 +0000 Subject: [petsc-users] HDF5 corruption In-Reply-To: <87k0iobgec.fsf@jedbrown.org> References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> <8735pcd12k.fsf@jedbrown.org> <87k0iobgec.fsf@jedbrown.org> Message-ID: <5D877CCC-46CD-4569-98E0-A726ECD43C91@erdw.ethz.ch> > On 8 Oct 2021, at 00:48, Jed Brown wrote: > > Adrian Croucher writes: > >> hi Jed, >> >> It looked to me like a call to h5f_flush() is all that is required. >> >> Some people said there would be a performance hit (maybe ~ 10% slower), >> which would be the trade-off for increased reliability. So if this were >> made available via PetscViewerFlush(), I'd probably make it optional in >> my code so the user could decide for themselves if it was worth it for them. >> >> Do you think flushing would be a better option than closing/opening the >> file between writes? > > Yes, less costly at scale (metadata like opening files can be expensive on parallel file systems), and simpler to manage from your code. I have just come across this a couple of days ago. I think PetscViewerFlush() [no-op for HDF5 currently] should call H5Fflush() for sure. I can do it now. I agree with Jed that closing/opening can have significant overhead on large number of processes due to metadata processing. Thanks, Vaclav From nabw91 at gmail.com Tue Oct 12 05:54:09 2021 From: nabw91 at gmail.com (=?UTF-8?Q?Nicol=C3=A1s_Barnafi?=) Date: Tue, 12 Oct 2021 12:54:09 +0200 Subject: [petsc-users] On QN + Fieldsplit Message-ID: Hello PETSc users, first email sent! I am creating a SNES solver using fenics, my example runs smoothly with 'newtonls', but gives a strange missing function error (error 83): these are the relevant lines of code where I setup the solver: > problem = SNESProblem(Res, sol, bcs) > b = PETScVector() # same as b = PETSc.Vec() > J_mat = PETScMatrix() > snes = PETSc.SNES().create(MPI.COMM_WORLD) > snes.setFunction(problem.F, b.vec()) > snes.setJacobian(problem.J, J_mat.mat()) > # Set up fieldsplit > ksp = snes.ksp > ksp.setOperators(J_mat.mat()) > pc = ksp.pc > pc.setType('fieldsplit') > dofmap_s = V.sub(0).dofmap().dofs() > dofmap_p = V.sub(1).dofmap().dofs() > is_s = PETSc.IS().createGeneral(dofmap_s) > is_p = PETSc.IS().createGeneral(dofmap_p) > pc.setFieldSplitIS((None, is_s), (None, is_p)) > pc.setFromOptions() > snes.setFromOptions() > snes.setUp() If it can be useful, this are the outputs of snes.view(), ksp.view() and pc.view(): > type: qn > SNES has not been set up so information may be incomplete > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN > Stored subspace size: 10 > Using the single reduction variant. > maximum iterations=10000, maximum function evaluations=30000 > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 > total number of function evaluations=0 > norm schedule ALWAYS > SNESLineSearch Object: 4 MPI processes > type: basic > maxstep=1.000000e+08, minlambda=1.000000e-12 > tolerances: relative=1.000000e-08, absolute=1.000000e-15, lambda=1.000000e-08 > maximum iterations=1 > KSP Object: 4 MPI processes > type: gmres > restart=1000, using Modified Gram-Schmidt Orthogonalization > happy breakdown tolerance 1e-30 > maximum iterations=1000, initial guess is zero > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using UNPRECONDITIONED norm type for convergence test > PC Object: 4 MPI processes > type: fieldsplit > PC has not been set up so information may be incomplete > FieldSplit with Schur preconditioner, factorization FULL I know that PC is not setup, but if I do it before setting up the SNES, the error persists. Thanks in advance for your help. Best, Nicolas -- Nicol?s Alejandro Barnafi Wittwer -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaclav.hapla at erdw.ethz.ch Tue Oct 12 06:25:53 2021 From: vaclav.hapla at erdw.ethz.ch (Hapla Vaclav) Date: Tue, 12 Oct 2021 11:25:53 +0000 Subject: [petsc-users] HDF5 corruption In-Reply-To: <5D877CCC-46CD-4569-98E0-A726ECD43C91@erdw.ethz.ch> References: <425d849a-eba3-6739-5e71-251e0f29b5fb@auckland.ac.nz> <8735pcd12k.fsf@jedbrown.org> <87k0iobgec.fsf@jedbrown.org> <5D877CCC-46CD-4569-98E0-A726ECD43C91@erdw.ethz.ch> Message-ID: On 12 Oct 2021, at 11:56, Hapla Vaclav > wrote: On 8 Oct 2021, at 00:48, Jed Brown > wrote: Adrian Croucher > writes: hi Jed, It looked to me like a call to h5f_flush() is all that is required. Some people said there would be a performance hit (maybe ~ 10% slower), which would be the trade-off for increased reliability. So if this were made available via PetscViewerFlush(), I'd probably make it optional in my code so the user could decide for themselves if it was worth it for them. Do you think flushing would be a better option than closing/opening the file between writes? Yes, less costly at scale (metadata like opening files can be expensive on parallel file systems), and simpler to manage from your code. I have just come across this a couple of days ago. I think PetscViewerFlush() [no-op for HDF5 currently] should call H5Fflush() for sure. I can do it now. https://gitlab.com/petsc/petsc/-/merge_requests/4445 I agree with Jed that closing/opening can have significant overhead on large number of processes due to metadata processing. Thanks, Vaclav -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Tue Oct 12 07:06:57 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Tue, 12 Oct 2021 15:06:57 +0300 Subject: [petsc-users] On QN + Fieldsplit In-Reply-To: References: Message-ID: Il giorno mar 12 ott 2021 alle ore 13:56 Nicol?s Barnafi ha scritto: > Hello PETSc users, > > first email sent! > I am creating a SNES solver using fenics, my example runs smoothly with > 'newtonls', but gives a strange missing function error (error 83): > > Dolphin swallows any useful error information returned from PETSc. You can try using the below code snippet at the beginning of your script from petsc4py import PETSc from dolfin import * # Remove the dolfin error handler PETSc.Sys.pushErrorHandler('python') > > these are the relevant lines of code where I setup the solver: > > > problem = SNESProblem(Res, sol, bcs) > > b = PETScVector() # same as b = PETSc.Vec() > > J_mat = PETScMatrix() > > snes = PETSc.SNES().create(MPI.COMM_WORLD) > > snes.setFunction(problem.F, b.vec()) > > snes.setJacobian(problem.J, J_mat.mat()) > > # Set up fieldsplit > > ksp = snes.ksp > > ksp.setOperators(J_mat.mat()) > > pc = ksp.pc > > pc.setType('fieldsplit') > > dofmap_s = V.sub(0).dofmap().dofs() > > dofmap_p = V.sub(1).dofmap().dofs() > > is_s = PETSc.IS().createGeneral(dofmap_s) > > is_p = PETSc.IS().createGeneral(dofmap_p) > > pc.setFieldSplitIS((None, is_s), (None, is_p)) > > pc.setFromOptions() > > snes.setFromOptions() > > snes.setUp() > > If it can be useful, this are the outputs of snes.view(), ksp.view() and > pc.view(): > > > type: qn > > SNES has not been set up so information may be incomplete > > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN > > Stored subspace size: 10 > > Using the single reduction variant. > > maximum iterations=10000, maximum function evaluations=30000 > > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 > > total number of function evaluations=0 > > norm schedule ALWAYS > > SNESLineSearch Object: 4 MPI processes > > type: basic > > maxstep=1.000000e+08, minlambda=1.000000e-12 > > tolerances: relative=1.000000e-08, absolute=1.000000e-15, > lambda=1.000000e-08 > > maximum iterations=1 > > KSP Object: 4 MPI processes > > type: gmres > > restart=1000, using Modified Gram-Schmidt Orthogonalization > > happy breakdown tolerance 1e-30 > > maximum iterations=1000, initial guess is zero > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > left preconditioning > > using UNPRECONDITIONED norm type for convergence test > > PC Object: 4 MPI processes > > type: fieldsplit > > PC has not been set up so information may be incomplete > > FieldSplit with Schur preconditioner, factorization FULL > > I know that PC is not setup, but if I do it before setting up the SNES, > the error persists. Thanks in advance for your help. > > Best, > Nicolas > -- > Nicol?s Alejandro Barnafi Wittwer > -- Stefano -------------- next part -------------- An HTML attachment was scrubbed... URL: From nabw91 at gmail.com Tue Oct 12 07:37:31 2021 From: nabw91 at gmail.com (=?UTF-8?Q?Nicol=C3=A1s_Barnafi?=) Date: Tue, 12 Oct 2021 14:37:31 +0200 Subject: [petsc-users] On QN + Fieldsplit In-Reply-To: References: Message-ID: Thank you Stefano for the help. I added the lines you indicated, but the error remains the same, here goes snes.view() + error > SNES Object: 1 MPI processes > type: qn > SNES has not been set up so information may be incomplete > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN > Stored subspace size: 10 > Using the single reduction variant. > maximum iterations=10000, maximum function evaluations=30000 > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 > total number of function evaluations=0 > norm schedule ALWAYS > SNESLineSearch Object: 1 MPI processes > type: basic > maxstep=1.000000e+08, minlambda=1.000000e-12 > tolerances: relative=1.000000e-08, absolute=1.000000e-15, lambda=1.000000e-08 > maximum iterations=1 > Traceback (most recent call last): > File "Twist.py", line 234, in > snes.setUp() > File "PETSc/SNES.pyx", line 530, in petsc4py.PETSc.SNES.setUp > petsc4py.PETSc.Error: error code 83 On Tue, Oct 12, 2021 at 2:07 PM Stefano Zampini wrote: > > > Il giorno mar 12 ott 2021 alle ore 13:56 Nicol?s Barnafi > ha scritto: > >> Hello PETSc users, >> >> first email sent! >> I am creating a SNES solver using fenics, my example runs smoothly with >> 'newtonls', but gives a strange missing function error (error 83): >> >> > Dolphin swallows any useful error information returned from PETSc. You can > try using the below code snippet at the beginning of your script > > from petsc4py import PETSc > from dolfin import * > # Remove the dolfin error handler > PETSc.Sys.pushErrorHandler('python') > > > >> >> these are the relevant lines of code where I setup the solver: >> >> > problem = SNESProblem(Res, sol, bcs) >> > b = PETScVector() # same as b = PETSc.Vec() >> > J_mat = PETScMatrix() >> > snes = PETSc.SNES().create(MPI.COMM_WORLD) >> > snes.setFunction(problem.F, b.vec()) >> > snes.setJacobian(problem.J, J_mat.mat()) >> > # Set up fieldsplit >> > ksp = snes.ksp >> > ksp.setOperators(J_mat.mat()) >> > pc = ksp.pc >> > pc.setType('fieldsplit') >> > dofmap_s = V.sub(0).dofmap().dofs() >> > dofmap_p = V.sub(1).dofmap().dofs() >> > is_s = PETSc.IS().createGeneral(dofmap_s) >> > is_p = PETSc.IS().createGeneral(dofmap_p) >> > pc.setFieldSplitIS((None, is_s), (None, is_p)) >> > pc.setFromOptions() >> > snes.setFromOptions() >> > snes.setUp() >> >> > If it can be useful, this are the outputs of snes.view(), ksp.view() and >> pc.view(): >> >> > type: qn >> > SNES has not been set up so information may be incomplete >> > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN >> > Stored subspace size: 10 >> > Using the single reduction variant. >> > maximum iterations=10000, maximum function evaluations=30000 >> > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 >> > total number of function evaluations=0 >> > norm schedule ALWAYS >> > SNESLineSearch Object: 4 MPI processes >> > type: basic >> > maxstep=1.000000e+08, minlambda=1.000000e-12 >> > tolerances: relative=1.000000e-08, absolute=1.000000e-15, >> lambda=1.000000e-08 >> > maximum iterations=1 >> > KSP Object: 4 MPI processes >> > type: gmres >> > restart=1000, using Modified Gram-Schmidt Orthogonalization >> > happy breakdown tolerance 1e-30 >> > maximum iterations=1000, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> > left preconditioning >> > using UNPRECONDITIONED norm type for convergence test >> > PC Object: 4 MPI processes >> > type: fieldsplit >> > PC has not been set up so information may be incomplete >> > FieldSplit with Schur preconditioner, factorization FULL >> >> I know that PC is not setup, but if I do it before setting up the SNES, >> the error persists. Thanks in advance for your help. >> >> Best, >> Nicolas >> -- >> Nicol?s Alejandro Barnafi Wittwer >> > > > -- > Stefano > -- Nicol?s Alejandro Barnafi Wittwer -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.seize at onera.fr Tue Oct 12 08:58:54 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Tue, 12 Oct 2021 15:58:54 +0200 Subject: [petsc-users] Still reachable memory in valgrind Message-ID: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> Hello petsc-users I am using Valgrind with my PETSc application, and I noticed something: ?1 #include ?2 ?3 int main(int argc, char **argv){ ?4 ? PetscErrorCode ierr = 0; ?5 ?6 ? ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) return ierr; ?7 ? PetscReal *foo; ?8 ? malloc(sizeof(PetscReal)); ?9 ? ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); 10 ? ierr = PetscFinalize(); 11?? return ierr; 12 } With this example, with today's release branch, I've got this Valgrind result (--leak-check=full --show-leak-kinds=all): ==2036== Memcheck, a memory error detector ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info ==2036== Command: ./build/bin/yanss data/box.yaml ==2036== ==2036== ==2036== HEAP SUMMARY: ==2036==???? in use at exit: 1,746 bytes in 4 blocks ==2036==?? total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 bytes allocated ==2036== ==2036== 8 bytes in 1 blocks are definitely lost in loss record 1 of 4 ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) ==2036==??? by 0x41A4FD: main (main.c:8) ==2036== ==2036== 32 bytes in 1 blocks are still reachable in loss record 2 of 4 ==2036==??? at 0x4C2B975: calloc (vg_replace_malloc.c:711) ==2036==??? by 0xACF461F: _dlerror_run (in /usr/lib64/libdl-2.17.so) ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) ==2036==??? by 0x41A4E2: main (main.c:6) ==2036== ==2036== 70 bytes in 1 blocks are still reachable in loss record 3 of 4 ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) ==2036==??? by 0x400F0D0: _dl_signal_error (in /usr/lib64/ld-2.17.so) ==2036==??? by 0x400F26D: _dl_signal_cerror (in /usr/lib64/ld-2.17.so) ==2036==??? by 0x400A4BC: _dl_lookup_symbol_x (in /usr/lib64/ld-2.17.so) ==2036==??? by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so) ==2036==??? by 0xACF40D3: dlsym_doit (in /usr/lib64/libdl-2.17.so) ==2036==??? by 0x400F2D3: _dl_catch_error (in /usr/lib64/ld-2.17.so) ==2036==??? by 0xACF45BC: _dlerror_run (in /usr/lib64/libdl-2.17.so) ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) ==2036==??? by 0x41A4E2: main (main.c:6) ==2036== ==2036== 1,636 bytes in 1 blocks are still reachable in loss record 4 of 4 ==2036==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) ==2036==??? by 0x54AC0CB: PetscMallocAlign (mal.c:54) ==2036==??? by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) ==2036==??? by 0x54ADDD2: PetscMallocA (mal.c:423) ==2036==??? by 0x41A52F: main (main.c:9) ==2036== ==2036== LEAK SUMMARY: ==2036==??? definitely lost: 8 bytes in 1 blocks ==2036==??? indirectly lost: 0 bytes in 0 blocks ==2036==????? possibly lost: 0 bytes in 0 blocks ==2036==??? still reachable: 1,738 bytes in 3 blocks ==2036==???????? suppressed: 0 bytes in 0 blocks ==2036== ==2036== For counts of detected and suppressed errors, rerun with: -v ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) The first report is the malloc on line 8, fine. The second and the third correspond to still reachable memory from PetscInitialize on line 6, I often got these so I usually discard it. The fourth and last is the one that worries me : the memory from PetscMalloc1 on line 9 is reported as "still reachable", but I don't think it should. Is there something I do not understand, or is this a bug ? Thanks in advance, Pierre From wence at gmx.li Tue Oct 12 09:01:37 2021 From: wence at gmx.li (Lawrence Mitchell) Date: Tue, 12 Oct 2021 15:01:37 +0100 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> Message-ID: <90A555DE-77CE-49BD-9D6D-F8BC2AF64524@gmx.li> Hi Pierre, > On 12 Oct 2021, at 14:58, Pierre Seize wrote: > > > 1 #include > 2 > 3 int main(int argc, char **argv){ > 4 PetscErrorCode ierr = 0; > 5 > 6 ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) return ierr; > 7 PetscReal *foo; > 8 malloc(sizeof(PetscReal)); > 9 ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); You need to call PetscFree on any arrays that you allocate with PetscMalloc, otherwise this is indeed a memory leak. Similarly, you should call free on any pointers that you get from malloc to deallocate them. Thanks, Lawrence > 10 ierr = PetscFinalize(); > 11 return ierr; > 12 } From pierre.seize at onera.fr Tue Oct 12 09:16:47 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Tue, 12 Oct 2021 16:16:47 +0200 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> Message-ID: <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> Sorry, I should have tried this before: I checked out to v3.14, and now both malloc and PetscMalloc1 are reported as definitely lost, so I would say it's a bug. Pierre On 12/10/21 15:58, Pierre Seize wrote: > Hello petsc-users > > I am using Valgrind with my PETSc application, and I noticed something: > > ?1 #include > ?2 > ?3 int main(int argc, char **argv){ > ?4 ? PetscErrorCode ierr = 0; > ?5 > ?6 ? ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) return > ierr; > ?7 ? PetscReal *foo; > ?8 ? malloc(sizeof(PetscReal)); > ?9 ? ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); > 10 ? ierr = PetscFinalize(); > 11?? return ierr; > 12 } > > With this example, with today's release branch, I've got this Valgrind > result (--leak-check=full --show-leak-kinds=all): > > ==2036== Memcheck, a memory error detector > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright > info > ==2036== Command: ./build/bin/yanss data/box.yaml > ==2036== > ==2036== > ==2036== HEAP SUMMARY: > ==2036==???? in use at exit: 1,746 bytes in 4 blocks > ==2036==?? total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 > bytes allocated > ==2036== > ==2036== 8 bytes in 1 blocks are definitely lost in loss record 1 of 4 > ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > ==2036==??? by 0x41A4FD: main (main.c:8) > ==2036== > ==2036== 32 bytes in 1 blocks are still reachable in loss record 2 of 4 > ==2036==??? at 0x4C2B975: calloc (vg_replace_malloc.c:711) > ==2036==??? by 0xACF461F: _dlerror_run (in /usr/lib64/libdl-2.17.so) > ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) > ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) > ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) > ==2036==??? by 0x41A4E2: main (main.c:6) > ==2036== > ==2036== 70 bytes in 1 blocks are still reachable in loss record 3 of 4 > ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > ==2036==??? by 0x400F0D0: _dl_signal_error (in /usr/lib64/ld-2.17.so) > ==2036==??? by 0x400F26D: _dl_signal_cerror (in /usr/lib64/ld-2.17.so) > ==2036==??? by 0x400A4BC: _dl_lookup_symbol_x (in /usr/lib64/ld-2.17.so) > ==2036==??? by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so) > ==2036==??? by 0xACF40D3: dlsym_doit (in /usr/lib64/libdl-2.17.so) > ==2036==??? by 0x400F2D3: _dl_catch_error (in /usr/lib64/ld-2.17.so) > ==2036==??? by 0xACF45BC: _dlerror_run (in /usr/lib64/libdl-2.17.so) > ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) > ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) > ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) > ==2036==??? by 0x41A4E2: main (main.c:6) > ==2036== > ==2036== 1,636 bytes in 1 blocks are still reachable in loss record 4 > of 4 > ==2036==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) > ==2036==??? by 0x54AC0CB: PetscMallocAlign (mal.c:54) > ==2036==??? by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) > ==2036==??? by 0x54ADDD2: PetscMallocA (mal.c:423) > ==2036==??? by 0x41A52F: main (main.c:9) > ==2036== > ==2036== LEAK SUMMARY: > ==2036==??? definitely lost: 8 bytes in 1 blocks > ==2036==??? indirectly lost: 0 bytes in 0 blocks > ==2036==????? possibly lost: 0 bytes in 0 blocks > ==2036==??? still reachable: 1,738 bytes in 3 blocks > ==2036==???????? suppressed: 0 bytes in 0 blocks > ==2036== > ==2036== For counts of detected and suppressed errors, rerun with: -v > ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) > > > The first report is the malloc on line 8, fine. > The second and the third correspond to still reachable memory from > PetscInitialize on line 6, I often got these so I usually discard it. > The fourth and last is the one that worries me : the memory from > PetscMalloc1 on line 9 is reported as "still reachable", but I don't > think it should. > Is there something I do not understand, or is this a bug ? > > Thanks in advance, > > Pierre From knepley at gmail.com Tue Oct 12 09:23:18 2021 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 12 Oct 2021 10:23:18 -0400 Subject: [petsc-users] On QN + Fieldsplit In-Reply-To: References: Message-ID: I looked over every place we use that error code. I do not think it is coming from PETSc, but rather from petsc4py. However, something is eating the error message, and I think Stefano indicated. My first step would be to get the FEniCS folks to display the error message. Another option is to just run it in Firedrake since I think we can see the stack properly there. Thanks, Matt On Tue, Oct 12, 2021 at 8:37 AM Nicol?s Barnafi wrote: > Thank you Stefano for the help. I added the lines you indicated, but the > error remains the same, here goes snes.view() + error > > > SNES Object: 1 MPI processes > > type: qn > > SNES has not been set up so information may be incomplete > > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN > > Stored subspace size: 10 > > Using the single reduction variant. > > maximum iterations=10000, maximum function evaluations=30000 > > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 > > total number of function evaluations=0 > > norm schedule ALWAYS > > SNESLineSearch Object: 1 MPI processes > > type: basic > > maxstep=1.000000e+08, minlambda=1.000000e-12 > > tolerances: relative=1.000000e-08, absolute=1.000000e-15, > lambda=1.000000e-08 > > maximum iterations=1 > > Traceback (most recent call last): > > File "Twist.py", line 234, in > > snes.setUp() > > File "PETSc/SNES.pyx", line 530, in petsc4py.PETSc.SNES.setUp > > petsc4py.PETSc.Error: error code 83 > > On Tue, Oct 12, 2021 at 2:07 PM Stefano Zampini > wrote: > >> >> >> Il giorno mar 12 ott 2021 alle ore 13:56 Nicol?s Barnafi < >> nabw91 at gmail.com> ha scritto: >> >>> Hello PETSc users, >>> >>> first email sent! >>> I am creating a SNES solver using fenics, my example runs smoothly with >>> 'newtonls', but gives a strange missing function error (error 83): >>> >>> >> Dolphin swallows any useful error information returned from PETSc. You >> can try using the below code snippet at the beginning of your script >> >> from petsc4py import PETSc >> from dolfin import * >> # Remove the dolfin error handler >> PETSc.Sys.pushErrorHandler('python') >> >> >> >>> >>> these are the relevant lines of code where I setup the solver: >>> >>> > problem = SNESProblem(Res, sol, bcs) >>> > b = PETScVector() # same as b = PETSc.Vec() >>> > J_mat = PETScMatrix() >>> > snes = PETSc.SNES().create(MPI.COMM_WORLD) >>> > snes.setFunction(problem.F, b.vec()) >>> > snes.setJacobian(problem.J, J_mat.mat()) >>> > # Set up fieldsplit >>> > ksp = snes.ksp >>> > ksp.setOperators(J_mat.mat()) >>> > pc = ksp.pc >>> > pc.setType('fieldsplit') >>> > dofmap_s = V.sub(0).dofmap().dofs() >>> > dofmap_p = V.sub(1).dofmap().dofs() >>> > is_s = PETSc.IS().createGeneral(dofmap_s) >>> > is_p = PETSc.IS().createGeneral(dofmap_p) >>> > pc.setFieldSplitIS((None, is_s), (None, is_p)) >>> > pc.setFromOptions() >>> > snes.setFromOptions() >>> > snes.setUp() >>> >>> >> If it can be useful, this are the outputs of snes.view(), ksp.view() and >>> pc.view(): >>> >>> > type: qn >>> > SNES has not been set up so information may be incomplete >>> > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN >>> > Stored subspace size: 10 >>> > Using the single reduction variant. >>> > maximum iterations=10000, maximum function evaluations=30000 >>> > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 >>> > total number of function evaluations=0 >>> > norm schedule ALWAYS >>> > SNESLineSearch Object: 4 MPI processes >>> > type: basic >>> > maxstep=1.000000e+08, minlambda=1.000000e-12 >>> > tolerances: relative=1.000000e-08, absolute=1.000000e-15, >>> lambda=1.000000e-08 >>> > maximum iterations=1 >>> > KSP Object: 4 MPI processes >>> > type: gmres >>> > restart=1000, using Modified Gram-Schmidt Orthogonalization >>> > happy breakdown tolerance 1e-30 >>> > maximum iterations=1000, initial guess is zero >>> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>> > left preconditioning >>> > using UNPRECONDITIONED norm type for convergence test >>> > PC Object: 4 MPI processes >>> > type: fieldsplit >>> > PC has not been set up so information may be incomplete >>> > FieldSplit with Schur preconditioner, factorization FULL >>> >>> I know that PC is not setup, but if I do it before setting up the SNES, >>> the error persists. Thanks in advance for your help. >>> >>> Best, >>> Nicolas >>> -- >>> Nicol?s Alejandro Barnafi Wittwer >>> >> >> >> -- >> Stefano >> > > > -- > Nicol?s Alejandro Barnafi Wittwer > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Oct 12 09:24:08 2021 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 12 Oct 2021 10:24:08 -0400 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> Message-ID: On Tue, Oct 12, 2021 at 10:16 AM Pierre Seize wrote: > Sorry, I should have tried this before: > > I checked out to v3.14, and now both malloc and PetscMalloc1 are > reported as definitely lost, so I would say it's a bug. > I am not sure what would be the bug. This is correctly reporting that you did not free the memory. Thanks, Matt > Pierre > > > On 12/10/21 15:58, Pierre Seize wrote: > > Hello petsc-users > > > > I am using Valgrind with my PETSc application, and I noticed something: > > > > 1 #include > > 2 > > 3 int main(int argc, char **argv){ > > 4 PetscErrorCode ierr = 0; > > 5 > > 6 ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) return > > ierr; > > 7 PetscReal *foo; > > 8 malloc(sizeof(PetscReal)); > > 9 ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); > > 10 ierr = PetscFinalize(); > > 11 return ierr; > > 12 } > > > > With this example, with today's release branch, I've got this Valgrind > > result (--leak-check=full --show-leak-kinds=all): > > > > ==2036== Memcheck, a memory error detector > > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. > > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright > > info > > ==2036== Command: ./build/bin/yanss data/box.yaml > > ==2036== > > ==2036== > > ==2036== HEAP SUMMARY: > > ==2036== in use at exit: 1,746 bytes in 4 blocks > > ==2036== total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 > > bytes allocated > > ==2036== > > ==2036== 8 bytes in 1 blocks are definitely lost in loss record 1 of 4 > > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > > ==2036== by 0x41A4FD: main (main.c:8) > > ==2036== > > ==2036== 32 bytes in 1 blocks are still reachable in loss record 2 of 4 > > ==2036== at 0x4C2B975: calloc (vg_replace_malloc.c:711) > > ==2036== by 0xACF461F: _dlerror_run (in /usr/lib64/libdl-2.17.so) > > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) > > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) > > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) > > ==2036== by 0x41A4E2: main (main.c:6) > > ==2036== > > ==2036== 70 bytes in 1 blocks are still reachable in loss record 3 of 4 > > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > > ==2036== by 0x400F0D0: _dl_signal_error (in /usr/lib64/ld-2.17.so) > > ==2036== by 0x400F26D: _dl_signal_cerror (in /usr/lib64/ld-2.17.so) > > ==2036== by 0x400A4BC: _dl_lookup_symbol_x (in /usr/lib64/ld-2.17.so) > > ==2036== by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so) > > ==2036== by 0xACF40D3: dlsym_doit (in /usr/lib64/libdl-2.17.so) > > ==2036== by 0x400F2D3: _dl_catch_error (in /usr/lib64/ld-2.17.so) > > ==2036== by 0xACF45BC: _dlerror_run (in /usr/lib64/libdl-2.17.so) > > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) > > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) > > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) > > ==2036== by 0x41A4E2: main (main.c:6) > > ==2036== > > ==2036== 1,636 bytes in 1 blocks are still reachable in loss record 4 > > of 4 > > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) > > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) > > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) > > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) > > ==2036== by 0x41A52F: main (main.c:9) > > ==2036== > > ==2036== LEAK SUMMARY: > > ==2036== definitely lost: 8 bytes in 1 blocks > > ==2036== indirectly lost: 0 bytes in 0 blocks > > ==2036== possibly lost: 0 bytes in 0 blocks > > ==2036== still reachable: 1,738 bytes in 3 blocks > > ==2036== suppressed: 0 bytes in 0 blocks > > ==2036== > > ==2036== For counts of detected and suppressed errors, rerun with: -v > > ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) > > > > > > The first report is the malloc on line 8, fine. > > The second and the third correspond to still reachable memory from > > PetscInitialize on line 6, I often got these so I usually discard it. > > The fourth and last is the one that worries me : the memory from > > PetscMalloc1 on line 9 is reported as "still reachable", but I don't > > think it should. > > Is there something I do not understand, or is this a bug ? > > > > Thanks in advance, > > > > Pierre > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Tue Oct 12 09:24:17 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Tue, 12 Oct 2021 09:24:17 -0500 Subject: [petsc-users] request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: Message-ID: Hi, Chang, For the mumps solver, we usually transfers matrix and vector data within a compute node. For the idea you propose, it looks like we need to gather data within MPI_COMM_WORLD, right? Mark, I remember you said cusparse solve is slow and you would rather do it on CPU. Is it right? --Junchao Zhang On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi, > > Currently, it is possible to use mumps solver in PETSC with > -mat_mumps_use_omp_threads option, so that multiple MPI processes will > transfer the matrix and rhs data to the master rank, and then master > rank will call mumps with OpenMP to solve the matrix. > > I wonder if someone can develop similar option for cusparse solver. > Right now, this solver does not work with mpiaijcusparse. I think a > possible workaround is to transfer all the matrix data to one MPI > process, and then upload the data to GPU to solve. In this way, one can > use cusparse solver for a MPI program. > > Chang > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nabw91 at gmail.com Tue Oct 12 09:27:37 2021 From: nabw91 at gmail.com (=?UTF-8?Q?Nicol=C3=A1s_Barnafi?=) Date: Tue, 12 Oct 2021 16:27:37 +0200 Subject: [petsc-users] On QN + Fieldsplit In-Reply-To: References: Message-ID: Thank you for the support. I rewrote the initialization in a simpler way, now it works as expected: > dofmap_s = V.sub(0).dofmap().dofs(); is_s = PETSc.IS().createGeneral(dofmap_s) > dofmap_p = V.sub(1).dofmap().dofs(); is_p = PETSc.IS().createGeneral(dofmap_p) > snes = PETSc.SNES().create(MPI.COMM_WORLD) > snes.setFunction(problem.F, b.vec()); snes.setJacobian(problem.J, J_mat.mat()) > pc = snes.ksp.getPC() > pc.setType('fieldsplit') > pc.setFieldSplitIS((None, is_s), (None, is_p)) > snes.setFromOptions() > snes.solve(None, problem.u.vector().vec()) Apparently trying to setup the solver's internals is not recommended. As a side note, I tried also setting up the KSP using 'SNESSetKSP', but this solution is not so good as giving the command 'snes_ksp_ew' does nothing, even though it gets correctly read as shown by snes.view(). Thanks for the help! Best, Nicolas On Tue, Oct 12, 2021 at 4:23 PM Matthew Knepley wrote: > I looked over every place we use that error code. I do not think it is > coming from PETSc, but rather from petsc4py. However, something > is eating the error message, and I think Stefano indicated. My first step > would be to get the FEniCS folks to display the error message. > > Another option is to just run it in Firedrake since I think we can see the > stack properly there. > > Thanks, > > Matt > > On Tue, Oct 12, 2021 at 8:37 AM Nicol?s Barnafi wrote: > >> Thank you Stefano for the help. I added the lines you indicated, but the >> error remains the same, here goes snes.view() + error >> >> > SNES Object: 1 MPI processes >> > type: qn >> > SNES has not been set up so information may be incomplete >> > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN >> > Stored subspace size: 10 >> > Using the single reduction variant. >> > maximum iterations=10000, maximum function evaluations=30000 >> > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 >> > total number of function evaluations=0 >> > norm schedule ALWAYS >> > SNESLineSearch Object: 1 MPI processes >> > type: basic >> > maxstep=1.000000e+08, minlambda=1.000000e-12 >> > tolerances: relative=1.000000e-08, absolute=1.000000e-15, >> lambda=1.000000e-08 >> > maximum iterations=1 >> > Traceback (most recent call last): >> > File "Twist.py", line 234, in >> > snes.setUp() >> > File "PETSc/SNES.pyx", line 530, in petsc4py.PETSc.SNES.setUp >> > petsc4py.PETSc.Error: error code 83 >> >> On Tue, Oct 12, 2021 at 2:07 PM Stefano Zampini < >> stefano.zampini at gmail.com> wrote: >> >>> >>> >>> Il giorno mar 12 ott 2021 alle ore 13:56 Nicol?s Barnafi < >>> nabw91 at gmail.com> ha scritto: >>> >>>> Hello PETSc users, >>>> >>>> first email sent! >>>> I am creating a SNES solver using fenics, my example runs smoothly with >>>> 'newtonls', but gives a strange missing function error (error 83): >>>> >>>> >>> Dolphin swallows any useful error information returned from PETSc. You >>> can try using the below code snippet at the beginning of your script >>> >>> from petsc4py import PETSc >>> from dolfin import * >>> # Remove the dolfin error handler >>> PETSc.Sys.pushErrorHandler('python') >>> >>> >>> >>>> >>>> these are the relevant lines of code where I setup the solver: >>>> >>>> > problem = SNESProblem(Res, sol, bcs) >>>> > b = PETScVector() # same as b = PETSc.Vec() >>>> > J_mat = PETScMatrix() >>>> > snes = PETSc.SNES().create(MPI.COMM_WORLD) >>>> > snes.setFunction(problem.F, b.vec()) >>>> > snes.setJacobian(problem.J, J_mat.mat()) >>>> > # Set up fieldsplit >>>> > ksp = snes.ksp >>>> > ksp.setOperators(J_mat.mat()) >>>> > pc = ksp.pc >>>> > pc.setType('fieldsplit') >>>> > dofmap_s = V.sub(0).dofmap().dofs() >>>> > dofmap_p = V.sub(1).dofmap().dofs() >>>> > is_s = PETSc.IS().createGeneral(dofmap_s) >>>> > is_p = PETSc.IS().createGeneral(dofmap_p) >>>> > pc.setFieldSplitIS((None, is_s), (None, is_p)) >>>> > pc.setFromOptions() >>>> > snes.setFromOptions() >>>> > snes.setUp() >>>> >>>> >>> If it can be useful, this are the outputs of snes.view(), ksp.view() and >>>> pc.view(): >>>> >>>> > type: qn >>>> > SNES has not been set up so information may be incomplete >>>> > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN >>>> > Stored subspace size: 10 >>>> > Using the single reduction variant. >>>> > maximum iterations=10000, maximum function evaluations=30000 >>>> > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 >>>> > total number of function evaluations=0 >>>> > norm schedule ALWAYS >>>> > SNESLineSearch Object: 4 MPI processes >>>> > type: basic >>>> > maxstep=1.000000e+08, minlambda=1.000000e-12 >>>> > tolerances: relative=1.000000e-08, absolute=1.000000e-15, >>>> lambda=1.000000e-08 >>>> > maximum iterations=1 >>>> > KSP Object: 4 MPI processes >>>> > type: gmres >>>> > restart=1000, using Modified Gram-Schmidt Orthogonalization >>>> > happy breakdown tolerance 1e-30 >>>> > maximum iterations=1000, initial guess is zero >>>> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>> > left preconditioning >>>> > using UNPRECONDITIONED norm type for convergence test >>>> > PC Object: 4 MPI processes >>>> > type: fieldsplit >>>> > PC has not been set up so information may be incomplete >>>> > FieldSplit with Schur preconditioner, factorization FULL >>>> >>>> I know that PC is not setup, but if I do it before setting up the SNES, >>>> the error persists. Thanks in advance for your help. >>>> >>>> Best, >>>> Nicolas >>>> -- >>>> Nicol?s Alejandro Barnafi Wittwer >>>> >>> >>> >>> -- >>> Stefano >>> >> >> >> -- >> Nicol?s Alejandro Barnafi Wittwer >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -- Nicol?s Alejandro Barnafi Wittwer -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Oct 12 09:34:28 2021 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 12 Oct 2021 10:34:28 -0400 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> Message-ID: <4576DB17-7C80-495F-AD7A-13739675C013@petsc.dev> I think that 4) is normal. PetscMalloc() is just a wrapper for malloc, PETSc does not free the space obtained with PetscMalloc() at PetscFinalize() so that memory is still available and usable after PetscFinalize() (Of course we do not recommend using it). PETSc has an option -malloc_dump that will print out all the memory obtained with PetscMalloc() that has not been freed at PetscFinalize() which is a quick way to find any PetscMalloc() without free. Barry > On Oct 12, 2021, at 9:58 AM, Pierre Seize wrote: > > Hello petsc-users > > I am using Valgrind with my PETSc application, and I noticed something: > > 1 #include > 2 > 3 int main(int argc, char **argv){ > 4 PetscErrorCode ierr = 0; > 5 > 6 ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) return ierr; > 7 PetscReal *foo; > 8 malloc(sizeof(PetscReal)); > 9 ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); > 10 ierr = PetscFinalize(); > 11 return ierr; > 12 } > > With this example, with today's release branch, I've got this Valgrind result (--leak-check=full --show-leak-kinds=all): > > ==2036== Memcheck, a memory error detector > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info > ==2036== Command: ./build/bin/yanss data/box.yaml > ==2036== > ==2036== > ==2036== HEAP SUMMARY: > ==2036== in use at exit: 1,746 bytes in 4 blocks > ==2036== total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 bytes allocated > ==2036== > ==2036== 8 bytes in 1 blocks are definitely lost in loss record 1 of 4 > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > ==2036== by 0x41A4FD: main (main.c:8) > ==2036== > ==2036== 32 bytes in 1 blocks are still reachable in loss record 2 of 4 > ==2036== at 0x4C2B975: calloc (vg_replace_malloc.c:711) > ==2036== by 0xACF461F: _dlerror_run (in /usr/lib64/libdl-2.17.so) > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) > ==2036== by 0x41A4E2: main (main.c:6) > ==2036== > ==2036== 70 bytes in 1 blocks are still reachable in loss record 3 of 4 > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > ==2036== by 0x400F0D0: _dl_signal_error (in /usr/lib64/ld-2.17.so) > ==2036== by 0x400F26D: _dl_signal_cerror (in /usr/lib64/ld-2.17.so) > ==2036== by 0x400A4BC: _dl_lookup_symbol_x (in /usr/lib64/ld-2.17.so) > ==2036== by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so) > ==2036== by 0xACF40D3: dlsym_doit (in /usr/lib64/libdl-2.17.so) > ==2036== by 0x400F2D3: _dl_catch_error (in /usr/lib64/ld-2.17.so) > ==2036== by 0xACF45BC: _dlerror_run (in /usr/lib64/libdl-2.17.so) > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) > ==2036== by 0x41A4E2: main (main.c:6) > ==2036== > ==2036== 1,636 bytes in 1 blocks are still reachable in loss record 4 of 4 > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) > ==2036== by 0x41A52F: main (main.c:9) > ==2036== > ==2036== LEAK SUMMARY: > ==2036== definitely lost: 8 bytes in 1 blocks > ==2036== indirectly lost: 0 bytes in 0 blocks > ==2036== possibly lost: 0 bytes in 0 blocks > ==2036== still reachable: 1,738 bytes in 3 blocks > ==2036== suppressed: 0 bytes in 0 blocks > ==2036== > ==2036== For counts of detected and suppressed errors, rerun with: -v > ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) > > > The first report is the malloc on line 8, fine. > The second and the third correspond to still reachable memory from PetscInitialize on line 6, I often got these so I usually discard it. > The fourth and last is the one that worries me : the memory from PetscMalloc1 on line 9 is reported as "still reachable", but I don't think it should. > Is there something I do not understand, or is this a bug ? > > Thanks in advance, > > Pierre From pierre.seize at onera.fr Tue Oct 12 09:38:07 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Tue, 12 Oct 2021 16:38:07 +0200 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> Message-ID: <25fb8a1d-8c33-5cf8-21e1-2c597c3a0de7@onera.fr> The "bug" is that memory from PetscMalloc1 that is not freed is reported as "definitely lost" in v3.14 (OK) but as "still reachable" in today's release (not OK). Here I forget to free the memory on purpose, I would like valgrind to report it's lost and not still reachable. Pierre On 12/10/21 16:24, Matthew Knepley wrote: > On Tue, Oct 12, 2021 at 10:16 AM Pierre Seize > wrote: > > Sorry, I should have tried this before: > > I checked out to v3.14, and now both malloc and PetscMalloc1 are > reported as definitely lost, so I would say it's a bug. > > > I am not sure what would be the bug. This is correctly reporting that > you did not free the memory. > > ? Thanks, > > ? ? Matt > > Pierre > > > On 12/10/21 15:58, Pierre Seize wrote: > > Hello petsc-users > > > > I am using Valgrind with my PETSc application, and I noticed > something: > > > > ?1 #include > > ?2 > > ?3 int main(int argc, char **argv){ > > ?4 ? PetscErrorCode ierr = 0; > > ?5 > > ?6 ? ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) > return > > ierr; > > ?7 ? PetscReal *foo; > > ?8 ? malloc(sizeof(PetscReal)); > > ?9 ? ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); > > 10 ? ierr = PetscFinalize(); > > 11?? return ierr; > > 12 } > > > > With this example, with today's release branch, I've got this > Valgrind > > result (--leak-check=full --show-leak-kinds=all): > > > > ==2036== Memcheck, a memory error detector > > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian > Seward et al. > > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for > copyright > > info > > ==2036== Command: ./build/bin/yanss data/box.yaml > > ==2036== > > ==2036== > > ==2036== HEAP SUMMARY: > > ==2036==???? in use at exit: 1,746 bytes in 4 blocks > > ==2036==?? total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 > > bytes allocated > > ==2036== > > ==2036== 8 bytes in 1 blocks are definitely lost in loss record > 1 of 4 > > ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > > ==2036==??? by 0x41A4FD: main (main.c:8) > > ==2036== > > ==2036== 32 bytes in 1 blocks are still reachable in loss record > 2 of 4 > > ==2036==??? at 0x4C2B975: calloc (vg_replace_malloc.c:711) > > ==2036==??? by 0xACF461F: _dlerror_run (in > /usr/lib64/libdl-2.17.so ) > > ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so > ) > > ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) > > ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) > > ==2036==??? by 0x41A4E2: main (main.c:6) > > ==2036== > > ==2036== 70 bytes in 1 blocks are still reachable in loss record > 3 of 4 > > ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > > ==2036==??? by 0x400F0D0: _dl_signal_error (in > /usr/lib64/ld-2.17.so ) > > ==2036==??? by 0x400F26D: _dl_signal_cerror (in > /usr/lib64/ld-2.17.so ) > > ==2036==??? by 0x400A4BC: _dl_lookup_symbol_x (in > /usr/lib64/ld-2.17.so ) > > ==2036==??? by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so > ) > > ==2036==??? by 0xACF40D3: dlsym_doit (in > /usr/lib64/libdl-2.17.so ) > > ==2036==??? by 0x400F2D3: _dl_catch_error (in > /usr/lib64/ld-2.17.so ) > > ==2036==??? by 0xACF45BC: _dlerror_run (in > /usr/lib64/libdl-2.17.so ) > > ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so > ) > > ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) > > ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) > > ==2036==??? by 0x41A4E2: main (main.c:6) > > ==2036== > > ==2036== 1,636 bytes in 1 blocks are still reachable in loss > record 4 > > of 4 > > ==2036==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) > > ==2036==??? by 0x54AC0CB: PetscMallocAlign (mal.c:54) > > ==2036==??? by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) > > ==2036==??? by 0x54ADDD2: PetscMallocA (mal.c:423) > > ==2036==??? by 0x41A52F: main (main.c:9) > > ==2036== > > ==2036== LEAK SUMMARY: > > ==2036==??? definitely lost: 8 bytes in 1 blocks > > ==2036==??? indirectly lost: 0 bytes in 0 blocks > > ==2036==????? possibly lost: 0 bytes in 0 blocks > > ==2036==??? still reachable: 1,738 bytes in 3 blocks > > ==2036==???????? suppressed: 0 bytes in 0 blocks > > ==2036== > > ==2036== For counts of detected and suppressed errors, rerun > with: -v > > ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 > from 0) > > > > > > The first report is the malloc on line 8, fine. > > The second and the third correspond to still reachable memory from > > PetscInitialize on line 6, I often got these so I usually > discard it. > > The fourth and last is the one that worries me : the memory from > > PetscMalloc1 on line 9 is reported as "still reachable", but I > don't > > think it should. > > Is there something I do not understand, or is this a bug ? > > > > Thanks in advance, > > > > Pierre > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Oct 12 09:43:41 2021 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 12 Oct 2021 10:43:41 -0400 Subject: [petsc-users] On QN + Fieldsplit In-Reply-To: References: Message-ID: <95A9CBB6-E0D8-4F28-A7CA-23EF0470AF5F@petsc.dev> So you removed ksp.setOperators(J_mat.mat()) and the problem went away? This could make sense, this is not an intended usage. I am not sure how cleanly we can provide an error checker for this inappropriate usage. Barry > On Oct 12, 2021, at 10:27 AM, Nicol?s Barnafi wrote: > > Thank you for the support. I rewrote the initialization in a simpler way, now it works as expected: > > > dofmap_s = V.sub(0).dofmap().dofs(); is_s = PETSc.IS().createGeneral(dofmap_s) > > dofmap_p = V.sub(1).dofmap().dofs(); is_p = PETSc.IS().createGeneral(dofmap_p) > > snes = PETSc.SNES().create(MPI.COMM_WORLD) > > snes.setFunction(problem.F, b.vec()); snes.setJacobian(problem.J, J_mat.mat()) > > pc = snes.ksp.getPC() > > pc.setType('fieldsplit') > > pc.setFieldSplitIS((None, is_s), (None, is_p)) > > snes.setFromOptions() > > snes.solve(None, problem.u.vector().vec()) > > Apparently trying to setup the solver's internals is not recommended. As a side note, I tried also setting up the KSP using 'SNESSetKSP', but this solution is not so good as giving the command 'snes_ksp_ew' does nothing, even though it gets correctly read as shown by snes.view(). > > Thanks for the help! > Best, > Nicolas > > On Tue, Oct 12, 2021 at 4:23 PM Matthew Knepley > wrote: > I looked over every place we use that error code. I do not think it is coming from PETSc, but rather from petsc4py. However, something > is eating the error message, and I think Stefano indicated. My first step would be to get the FEniCS folks to display the error message. > > Another option is to just run it in Firedrake since I think we can see the stack properly there. > > Thanks, > > Matt > > On Tue, Oct 12, 2021 at 8:37 AM Nicol?s Barnafi > wrote: > Thank you Stefano for the help. I added the lines you indicated, but the error remains the same, here goes snes.view() + error > > > SNES Object: 1 MPI processes > > type: qn > > SNES has not been set up so information may be incomplete > > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN > > Stored subspace size: 10 > > Using the single reduction variant. > > maximum iterations=10000, maximum function evaluations=30000 > > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 > > total number of function evaluations=0 > > norm schedule ALWAYS > > SNESLineSearch Object: 1 MPI processes > > type: basic > > maxstep=1.000000e+08, minlambda=1.000000e-12 > > tolerances: relative=1.000000e-08, absolute=1.000000e-15, lambda=1.000000e-08 > > maximum iterations=1 > > Traceback (most recent call last): > > File "Twist.py", line 234, in > > snes.setUp() > > File "PETSc/SNES.pyx", line 530, in petsc4py.PETSc.SNES.setUp > > petsc4py.PETSc.Error: error code 83 > > On Tue, Oct 12, 2021 at 2:07 PM Stefano Zampini > wrote: > > > Il giorno mar 12 ott 2021 alle ore 13:56 Nicol?s Barnafi > ha scritto: > Hello PETSc users, > > first email sent! > I am creating a SNES solver using fenics, my example runs smoothly with 'newtonls', but gives a strange missing function error (error 83): > > > Dolphin swallows any useful error information returned from PETSc. You can try using the below code snippet at the beginning of your script > > from petsc4py import PETSc > from dolfin import * > # Remove the dolfin error handler > PETSc.Sys.pushErrorHandler('python') > > > > these are the relevant lines of code where I setup the solver: > > > problem = SNESProblem(Res, sol, bcs) > > b = PETScVector() # same as b = PETSc.Vec() > > J_mat = PETScMatrix() > > snes = PETSc.SNES().create(MPI.COMM_WORLD) > > snes.setFunction(problem.F, b.vec()) > > snes.setJacobian(problem.J, J_mat.mat()) > > # Set up fieldsplit > > ksp = snes.ksp > > ksp.setOperators(J_mat.mat()) > > pc = ksp.pc > > pc.setType('fieldsplit') > > dofmap_s = V.sub(0).dofmap().dofs() > > dofmap_p = V.sub(1).dofmap().dofs() > > is_s = PETSc.IS().createGeneral(dofmap_s) > > is_p = PETSc.IS().createGeneral(dofmap_p) > > pc.setFieldSplitIS((None, is_s), (None, is_p)) > > pc.setFromOptions() > > snes.setFromOptions() > > snes.setUp() > > If it can be useful, this are the outputs of snes.view(), ksp.view() and pc.view(): > > > type: qn > > SNES has not been set up so information may be incomplete > > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN > > Stored subspace size: 10 > > Using the single reduction variant. > > maximum iterations=10000, maximum function evaluations=30000 > > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 > > total number of function evaluations=0 > > norm schedule ALWAYS > > SNESLineSearch Object: 4 MPI processes > > type: basic > > maxstep=1.000000e+08, minlambda=1.000000e-12 > > tolerances: relative=1.000000e-08, absolute=1.000000e-15, lambda=1.000000e-08 > > maximum iterations=1 > > KSP Object: 4 MPI processes > > type: gmres > > restart=1000, using Modified Gram-Schmidt Orthogonalization > > happy breakdown tolerance 1e-30 > > maximum iterations=1000, initial guess is zero > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > left preconditioning > > using UNPRECONDITIONED norm type for convergence test > > PC Object: 4 MPI processes > > type: fieldsplit > > PC has not been set up so information may be incomplete > > FieldSplit with Schur preconditioner, factorization FULL > > I know that PC is not setup, but if I do it before setting up the SNES, the error persists. Thanks in advance for your help. > > Best, > Nicolas > -- > Nicol?s Alejandro Barnafi Wittwer > > > -- > Stefano > > > -- > Nicol?s Alejandro Barnafi Wittwer > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > Nicol?s Alejandro Barnafi Wittwer -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Oct 12 09:51:57 2021 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 12 Oct 2021 10:51:57 -0400 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: <25fb8a1d-8c33-5cf8-21e1-2c597c3a0de7@onera.fr> References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> <25fb8a1d-8c33-5cf8-21e1-2c597c3a0de7@onera.fr> Message-ID: <1DA582BF-515E-48B2-93DA-5BD1B3B7070D@petsc.dev> Do you have the valgrind output from 3.14 ? > 1,636 bytes in 1 blocks are still reachable in loss record 4 > > of 4 > > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) > > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) > > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) > > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) > > ==2036== by 0x41A52F: main (main.c:9) > > ==2036== Given the large amount of memory in the block I think tracing of PETSc's memory allocation is turned on with this run, this may mean the memory is reachable but with your 3.14 run I would guess the memory size is 8 bytes and tracing is not turned on so the memory is listed as "lost". But I do not understand the subtleties of reachable. Barry > On Oct 12, 2021, at 10:38 AM, Pierre Seize wrote: > > The "bug" is that memory from PetscMalloc1 that is not freed is reported as "definitely lost" in v3.14 (OK) but as "still reachable" in today's release (not OK). > > Here I forget to free the memory on purpose, I would like valgrind to report it's lost and not still reachable. > > > > Pierre > > > On 12/10/21 16:24, Matthew Knepley wrote: >> On Tue, Oct 12, 2021 at 10:16 AM Pierre Seize > wrote: >> Sorry, I should have tried this before: >> >> I checked out to v3.14, and now both malloc and PetscMalloc1 are >> reported as definitely lost, so I would say it's a bug. >> >> I am not sure what would be the bug. This is correctly reporting that you did not free the memory. >> >> Thanks, >> >> Matt >> >> Pierre >> >> >> On 12/10/21 15:58, Pierre Seize wrote: >> > Hello petsc-users >> > >> > I am using Valgrind with my PETSc application, and I noticed something: >> > >> > 1 #include >> > 2 >> > 3 int main(int argc, char **argv){ >> > 4 PetscErrorCode ierr = 0; >> > 5 >> > 6 ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) return >> > ierr; >> > 7 PetscReal *foo; >> > 8 malloc(sizeof(PetscReal)); >> > 9 ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); >> > 10 ierr = PetscFinalize(); >> > 11 return ierr; >> > 12 } >> > >> > With this example, with today's release branch, I've got this Valgrind >> > result (--leak-check=full --show-leak-kinds=all): >> > >> > ==2036== Memcheck, a memory error detector >> > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. >> > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright >> > info >> > ==2036== Command: ./build/bin/yanss data/box.yaml >> > ==2036== >> > ==2036== >> > ==2036== HEAP SUMMARY: >> > ==2036== in use at exit: 1,746 bytes in 4 blocks >> > ==2036== total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 >> > bytes allocated >> > ==2036== >> > ==2036== 8 bytes in 1 blocks are definitely lost in loss record 1 of 4 >> > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >> > ==2036== by 0x41A4FD: main (main.c:8) >> > ==2036== >> > ==2036== 32 bytes in 1 blocks are still reachable in loss record 2 of 4 >> > ==2036== at 0x4C2B975: calloc (vg_replace_malloc.c:711) >> > ==2036== by 0xACF461F: _dlerror_run (in /usr/lib64/libdl-2.17.so ) >> > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so ) >> > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >> > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) >> > ==2036== by 0x41A4E2: main (main.c:6) >> > ==2036== >> > ==2036== 70 bytes in 1 blocks are still reachable in loss record 3 of 4 >> > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >> > ==2036== by 0x400F0D0: _dl_signal_error (in /usr/lib64/ld-2.17.so ) >> > ==2036== by 0x400F26D: _dl_signal_cerror (in /usr/lib64/ld-2.17.so ) >> > ==2036== by 0x400A4BC: _dl_lookup_symbol_x (in /usr/lib64/ld-2.17.so ) >> > ==2036== by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so ) >> > ==2036== by 0xACF40D3: dlsym_doit (in /usr/lib64/libdl-2.17.so ) >> > ==2036== by 0x400F2D3: _dl_catch_error (in /usr/lib64/ld-2.17.so ) >> > ==2036== by 0xACF45BC: _dlerror_run (in /usr/lib64/libdl-2.17.so ) >> > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so ) >> > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >> > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) >> > ==2036== by 0x41A4E2: main (main.c:6) >> > ==2036== >> > ==2036== 1,636 bytes in 1 blocks are still reachable in loss record 4 >> > of 4 >> > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >> > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) >> > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >> > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) >> > ==2036== by 0x41A52F: main (main.c:9) >> > ==2036== >> > ==2036== LEAK SUMMARY: >> > ==2036== definitely lost: 8 bytes in 1 blocks >> > ==2036== indirectly lost: 0 bytes in 0 blocks >> > ==2036== possibly lost: 0 bytes in 0 blocks >> > ==2036== still reachable: 1,738 bytes in 3 blocks >> > ==2036== suppressed: 0 bytes in 0 blocks >> > ==2036== >> > ==2036== For counts of detected and suppressed errors, rerun with: -v >> > ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) >> > >> > >> > The first report is the malloc on line 8, fine. >> > The second and the third correspond to still reachable memory from >> > PetscInitialize on line 6, I often got these so I usually discard it. >> > The fourth and last is the one that worries me : the memory from >> > PetscMalloc1 on line 9 is reported as "still reachable", but I don't >> > think it should. >> > Is there something I do not understand, or is this a bug ? >> > >> > Thanks in advance, >> > >> > Pierre >> >> >> >> -- >> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.seize at onera.fr Tue Oct 12 10:06:50 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Tue, 12 Oct 2021 17:06:50 +0200 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: <1DA582BF-515E-48B2-93DA-5BD1B3B7070D@petsc.dev> References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> <25fb8a1d-8c33-5cf8-21e1-2c597c3a0de7@onera.fr> <1DA582BF-515E-48B2-93DA-5BD1B3B7070D@petsc.dev> Message-ID: <7111ed62-77c8-f389-aa43-ff9867c21765@onera.fr> With 3.14 : both malloc and PetscMalloc1 are definitely lost, which is what I want: ==5463== Memcheck, a memory error detector ==5463== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==5463== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info ==5463== Command: ./build/bin/yanss data/box.yaml ==5463== ==5463== ==5463== HEAP SUMMARY: ==5463==???? in use at exit: 48 bytes in 3 blocks ==5463==?? total heap usage: 2,092 allocs, 2,089 frees, 9,139,664 bytes allocated ==5463== ==5463== 8 bytes in 1 blocks are definitely lost in loss record 1 of 3 ==5463==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) ==5463==??? by 0x4191A1: main (main.c:62) ==5463== ==5463== 8 bytes in 1 blocks are definitely lost in loss record 2 of 3 ==5463==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) ==5463==??? by 0x5655AEF: PetscMallocAlign (mal.c:52) ==5463==??? by 0x5657465: PetscMallocA (mal.c:425) ==5463==??? by 0x4191D3: main (main.c:63) ==5463== ==5463== LEAK SUMMARY: ==5463==??? definitely lost: 16 bytes in 2 blocks ==5463==??? indirectly lost: 0 bytes in 0 blocks ==5463==????? possibly lost: 0 bytes in 0 blocks ==5463==??? still reachable: 32 bytes in 1 blocks ==5463==???????? suppressed: 0 bytes in 0 blocks ==5463== Reachable blocks (those to which a pointer was found) are not shown. ==5463== To see them, rerun with: --leak-check=full --show-leak-kinds=all ==5463== ==5463== For counts of detected and suppressed errors, rerun with: -v ==5463== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0) but on a more recent version, the lost memory from PetscMalloc1 is marked ad reachable. It bothers me as I use valgrind to make sure I free everything. Usually the lost memory would be reported right away, but now it isn't. If I understand Barry's answer, this is because the memory block is large ("1,636 bytes in 1 blocks") and valgrind gives up on this block tracing ? Then out of curiosity, why is this block 8 bytes in 3.14 and 1636 bytes today ? Thank you for your time Pierre On 12/10/21 16:51, Barry Smith wrote: > > ? Do you have the valgrind output from 3.14 ? > >> 1,636 bytes in 1 blocks are still reachable in loss record 4 >> > of 4 >> > ==2036==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >> > ==2036==??? by 0x54AC0CB: PetscMallocAlign (mal.c:54) >> > ==2036==??? by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >> > ==2036==??? by 0x54ADDD2: PetscMallocA (mal.c:423) >> > ==2036==??? by 0x41A52F: main (main.c:9) >> > ==2036== >> > > Given the large amount of memory in the block I think tracing of > PETSc's memory allocation is turned on with this run, this may mean > the memory is reachable but with your 3.14 run I would guess the > memory size is 8 bytes and tracing is not turned on so the memory is > listed as "lost". But I do not understand the subtleties of reachable. > > Barry > > > >> On Oct 12, 2021, at 10:38 AM, Pierre Seize > > wrote: >> >> The "bug" is that memory from PetscMalloc1 that is not freed is >> reported as "definitely lost" in v3.14 (OK) but as "still reachable" >> in today's release (not OK). >> >> Here I forget to free the memory on purpose, I would like valgrind to >> report it's lost and not still reachable. >> >> >> Pierre >> >> >> On 12/10/21 16:24, Matthew Knepley wrote: >>> On Tue, Oct 12, 2021 at 10:16 AM Pierre Seize >> > wrote: >>> >>> Sorry, I should have tried this before: >>> >>> I checked out to v3.14, and now both malloc and PetscMalloc1 are >>> reported as definitely lost, so I would say it's a bug. >>> >>> >>> I am not sure what would be the bug. This is correctly reporting >>> that you did not free the memory. >>> >>> ? Thanks, >>> >>> ? ? Matt >>> >>> Pierre >>> >>> >>> On 12/10/21 15:58, Pierre Seize wrote: >>> > Hello petsc-users >>> > >>> > I am using Valgrind with my PETSc application, and I noticed >>> something: >>> > >>> > ?1 #include >>> > ?2 >>> > ?3 int main(int argc, char **argv){ >>> > ?4 ? PetscErrorCode ierr = 0; >>> > ?5 >>> > ?6 ? ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) >>> return >>> > ierr; >>> > ?7 ? PetscReal *foo; >>> > ?8 ? malloc(sizeof(PetscReal)); >>> > ?9 ? ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); >>> > 10 ? ierr = PetscFinalize(); >>> > 11?? return ierr; >>> > 12 } >>> > >>> > With this example, with today's release branch, I've got this >>> Valgrind >>> > result (--leak-check=full --show-leak-kinds=all): >>> > >>> > ==2036== Memcheck, a memory error detector >>> > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian >>> Seward et al. >>> > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for >>> copyright >>> > info >>> > ==2036== Command: ./build/bin/yanss data/box.yaml >>> > ==2036== >>> > ==2036== >>> > ==2036== HEAP SUMMARY: >>> > ==2036==???? in use at exit: 1,746 bytes in 4 blocks >>> > ==2036==?? total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 >>> > bytes allocated >>> > ==2036== >>> > ==2036== 8 bytes in 1 blocks are definitely lost in loss >>> record 1 of 4 >>> > ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >>> > ==2036==??? by 0x41A4FD: main (main.c:8) >>> > ==2036== >>> > ==2036== 32 bytes in 1 blocks are still reachable in loss >>> record 2 of 4 >>> > ==2036==??? at 0x4C2B975: calloc (vg_replace_malloc.c:711) >>> > ==2036==??? by 0xACF461F: _dlerror_run (in >>> /usr/lib64/libdl-2.17.so ) >>> > ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so >>> ) >>> > ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >>> > ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) >>> > ==2036==??? by 0x41A4E2: main (main.c:6) >>> > ==2036== >>> > ==2036== 70 bytes in 1 blocks are still reachable in loss >>> record 3 of 4 >>> > ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >>> > ==2036==??? by 0x400F0D0: _dl_signal_error (in >>> /usr/lib64/ld-2.17.so ) >>> > ==2036==??? by 0x400F26D: _dl_signal_cerror (in >>> /usr/lib64/ld-2.17.so ) >>> > ==2036==??? by 0x400A4BC: _dl_lookup_symbol_x (in >>> /usr/lib64/ld-2.17.so ) >>> > ==2036==??? by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so >>> ) >>> > ==2036==??? by 0xACF40D3: dlsym_doit (in >>> /usr/lib64/libdl-2.17.so ) >>> > ==2036==??? by 0x400F2D3: _dl_catch_error (in >>> /usr/lib64/ld-2.17.so ) >>> > ==2036==??? by 0xACF45BC: _dlerror_run (in >>> /usr/lib64/libdl-2.17.so ) >>> > ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so >>> ) >>> > ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >>> > ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) >>> > ==2036==??? by 0x41A4E2: main (main.c:6) >>> > ==2036== >>> > ==2036== 1,636 bytes in 1 blocks are still reachable in loss >>> record 4 >>> > of 4 >>> > ==2036==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >>> > ==2036==??? by 0x54AC0CB: PetscMallocAlign (mal.c:54) >>> > ==2036==??? by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >>> > ==2036==??? by 0x54ADDD2: PetscMallocA (mal.c:423) >>> > ==2036==??? by 0x41A52F: main (main.c:9) >>> > ==2036== >>> > ==2036== LEAK SUMMARY: >>> > ==2036==??? definitely lost: 8 bytes in 1 blocks >>> > ==2036==??? indirectly lost: 0 bytes in 0 blocks >>> > ==2036==????? possibly lost: 0 bytes in 0 blocks >>> > ==2036==??? still reachable: 1,738 bytes in 3 blocks >>> > ==2036==???????? suppressed: 0 bytes in 0 blocks >>> > ==2036== >>> > ==2036== For counts of detected and suppressed errors, rerun >>> with: -v >>> > ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: >>> 0 from 0) >>> > >>> > >>> > The first report is the malloc on line 8, fine. >>> > The second and the third correspond to still reachable memory >>> from >>> > PetscInitialize on line 6, I often got these so I usually >>> discard it. >>> > The fourth and last is the one that worries me : the memory from >>> > PetscMalloc1 on line 9 is reported as "still reachable", but I >>> don't >>> > think it should. >>> > Is there something I do not understand, or is this a bug ? >>> > >>> > Thanks in advance, >>> > >>> > Pierre >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which >>> their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Oct 12 10:16:38 2021 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 12 Oct 2021 11:16:38 -0400 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: <7111ed62-77c8-f389-aa43-ff9867c21765@onera.fr> References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> <25fb8a1d-8c33-5cf8-21e1-2c597c3a0de7@onera.fr> <1DA582BF-515E-48B2-93DA-5BD1B3B7070D@petsc.dev> <7111ed62-77c8-f389-aa43-ff9867c21765@onera.fr> Message-ID: Notice with your 3.14 > 8 bytes in 1 blocks are definitely lost in loss record 2 of 3 > ==5463== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) > ==5463== by 0x5655AEF: PetscMallocAlign (mal.c:52) > ==5463== by 0x5657465: PetscMallocA (mal.c:425) > ==5463== by 0x4191D3: main (main.c:63) > > but with your 3.15 >>>> > ==2036== 1,636 bytes in 1 blocks are still reachable in loss record 4 >>>> > of 4 >>>> > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >>>> > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) >>>> > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >>>> > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) >>>> > ==2036== by 0x41A52F: main (main.c:9) note the >>>> PetscTrMallocDefault so with 3.15 it is using the "tracing" version of PETSc malloc which keeps a list of unfreeded memory but with 3.14 it is not using the tracing version. This could happen because 3.14 was configured with --with-debugging=0 while 3.15 was not. Or having -malloc_debug in the environmental variable PETSC_OPTIONS. But I don't think it is due to any changes in the PETSc source code. Barry > On Oct 12, 2021, at 11:06 AM, Pierre Seize wrote: > > With 3.14 : both malloc and PetscMalloc1 are definitely lost, which is what I want: > > ==5463== Memcheck, a memory error detector > ==5463== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. > ==5463== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info > ==5463== Command: ./build/bin/yanss data/box.yaml > ==5463== > ==5463== > ==5463== HEAP SUMMARY: > ==5463== in use at exit: 48 bytes in 3 blocks > ==5463== total heap usage: 2,092 allocs, 2,089 frees, 9,139,664 bytes allocated > ==5463== > ==5463== 8 bytes in 1 blocks are definitely lost in loss record 1 of 3 > ==5463== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > ==5463== by 0x4191A1: main (main.c:62) > ==5463== > ==5463== 8 bytes in 1 blocks are definitely lost in loss record 2 of 3 > ==5463== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) > ==5463== by 0x5655AEF: PetscMallocAlign (mal.c:52) > ==5463== by 0x5657465: PetscMallocA (mal.c:425) > ==5463== by 0x4191D3: main (main.c:63) > ==5463== > ==5463== LEAK SUMMARY: > ==5463== definitely lost: 16 bytes in 2 blocks > ==5463== indirectly lost: 0 bytes in 0 blocks > ==5463== possibly lost: 0 bytes in 0 blocks > ==5463== still reachable: 32 bytes in 1 blocks > ==5463== suppressed: 0 bytes in 0 blocks > ==5463== Reachable blocks (those to which a pointer was found) are not shown. > ==5463== To see them, rerun with: --leak-check=full --show-leak-kinds=all > ==5463== > ==5463== For counts of detected and suppressed errors, rerun with: -v > ==5463== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0) > > but on a more recent version, the lost memory from PetscMalloc1 is marked ad reachable. It bothers me as I use valgrind to make sure I free everything. Usually the lost memory would be reported right away, but now it isn't. > If I understand Barry's answer, this is because the memory block is large ("1,636 bytes in 1 blocks") and valgrind gives up on this block tracing ? Then out of curiosity, why is this block 8 bytes in 3.14 and 1636 bytes today ? > > Thank you for your time > Pierre > > On 12/10/21 16:51, Barry Smith wrote: >> >> Do you have the valgrind output from 3.14 ? >> >>> 1,636 bytes in 1 blocks are still reachable in loss record 4 >>> > of 4 >>> > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >>> > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) >>> > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >>> > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) >>> > ==2036== by 0x41A52F: main (main.c:9) >>> > ==2036== >> >> Given the large amount of memory in the block I think tracing of PETSc's memory allocation is turned on with this run, this may mean the memory is reachable but with your 3.14 run I would guess the memory size is 8 bytes and tracing is not turned on so the memory is listed as "lost". But I do not understand the subtleties of reachable. >> >> Barry >> >> >> >>> On Oct 12, 2021, at 10:38 AM, Pierre Seize > wrote: >>> >>> The "bug" is that memory from PetscMalloc1 that is not freed is reported as "definitely lost" in v3.14 (OK) but as "still reachable" in today's release (not OK). >>> >>> Here I forget to free the memory on purpose, I would like valgrind to report it's lost and not still reachable. >>> >>> >>> >>> Pierre >>> >>> >>> On 12/10/21 16:24, Matthew Knepley wrote: >>>> On Tue, Oct 12, 2021 at 10:16 AM Pierre Seize > wrote: >>>> Sorry, I should have tried this before: >>>> >>>> I checked out to v3.14, and now both malloc and PetscMalloc1 are >>>> reported as definitely lost, so I would say it's a bug. >>>> >>>> I am not sure what would be the bug. This is correctly reporting that you did not free the memory. >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> Pierre >>>> >>>> >>>> On 12/10/21 15:58, Pierre Seize wrote: >>>> > Hello petsc-users >>>> > >>>> > I am using Valgrind with my PETSc application, and I noticed something: >>>> > >>>> > 1 #include >>>> > 2 >>>> > 3 int main(int argc, char **argv){ >>>> > 4 PetscErrorCode ierr = 0; >>>> > 5 >>>> > 6 ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) return >>>> > ierr; >>>> > 7 PetscReal *foo; >>>> > 8 malloc(sizeof(PetscReal)); >>>> > 9 ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); >>>> > 10 ierr = PetscFinalize(); >>>> > 11 return ierr; >>>> > 12 } >>>> > >>>> > With this example, with today's release branch, I've got this Valgrind >>>> > result (--leak-check=full --show-leak-kinds=all): >>>> > >>>> > ==2036== Memcheck, a memory error detector >>>> > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. >>>> > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright >>>> > info >>>> > ==2036== Command: ./build/bin/yanss data/box.yaml >>>> > ==2036== >>>> > ==2036== >>>> > ==2036== HEAP SUMMARY: >>>> > ==2036== in use at exit: 1,746 bytes in 4 blocks >>>> > ==2036== total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 >>>> > bytes allocated >>>> > ==2036== >>>> > ==2036== 8 bytes in 1 blocks are definitely lost in loss record 1 of 4 >>>> > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >>>> > ==2036== by 0x41A4FD: main (main.c:8) >>>> > ==2036== >>>> > ==2036== 32 bytes in 1 blocks are still reachable in loss record 2 of 4 >>>> > ==2036== at 0x4C2B975: calloc (vg_replace_malloc.c:711) >>>> > ==2036== by 0xACF461F: _dlerror_run (in /usr/lib64/libdl-2.17.so ) >>>> > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so ) >>>> > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >>>> > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) >>>> > ==2036== by 0x41A4E2: main (main.c:6) >>>> > ==2036== >>>> > ==2036== 70 bytes in 1 blocks are still reachable in loss record 3 of 4 >>>> > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >>>> > ==2036== by 0x400F0D0: _dl_signal_error (in /usr/lib64/ld-2.17.so ) >>>> > ==2036== by 0x400F26D: _dl_signal_cerror (in /usr/lib64/ld-2.17.so ) >>>> > ==2036== by 0x400A4BC: _dl_lookup_symbol_x (in /usr/lib64/ld-2.17.so ) >>>> > ==2036== by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so ) >>>> > ==2036== by 0xACF40D3: dlsym_doit (in /usr/lib64/libdl-2.17.so ) >>>> > ==2036== by 0x400F2D3: _dl_catch_error (in /usr/lib64/ld-2.17.so ) >>>> > ==2036== by 0xACF45BC: _dlerror_run (in /usr/lib64/libdl-2.17.so ) >>>> > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so ) >>>> > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >>>> > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) >>>> > ==2036== by 0x41A4E2: main (main.c:6) >>>> > ==2036== >>>> > ==2036== 1,636 bytes in 1 blocks are still reachable in loss record 4 >>>> > of 4 >>>> > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >>>> > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) >>>> > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >>>> > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) >>>> > ==2036== by 0x41A52F: main (main.c:9) >>>> > ==2036== >>>> > ==2036== LEAK SUMMARY: >>>> > ==2036== definitely lost: 8 bytes in 1 blocks >>>> > ==2036== indirectly lost: 0 bytes in 0 blocks >>>> > ==2036== possibly lost: 0 bytes in 0 blocks >>>> > ==2036== still reachable: 1,738 bytes in 3 blocks >>>> > ==2036== suppressed: 0 bytes in 0 blocks >>>> > ==2036== >>>> > ==2036== For counts of detected and suppressed errors, rerun with: -v >>>> > ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) >>>> > >>>> > >>>> > The first report is the malloc on line 8, fine. >>>> > The second and the third correspond to still reachable memory from >>>> > PetscInitialize on line 6, I often got these so I usually discard it. >>>> > The fourth and last is the one that worries me : the memory from >>>> > PetscMalloc1 on line 9 is reported as "still reachable", but I don't >>>> > think it should. >>>> > Is there something I do not understand, or is this a bug ? >>>> > >>>> > Thanks in advance, >>>> > >>>> > Pierre >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Tue Oct 12 10:17:29 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Tue, 12 Oct 2021 18:17:29 +0300 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: <7111ed62-77c8-f389-aa43-ff9867c21765@onera.fr> References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> <25fb8a1d-8c33-5cf8-21e1-2c597c3a0de7@onera.fr> <1DA582BF-515E-48B2-93DA-5BD1B3B7070D@petsc.dev> <7111ed62-77c8-f389-aa43-ff9867c21765@onera.fr> Message-ID: Your are using two different mallocs in PETSc. For your 3.14 test, PetscMallocAlign is used, while for 3.16, PetscTrMallocDefault is called, which uses much more memory to trace memory corruption previous allocated PETSc data. Il giorno mar 12 ott 2021 alle ore 18:07 Pierre Seize ha scritto: > With 3.14 : both malloc and PetscMalloc1 are definitely lost, which is > what I want: > > ==5463== Memcheck, a memory error detector > ==5463== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. > ==5463== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info > ==5463== Command: ./build/bin/yanss data/box.yaml > ==5463== > ==5463== > ==5463== HEAP SUMMARY: > ==5463== in use at exit: 48 bytes in 3 blocks > ==5463== total heap usage: 2,092 allocs, 2,089 frees, 9,139,664 bytes > allocated > ==5463== > ==5463== 8 bytes in 1 blocks are definitely lost in loss record 1 of 3 > ==5463== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) > ==5463== by 0x4191A1: main (main.c:62) > ==5463== > ==5463== 8 bytes in 1 blocks are definitely lost in loss record 2 of 3 > ==5463== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) > ==5463== by 0x5655AEF: PetscMallocAlign (mal.c:52) > ==5463== by 0x5657465: PetscMallocA (mal.c:425) > ==5463== by 0x4191D3: main (main.c:63) > ==5463== > ==5463== LEAK SUMMARY: > ==5463== definitely lost: 16 bytes in 2 blocks > ==5463== indirectly lost: 0 bytes in 0 blocks > ==5463== possibly lost: 0 bytes in 0 blocks > ==5463== still reachable: 32 bytes in 1 blocks > ==5463== suppressed: 0 bytes in 0 blocks > ==5463== Reachable blocks (those to which a pointer was found) are not > shown. > ==5463== To see them, rerun with: --leak-check=full --show-leak-kinds=all > ==5463== > ==5463== For counts of detected and suppressed errors, rerun with: -v > ==5463== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0) > but on a more recent version, the lost memory from PetscMalloc1 is marked > ad reachable. It bothers me as I use valgrind to make sure I free > everything. Usually the lost memory would be reported right away, but now > it isn't. > If I understand Barry's answer, this is because the memory block is large > ("1,636 bytes in 1 blocks") and valgrind gives up on this block tracing ? > Then out of curiosity, why is this block 8 bytes in 3.14 and 1636 bytes > today ? > > Thank you for your time > Pierre > > On 12/10/21 16:51, Barry Smith wrote: > > > Do you have the valgrind output from 3.14 ? > > 1,636 bytes in 1 blocks are still reachable in loss record 4 >> > of 4 >> > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >> > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) >> > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >> > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) >> > ==2036== by 0x41A52F: main (main.c:9) >> > ==2036== > > > Given the large amount of memory in the block I think tracing of PETSc's > memory allocation is turned on with this run, this may mean the memory is > reachable but with your 3.14 run I would guess the memory size is 8 bytes > and tracing is not turned on so the memory is listed as "lost". But I do > not understand the subtleties of reachable. > > Barry > > > > On Oct 12, 2021, at 10:38 AM, Pierre Seize wrote: > > The "bug" is that memory from PetscMalloc1 that is not freed is reported > as "definitely lost" in v3.14 (OK) but as "still reachable" in today's > release (not OK). > > Here I forget to free the memory on purpose, I would like valgrind to > report it's lost and not still reachable. > > > Pierre > > On 12/10/21 16:24, Matthew Knepley wrote: > > On Tue, Oct 12, 2021 at 10:16 AM Pierre Seize > wrote: > >> Sorry, I should have tried this before: >> >> I checked out to v3.14, and now both malloc and PetscMalloc1 are >> reported as definitely lost, so I would say it's a bug. >> > > I am not sure what would be the bug. This is correctly reporting that you > did not free the memory. > > Thanks, > > Matt > > >> Pierre >> >> >> On 12/10/21 15:58, Pierre Seize wrote: >> > Hello petsc-users >> > >> > I am using Valgrind with my PETSc application, and I noticed something: >> > >> > 1 #include >> > 2 >> > 3 int main(int argc, char **argv){ >> > 4 PetscErrorCode ierr = 0; >> > 5 >> > 6 ierr = PetscInitialize(&argc, &argv, NULL, ""); if (ierr) return >> > ierr; >> > 7 PetscReal *foo; >> > 8 malloc(sizeof(PetscReal)); >> > 9 ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); >> > 10 ierr = PetscFinalize(); >> > 11 return ierr; >> > 12 } >> > >> > With this example, with today's release branch, I've got this Valgrind >> > result (--leak-check=full --show-leak-kinds=all): >> > >> > ==2036== Memcheck, a memory error detector >> > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. >> > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright >> > info >> > ==2036== Command: ./build/bin/yanss data/box.yaml >> > ==2036== >> > ==2036== >> > ==2036== HEAP SUMMARY: >> > ==2036== in use at exit: 1,746 bytes in 4 blocks >> > ==2036== total heap usage: 2,172 allocs, 2,168 frees, 9,624,690 >> > bytes allocated >> > ==2036== >> > ==2036== 8 bytes in 1 blocks are definitely lost in loss record 1 of 4 >> > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >> > ==2036== by 0x41A4FD: main (main.c:8) >> > ==2036== >> > ==2036== 32 bytes in 1 blocks are still reachable in loss record 2 of 4 >> > ==2036== at 0x4C2B975: calloc (vg_replace_malloc.c:711) >> > ==2036== by 0xACF461F: _dlerror_run (in /usr/lib64/libdl-2.17.so) >> > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) >> > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >> > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) >> > ==2036== by 0x41A4E2: main (main.c:6) >> > ==2036== >> > ==2036== 70 bytes in 1 blocks are still reachable in loss record 3 of 4 >> > ==2036== at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >> > ==2036== by 0x400F0D0: _dl_signal_error (in /usr/lib64/ld-2.17.so) >> > ==2036== by 0x400F26D: _dl_signal_cerror (in /usr/lib64/ld-2.17.so) >> > ==2036== by 0x400A4BC: _dl_lookup_symbol_x (in /usr/lib64/ld-2.17.so >> ) >> > ==2036== by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so) >> > ==2036== by 0xACF40D3: dlsym_doit (in /usr/lib64/libdl-2.17.so) >> > ==2036== by 0x400F2D3: _dl_catch_error (in /usr/lib64/ld-2.17.so) >> > ==2036== by 0xACF45BC: _dlerror_run (in /usr/lib64/libdl-2.17.so) >> > ==2036== by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so) >> > ==2036== by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >> > ==2036== by 0x56EF325: PetscInitialize (pinit.c:1203) >> > ==2036== by 0x41A4E2: main (main.c:6) >> > ==2036== >> > ==2036== 1,636 bytes in 1 blocks are still reachable in loss record 4 >> > of 4 >> > ==2036== at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >> > ==2036== by 0x54AC0CB: PetscMallocAlign (mal.c:54) >> > ==2036== by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >> > ==2036== by 0x54ADDD2: PetscMallocA (mal.c:423) >> > ==2036== by 0x41A52F: main (main.c:9) >> > ==2036== >> > ==2036== LEAK SUMMARY: >> > ==2036== definitely lost: 8 bytes in 1 blocks >> > ==2036== indirectly lost: 0 bytes in 0 blocks >> > ==2036== possibly lost: 0 bytes in 0 blocks >> > ==2036== still reachable: 1,738 bytes in 3 blocks >> > ==2036== suppressed: 0 bytes in 0 blocks >> > ==2036== >> > ==2036== For counts of detected and suppressed errors, rerun with: -v >> > ==2036== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) >> > >> > >> > The first report is the malloc on line 8, fine. >> > The second and the third correspond to still reachable memory from >> > PetscInitialize on line 6, I often got these so I usually discard it. >> > The fourth and last is the one that worries me : the memory from >> > PetscMalloc1 on line 9 is reported as "still reachable", but I don't >> > think it should. >> > Is there something I do not understand, or is this a bug ? >> > >> > Thanks in advance, >> > >> > Pierre >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > > > > -- Stefano -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.seize at onera.fr Tue Oct 12 10:19:04 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Tue, 12 Oct 2021 17:19:04 +0200 Subject: [petsc-users] Still reachable memory in valgrind In-Reply-To: References: <432739db-ca34-4a3e-cbf6-d28cdbb10c32@onera.fr> <3192c2ee-c99a-71a4-798f-e374e00da84a@onera.fr> <25fb8a1d-8c33-5cf8-21e1-2c597c3a0de7@onera.fr> <1DA582BF-515E-48B2-93DA-5BD1B3B7070D@petsc.dev> <7111ed62-77c8-f389-aa43-ff9867c21765@onera.fr> Message-ID: Thank you ! I configured both with the same options, but maybe the default have changed between versions. Now I understand. And thank you for the -malloc_dump option, I forgot about it. Pierre On 12/10/21 17:16, Barry Smith wrote: > > ? ? Notice with your 3.14 > >> 8 bytes in 1 blocks are definitely lost in loss record 2 of 3 >> ==5463==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >> ==5463==??? by 0x5655AEF: PetscMallocAlign (mal.c:52) >> ==5463==??? by 0x5657465: PetscMallocA (mal.c:425) >> ==5463==??? by 0x4191D3: main (main.c:63) >> >> > > but with your 3.15 > >>>>> > ==2036== 1,636 bytes in 1 blocks are still reachable in loss >>>>> record 4 >>>>> > of 4 >>>>> > ==2036==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >>>>> > ==2036==??? by 0x54AC0CB: PetscMallocAlign (mal.c:54) >>>>> > ==2036==??? by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >>>>> > ==2036==??? by 0x54ADDD2: PetscMallocA (mal.c:423) >>>>> > ==2036==??? by 0x41A52F: main (main.c:9) >>>>> > > note the > >>>>> PetscTrMallocDefault >>>>> > > > so with 3.15 it is using the "tracing" version of PETSc malloc which > keeps a list of unfreeded memory but with 3.14 it is not using the > tracing version. This could happen because 3.14 was configured with > --with-debugging=0 while 3.15 was not. Or having -malloc_debug in the > environmental variable PETSC_OPTIONS. But I don't think it is due to > any changes in the PETSc source code. > > ? Barry > > >> On Oct 12, 2021, at 11:06 AM, Pierre Seize > > wrote: >> >> With 3.14 : both malloc and PetscMalloc1 are definitely lost, which >> is what I want: >> >> ==5463== Memcheck, a memory error detector >> ==5463== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. >> ==5463== Using Valgrind-3.12.0 and LibVEX; rerun with -h for >> copyright info >> ==5463== Command: ./build/bin/yanss data/box.yaml >> ==5463== >> ==5463== >> ==5463== HEAP SUMMARY: >> ==5463==???? in use at exit: 48 bytes in 3 blocks >> ==5463==?? total heap usage: 2,092 allocs, 2,089 frees, 9,139,664 >> bytes allocated >> ==5463== >> ==5463== 8 bytes in 1 blocks are definitely lost in loss record 1 of 3 >> ==5463==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >> ==5463==??? by 0x4191A1: main (main.c:62) >> ==5463== >> ==5463== 8 bytes in 1 blocks are definitely lost in loss record 2 of 3 >> ==5463==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >> ==5463==??? by 0x5655AEF: PetscMallocAlign (mal.c:52) >> ==5463==??? by 0x5657465: PetscMallocA (mal.c:425) >> ==5463==??? by 0x4191D3: main (main.c:63) >> ==5463== >> ==5463== LEAK SUMMARY: >> ==5463==??? definitely lost: 16 bytes in 2 blocks >> ==5463==??? indirectly lost: 0 bytes in 0 blocks >> ==5463==????? possibly lost: 0 bytes in 0 blocks >> ==5463==??? still reachable: 32 bytes in 1 blocks >> ==5463==???????? suppressed: 0 bytes in 0 blocks >> ==5463== Reachable blocks (those to which a pointer was found) are >> not shown. >> ==5463== To see them, rerun with: --leak-check=full --show-leak-kinds=all >> ==5463== >> ==5463== For counts of detected and suppressed errors, rerun with: -v >> ==5463== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0) >> >> but on a more recent version, the lost memory from PetscMalloc1 is >> marked ad reachable. It bothers me as I use valgrind to make sure I >> free everything. Usually the lost memory would be reported right >> away, but now it isn't. >> If I understand Barry's answer, this is because the memory block is >> large ("1,636 bytes in 1 blocks") and valgrind gives up on this block >> tracing ? Then out of curiosity, why is this block 8 bytes in 3.14 >> and 1636 bytes today ? >> >> Thank you for your time >> Pierre >> >> On 12/10/21 16:51, Barry Smith wrote: >>> >>> ? Do you have the valgrind output from 3.14 ? >>> >>>> 1,636 bytes in 1 blocks are still reachable in loss record 4 >>>> > of 4 >>>> > ==2036==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >>>> > ==2036==??? by 0x54AC0CB: PetscMallocAlign (mal.c:54) >>>> > ==2036==??? by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >>>> > ==2036==??? by 0x54ADDD2: PetscMallocA (mal.c:423) >>>> > ==2036==??? by 0x41A52F: main (main.c:9) >>>> > ==2036== >>>> >>> >>> Given the large amount of memory in the block I think tracing of >>> PETSc's memory allocation is turned on with this run, this may mean >>> the memory is reachable but with your 3.14 run I would guess the >>> memory size is 8 bytes and tracing is not turned on so the memory is >>> listed as "lost". But I do not understand the subtleties of reachable. >>> >>> Barry >>> >>> >>> >>>> On Oct 12, 2021, at 10:38 AM, Pierre Seize >>> > wrote: >>>> >>>> The "bug" is that memory from PetscMalloc1 that is not freed is >>>> reported as "definitely lost" in v3.14 (OK) but as "still >>>> reachable" in today's release (not OK). >>>> >>>> Here I forget to free the memory on purpose, I would like valgrind >>>> to report it's lost and not still reachable. >>>> >>>> >>>> Pierre >>>> >>>> >>>> On 12/10/21 16:24, Matthew Knepley wrote: >>>>> On Tue, Oct 12, 2021 at 10:16 AM Pierre Seize >>>>> > wrote: >>>>> >>>>> Sorry, I should have tried this before: >>>>> >>>>> I checked out to v3.14, and now both malloc and PetscMalloc1 are >>>>> reported as definitely lost, so I would say it's a bug. >>>>> >>>>> >>>>> I am not sure what would be the bug. This is correctly reporting >>>>> that you did not free the memory. >>>>> >>>>> ? Thanks, >>>>> >>>>> ? ? Matt >>>>> >>>>> Pierre >>>>> >>>>> >>>>> On 12/10/21 15:58, Pierre Seize wrote: >>>>> > Hello petsc-users >>>>> > >>>>> > I am using Valgrind with my PETSc application, and I noticed >>>>> something: >>>>> > >>>>> > ?1 #include >>>>> > ?2 >>>>> > ?3 int main(int argc, char **argv){ >>>>> > ?4 ? PetscErrorCode ierr = 0; >>>>> > ?5 >>>>> > ?6 ? ierr = PetscInitialize(&argc, &argv, NULL, ""); if >>>>> (ierr) return >>>>> > ierr; >>>>> > ?7 ? PetscReal *foo; >>>>> > ?8 ? malloc(sizeof(PetscReal)); >>>>> > ?9 ? ierr = PetscMalloc1(1, &foo); CHKERRQ(ierr); >>>>> > 10 ? ierr = PetscFinalize(); >>>>> > 11?? return ierr; >>>>> > 12 } >>>>> > >>>>> > With this example, with today's release branch, I've got >>>>> this Valgrind >>>>> > result (--leak-check=full --show-leak-kinds=all): >>>>> > >>>>> > ==2036== Memcheck, a memory error detector >>>>> > ==2036== Copyright (C) 2002-2015, and GNU GPL'd, by Julian >>>>> Seward et al. >>>>> > ==2036== Using Valgrind-3.12.0 and LibVEX; rerun with -h for >>>>> copyright >>>>> > info >>>>> > ==2036== Command: ./build/bin/yanss data/box.yaml >>>>> > ==2036== >>>>> > ==2036== >>>>> > ==2036== HEAP SUMMARY: >>>>> > ==2036==???? in use at exit: 1,746 bytes in 4 blocks >>>>> > ==2036==?? total heap usage: 2,172 allocs, 2,168 frees, >>>>> 9,624,690 >>>>> > bytes allocated >>>>> > ==2036== >>>>> > ==2036== 8 bytes in 1 blocks are definitely lost in loss >>>>> record 1 of 4 >>>>> > ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >>>>> > ==2036==??? by 0x41A4FD: main (main.c:8) >>>>> > ==2036== >>>>> > ==2036== 32 bytes in 1 blocks are still reachable in loss >>>>> record 2 of 4 >>>>> > ==2036==??? at 0x4C2B975: calloc (vg_replace_malloc.c:711) >>>>> > ==2036==??? by 0xACF461F: _dlerror_run (in >>>>> /usr/lib64/libdl-2.17.so ) >>>>> > ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so >>>>> ) >>>>> > ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >>>>> > ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) >>>>> > ==2036==??? by 0x41A4E2: main (main.c:6) >>>>> > ==2036== >>>>> > ==2036== 70 bytes in 1 blocks are still reachable in loss >>>>> record 3 of 4 >>>>> > ==2036==??? at 0x4C29BE3: malloc (vg_replace_malloc.c:299) >>>>> > ==2036==??? by 0x400F0D0: _dl_signal_error (in >>>>> /usr/lib64/ld-2.17.so ) >>>>> > ==2036==??? by 0x400F26D: _dl_signal_cerror (in >>>>> /usr/lib64/ld-2.17.so ) >>>>> > ==2036==??? by 0x400A4BC: _dl_lookup_symbol_x (in >>>>> /usr/lib64/ld-2.17.so ) >>>>> > ==2036==??? by 0x83B9F02: do_sym (in /usr/lib64/libc-2.17.so >>>>> ) >>>>> > ==2036==??? by 0xACF40D3: dlsym_doit (in >>>>> /usr/lib64/libdl-2.17.so ) >>>>> > ==2036==??? by 0x400F2D3: _dl_catch_error (in >>>>> /usr/lib64/ld-2.17.so ) >>>>> > ==2036==??? by 0xACF45BC: _dlerror_run (in >>>>> /usr/lib64/libdl-2.17.so ) >>>>> > ==2036==??? by 0xACF4127: dlsym (in /usr/lib64/libdl-2.17.so >>>>> ) >>>>> > ==2036==??? by 0x56ECBB5: PetscInitialize_Common (pinit.c:785) >>>>> > ==2036==??? by 0x56EF325: PetscInitialize (pinit.c:1203) >>>>> > ==2036==??? by 0x41A4E2: main (main.c:6) >>>>> > ==2036== >>>>> > ==2036== 1,636 bytes in 1 blocks are still reachable in loss >>>>> record 4 >>>>> > of 4 >>>>> > ==2036==??? at 0x4C2BE2D: memalign (vg_replace_malloc.c:858) >>>>> > ==2036==??? by 0x54AC0CB: PetscMallocAlign (mal.c:54) >>>>> > ==2036==??? by 0x54AFBA9: PetscTrMallocDefault (mtr.c:183) >>>>> > ==2036==??? by 0x54ADDD2: PetscMallocA (mal.c:423) >>>>> > ==2036==??? by 0x41A52F: main (main.c:9) >>>>> > ==2036== >>>>> > ==2036== LEAK SUMMARY: >>>>> > ==2036==??? definitely lost: 8 bytes in 1 blocks >>>>> > ==2036==??? indirectly lost: 0 bytes in 0 blocks >>>>> > ==2036==????? possibly lost: 0 bytes in 0 blocks >>>>> > ==2036==??? still reachable: 1,738 bytes in 3 blocks >>>>> > ==2036==???????? suppressed: 0 bytes in 0 blocks >>>>> > ==2036== >>>>> > ==2036== For counts of detected and suppressed errors, rerun >>>>> with: -v >>>>> > ==2036== ERROR SUMMARY: 1 errors from 1 contexts >>>>> (suppressed: 0 from 0) >>>>> > >>>>> > >>>>> > The first report is the malloc on line 8, fine. >>>>> > The second and the third correspond to still reachable >>>>> memory from >>>>> > PetscInitialize on line 6, I often got these so I usually >>>>> discard it. >>>>> > The fourth and last is the one that worries me : the memory >>>>> from >>>>> > PetscMalloc1 on line 9 is reported as "still reachable", but >>>>> I don't >>>>> > think it should. >>>>> > Is there something I do not understand, or is this a bug ? >>>>> > >>>>> > Thanks in advance, >>>>> > >>>>> > Pierre >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to >>>>> which their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Tue Oct 12 10:19:25 2021 From: cliu at pppl.gov (Chang Liu) Date: Tue, 12 Oct 2021 11:19:25 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: Message-ID: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> Hi Junchao, No I only needs it to be transferred within a node. I use block-Jacobi method and GMRES to solve the sparse matrix, so each direct solver will take care of a sub-block of the whole matrix. In this way, I can use one GPU to solve one sub-block, which is stored within one node. It was stated in the documentation that cusparse solver is slow. However, in my test using ex72.c, the cusparse solver is faster than mumps or superlu_dist on CPUs. Chang On 10/12/21 10:24 AM, Junchao Zhang wrote: > Hi, Chang, > ? ?For the mumps solver, we usually transfers matrix and vector data > within a compute node.? For the idea you propose, it looks like we need > to gather data within MPI_COMM_WORLD, right? > > ? ?Mark, I remember you said cusparse solve is slow and you would > rather do it on CPU. Is it right? > > --Junchao Zhang > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users > > wrote: > > Hi, > > Currently, it is possible to use mumps solver in PETSC with > -mat_mumps_use_omp_threads option, so that multiple MPI processes will > transfer the matrix and rhs data to the master rank, and then master > rank will call mumps with OpenMP to solve the matrix. > > I wonder if someone can develop similar option for cusparse solver. > Right now, this solver does not work with mpiaijcusparse. I think a > possible workaround is to transfer all the matrix data to one MPI > process, and then upload the data to GPU to solve. In this way, one can > use cusparse solver for a MPI program. > > Chang > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From mfadams at lbl.gov Tue Oct 12 12:05:01 2021 From: mfadams at lbl.gov (Mark Adams) Date: Tue, 12 Oct 2021 13:05:01 -0400 Subject: [petsc-users] request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: Message-ID: On Tue, Oct 12, 2021 at 10:24 AM Junchao Zhang wrote: > Hi, Chang, > For the mumps solver, we usually transfers matrix and vector data > within a compute node. For the idea you propose, it looks like we need to > gather data within MPI_COMM_WORLD, right? > > Mark, I remember you said cusparse solve is slow and you would rather > do it on CPU. Is it right? > Yes, I find that cuSparse solve is slower on our sparse CPU lu factorization solves than the (old) CPU solve. I have an MR to allow the use of the CPU solve with LU and cusparse. I am running many fairly small problems and the factorization is on the CPU so a CPU solve keeps the factors on the CPU. I would imagine cuSparse solves would be faster at some point as you scale up. > > --Junchao Zhang > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users < > petsc-users at mcs.anl.gov> wrote: > >> Hi, >> >> Currently, it is possible to use mumps solver in PETSC with >> -mat_mumps_use_omp_threads option, so that multiple MPI processes will >> transfer the matrix and rhs data to the master rank, and then master >> rank will call mumps with OpenMP to solve the matrix. >> >> I wonder if someone can develop similar option for cusparse solver. >> Right now, this solver does not work with mpiaijcusparse. I think a >> possible workaround is to transfer all the matrix data to one MPI >> process, and then upload the data to GPU to solve. In this way, one can >> use cusparse solver for a MPI program. >> >> Chang >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Tue Oct 12 12:17:29 2021 From: mfadams at lbl.gov (Mark Adams) Date: Tue, 12 Oct 2021 13:17:29 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> Message-ID: On Tue, Oct 12, 2021 at 11:19 AM Chang Liu wrote: > Hi Junchao, > > No I only needs it to be transferred within a node. I use block-Jacobi > method and GMRES to solve the sparse matrix, so each direct solver will > take care of a sub-block of the whole matrix. In this way, I can use one > GPU to solve one sub-block, which is stored within one node. > > It was stated in the documentation that cusparse solver is slow. > However, in my test using ex72.c, the cusparse solver is faster than > mumps or superlu_dist on CPUs. > Are we talking about the factorization, the solve, or both? We do not have an interface to cuSparse's LU factorization (I just learned that it exists a few weeks ago). Perhaps your fast "cusparse solver" is '-pc_type lu -mat_type aijcusparse' ? This would be the CPU factorization, which is the dominant cost. > Chang > > On 10/12/21 10:24 AM, Junchao Zhang wrote: > > Hi, Chang, > > For the mumps solver, we usually transfers matrix and vector data > > within a compute node. For the idea you propose, it looks like we need > > to gather data within MPI_COMM_WORLD, right? > > > > Mark, I remember you said cusparse solve is slow and you would > > rather do it on CPU. Is it right? > > > > --Junchao Zhang > > > > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users > > > wrote: > > > > Hi, > > > > Currently, it is possible to use mumps solver in PETSC with > > -mat_mumps_use_omp_threads option, so that multiple MPI processes > will > > transfer the matrix and rhs data to the master rank, and then master > > rank will call mumps with OpenMP to solve the matrix. > > > > I wonder if someone can develop similar option for cusparse solver. > > Right now, this solver does not work with mpiaijcusparse. I think a > > possible workaround is to transfer all the matrix data to one MPI > > process, and then upload the data to GPU to solve. In this way, one > can > > use cusparse solver for a MPI program. > > > > Chang > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Tue Oct 12 12:45:08 2021 From: cliu at pppl.gov (Chang Liu) Date: Tue, 12 Oct 2021 13:45:08 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> Message-ID: <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> Hi Mark, The option I use is like -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres -mat_type aijcusparse -sub_pc_factor_mat_solver_type cusparse -sub_ksp_type preonly -sub_pc_type lu -ksp_max_it 2000 -ksp_rtol 1.e-300 -ksp_atol 1.e-300 I think this one do both factorization and solve on gpu. You can check the runex72_aijcusparse.sh file in petsc install directory, and try it your self (this is only lu factorization without iterative solve). Chang On 10/12/21 1:17 PM, Mark Adams wrote: > > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > wrote: > > Hi Junchao, > > No I only needs it to be transferred within a node. I use block-Jacobi > method and GMRES to solve the sparse matrix, so each direct solver will > take care of a sub-block of the whole matrix. In this way, I can use > one > GPU to solve one sub-block, which is stored within one node. > > It was stated in the documentation that cusparse solver is slow. > However, in my test using ex72.c, the cusparse solver is faster than > mumps or superlu_dist on CPUs. > > > Are we talking about the factorization, the solve, or both? > > We do not have an interface?to cuSparse's?LU factorization (I just > learned that it exists a few weeks ago). > Perhaps your fast "cusparse solver" is '-pc_type lu -mat_type > aijcusparse' ? This would be the CPU factorization, which is the > dominant?cost. > > > Chang > > On 10/12/21 10:24 AM, Junchao Zhang wrote: > > Hi, Chang, > >? ? ?For the mumps solver, we usually transfers matrix and vector > data > > within a compute node.? For the idea you propose, it looks like > we need > > to gather data within MPI_COMM_WORLD, right? > > > >? ? ?Mark, I remember you said cusparse solve is slow and you would > > rather do it on CPU. Is it right? > > > > --Junchao Zhang > > > > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users > > > >> > wrote: > > > >? ? ?Hi, > > > >? ? ?Currently, it is possible to use mumps solver in PETSC with > >? ? ?-mat_mumps_use_omp_threads option, so that multiple MPI > processes will > >? ? ?transfer the matrix and rhs data to the master rank, and then > master > >? ? ?rank will call mumps with OpenMP to solve the matrix. > > > >? ? ?I wonder if someone can develop similar option for cusparse > solver. > >? ? ?Right now, this solver does not work with mpiaijcusparse. I > think a > >? ? ?possible workaround is to transfer all the matrix data to one MPI > >? ? ?process, and then upload the data to GPU to solve. In this > way, one can > >? ? ?use cusparse solver for a MPI program. > > > >? ? ?Chang > >? ? ?-- > >? ? ?Chang Liu > >? ? ?Staff Research Physicist > >? ? ?+1 609 243 3438 > > cliu at pppl.gov > > >? ? ?Princeton Plasma Physics Laboratory > >? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From mfadams at lbl.gov Tue Oct 12 13:06:52 2021 From: mfadams at lbl.gov (Mark Adams) Date: Tue, 12 Oct 2021 14:06:52 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> Message-ID: On Tue, Oct 12, 2021 at 1:45 PM Chang Liu wrote: > Hi Mark, > > The option I use is like > > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres -mat_type > aijcusparse *-sub_pc_factor_mat_solver_type cusparse *-sub_ksp_type > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 -ksp_atol > 1.e-300 > > Note, If you use -log_view the last column (rows are the method like MatFactorNumeric) has the percent of work in the GPU. Junchau: *This* implies that we have a cuSparse LU factorization. Is that correct? (I don't think we do) I think this one do both factorization and solve on gpu. > > You can check the runex72_aijcusparse.sh file in petsc install > directory, and try it your self (this is only lu factorization without > iterative solve). > > Chang > > On 10/12/21 1:17 PM, Mark Adams wrote: > > > > > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > > wrote: > > > > Hi Junchao, > > > > No I only needs it to be transferred within a node. I use > block-Jacobi > > method and GMRES to solve the sparse matrix, so each direct solver > will > > take care of a sub-block of the whole matrix. In this way, I can use > > one > > GPU to solve one sub-block, which is stored within one node. > > > > It was stated in the documentation that cusparse solver is slow. > > However, in my test using ex72.c, the cusparse solver is faster than > > mumps or superlu_dist on CPUs. > > > > > > Are we talking about the factorization, the solve, or both? > > > > We do not have an interface to cuSparse's LU factorization (I just > > learned that it exists a few weeks ago). > > Perhaps your fast "cusparse solver" is '-pc_type lu -mat_type > > aijcusparse' ? This would be the CPU factorization, which is the > > dominant cost. > > > > > > Chang > > > > On 10/12/21 10:24 AM, Junchao Zhang wrote: > > > Hi, Chang, > > > For the mumps solver, we usually transfers matrix and vector > > data > > > within a compute node. For the idea you propose, it looks like > > we need > > > to gather data within MPI_COMM_WORLD, right? > > > > > > Mark, I remember you said cusparse solve is slow and you would > > > rather do it on CPU. Is it right? > > > > > > --Junchao Zhang > > > > > > > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users > > > > > >> > > wrote: > > > > > > Hi, > > > > > > Currently, it is possible to use mumps solver in PETSC with > > > -mat_mumps_use_omp_threads option, so that multiple MPI > > processes will > > > transfer the matrix and rhs data to the master rank, and then > > master > > > rank will call mumps with OpenMP to solve the matrix. > > > > > > I wonder if someone can develop similar option for cusparse > > solver. > > > Right now, this solver does not work with mpiaijcusparse. I > > think a > > > possible workaround is to transfer all the matrix data to one > MPI > > > process, and then upload the data to GPU to solve. In this > > way, one can > > > use cusparse solver for a MPI program. > > > > > > Chang > > > -- > > > Chang Liu > > > Staff Research Physicist > > > +1 609 243 3438 > > > cliu at pppl.gov > > > > > Princeton Plasma Physics Laboratory > > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > > > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Oct 12 13:24:33 2021 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 12 Oct 2021 14:24:33 -0400 Subject: [petsc-users] On QN + Fieldsplit In-Reply-To: References: Message-ID: On Tue, Oct 12, 2021 at 10:27 AM Nicol?s Barnafi wrote: > Thank you for the support. I rewrote the initialization in a simpler way, > now it works as expected: > > > dofmap_s = V.sub(0).dofmap().dofs(); is_s = > PETSc.IS().createGeneral(dofmap_s) > > dofmap_p = V.sub(1).dofmap().dofs(); is_p = > PETSc.IS().createGeneral(dofmap_p) > > snes = PETSc.SNES().create(MPI.COMM_WORLD) > > snes.setFunction(problem.F, b.vec()); snes.setJacobian(problem.J, > J_mat.mat()) > > pc = snes.ksp.getPC() > > pc.setType('fieldsplit') > > pc.setFieldSplitIS((None, is_s), (None, is_p)) > > snes.setFromOptions() > > snes.solve(None, problem.u.vector().vec()) > > Apparently trying to setup the solver's internals is not recommended. As a > side note, I tried also setting up the KSP using 'SNESSetKSP', but this > solution is not so good as giving the command 'snes_ksp_ew' does nothing, > even though it gets correctly read as shown by snes.view(). > I think I can explain this. The Eisenstat-Walker scheme is a way to set tolerances for the linear solves inside a Newton iteration. The goal is to avoid over-solving the linear systems, meaning that far away from the solution accurate linear solves have no advantage over inaccurate ones. Implementing this involves coordination with the linear solver since we are setting the convergence tolerance. When you replace the linear solve, that setup is discarded. Thus, when configuring things, we recommend that you pull out the existing object SNESGetKSP() and customize it, rather than creating a new object and setting it. Thanks, Matt > Thanks for the help! > Best, > Nicolas > > On Tue, Oct 12, 2021 at 4:23 PM Matthew Knepley wrote: > >> I looked over every place we use that error code. I do not think it is >> coming from PETSc, but rather from petsc4py. However, something >> is eating the error message, and I think Stefano indicated. My first step >> would be to get the FEniCS folks to display the error message. >> >> Another option is to just run it in Firedrake since I think we can see >> the stack properly there. >> >> Thanks, >> >> Matt >> >> On Tue, Oct 12, 2021 at 8:37 AM Nicol?s Barnafi wrote: >> >>> Thank you Stefano for the help. I added the lines you indicated, but the >>> error remains the same, here goes snes.view() + error >>> >>> > SNES Object: 1 MPI processes >>> > type: qn >>> > SNES has not been set up so information may be incomplete >>> > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN >>> > Stored subspace size: 10 >>> > Using the single reduction variant. >>> > maximum iterations=10000, maximum function evaluations=30000 >>> > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 >>> > total number of function evaluations=0 >>> > norm schedule ALWAYS >>> > SNESLineSearch Object: 1 MPI processes >>> > type: basic >>> > maxstep=1.000000e+08, minlambda=1.000000e-12 >>> > tolerances: relative=1.000000e-08, absolute=1.000000e-15, >>> lambda=1.000000e-08 >>> > maximum iterations=1 >>> > Traceback (most recent call last): >>> > File "Twist.py", line 234, in >>> > snes.setUp() >>> > File "PETSc/SNES.pyx", line 530, in petsc4py.PETSc.SNES.setUp >>> > petsc4py.PETSc.Error: error code 83 >>> >>> On Tue, Oct 12, 2021 at 2:07 PM Stefano Zampini < >>> stefano.zampini at gmail.com> wrote: >>> >>>> >>>> >>>> Il giorno mar 12 ott 2021 alle ore 13:56 Nicol?s Barnafi < >>>> nabw91 at gmail.com> ha scritto: >>>> >>>>> Hello PETSc users, >>>>> >>>>> first email sent! >>>>> I am creating a SNES solver using fenics, my example runs smoothly >>>>> with 'newtonls', but gives a strange missing function error (error 83): >>>>> >>>>> >>>> Dolphin swallows any useful error information returned from PETSc. You >>>> can try using the below code snippet at the beginning of your script >>>> >>>> from petsc4py import PETSc >>>> from dolfin import * >>>> # Remove the dolfin error handler >>>> PETSc.Sys.pushErrorHandler('python') >>>> >>>> >>>> >>>>> >>>>> these are the relevant lines of code where I setup the solver: >>>>> >>>>> > problem = SNESProblem(Res, sol, bcs) >>>>> > b = PETScVector() # same as b = PETSc.Vec() >>>>> > J_mat = PETScMatrix() >>>>> > snes = PETSc.SNES().create(MPI.COMM_WORLD) >>>>> > snes.setFunction(problem.F, b.vec()) >>>>> > snes.setJacobian(problem.J, J_mat.mat()) >>>>> > # Set up fieldsplit >>>>> > ksp = snes.ksp >>>>> > ksp.setOperators(J_mat.mat()) >>>>> > pc = ksp.pc >>>>> > pc.setType('fieldsplit') >>>>> > dofmap_s = V.sub(0).dofmap().dofs() >>>>> > dofmap_p = V.sub(1).dofmap().dofs() >>>>> > is_s = PETSc.IS().createGeneral(dofmap_s) >>>>> > is_p = PETSc.IS().createGeneral(dofmap_p) >>>>> > pc.setFieldSplitIS((None, is_s), (None, is_p)) >>>>> > pc.setFromOptions() >>>>> > snes.setFromOptions() >>>>> > snes.setUp() >>>>> >>>>> >>>> If it can be useful, this are the outputs of snes.view(), ksp.view() >>>>> and pc.view(): >>>>> >>>>> > type: qn >>>>> > SNES has not been set up so information may be incomplete >>>>> > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN >>>>> > Stored subspace size: 10 >>>>> > Using the single reduction variant. >>>>> > maximum iterations=10000, maximum function evaluations=30000 >>>>> > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 >>>>> > total number of function evaluations=0 >>>>> > norm schedule ALWAYS >>>>> > SNESLineSearch Object: 4 MPI processes >>>>> > type: basic >>>>> > maxstep=1.000000e+08, minlambda=1.000000e-12 >>>>> > tolerances: relative=1.000000e-08, absolute=1.000000e-15, >>>>> lambda=1.000000e-08 >>>>> > maximum iterations=1 >>>>> > KSP Object: 4 MPI processes >>>>> > type: gmres >>>>> > restart=1000, using Modified Gram-Schmidt Orthogonalization >>>>> > happy breakdown tolerance 1e-30 >>>>> > maximum iterations=1000, initial guess is zero >>>>> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>>> > left preconditioning >>>>> > using UNPRECONDITIONED norm type for convergence test >>>>> > PC Object: 4 MPI processes >>>>> > type: fieldsplit >>>>> > PC has not been set up so information may be incomplete >>>>> > FieldSplit with Schur preconditioner, factorization FULL >>>>> >>>>> I know that PC is not setup, but if I do it before setting up the >>>>> SNES, the error persists. Thanks in advance for your help. >>>>> >>>>> Best, >>>>> Nicolas >>>>> -- >>>>> Nicol?s Alejandro Barnafi Wittwer >>>>> >>>> >>>> >>>> -- >>>> Stefano >>>> >>> >>> >>> -- >>> Nicol?s Alejandro Barnafi Wittwer >>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> > > > -- > Nicol?s Alejandro Barnafi Wittwer > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Oct 12 14:00:16 2021 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 12 Oct 2021 15:00:16 -0400 Subject: [petsc-users] On QN + Fieldsplit In-Reply-To: References: Message-ID: <8B085C99-087F-4A46-AA8F-5E0D47363D22@petsc.dev> SNESSetKSP() (and friends) was a mistake, it was originally introduced for symmetry reasons but doesn't seem to have a good reason to exist. Barry > On Oct 12, 2021, at 2:24 PM, Matthew Knepley wrote: > > On Tue, Oct 12, 2021 at 10:27 AM Nicol?s Barnafi > wrote: > Thank you for the support. I rewrote the initialization in a simpler way, now it works as expected: > > > dofmap_s = V.sub(0).dofmap().dofs(); is_s = PETSc.IS().createGeneral(dofmap_s) > > dofmap_p = V.sub(1).dofmap().dofs(); is_p = PETSc.IS().createGeneral(dofmap_p) > > snes = PETSc.SNES().create(MPI.COMM_WORLD) > > snes.setFunction(problem.F, b.vec()); snes.setJacobian(problem.J, J_mat.mat()) > > pc = snes.ksp.getPC() > > pc.setType('fieldsplit') > > pc.setFieldSplitIS((None, is_s), (None, is_p)) > > snes.setFromOptions() > > snes.solve(None, problem.u.vector().vec()) > > Apparently trying to setup the solver's internals is not recommended. As a side note, I tried also setting up the KSP using 'SNESSetKSP', but this solution is not so good as giving the command 'snes_ksp_ew' does nothing, even though it gets correctly read as shown by snes.view(). > > I think I can explain this. The Eisenstat-Walker scheme is a way to set tolerances for the linear solves inside a Newton iteration. The goal is > to avoid over-solving the linear systems, meaning that far away from the solution accurate linear solves have no advantage over inaccurate ones. > Implementing this involves coordination with the linear solver since we are setting the convergence tolerance. When you replace the linear solve, > that setup is discarded. Thus, when configuring things, we recommend that you pull out the existing object > > SNESGetKSP() > > and customize it, rather than creating a new object and setting it. > > Thanks, > > Matt > > Thanks for the help! > Best, > Nicolas > > On Tue, Oct 12, 2021 at 4:23 PM Matthew Knepley > wrote: > I looked over every place we use that error code. I do not think it is coming from PETSc, but rather from petsc4py. However, something > is eating the error message, and I think Stefano indicated. My first step would be to get the FEniCS folks to display the error message. > > Another option is to just run it in Firedrake since I think we can see the stack properly there. > > Thanks, > > Matt > > On Tue, Oct 12, 2021 at 8:37 AM Nicol?s Barnafi > wrote: > Thank you Stefano for the help. I added the lines you indicated, but the error remains the same, here goes snes.view() + error > > > SNES Object: 1 MPI processes > > type: qn > > SNES has not been set up so information may be incomplete > > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN > > Stored subspace size: 10 > > Using the single reduction variant. > > maximum iterations=10000, maximum function evaluations=30000 > > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 > > total number of function evaluations=0 > > norm schedule ALWAYS > > SNESLineSearch Object: 1 MPI processes > > type: basic > > maxstep=1.000000e+08, minlambda=1.000000e-12 > > tolerances: relative=1.000000e-08, absolute=1.000000e-15, lambda=1.000000e-08 > > maximum iterations=1 > > Traceback (most recent call last): > > File "Twist.py", line 234, in > > snes.setUp() > > File "PETSc/SNES.pyx", line 530, in petsc4py.PETSc.SNES.setUp > > petsc4py.PETSc.Error: error code 83 > > On Tue, Oct 12, 2021 at 2:07 PM Stefano Zampini > wrote: > > > Il giorno mar 12 ott 2021 alle ore 13:56 Nicol?s Barnafi > ha scritto: > Hello PETSc users, > > first email sent! > I am creating a SNES solver using fenics, my example runs smoothly with 'newtonls', but gives a strange missing function error (error 83): > > > Dolphin swallows any useful error information returned from PETSc. You can try using the below code snippet at the beginning of your script > > from petsc4py import PETSc > from dolfin import * > # Remove the dolfin error handler > PETSc.Sys.pushErrorHandler('python') > > > > these are the relevant lines of code where I setup the solver: > > > problem = SNESProblem(Res, sol, bcs) > > b = PETScVector() # same as b = PETSc.Vec() > > J_mat = PETScMatrix() > > snes = PETSc.SNES().create(MPI.COMM_WORLD) > > snes.setFunction(problem.F, b.vec()) > > snes.setJacobian(problem.J, J_mat.mat()) > > # Set up fieldsplit > > ksp = snes.ksp > > ksp.setOperators(J_mat.mat()) > > pc = ksp.pc > > pc.setType('fieldsplit') > > dofmap_s = V.sub(0).dofmap().dofs() > > dofmap_p = V.sub(1).dofmap().dofs() > > is_s = PETSc.IS().createGeneral(dofmap_s) > > is_p = PETSc.IS().createGeneral(dofmap_p) > > pc.setFieldSplitIS((None, is_s), (None, is_p)) > > pc.setFromOptions() > > snes.setFromOptions() > > snes.setUp() > > If it can be useful, this are the outputs of snes.view(), ksp.view() and pc.view(): > > > type: qn > > SNES has not been set up so information may be incomplete > > type is BROYDEN, restart type is DEFAULT, scale type is JACOBIAN > > Stored subspace size: 10 > > Using the single reduction variant. > > maximum iterations=10000, maximum function evaluations=30000 > > tolerances: relative=1e-08, absolute=1e-50, solution=1e-08 > > total number of function evaluations=0 > > norm schedule ALWAYS > > SNESLineSearch Object: 4 MPI processes > > type: basic > > maxstep=1.000000e+08, minlambda=1.000000e-12 > > tolerances: relative=1.000000e-08, absolute=1.000000e-15, lambda=1.000000e-08 > > maximum iterations=1 > > KSP Object: 4 MPI processes > > type: gmres > > restart=1000, using Modified Gram-Schmidt Orthogonalization > > happy breakdown tolerance 1e-30 > > maximum iterations=1000, initial guess is zero > > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > > left preconditioning > > using UNPRECONDITIONED norm type for convergence test > > PC Object: 4 MPI processes > > type: fieldsplit > > PC has not been set up so information may be incomplete > > FieldSplit with Schur preconditioner, factorization FULL > > I know that PC is not setup, but if I do it before setting up the SNES, the error persists. Thanks in advance for your help. > > Best, > Nicolas > -- > Nicol?s Alejandro Barnafi Wittwer > > > -- > Stefano > > > -- > Nicol?s Alejandro Barnafi Wittwer > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > Nicol?s Alejandro Barnafi Wittwer > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.richter at ntnu.no Wed Oct 13 04:53:24 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Wed, 13 Oct 2021 11:53:24 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> Message-ID: Hei, I noticed a difference in when the program is running, and when not. The code works fine if I compile it via a CMake-file and load PETSc there. If I use the compilation line which is included in the Makefiles, then the code will fail with the mentioned error. The cmake-generated compilation line (including armadillo, because my test sample contained armadillo-code) is //opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -D__INSDIR__="" -I/include -I/opt/petsc/include -I/opt/armadillo/include -std=c++0x -g -MD -MT CMakeFiles/main.dir/source/main.cpp.o -MF CMakeFiles/main.dir/source/main.cpp.o.d -o CMakeFiles/main.dir/source/main.cpp.o -c source/main.cpp// ///opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -rdynamic CMakeFiles/main.dir/source/main.cpp.o -o main_short? -Wl,-rpath,/opt/petsc/lib:/opt/armadillo/lib64 /opt/petsc/lib/libpetsc.so /opt/armadillo/lib64/libarmadillo.so / Meanwhile, the original compilation line from PETSc is /mpicxx -mavx2 -march=native -O3 -fPIC -fopenmp??? -I/opt/petsc/include -I/opt/armadillo/include -I/opt/intel/oneapi/mkl/latest/include -I/opt/fftw3/include -I/opt/hdf5/include -I/opt/boost/include source/main.cpp -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib -L/opt/intel/oneapi/mkl/latest/lib/intel64 -Wl,-rpath,/opt/fftw3/lib64 -L/opt/fftw3/lib64 -Wl,-rpath,/opt/armadillo/lib64 -L/opt/armadillo/lib64 -Wl,-rpath,/opt/intel/oneapi/mkl/latest/lib/intel64 -Wl,-rpath,/opt/hdf5/lib -L/opt/hdf5/lib -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib/release -L/opt/intel/oneapi/mpi/2021.4.0/lib/release -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib -L/opt/intel/oneapi/mpi/2021.4.0/lib -Wl,-rpath,/usr/lib64/gcc/x86_64-suse-linux/11 -L/usr/lib64/gcc/x86_64-suse-linux/11 -Wl,-rpath,/opt/intel/oneapi/vpl/2021.6.0/lib -L/opt/intel/oneapi/vpl/2021.6.0/lib -Wl,-rpath,/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 -L/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib -L/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib -Wl,-rpath,/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 -L/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 -Wl,-rpath,/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 -L/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 -Wl,-rpath,/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 -L/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 -Wl,-rpath,/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib -L/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib -Wl,-rpath,/opt/intel/oneapi/dal/2021.4.0/lib/intel64 -L/opt/intel/oneapi/dal/2021.4.0/lib/intel64 -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin -L/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/lib -L/opt/intel/oneapi/compiler/2021.4.0/linux/lib -Wl,-rpath,/opt/intel/oneapi/clck/2021.4.0/lib/intel64 -L/opt/intel/oneapi/clck/2021.4.0/lib/intel64 -Wl,-rpath,/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp -L/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp -Wl,-rpath,/usr/x86_64-suse-linux/lib -L/usr/x86_64-suse-linux/lib -larmadillo -lpetsc -lHYPRE -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lspqr -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -lsuperlu -lsuperlu_dist -lEl -lElSuiteSparse -lpmrrr -lfftw3_mpi -lfftw3 -lp4est -lsc -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -ldl -lpthread -lptesmumps -lptscotchparmetis -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lhdf5_hl -lhdf5 -lparmetis -lmetis -lm -lz -lmuparser -lX11 -lstdc++ -ldl -lmpifort -lmpi -lrt -lpthread -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lrt -lquadmath -lstdc++ -ldl -o main_long/ Both executables have the same libraries linked to them, but in a different order when comparing them with ldd. Does that explain the observed behavior? Thanks, regards, Roland Am 11.10.21 um 15:13 schrieb Roland Richter: > > Hei, > > the following code works fine: > > #include > #include > > static char help[] = "Solves 2D Poisson equation using multigrid.\n\n"; > int main(int argc,char **argv) { > ??? PetscInitialize(&argc,&argv,(char*)0,help); > ??? std::cout << "Hello World\n"; > ??? PetscFinalize(); > ??? return 0; > } > > Regards, > > Roland > > Am 11.10.21 um 14:34 schrieb Stefano Zampini: >> Can you try with a simple call that only calls PetscInitialize/Finalize? >> >> >>> On Oct 11, 2021, at 3:30 PM, Roland Richter >>> wrote: >>> >>> At least according to configure.log mpiexec was defined as >>> >>> Checking for program /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found >>> ????????????????? Defined make macro "MPIEXECEXECUTABLE" to >>> "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" >>> >>> When running ex19 with this mpiexec it fails with the usual error, >>> even though all configuration steps worked fine. I attached the >>> configuration log. >>> >>> Regards, >>> >>> Roland >>> >>> Am 11.10.21 um 14:24 schrieb Stefano Zampini: >>>> You are most probably using a different mpiexec then the one used >>>> to compile petsc. >>>> >>>> >>>> >>>>> On Oct 11, 2021, at 3:23 PM, Roland Richter >>>>> wrote: >>>>> >>>>> I tried either /./ex19/ (SNES-example), /mpirun ./ex19/ or /mpirun >>>>> -n 1 ./ex19/, all with the same result. >>>>> >>>>> Regards, >>>>> >>>>> Roland >>>>> >>>>> Am 11.10.21 um 14:22 schrieb Matthew Knepley: >>>>>> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter >>>>>> wrote: >>>>>> >>>>>> Hei, >>>>>> >>>>>> at least in gdb it fails with >>>>>> >>>>>> Attempting to use an MPI routine before initializing MPICH >>>>>> [Inferior 1 (process 7854) exited with code 01] >>>>>> (gdb) backtrace >>>>>> No stack. >>>>>> >>>>>> >>>>>> What were you running? If it never makes it into PETSc code, I am >>>>>> not sure what we are >>>>>> doing to cause this. >>>>>> >>>>>> ? Thanks, >>>>>> >>>>>> ? ? ?Matt >>>>>> ? >>>>>> >>>>>> Regards, >>>>>> >>>>>> Roland >>>>>> >>>>>> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>>>>>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >>>>>>> wrote: >>>>>>> >>>>>>> Hei, >>>>>>> >>>>>>> I compiled PETSc with Intel MPI (MPICH) and GCC as >>>>>>> compiler (i.e. using >>>>>>> Intel OneAPI together with the supplied >>>>>>> mpicxx-compiler). Compilation >>>>>>> and installation worked fine, but running the tests >>>>>>> resulted in the >>>>>>> error "Attempting to use an MPI routine before >>>>>>> initializing MPICH". A >>>>>>> simple test program (attached) worked fine with the same >>>>>>> combination. >>>>>>> >>>>>>> What could be the reason for that? >>>>>>> >>>>>>> >>>>>>> Hi Roland, >>>>>>> >>>>>>> Can you get a stack trace for this error using the debugger? >>>>>>> >>>>>>> ? Thanks, >>>>>>> >>>>>>> ? ? ?Matt >>>>>>> ? >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Roland Richter >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin >>>>>>> their experiments is infinitely more interesting than any >>>>>>> results to which their experiments lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin their >>>>>> experiments is infinitely more interesting than any results to >>>>>> which their experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> >>>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 13 05:26:31 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 13 Oct 2021 06:26:31 -0400 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> Message-ID: On Wed, Oct 13, 2021 at 5:53 AM Roland Richter wrote: > Hei, > > I noticed a difference in when the program is running, and when not. The > code works fine if I compile it via a CMake-file and load PETSc there. If I > use the compilation line which is included in the Makefiles, then the code > will fail with the mentioned error. The cmake-generated compilation line > (including armadillo, because my test sample contained armadillo-code) is > > One of these is a compile command and the other is a link command. Matt > */opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -D__INSDIR__="" -I/include > -I/opt/petsc/include -I/opt/armadillo/include -std=c++0x -g -MD -MT > CMakeFiles/main.dir/source/main.cpp.o -MF > CMakeFiles/main.dir/source/main.cpp.o.d -o > CMakeFiles/main.dir/source/main.cpp.o -c source/main.cpp* > */opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -rdynamic > CMakeFiles/main.dir/source/main.cpp.o -o main_short > -Wl,-rpath,/opt/petsc/lib:/opt/armadillo/lib64 /opt/petsc/lib/libpetsc.so > /opt/armadillo/lib64/libarmadillo.so * > > Meanwhile, the original compilation line from PETSc is > > *mpicxx -mavx2 -march=native -O3 -fPIC -fopenmp -I/opt/petsc/include > -I/opt/armadillo/include -I/opt/intel/oneapi/mkl/latest/include > -I/opt/fftw3/include -I/opt/hdf5/include -I/opt/boost/include > source/main.cpp -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib > -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib > -L/opt/intel/oneapi/mkl/latest/lib/intel64 -Wl,-rpath,/opt/fftw3/lib64 > -L/opt/fftw3/lib64 -Wl,-rpath,/opt/armadillo/lib64 -L/opt/armadillo/lib64 > -Wl,-rpath,/opt/intel/oneapi/mkl/latest/lib/intel64 > -Wl,-rpath,/opt/hdf5/lib -L/opt/hdf5/lib > -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib/release > -L/opt/intel/oneapi/mpi/2021.4.0/lib/release > -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib > -L/opt/intel/oneapi/mpi/2021.4.0/lib > -Wl,-rpath,/usr/lib64/gcc/x86_64-suse-linux/11 > -L/usr/lib64/gcc/x86_64-suse-linux/11 > -Wl,-rpath,/opt/intel/oneapi/vpl/2021.6.0/lib > -L/opt/intel/oneapi/vpl/2021.6.0/lib > -Wl,-rpath,/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 > -L/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 > -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib > -L/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib > -Wl,-rpath,/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib > -L/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib > -Wl,-rpath,/opt/intel/oneapi/dal/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/dal/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin > -L/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin > -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/lib > -L/opt/intel/oneapi/compiler/2021.4.0/linux/lib > -Wl,-rpath,/opt/intel/oneapi/clck/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/clck/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp > -L/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp > -Wl,-rpath,/usr/x86_64-suse-linux/lib -L/usr/x86_64-suse-linux/lib > -larmadillo -lpetsc -lHYPRE -lcmumps -ldmumps -lsmumps -lzmumps > -lmumps_common -lpord -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lspqr > -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd > -lsuitesparseconfig -lsuperlu -lsuperlu_dist -lEl -lElSuiteSparse -lpmrrr > -lfftw3_mpi -lfftw3 -lp4est -lsc -lmkl_intel_lp64 -lmkl_core > -lmkl_intel_thread -liomp5 -ldl -lpthread -lptesmumps -lptscotchparmetis > -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lhdf5_hl -lhdf5 > -lparmetis -lmetis -lm -lz -lmuparser -lX11 -lstdc++ -ldl -lmpifort -lmpi > -lrt -lpthread -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lrt > -lquadmath -lstdc++ -ldl -o main_long* > > Both executables have the same libraries linked to them, but in a > different order when comparing them with ldd. > > Does that explain the observed behavior? > > Thanks, > > regards, > > Roland > Am 11.10.21 um 15:13 schrieb Roland Richter: > > Hei, > > the following code works fine: > > #include > #include > > static char help[] = "Solves 2D Poisson equation using multigrid.\n\n"; > int main(int argc,char **argv) { > PetscInitialize(&argc,&argv,(char*)0,help); > std::cout << "Hello World\n"; > PetscFinalize(); > return 0; > } > > Regards, > > Roland > Am 11.10.21 um 14:34 schrieb Stefano Zampini: > > Can you try with a simple call that only calls PetscInitialize/Finalize? > > > On Oct 11, 2021, at 3:30 PM, Roland Richter > wrote: > > At least according to configure.log mpiexec was defined as > > Checking for program /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found > Defined make macro "MPIEXECEXECUTABLE" to > "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" > > When running ex19 with this mpiexec it fails with the usual error, even > though all configuration steps worked fine. I attached the configuration > log. > > Regards, > > Roland > Am 11.10.21 um 14:24 schrieb Stefano Zampini: > > You are most probably using a different mpiexec then the one used to > compile petsc. > > > > On Oct 11, 2021, at 3:23 PM, Roland Richter > wrote: > > I tried either *./ex19* (SNES-example), *mpirun ./ex19* or *mpirun -n 1 > ./ex19*, all with the same result. > > Regards, > > Roland > Am 11.10.21 um 14:22 schrieb Matthew Knepley: > > On Mon, Oct 11, 2021 at 8:07 AM Roland Richter > wrote: > >> Hei, >> >> at least in gdb it fails with >> >> Attempting to use an MPI routine before initializing MPICH >> [Inferior 1 (process 7854) exited with code 01] >> (gdb) backtrace >> No stack. >> > > What were you running? If it never makes it into PETSc code, I am not sure > what we are > doing to cause this. > > Thanks, > > Matt > > >> Regards, >> >> Roland >> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >> >> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >> wrote: >> >>> Hei, >>> >>> I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. using >>> Intel OneAPI together with the supplied mpicxx-compiler). Compilation >>> and installation worked fine, but running the tests resulted in the >>> error "Attempting to use an MPI routine before initializing MPICH". A >>> simple test program (attached) worked fine with the same combination. >>> >>> What could be the reason for that? >>> >> >> Hi Roland, >> >> Can you get a stack trace for this error using the debugger? >> >> Thanks, >> >> Matt >> >> >>> Thanks! >>> >>> Regards, >>> >>> Roland Richter >>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.richter at ntnu.no Wed Oct 13 05:32:44 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Wed, 13 Oct 2021 12:32:44 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> Message-ID: Yes, the first part (which works) consists out of a compilation line and a linking line, while the second command is a combination of compile- and linking line. Am 13.10.21 um 12:26 schrieb Matthew Knepley: > On Wed, Oct 13, 2021 at 5:53 AM Roland Richter > wrote: > > Hei, > > I noticed a difference in when the program is running, and when > not. The code works fine if I compile it via a CMake-file and load > PETSc there. If I use the compilation line which is included in > the Makefiles, then the code will fail with the mentioned error. > The cmake-generated compilation line (including armadillo, because > my test sample contained armadillo-code) is > > One of these is a compile command and the other is a link command. > > ? ?Matt > > //opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -D__INSDIR__="" > -I/include -I/opt/petsc/include -I/opt/armadillo/include > -std=c++0x -g -MD -MT CMakeFiles/main.dir/source/main.cpp.o -MF > CMakeFiles/main.dir/source/main.cpp.o.d -o > CMakeFiles/main.dir/source/main.cpp.o -c source/main.cpp// > ///opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -rdynamic > CMakeFiles/main.dir/source/main.cpp.o -o main_short? > -Wl,-rpath,/opt/petsc/lib:/opt/armadillo/lib64 > /opt/petsc/lib/libpetsc.so /opt/armadillo/lib64/libarmadillo.so / > > Meanwhile, the original compilation line from PETSc is > > /mpicxx -mavx2 -march=native -O3 -fPIC -fopenmp??? > -I/opt/petsc/include -I/opt/armadillo/include > -I/opt/intel/oneapi/mkl/latest/include -I/opt/fftw3/include > -I/opt/hdf5/include -I/opt/boost/include source/main.cpp > -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib > -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib > -L/opt/intel/oneapi/mkl/latest/lib/intel64 > -Wl,-rpath,/opt/fftw3/lib64 -L/opt/fftw3/lib64 > -Wl,-rpath,/opt/armadillo/lib64 -L/opt/armadillo/lib64 > -Wl,-rpath,/opt/intel/oneapi/mkl/latest/lib/intel64 > -Wl,-rpath,/opt/hdf5/lib -L/opt/hdf5/lib > -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib/release > -L/opt/intel/oneapi/mpi/2021.4.0/lib/release > -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib > -L/opt/intel/oneapi/mpi/2021.4.0/lib > -Wl,-rpath,/usr/lib64/gcc/x86_64-suse-linux/11 > -L/usr/lib64/gcc/x86_64-suse-linux/11 > -Wl,-rpath,/opt/intel/oneapi/vpl/2021.6.0/lib > -L/opt/intel/oneapi/vpl/2021.6.0/lib > -Wl,-rpath,/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 > -L/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 > -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib > -L/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib > -Wl,-rpath,/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib > -L/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib > -Wl,-rpath,/opt/intel/oneapi/dal/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/dal/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin > -L/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin > -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/lib > -L/opt/intel/oneapi/compiler/2021.4.0/linux/lib > -Wl,-rpath,/opt/intel/oneapi/clck/2021.4.0/lib/intel64 > -L/opt/intel/oneapi/clck/2021.4.0/lib/intel64 > -Wl,-rpath,/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp > -L/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp > -Wl,-rpath,/usr/x86_64-suse-linux/lib -L/usr/x86_64-suse-linux/lib > -larmadillo -lpetsc -lHYPRE -lcmumps -ldmumps -lsmumps -lzmumps > -lmumps_common -lpord -lmkl_scalapack_lp64 > -lmkl_blacs_intelmpi_lp64 -lspqr -lumfpack -lklu -lcholmod -lbtf > -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -lsuperlu > -lsuperlu_dist -lEl -lElSuiteSparse -lpmrrr -lfftw3_mpi -lfftw3 > -lp4est -lsc -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread > -liomp5 -ldl -lpthread -lptesmumps -lptscotchparmetis -lptscotch > -lptscotcherr -lesmumps -lscotch -lscotcherr -lhdf5_hl -lhdf5 > -lparmetis -lmetis -lm -lz -lmuparser -lX11 -lstdc++ -ldl > -lmpifort -lmpi -lrt -lpthread -lgfortran -lm -lgfortran -lm > -lgcc_s -lquadmath -lrt -lquadmath -lstdc++ -ldl -o main_long/ > > Both executables have the same libraries linked to them, but in a > different order when comparing them with ldd. > > Does that explain the observed behavior? > > Thanks, > > regards, > > Roland > > Am 11.10.21 um 15:13 schrieb Roland Richter: >> >> Hei, >> >> the following code works fine: >> >> #include >> #include >> >> static char help[] = "Solves 2D Poisson equation using >> multigrid.\n\n"; >> int main(int argc,char **argv) { >> ??? PetscInitialize(&argc,&argv,(char*)0,help); >> ??? std::cout << "Hello World\n"; >> ??? PetscFinalize(); >> ??? return 0; >> } >> >> Regards, >> >> Roland >> >> Am 11.10.21 um 14:34 schrieb Stefano Zampini: >>> Can you try with a simple call that only calls >>> PetscInitialize/Finalize? >>> >>> >>>> On Oct 11, 2021, at 3:30 PM, Roland Richter >>>> wrote: >>>> >>>> At least according to configure.log mpiexec was defined as >>>> >>>> Checking for program >>>> /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found >>>> ????????????????? Defined make macro "MPIEXECEXECUTABLE" to >>>> "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" >>>> >>>> When running ex19 with this mpiexec it fails with the usual >>>> error, even though all configuration steps worked fine. I >>>> attached the configuration log. >>>> >>>> Regards, >>>> >>>> Roland >>>> >>>> Am 11.10.21 um 14:24 schrieb Stefano Zampini: >>>>> You are most probably using a different mpiexec then the one >>>>> used to compile petsc. >>>>> >>>>> >>>>> >>>>>> On Oct 11, 2021, at 3:23 PM, Roland Richter >>>>>> wrote: >>>>>> >>>>>> I tried either /./ex19/ (SNES-example), /mpirun ./ex19/ or >>>>>> /mpirun -n 1 ./ex19/, all with the same result. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Roland >>>>>> >>>>>> Am 11.10.21 um 14:22 schrieb Matthew Knepley: >>>>>>> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter >>>>>>> wrote: >>>>>>> >>>>>>> Hei, >>>>>>> >>>>>>> at least in gdb it fails with >>>>>>> >>>>>>> Attempting to use an MPI routine before initializing MPICH >>>>>>> [Inferior 1 (process 7854) exited with code 01] >>>>>>> (gdb) backtrace >>>>>>> No stack. >>>>>>> >>>>>>> >>>>>>> What were you running? If it never makes it into PETSc code, >>>>>>> I am not sure what we are >>>>>>> doing to cause this. >>>>>>> >>>>>>> ? Thanks, >>>>>>> >>>>>>> ? ? ?Matt >>>>>>> ? >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Roland >>>>>>> >>>>>>> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>>>>>>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hei, >>>>>>>> >>>>>>>> I compiled PETSc with Intel MPI (MPICH) and GCC as >>>>>>>> compiler (i.e. using >>>>>>>> Intel OneAPI together with the supplied >>>>>>>> mpicxx-compiler). Compilation >>>>>>>> and installation worked fine, but running the tests >>>>>>>> resulted in the >>>>>>>> error "Attempting to use an MPI routine before >>>>>>>> initializing MPICH". A >>>>>>>> simple test program (attached) worked fine with the >>>>>>>> same combination. >>>>>>>> >>>>>>>> What could be the reason for that? >>>>>>>> >>>>>>>> >>>>>>>> Hi Roland, >>>>>>>> >>>>>>>> Can you get a stack trace for this error using the >>>>>>>> debugger? >>>>>>>> >>>>>>>> ? Thanks, >>>>>>>> >>>>>>>> ? ? ?Matt >>>>>>>> ? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Roland Richter >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> What most experimenters take for granted before they >>>>>>>> begin their experiments is infinitely more interesting >>>>>>>> than any results to which their experiments lead. >>>>>>>> -- Norbert Wiener >>>>>>>> >>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin >>>>>>> their experiments is infinitely more interesting than any >>>>>>> results to which their experiments lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>> >>>>> >>>> >>> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 13 05:36:01 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 13 Oct 2021 06:36:01 -0400 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> Message-ID: On Wed, Oct 13, 2021 at 6:32 AM Roland Richter wrote: > Yes, the first part (which works) consists out of a compilation line and a > linking line, while the second command is a combination of compile- and > linking line. > The link line in the first does not tell us anything because MPI is not even present. It is being pulled in I presume from libarmadillo, which we cannot see. It still seems most likely, as Stefano said, that you are mixing versions of MPI. Thanks, Matt > Am 13.10.21 um 12:26 schrieb Matthew Knepley: > > On Wed, Oct 13, 2021 at 5:53 AM Roland Richter > wrote: > >> Hei, >> >> I noticed a difference in when the program is running, and when not. The >> code works fine if I compile it via a CMake-file and load PETSc there. If I >> use the compilation line which is included in the Makefiles, then the code >> will fail with the mentioned error. The cmake-generated compilation line >> (including armadillo, because my test sample contained armadillo-code) is >> > One of these is a compile command and the other is a link command. > > Matt > >> */opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -D__INSDIR__="" -I/include >> -I/opt/petsc/include -I/opt/armadillo/include -std=c++0x -g -MD -MT >> CMakeFiles/main.dir/source/main.cpp.o -MF >> CMakeFiles/main.dir/source/main.cpp.o.d -o >> CMakeFiles/main.dir/source/main.cpp.o -c source/main.cpp* >> */opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -rdynamic >> CMakeFiles/main.dir/source/main.cpp.o -o main_short >> -Wl,-rpath,/opt/petsc/lib:/opt/armadillo/lib64 /opt/petsc/lib/libpetsc.so >> /opt/armadillo/lib64/libarmadillo.so * >> >> Meanwhile, the original compilation line from PETSc is >> >> *mpicxx -mavx2 -march=native -O3 -fPIC -fopenmp -I/opt/petsc/include >> -I/opt/armadillo/include -I/opt/intel/oneapi/mkl/latest/include >> -I/opt/fftw3/include -I/opt/hdf5/include -I/opt/boost/include >> source/main.cpp -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib >> -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib >> -L/opt/intel/oneapi/mkl/latest/lib/intel64 -Wl,-rpath,/opt/fftw3/lib64 >> -L/opt/fftw3/lib64 -Wl,-rpath,/opt/armadillo/lib64 -L/opt/armadillo/lib64 >> -Wl,-rpath,/opt/intel/oneapi/mkl/latest/lib/intel64 >> -Wl,-rpath,/opt/hdf5/lib -L/opt/hdf5/lib >> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib/release >> -L/opt/intel/oneapi/mpi/2021.4.0/lib/release >> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib >> -L/opt/intel/oneapi/mpi/2021.4.0/lib >> -Wl,-rpath,/usr/lib64/gcc/x86_64-suse-linux/11 >> -L/usr/lib64/gcc/x86_64-suse-linux/11 >> -Wl,-rpath,/opt/intel/oneapi/vpl/2021.6.0/lib >> -L/opt/intel/oneapi/vpl/2021.6.0/lib >> -Wl,-rpath,/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 >> -L/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 >> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib >> -L/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib >> -Wl,-rpath,/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib >> -L/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib >> -Wl,-rpath,/opt/intel/oneapi/dal/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/dal/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin >> -L/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin >> -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/lib >> -L/opt/intel/oneapi/compiler/2021.4.0/linux/lib >> -Wl,-rpath,/opt/intel/oneapi/clck/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/clck/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp >> -L/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp >> -Wl,-rpath,/usr/x86_64-suse-linux/lib -L/usr/x86_64-suse-linux/lib >> -larmadillo -lpetsc -lHYPRE -lcmumps -ldmumps -lsmumps -lzmumps >> -lmumps_common -lpord -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lspqr >> -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd >> -lsuitesparseconfig -lsuperlu -lsuperlu_dist -lEl -lElSuiteSparse -lpmrrr >> -lfftw3_mpi -lfftw3 -lp4est -lsc -lmkl_intel_lp64 -lmkl_core >> -lmkl_intel_thread -liomp5 -ldl -lpthread -lptesmumps -lptscotchparmetis >> -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lhdf5_hl -lhdf5 >> -lparmetis -lmetis -lm -lz -lmuparser -lX11 -lstdc++ -ldl -lmpifort -lmpi >> -lrt -lpthread -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lrt >> -lquadmath -lstdc++ -ldl -o main_long* >> >> Both executables have the same libraries linked to them, but in a >> different order when comparing them with ldd. >> >> Does that explain the observed behavior? >> >> Thanks, >> >> regards, >> >> Roland >> Am 11.10.21 um 15:13 schrieb Roland Richter: >> >> Hei, >> >> the following code works fine: >> >> #include >> #include >> >> static char help[] = "Solves 2D Poisson equation using multigrid.\n\n"; >> int main(int argc,char **argv) { >> PetscInitialize(&argc,&argv,(char*)0,help); >> std::cout << "Hello World\n"; >> PetscFinalize(); >> return 0; >> } >> >> Regards, >> >> Roland >> Am 11.10.21 um 14:34 schrieb Stefano Zampini: >> >> Can you try with a simple call that only calls PetscInitialize/Finalize? >> >> >> On Oct 11, 2021, at 3:30 PM, Roland Richter >> wrote: >> >> At least according to configure.log mpiexec was defined as >> >> Checking for program /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found >> Defined make macro "MPIEXECEXECUTABLE" to >> "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" >> >> When running ex19 with this mpiexec it fails with the usual error, even >> though all configuration steps worked fine. I attached the configuration >> log. >> >> Regards, >> >> Roland >> Am 11.10.21 um 14:24 schrieb Stefano Zampini: >> >> You are most probably using a different mpiexec then the one used to >> compile petsc. >> >> >> >> On Oct 11, 2021, at 3:23 PM, Roland Richter >> wrote: >> >> I tried either *./ex19* (SNES-example), *mpirun ./ex19* or *mpirun -n 1 >> ./ex19*, all with the same result. >> >> Regards, >> >> Roland >> Am 11.10.21 um 14:22 schrieb Matthew Knepley: >> >> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter >> wrote: >> >>> Hei, >>> >>> at least in gdb it fails with >>> >>> Attempting to use an MPI routine before initializing MPICH >>> [Inferior 1 (process 7854) exited with code 01] >>> (gdb) backtrace >>> No stack. >>> >> >> What were you running? If it never makes it into PETSc code, I am not >> sure what we are >> doing to cause this. >> >> Thanks, >> >> Matt >> >> >>> Regards, >>> >>> Roland >>> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>> >>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >>> wrote: >>> >>>> Hei, >>>> >>>> I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. using >>>> Intel OneAPI together with the supplied mpicxx-compiler). Compilation >>>> and installation worked fine, but running the tests resulted in the >>>> error "Attempting to use an MPI routine before initializing MPICH". A >>>> simple test program (attached) worked fine with the same combination. >>>> >>>> What could be the reason for that? >>>> >>> >>> Hi Roland, >>> >>> Can you get a stack trace for this error using the debugger? >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Thanks! >>>> >>>> Regards, >>>> >>>> Roland Richter >>>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> >> >> >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland.richter at ntnu.no Wed Oct 13 05:43:48 2021 From: roland.richter at ntnu.no (Roland Richter) Date: Wed, 13 Oct 2021 12:43:48 +0200 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> Message-ID: <79bf023a-566a-d4fd-6ca1-bbfb28c543c3@ntnu.no> Shouldn't I see a mixture of MPI-libraries when calling ldd if I mix versions of MPI? I also removed all calls to armadillo, and accordingly all references to it in the compilation, but the behavior is still unchanged. Regards, Roland Am 13.10.21 um 12:36 schrieb Matthew Knepley: > On Wed, Oct 13, 2021 at 6:32 AM Roland Richter > wrote: > > Yes, the first part (which works) consists out of a compilation > line and a linking line, while the second command is a combination > of compile- and linking line. > > The link line in the first does not tell us anything because MPI is > not even present. It is being pulled in I presume from libarmadillo, > which we cannot see. It still > seems most likely, as Stefano said, that you are mixing versions of MPI. > > ? Thanks, > > ? ? ?Matt > ? > > Am 13.10.21 um 12:26 schrieb Matthew Knepley: >> On Wed, Oct 13, 2021 at 5:53 AM Roland Richter >> wrote: >> >> Hei, >> >> I noticed a difference in when the program is running, and >> when not. The code works fine if I compile it via a >> CMake-file and load PETSc there. If I use the compilation >> line which is included in the Makefiles, then the code will >> fail with the mentioned error. The cmake-generated >> compilation line (including armadillo, because my test sample >> contained armadillo-code) is >> >> One of these is a compile command and the other is a link command. >> >> ? ?Matt >> >> //opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -D__INSDIR__="" >> -I/include -I/opt/petsc/include -I/opt/armadillo/include >> -std=c++0x -g -MD -MT CMakeFiles/main.dir/source/main.cpp.o >> -MF CMakeFiles/main.dir/source/main.cpp.o.d -o >> CMakeFiles/main.dir/source/main.cpp.o -c source/main.cpp// >> ///opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -rdynamic >> CMakeFiles/main.dir/source/main.cpp.o -o main_short? >> -Wl,-rpath,/opt/petsc/lib:/opt/armadillo/lib64 >> /opt/petsc/lib/libpetsc.so /opt/armadillo/lib64/libarmadillo.so / >> >> Meanwhile, the original compilation line from PETSc is >> >> /mpicxx -mavx2 -march=native -O3 -fPIC -fopenmp??? >> -I/opt/petsc/include -I/opt/armadillo/include >> -I/opt/intel/oneapi/mkl/latest/include -I/opt/fftw3/include >> -I/opt/hdf5/include -I/opt/boost/include source/main.cpp >> -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib >> -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib >> -L/opt/intel/oneapi/mkl/latest/lib/intel64 >> -Wl,-rpath,/opt/fftw3/lib64 -L/opt/fftw3/lib64 >> -Wl,-rpath,/opt/armadillo/lib64 -L/opt/armadillo/lib64 >> -Wl,-rpath,/opt/intel/oneapi/mkl/latest/lib/intel64 >> -Wl,-rpath,/opt/hdf5/lib -L/opt/hdf5/lib >> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib/release >> -L/opt/intel/oneapi/mpi/2021.4.0/lib/release >> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib >> -L/opt/intel/oneapi/mpi/2021.4.0/lib >> -Wl,-rpath,/usr/lib64/gcc/x86_64-suse-linux/11 >> -L/usr/lib64/gcc/x86_64-suse-linux/11 >> -Wl,-rpath,/opt/intel/oneapi/vpl/2021.6.0/lib >> -L/opt/intel/oneapi/vpl/2021.6.0/lib >> -Wl,-rpath,/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 >> -L/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 >> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib >> -L/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib >> -Wl,-rpath,/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib >> -L/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib >> -Wl,-rpath,/opt/intel/oneapi/dal/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/dal/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin >> -L/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin >> -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/lib >> -L/opt/intel/oneapi/compiler/2021.4.0/linux/lib >> -Wl,-rpath,/opt/intel/oneapi/clck/2021.4.0/lib/intel64 >> -L/opt/intel/oneapi/clck/2021.4.0/lib/intel64 >> -Wl,-rpath,/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp >> -L/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp >> -Wl,-rpath,/usr/x86_64-suse-linux/lib >> -L/usr/x86_64-suse-linux/lib -larmadillo -lpetsc -lHYPRE >> -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord >> -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lspqr >> -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd >> -lamd -lsuitesparseconfig -lsuperlu -lsuperlu_dist -lEl >> -lElSuiteSparse -lpmrrr -lfftw3_mpi -lfftw3 -lp4est -lsc >> -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -ldl >> -lpthread -lptesmumps -lptscotchparmetis -lptscotch >> -lptscotcherr -lesmumps -lscotch -lscotcherr -lhdf5_hl -lhdf5 >> -lparmetis -lmetis -lm -lz -lmuparser -lX11 -lstdc++ -ldl >> -lmpifort -lmpi -lrt -lpthread -lgfortran -lm -lgfortran -lm >> -lgcc_s -lquadmath -lrt -lquadmath -lstdc++ -ldl -o main_long/ >> >> Both executables have the same libraries linked to them, but >> in a different order when comparing them with ldd. >> >> Does that explain the observed behavior? >> >> Thanks, >> >> regards, >> >> Roland >> >> Am 11.10.21 um 15:13 schrieb Roland Richter: >>> >>> Hei, >>> >>> the following code works fine: >>> >>> #include >>> #include >>> >>> static char help[] = "Solves 2D Poisson equation using >>> multigrid.\n\n"; >>> int main(int argc,char **argv) { >>> ??? PetscInitialize(&argc,&argv,(char*)0,help); >>> ??? std::cout << "Hello World\n"; >>> ??? PetscFinalize(); >>> ??? return 0; >>> } >>> >>> Regards, >>> >>> Roland >>> >>> Am 11.10.21 um 14:34 schrieb Stefano Zampini: >>>> Can you try with a simple call that only calls >>>> PetscInitialize/Finalize? >>>> >>>> >>>>> On Oct 11, 2021, at 3:30 PM, Roland Richter >>>>> wrote: >>>>> >>>>> At least according to configure.log mpiexec was defined as >>>>> >>>>> Checking for program >>>>> /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found >>>>> ????????????????? Defined make macro "MPIEXECEXECUTABLE" >>>>> to "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" >>>>> >>>>> When running ex19 with this mpiexec it fails with the >>>>> usual error, even though all configuration steps worked >>>>> fine. I attached the configuration log. >>>>> >>>>> Regards, >>>>> >>>>> Roland >>>>> >>>>> Am 11.10.21 um 14:24 schrieb Stefano Zampini: >>>>>> You are most probably using a different mpiexec then the >>>>>> one used to compile petsc. >>>>>> >>>>>> >>>>>> >>>>>>> On Oct 11, 2021, at 3:23 PM, Roland Richter >>>>>>> wrote: >>>>>>> >>>>>>> I tried either /./ex19/ (SNES-example), /mpirun ./ex19/ >>>>>>> or /mpirun -n 1 ./ex19/, all with the same result. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Roland >>>>>>> >>>>>>> Am 11.10.21 um 14:22 schrieb Matthew Knepley: >>>>>>>> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hei, >>>>>>>> >>>>>>>> at least in gdb it fails with >>>>>>>> >>>>>>>> Attempting to use an MPI routine before >>>>>>>> initializing MPICH >>>>>>>> [Inferior 1 (process 7854) exited with code 01] >>>>>>>> (gdb) backtrace >>>>>>>> No stack. >>>>>>>> >>>>>>>> >>>>>>>> What were you running? If it never makes it into PETSc >>>>>>>> code, I am not sure what we are >>>>>>>> doing to cause this. >>>>>>>> >>>>>>>> ? Thanks, >>>>>>>> >>>>>>>> ? ? ?Matt >>>>>>>> ? >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Roland >>>>>>>> >>>>>>>> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>>>>>>>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hei, >>>>>>>>> >>>>>>>>> I compiled PETSc with Intel MPI (MPICH) and >>>>>>>>> GCC as compiler (i.e. using >>>>>>>>> Intel OneAPI together with the supplied >>>>>>>>> mpicxx-compiler). Compilation >>>>>>>>> and installation worked fine, but running the >>>>>>>>> tests resulted in the >>>>>>>>> error "Attempting to use an MPI routine before >>>>>>>>> initializing MPICH". A >>>>>>>>> simple test program (attached) worked fine >>>>>>>>> with the same combination. >>>>>>>>> >>>>>>>>> What could be the reason for that? >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Roland, >>>>>>>>> >>>>>>>>> Can you get a stack trace for this error using the >>>>>>>>> debugger? >>>>>>>>> >>>>>>>>> ? Thanks, >>>>>>>>> >>>>>>>>> ? ? ?Matt >>>>>>>>> ? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Roland Richter >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> What most experimenters take for granted before >>>>>>>>> they begin their experiments is infinitely more >>>>>>>>> interesting than any results to which their >>>>>>>>> experiments lead. >>>>>>>>> -- Norbert Wiener >>>>>>>>> >>>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> What most experimenters take for granted before they >>>>>>>> begin their experiments is infinitely more interesting >>>>>>>> than any results to which their experiments lead. >>>>>>>> -- Norbert Wiener >>>>>>>> >>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>> >>>>>> >>>>> >>>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 13 06:01:12 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 13 Oct 2021 07:01:12 -0400 Subject: [petsc-users] Error "Attempting to use an MPI routine before initializing MPICH" after compiling PETSc with Intel MPI and GCC In-Reply-To: <79bf023a-566a-d4fd-6ca1-bbfb28c543c3@ntnu.no> References: <8bb8147d-05ad-71a1-286e-2a650c4324fb@ntnu.no> <73bc42b8-00f2-d9ae-6850-06c3c459cf9d@ntnu.no> <1280E270-AAA1-44CE-AE2A-93A2B70462E2@gmail.com> <7935b6e2-4319-6444-c0f9-a3ba34e8694a@ntnu.no> <383F2AEB-4CEB-4407-A1B3-E294ACDFD91C@gmail.com> <79bf023a-566a-d4fd-6ca1-bbfb28c543c3@ntnu.no> Message-ID: On Wed, Oct 13, 2021 at 6:43 AM Roland Richter wrote: > Shouldn't I see a mixture of MPI-libraries when calling ldd if I mix > versions of MPI? > > Not if the compilation uses one and the link another. > I also removed all calls to armadillo, and accordingly all references to > it in the compilation, but the behavior is still unchanged. > I do not understand what you mean here. If you remove the armadiillo library, how are you linking MPI? Thanks, Matt > Regards, > > Roland > Am 13.10.21 um 12:36 schrieb Matthew Knepley: > > On Wed, Oct 13, 2021 at 6:32 AM Roland Richter > wrote: > >> Yes, the first part (which works) consists out of a compilation line and >> a linking line, while the second command is a combination of compile- and >> linking line. >> > The link line in the first does not tell us anything because MPI is not > even present. It is being pulled in I presume from libarmadillo, which we > cannot see. It still > seems most likely, as Stefano said, that you are mixing versions of MPI. > > Thanks, > > Matt > > >> Am 13.10.21 um 12:26 schrieb Matthew Knepley: >> >> On Wed, Oct 13, 2021 at 5:53 AM Roland Richter >> wrote: >> >>> Hei, >>> >>> I noticed a difference in when the program is running, and when not. The >>> code works fine if I compile it via a CMake-file and load PETSc there. If I >>> use the compilation line which is included in the Makefiles, then the code >>> will fail with the mentioned error. The cmake-generated compilation line >>> (including armadillo, because my test sample contained armadillo-code) is >>> >> One of these is a compile command and the other is a link command. >> >> Matt >> >>> */opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -D__INSDIR__="" -I/include >>> -I/opt/petsc/include -I/opt/armadillo/include -std=c++0x -g -MD -MT >>> CMakeFiles/main.dir/source/main.cpp.o -MF >>> CMakeFiles/main.dir/source/main.cpp.o.d -o >>> CMakeFiles/main.dir/source/main.cpp.o -c source/main.cpp* >>> */opt/intel/oneapi/mpi/2021.4.0/bin/mpicxx -rdynamic >>> CMakeFiles/main.dir/source/main.cpp.o -o main_short >>> -Wl,-rpath,/opt/petsc/lib:/opt/armadillo/lib64 /opt/petsc/lib/libpetsc.so >>> /opt/armadillo/lib64/libarmadillo.so * >>> >>> Meanwhile, the original compilation line from PETSc is >>> >>> *mpicxx -mavx2 -march=native -O3 -fPIC -fopenmp -I/opt/petsc/include >>> -I/opt/armadillo/include -I/opt/intel/oneapi/mkl/latest/include >>> -I/opt/fftw3/include -I/opt/hdf5/include -I/opt/boost/include >>> source/main.cpp -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib >>> -Wl,-rpath,/opt/petsc/lib -L/opt/petsc/lib >>> -L/opt/intel/oneapi/mkl/latest/lib/intel64 -Wl,-rpath,/opt/fftw3/lib64 >>> -L/opt/fftw3/lib64 -Wl,-rpath,/opt/armadillo/lib64 -L/opt/armadillo/lib64 >>> -Wl,-rpath,/opt/intel/oneapi/mkl/latest/lib/intel64 >>> -Wl,-rpath,/opt/hdf5/lib -L/opt/hdf5/lib >>> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib/release >>> -L/opt/intel/oneapi/mpi/2021.4.0/lib/release >>> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/lib >>> -L/opt/intel/oneapi/mpi/2021.4.0/lib >>> -Wl,-rpath,/usr/lib64/gcc/x86_64-suse-linux/11 >>> -L/usr/lib64/gcc/x86_64-suse-linux/11 >>> -Wl,-rpath,/opt/intel/oneapi/vpl/2021.6.0/lib >>> -L/opt/intel/oneapi/vpl/2021.6.0/lib >>> -Wl,-rpath,/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 >>> -L/opt/intel/oneapi/tbb/2021.4.0/lib/intel64/gcc4.8 >>> -Wl,-rpath,/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib >>> -L/opt/intel/oneapi/mpi/2021.4.0/libfabric/lib >>> -Wl,-rpath,/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 >>> -L/opt/intel/oneapi/mkl/2021.4.0/lib/intel64 >>> -Wl,-rpath,/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 >>> -L/opt/intel/oneapi/ipp/2021.4.0/lib/intel64 >>> -Wl,-rpath,/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 >>> -L/opt/intel/oneapi/ippcp/2021.4.0/lib/intel64 >>> -Wl,-rpath,/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib >>> -L/opt/intel/oneapi/dnnl/2021.4.0/cpu_dpcpp_gpu_dpcpp/lib >>> -Wl,-rpath,/opt/intel/oneapi/dal/2021.4.0/lib/intel64 >>> -L/opt/intel/oneapi/dal/2021.4.0/lib/intel64 >>> -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin >>> -L/opt/intel/oneapi/compiler/2021.4.0/linux/compiler/lib/intel64_lin >>> -Wl,-rpath,/opt/intel/oneapi/compiler/2021.4.0/linux/lib >>> -L/opt/intel/oneapi/compiler/2021.4.0/linux/lib >>> -Wl,-rpath,/opt/intel/oneapi/clck/2021.4.0/lib/intel64 >>> -L/opt/intel/oneapi/clck/2021.4.0/lib/intel64 >>> -Wl,-rpath,/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp >>> -L/opt/intel/oneapi/ccl/2021.4.0/lib/cpu_gpu_dpcpp >>> -Wl,-rpath,/usr/x86_64-suse-linux/lib -L/usr/x86_64-suse-linux/lib >>> -larmadillo -lpetsc -lHYPRE -lcmumps -ldmumps -lsmumps -lzmumps >>> -lmumps_common -lpord -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lspqr >>> -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd >>> -lsuitesparseconfig -lsuperlu -lsuperlu_dist -lEl -lElSuiteSparse -lpmrrr >>> -lfftw3_mpi -lfftw3 -lp4est -lsc -lmkl_intel_lp64 -lmkl_core >>> -lmkl_intel_thread -liomp5 -ldl -lpthread -lptesmumps -lptscotchparmetis >>> -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lhdf5_hl -lhdf5 >>> -lparmetis -lmetis -lm -lz -lmuparser -lX11 -lstdc++ -ldl -lmpifort -lmpi >>> -lrt -lpthread -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lrt >>> -lquadmath -lstdc++ -ldl -o main_long* >>> >>> Both executables have the same libraries linked to them, but in a >>> different order when comparing them with ldd. >>> >>> Does that explain the observed behavior? >>> >>> Thanks, >>> >>> regards, >>> >>> Roland >>> Am 11.10.21 um 15:13 schrieb Roland Richter: >>> >>> Hei, >>> >>> the following code works fine: >>> >>> #include >>> #include >>> >>> static char help[] = "Solves 2D Poisson equation using multigrid.\n\n"; >>> int main(int argc,char **argv) { >>> PetscInitialize(&argc,&argv,(char*)0,help); >>> std::cout << "Hello World\n"; >>> PetscFinalize(); >>> return 0; >>> } >>> >>> Regards, >>> >>> Roland >>> Am 11.10.21 um 14:34 schrieb Stefano Zampini: >>> >>> Can you try with a simple call that only calls PetscInitialize/Finalize? >>> >>> >>> On Oct 11, 2021, at 3:30 PM, Roland Richter >>> wrote: >>> >>> At least according to configure.log mpiexec was defined as >>> >>> Checking for program /opt/intel/oneapi/mpi/2021.4.0//bin/mpiexec...found >>> Defined make macro "MPIEXECEXECUTABLE" to >>> "/opt/intel/oneapi/mpi/2021.4.0/bin/mpiexec" >>> >>> When running ex19 with this mpiexec it fails with the usual error, even >>> though all configuration steps worked fine. I attached the configuration >>> log. >>> >>> Regards, >>> >>> Roland >>> Am 11.10.21 um 14:24 schrieb Stefano Zampini: >>> >>> You are most probably using a different mpiexec then the one used to >>> compile petsc. >>> >>> >>> >>> On Oct 11, 2021, at 3:23 PM, Roland Richter >>> wrote: >>> >>> I tried either *./ex19* (SNES-example), *mpirun ./ex19* or *mpirun -n 1 >>> ./ex19*, all with the same result. >>> >>> Regards, >>> >>> Roland >>> Am 11.10.21 um 14:22 schrieb Matthew Knepley: >>> >>> On Mon, Oct 11, 2021 at 8:07 AM Roland Richter >>> wrote: >>> >>>> Hei, >>>> >>>> at least in gdb it fails with >>>> >>>> Attempting to use an MPI routine before initializing MPICH >>>> [Inferior 1 (process 7854) exited with code 01] >>>> (gdb) backtrace >>>> No stack. >>>> >>> >>> What were you running? If it never makes it into PETSc code, I am not >>> sure what we are >>> doing to cause this. >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Regards, >>>> >>>> Roland >>>> Am 11.10.21 um 13:57 schrieb Matthew Knepley: >>>> >>>> On Mon, Oct 11, 2021 at 5:24 AM Roland Richter >>>> wrote: >>>> >>>>> Hei, >>>>> >>>>> I compiled PETSc with Intel MPI (MPICH) and GCC as compiler (i.e. using >>>>> Intel OneAPI together with the supplied mpicxx-compiler). Compilation >>>>> and installation worked fine, but running the tests resulted in the >>>>> error "Attempting to use an MPI routine before initializing MPICH". A >>>>> simple test program (attached) worked fine with the same combination. >>>>> >>>>> What could be the reason for that? >>>>> >>>> >>>> Hi Roland, >>>> >>>> Can you get a stack trace for this error using the debugger? >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> >>>>> Thanks! >>>>> >>>>> Regards, >>>>> >>>>> Roland Richter >>>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >>> >>> >>> >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Wed Oct 13 09:18:24 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Wed, 13 Oct 2021 09:18:24 -0500 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> Message-ID: On Tue, Oct 12, 2021 at 1:07 PM Mark Adams wrote: > > > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu wrote: > >> Hi Mark, >> >> The option I use is like >> >> -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres -mat_type >> aijcusparse *-sub_pc_factor_mat_solver_type cusparse *-sub_ksp_type >> preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 -ksp_atol >> 1.e-300 >> >> > Note, If you use -log_view the last column (rows are the method like > MatFactorNumeric) has the percent of work in the GPU. > > Junchao: *This* implies that we have a cuSparse LU factorization. Is > that correct? (I don't think we do) > No, we don't have cuSparse LU factorization. If you check MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls MatLUFactorSymbolic_SeqAIJ() instead. So I don't understand Chang's idea. Do you want to make bigger blocks? > > I think this one do both factorization and solve on gpu. >> >> You can check the runex72_aijcusparse.sh file in petsc install >> directory, and try it your self (this is only lu factorization without >> iterative solve). >> >> Chang >> >> On 10/12/21 1:17 PM, Mark Adams wrote: >> > >> > >> > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > > > wrote: >> > >> > Hi Junchao, >> > >> > No I only needs it to be transferred within a node. I use >> block-Jacobi >> > method and GMRES to solve the sparse matrix, so each direct solver >> will >> > take care of a sub-block of the whole matrix. In this way, I can use >> > one >> > GPU to solve one sub-block, which is stored within one node. >> > >> > It was stated in the documentation that cusparse solver is slow. >> > However, in my test using ex72.c, the cusparse solver is faster than >> > mumps or superlu_dist on CPUs. >> > >> > >> > Are we talking about the factorization, the solve, or both? >> > >> > We do not have an interface to cuSparse's LU factorization (I just >> > learned that it exists a few weeks ago). >> > Perhaps your fast "cusparse solver" is '-pc_type lu -mat_type >> > aijcusparse' ? This would be the CPU factorization, which is the >> > dominant cost. >> > >> > >> > Chang >> > >> > On 10/12/21 10:24 AM, Junchao Zhang wrote: >> > > Hi, Chang, >> > > For the mumps solver, we usually transfers matrix and vector >> > data >> > > within a compute node. For the idea you propose, it looks like >> > we need >> > > to gather data within MPI_COMM_WORLD, right? >> > > >> > > Mark, I remember you said cusparse solve is slow and you >> would >> > > rather do it on CPU. Is it right? >> > > >> > > --Junchao Zhang >> > > >> > > >> > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users >> > > >> > >> >> > wrote: >> > > >> > > Hi, >> > > >> > > Currently, it is possible to use mumps solver in PETSC with >> > > -mat_mumps_use_omp_threads option, so that multiple MPI >> > processes will >> > > transfer the matrix and rhs data to the master rank, and then >> > master >> > > rank will call mumps with OpenMP to solve the matrix. >> > > >> > > I wonder if someone can develop similar option for cusparse >> > solver. >> > > Right now, this solver does not work with mpiaijcusparse. I >> > think a >> > > possible workaround is to transfer all the matrix data to >> one MPI >> > > process, and then upload the data to GPU to solve. In this >> > way, one can >> > > use cusparse solver for a MPI program. >> > > >> > > Chang >> > > -- >> > > Chang Liu >> > > Staff Research Physicist >> > > +1 609 243 3438 >> > > cliu at pppl.gov > > > >> > > Princeton Plasma Physics Laboratory >> > > 100 Stellarator Rd, Princeton NJ 08540, USA >> > > >> > >> > -- >> > Chang Liu >> > Staff Research Physicist >> > +1 609 243 3438 >> > cliu at pppl.gov >> > Princeton Plasma Physics Laboratory >> > 100 Stellarator Rd, Princeton NJ 08540, USA >> > >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Wed Oct 13 10:10:21 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 13 Oct 2021 11:10:21 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> Message-ID: Thank you Junchao for explaining this. I guess in my case the code is just calling a seq solver like superlu to do factorization on GPUs. My idea is that I want to have a traditional MPI code to utilize GPUs with cusparse. Right now cusparse does not support mpiaij matrix, so I want the code to have a mpiaij matrix when adding all the matrix terms, and then transform the matrix to seqaij when doing the factorization and solve. This involves sending the data to the master process, and I think the petsc mumps solver have something similar already. Chang On 10/13/21 10:18 AM, Junchao Zhang wrote: > > > > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > wrote: > > > > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > wrote: > > Hi Mark, > > The option I use is like > > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres -mat_type > aijcusparse *-sub_pc_factor_mat_solver_type cusparse *-sub_ksp_type > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 > -ksp_atol 1.e-300 > > > Note, If you use -log_view the last column (rows are the method like > MatFactorNumeric) has the percent of work in the GPU. > > Junchao: *This* implies that we have a cuSparse LU factorization. Is > that correct? (I don't think we do) > > No, we don't have cuSparse LU factorization.? If you check > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls > MatLUFactorSymbolic_SeqAIJ() instead. > So I don't understand Chang's idea. Do you want to make bigger blocks? > > > I think this one do both factorization and solve on gpu. > > You can check the runex72_aijcusparse.sh file in petsc install > directory, and try it your self (this is only lu factorization > without > iterative solve). > > Chang > > On 10/12/21 1:17 PM, Mark Adams wrote: > > > > > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > > >> wrote: > > > >? ? ?Hi Junchao, > > > >? ? ?No I only needs it to be transferred within a node. I use > block-Jacobi > >? ? ?method and GMRES to solve the sparse matrix, so each > direct solver will > >? ? ?take care of a sub-block of the whole matrix. In this > way, I can use > >? ? ?one > >? ? ?GPU to solve one sub-block, which is stored within one node. > > > >? ? ?It was stated in the documentation that cusparse solver > is slow. > >? ? ?However, in my test using ex72.c, the cusparse solver is > faster than > >? ? ?mumps or superlu_dist on CPUs. > > > > > > Are we talking about the factorization, the solve, or both? > > > > We do not have an interface?to cuSparse's?LU factorization (I > just > > learned that it exists a few weeks ago). > > Perhaps your fast "cusparse solver" is '-pc_type lu -mat_type > > aijcusparse' ? This would be the CPU factorization, which is the > > dominant?cost. > > > > > >? ? ?Chang > > > >? ? ?On 10/12/21 10:24 AM, Junchao Zhang wrote: > >? ? ? > Hi, Chang, > >? ? ? >? ? ?For the mumps solver, we usually transfers matrix > and vector > >? ? ?data > >? ? ? > within a compute node.? For the idea you propose, it > looks like > >? ? ?we need > >? ? ? > to gather data within MPI_COMM_WORLD, right? > >? ? ? > > >? ? ? >? ? ?Mark, I remember you said cusparse solve is slow > and you would > >? ? ? > rather do it on CPU. Is it right? > >? ? ? > > >? ? ? > --Junchao Zhang > >? ? ? > > >? ? ? > > >? ? ? > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users > >? ? ? > > > >? ? ? >>> > >? ? ?wrote: > >? ? ? > > >? ? ? >? ? ?Hi, > >? ? ? > > >? ? ? >? ? ?Currently, it is possible to use mumps solver in > PETSC with > >? ? ? >? ? ?-mat_mumps_use_omp_threads option, so that > multiple MPI > >? ? ?processes will > >? ? ? >? ? ?transfer the matrix and rhs data to the master > rank, and then > >? ? ?master > >? ? ? >? ? ?rank will call mumps with OpenMP to solve the matrix. > >? ? ? > > >? ? ? >? ? ?I wonder if someone can develop similar option for > cusparse > >? ? ?solver. > >? ? ? >? ? ?Right now, this solver does not work with > mpiaijcusparse. I > >? ? ?think a > >? ? ? >? ? ?possible workaround is to transfer all the matrix > data to one MPI > >? ? ? >? ? ?process, and then upload the data to GPU to solve. > In this > >? ? ?way, one can > >? ? ? >? ? ?use cusparse solver for a MPI program. > >? ? ? > > >? ? ? >? ? ?Chang > >? ? ? >? ? ?-- > >? ? ? >? ? ?Chang Liu > >? ? ? >? ? ?Staff Research Physicist > >? ? ? >? ? ?+1 609 243 3438 > >? ? ? > cliu at pppl.gov > > > > >? ? ?>> > >? ? ? >? ? ?Princeton Plasma Physics Laboratory > >? ? ? >? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >? ? ? > > > > >? ? ?-- > >? ? ?Chang Liu > >? ? ?Staff Research Physicist > >? ? ?+1 609 243 3438 > > cliu at pppl.gov > > >? ? ?Princeton Plasma Physics Laboratory > >? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From mfadams at lbl.gov Wed Oct 13 11:03:39 2021 From: mfadams at lbl.gov (Mark Adams) Date: Wed, 13 Oct 2021 12:03:39 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> Message-ID: On Wed, Oct 13, 2021 at 11:10 AM Chang Liu wrote: > Thank you Junchao for explaining this. I guess in my case the code is > just calling a seq solver like superlu to do factorization on GPUs. > > My idea is that I want to have a traditional MPI code to utilize GPUs > with cusparse. Right now cusparse does not support mpiaij matrix, Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes. (-mat_type mpiaijcusparse might also work with >1 proc). However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type. MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue? Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work) Thanks, Mark > so I > want the code to have a mpiaij matrix when adding all the matrix terms, > and then transform the matrix to seqaij when doing the factorization and > solve. This involves sending the data to the master process, and I think > the petsc mumps solver have something similar already. > > Chang > > On 10/13/21 10:18 AM, Junchao Zhang wrote: > > > > > > > > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > > wrote: > > > > > > > > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > > wrote: > > > > Hi Mark, > > > > The option I use is like > > > > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres -mat_type > > aijcusparse *-sub_pc_factor_mat_solver_type cusparse > *-sub_ksp_type > > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 > > -ksp_atol 1.e-300 > > > > > > Note, If you use -log_view the last column (rows are the method like > > MatFactorNumeric) has the percent of work in the GPU. > > > > Junchao: *This* implies that we have a cuSparse LU factorization. Is > > that correct? (I don't think we do) > > > > No, we don't have cuSparse LU factorization. If you check > > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls > > MatLUFactorSymbolic_SeqAIJ() instead. > > So I don't understand Chang's idea. Do you want to make bigger blocks? > > > > > > I think this one do both factorization and solve on gpu. > > > > You can check the runex72_aijcusparse.sh file in petsc install > > directory, and try it your self (this is only lu factorization > > without > > iterative solve). > > > > Chang > > > > On 10/12/21 1:17 PM, Mark Adams wrote: > > > > > > > > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > > > > >> wrote: > > > > > > Hi Junchao, > > > > > > No I only needs it to be transferred within a node. I use > > block-Jacobi > > > method and GMRES to solve the sparse matrix, so each > > direct solver will > > > take care of a sub-block of the whole matrix. In this > > way, I can use > > > one > > > GPU to solve one sub-block, which is stored within one > node. > > > > > > It was stated in the documentation that cusparse solver > > is slow. > > > However, in my test using ex72.c, the cusparse solver is > > faster than > > > mumps or superlu_dist on CPUs. > > > > > > > > > Are we talking about the factorization, the solve, or both? > > > > > > We do not have an interface to cuSparse's LU factorization (I > > just > > > learned that it exists a few weeks ago). > > > Perhaps your fast "cusparse solver" is '-pc_type lu -mat_type > > > aijcusparse' ? This would be the CPU factorization, which is > the > > > dominant cost. > > > > > > > > > Chang > > > > > > On 10/12/21 10:24 AM, Junchao Zhang wrote: > > > > Hi, Chang, > > > > For the mumps solver, we usually transfers matrix > > and vector > > > data > > > > within a compute node. For the idea you propose, it > > looks like > > > we need > > > > to gather data within MPI_COMM_WORLD, right? > > > > > > > > Mark, I remember you said cusparse solve is slow > > and you would > > > > rather do it on CPU. Is it right? > > > > > > > > --Junchao Zhang > > > > > > > > > > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via > petsc-users > > > > > > > > > > > > >>> > > > wrote: > > > > > > > > Hi, > > > > > > > > Currently, it is possible to use mumps solver in > > PETSC with > > > > -mat_mumps_use_omp_threads option, so that > > multiple MPI > > > processes will > > > > transfer the matrix and rhs data to the master > > rank, and then > > > master > > > > rank will call mumps with OpenMP to solve the > matrix. > > > > > > > > I wonder if someone can develop similar option for > > cusparse > > > solver. > > > > Right now, this solver does not work with > > mpiaijcusparse. I > > > think a > > > > possible workaround is to transfer all the matrix > > data to one MPI > > > > process, and then upload the data to GPU to solve. > > In this > > > way, one can > > > > use cusparse solver for a MPI program. > > > > > > > > Chang > > > > -- > > > > Chang Liu > > > > Staff Research Physicist > > > > +1 609 243 3438 > > > > cliu at pppl.gov > > > > > > > > >> > > > > Princeton Plasma Physics Laboratory > > > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > > > > > > > -- > > > Chang Liu > > > Staff Research Physicist > > > +1 609 243 3438 > > > cliu at pppl.gov > > > > > Princeton Plasma Physics Laboratory > > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > > > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Wed Oct 13 11:16:47 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 13 Oct 2021 12:16:47 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> Message-ID: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> Hi Mark, '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, but with -pc_factor_mat_solver_type cusparse, it will give an error. Yes what I want is to have mumps or superlu to do the factorization, and then do the rest, including GMRES solver, on gpu. Is that possible? I have tried to use aijcusparse with superlu_dist, it runs but the iterative solver is still running on CPUs. I have contacted the superlu group and they confirmed that is the case right now. But if I set -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is running on GPU. Chang On 10/13/21 12:03 PM, Mark Adams wrote: > > > On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > wrote: > > Thank you Junchao for explaining this. I guess in my case the code is > just calling a seq solver like superlu to do factorization on GPUs. > > My idea is that I want to have a traditional MPI code to utilize GPUs > with cusparse. Right now cusparse does not support mpiaij matrix, > > > Sure it does: '-mat_type aijcusparse' will give you an > mpiaijcusparse?matrix with > 1 processes. > (-mat_type mpiaijcusparse?might also work with >1 proc). > > However, I see in grepping?the repo that all the mumps and superlu tests > use aij or sell matrix type. > MUMPS and SuperLU provide their?own solves, I assume .... but you might > want to do other matrix operations on the GPU. Is that the issue? > Did you try -mat_type aijcusparse?with MUMPS and/or SuperLU have a > problem? (no test with it so it probably?does not work) > > Thanks, > Mark > > so I > want the code to have a mpiaij matrix when adding all the matrix terms, > and then transform the matrix to seqaij when doing the factorization > and > solve. This involves sending the data to the master process, and I > think > the petsc mumps solver have something similar already. > > Chang > > On 10/13/21 10:18 AM, Junchao Zhang wrote: > > > > > > > > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > > >> wrote: > > > > > > > >? ? ?On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > >? ? ?>> wrote: > > > >? ? ? ? ?Hi Mark, > > > >? ? ? ? ?The option I use is like > > > >? ? ? ? ?-pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres > -mat_type > >? ? ? ? ?aijcusparse *-sub_pc_factor_mat_solver_type cusparse > *-sub_ksp_type > >? ? ? ? ?preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 > >? ? ? ? ?-ksp_atol 1.e-300 > > > > > >? ? ?Note, If you use -log_view the last column (rows are the > method like > >? ? ?MatFactorNumeric) has the percent of work in the GPU. > > > >? ? ?Junchao: *This* implies that we have a cuSparse LU > factorization. Is > >? ? ?that correct? (I don't think we do) > > > > No, we don't have cuSparse LU factorization.? If you check > > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls > > MatLUFactorSymbolic_SeqAIJ() instead. > > So I don't understand Chang's idea. Do you want to make bigger > blocks? > > > > > >? ? ? ? ?I think this one do both factorization and solve on gpu. > > > >? ? ? ? ?You can check the runex72_aijcusparse.sh file in petsc > install > >? ? ? ? ?directory, and try it your self (this is only lu > factorization > >? ? ? ? ?without > >? ? ? ? ?iterative solve). > > > >? ? ? ? ?Chang > > > >? ? ? ? ?On 10/12/21 1:17 PM, Mark Adams wrote: > >? ? ? ? ? > > >? ? ? ? ? > > >? ? ? ? ? > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > > >? ? ? ? ?> > >? ? ? ? ? > > >>> wrote: > >? ? ? ? ? > > >? ? ? ? ? >? ? ?Hi Junchao, > >? ? ? ? ? > > >? ? ? ? ? >? ? ?No I only needs it to be transferred within a > node. I use > >? ? ? ? ?block-Jacobi > >? ? ? ? ? >? ? ?method and GMRES to solve the sparse matrix, so each > >? ? ? ? ?direct solver will > >? ? ? ? ? >? ? ?take care of a sub-block of the whole matrix. In this > >? ? ? ? ?way, I can use > >? ? ? ? ? >? ? ?one > >? ? ? ? ? >? ? ?GPU to solve one sub-block, which is stored within > one node. > >? ? ? ? ? > > >? ? ? ? ? >? ? ?It was stated in the documentation that cusparse > solver > >? ? ? ? ?is slow. > >? ? ? ? ? >? ? ?However, in my test using ex72.c, the cusparse > solver is > >? ? ? ? ?faster than > >? ? ? ? ? >? ? ?mumps or superlu_dist on CPUs. > >? ? ? ? ? > > >? ? ? ? ? > > >? ? ? ? ? > Are we talking about the factorization, the solve, or > both? > >? ? ? ? ? > > >? ? ? ? ? > We do not have an interface?to cuSparse's?LU > factorization (I > >? ? ? ? ?just > >? ? ? ? ? > learned that it exists a few weeks ago). > >? ? ? ? ? > Perhaps your fast "cusparse solver" is '-pc_type lu > -mat_type > >? ? ? ? ? > aijcusparse' ? This would be the CPU factorization, > which is the > >? ? ? ? ? > dominant?cost. > >? ? ? ? ? > > >? ? ? ? ? > > >? ? ? ? ? >? ? ?Chang > >? ? ? ? ? > > >? ? ? ? ? >? ? ?On 10/12/21 10:24 AM, Junchao Zhang wrote: > >? ? ? ? ? >? ? ? > Hi, Chang, > >? ? ? ? ? >? ? ? >? ? ?For the mumps solver, we usually transfers > matrix > >? ? ? ? ?and vector > >? ? ? ? ? >? ? ?data > >? ? ? ? ? >? ? ? > within a compute node.? For the idea you > propose, it > >? ? ? ? ?looks like > >? ? ? ? ? >? ? ?we need > >? ? ? ? ? >? ? ? > to gather data within MPI_COMM_WORLD, right? > >? ? ? ? ? >? ? ? > > >? ? ? ? ? >? ? ? >? ? ?Mark, I remember you said cusparse solve is > slow > >? ? ? ? ?and you would > >? ? ? ? ? >? ? ? > rather do it on CPU. Is it right? > >? ? ? ? ? >? ? ? > > >? ? ? ? ? >? ? ? > --Junchao Zhang > >? ? ? ? ? >? ? ? > > >? ? ? ? ? >? ? ? > > >? ? ? ? ? >? ? ? > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via > petsc-users > >? ? ? ? ? >? ? ? > > >? ? ? ? ? > > >? ? ? ? ? >> > >? ? ? ? ? >? ? ? > >? ? ? ? ? > > >? ? ? ? ? >>>> > >? ? ? ? ? >? ? ?wrote: > >? ? ? ? ? >? ? ? > > >? ? ? ? ? >? ? ? >? ? ?Hi, > >? ? ? ? ? >? ? ? > > >? ? ? ? ? >? ? ? >? ? ?Currently, it is possible to use mumps > solver in > >? ? ? ? ?PETSC with > >? ? ? ? ? >? ? ? >? ? ?-mat_mumps_use_omp_threads option, so that > >? ? ? ? ?multiple MPI > >? ? ? ? ? >? ? ?processes will > >? ? ? ? ? >? ? ? >? ? ?transfer the matrix and rhs data to the master > >? ? ? ? ?rank, and then > >? ? ? ? ? >? ? ?master > >? ? ? ? ? >? ? ? >? ? ?rank will call mumps with OpenMP to solve > the matrix. > >? ? ? ? ? >? ? ? > > >? ? ? ? ? >? ? ? >? ? ?I wonder if someone can develop similar > option for > >? ? ? ? ?cusparse > >? ? ? ? ? >? ? ?solver. > >? ? ? ? ? >? ? ? >? ? ?Right now, this solver does not work with > >? ? ? ? ?mpiaijcusparse. I > >? ? ? ? ? >? ? ?think a > >? ? ? ? ? >? ? ? >? ? ?possible workaround is to transfer all the > matrix > >? ? ? ? ?data to one MPI > >? ? ? ? ? >? ? ? >? ? ?process, and then upload the data to GPU to > solve. > >? ? ? ? ?In this > >? ? ? ? ? >? ? ?way, one can > >? ? ? ? ? >? ? ? >? ? ?use cusparse solver for a MPI program. > >? ? ? ? ? >? ? ? > > >? ? ? ? ? >? ? ? >? ? ?Chang > >? ? ? ? ? >? ? ? >? ? ?-- > >? ? ? ? ? >? ? ? >? ? ?Chang Liu > >? ? ? ? ? >? ? ? >? ? ?Staff Research Physicist > >? ? ? ? ? >? ? ? >? ? ?+1 609 243 3438 > >? ? ? ? ? >? ? ? > cliu at pppl.gov > > > >? ? ? ? ? > >> > >? ? ? ? ? > > > >? ? ? ? ? >? ? ? > >>> > >? ? ? ? ? >? ? ? >? ? ?Princeton Plasma Physics Laboratory > >? ? ? ? ? >? ? ? >? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >? ? ? ? ? >? ? ? > > >? ? ? ? ? > > >? ? ? ? ? >? ? ?-- > >? ? ? ? ? >? ? ?Chang Liu > >? ? ? ? ? >? ? ?Staff Research Physicist > >? ? ? ? ? >? ? ?+1 609 243 3438 > >? ? ? ? ? > cliu at pppl.gov > > > >? ? ? ? ?>> > >? ? ? ? ? >? ? ?Princeton Plasma Physics Laboratory > >? ? ? ? ? >? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >? ? ? ? ? > > > > >? ? ? ? ?-- > >? ? ? ? ?Chang Liu > >? ? ? ? ?Staff Research Physicist > >? ? ? ? ?+1 609 243 3438 > > cliu at pppl.gov > > >? ? ? ? ?Princeton Plasma Physics Laboratory > >? ? ? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From celestechevali at gmail.com Wed Oct 13 11:18:46 2021 From: celestechevali at gmail.com (Tianchi Li) Date: Wed, 13 Oct 2021 18:18:46 +0200 Subject: [petsc-users] About hardware limitation/recommendation for PETSc (new user) Message-ID: Hello, I?m planning to parallelize my C++ simulation code with PETSc. And I wish to buy a local workstation to perform the parallel code development and later productive runs. Before purchasing the workstation, I wish to know if there is any limitation or recommendation for PETSc implementation on the hardware side ? For example, is there any hardware limitation about the MPI or GPU parallelism ? (e.g. Nvidia or AMD graphics cards, Intel or AMD CPUs) Concerning cost-effectiveness, does a hybrid CPU-GPU machine have more advantages than an all-CPU one ? (if using GPU acceleration) Thank you so much in advance. I appreciate any advice that you provide. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Wed Oct 13 12:47:38 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 13 Oct 2021 13:47:38 -0400 Subject: [petsc-users] About hardware limitation/recommendation for PETSc (new user) In-Reply-To: References: Message-ID: <723D4B5C-4F73-4833-BA37-3FAED6996CFB@petsc.dev> This is a very complex question. But most PETSc simulations are memory bandwidth limited so to first order you want to purchase something that delivers the highest possible memory bandwidth for your price tag. For a pure CPU system it is the cumulative memory bandwidth you can utilize over multiple cores that matters, https://petsc.org/release/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup GPUs are designed to be high memory bandwidth and hence can offer for large enough problems an higher overall bandwidth than a CPU can alone, but it may be difficult to take advantage of the bandwidth. In turns of mature software support in PETSc for GPUs the best choice is NVIDIA. If you have a good test case for your class of problems coded you might consider using some cloud services to benchmark some hardware choices for your problem before purchasing anything. Barry > On Oct 13, 2021, at 12:18 PM, Tianchi Li wrote: > > Hello, > > I?m planning to parallelize my C++ simulation code with PETSc. > > And I wish to buy a local workstation to perform the parallel code development and later productive runs. > > Before purchasing the workstation, I wish to know if there is any limitation or recommendation for PETSc implementation on the hardware side ? > > For example, is there any hardware limitation about the MPI or GPU parallelism ? (e.g. Nvidia or AMD graphics cards, Intel or AMD CPUs) > > Concerning cost-effectiveness, does a hybrid CPU-GPU machine have more advantages than an all-CPU one ? (if using GPU acceleration) > > Thank you so much in advance. > > I appreciate any advice that you provide. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Wed Oct 13 12:53:27 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 13 Oct 2021 13:53:27 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> Message-ID: <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> Chang, You are correct there is no MPI + GPU direct solvers that currently do the triangular solves with MPI + GPU parallelism that I am aware of. You are limited that individual triangular solves be done on a single GPU. I can only suggest making each subdomain as big as possible to utilize each GPU as much as possible for the direct triangular solves. Barry > On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users wrote: > > Hi Mark, > > '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, but with -pc_factor_mat_solver_type cusparse, it will give an error. > > Yes what I want is to have mumps or superlu to do the factorization, and then do the rest, including GMRES solver, on gpu. Is that possible? > > I have tried to use aijcusparse with superlu_dist, it runs but the iterative solver is still running on CPUs. I have contacted the superlu group and they confirmed that is the case right now. But if I set -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is running on GPU. > > Chang > > On 10/13/21 12:03 PM, Mark Adams wrote: >> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > wrote: >> Thank you Junchao for explaining this. I guess in my case the code is >> just calling a seq solver like superlu to do factorization on GPUs. >> My idea is that I want to have a traditional MPI code to utilize GPUs >> with cusparse. Right now cusparse does not support mpiaij matrix, Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes. >> (-mat_type mpiaijcusparse might also work with >1 proc). >> However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type. >> MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue? >> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work) >> Thanks, >> Mark >> so I >> want the code to have a mpiaij matrix when adding all the matrix terms, >> and then transform the matrix to seqaij when doing the factorization >> and >> solve. This involves sending the data to the master process, and I >> think >> the petsc mumps solver have something similar already. >> Chang >> On 10/13/21 10:18 AM, Junchao Zhang wrote: >> > >> > >> > >> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > >> > >> wrote: >> > >> > >> > >> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > >> > >> wrote: >> > >> > Hi Mark, >> > >> > The option I use is like >> > >> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres >> -mat_type >> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse >> *-sub_ksp_type >> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 >> > -ksp_atol 1.e-300 >> > >> > >> > Note, If you use -log_view the last column (rows are the >> method like >> > MatFactorNumeric) has the percent of work in the GPU. >> > >> > Junchao: *This* implies that we have a cuSparse LU >> factorization. Is >> > that correct? (I don't think we do) >> > >> > No, we don't have cuSparse LU factorization. If you check >> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls >> > MatLUFactorSymbolic_SeqAIJ() instead. >> > So I don't understand Chang's idea. Do you want to make bigger >> blocks? >> > >> > >> > I think this one do both factorization and solve on gpu. >> > >> > You can check the runex72_aijcusparse.sh file in petsc >> install >> > directory, and try it your self (this is only lu >> factorization >> > without >> > iterative solve). >> > >> > Chang >> > >> > On 10/12/21 1:17 PM, Mark Adams wrote: >> > > >> > > >> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu >> >> > > >> > > >> >>> wrote: >> > > >> > > Hi Junchao, >> > > >> > > No I only needs it to be transferred within a >> node. I use >> > block-Jacobi >> > > method and GMRES to solve the sparse matrix, so each >> > direct solver will >> > > take care of a sub-block of the whole matrix. In this >> > way, I can use >> > > one >> > > GPU to solve one sub-block, which is stored within >> one node. >> > > >> > > It was stated in the documentation that cusparse >> solver >> > is slow. >> > > However, in my test using ex72.c, the cusparse >> solver is >> > faster than >> > > mumps or superlu_dist on CPUs. >> > > >> > > >> > > Are we talking about the factorization, the solve, or >> both? >> > > >> > > We do not have an interface to cuSparse's LU >> factorization (I >> > just >> > > learned that it exists a few weeks ago). >> > > Perhaps your fast "cusparse solver" is '-pc_type lu >> -mat_type >> > > aijcusparse' ? This would be the CPU factorization, >> which is the >> > > dominant cost. >> > > >> > > >> > > Chang >> > > >> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: >> > > > Hi, Chang, >> > > > For the mumps solver, we usually transfers >> matrix >> > and vector >> > > data >> > > > within a compute node. For the idea you >> propose, it >> > looks like >> > > we need >> > > > to gather data within MPI_COMM_WORLD, right? >> > > > >> > > > Mark, I remember you said cusparse solve is >> slow >> > and you would >> > > > rather do it on CPU. Is it right? >> > > > >> > > > --Junchao Zhang >> > > > >> > > > >> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via >> petsc-users >> > > > > >> > > > > >> > > >> >> > > > >> > > > > >> > > >>>> >> > > wrote: >> > > > >> > > > Hi, >> > > > >> > > > Currently, it is possible to use mumps >> solver in >> > PETSC with >> > > > -mat_mumps_use_omp_threads option, so that >> > multiple MPI >> > > processes will >> > > > transfer the matrix and rhs data to the master >> > rank, and then >> > > master >> > > > rank will call mumps with OpenMP to solve >> the matrix. >> > > > >> > > > I wonder if someone can develop similar >> option for >> > cusparse >> > > solver. >> > > > Right now, this solver does not work with >> > mpiaijcusparse. I >> > > think a >> > > > possible workaround is to transfer all the >> matrix >> > data to one MPI >> > > > process, and then upload the data to GPU to >> solve. >> > In this >> > > way, one can >> > > > use cusparse solver for a MPI program. >> > > > >> > > > Chang >> > > > -- >> > > > Chang Liu >> > > > Staff Research Physicist >> > > > +1 609 243 3438 >> > > > cliu at pppl.gov >> > >> > >> >> >> > >> > >> > > >> >>> >> > > > Princeton Plasma Physics Laboratory >> > > > 100 Stellarator Rd, Princeton NJ 08540, USA >> > > > >> > > >> > > -- >> > > Chang Liu >> > > Staff Research Physicist >> > > +1 609 243 3438 >> > > cliu at pppl.gov >> > > >> > >> >> > > Princeton Plasma Physics Laboratory >> > > 100 Stellarator Rd, Princeton NJ 08540, USA >> > > >> > >> > -- >> > Chang Liu >> > Staff Research Physicist >> > +1 609 243 3438 >> > cliu at pppl.gov > > >> > Princeton Plasma Physics Laboratory >> > 100 Stellarator Rd, Princeton NJ 08540, USA >> > >> -- Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From cliu at pppl.gov Wed Oct 13 14:50:30 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 13 Oct 2021 15:50:30 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> Message-ID: Hi Barry, That is exactly what I want. Back to my original question, I am looking for an approach to transfer matrix data from many MPI processes to "master" MPI processes, each of which taking care of one GPU, and then upload the data to GPU to solve. One can just grab some codes from mumps.c to aijcusparse.cu. Chang On 10/13/21 1:53 PM, Barry Smith wrote: > > Chang, > > You are correct there is no MPI + GPU direct solvers that currently do the triangular solves with MPI + GPU parallelism that I am aware of. You are limited that individual triangular solves be done on a single GPU. I can only suggest making each subdomain as big as possible to utilize each GPU as much as possible for the direct triangular solves. > > Barry > > >> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users wrote: >> >> Hi Mark, >> >> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, but with -pc_factor_mat_solver_type cusparse, it will give an error. >> >> Yes what I want is to have mumps or superlu to do the factorization, and then do the rest, including GMRES solver, on gpu. Is that possible? >> >> I have tried to use aijcusparse with superlu_dist, it runs but the iterative solver is still running on CPUs. I have contacted the superlu group and they confirmed that is the case right now. But if I set -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is running on GPU. >> >> Chang >> >> On 10/13/21 12:03 PM, Mark Adams wrote: >>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > wrote: >>> Thank you Junchao for explaining this. I guess in my case the code is >>> just calling a seq solver like superlu to do factorization on GPUs. >>> My idea is that I want to have a traditional MPI code to utilize GPUs >>> with cusparse. Right now cusparse does not support mpiaij matrix, Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes. >>> (-mat_type mpiaijcusparse might also work with >1 proc). >>> However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type. >>> MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue? >>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work) >>> Thanks, >>> Mark >>> so I >>> want the code to have a mpiaij matrix when adding all the matrix terms, >>> and then transform the matrix to seqaij when doing the factorization >>> and >>> solve. This involves sending the data to the master process, and I >>> think >>> the petsc mumps solver have something similar already. >>> Chang >>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>> > >>> > >>> > >>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >> >>> > >> wrote: >>> > >>> > >>> > >>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >> >>> > >> wrote: >>> > >>> > Hi Mark, >>> > >>> > The option I use is like >>> > >>> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres >>> -mat_type >>> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse >>> *-sub_ksp_type >>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 >>> > -ksp_atol 1.e-300 >>> > >>> > >>> > Note, If you use -log_view the last column (rows are the >>> method like >>> > MatFactorNumeric) has the percent of work in the GPU. >>> > >>> > Junchao: *This* implies that we have a cuSparse LU >>> factorization. Is >>> > that correct? (I don't think we do) >>> > >>> > No, we don't have cuSparse LU factorization. If you check >>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls >>> > MatLUFactorSymbolic_SeqAIJ() instead. >>> > So I don't understand Chang's idea. Do you want to make bigger >>> blocks? >>> > >>> > >>> > I think this one do both factorization and solve on gpu. >>> > >>> > You can check the runex72_aijcusparse.sh file in petsc >>> install >>> > directory, and try it your self (this is only lu >>> factorization >>> > without >>> > iterative solve). >>> > >>> > Chang >>> > >>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>> > > >>> > > >>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu >>> >>> > > >>> > > >>> >>> wrote: >>> > > >>> > > Hi Junchao, >>> > > >>> > > No I only needs it to be transferred within a >>> node. I use >>> > block-Jacobi >>> > > method and GMRES to solve the sparse matrix, so each >>> > direct solver will >>> > > take care of a sub-block of the whole matrix. In this >>> > way, I can use >>> > > one >>> > > GPU to solve one sub-block, which is stored within >>> one node. >>> > > >>> > > It was stated in the documentation that cusparse >>> solver >>> > is slow. >>> > > However, in my test using ex72.c, the cusparse >>> solver is >>> > faster than >>> > > mumps or superlu_dist on CPUs. >>> > > >>> > > >>> > > Are we talking about the factorization, the solve, or >>> both? >>> > > >>> > > We do not have an interface to cuSparse's LU >>> factorization (I >>> > just >>> > > learned that it exists a few weeks ago). >>> > > Perhaps your fast "cusparse solver" is '-pc_type lu >>> -mat_type >>> > > aijcusparse' ? This would be the CPU factorization, >>> which is the >>> > > dominant cost. >>> > > >>> > > >>> > > Chang >>> > > >>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: >>> > > > Hi, Chang, >>> > > > For the mumps solver, we usually transfers >>> matrix >>> > and vector >>> > > data >>> > > > within a compute node. For the idea you >>> propose, it >>> > looks like >>> > > we need >>> > > > to gather data within MPI_COMM_WORLD, right? >>> > > > >>> > > > Mark, I remember you said cusparse solve is >>> slow >>> > and you would >>> > > > rather do it on CPU. Is it right? >>> > > > >>> > > > --Junchao Zhang >>> > > > >>> > > > >>> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via >>> petsc-users >>> > > > >> >>> > >> > >> >>> > >> >> >>> > > >> >>> > >> > >> >>> > >> >>>> >>> > > wrote: >>> > > > >>> > > > Hi, >>> > > > >>> > > > Currently, it is possible to use mumps >>> solver in >>> > PETSC with >>> > > > -mat_mumps_use_omp_threads option, so that >>> > multiple MPI >>> > > processes will >>> > > > transfer the matrix and rhs data to the master >>> > rank, and then >>> > > master >>> > > > rank will call mumps with OpenMP to solve >>> the matrix. >>> > > > >>> > > > I wonder if someone can develop similar >>> option for >>> > cusparse >>> > > solver. >>> > > > Right now, this solver does not work with >>> > mpiaijcusparse. I >>> > > think a >>> > > > possible workaround is to transfer all the >>> matrix >>> > data to one MPI >>> > > > process, and then upload the data to GPU to >>> solve. >>> > In this >>> > > way, one can >>> > > > use cusparse solver for a MPI program. >>> > > > >>> > > > Chang >>> > > > -- >>> > > > Chang Liu >>> > > > Staff Research Physicist >>> > > > +1 609 243 3438 >>> > > > cliu at pppl.gov >>> > >>> > >>> >> >>> > >>> > >>> > > >>> >>> >>> > > > Princeton Plasma Physics Laboratory >>> > > > 100 Stellarator Rd, Princeton NJ 08540, USA >>> > > > >>> > > >>> > > -- >>> > > Chang Liu >>> > > Staff Research Physicist >>> > > +1 609 243 3438 >>> > > cliu at pppl.gov >>> > >> >>> > >> >>> > > Princeton Plasma Physics Laboratory >>> > > 100 Stellarator Rd, Princeton NJ 08540, USA >>> > > >>> > >>> > -- >>> > Chang Liu >>> > Staff Research Physicist >>> > +1 609 243 3438 >>> > cliu at pppl.gov >> > >>> > Princeton Plasma Physics Laboratory >>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>> > >>> -- Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From gsabhishek1ags at gmail.com Wed Oct 13 17:29:44 2021 From: gsabhishek1ags at gmail.com (Abhishek G.S.) Date: Thu, 14 Oct 2021 03:59:44 +0530 Subject: [petsc-users] VecView DMDA and HDF5 - Unable to write out files properly Message-ID: Hi, I need some help with getting the file output working right. I am using a DMDACreate3D to initialize my DM. This is my write function void write(){ PetscViewer viewer; PetscViewerHDF5Open(PETSC_COMM_WORLD,filename.c_str(),FILE_MODE_WRITE,&viewer); DMDAVecRestoreArray(dm,global_vector,global_array) VecView(global_vec, viewer); DMDAVecGetArray(dm,global_vector,global_array); PetscViewerDestroy(&viewer); } 1) I have 2 PDE's to solve. Still, I went ahead creating a single DM with dof=1 and creating two vectors using the DMCreateGlobalVector(). I want to write the file out periodically. Should I perform DMDAVecRestoreArray and DMDAVecGetArray every time is write out the global_vector? (I know that it is just indexing the pointers and there is no copying of values. But I am not sure) 2) I am writing out to HDF5 format. I see that the vecview is supposed to reorder the global_vector based on the DM. However, when I read the H5 files, I get an error on ViSIT and my output image becomes a 1D image rather than a 2D/3D. What might be the reason for this ?. Error Msg : "In domain 0, your zonal variable "avtGhostZones" has 25600 values, but it should have 160. Some values were removed to ensure VisIt runs smoothly" I was using a 160x160x1 DM 3) I tried using the "petsc_gen_xdmf.py" to generate the xdmf files for use in Paraview. Here the key ["viz/geometry"] is missing. The keys present in the output H5 file are just the two vectors I am writing and has no info about mesh. Isn't this supposed to come automatically since the vector is attached to the DM? How do I sort this out? 4) Can I have multiple vectors attached to the DM by DMCreateGlobalVector() even though I created the DMDA using dof=1. thanks, Abhishek -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Wed Oct 13 18:53:28 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 13 Oct 2021 19:53:28 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> Message-ID: <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> > On Oct 13, 2021, at 3:50 PM, Chang Liu wrote: > > Hi Barry, > > That is exactly what I want. > > Back to my original question, I am looking for an approach to transfer > matrix > data from many MPI processes to "master" MPI > processes, each of which taking care of one GPU, and then upload the data to GPU to > solve. > One can just grab some codes from mumps.c to aijcusparse.cu. mumps.c doesn't actually do that. It never needs to copy the entire matrix to a single MPI rank. It would be possible to write such a code that you suggest but it is not clear that it makes sense 1) For normal PETSc GPU usage there is one GPU per MPI rank, so while your one GPU per big domain is solving its systems the other GPUs (with the other MPI ranks that share that domain) are doing nothing. 2) For each triangular solve you would have to gather the right hand side from the multiple ranks to the single GPU to pass it to the GPU solver and then scatter the resulting solution back to all of its subdomain ranks. What I was suggesting was assign an entire subdomain to a single MPI rank, thus it does everything on one GPU and can use the GPU solver directly. If all the major computations of a subdomain can fit and be done on a single GPU then you would be utilizing all the GPUs you are using effectively. Barry > > Chang > > On 10/13/21 1:53 PM, Barry Smith wrote: >> Chang, >> You are correct there is no MPI + GPU direct solvers that currently do the triangular solves with MPI + GPU parallelism that I am aware of. You are limited that individual triangular solves be done on a single GPU. I can only suggest making each subdomain as big as possible to utilize each GPU as much as possible for the direct triangular solves. >> Barry >>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users wrote: >>> >>> Hi Mark, >>> >>> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, but with -pc_factor_mat_solver_type cusparse, it will give an error. >>> >>> Yes what I want is to have mumps or superlu to do the factorization, and then do the rest, including GMRES solver, on gpu. Is that possible? >>> >>> I have tried to use aijcusparse with superlu_dist, it runs but the iterative solver is still running on CPUs. I have contacted the superlu group and they confirmed that is the case right now. But if I set -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is running on GPU. >>> >>> Chang >>> >>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > wrote: >>>> Thank you Junchao for explaining this. I guess in my case the code is >>>> just calling a seq solver like superlu to do factorization on GPUs. >>>> My idea is that I want to have a traditional MPI code to utilize GPUs >>>> with cusparse. Right now cusparse does not support mpiaij matrix, Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes. >>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>> However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type. >>>> MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue? >>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work) >>>> Thanks, >>>> Mark >>>> so I >>>> want the code to have a mpiaij matrix when adding all the matrix terms, >>>> and then transform the matrix to seqaij when doing the factorization >>>> and >>>> solve. This involves sending the data to the master process, and I >>>> think >>>> the petsc mumps solver have something similar already. >>>> Chang >>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>> > >>>> > >>>> > >>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>> >>>> > >> wrote: >>>> > >>>> > >>>> > >>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>> >>>> > >> wrote: >>>> > >>>> > Hi Mark, >>>> > >>>> > The option I use is like >>>> > >>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres >>>> -mat_type >>>> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse >>>> *-sub_ksp_type >>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 >>>> > -ksp_atol 1.e-300 >>>> > >>>> > >>>> > Note, If you use -log_view the last column (rows are the >>>> method like >>>> > MatFactorNumeric) has the percent of work in the GPU. >>>> > >>>> > Junchao: *This* implies that we have a cuSparse LU >>>> factorization. Is >>>> > that correct? (I don't think we do) >>>> > >>>> > No, we don't have cuSparse LU factorization. If you check >>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls >>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>> > So I don't understand Chang's idea. Do you want to make bigger >>>> blocks? >>>> > >>>> > >>>> > I think this one do both factorization and solve on gpu. >>>> > >>>> > You can check the runex72_aijcusparse.sh file in petsc >>>> install >>>> > directory, and try it your self (this is only lu >>>> factorization >>>> > without >>>> > iterative solve). >>>> > >>>> > Chang >>>> > >>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>> > > >>>> > > >>>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu >>>> >>>> > > >>>> > > >>>> >>> wrote: >>>> > > >>>> > > Hi Junchao, >>>> > > >>>> > > No I only needs it to be transferred within a >>>> node. I use >>>> > block-Jacobi >>>> > > method and GMRES to solve the sparse matrix, so each >>>> > direct solver will >>>> > > take care of a sub-block of the whole matrix. In this >>>> > way, I can use >>>> > > one >>>> > > GPU to solve one sub-block, which is stored within >>>> one node. >>>> > > >>>> > > It was stated in the documentation that cusparse >>>> solver >>>> > is slow. >>>> > > However, in my test using ex72.c, the cusparse >>>> solver is >>>> > faster than >>>> > > mumps or superlu_dist on CPUs. >>>> > > >>>> > > >>>> > > Are we talking about the factorization, the solve, or >>>> both? >>>> > > >>>> > > We do not have an interface to cuSparse's LU >>>> factorization (I >>>> > just >>>> > > learned that it exists a few weeks ago). >>>> > > Perhaps your fast "cusparse solver" is '-pc_type lu >>>> -mat_type >>>> > > aijcusparse' ? This would be the CPU factorization, >>>> which is the >>>> > > dominant cost. >>>> > > >>>> > > >>>> > > Chang >>>> > > >>>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: >>>> > > > Hi, Chang, >>>> > > > For the mumps solver, we usually transfers >>>> matrix >>>> > and vector >>>> > > data >>>> > > > within a compute node. For the idea you >>>> propose, it >>>> > looks like >>>> > > we need >>>> > > > to gather data within MPI_COMM_WORLD, right? >>>> > > > >>>> > > > Mark, I remember you said cusparse solve is >>>> slow >>>> > and you would >>>> > > > rather do it on CPU. Is it right? >>>> > > > >>>> > > > --Junchao Zhang >>>> > > > >>>> > > > >>>> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via >>>> petsc-users >>>> > > > >>> >>>> > >>> > >>> >>>> > >>> >> >>>> > > >>> >>>> > >>> > >>> >>>> > >>> >>>> >>>> > > wrote: >>>> > > > >>>> > > > Hi, >>>> > > > >>>> > > > Currently, it is possible to use mumps >>>> solver in >>>> > PETSC with >>>> > > > -mat_mumps_use_omp_threads option, so that >>>> > multiple MPI >>>> > > processes will >>>> > > > transfer the matrix and rhs data to the master >>>> > rank, and then >>>> > > master >>>> > > > rank will call mumps with OpenMP to solve >>>> the matrix. >>>> > > > >>>> > > > I wonder if someone can develop similar >>>> option for >>>> > cusparse >>>> > > solver. >>>> > > > Right now, this solver does not work with >>>> > mpiaijcusparse. I >>>> > > think a >>>> > > > possible workaround is to transfer all the >>>> matrix >>>> > data to one MPI >>>> > > > process, and then upload the data to GPU to >>>> solve. >>>> > In this >>>> > > way, one can >>>> > > > use cusparse solver for a MPI program. >>>> > > > >>>> > > > Chang >>>> > > > -- >>>> > > > Chang Liu >>>> > > > Staff Research Physicist >>>> > > > +1 609 243 3438 >>>> > > > cliu at pppl.gov >>>> > >>>> > >>>> >> >>>> > >>>> > >>>> > > >>>> >>> >>>> > > > Princeton Plasma Physics Laboratory >>>> > > > 100 Stellarator Rd, Princeton NJ 08540, USA >>>> > > > >>>> > > >>>> > > -- >>>> > > Chang Liu >>>> > > Staff Research Physicist >>>> > > +1 609 243 3438 >>>> > > cliu at pppl.gov >>>> > >>> >>>> > >> >>>> > > Princeton Plasma Physics Laboratory >>>> > > 100 Stellarator Rd, Princeton NJ 08540, USA >>>> > > >>>> > >>>> > -- >>>> > Chang Liu >>>> > Staff Research Physicist >>>> > +1 609 243 3438 >>>> > cliu at pppl.gov >>> > >>>> > Princeton Plasma Physics Laboratory >>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>> > >>>> -- Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>> -- >>> Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From mfadams at lbl.gov Wed Oct 13 19:29:41 2021 From: mfadams at lbl.gov (Mark Adams) Date: Wed, 13 Oct 2021 20:29:41 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> Message-ID: On Wed, Oct 13, 2021 at 1:53 PM Barry Smith wrote: > > Chang, > > You are correct there is no MPI + GPU direct solvers that currently do > the triangular solves with MPI + GPU parallelism that I am aware of. So SuperLU and MUMPS do MPI solves on the CPU. That is reasonable. I have not been able to get decent performance with GPU solves. Complex code and low AI is not a good fit for GPUs. No work and all latency. Chang, you would find that GPU solves suck and, anyway, machines these days are configured with significant (high quality) CPU resources. I think you would find that you can't get GPU solves to beat CPU solves, except if you have enormous problems to solve, perhaps. > You are limited that individual triangular solves be done on a single GPU. > I can only suggest making each subdomain as big as possible to utilize each > GPU as much as possible for the direct triangular solves. > > Barry > > > > On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > > > Hi Mark, > > > > '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, > but with -pc_factor_mat_solver_type cusparse, it will give an error. > > > > Yes what I want is to have mumps or superlu to do the factorization, and > then do the rest, including GMRES solver, on gpu. Is that possible? > > > > I have tried to use aijcusparse with superlu_dist, it runs but the > iterative solver is still running on CPUs. I have contacted the superlu > group and they confirmed that is the case right now. But if I set > -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is > running on GPU. > > > > Chang > > > > On 10/13/21 12:03 PM, Mark Adams wrote: > >> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu cliu at pppl.gov>> wrote: > >> Thank you Junchao for explaining this. I guess in my case the code is > >> just calling a seq solver like superlu to do factorization on GPUs. > >> My idea is that I want to have a traditional MPI code to utilize GPUs > >> with cusparse. Right now cusparse does not support mpiaij matrix, > Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse > matrix with > 1 processes. > >> (-mat_type mpiaijcusparse might also work with >1 proc). > >> However, I see in grepping the repo that all the mumps and superlu > tests use aij or sell matrix type. > >> MUMPS and SuperLU provide their own solves, I assume .... but you might > want to do other matrix operations on the GPU. Is that the issue? > >> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a > problem? (no test with it so it probably does not work) > >> Thanks, > >> Mark > >> so I > >> want the code to have a mpiaij matrix when adding all the matrix > terms, > >> and then transform the matrix to seqaij when doing the factorization > >> and > >> solve. This involves sending the data to the master process, and I > >> think > >> the petsc mumps solver have something similar already. > >> Chang > >> On 10/13/21 10:18 AM, Junchao Zhang wrote: > >> > > >> > > >> > > >> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >> > >> > >> wrote: > >> > > >> > > >> > > >> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >> > >> > >> wrote: > >> > > >> > Hi Mark, > >> > > >> > The option I use is like > >> > > >> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres > >> -mat_type > >> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse > >> *-sub_ksp_type > >> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol > 1.e-300 > >> > -ksp_atol 1.e-300 > >> > > >> > > >> > Note, If you use -log_view the last column (rows are the > >> method like > >> > MatFactorNumeric) has the percent of work in the GPU. > >> > > >> > Junchao: *This* implies that we have a cuSparse LU > >> factorization. Is > >> > that correct? (I don't think we do) > >> > > >> > No, we don't have cuSparse LU factorization. If you check > >> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls > >> > MatLUFactorSymbolic_SeqAIJ() instead. > >> > So I don't understand Chang's idea. Do you want to make bigger > >> blocks? > >> > > >> > > >> > I think this one do both factorization and solve on gpu. > >> > > >> > You can check the runex72_aijcusparse.sh file in petsc > >> install > >> > directory, and try it your self (this is only lu > >> factorization > >> > without > >> > iterative solve). > >> > > >> > Chang > >> > > >> > On 10/12/21 1:17 PM, Mark Adams wrote: > >> > > > >> > > > >> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > >> > >> > > > >> > > > >> >>> wrote: > >> > > > >> > > Hi Junchao, > >> > > > >> > > No I only needs it to be transferred within a > >> node. I use > >> > block-Jacobi > >> > > method and GMRES to solve the sparse matrix, so > each > >> > direct solver will > >> > > take care of a sub-block of the whole matrix. In > this > >> > way, I can use > >> > > one > >> > > GPU to solve one sub-block, which is stored within > >> one node. > >> > > > >> > > It was stated in the documentation that cusparse > >> solver > >> > is slow. > >> > > However, in my test using ex72.c, the cusparse > >> solver is > >> > faster than > >> > > mumps or superlu_dist on CPUs. > >> > > > >> > > > >> > > Are we talking about the factorization, the solve, or > >> both? > >> > > > >> > > We do not have an interface to cuSparse's LU > >> factorization (I > >> > just > >> > > learned that it exists a few weeks ago). > >> > > Perhaps your fast "cusparse solver" is '-pc_type lu > >> -mat_type > >> > > aijcusparse' ? This would be the CPU factorization, > >> which is the > >> > > dominant cost. > >> > > > >> > > > >> > > Chang > >> > > > >> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: > >> > > > Hi, Chang, > >> > > > For the mumps solver, we usually transfers > >> matrix > >> > and vector > >> > > data > >> > > > within a compute node. For the idea you > >> propose, it > >> > looks like > >> > > we need > >> > > > to gather data within MPI_COMM_WORLD, right? > >> > > > > >> > > > Mark, I remember you said cusparse solve is > >> slow > >> > and you would > >> > > > rather do it on CPU. Is it right? > >> > > > > >> > > > --Junchao Zhang > >> > > > > >> > > > > >> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via > >> petsc-users > >> > > > >> > >> > >> > >> > >> > >> >> > >> > > >> > >> > >> > >> > >> > >> >>>> > >> > > wrote: > >> > > > > >> > > > Hi, > >> > > > > >> > > > Currently, it is possible to use mumps > >> solver in > >> > PETSC with > >> > > > -mat_mumps_use_omp_threads option, so that > >> > multiple MPI > >> > > processes will > >> > > > transfer the matrix and rhs data to the > master > >> > rank, and then > >> > > master > >> > > > rank will call mumps with OpenMP to solve > >> the matrix. > >> > > > > >> > > > I wonder if someone can develop similar > >> option for > >> > cusparse > >> > > solver. > >> > > > Right now, this solver does not work with > >> > mpiaijcusparse. I > >> > > think a > >> > > > possible workaround is to transfer all the > >> matrix > >> > data to one MPI > >> > > > process, and then upload the data to GPU to > >> solve. > >> > In this > >> > > way, one can > >> > > > use cusparse solver for a MPI program. > >> > > > > >> > > > Chang > >> > > > -- > >> > > > Chang Liu > >> > > > Staff Research Physicist > >> > > > +1 609 243 3438 > >> > > > cliu at pppl.gov > >> > > >> > > >> >> > >> > > >> > > >> > > > >> >>> > >> > > > Princeton Plasma Physics Laboratory > >> > > > 100 Stellarator Rd, Princeton NJ 08540, USA > >> > > > > >> > > > >> > > -- > >> > > Chang Liu > >> > > Staff Research Physicist > >> > > +1 609 243 3438 > >> > > cliu at pppl.gov > >> > >> > >> > >> > >> > > Princeton Plasma Physics Laboratory > >> > > 100 Stellarator Rd, Princeton NJ 08540, USA > >> > > > >> > > >> > -- > >> > Chang Liu > >> > Staff Research Physicist > >> > +1 609 243 3438 > >> > cliu at pppl.gov >> > > >> > Princeton Plasma Physics Laboratory > >> > 100 Stellarator Rd, Princeton NJ 08540, USA > >> > > >> -- Chang Liu > >> Staff Research Physicist > >> +1 609 243 3438 > >> cliu at pppl.gov > >> Princeton Plasma Physics Laboratory > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Wed Oct 13 19:47:05 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 13 Oct 2021 20:47:05 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> Message-ID: <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> Hi Barry, I think mumps solver in petsc does support that. You can check the documentation on "-mat_mumps_use_omp_threads" at https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in functions MatMumpsSetUpDistRHSInfo and MatMumpsGatherNonzerosOnMaster in mumps.c 1. I understand it is ideal to do one MPI rank per GPU. However, I am working on an existing code that was developed based on MPI and the the # of mpi ranks is typically equal to # of cpu cores. We don't want to change the whole structure of the code. 2. What you have suggested has been coded in mumps.c. See function MatMumpsSetUpDistRHSInfo. Regards, Chang On 10/13/21 7:53 PM, Barry Smith wrote: > > >> On Oct 13, 2021, at 3:50 PM, Chang Liu wrote: >> >> Hi Barry, >> >> That is exactly what I want. >> >> Back to my original question, I am looking for an approach to transfer >> matrix >> data from many MPI processes to "master" MPI >> processes, each of which taking care of one GPU, and then upload the data to GPU to >> solve. >> One can just grab some codes from mumps.c to aijcusparse.cu. > > mumps.c doesn't actually do that. It never needs to copy the entire matrix to a single MPI rank. > > It would be possible to write such a code that you suggest but it is not clear that it makes sense > > 1) For normal PETSc GPU usage there is one GPU per MPI rank, so while your one GPU per big domain is solving its systems the other GPUs (with the other MPI ranks that share that domain) are doing nothing. > > 2) For each triangular solve you would have to gather the right hand side from the multiple ranks to the single GPU to pass it to the GPU solver and then scatter the resulting solution back to all of its subdomain ranks. > > What I was suggesting was assign an entire subdomain to a single MPI rank, thus it does everything on one GPU and can use the GPU solver directly. If all the major computations of a subdomain can fit and be done on a single GPU then you would be utilizing all the GPUs you are using effectively. > > Barry > > > >> >> Chang >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>> Chang, >>> You are correct there is no MPI + GPU direct solvers that currently do the triangular solves with MPI + GPU parallelism that I am aware of. You are limited that individual triangular solves be done on a single GPU. I can only suggest making each subdomain as big as possible to utilize each GPU as much as possible for the direct triangular solves. >>> Barry >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users wrote: >>>> >>>> Hi Mark, >>>> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, but with -pc_factor_mat_solver_type cusparse, it will give an error. >>>> >>>> Yes what I want is to have mumps or superlu to do the factorization, and then do the rest, including GMRES solver, on gpu. Is that possible? >>>> >>>> I have tried to use aijcusparse with superlu_dist, it runs but the iterative solver is still running on CPUs. I have contacted the superlu group and they confirmed that is the case right now. But if I set -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is running on GPU. >>>> >>>> Chang >>>> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > wrote: >>>>> Thank you Junchao for explaining this. I guess in my case the code is >>>>> just calling a seq solver like superlu to do factorization on GPUs. >>>>> My idea is that I want to have a traditional MPI code to utilize GPUs >>>>> with cusparse. Right now cusparse does not support mpiaij matrix, Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes. >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>> However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type. >>>>> MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue? >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work) >>>>> Thanks, >>>>> Mark >>>>> so I >>>>> want the code to have a mpiaij matrix when adding all the matrix terms, >>>>> and then transform the matrix to seqaij when doing the factorization >>>>> and >>>>> solve. This involves sending the data to the master process, and I >>>>> think >>>>> the petsc mumps solver have something similar already. >>>>> Chang >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>> > >>>>> > >>>>> > >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>> >>>>> > >> wrote: >>>>> > >>>>> > >>>>> > >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>> >>>>> > >> wrote: >>>>> > >>>>> > Hi Mark, >>>>> > >>>>> > The option I use is like >>>>> > >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres >>>>> -mat_type >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse >>>>> *-sub_ksp_type >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300 >>>>> > -ksp_atol 1.e-300 >>>>> > >>>>> > >>>>> > Note, If you use -log_view the last column (rows are the >>>>> method like >>>>> > MatFactorNumeric) has the percent of work in the GPU. >>>>> > >>>>> > Junchao: *This* implies that we have a cuSparse LU >>>>> factorization. Is >>>>> > that correct? (I don't think we do) >>>>> > >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>> > So I don't understand Chang's idea. Do you want to make bigger >>>>> blocks? >>>>> > >>>>> > >>>>> > I think this one do both factorization and solve on gpu. >>>>> > >>>>> > You can check the runex72_aijcusparse.sh file in petsc >>>>> install >>>>> > directory, and try it your self (this is only lu >>>>> factorization >>>>> > without >>>>> > iterative solve). >>>>> > >>>>> > Chang >>>>> > >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>> > > >>>>> > > >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu >>>>> >>>>> > > >>>>> > > >>>>> >>> wrote: >>>>> > > >>>>> > > Hi Junchao, >>>>> > > >>>>> > > No I only needs it to be transferred within a >>>>> node. I use >>>>> > block-Jacobi >>>>> > > method and GMRES to solve the sparse matrix, so each >>>>> > direct solver will >>>>> > > take care of a sub-block of the whole matrix. In this >>>>> > way, I can use >>>>> > > one >>>>> > > GPU to solve one sub-block, which is stored within >>>>> one node. >>>>> > > >>>>> > > It was stated in the documentation that cusparse >>>>> solver >>>>> > is slow. >>>>> > > However, in my test using ex72.c, the cusparse >>>>> solver is >>>>> > faster than >>>>> > > mumps or superlu_dist on CPUs. >>>>> > > >>>>> > > >>>>> > > Are we talking about the factorization, the solve, or >>>>> both? >>>>> > > >>>>> > > We do not have an interface to cuSparse's LU >>>>> factorization (I >>>>> > just >>>>> > > learned that it exists a few weeks ago). >>>>> > > Perhaps your fast "cusparse solver" is '-pc_type lu >>>>> -mat_type >>>>> > > aijcusparse' ? This would be the CPU factorization, >>>>> which is the >>>>> > > dominant cost. >>>>> > > >>>>> > > >>>>> > > Chang >>>>> > > >>>>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: >>>>> > > > Hi, Chang, >>>>> > > > For the mumps solver, we usually transfers >>>>> matrix >>>>> > and vector >>>>> > > data >>>>> > > > within a compute node. For the idea you >>>>> propose, it >>>>> > looks like >>>>> > > we need >>>>> > > > to gather data within MPI_COMM_WORLD, right? >>>>> > > > >>>>> > > > Mark, I remember you said cusparse solve is >>>>> slow >>>>> > and you would >>>>> > > > rather do it on CPU. Is it right? >>>>> > > > >>>>> > > > --Junchao Zhang >>>>> > > > >>>>> > > > >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via >>>>> petsc-users >>>>> > > > >>>> >>>>> > >>>> > >>>> >>>>> > >>>> >> >>>>> > > >>>> >>>>> > >>>> > >>>> >>>>> > >>>> >>>> >>>>> > > wrote: >>>>> > > > >>>>> > > > Hi, >>>>> > > > >>>>> > > > Currently, it is possible to use mumps >>>>> solver in >>>>> > PETSC with >>>>> > > > -mat_mumps_use_omp_threads option, so that >>>>> > multiple MPI >>>>> > > processes will >>>>> > > > transfer the matrix and rhs data to the master >>>>> > rank, and then >>>>> > > master >>>>> > > > rank will call mumps with OpenMP to solve >>>>> the matrix. >>>>> > > > >>>>> > > > I wonder if someone can develop similar >>>>> option for >>>>> > cusparse >>>>> > > solver. >>>>> > > > Right now, this solver does not work with >>>>> > mpiaijcusparse. I >>>>> > > think a >>>>> > > > possible workaround is to transfer all the >>>>> matrix >>>>> > data to one MPI >>>>> > > > process, and then upload the data to GPU to >>>>> solve. >>>>> > In this >>>>> > > way, one can >>>>> > > > use cusparse solver for a MPI program. >>>>> > > > >>>>> > > > Chang >>>>> > > > -- >>>>> > > > Chang Liu >>>>> > > > Staff Research Physicist >>>>> > > > +1 609 243 3438 >>>>> > > > cliu at pppl.gov >>>>> > >>>>> > >>>>> >> >>>>> > >>>>> > >>>>> > > >>>>> >>> >>>>> > > > Princeton Plasma Physics Laboratory >>>>> > > > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> > > > >>>>> > > >>>>> > > -- >>>>> > > Chang Liu >>>>> > > Staff Research Physicist >>>>> > > +1 609 243 3438 >>>>> > > cliu at pppl.gov >>>>> > >>>> >>>>> > >> >>>>> > > Princeton Plasma Physics Laboratory >>>>> > > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> > > >>>>> > >>>>> > -- >>>>> > Chang Liu >>>>> > Staff Research Physicist >>>>> > +1 609 243 3438 >>>>> > cliu at pppl.gov >>>> > >>>>> > Princeton Plasma Physics Laboratory >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> > >>>>> -- Chang Liu >>>>> Staff Research Physicist >>>>> +1 609 243 3438 >>>>> cliu at pppl.gov >>>>> Princeton Plasma Physics Laboratory >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >>>> -- >>>> Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From cliu at pppl.gov Wed Oct 13 20:04:30 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 13 Oct 2021 21:04:30 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> Message-ID: Hi Mark, Thank you for sharing this. I totally agree that factorization and triangular solve can be slow on GPUs. However, I also find that other operations such as matrix matrix multiplication can be very fast on GPU, so some iterative solver may perform well on GPUs, depending on the density and structure of matrix. In my tests, I found that sometimes GPU can gives 2-3 speedup for GMRES. Also, I think the SuperLU group has made significant progress on porting their code to GPU recently, and impressive speedup (not published yet). Chang On 10/13/21 8:29 PM, Mark Adams wrote: > > > On Wed, Oct 13, 2021 at 1:53 PM Barry Smith > wrote: > > > ? Chang, > > ? ? You are correct there is no MPI + GPU direct solvers that > currently do the triangular solves with MPI + GPU parallelism that I > am aware of. > > > So SuperLU and? MUMPS do MPI solves on the CPU. That is reasonable. I > have not been able to get decent performance with GPU solves. Complex > code and low AI is not a good fit for GPUs. No work and all latency. > > Chang, you would find that GPU solves suck and, anyway, machines these > days are configured with significant (high quality) CPU resources. I > think you would find that you can't get GPU solves to beat CPU solves, > except?if you have enormous problems to solve,?perhaps. > > You are limited that individual triangular solves be done on a > single GPU. I can only suggest making each subdomain as big as > possible to utilize each GPU as much as possible for the direct > triangular solves. > > ? ?Barry > > > > On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users > > wrote: > > > > Hi Mark, > > > > '-mat_type aijcusparse' works with mpiaijcusparse with other > solvers, but with -pc_factor_mat_solver_type cusparse, it will give > an error. > > > > Yes what I want is to have mumps or superlu to do the > factorization, and then do the rest, including GMRES solver, on gpu. > Is that possible? > > > > I have tried to use aijcusparse with superlu_dist, it runs but > the iterative solver is still running on CPUs. I have contacted the > superlu group and they confirmed that is the case right now. But if > I set -pc_factor_mat_solver_type cusparse, it seems that the > iterative solver is running on GPU. > > > > Chang > > > > On 10/13/21 12:03 PM, Mark Adams wrote: > >> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >> wrote: > >>? ? Thank you Junchao for explaining this. I guess in my case the > code is > >>? ? just calling a seq solver like superlu to do factorization on > GPUs. > >>? ? My idea is that I want to have a traditional MPI code to > utilize GPUs > >>? ? with cusparse. Right now cusparse does not support mpiaij > matrix, Sure it does: '-mat_type aijcusparse' will give you an > mpiaijcusparse matrix with > 1 processes. > >> (-mat_type mpiaijcusparse might also work with >1 proc). > >> However, I see in grepping the repo that all the mumps and > superlu tests use aij or sell matrix type. > >> MUMPS and SuperLU provide their own solves, I assume .... but > you might want to do other matrix operations on the GPU. Is that the > issue? > >> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have > a problem? (no test with it so it probably does not work) > >> Thanks, > >> Mark > >>? ? so I > >>? ? want the code to have a mpiaij matrix when adding all the > matrix terms, > >>? ? and then transform the matrix to seqaij when doing the > factorization > >>? ? and > >>? ? solve. This involves sending the data to the master process, > and I > >>? ? think > >>? ? the petsc mumps solver have something similar already. > >>? ? Chang > >>? ? On 10/13/21 10:18 AM, Junchao Zhang wrote: > >>? ? ?> > >>? ? ?> > >>? ? ?> > >>? ? ?> On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > > >>? ? > > >>? ? ?> > >>> wrote: > >>? ? ?> > >>? ? ?> > >>? ? ?> > >>? ? ?>? ? ?On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > > >>? ? > > >>? ? ?>? ? ? > >>> wrote: > >>? ? ?> > >>? ? ?>? ? ? ? ?Hi Mark, > >>? ? ?> > >>? ? ?>? ? ? ? ?The option I use is like > >>? ? ?> > >>? ? ?>? ? ? ? ?-pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type > fgmres > >>? ? -mat_type > >>? ? ?>? ? ? ? ?aijcusparse *-sub_pc_factor_mat_solver_type cusparse > >>? ? *-sub_ksp_type > >>? ? ?>? ? ? ? ?preonly *-sub_pc_type lu* -ksp_max_it 2000 > -ksp_rtol 1.e-300 > >>? ? ?>? ? ? ? ?-ksp_atol 1.e-300 > >>? ? ?> > >>? ? ?> > >>? ? ?>? ? ?Note, If you use -log_view the last column (rows are the > >>? ? method like > >>? ? ?>? ? ?MatFactorNumeric) has the percent of work in the GPU. > >>? ? ?> > >>? ? ?>? ? ?Junchao: *This* implies that we have a cuSparse LU > >>? ? factorization. Is > >>? ? ?>? ? ?that correct? (I don't think we do) > >>? ? ?> > >>? ? ?> No, we don't have cuSparse LU factorization.? If you check > >>? ? ?> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls > >>? ? ?> MatLUFactorSymbolic_SeqAIJ() instead. > >>? ? ?> So I don't understand Chang's idea. Do you want to make bigger > >>? ? blocks? > >>? ? ?> > >>? ? ?> > >>? ? ?>? ? ? ? ?I think this one do both factorization and solve > on gpu. > >>? ? ?> > >>? ? ?>? ? ? ? ?You can check the runex72_aijcusparse.sh file in petsc > >>? ? install > >>? ? ?>? ? ? ? ?directory, and try it your self (this is only lu > >>? ? factorization > >>? ? ?>? ? ? ? ?without > >>? ? ?>? ? ? ? ?iterative solve). > >>? ? ?> > >>? ? ?>? ? ? ? ?Chang > >>? ? ?> > >>? ? ?>? ? ? ? ?On 10/12/21 1:17 PM, Mark Adams wrote: > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > >>? ? > > >>? ? ?>? ? ? ? ? > >> > >>? ? ?>? ? ? ? ? > > > > >>? ? > >>>> wrote: > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? >? ? ?Hi Junchao, > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? >? ? ?No I only needs it to be transferred within a > >>? ? node. I use > >>? ? ?>? ? ? ? ?block-Jacobi > >>? ? ?>? ? ? ? ? >? ? ?method and GMRES to solve the sparse > matrix, so each > >>? ? ?>? ? ? ? ?direct solver will > >>? ? ?>? ? ? ? ? >? ? ?take care of a sub-block of the whole > matrix. In this > >>? ? ?>? ? ? ? ?way, I can use > >>? ? ?>? ? ? ? ? >? ? ?one > >>? ? ?>? ? ? ? ? >? ? ?GPU to solve one sub-block, which is stored > within > >>? ? one node. > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? >? ? ?It was stated in the documentation that > cusparse > >>? ? solver > >>? ? ?>? ? ? ? ?is slow. > >>? ? ?>? ? ? ? ? >? ? ?However, in my test using ex72.c, the cusparse > >>? ? solver is > >>? ? ?>? ? ? ? ?faster than > >>? ? ?>? ? ? ? ? >? ? ?mumps or superlu_dist on CPUs. > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? > Are we talking about the factorization, the > solve, or > >>? ? both? > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? > We do not have an interface to cuSparse's LU > >>? ? factorization (I > >>? ? ?>? ? ? ? ?just > >>? ? ?>? ? ? ? ? > learned that it exists a few weeks ago). > >>? ? ?>? ? ? ? ? > Perhaps your fast "cusparse solver" is '-pc_type lu > >>? ? -mat_type > >>? ? ?>? ? ? ? ? > aijcusparse' ? This would be the CPU factorization, > >>? ? which is the > >>? ? ?>? ? ? ? ? > dominant cost. > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? >? ? ?Chang > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? >? ? ?On 10/12/21 10:24 AM, Junchao Zhang wrote: > >>? ? ?>? ? ? ? ? >? ? ? > Hi, Chang, > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?For the mumps solver, we usually > transfers > >>? ? matrix > >>? ? ?>? ? ? ? ?and vector > >>? ? ?>? ? ? ? ? >? ? ?data > >>? ? ?>? ? ? ? ? >? ? ? > within a compute node.? For the idea you > >>? ? propose, it > >>? ? ?>? ? ? ? ?looks like > >>? ? ?>? ? ? ? ? >? ? ?we need > >>? ? ?>? ? ? ? ? >? ? ? > to gather data within MPI_COMM_WORLD, right? > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?Mark, I remember you said cusparse > solve is > >>? ? slow > >>? ? ?>? ? ? ? ?and you would > >>? ? ?>? ? ? ? ? >? ? ? > rather do it on CPU. Is it right? > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? >? ? ? > --Junchao Zhang > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? >? ? ? > On Mon, Oct 11, 2021 at 10:25 PM Chang > Liu via > >>? ? petsc-users > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? > > >>? ? ?>? ? ? ? ? > >>? ? >> > >>? ? > > >>? ? ?>? ? ? ? ? > >>? ? >>> > >>? ? ?>? ? ? ? ? >? ? ? > >>? ? > > >>? ? ?>? ? ? ? ? > >>? ? >> > >>? ? > > >>? ? ?>? ? ? ? ? > >>? ? >>>>> > >>? ? ?>? ? ? ? ? >? ? ?wrote: > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?Hi, > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?Currently, it is possible to use mumps > >>? ? solver in > >>? ? ?>? ? ? ? ?PETSC with > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?-mat_mumps_use_omp_threads option, > so that > >>? ? ?>? ? ? ? ?multiple MPI > >>? ? ?>? ? ? ? ? >? ? ?processes will > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?transfer the matrix and rhs data to > the master > >>? ? ?>? ? ? ? ?rank, and then > >>? ? ?>? ? ? ? ? >? ? ?master > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?rank will call mumps with OpenMP to > solve > >>? ? the matrix. > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?I wonder if someone can develop similar > >>? ? option for > >>? ? ?>? ? ? ? ?cusparse > >>? ? ?>? ? ? ? ? >? ? ?solver. > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?Right now, this solver does not work > with > >>? ? ?>? ? ? ? ?mpiaijcusparse. I > >>? ? ?>? ? ? ? ? >? ? ?think a > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?possible workaround is to transfer > all the > >>? ? matrix > >>? ? ?>? ? ? ? ?data to one MPI > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?process, and then upload the data to > GPU to > >>? ? solve. > >>? ? ?>? ? ? ? ?In this > >>? ? ?>? ? ? ? ? >? ? ?way, one can > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?use cusparse solver for a MPI program. > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?Chang > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?-- > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?Chang Liu > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?Staff Research Physicist > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?+1 609 243 3438 > >>? ? ?>? ? ? ? ? >? ? ? > cliu at pppl.gov > > > >>? ? > >> > >>? ? ?>? ? ? ? ? > > > >>? ? > >>> > >>? ? ?>? ? ? ? ? > > > >>? ? > >> > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? > >>>> > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?Princeton Plasma Physics Laboratory > >>? ? ?>? ? ? ? ? >? ? ? >? ? ?100 Stellarator Rd, Princeton NJ > 08540, USA > >>? ? ?>? ? ? ? ? >? ? ? > > >>? ? ?>? ? ? ? ? > > >>? ? ?>? ? ? ? ? >? ? ?-- > >>? ? ?>? ? ? ? ? >? ? ?Chang Liu > >>? ? ?>? ? ? ? ? >? ? ?Staff Research Physicist > >>? ? ?>? ? ? ? ? >? ? ?+1 609 243 3438 > >>? ? ?>? ? ? ? ? > cliu at pppl.gov > > > >>? ? > >> > >>? ? > > >>? ? ?>? ? ? ? ? > >>> > >>? ? ?>? ? ? ? ? >? ? ?Princeton Plasma Physics Laboratory > >>? ? ?>? ? ? ? ? >? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >>? ? ?>? ? ? ? ? > > >>? ? ?> > >>? ? ?>? ? ? ? ?-- > >>? ? ?>? ? ? ? ?Chang Liu > >>? ? ?>? ? ? ? ?Staff Research Physicist > >>? ? ?>? ? ? ? ?+1 609 243 3438 > >>? ? ?> cliu at pppl.gov > > >>? ? >> > >>? ? ?>? ? ? ? ?Princeton Plasma Physics Laboratory > >>? ? ?>? ? ? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >>? ? ?> > >>? ? --? ? ?Chang Liu > >>? ? Staff Research Physicist > >>? ? +1 609 243 3438 > >> cliu at pppl.gov > > >>? ? Princeton Plasma Physics Laboratory > >>? ? 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From junchao.zhang at gmail.com Wed Oct 13 20:24:38 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Wed, 13 Oct 2021 20:24:38 -0500 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> Message-ID: Hi Chang, I did the work in mumps. It is easy for me to understand gathering matrix rows to one process. But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? Thanks --Junchao Zhang On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi Barry, > > I think mumps solver in petsc does support that. You can check the > documentation on "-mat_mumps_use_omp_threads" at > > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > > and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in > functions MatMumpsSetUpDistRHSInfo and MatMumpsGatherNonzerosOnMaster in > mumps.c > > 1. I understand it is ideal to do one MPI rank per GPU. However, I am > working on an existing code that was developed based on MPI and the the > # of mpi ranks is typically equal to # of cpu cores. We don't want to > change the whole structure of the code. > > 2. What you have suggested has been coded in mumps.c. See function > MatMumpsSetUpDistRHSInfo. > > Regards, > > Chang > > On 10/13/21 7:53 PM, Barry Smith wrote: > > > > > >> On Oct 13, 2021, at 3:50 PM, Chang Liu wrote: > >> > >> Hi Barry, > >> > >> That is exactly what I want. > >> > >> Back to my original question, I am looking for an approach to transfer > >> matrix > >> data from many MPI processes to "master" MPI > >> processes, each of which taking care of one GPU, and then upload the > data to GPU to > >> solve. > >> One can just grab some codes from mumps.c to aijcusparse.cu. > > > > mumps.c doesn't actually do that. It never needs to copy the entire > matrix to a single MPI rank. > > > > It would be possible to write such a code that you suggest but it is > not clear that it makes sense > > > > 1) For normal PETSc GPU usage there is one GPU per MPI rank, so while > your one GPU per big domain is solving its systems the other GPUs (with the > other MPI ranks that share that domain) are doing nothing. > > > > 2) For each triangular solve you would have to gather the right hand > side from the multiple ranks to the single GPU to pass it to the GPU solver > and then scatter the resulting solution back to all of its subdomain ranks. > > > > What I was suggesting was assign an entire subdomain to a single MPI > rank, thus it does everything on one GPU and can use the GPU solver > directly. If all the major computations of a subdomain can fit and be done > on a single GPU then you would be utilizing all the GPUs you are using > effectively. > > > > Barry > > > > > > > >> > >> Chang > >> > >> On 10/13/21 1:53 PM, Barry Smith wrote: > >>> Chang, > >>> You are correct there is no MPI + GPU direct solvers that > currently do the triangular solves with MPI + GPU parallelism that I am > aware of. You are limited that individual triangular solves be done on a > single GPU. I can only suggest making each subdomain as big as possible to > utilize each GPU as much as possible for the direct triangular solves. > >>> Barry > >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users < > petsc-users at mcs.anl.gov> wrote: > >>>> > >>>> Hi Mark, > >>>> > >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, > but with -pc_factor_mat_solver_type cusparse, it will give an error. > >>>> > >>>> Yes what I want is to have mumps or superlu to do the factorization, > and then do the rest, including GMRES solver, on gpu. Is that possible? > >>>> > >>>> I have tried to use aijcusparse with superlu_dist, it runs but the > iterative solver is still running on CPUs. I have contacted the superlu > group and they confirmed that is the case right now. But if I set > -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is > running on GPU. > >>>> > >>>> Chang > >>>> > >>>> On 10/13/21 12:03 PM, Mark Adams wrote: > >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu cliu at pppl.gov>> wrote: > >>>>> Thank you Junchao for explaining this. I guess in my case the > code is > >>>>> just calling a seq solver like superlu to do factorization on > GPUs. > >>>>> My idea is that I want to have a traditional MPI code to utilize > GPUs > >>>>> with cusparse. Right now cusparse does not support mpiaij > matrix, Sure it does: '-mat_type aijcusparse' will give you an > mpiaijcusparse matrix with > 1 processes. > >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). > >>>>> However, I see in grepping the repo that all the mumps and superlu > tests use aij or sell matrix type. > >>>>> MUMPS and SuperLU provide their own solves, I assume .... but you > might want to do other matrix operations on the GPU. Is that the issue? > >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a > problem? (no test with it so it probably does not work) > >>>>> Thanks, > >>>>> Mark > >>>>> so I > >>>>> want the code to have a mpiaij matrix when adding all the matrix > terms, > >>>>> and then transform the matrix to seqaij when doing the > factorization > >>>>> and > >>>>> solve. This involves sending the data to the master process, and > I > >>>>> think > >>>>> the petsc mumps solver have something similar already. > >>>>> Chang > >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: > >>>>> > > >>>>> > > >>>>> > > >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>> > >>>>> > >> wrote: > >>>>> > > >>>>> > > >>>>> > > >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>> > >>>>> > >> wrote: > >>>>> > > >>>>> > Hi Mark, > >>>>> > > >>>>> > The option I use is like > >>>>> > > >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type > fgmres > >>>>> -mat_type > >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse > >>>>> *-sub_ksp_type > >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol > 1.e-300 > >>>>> > -ksp_atol 1.e-300 > >>>>> > > >>>>> > > >>>>> > Note, If you use -log_view the last column (rows are the > >>>>> method like > >>>>> > MatFactorNumeric) has the percent of work in the GPU. > >>>>> > > >>>>> > Junchao: *This* implies that we have a cuSparse LU > >>>>> factorization. Is > >>>>> > that correct? (I don't think we do) > >>>>> > > >>>>> > No, we don't have cuSparse LU factorization. If you check > >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls > >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. > >>>>> > So I don't understand Chang's idea. Do you want to make bigger > >>>>> blocks? > >>>>> > > >>>>> > > >>>>> > I think this one do both factorization and solve on > gpu. > >>>>> > > >>>>> > You can check the runex72_aijcusparse.sh file in petsc > >>>>> install > >>>>> > directory, and try it your self (this is only lu > >>>>> factorization > >>>>> > without > >>>>> > iterative solve). > >>>>> > > >>>>> > Chang > >>>>> > > >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: > >>>>> > > > >>>>> > > > >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > >>>>> > >>>>> > > > >>>>> > > > >>>>> >>> wrote: > >>>>> > > > >>>>> > > Hi Junchao, > >>>>> > > > >>>>> > > No I only needs it to be transferred within a > >>>>> node. I use > >>>>> > block-Jacobi > >>>>> > > method and GMRES to solve the sparse matrix, > so each > >>>>> > direct solver will > >>>>> > > take care of a sub-block of the whole matrix. > In this > >>>>> > way, I can use > >>>>> > > one > >>>>> > > GPU to solve one sub-block, which is stored > within > >>>>> one node. > >>>>> > > > >>>>> > > It was stated in the documentation that > cusparse > >>>>> solver > >>>>> > is slow. > >>>>> > > However, in my test using ex72.c, the cusparse > >>>>> solver is > >>>>> > faster than > >>>>> > > mumps or superlu_dist on CPUs. > >>>>> > > > >>>>> > > > >>>>> > > Are we talking about the factorization, the solve, > or > >>>>> both? > >>>>> > > > >>>>> > > We do not have an interface to cuSparse's LU > >>>>> factorization (I > >>>>> > just > >>>>> > > learned that it exists a few weeks ago). > >>>>> > > Perhaps your fast "cusparse solver" is '-pc_type lu > >>>>> -mat_type > >>>>> > > aijcusparse' ? This would be the CPU factorization, > >>>>> which is the > >>>>> > > dominant cost. > >>>>> > > > >>>>> > > > >>>>> > > Chang > >>>>> > > > >>>>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: > >>>>> > > > Hi, Chang, > >>>>> > > > For the mumps solver, we usually > transfers > >>>>> matrix > >>>>> > and vector > >>>>> > > data > >>>>> > > > within a compute node. For the idea you > >>>>> propose, it > >>>>> > looks like > >>>>> > > we need > >>>>> > > > to gather data within MPI_COMM_WORLD, right? > >>>>> > > > > >>>>> > > > Mark, I remember you said cusparse > solve is > >>>>> slow > >>>>> > and you would > >>>>> > > > rather do it on CPU. Is it right? > >>>>> > > > > >>>>> > > > --Junchao Zhang > >>>>> > > > > >>>>> > > > > >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu > via > >>>>> petsc-users > >>>>> > > > >>>>> > >>>>> > >>>>> > petsc-users at mcs.anl.gov > >>>>> > >>>>> > >>>>> >> > >>>>> > > >>>>> > >>>>> > >>>>> > petsc-users at mcs.anl.gov > >>>>> > >>>>> > >>>>> >>>> > >>>>> > > wrote: > >>>>> > > > > >>>>> > > > Hi, > >>>>> > > > > >>>>> > > > Currently, it is possible to use mumps > >>>>> solver in > >>>>> > PETSC with > >>>>> > > > -mat_mumps_use_omp_threads option, so > that > >>>>> > multiple MPI > >>>>> > > processes will > >>>>> > > > transfer the matrix and rhs data to the > master > >>>>> > rank, and then > >>>>> > > master > >>>>> > > > rank will call mumps with OpenMP to > solve > >>>>> the matrix. > >>>>> > > > > >>>>> > > > I wonder if someone can develop similar > >>>>> option for > >>>>> > cusparse > >>>>> > > solver. > >>>>> > > > Right now, this solver does not work > with > >>>>> > mpiaijcusparse. I > >>>>> > > think a > >>>>> > > > possible workaround is to transfer all > the > >>>>> matrix > >>>>> > data to one MPI > >>>>> > > > process, and then upload the data to > GPU to > >>>>> solve. > >>>>> > In this > >>>>> > > way, one can > >>>>> > > > use cusparse solver for a MPI program. > >>>>> > > > > >>>>> > > > Chang > >>>>> > > > -- > >>>>> > > > Chang Liu > >>>>> > > > Staff Research Physicist > >>>>> > > > +1 609 243 3438 > >>>>> > > > cliu at pppl.gov > >>>>> > > >>>>> > > >>>>> >> > >>>>> > > >>>>> > > >>>>> > > > >>>>> >>> > >>>>> > > > Princeton Plasma Physics Laboratory > >>>>> > > > 100 Stellarator Rd, Princeton NJ 08540, > USA > >>>>> > > > > >>>>> > > > >>>>> > > -- > >>>>> > > Chang Liu > >>>>> > > Staff Research Physicist > >>>>> > > +1 609 243 3438 > >>>>> > > cliu at pppl.gov > >>>>> > cliu at pppl.gov > >>>>> > >>>>> > >> > >>>>> > > Princeton Plasma Physics Laboratory > >>>>> > > 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>> > > > >>>>> > > >>>>> > -- > >>>>> > Chang Liu > >>>>> > Staff Research Physicist > >>>>> > +1 609 243 3438 > >>>>> > cliu at pppl.gov >>>>> > > >>>>> > Princeton Plasma Physics Laboratory > >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>> > > >>>>> -- Chang Liu > >>>>> Staff Research Physicist > >>>>> +1 609 243 3438 > >>>>> cliu at pppl.gov > >>>>> Princeton Plasma Physics Laboratory > >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>> > >>>> -- > >>>> Chang Liu > >>>> Staff Research Physicist > >>>> +1 609 243 3438 > >>>> cliu at pppl.gov > >>>> Princeton Plasma Physics Laboratory > >>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >> > >> -- > >> Chang Liu > >> Staff Research Physicist > >> +1 609 243 3438 > >> cliu at pppl.gov > >> Princeton Plasma Physics Laboratory > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Wed Oct 13 20:25:17 2021 From: mfadams at lbl.gov (Mark Adams) Date: Wed, 13 Oct 2021 21:25:17 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> Message-ID: On Wed, Oct 13, 2021 at 9:04 PM Chang Liu wrote: > Hi Mark, > > Thank you for sharing this. I totally agree that factorization and > triangular solve can be slow on GPUs. Note, factorizations have much more potential on a GPU because there is much more work and arithmetic intensity (BLAS3 vs BLAS2 (or 1)) than the forward and backward solve (the solve) phases. The work complexity of a PDE sparse factorization is about O(N^2) and O(N^3/2) for the solve. That is a big difference. > However, I also find that other > operations such as matrix matrix multiplication can be very fast on GPU, > so some iterative solver may perform well on GPUs, depending on the > density and structure of matrix. > > In my tests, I found that sometimes GPU can gives 2-3 speedup for GMRES. > > Also, I think the SuperLU group has made significant progress on porting > their code to GPU recently, and impressive speedup (not published yet). > > Chang > > On 10/13/21 8:29 PM, Mark Adams wrote: > > > > > > On Wed, Oct 13, 2021 at 1:53 PM Barry Smith > > wrote: > > > > > > Chang, > > > > You are correct there is no MPI + GPU direct solvers that > > currently do the triangular solves with MPI + GPU parallelism that I > > am aware of. > > > > > > So SuperLU and MUMPS do MPI solves on the CPU. That is reasonable. I > > have not been able to get decent performance with GPU solves. Complex > > code and low AI is not a good fit for GPUs. No work and all latency. > > > > Chang, you would find that GPU solves suck and, anyway, machines these > > days are configured with significant (high quality) CPU resources. I > > think you would find that you can't get GPU solves to beat CPU solves, > > except if you have enormous problems to solve, perhaps. > > > > You are limited that individual triangular solves be done on a > > single GPU. I can only suggest making each subdomain as big as > > possible to utilize each GPU as much as possible for the direct > > triangular solves. > > > > Barry > > > > > > > On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users > > > wrote: > > > > > > Hi Mark, > > > > > > '-mat_type aijcusparse' works with mpiaijcusparse with other > > solvers, but with -pc_factor_mat_solver_type cusparse, it will give > > an error. > > > > > > Yes what I want is to have mumps or superlu to do the > > factorization, and then do the rest, including GMRES solver, on gpu. > > Is that possible? > > > > > > I have tried to use aijcusparse with superlu_dist, it runs but > > the iterative solver is still running on CPUs. I have contacted the > > superlu group and they confirmed that is the case right now. But if > > I set -pc_factor_mat_solver_type cusparse, it seems that the > > iterative solver is running on GPU. > > > > > > Chang > > > > > > On 10/13/21 12:03 PM, Mark Adams wrote: > > >> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > > >> wrote: > > >> Thank you Junchao for explaining this. I guess in my case the > > code is > > >> just calling a seq solver like superlu to do factorization on > > GPUs. > > >> My idea is that I want to have a traditional MPI code to > > utilize GPUs > > >> with cusparse. Right now cusparse does not support mpiaij > > matrix, Sure it does: '-mat_type aijcusparse' will give you an > > mpiaijcusparse matrix with > 1 processes. > > >> (-mat_type mpiaijcusparse might also work with >1 proc). > > >> However, I see in grepping the repo that all the mumps and > > superlu tests use aij or sell matrix type. > > >> MUMPS and SuperLU provide their own solves, I assume .... but > > you might want to do other matrix operations on the GPU. Is that the > > issue? > > >> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have > > a problem? (no test with it so it probably does not work) > > >> Thanks, > > >> Mark > > >> so I > > >> want the code to have a mpiaij matrix when adding all the > > matrix terms, > > >> and then transform the matrix to seqaij when doing the > > factorization > > >> and > > >> solve. This involves sending the data to the master process, > > and I > > >> think > > >> the petsc mumps solver have something similar already. > > >> Chang > > >> On 10/13/21 10:18 AM, Junchao Zhang wrote: > > >> > > > >> > > > >> > > > >> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > > > > >> > > > >> > > > >>> wrote: > > >> > > > >> > > > >> > > > >> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > > > > >> > > > >> > > > >>> wrote: > > >> > > > >> > Hi Mark, > > >> > > > >> > The option I use is like > > >> > > > >> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type > > fgmres > > >> -mat_type > > >> > aijcusparse *-sub_pc_factor_mat_solver_type > cusparse > > >> *-sub_ksp_type > > >> > preonly *-sub_pc_type lu* -ksp_max_it 2000 > > -ksp_rtol 1.e-300 > > >> > -ksp_atol 1.e-300 > > >> > > > >> > > > >> > Note, If you use -log_view the last column (rows are > the > > >> method like > > >> > MatFactorNumeric) has the percent of work in the GPU. > > >> > > > >> > Junchao: *This* implies that we have a cuSparse LU > > >> factorization. Is > > >> > that correct? (I don't think we do) > > >> > > > >> > No, we don't have cuSparse LU factorization. If you check > > >> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls > > >> > MatLUFactorSymbolic_SeqAIJ() instead. > > >> > So I don't understand Chang's idea. Do you want to make > bigger > > >> blocks? > > >> > > > >> > > > >> > I think this one do both factorization and solve > > on gpu. > > >> > > > >> > You can check the runex72_aijcusparse.sh file in > petsc > > >> install > > >> > directory, and try it your self (this is only lu > > >> factorization > > >> > without > > >> > iterative solve). > > >> > > > >> > Chang > > >> > > > >> > On 10/12/21 1:17 PM, Mark Adams wrote: > > >> > > > > >> > > > > >> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > > >> > > > > >> > > > >> > > >> > > > > > > > >> > > >>>> wrote: > > >> > > > > >> > > Hi Junchao, > > >> > > > > >> > > No I only needs it to be transferred within > a > > >> node. I use > > >> > block-Jacobi > > >> > > method and GMRES to solve the sparse > > matrix, so each > > >> > direct solver will > > >> > > take care of a sub-block of the whole > > matrix. In this > > >> > way, I can use > > >> > > one > > >> > > GPU to solve one sub-block, which is stored > > within > > >> one node. > > >> > > > > >> > > It was stated in the documentation that > > cusparse > > >> solver > > >> > is slow. > > >> > > However, in my test using ex72.c, the > cusparse > > >> solver is > > >> > faster than > > >> > > mumps or superlu_dist on CPUs. > > >> > > > > >> > > > > >> > > Are we talking about the factorization, the > > solve, or > > >> both? > > >> > > > > >> > > We do not have an interface to cuSparse's LU > > >> factorization (I > > >> > just > > >> > > learned that it exists a few weeks ago). > > >> > > Perhaps your fast "cusparse solver" is > '-pc_type lu > > >> -mat_type > > >> > > aijcusparse' ? This would be the CPU > factorization, > > >> which is the > > >> > > dominant cost. > > >> > > > > >> > > > > >> > > Chang > > >> > > > > >> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: > > >> > > > Hi, Chang, > > >> > > > For the mumps solver, we usually > > transfers > > >> matrix > > >> > and vector > > >> > > data > > >> > > > within a compute node. For the idea you > > >> propose, it > > >> > looks like > > >> > > we need > > >> > > > to gather data within MPI_COMM_WORLD, > right? > > >> > > > > > >> > > > Mark, I remember you said cusparse > > solve is > > >> slow > > >> > and you would > > >> > > > rather do it on CPU. Is it right? > > >> > > > > > >> > > > --Junchao Zhang > > >> > > > > > >> > > > > > >> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang > > Liu via > > >> petsc-users > > >> > > > > > > >> petsc-users at mcs.anl.gov>> > > >> > > > > >> > >> > > > >> petsc-users at mcs.anl.gov>> > > >> > > > > >> > >>> > > >> > > > > > >> petsc-users at mcs.anl.gov>> > > >> > > > > >> > >> > > > >> petsc-users at mcs.anl.gov>> > > >> > > > > >> > >>>>> > > >> > > wrote: > > >> > > > > > >> > > > Hi, > > >> > > > > > >> > > > Currently, it is possible to use > mumps > > >> solver in > > >> > PETSC with > > >> > > > -mat_mumps_use_omp_threads option, > > so that > > >> > multiple MPI > > >> > > processes will > > >> > > > transfer the matrix and rhs data to > > the master > > >> > rank, and then > > >> > > master > > >> > > > rank will call mumps with OpenMP to > > solve > > >> the matrix. > > >> > > > > > >> > > > I wonder if someone can develop > similar > > >> option for > > >> > cusparse > > >> > > solver. > > >> > > > Right now, this solver does not work > > with > > >> > mpiaijcusparse. I > > >> > > think a > > >> > > > possible workaround is to transfer > > all the > > >> matrix > > >> > data to one MPI > > >> > > > process, and then upload the data to > > GPU to > > >> solve. > > >> > In this > > >> > > way, one can > > >> > > > use cusparse solver for a MPI > program. > > >> > > > > > >> > > > Chang > > >> > > > -- > > >> > > > Chang Liu > > >> > > > Staff Research Physicist > > >> > > > +1 609 243 3438 > > >> > > > cliu at pppl.gov > > > > > >> > > >> > > >> > > > > > > >> > > >>> > > >> > > > > > > >> > > >> > > >> > > > > > > >> > > >>>> > > >> > > > Princeton Plasma Physics Laboratory > > >> > > > 100 Stellarator Rd, Princeton NJ > > 08540, USA > > >> > > > > > >> > > > > >> > > -- > > >> > > Chang Liu > > >> > > Staff Research Physicist > > >> > > +1 609 243 3438 > > >> > > cliu at pppl.gov > > > > > >> > > >> > > > >> > > > >> > > > >>> > > >> > > Princeton Plasma Physics Laboratory > > >> > > 100 Stellarator Rd, Princeton NJ 08540, USA > > >> > > > > >> > > > >> > -- > > >> > Chang Liu > > >> > Staff Research Physicist > > >> > +1 609 243 3438 > > >> > cliu at pppl.gov > > > > >> >> > > >> > Princeton Plasma Physics Laboratory > > >> > 100 Stellarator Rd, Princeton NJ 08540, USA > > >> > > > >> -- Chang Liu > > >> Staff Research Physicist > > >> +1 609 243 3438 > > >> cliu at pppl.gov > > > > >> Princeton Plasma Physics Laboratory > > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > > > -- > > > Chang Liu > > > Staff Research Physicist > > > +1 609 243 3438 > > > cliu at pppl.gov > > > Princeton Plasma Physics Laboratory > > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Wed Oct 13 20:32:29 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 13 Oct 2021 21:32:29 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> Message-ID: <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> Sorry I am not familiar with the details either. Can you please check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? Chang On 10/13/21 9:24 PM, Junchao Zhang wrote: > Hi Chang, > ? I did the work in mumps. It is easy for me to understand gathering > matrix rows to one process. > ? But how to gather blocks (submatrices) to form a large block?? Can > you draw a picture of that? > ? Thanks > --Junchao Zhang > > > On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users > > wrote: > > Hi Barry, > > I think mumps solver in petsc does support that. You can check the > documentation on "-mat_mumps_use_omp_threads" at > > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > > > and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in > functions MatMumpsSetUpDistRHSInfo and > MatMumpsGatherNonzerosOnMaster in > mumps.c > > 1. I understand it is ideal to do one MPI rank per GPU. However, I am > working on an existing code that was developed based on MPI and the the > # of mpi ranks is typically equal to # of cpu cores. We don't want to > change the whole structure of the code. > > 2. What you have suggested has been coded in mumps.c. See function > MatMumpsSetUpDistRHSInfo. > > Regards, > > Chang > > On 10/13/21 7:53 PM, Barry Smith wrote: > > > > > >> On Oct 13, 2021, at 3:50 PM, Chang Liu > wrote: > >> > >> Hi Barry, > >> > >> That is exactly what I want. > >> > >> Back to my original question, I am looking for an approach to > transfer > >> matrix > >> data from many MPI processes to "master" MPI > >> processes, each of which taking care of one GPU, and then upload > the data to GPU to > >> solve. > >> One can just grab some codes from mumps.c to aijcusparse.cu > . > > > >? ? mumps.c doesn't actually do that. It never needs to copy the > entire matrix to a single MPI rank. > > > >? ? It would be possible to write such a code that you suggest but > it is not clear that it makes sense > > > > 1)? For normal PETSc GPU usage there is one GPU per MPI rank, so > while your one GPU per big domain is solving its systems the other > GPUs (with the other MPI ranks that share that domain) are doing > nothing. > > > > 2) For each triangular solve you would have to gather the right > hand side from the multiple ranks to the single GPU to pass it to > the GPU solver and then scatter the resulting solution back to all > of its subdomain ranks. > > > >? ? What I was suggesting was assign an entire subdomain to a > single MPI rank, thus it does everything on one GPU and can use the > GPU solver directly. If all the major computations of a subdomain > can fit and be done on a single GPU then you would be utilizing all > the GPUs you are using effectively. > > > >? ? Barry > > > > > > > >> > >> Chang > >> > >> On 10/13/21 1:53 PM, Barry Smith wrote: > >>>? ? Chang, > >>>? ? ? You are correct there is no MPI + GPU direct solvers that > currently do the triangular solves with MPI + GPU parallelism that I > am aware of. You are limited that individual triangular solves be > done on a single GPU. I can only suggest making each subdomain as > big as possible to utilize each GPU as much as possible for the > direct triangular solves. > >>>? ? ?Barry > >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users > > wrote: > >>>> > >>>> Hi Mark, > >>>> > >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other > solvers, but with -pc_factor_mat_solver_type cusparse, it will give > an error. > >>>> > >>>> Yes what I want is to have mumps or superlu to do the > factorization, and then do the rest, including GMRES solver, on gpu. > Is that possible? > >>>> > >>>> I have tried to use aijcusparse with superlu_dist, it runs but > the iterative solver is still running on CPUs. I have contacted the > superlu group and they confirmed that is the case right now. But if > I set -pc_factor_mat_solver_type cusparse, it seems that the > iterative solver is running on GPU. > >>>> > >>>> Chang > >>>> > >>>> On 10/13/21 12:03 PM, Mark Adams wrote: > >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >> wrote: > >>>>>? ? ?Thank you Junchao for explaining this. I guess in my case > the code is > >>>>>? ? ?just calling a seq solver like superlu to do > factorization on GPUs. > >>>>>? ? ?My idea is that I want to have a traditional MPI code to > utilize GPUs > >>>>>? ? ?with cusparse. Right now cusparse does not support mpiaij > matrix, Sure it does: '-mat_type aijcusparse' will give you an > mpiaijcusparse matrix with > 1 processes. > >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). > >>>>> However, I see in grepping the repo that all the mumps and > superlu tests use aij or sell matrix type. > >>>>> MUMPS and SuperLU provide their own solves, I assume .... but > you might want to do other matrix operations on the GPU. Is that the > issue? > >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU > have a problem? (no test with it so it probably does not work) > >>>>> Thanks, > >>>>> Mark > >>>>>? ? ?so I > >>>>>? ? ?want the code to have a mpiaij matrix when adding all the > matrix terms, > >>>>>? ? ?and then transform the matrix to seqaij when doing the > factorization > >>>>>? ? ?and > >>>>>? ? ?solve. This involves sending the data to the master > process, and I > >>>>>? ? ?think > >>>>>? ? ?the petsc mumps solver have something similar already. > >>>>>? ? ?Chang > >>>>>? ? ?On 10/13/21 10:18 AM, Junchao Zhang wrote: > >>>>>? ? ? > > >>>>>? ? ? > > >>>>>? ? ? > > >>>>>? ? ? > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > > >>>>>? ? ?> > >>>>>? ? ? > > >>> wrote: > >>>>>? ? ? > > >>>>>? ? ? > > >>>>>? ? ? > > >>>>>? ? ? >? ? ?On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > > >>>>>? ? ?> > >>>>>? ? ? >? ? ? > >>> wrote: > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ?Hi Mark, > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ?The option I use is like > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ?-pc_type bjacobi -pc_bjacobi_blocks 16 > -ksp_type fgmres > >>>>>? ? ?-mat_type > >>>>>? ? ? >? ? ? ? ?aijcusparse *-sub_pc_factor_mat_solver_type > cusparse > >>>>>? ? ?*-sub_ksp_type > >>>>>? ? ? >? ? ? ? ?preonly *-sub_pc_type lu* -ksp_max_it 2000 > -ksp_rtol 1.e-300 > >>>>>? ? ? >? ? ? ? ?-ksp_atol 1.e-300 > >>>>>? ? ? > > >>>>>? ? ? > > >>>>>? ? ? >? ? ?Note, If you use -log_view the last column (rows > are the > >>>>>? ? ?method like > >>>>>? ? ? >? ? ?MatFactorNumeric) has the percent of work in the GPU. > >>>>>? ? ? > > >>>>>? ? ? >? ? ?Junchao: *This* implies that we have a cuSparse LU > >>>>>? ? ?factorization. Is > >>>>>? ? ? >? ? ?that correct? (I don't think we do) > >>>>>? ? ? > > >>>>>? ? ? > No, we don't have cuSparse LU factorization.? If you check > >>>>>? ? ? > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it > calls > >>>>>? ? ? > MatLUFactorSymbolic_SeqAIJ() instead. > >>>>>? ? ? > So I don't understand Chang's idea. Do you want to > make bigger > >>>>>? ? ?blocks? > >>>>>? ? ? > > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ?I think this one do both factorization and > solve on gpu. > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ?You can check the runex72_aijcusparse.sh file > in petsc > >>>>>? ? ?install > >>>>>? ? ? >? ? ? ? ?directory, and try it your self (this is only lu > >>>>>? ? ?factorization > >>>>>? ? ? >? ? ? ? ?without > >>>>>? ? ? >? ? ? ? ?iterative solve). > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ?Chang > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ?On 10/12/21 1:17 PM, Mark Adams wrote: > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > >>>>>? ? ? > > > >>>>>? ? ? >? ? ? ? ? > >> > >>>>>? ? ? >? ? ? ? ? > > > >>>>>? ? ? > >>>> wrote: > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ?Hi Junchao, > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ?No I only needs it to be transferred > within a > >>>>>? ? ?node. I use > >>>>>? ? ? >? ? ? ? ?block-Jacobi > >>>>>? ? ? >? ? ? ? ? >? ? ?method and GMRES to solve the sparse > matrix, so each > >>>>>? ? ? >? ? ? ? ?direct solver will > >>>>>? ? ? >? ? ? ? ? >? ? ?take care of a sub-block of the whole > matrix. In this > >>>>>? ? ? >? ? ? ? ?way, I can use > >>>>>? ? ? >? ? ? ? ? >? ? ?one > >>>>>? ? ? >? ? ? ? ? >? ? ?GPU to solve one sub-block, which is > stored within > >>>>>? ? ?one node. > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ?It was stated in the documentation that > cusparse > >>>>>? ? ?solver > >>>>>? ? ? >? ? ? ? ?is slow. > >>>>>? ? ? >? ? ? ? ? >? ? ?However, in my test using ex72.c, the > cusparse > >>>>>? ? ?solver is > >>>>>? ? ? >? ? ? ? ?faster than > >>>>>? ? ? >? ? ? ? ? >? ? ?mumps or superlu_dist on CPUs. > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? > Are we talking about the factorization, the > solve, or > >>>>>? ? ?both? > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? > We do not have an interface to cuSparse's LU > >>>>>? ? ?factorization (I > >>>>>? ? ? >? ? ? ? ?just > >>>>>? ? ? >? ? ? ? ? > learned that it exists a few weeks ago). > >>>>>? ? ? >? ? ? ? ? > Perhaps your fast "cusparse solver" is > '-pc_type lu > >>>>>? ? ?-mat_type > >>>>>? ? ? >? ? ? ? ? > aijcusparse' ? This would be the CPU > factorization, > >>>>>? ? ?which is the > >>>>>? ? ? >? ? ? ? ? > dominant cost. > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ?Chang > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ?On 10/12/21 10:24 AM, Junchao Zhang wrote: > >>>>>? ? ? >? ? ? ? ? >? ? ? > Hi, Chang, > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?For the mumps solver, we usually > transfers > >>>>>? ? ?matrix > >>>>>? ? ? >? ? ? ? ?and vector > >>>>>? ? ? >? ? ? ? ? >? ? ?data > >>>>>? ? ? >? ? ? ? ? >? ? ? > within a compute node.? For the idea you > >>>>>? ? ?propose, it > >>>>>? ? ? >? ? ? ? ?looks like > >>>>>? ? ? >? ? ? ? ? >? ? ?we need > >>>>>? ? ? >? ? ? ? ? >? ? ? > to gather data within > MPI_COMM_WORLD, right? > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Mark, I remember you said > cusparse solve is > >>>>>? ? ?slow > >>>>>? ? ? >? ? ? ? ?and you would > >>>>>? ? ? >? ? ? ? ? >? ? ? > rather do it on CPU. Is it right? > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ? > --Junchao Zhang > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ? > On Mon, Oct 11, 2021 at 10:25 PM > Chang Liu via > >>>>>? ? ?petsc-users > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ? > >>>>>? ? ? >> > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ? > >>>>>? ? ? >>> > >>>>>? ? ? >? ? ? ? ? >? ? ? > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ? > >>>>>? ? ? >> > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ? > >>>>>? ? ? >>>>> > >>>>>? ? ? >? ? ? ? ? >? ? ?wrote: > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Hi, > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Currently, it is possible to use > mumps > >>>>>? ? ?solver in > >>>>>? ? ? >? ? ? ? ?PETSC with > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?-mat_mumps_use_omp_threads > option, so that > >>>>>? ? ? >? ? ? ? ?multiple MPI > >>>>>? ? ? >? ? ? ? ? >? ? ?processes will > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?transfer the matrix and rhs data > to the master > >>>>>? ? ? >? ? ? ? ?rank, and then > >>>>>? ? ? >? ? ? ? ? >? ? ?master > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?rank will call mumps with OpenMP > to solve > >>>>>? ? ?the matrix. > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?I wonder if someone can develop > similar > >>>>>? ? ?option for > >>>>>? ? ? >? ? ? ? ?cusparse > >>>>>? ? ? >? ? ? ? ? >? ? ?solver. > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Right now, this solver does not > work with > >>>>>? ? ? >? ? ? ? ?mpiaijcusparse. I > >>>>>? ? ? >? ? ? ? ? >? ? ?think a > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?possible workaround is to > transfer all the > >>>>>? ? ?matrix > >>>>>? ? ? >? ? ? ? ?data to one MPI > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?process, and then upload the > data to GPU to > >>>>>? ? ?solve. > >>>>>? ? ? >? ? ? ? ?In this > >>>>>? ? ? >? ? ? ? ? >? ? ?way, one can > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?use cusparse solver for a MPI > program. > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Chang > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?-- > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Chang Liu > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Staff Research Physicist > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?+1 609 243 3438 > >>>>>? ? ? >? ? ? ? ? >? ? ? > cliu at pppl.gov > > > >>>>>? ? ? > >> > >>>>>? ? ? >? ? ? ? ? > > > >>>>>? ? ? > >>> > >>>>>? ? ? >? ? ? ? ? > > > >>>>>? ? ? > >> > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? > >>>> > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Princeton Plasma Physics Laboratory > >>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?100 Stellarator Rd, Princeton NJ > 08540, USA > >>>>>? ? ? >? ? ? ? ? >? ? ? > > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? >? ? ? ? ? >? ? ?-- > >>>>>? ? ? >? ? ? ? ? >? ? ?Chang Liu > >>>>>? ? ? >? ? ? ? ? >? ? ?Staff Research Physicist > >>>>>? ? ? >? ? ? ? ? >? ? ?+1 609 243 3438 > >>>>>? ? ? >? ? ? ? ? > cliu at pppl.gov > > > >>>>>? ? ? > >> > >>>>>? ? ?> > >>>>>? ? ? >? ? ? ? ? > >>> > >>>>>? ? ? >? ? ? ? ? >? ? ?Princeton Plasma Physics Laboratory > >>>>>? ? ? >? ? ? ? ? >? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>? ? ? >? ? ? ? ? > > >>>>>? ? ? > > >>>>>? ? ? >? ? ? ? ?-- > >>>>>? ? ? >? ? ? ? ?Chang Liu > >>>>>? ? ? >? ? ? ? ?Staff Research Physicist > >>>>>? ? ? >? ? ? ? ?+1 609 243 3438 > >>>>>? ? ? > cliu at pppl.gov > > > >>>>>? ? ?>> > >>>>>? ? ? >? ? ? ? ?Princeton Plasma Physics Laboratory > >>>>>? ? ? >? ? ? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>? ? ? > > >>>>>? ? ?--? ? ?Chang Liu > >>>>>? ? ?Staff Research Physicist > >>>>>? ? ?+1 609 243 3438 > >>>>> cliu at pppl.gov > > >>>>>? ? ?Princeton Plasma Physics Laboratory > >>>>>? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >>>> > >>>> -- > >>>> Chang Liu > >>>> Staff Research Physicist > >>>> +1 609 243 3438 > >>>> cliu at pppl.gov > >>>> Princeton Plasma Physics Laboratory > >>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >> > >> -- > >> Chang Liu > >> Staff Research Physicist > >> +1 609 243 3438 > >> cliu at pppl.gov > >> Princeton Plasma Physics Laboratory > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From bsmith at petsc.dev Wed Oct 13 20:57:55 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 13 Oct 2021 21:57:55 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> Message-ID: <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> Junchao, If I understand correctly Chang is using the block Jacobi method with a single block for a number of MPI ranks and a direct solver for each block so it uses PCSetUp_BJacobi_Multiproc() which is code Hong Zhang wrote a number of years ago for CPUs. For their particular problems this preconditioner works well, but using an iterative solver on the blocks does not work well. If we had complete MPI-GPU direct solvers he could just use the current code with MPIAIJCUSPARSE on each block but since we do not he would like to use a single GPU for each block, this means that diagonal blocks of the global parallel MPI matrix needs to be sent to a subset of the GPUs (one GPU per block, which has multiple MPI ranks associated with the blocks). Similarly for the triangular solves the blocks of the right hand side needs to be shipped to the appropriate GPU and the resulting solution shipped back to the multiple GPUs. So Chang is absolutely correct, this is somewhat like your code for MUMPS with OpenMP. One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the MPI ranks and then shrink each block down to a single GPU but this would be pretty inefficient, ideally one would go directly from the big MPI matrix on all the GPUs to the sub matrices on the subset of GPUs. But this may be a large coding project. Barry Since the matrices being factored and solved directly are relatively large it is possible that the cusparse code could be reasonably efficient (they are not the tiny problems one gets at the coarse level of multigrid). Of course, this is speculation, I don't actually know how much better the cusparse code would be on the direct solver than a good CPU direct sparse solver. > On Oct 13, 2021, at 9:32 PM, Chang Liu wrote: > > Sorry I am not familiar with the details either. Can you please check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? > > Chang > > On 10/13/21 9:24 PM, Junchao Zhang wrote: >> Hi Chang, >> I did the work in mumps. It is easy for me to understand gathering matrix rows to one process. >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >> Thanks >> --Junchao Zhang >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users > wrote: >> Hi Barry, >> I think mumps solver in petsc does support that. You can check the >> documentation on "-mat_mumps_use_omp_threads" at >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >> >> and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in >> functions MatMumpsSetUpDistRHSInfo and >> MatMumpsGatherNonzerosOnMaster in >> mumps.c >> 1. I understand it is ideal to do one MPI rank per GPU. However, I am >> working on an existing code that was developed based on MPI and the the >> # of mpi ranks is typically equal to # of cpu cores. We don't want to >> change the whole structure of the code. >> 2. What you have suggested has been coded in mumps.c. See function >> MatMumpsSetUpDistRHSInfo. >> Regards, >> Chang >> On 10/13/21 7:53 PM, Barry Smith wrote: >> > >> > >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu > > wrote: >> >> >> >> Hi Barry, >> >> >> >> That is exactly what I want. >> >> >> >> Back to my original question, I am looking for an approach to >> transfer >> >> matrix >> >> data from many MPI processes to "master" MPI >> >> processes, each of which taking care of one GPU, and then upload >> the data to GPU to >> >> solve. >> >> One can just grab some codes from mumps.c to aijcusparse.cu >> . >> > >> > mumps.c doesn't actually do that. It never needs to copy the >> entire matrix to a single MPI rank. >> > >> > It would be possible to write such a code that you suggest but >> it is not clear that it makes sense >> > >> > 1) For normal PETSc GPU usage there is one GPU per MPI rank, so >> while your one GPU per big domain is solving its systems the other >> GPUs (with the other MPI ranks that share that domain) are doing >> nothing. >> > >> > 2) For each triangular solve you would have to gather the right >> hand side from the multiple ranks to the single GPU to pass it to >> the GPU solver and then scatter the resulting solution back to all >> of its subdomain ranks. >> > >> > What I was suggesting was assign an entire subdomain to a >> single MPI rank, thus it does everything on one GPU and can use the >> GPU solver directly. If all the major computations of a subdomain >> can fit and be done on a single GPU then you would be utilizing all >> the GPUs you are using effectively. >> > >> > Barry >> > >> > >> > >> >> >> >> Chang >> >> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >> >>> Chang, >> >>> You are correct there is no MPI + GPU direct solvers that >> currently do the triangular solves with MPI + GPU parallelism that I >> am aware of. You are limited that individual triangular solves be >> done on a single GPU. I can only suggest making each subdomain as >> big as possible to utilize each GPU as much as possible for the >> direct triangular solves. >> >>> Barry >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >> > wrote: >> >>>> >> >>>> Hi Mark, >> >>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other >> solvers, but with -pc_factor_mat_solver_type cusparse, it will give >> an error. >> >>>> >> >>>> Yes what I want is to have mumps or superlu to do the >> factorization, and then do the rest, including GMRES solver, on gpu. >> Is that possible? >> >>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it runs but >> the iterative solver is still running on CPUs. I have contacted the >> superlu group and they confirmed that is the case right now. But if >> I set -pc_factor_mat_solver_type cusparse, it seems that the >> iterative solver is running on GPU. >> >>>> >> >>>> Chang >> >>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > > >> wrote: >> >>>>> Thank you Junchao for explaining this. I guess in my case >> the code is >> >>>>> just calling a seq solver like superlu to do >> factorization on GPUs. >> >>>>> My idea is that I want to have a traditional MPI code to >> utilize GPUs >> >>>>> with cusparse. Right now cusparse does not support mpiaij >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >> mpiaijcusparse matrix with > 1 processes. >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >> >>>>> However, I see in grepping the repo that all the mumps and >> superlu tests use aij or sell matrix type. >> >>>>> MUMPS and SuperLU provide their own solves, I assume .... but >> you might want to do other matrix operations on the GPU. Is that the >> issue? >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU >> have a problem? (no test with it so it probably does not work) >> >>>>> Thanks, >> >>>>> Mark >> >>>>> so I >> >>>>> want the code to have a mpiaij matrix when adding all the >> matrix terms, >> >>>>> and then transform the matrix to seqaij when doing the >> factorization >> >>>>> and >> >>>>> solve. This involves sending the data to the master >> process, and I >> >>>>> think >> >>>>> the petsc mumps solver have something similar already. >> >>>>> Chang >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >> >> >>>>> > >> >>>>> > >> >>> wrote: >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >> >> >>>>> > >> >>>>> > >> >>> wrote: >> >>>>> > >> >>>>> > Hi Mark, >> >>>>> > >> >>>>> > The option I use is like >> >>>>> > >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >> -ksp_type fgmres >> >>>>> -mat_type >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >> cusparse >> >>>>> *-sub_ksp_type >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >> -ksp_rtol 1.e-300 >> >>>>> > -ksp_atol 1.e-300 >> >>>>> > >> >>>>> > >> >>>>> > Note, If you use -log_view the last column (rows >> are the >> >>>>> method like >> >>>>> > MatFactorNumeric) has the percent of work in the GPU. >> >>>>> > >> >>>>> > Junchao: *This* implies that we have a cuSparse LU >> >>>>> factorization. Is >> >>>>> > that correct? (I don't think we do) >> >>>>> > >> >>>>> > No, we don't have cuSparse LU factorization. If you check >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it >> calls >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >> >>>>> > So I don't understand Chang's idea. Do you want to >> make bigger >> >>>>> blocks? >> >>>>> > >> >>>>> > >> >>>>> > I think this one do both factorization and >> solve on gpu. >> >>>>> > >> >>>>> > You can check the runex72_aijcusparse.sh file >> in petsc >> >>>>> install >> >>>>> > directory, and try it your self (this is only lu >> >>>>> factorization >> >>>>> > without >> >>>>> > iterative solve). >> >>>>> > >> >>>>> > Chang >> >>>>> > >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >> >>>>> > > >> >>>>> > > >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu >> >>>>> >> > >> >>>>> > >> >> >> >>>>> > > > > >> >>>>> >> >>>> wrote: >> >>>>> > > >> >>>>> > > Hi Junchao, >> >>>>> > > >> >>>>> > > No I only needs it to be transferred >> within a >> >>>>> node. I use >> >>>>> > block-Jacobi >> >>>>> > > method and GMRES to solve the sparse >> matrix, so each >> >>>>> > direct solver will >> >>>>> > > take care of a sub-block of the whole >> matrix. In this >> >>>>> > way, I can use >> >>>>> > > one >> >>>>> > > GPU to solve one sub-block, which is >> stored within >> >>>>> one node. >> >>>>> > > >> >>>>> > > It was stated in the documentation that >> cusparse >> >>>>> solver >> >>>>> > is slow. >> >>>>> > > However, in my test using ex72.c, the >> cusparse >> >>>>> solver is >> >>>>> > faster than >> >>>>> > > mumps or superlu_dist on CPUs. >> >>>>> > > >> >>>>> > > >> >>>>> > > Are we talking about the factorization, the >> solve, or >> >>>>> both? >> >>>>> > > >> >>>>> > > We do not have an interface to cuSparse's LU >> >>>>> factorization (I >> >>>>> > just >> >>>>> > > learned that it exists a few weeks ago). >> >>>>> > > Perhaps your fast "cusparse solver" is >> '-pc_type lu >> >>>>> -mat_type >> >>>>> > > aijcusparse' ? This would be the CPU >> factorization, >> >>>>> which is the >> >>>>> > > dominant cost. >> >>>>> > > >> >>>>> > > >> >>>>> > > Chang >> >>>>> > > >> >>>>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote: >> >>>>> > > > Hi, Chang, >> >>>>> > > > For the mumps solver, we usually >> transfers >> >>>>> matrix >> >>>>> > and vector >> >>>>> > > data >> >>>>> > > > within a compute node. For the idea you >> >>>>> propose, it >> >>>>> > looks like >> >>>>> > > we need >> >>>>> > > > to gather data within >> MPI_COMM_WORLD, right? >> >>>>> > > > >> >>>>> > > > Mark, I remember you said >> cusparse solve is >> >>>>> slow >> >>>>> > and you would >> >>>>> > > > rather do it on CPU. Is it right? >> >>>>> > > > >> >>>>> > > > --Junchao Zhang >> >>>>> > > > >> >>>>> > > > >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >> Chang Liu via >> >>>>> petsc-users >> >>>>> > > > > >> >>>>> > > >> >>>>> > > >> >>>>> > >> > >> >>>>> > > >> >>>>> > > >> >>>>> > >>> >> >>>>> > > > >> >>>>> > > >> >>>>> > > >> >>>>> > >> > >> >>>>> > > >> >>>>> > > >> >>>>> > >>>>> >> >>>>> > > wrote: >> >>>>> > > > >> >>>>> > > > Hi, >> >>>>> > > > >> >>>>> > > > Currently, it is possible to use >> mumps >> >>>>> solver in >> >>>>> > PETSC with >> >>>>> > > > -mat_mumps_use_omp_threads >> option, so that >> >>>>> > multiple MPI >> >>>>> > > processes will >> >>>>> > > > transfer the matrix and rhs data >> to the master >> >>>>> > rank, and then >> >>>>> > > master >> >>>>> > > > rank will call mumps with OpenMP >> to solve >> >>>>> the matrix. >> >>>>> > > > >> >>>>> > > > I wonder if someone can develop >> similar >> >>>>> option for >> >>>>> > cusparse >> >>>>> > > solver. >> >>>>> > > > Right now, this solver does not >> work with >> >>>>> > mpiaijcusparse. I >> >>>>> > > think a >> >>>>> > > > possible workaround is to >> transfer all the >> >>>>> matrix >> >>>>> > data to one MPI >> >>>>> > > > process, and then upload the >> data to GPU to >> >>>>> solve. >> >>>>> > In this >> >>>>> > > way, one can >> >>>>> > > > use cusparse solver for a MPI >> program. >> >>>>> > > > >> >>>>> > > > Chang >> >>>>> > > > -- >> >>>>> > > > Chang Liu >> >>>>> > > > Staff Research Physicist >> >>>>> > > > +1 609 243 3438 >> >>>>> > > > cliu at pppl.gov >> > >> >>>>> >> >> >> >>>>> > >> > >> >>>>> >> >>> >> >>>>> > >> > >> >>>>> >> >> >> >>>>> > > > > >> >>>>> >> >>>> >> >>>>> > > > Princeton Plasma Physics Laboratory >> >>>>> > > > 100 Stellarator Rd, Princeton NJ >> 08540, USA >> >>>>> > > > >> >>>>> > > >> >>>>> > > -- >> >>>>> > > Chang Liu >> >>>>> > > Staff Research Physicist >> >>>>> > > +1 609 243 3438 >> >>>>> > > cliu at pppl.gov >> > >> >>>>> >> >> > >> >>>>> > >> >>>>> > >> >>> >> >>>>> > > Princeton Plasma Physics Laboratory >> >>>>> > > 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>> > > >> >>>>> > >> >>>>> > -- >> >>>>> > Chang Liu >> >>>>> > Staff Research Physicist >> >>>>> > +1 609 243 3438 >> >>>>> > cliu at pppl.gov >> > > >> >>>>> >> >> >>>>> > Princeton Plasma Physics Laboratory >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>> > >> >>>>> -- Chang Liu >> >>>>> Staff Research Physicist >> >>>>> +1 609 243 3438 >> >>>>> cliu at pppl.gov > > >> >>>>> Princeton Plasma Physics Laboratory >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>> >> >>>> -- >> >>>> Chang Liu >> >>>> Staff Research Physicist >> >>>> +1 609 243 3438 >> >>>> cliu at pppl.gov >> >>>> Princeton Plasma Physics Laboratory >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> >> >> -- >> >> Chang Liu >> >> Staff Research Physicist >> >> +1 609 243 3438 >> >> cliu at pppl.gov >> >> Princeton Plasma Physics Laboratory >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > >> -- Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From a.croucher at auckland.ac.nz Wed Oct 13 22:19:34 2021 From: a.croucher at auckland.ac.nz (Adrian Croucher) Date: Thu, 14 Oct 2021 16:19:34 +1300 Subject: [petsc-users] HDF5 timestepping in PETSc 3.16 Message-ID: hi I am just testing out PETSc 3.16 and making the necessary changes to my code. Amongst other things I now have to add a PetscViewerHDF5PushTimestepping() call before starting to output time-dependent results to HDF5 using a PetscViewer. I now also have to add this call before reading in sets of previously computed time-dependent results (for restarting a simulation from the results of a previous run). The problem with this is that if I try to read in the results of any previous run, computed with an earlier version of PETSc (< 3.16), an error is raised because the time-dependent datasets in the file do not have the 'timestepping' attribute. Is there something else I need to do to make this work? - Adrian -- Dr Adrian Croucher Senior Research Fellow Department of Engineering Science University of Auckland, New Zealand email: a.croucher at auckland.ac.nz tel: +64 (0)9 923 4611 From junchao.zhang at gmail.com Wed Oct 13 22:42:05 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Wed, 13 Oct 2021 22:42:05 -0500 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> Message-ID: On Wed, Oct 13, 2021 at 8:58 PM Barry Smith wrote: > > Junchao, > > If I understand correctly Chang is using the block Jacobi method with > a single block for a number of MPI ranks and a direct solver for each block > so it uses PCSetUp_BJacobi_Multiproc() which is code Hong Zhang wrote a > number of years ago for CPUs. For their particular problems this > preconditioner works well, but using an iterative solver on the blocks does > not work well. > > If we had complete MPI-GPU direct solvers he could just use the > current code with MPIAIJCUSPARSE on each block but since we do not he would > like to use a single GPU for each block, this means that diagonal blocks > of the global parallel MPI matrix needs to be sent to a subset of the GPUs > (one GPU per block, which has multiple MPI ranks associated with the > blocks). Similarly for the triangular solves the blocks of the right hand > side needs to be shipped to the appropriate GPU and the resulting solution > shipped back to the multiple GPUs. So Chang is absolutely correct, this is > somewhat like your code for MUMPS with OpenMP. OK, I now understand the background.. One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the MPI > ranks and then shrink each block down to a single GPU but this would be > pretty inefficient, ideally one would go directly from the big MPI matrix > on all the GPUs to the sub matrices on the subset of GPUs. But this may be > a large coding project. > I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. > > Barry > > Since the matrices being factored and solved directly are relatively large > it is possible that the cusparse code could be reasonably efficient (they > are not the tiny problems one gets at the coarse level of multigrid). Of > course, this is speculation, I don't actually know how much better the > cusparse code would be on the direct solver than a good CPU direct sparse > solver. > > > On Oct 13, 2021, at 9:32 PM, Chang Liu wrote: > > > > Sorry I am not familiar with the details either. Can you please check > the code in MatMumpsGatherNonzerosOnMaster in mumps.c? > > > > Chang > > > > On 10/13/21 9:24 PM, Junchao Zhang wrote: > >> Hi Chang, > >> I did the work in mumps. It is easy for me to understand gathering > matrix rows to one process. > >> But how to gather blocks (submatrices) to form a large block? Can > you draw a picture of that? > >> Thanks > >> --Junchao Zhang > >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users < > petsc-users at mcs.anl.gov > wrote: > >> Hi Barry, > >> I think mumps solver in petsc does support that. You can check the > >> documentation on "-mat_mumps_use_omp_threads" at > >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > >> > >> and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in > >> functions MatMumpsSetUpDistRHSInfo and > >> MatMumpsGatherNonzerosOnMaster in > >> mumps.c > >> 1. I understand it is ideal to do one MPI rank per GPU. However, I am > >> working on an existing code that was developed based on MPI and the > the > >> # of mpi ranks is typically equal to # of cpu cores. We don't want to > >> change the whole structure of the code. > >> 2. What you have suggested has been coded in mumps.c. See function > >> MatMumpsSetUpDistRHSInfo. > >> Regards, > >> Chang > >> On 10/13/21 7:53 PM, Barry Smith wrote: > >> > > >> > > >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >> > wrote: > >> >> > >> >> Hi Barry, > >> >> > >> >> That is exactly what I want. > >> >> > >> >> Back to my original question, I am looking for an approach to > >> transfer > >> >> matrix > >> >> data from many MPI processes to "master" MPI > >> >> processes, each of which taking care of one GPU, and then upload > >> the data to GPU to > >> >> solve. > >> >> One can just grab some codes from mumps.c to aijcusparse.cu > >> . > >> > > >> > mumps.c doesn't actually do that. It never needs to copy the > >> entire matrix to a single MPI rank. > >> > > >> > It would be possible to write such a code that you suggest but > >> it is not clear that it makes sense > >> > > >> > 1) For normal PETSc GPU usage there is one GPU per MPI rank, so > >> while your one GPU per big domain is solving its systems the other > >> GPUs (with the other MPI ranks that share that domain) are doing > >> nothing. > >> > > >> > 2) For each triangular solve you would have to gather the right > >> hand side from the multiple ranks to the single GPU to pass it to > >> the GPU solver and then scatter the resulting solution back to all > >> of its subdomain ranks. > >> > > >> > What I was suggesting was assign an entire subdomain to a > >> single MPI rank, thus it does everything on one GPU and can use the > >> GPU solver directly. If all the major computations of a subdomain > >> can fit and be done on a single GPU then you would be utilizing all > >> the GPUs you are using effectively. > >> > > >> > Barry > >> > > >> > > >> > > >> >> > >> >> Chang > >> >> > >> >> On 10/13/21 1:53 PM, Barry Smith wrote: > >> >>> Chang, > >> >>> You are correct there is no MPI + GPU direct solvers that > >> currently do the triangular solves with MPI + GPU parallelism that I > >> am aware of. You are limited that individual triangular solves be > >> done on a single GPU. I can only suggest making each subdomain as > >> big as possible to utilize each GPU as much as possible for the > >> direct triangular solves. > >> >>> Barry > >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users > >> > wrote: > >> >>>> > >> >>>> Hi Mark, > >> >>>> > >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other > >> solvers, but with -pc_factor_mat_solver_type cusparse, it will give > >> an error. > >> >>>> > >> >>>> Yes what I want is to have mumps or superlu to do the > >> factorization, and then do the rest, including GMRES solver, on gpu. > >> Is that possible? > >> >>>> > >> >>>> I have tried to use aijcusparse with superlu_dist, it runs but > >> the iterative solver is still running on CPUs. I have contacted the > >> superlu group and they confirmed that is the case right now. But if > >> I set -pc_factor_mat_solver_type cusparse, it seems that the > >> iterative solver is running on GPU. > >> >>>> > >> >>>> Chang > >> >>>> > >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: > >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >> >> >> wrote: > >> >>>>> Thank you Junchao for explaining this. I guess in my case > >> the code is > >> >>>>> just calling a seq solver like superlu to do > >> factorization on GPUs. > >> >>>>> My idea is that I want to have a traditional MPI code to > >> utilize GPUs > >> >>>>> with cusparse. Right now cusparse does not support mpiaij > >> matrix, Sure it does: '-mat_type aijcusparse' will give you an > >> mpiaijcusparse matrix with > 1 processes. > >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). > >> >>>>> However, I see in grepping the repo that all the mumps and > >> superlu tests use aij or sell matrix type. > >> >>>>> MUMPS and SuperLU provide their own solves, I assume .... but > >> you might want to do other matrix operations on the GPU. Is that the > >> issue? > >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU > >> have a problem? (no test with it so it probably does not work) > >> >>>>> Thanks, > >> >>>>> Mark > >> >>>>> so I > >> >>>>> want the code to have a mpiaij matrix when adding all the > >> matrix terms, > >> >>>>> and then transform the matrix to seqaij when doing the > >> factorization > >> >>>>> and > >> >>>>> solve. This involves sending the data to the master > >> process, and I > >> >>>>> think > >> >>>>> the petsc mumps solver have something similar already. > >> >>>>> Chang > >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: > >> >>>>> > > >> >>>>> > > >> >>>>> > > >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > >> > >> >>>>> > > >> >>>>> > > >> >>> wrote: > >> >>>>> > > >> >>>>> > > >> >>>>> > > >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > >> > >> >>>>> > > >> >>>>> > > >> >>> wrote: > >> >>>>> > > >> >>>>> > Hi Mark, > >> >>>>> > > >> >>>>> > The option I use is like > >> >>>>> > > >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 > >> -ksp_type fgmres > >> >>>>> -mat_type > >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type > >> cusparse > >> >>>>> *-sub_ksp_type > >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 > >> -ksp_rtol 1.e-300 > >> >>>>> > -ksp_atol 1.e-300 > >> >>>>> > > >> >>>>> > > >> >>>>> > Note, If you use -log_view the last column (rows > >> are the > >> >>>>> method like > >> >>>>> > MatFactorNumeric) has the percent of work in the > GPU. > >> >>>>> > > >> >>>>> > Junchao: *This* implies that we have a cuSparse LU > >> >>>>> factorization. Is > >> >>>>> > that correct? (I don't think we do) > >> >>>>> > > >> >>>>> > No, we don't have cuSparse LU factorization. If you > check > >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it > >> calls > >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. > >> >>>>> > So I don't understand Chang's idea. Do you want to > >> make bigger > >> >>>>> blocks? > >> >>>>> > > >> >>>>> > > >> >>>>> > I think this one do both factorization and > >> solve on gpu. > >> >>>>> > > >> >>>>> > You can check the runex72_aijcusparse.sh file > >> in petsc > >> >>>>> install > >> >>>>> > directory, and try it your self (this is only > lu > >> >>>>> factorization > >> >>>>> > without > >> >>>>> > iterative solve). > >> >>>>> > > >> >>>>> > Chang > >> >>>>> > > >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: > >> >>>>> > > > >> >>>>> > > > >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu > >> >>>>> > >> > > >> >>>>> > > >> >> > >> >>>>> > > >> > > >> >>>>> > >> >>>> wrote: > >> >>>>> > > > >> >>>>> > > Hi Junchao, > >> >>>>> > > > >> >>>>> > > No I only needs it to be transferred > >> within a > >> >>>>> node. I use > >> >>>>> > block-Jacobi > >> >>>>> > > method and GMRES to solve the sparse > >> matrix, so each > >> >>>>> > direct solver will > >> >>>>> > > take care of a sub-block of the whole > >> matrix. In this > >> >>>>> > way, I can use > >> >>>>> > > one > >> >>>>> > > GPU to solve one sub-block, which is > >> stored within > >> >>>>> one node. > >> >>>>> > > > >> >>>>> > > It was stated in the documentation that > >> cusparse > >> >>>>> solver > >> >>>>> > is slow. > >> >>>>> > > However, in my test using ex72.c, the > >> cusparse > >> >>>>> solver is > >> >>>>> > faster than > >> >>>>> > > mumps or superlu_dist on CPUs. > >> >>>>> > > > >> >>>>> > > > >> >>>>> > > Are we talking about the factorization, the > >> solve, or > >> >>>>> both? > >> >>>>> > > > >> >>>>> > > We do not have an interface to cuSparse's LU > >> >>>>> factorization (I > >> >>>>> > just > >> >>>>> > > learned that it exists a few weeks ago). > >> >>>>> > > Perhaps your fast "cusparse solver" is > >> '-pc_type lu > >> >>>>> -mat_type > >> >>>>> > > aijcusparse' ? This would be the CPU > >> factorization, > >> >>>>> which is the > >> >>>>> > > dominant cost. > >> >>>>> > > > >> >>>>> > > > >> >>>>> > > Chang > >> >>>>> > > > >> >>>>> > > On 10/12/21 10:24 AM, Junchao Zhang > wrote: > >> >>>>> > > > Hi, Chang, > >> >>>>> > > > For the mumps solver, we usually > >> transfers > >> >>>>> matrix > >> >>>>> > and vector > >> >>>>> > > data > >> >>>>> > > > within a compute node. For the idea > you > >> >>>>> propose, it > >> >>>>> > looks like > >> >>>>> > > we need > >> >>>>> > > > to gather data within > >> MPI_COMM_WORLD, right? > >> >>>>> > > > > >> >>>>> > > > Mark, I remember you said > >> cusparse solve is > >> >>>>> slow > >> >>>>> > and you would > >> >>>>> > > > rather do it on CPU. Is it right? > >> >>>>> > > > > >> >>>>> > > > --Junchao Zhang > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM > >> Chang Liu via > >> >>>>> petsc-users > >> >>>>> > > > >> > >> >>>>> >> > > >> >>>>> > >> > >> >>>>> >> >> >> > >> >>>>> >> > > >> >>>>> > >> > >> >>>>> >> >>> > >> >>>>> > > >> > >> >>>>> >> > > >> >>>>> > >> > >> >>>>> >> >> >> > >> >>>>> >> > > >> >>>>> > >> > >> >>>>> >> >>>>> > >> >>>>> > > wrote: > >> >>>>> > > > > >> >>>>> > > > Hi, > >> >>>>> > > > > >> >>>>> > > > Currently, it is possible to use > >> mumps > >> >>>>> solver in > >> >>>>> > PETSC with > >> >>>>> > > > -mat_mumps_use_omp_threads > >> option, so that > >> >>>>> > multiple MPI > >> >>>>> > > processes will > >> >>>>> > > > transfer the matrix and rhs data > >> to the master > >> >>>>> > rank, and then > >> >>>>> > > master > >> >>>>> > > > rank will call mumps with OpenMP > >> to solve > >> >>>>> the matrix. > >> >>>>> > > > > >> >>>>> > > > I wonder if someone can develop > >> similar > >> >>>>> option for > >> >>>>> > cusparse > >> >>>>> > > solver. > >> >>>>> > > > Right now, this solver does not > >> work with > >> >>>>> > mpiaijcusparse. I > >> >>>>> > > think a > >> >>>>> > > > possible workaround is to > >> transfer all the > >> >>>>> matrix > >> >>>>> > data to one MPI > >> >>>>> > > > process, and then upload the > >> data to GPU to > >> >>>>> solve. > >> >>>>> > In this > >> >>>>> > > way, one can > >> >>>>> > > > use cusparse solver for a MPI > >> program. > >> >>>>> > > > > >> >>>>> > > > Chang > >> >>>>> > > > -- > >> >>>>> > > > Chang Liu > >> >>>>> > > > Staff Research Physicist > >> >>>>> > > > +1 609 243 3438 > >> >>>>> > > > cliu at pppl.gov > >> > > >> >>>>> > >> >> > >> >>>>> > > >> > > >> >>>>> > >> >>> > >> >>>>> > > >> > > >> >>>>> > >> >> > >> >>>>> > > >> > > >> >>>>> > >> >>>> > >> >>>>> > > > Princeton Plasma Physics > Laboratory > >> >>>>> > > > 100 Stellarator Rd, Princeton NJ > >> 08540, USA > >> >>>>> > > > > >> >>>>> > > > >> >>>>> > > -- > >> >>>>> > > Chang Liu > >> >>>>> > > Staff Research Physicist > >> >>>>> > > +1 609 243 3438 > >> >>>>> > > cliu at pppl.gov > >> > > >> >>>>> > >> >> >> > >> >>>>> > > >> >>>>> > > >> >>> > >> >>>>> > > Princeton Plasma Physics Laboratory > >> >>>>> > > 100 Stellarator Rd, Princeton NJ 08540, > USA > >> >>>>> > > > >> >>>>> > > >> >>>>> > -- > >> >>>>> > Chang Liu > >> >>>>> > Staff Research Physicist > >> >>>>> > +1 609 243 3438 > >> >>>>> > cliu at pppl.gov > >> > >> > >> >>>>> >> > >> >>>>> > Princeton Plasma Physics Laboratory > >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA > >> >>>>> > > >> >>>>> -- Chang Liu > >> >>>>> Staff Research Physicist > >> >>>>> +1 609 243 3438 > >> >>>>> cliu at pppl.gov >> > > >> >>>>> Princeton Plasma Physics Laboratory > >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >> >>>> > >> >>>> -- > >> >>>> Chang Liu > >> >>>> Staff Research Physicist > >> >>>> +1 609 243 3438 > >> >>>> cliu at pppl.gov > >> >>>> Princeton Plasma Physics Laboratory > >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >> >> > >> >> -- > >> >> Chang Liu > >> >> Staff Research Physicist > >> >> +1 609 243 3438 > >> >> cliu at pppl.gov > >> >> Princeton Plasma Physics Laboratory > >> >> 100 Stellarator Rd, Princeton NJ 08540, USA > >> > > >> -- Chang Liu > >> Staff Research Physicist > >> +1 609 243 3438 > >> cliu at pppl.gov > >> Princeton Plasma Physics Laboratory > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Wed Oct 13 23:00:57 2021 From: cliu at pppl.gov (Chang Liu) Date: Thu, 14 Oct 2021 00:00:57 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> Message-ID: Hi Junchao, Yes that is what I want. Chang On 10/13/21 11:42 PM, Junchao Zhang wrote: > > > > On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: > > > ? Junchao, > > ? ? ?If I understand correctly Chang is using the block Jacobi > method with a single block for a number of MPI ranks and a direct > solver for each block so it uses PCSetUp_BJacobi_Multiproc() which > is code Hong Zhang wrote a number of years ago for CPUs. For their > particular problems this preconditioner works well, but using an > iterative solver on the blocks does not work well. > > ? ? ?If we had complete MPI-GPU direct solvers he could just use > the current code with MPIAIJCUSPARSE on each block but since we do > not he would like to use a single GPU for each block, this means > that diagonal blocks of? the global parallel MPI matrix needs to be > sent to a subset of the GPUs (one GPU per block, which has multiple > MPI ranks associated with the blocks). Similarly for the triangular > solves the blocks of the right hand side needs to be shipped to the > appropriate GPU and the resulting solution shipped back to the > multiple GPUs. So Chang is absolutely correct, this is somewhat like > your code for MUMPS with OpenMP. > > OK, I now?understand the?background.. > > One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the > MPI ranks and then shrink each block down to a single GPU but this > would be pretty inefficient, ideally one would go directly from the > big MPI matrix on all the GPUs to the sub matrices on the subset of > GPUs. But this may be a large coding project. > > I don't understand these sentences. Why do you say "shrink"? In my mind, > we just need to move each block (submatrix) living over multiple MPI > ranks to one of them and solve directly there.? In other?words, we keep > blocks' size, no shrinking or expanding. > As mentioned before, cusparse does not provide LU factorization. So the > LU factorization would be done on CPU, and the solve?be done on GPU. I > assume Chang wants to gain from the (potential) faster solve (instead of > factorization) on GPU. > > > ? Barry > > Since the matrices being factored and solved directly are relatively > large it is possible that the cusparse code could be reasonably > efficient (they are not the tiny problems one gets at the coarse > level of multigrid). Of course, this is speculation, I don't > actually know how much better the cusparse code would be on the > direct solver than a good CPU direct sparse solver. > > > On Oct 13, 2021, at 9:32 PM, Chang Liu > wrote: > > > > Sorry I am not familiar with the details either. Can you please > check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? > > > > Chang > > > > On 10/13/21 9:24 PM, Junchao Zhang wrote: > >> Hi Chang, > >>? ?I did the work in mumps. It is easy for me to understand > gathering matrix rows to one process. > >>? ?But how to gather blocks (submatrices) to form a large block? > Can you draw a picture of that? > >>? ?Thanks > >> --Junchao Zhang > >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users > > >> > wrote: > >>? ? Hi Barry, > >>? ? I think mumps solver in petsc does support that. You can > check the > >>? ? documentation on "-mat_mumps_use_omp_threads" at > >> > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > > >> > > > >>? ? and the code enclosed by #if > defined(PETSC_HAVE_OPENMP_SUPPORT) in > >>? ? functions MatMumpsSetUpDistRHSInfo and > >>? ? MatMumpsGatherNonzerosOnMaster in > >>? ? mumps.c > >>? ? 1. I understand it is ideal to do one MPI rank per GPU. > However, I am > >>? ? working on an existing code that was developed based on MPI > and the the > >>? ? # of mpi ranks is typically equal to # of cpu cores. We don't > want to > >>? ? change the whole structure of the code. > >>? ? 2. What you have suggested has been coded in mumps.c. See > function > >>? ? MatMumpsSetUpDistRHSInfo. > >>? ? Regards, > >>? ? Chang > >>? ? On 10/13/21 7:53 PM, Barry Smith wrote: > >>? ? ?> > >>? ? ?> > >>? ? ?>> On Oct 13, 2021, at 3:50 PM, Chang Liu > >>? ? >> wrote: > >>? ? ?>> > >>? ? ?>> Hi Barry, > >>? ? ?>> > >>? ? ?>> That is exactly what I want. > >>? ? ?>> > >>? ? ?>> Back to my original question, I am looking for an approach to > >>? ? transfer > >>? ? ?>> matrix > >>? ? ?>> data from many MPI processes to "master" MPI > >>? ? ?>> processes, each of which taking care of one GPU, and then > upload > >>? ? the data to GPU to > >>? ? ?>> solve. > >>? ? ?>> One can just grab some codes from mumps.c to > aijcusparse.cu > >>? ? >. > >>? ? ?> > >>? ? ?>? ? mumps.c doesn't actually do that. It never needs to > copy the > >>? ? entire matrix to a single MPI rank. > >>? ? ?> > >>? ? ?>? ? It would be possible to write such a code that you > suggest but > >>? ? it is not clear that it makes sense > >>? ? ?> > >>? ? ?> 1)? For normal PETSc GPU usage there is one GPU per MPI > rank, so > >>? ? while your one GPU per big domain is solving its systems the > other > >>? ? GPUs (with the other MPI ranks that share that domain) are doing > >>? ? nothing. > >>? ? ?> > >>? ? ?> 2) For each triangular solve you would have to gather the > right > >>? ? hand side from the multiple ranks to the single GPU to pass it to > >>? ? the GPU solver and then scatter the resulting solution back > to all > >>? ? of its subdomain ranks. > >>? ? ?> > >>? ? ?>? ? What I was suggesting was assign an entire subdomain to a > >>? ? single MPI rank, thus it does everything on one GPU and can > use the > >>? ? GPU solver directly. If all the major computations of a subdomain > >>? ? can fit and be done on a single GPU then you would be > utilizing all > >>? ? the GPUs you are using effectively. > >>? ? ?> > >>? ? ?>? ? Barry > >>? ? ?> > >>? ? ?> > >>? ? ?> > >>? ? ?>> > >>? ? ?>> Chang > >>? ? ?>> > >>? ? ?>> On 10/13/21 1:53 PM, Barry Smith wrote: > >>? ? ?>>>? ? Chang, > >>? ? ?>>>? ? ? You are correct there is no MPI + GPU direct > solvers that > >>? ? currently do the triangular solves with MPI + GPU parallelism > that I > >>? ? am aware of. You are limited that individual triangular solves be > >>? ? done on a single GPU. I can only suggest making each subdomain as > >>? ? big as possible to utilize each GPU as much as possible for the > >>? ? direct triangular solves. > >>? ? ?>>>? ? ?Barry > >>? ? ?>>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users > >>? ? > >> > wrote: > >>? ? ?>>>> > >>? ? ?>>>> Hi Mark, > >>? ? ?>>>> > >>? ? ?>>>> '-mat_type aijcusparse' works with mpiaijcusparse with > other > >>? ? solvers, but with -pc_factor_mat_solver_type cusparse, it > will give > >>? ? an error. > >>? ? ?>>>> > >>? ? ?>>>> Yes what I want is to have mumps or superlu to do the > >>? ? factorization, and then do the rest, including GMRES solver, > on gpu. > >>? ? Is that possible? > >>? ? ?>>>> > >>? ? ?>>>> I have tried to use aijcusparse with superlu_dist, it > runs but > >>? ? the iterative solver is still running on CPUs. I have > contacted the > >>? ? superlu group and they confirmed that is the case right now. > But if > >>? ? I set -pc_factor_mat_solver_type cusparse, it seems that the > >>? ? iterative solver is running on GPU. > >>? ? ?>>>> > >>? ? ?>>>> Chang > >>? ? ?>>>> > >>? ? ?>>>> On 10/13/21 12:03 PM, Mark Adams wrote: > >>? ? ?>>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > > >>? ? > > > >>? ? >>> wrote: > >>? ? ?>>>>>? ? ?Thank you Junchao for explaining this. I guess in > my case > >>? ? the code is > >>? ? ?>>>>>? ? ?just calling a seq solver like superlu to do > >>? ? factorization on GPUs. > >>? ? ?>>>>>? ? ?My idea is that I want to have a traditional MPI > code to > >>? ? utilize GPUs > >>? ? ?>>>>>? ? ?with cusparse. Right now cusparse does not support > mpiaij > >>? ? matrix, Sure it does: '-mat_type aijcusparse' will give you an > >>? ? mpiaijcusparse matrix with > 1 processes. > >>? ? ?>>>>> (-mat_type mpiaijcusparse might also work with >1 proc). > >>? ? ?>>>>> However, I see in grepping the repo that all the mumps and > >>? ? superlu tests use aij or sell matrix type. > >>? ? ?>>>>> MUMPS and SuperLU provide their own solves, I assume > .... but > >>? ? you might want to do other matrix operations on the GPU. Is > that the > >>? ? issue? > >>? ? ?>>>>> Did you try -mat_type aijcusparse with MUMPS and/or > SuperLU > >>? ? have a problem? (no test with it so it probably does not work) > >>? ? ?>>>>> Thanks, > >>? ? ?>>>>> Mark > >>? ? ?>>>>>? ? ?so I > >>? ? ?>>>>>? ? ?want the code to have a mpiaij matrix when adding > all the > >>? ? matrix terms, > >>? ? ?>>>>>? ? ?and then transform the matrix to seqaij when doing the > >>? ? factorization > >>? ? ?>>>>>? ? ?and > >>? ? ?>>>>>? ? ?solve. This involves sending the data to the master > >>? ? process, and I > >>? ? ?>>>>>? ? ?think > >>? ? ?>>>>>? ? ?the petsc mumps solver have something similar already. > >>? ? ?>>>>>? ? ?Chang > >>? ? ?>>>>>? ? ?On 10/13/21 10:18 AM, Junchao Zhang wrote: > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > >>? ? > > > >>? ? ?>>>>>? ? ? > >> > >>? ? ?>>>>>? ? ? > > > >>? ? > >>>> wrote: > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ?On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > >>? ? > > >>? ? ?>>>>>? ? ? > >> > >>? ? ?>>>>>? ? ? >? ? ? > > >>? ? > >>>> wrote: > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ?Hi Mark, > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ?The option I use is like > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ?-pc_type bjacobi -pc_bjacobi_blocks 16 > >>? ? -ksp_type fgmres > >>? ? ?>>>>>? ? ?-mat_type > >>? ? ?>>>>>? ? ? >? ? ? ? ?aijcusparse *-sub_pc_factor_mat_solver_type > >>? ? cusparse > >>? ? ?>>>>>? ? ?*-sub_ksp_type > >>? ? ?>>>>>? ? ? >? ? ? ? ?preonly *-sub_pc_type lu* -ksp_max_it 2000 > >>? ? -ksp_rtol 1.e-300 > >>? ? ?>>>>>? ? ? >? ? ? ? ?-ksp_atol 1.e-300 > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ?Note, If you use -log_view the last column > (rows > >>? ? are the > >>? ? ?>>>>>? ? ?method like > >>? ? ?>>>>>? ? ? >? ? ?MatFactorNumeric) has the percent of work > in the GPU. > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ?Junchao: *This* implies that we have a > cuSparse LU > >>? ? ?>>>>>? ? ?factorization. Is > >>? ? ?>>>>>? ? ? >? ? ?that correct? (I don't think we do) > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? > No, we don't have cuSparse LU factorization. > If you check > >>? ? ?>>>>>? ? ? > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will > find it > >>? ? calls > >>? ? ?>>>>>? ? ? > MatLUFactorSymbolic_SeqAIJ() instead. > >>? ? ?>>>>>? ? ? > So I don't understand Chang's idea. Do you want to > >>? ? make bigger > >>? ? ?>>>>>? ? ?blocks? > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ?I think this one do both factorization and > >>? ? solve on gpu. > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ?You can check the > runex72_aijcusparse.sh file > >>? ? in petsc > >>? ? ?>>>>>? ? ?install > >>? ? ?>>>>>? ? ? >? ? ? ? ?directory, and try it your self (this > is only lu > >>? ? ?>>>>>? ? ?factorization > >>? ? ?>>>>>? ? ? >? ? ? ? ?without > >>? ? ?>>>>>? ? ? >? ? ? ? ?iterative solve). > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ?Chang > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ?On 10/12/21 1:17 PM, Mark Adams wrote: > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? > On Tue, Oct 12, 2021 at 11:19 AM > Chang Liu > >>? ? ?>>>>>? ? ? > > > >>? ? > >> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? > >>> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? > > >> > >>? ? ?>>>>>? ? ? > > > >>? ? > >>>>> wrote: > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Hi Junchao, > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?No I only needs it to be transferred > >>? ? within a > >>? ? ?>>>>>? ? ?node. I use > >>? ? ?>>>>>? ? ? >? ? ? ? ?block-Jacobi > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?method and GMRES to solve the sparse > >>? ? matrix, so each > >>? ? ?>>>>>? ? ? >? ? ? ? ?direct solver will > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?take care of a sub-block of the > whole > >>? ? matrix. In this > >>? ? ?>>>>>? ? ? >? ? ? ? ?way, I can use > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?one > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?GPU to solve one sub-block, which is > >>? ? stored within > >>? ? ?>>>>>? ? ?one node. > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?It was stated in the > documentation that > >>? ? cusparse > >>? ? ?>>>>>? ? ?solver > >>? ? ?>>>>>? ? ? >? ? ? ? ?is slow. > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?However, in my test using > ex72.c, the > >>? ? cusparse > >>? ? ?>>>>>? ? ?solver is > >>? ? ?>>>>>? ? ? >? ? ? ? ?faster than > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?mumps or superlu_dist on CPUs. > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? > Are we talking about the > factorization, the > >>? ? solve, or > >>? ? ?>>>>>? ? ?both? > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? > We do not have an interface to > cuSparse's LU > >>? ? ?>>>>>? ? ?factorization (I > >>? ? ?>>>>>? ? ? >? ? ? ? ?just > >>? ? ?>>>>>? ? ? >? ? ? ? ? > learned that it exists a few weeks ago). > >>? ? ?>>>>>? ? ? >? ? ? ? ? > Perhaps your fast "cusparse solver" is > >>? ? '-pc_type lu > >>? ? ?>>>>>? ? ?-mat_type > >>? ? ?>>>>>? ? ? >? ? ? ? ? > aijcusparse' ? This would be the CPU > >>? ? factorization, > >>? ? ?>>>>>? ? ?which is the > >>? ? ?>>>>>? ? ? >? ? ? ? ? > dominant cost. > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Chang > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?On 10/12/21 10:24 AM, Junchao > Zhang wrote: > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > Hi, Chang, > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?For the mumps solver, we > usually > >>? ? transfers > >>? ? ?>>>>>? ? ?matrix > >>? ? ?>>>>>? ? ? >? ? ? ? ?and vector > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?data > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > within a compute node.? For > the idea you > >>? ? ?>>>>>? ? ?propose, it > >>? ? ?>>>>>? ? ? >? ? ? ? ?looks like > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?we need > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > to gather data within > >>? ? MPI_COMM_WORLD, right? > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Mark, I remember you said > >>? ? cusparse solve is > >>? ? ?>>>>>? ? ?slow > >>? ? ?>>>>>? ? ? >? ? ? ? ?and you would > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > rather do it on CPU. Is it right? > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > --Junchao Zhang > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > On Mon, Oct 11, 2021 at 10:25 PM > >>? ? Chang Liu via > >>? ? ?>>>>>? ? ?petsc-users > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? > > >>? ? ?>>>>>? ? ? > >>? ? >> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > >>? ? > > >>? ? ?>>>>>? ? ? > >>? ? >>> > >>? ? > > >>? ? ?>>>>>? ? ? > >>? ? >> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > >>? ? > > >>? ? ?>>>>>? ? ? > >>? ? >>>> > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >>? ? > > >>? ? ?>>>>>? ? ? > >>? ? >> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > >>? ? > > >>? ? ?>>>>>? ? ? > >>? ? >>> > >>? ? > > >>? ? ?>>>>>? ? ? > >>? ? >> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > >>? ? > > >>? ? ?>>>>>? ? ? > >>? ? >>>>>> > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?wrote: > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Hi, > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Currently, it is possible > to use > >>? ? mumps > >>? ? ?>>>>>? ? ?solver in > >>? ? ?>>>>>? ? ? >? ? ? ? ?PETSC with > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?-mat_mumps_use_omp_threads > >>? ? option, so that > >>? ? ?>>>>>? ? ? >? ? ? ? ?multiple MPI > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?processes will > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?transfer the matrix and > rhs data > >>? ? to the master > >>? ? ?>>>>>? ? ? >? ? ? ? ?rank, and then > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?master > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?rank will call mumps with > OpenMP > >>? ? to solve > >>? ? ?>>>>>? ? ?the matrix. > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?I wonder if someone can > develop > >>? ? similar > >>? ? ?>>>>>? ? ?option for > >>? ? ?>>>>>? ? ? >? ? ? ? ?cusparse > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?solver. > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Right now, this solver > does not > >>? ? work with > >>? ? ?>>>>>? ? ? >? ? ? ? ?mpiaijcusparse. I > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?think a > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?possible workaround is to > >>? ? transfer all the > >>? ? ?>>>>>? ? ?matrix > >>? ? ?>>>>>? ? ? >? ? ? ? ?data to one MPI > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?process, and then upload the > >>? ? data to GPU to > >>? ? ?>>>>>? ? ?solve. > >>? ? ?>>>>>? ? ? >? ? ? ? ?In this > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?way, one can > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?use cusparse solver for a MPI > >>? ? program. > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Chang > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?-- > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Chang Liu > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Staff Research Physicist > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?+1 609 243 3438 > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > cliu at pppl.gov > > > >>? ? > >> > >>? ? ?>>>>>? ? ? > > > >>? ? > >>> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? > >> > >>? ? ?>>>>>? ? ? > > > >>? ? > >>>> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? > >> > >>? ? ?>>>>>? ? ? > > > >>? ? > >>> > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >>? ? > > >> > >>? ? ?>>>>>? ? ? > > > >>? ? > >>>>> > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Princeton Plasma Physics > Laboratory > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?100 Stellarator Rd, > Princeton NJ > >>? ? 08540, USA > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?-- > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Chang Liu > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Staff Research Physicist > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?+1 609 243 3438 > >>? ? ?>>>>>? ? ? >? ? ? ? ? > cliu at pppl.gov > > > >>? ? > >> > >>? ? ?>>>>>? ? ? > > > >>? ? > >>> > > >>? ? > > >>? ? ?>>>>>? ? ? > >> > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? > >>>> > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Princeton Plasma Physics Laboratory > >>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?100 Stellarator Rd, Princeton NJ > 08540, USA > >>? ? ?>>>>>? ? ? >? ? ? ? ? > > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ? >? ? ? ? ?-- > >>? ? ?>>>>>? ? ? >? ? ? ? ?Chang Liu > >>? ? ?>>>>>? ? ? >? ? ? ? ?Staff Research Physicist > >>? ? ?>>>>>? ? ? >? ? ? ? ?+1 609 243 3438 > >>? ? ?>>>>>? ? ? > cliu at pppl.gov > > > >>? ? > >> > >>? ? > > >>? ? ?>>>>>? ? ? > >>> > >>? ? ?>>>>>? ? ? >? ? ? ? ?Princeton Plasma Physics Laboratory > >>? ? ?>>>>>? ? ? >? ? ? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >>? ? ?>>>>>? ? ? > > >>? ? ?>>>>>? ? ?--? ? ?Chang Liu > >>? ? ?>>>>>? ? ?Staff Research Physicist > >>? ? ?>>>>>? ? ?+1 609 243 3438 > >>? ? ?>>>>> cliu at pppl.gov > > > >>? ? >> > >>? ? ?>>>>>? ? ?Princeton Plasma Physics Laboratory > >>? ? ?>>>>>? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > >>? ? ?>>>> > >>? ? ?>>>> -- > >>? ? ?>>>> Chang Liu > >>? ? ?>>>> Staff Research Physicist > >>? ? ?>>>> +1 609 243 3438 > >>? ? ?>>>> cliu at pppl.gov > > > >>? ? ?>>>> Princeton Plasma Physics Laboratory > >>? ? ?>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>? ? ?>> > >>? ? ?>> -- > >>? ? ?>> Chang Liu > >>? ? ?>> Staff Research Physicist > >>? ? ?>> +1 609 243 3438 > >>? ? ?>> cliu at pppl.gov > > > >>? ? ?>> Princeton Plasma Physics Laboratory > >>? ? ?>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>? ? ?> > >>? ? --? ? ?Chang Liu > >>? ? Staff Research Physicist > >>? ? +1 609 243 3438 > >> cliu at pppl.gov > > >>? ? Princeton Plasma Physics Laboratory > >>? ? 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From pierre at joliv.et Thu Oct 14 00:35:58 2021 From: pierre at joliv.et (Pierre Jolivet) Date: Thu, 14 Oct 2021 07:35:58 +0200 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <6f19664b-4dc6-33d6-83a4-fdd5c08d4649@pppl.gov> <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> Message-ID: Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. Thus the need for specific code in mumps.c. Thanks, Pierre > On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: > > Hi Junchao, > > Yes that is what I want. > > Chang > > On 10/13/21 11:42 PM, Junchao Zhang wrote: >> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >> Junchao, >> If I understand correctly Chang is using the block Jacobi >> method with a single block for a number of MPI ranks and a direct >> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >> is code Hong Zhang wrote a number of years ago for CPUs. For their >> particular problems this preconditioner works well, but using an >> iterative solver on the blocks does not work well. >> If we had complete MPI-GPU direct solvers he could just use >> the current code with MPIAIJCUSPARSE on each block but since we do >> not he would like to use a single GPU for each block, this means >> that diagonal blocks of the global parallel MPI matrix needs to be >> sent to a subset of the GPUs (one GPU per block, which has multiple >> MPI ranks associated with the blocks). Similarly for the triangular >> solves the blocks of the right hand side needs to be shipped to the >> appropriate GPU and the resulting solution shipped back to the >> multiple GPUs. So Chang is absolutely correct, this is somewhat like >> your code for MUMPS with OpenMP. OK, I now understand the background.. >> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >> MPI ranks and then shrink each block down to a single GPU but this >> would be pretty inefficient, ideally one would go directly from the >> big MPI matrix on all the GPUs to the sub matrices on the subset of >> GPUs. But this may be a large coding project. >> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >> Barry >> Since the matrices being factored and solved directly are relatively >> large it is possible that the cusparse code could be reasonably >> efficient (they are not the tiny problems one gets at the coarse >> level of multigrid). Of course, this is speculation, I don't >> actually know how much better the cusparse code would be on the >> direct solver than a good CPU direct sparse solver. >> > On Oct 13, 2021, at 9:32 PM, Chang Liu > > wrote: >> > >> > Sorry I am not familiar with the details either. Can you please >> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >> > >> > Chang >> > >> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >> >> Hi Chang, >> >> I did the work in mumps. It is easy for me to understand >> gathering matrix rows to one process. >> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >> >> Thanks >> >> --Junchao Zhang >> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >> >> >> >> wrote: >> >> Hi Barry, >> >> I think mumps solver in petsc does support that. You can >> check the >> >> documentation on "-mat_mumps_use_omp_threads" at >> >> >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >> >> >> > > >> >> and the code enclosed by #if >> defined(PETSC_HAVE_OPENMP_SUPPORT) in >> >> functions MatMumpsSetUpDistRHSInfo and >> >> MatMumpsGatherNonzerosOnMaster in >> >> mumps.c >> >> 1. I understand it is ideal to do one MPI rank per GPU. >> However, I am >> >> working on an existing code that was developed based on MPI >> and the the >> >> # of mpi ranks is typically equal to # of cpu cores. We don't >> want to >> >> change the whole structure of the code. >> >> 2. What you have suggested has been coded in mumps.c. See >> function >> >> MatMumpsSetUpDistRHSInfo. >> >> Regards, >> >> Chang >> >> On 10/13/21 7:53 PM, Barry Smith wrote: >> >> > >> >> > >> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu > >> >> >> wrote: >> >> >> >> >> >> Hi Barry, >> >> >> >> >> >> That is exactly what I want. >> >> >> >> >> >> Back to my original question, I am looking for an approach to >> >> transfer >> >> >> matrix >> >> >> data from many MPI processes to "master" MPI >> >> >> processes, each of which taking care of one GPU, and then >> upload >> >> the data to GPU to >> >> >> solve. >> >> >> One can just grab some codes from mumps.c to >> aijcusparse.cu >> >> >. >> >> > >> >> > mumps.c doesn't actually do that. It never needs to >> copy the >> >> entire matrix to a single MPI rank. >> >> > >> >> > It would be possible to write such a code that you >> suggest but >> >> it is not clear that it makes sense >> >> > >> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >> rank, so >> >> while your one GPU per big domain is solving its systems the >> other >> >> GPUs (with the other MPI ranks that share that domain) are doing >> >> nothing. >> >> > >> >> > 2) For each triangular solve you would have to gather the >> right >> >> hand side from the multiple ranks to the single GPU to pass it to >> >> the GPU solver and then scatter the resulting solution back >> to all >> >> of its subdomain ranks. >> >> > >> >> > What I was suggesting was assign an entire subdomain to a >> >> single MPI rank, thus it does everything on one GPU and can >> use the >> >> GPU solver directly. If all the major computations of a subdomain >> >> can fit and be done on a single GPU then you would be >> utilizing all >> >> the GPUs you are using effectively. >> >> > >> >> > Barry >> >> > >> >> > >> >> > >> >> >> >> >> >> Chang >> >> >> >> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >> >> >>> Chang, >> >> >>> You are correct there is no MPI + GPU direct >> solvers that >> >> currently do the triangular solves with MPI + GPU parallelism >> that I >> >> am aware of. You are limited that individual triangular solves be >> >> done on a single GPU. I can only suggest making each subdomain as >> >> big as possible to utilize each GPU as much as possible for the >> >> direct triangular solves. >> >> >>> Barry >> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >> >> >> >> >> wrote: >> >> >>>> >> >> >>>> Hi Mark, >> >> >>>> >> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >> other >> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >> will give >> >> an error. >> >> >>>> >> >> >>>> Yes what I want is to have mumps or superlu to do the >> >> factorization, and then do the rest, including GMRES solver, >> on gpu. >> >> Is that possible? >> >> >>>> >> >> >>>> I have tried to use aijcusparse with superlu_dist, it >> runs but >> >> the iterative solver is still running on CPUs. I have >> contacted the >> >> superlu group and they confirmed that is the case right now. >> But if >> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >> >> iterative solver is running on GPU. >> >> >>>> >> >> >>>> Chang >> >> >>>> >> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >> >> >> > >> >> >> >>> wrote: >> >> >>>>> Thank you Junchao for explaining this. I guess in >> my case >> >> the code is >> >> >>>>> just calling a seq solver like superlu to do >> >> factorization on GPUs. >> >> >>>>> My idea is that I want to have a traditional MPI >> code to >> >> utilize GPUs >> >> >>>>> with cusparse. Right now cusparse does not support >> mpiaij >> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >> >> mpiaijcusparse matrix with > 1 processes. >> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >> >> >>>>> However, I see in grepping the repo that all the mumps and >> >> superlu tests use aij or sell matrix type. >> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >> .... but >> >> you might want to do other matrix operations on the GPU. Is >> that the >> >> issue? >> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >> SuperLU >> >> have a problem? (no test with it so it probably does not work) >> >> >>>>> Thanks, >> >> >>>>> Mark >> >> >>>>> so I >> >> >>>>> want the code to have a mpiaij matrix when adding >> all the >> >> matrix terms, >> >> >>>>> and then transform the matrix to seqaij when doing the >> >> factorization >> >> >>>>> and >> >> >>>>> solve. This involves sending the data to the master >> >> process, and I >> >> >>>>> think >> >> >>>>> the petsc mumps solver have something similar already. >> >> >>>>> Chang >> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >> >> >> > >> >> >>>>> >> >> >> >> >>>>> > > > > >> >> >> >>>> wrote: >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >> >> > > >> >> >>>>> >> >> >> >> >>>>> > > > >> >> >> >>>> wrote: >> >> >>>>> > >> >> >>>>> > Hi Mark, >> >> >>>>> > >> >> >>>>> > The option I use is like >> >> >>>>> > >> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >> >> -ksp_type fgmres >> >> >>>>> -mat_type >> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >> >> cusparse >> >> >>>>> *-sub_ksp_type >> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >> >> -ksp_rtol 1.e-300 >> >> >>>>> > -ksp_atol 1.e-300 >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > Note, If you use -log_view the last column >> (rows >> >> are the >> >> >>>>> method like >> >> >>>>> > MatFactorNumeric) has the percent of work >> in the GPU. >> >> >>>>> > >> >> >>>>> > Junchao: *This* implies that we have a >> cuSparse LU >> >> >>>>> factorization. Is >> >> >>>>> > that correct? (I don't think we do) >> >> >>>>> > >> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >> find it >> >> calls >> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >> >> >>>>> > So I don't understand Chang's idea. Do you want to >> >> make bigger >> >> >>>>> blocks? >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > I think this one do both factorization and >> >> solve on gpu. >> >> >>>>> > >> >> >>>>> > You can check the >> runex72_aijcusparse.sh file >> >> in petsc >> >> >>>>> install >> >> >>>>> > directory, and try it your self (this >> is only lu >> >> >>>>> factorization >> >> >>>>> > without >> >> >>>>> > iterative solve). >> >> >>>>> > >> >> >>>>> > Chang >> >> >>>>> > >> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >> >> >>>>> > > >> >> >>>>> > > >> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >> Chang Liu >> >> >>>>> >> > >> >> >> >> >> >> >>>>> > > > >> >> >> >>> >> >> >>>>> > > > >> >> > >> > >> >> >> >>>>> >> > >> >> >> >>>>> wrote: >> >> >>>>> > > >> >> >>>>> > > Hi Junchao, >> >> >>>>> > > >> >> >>>>> > > No I only needs it to be transferred >> >> within a >> >> >>>>> node. I use >> >> >>>>> > block-Jacobi >> >> >>>>> > > method and GMRES to solve the sparse >> >> matrix, so each >> >> >>>>> > direct solver will >> >> >>>>> > > take care of a sub-block of the >> whole >> >> matrix. In this >> >> >>>>> > way, I can use >> >> >>>>> > > one >> >> >>>>> > > GPU to solve one sub-block, which is >> >> stored within >> >> >>>>> one node. >> >> >>>>> > > >> >> >>>>> > > It was stated in the >> documentation that >> >> cusparse >> >> >>>>> solver >> >> >>>>> > is slow. >> >> >>>>> > > However, in my test using >> ex72.c, the >> >> cusparse >> >> >>>>> solver is >> >> >>>>> > faster than >> >> >>>>> > > mumps or superlu_dist on CPUs. >> >> >>>>> > > >> >> >>>>> > > >> >> >>>>> > > Are we talking about the >> factorization, the >> >> solve, or >> >> >>>>> both? >> >> >>>>> > > >> >> >>>>> > > We do not have an interface to >> cuSparse's LU >> >> >>>>> factorization (I >> >> >>>>> > just >> >> >>>>> > > learned that it exists a few weeks ago). >> >> >>>>> > > Perhaps your fast "cusparse solver" is >> >> '-pc_type lu >> >> >>>>> -mat_type >> >> >>>>> > > aijcusparse' ? This would be the CPU >> >> factorization, >> >> >>>>> which is the >> >> >>>>> > > dominant cost. >> >> >>>>> > > >> >> >>>>> > > >> >> >>>>> > > Chang >> >> >>>>> > > >> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >> Zhang wrote: >> >> >>>>> > > > Hi, Chang, >> >> >>>>> > > > For the mumps solver, we >> usually >> >> transfers >> >> >>>>> matrix >> >> >>>>> > and vector >> >> >>>>> > > data >> >> >>>>> > > > within a compute node. For >> the idea you >> >> >>>>> propose, it >> >> >>>>> > looks like >> >> >>>>> > > we need >> >> >>>>> > > > to gather data within >> >> MPI_COMM_WORLD, right? >> >> >>>>> > > > >> >> >>>>> > > > Mark, I remember you said >> >> cusparse solve is >> >> >>>>> slow >> >> >>>>> > and you would >> >> >>>>> > > > rather do it on CPU. Is it right? >> >> >>>>> > > > >> >> >>>>> > > > --Junchao Zhang >> >> >>>>> > > > >> >> >>>>> > > > >> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >> >> Chang Liu via >> >> >>>>> petsc-users >> >> >>>>> > > > > >> >> > >> >> >>>>> > >> >> > >> >> >> >>>>> > > >> >> > >> >> >>>>> > >> >> > >>> > >> >> > >> >> >>>>> > >> >> > >> >> >> >>>>> > > >> >> > >> >> >>>>> > >> >> > >>>> >> >> >>>>> > > > >> >> > >> >> >>>>> > >> >> > >> >> >> >>>>> > > >> >> > >> >> >>>>> > >> >> > >>> > >> >> > >> >> >>>>> > >> >> > >> >> >> >>>>> > > >> >> > >> >> >>>>> > >> >> > >>>>>> >> >> >>>>> > > wrote: >> >> >>>>> > > > >> >> >>>>> > > > Hi, >> >> >>>>> > > > >> >> >>>>> > > > Currently, it is possible >> to use >> >> mumps >> >> >>>>> solver in >> >> >>>>> > PETSC with >> >> >>>>> > > > -mat_mumps_use_omp_threads >> >> option, so that >> >> >>>>> > multiple MPI >> >> >>>>> > > processes will >> >> >>>>> > > > transfer the matrix and >> rhs data >> >> to the master >> >> >>>>> > rank, and then >> >> >>>>> > > master >> >> >>>>> > > > rank will call mumps with >> OpenMP >> >> to solve >> >> >>>>> the matrix. >> >> >>>>> > > > >> >> >>>>> > > > I wonder if someone can >> develop >> >> similar >> >> >>>>> option for >> >> >>>>> > cusparse >> >> >>>>> > > solver. >> >> >>>>> > > > Right now, this solver >> does not >> >> work with >> >> >>>>> > mpiaijcusparse. I >> >> >>>>> > > think a >> >> >>>>> > > > possible workaround is to >> >> transfer all the >> >> >>>>> matrix >> >> >>>>> > data to one MPI >> >> >>>>> > > > process, and then upload the >> >> data to GPU to >> >> >>>>> solve. >> >> >>>>> > In this >> >> >>>>> > > way, one can >> >> >>>>> > > > use cusparse solver for a MPI >> >> program. >> >> >>>>> > > > >> >> >>>>> > > > Chang >> >> >>>>> > > > -- >> >> >>>>> > > > Chang Liu >> >> >>>>> > > > Staff Research Physicist >> >> >>>>> > > > +1 609 243 3438 >> >> >>>>> > > > cliu at pppl.gov >> > >> >> >> >> >> >> >>>>> >> > >> >> >> >>> >> >> >>>>> > > > >> >> >> >> >> >> >>>>> >> > >> >> >> >>>> >> >> >>>>> > > > >> >> >> >> >> >> >>>>> >> > >> >> >> >>> >> >> >>>>> > > > >> >> > >> > >> >> >> >>>>> >> > >> >> >> >>>>> >> >> >>>>> > > > Princeton Plasma Physics >> Laboratory >> >> >>>>> > > > 100 Stellarator Rd, >> Princeton NJ >> >> 08540, USA >> >> >>>>> > > > >> >> >>>>> > > >> >> >>>>> > > -- >> >> >>>>> > > Chang Liu >> >> >>>>> > > Staff Research Physicist >> >> >>>>> > > +1 609 243 3438 >> >> >>>>> > > cliu at pppl.gov >> > >> >> >> >> >> >> >>>>> >> > >> >> >> >>> >> >> >> > >> >> >>>>> >> >> >> >> >>>>> > > > >> >> >> >>>> >> >> >>>>> > > Princeton Plasma Physics Laboratory >> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >> 08540, USA >> >> >>>>> > > >> >> >>>>> > >> >> >>>>> > -- >> >> >>>>> > Chang Liu >> >> >>>>> > Staff Research Physicist >> >> >>>>> > +1 609 243 3438 >> >> >>>>> > cliu at pppl.gov >> > >> >> >> >> > >> >> > >> >> >>>>> >> >>> >> >> >>>>> > Princeton Plasma Physics Laboratory >> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >> >> >>>>> > >> >> >>>>> -- Chang Liu >> >> >>>>> Staff Research Physicist >> >> >>>>> +1 609 243 3438 >> >> >>>>> cliu at pppl.gov >> > > >> >> >> >> >> >>>>> Princeton Plasma Physics Laboratory >> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> >>>> >> >> >>>> -- >> >> >>>> Chang Liu >> >> >>>> Staff Research Physicist >> >> >>>> +1 609 243 3438 >> >> >>>> cliu at pppl.gov >> > >> >> >>>> Princeton Plasma Physics Laboratory >> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> >> >> >> >> -- >> >> >> Chang Liu >> >> >> Staff Research Physicist >> >> >> +1 609 243 3438 >> >> >> cliu at pppl.gov >> > >> >> >> Princeton Plasma Physics Laboratory >> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> > >> >> -- Chang Liu >> >> Staff Research Physicist >> >> +1 609 243 3438 >> >> cliu at pppl.gov > > >> >> Princeton Plasma Physics Laboratory >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > >> > -- >> > Chang Liu >> > Staff Research Physicist >> > +1 609 243 3438 >> > cliu at pppl.gov >> > Princeton Plasma Physics Laboratory >> > 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From knepley at gmail.com Thu Oct 14 07:37:35 2021 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 14 Oct 2021 08:37:35 -0400 Subject: [petsc-users] VecView DMDA and HDF5 - Unable to write out files properly In-Reply-To: References: Message-ID: On Wed, Oct 13, 2021 at 6:30 PM Abhishek G.S. wrote: > Hi, > I need some help with getting the file output working right. > > I am using a DMDACreate3D to initialize my DM. This is my write function > > void write(){ > PetscViewer viewer; > > PetscViewerHDF5Open(PETSC_COMM_WORLD,filename.c_str(),FILE_MODE_WRITE,&viewer); > DMDAVecRestoreArray(dm,global_vector,global_array) > VecView(global_vec, viewer); > DMDAVecGetArray(dm,global_vector,global_array); > PetscViewerDestroy(&viewer); > } > > 1) I have 2 PDE's to solve. Still, I went ahead creating a single DM with > dof=1 and creating two vectors using the DMCreateGlobalVector(). I want to > write the file out periodically. Should I perform DMDAVecRestoreArray and > DMDAVecGetArray every time is write out the global_vector? (I know that it > is just indexing the pointers and there is no copying of values. But I am > not sure) > I don't think you need the Get/RestoreArray() calls here. > 2) I am writing out to HDF5 format. I see that the vecview is supposed to > reorder the global_vector based on the DM. However, when I read the H5 > files, I get an error on ViSIT and my output image becomes a 1D image > rather than a 2D/3D. What might be the reason for this ?. > Error Msg : "In domain 0, your zonal variable "avtGhostZones" has 25600 > values, but it should have 160. Some values were removed to ensure VisIt > runs smoothly" > I was using a 160x160x1 DM > I do not believe we support HDF5 <--> Visit/Paraview for DMDA. The VecView() is just writing out the vector as a linear array without mesh details. For interfacing with the visualization, I think we use .vtu files. You should be able to get this effect using VecViewFromOptions(global_vec, NULL, "-vec_view"); in your code, and then -vec_view vtk:sol.vtu on the command line. > 3) I tried using the "petsc_gen_xdmf.py" to generate the xdmf files for > use in Paraview. Here the key ["viz/geometry"] is missing. The keys present > in the output H5 file are just the two vectors I am writing and has no info > about mesh. Isn't this supposed to come automatically since the vector is > attached to the DM? How do I sort this out? > This support is for unstructured grids, DMPlex and DMForest. > 4) Can I have multiple vectors attached to the DM by > DMCreateGlobalVector() even though I created the DMDA using dof=1. > Yes. Thanks, Matt > thanks, > Abhishek > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From matteo.semplice at uninsubria.it Thu Oct 14 08:21:49 2021 From: matteo.semplice at uninsubria.it (Matteo Semplice) Date: Thu, 14 Oct 2021 15:21:49 +0200 Subject: [petsc-users] VecView DMDA and HDF5 - Unable to write out files properly In-Reply-To: References: Message-ID: Il 14/10/21 14:37, Matthew Knepley ha scritto: > On Wed, Oct 13, 2021 at 6:30 PM Abhishek G.S. > > wrote: > > Hi, > I need some help with getting the file output working right. > > I am using a DMDACreate3D to initialize my DM. This is my write > function > > void write(){ > PetscViewer viewer; > PetscViewerHDF5Open(PETSC_COMM_WORLD,filename.c_str(),FILE_MODE_WRITE,&viewer); > DMDAVecRestoreArray(dm,global_vector,global_array) > VecView(global_vec, viewer); > DMDAVecGetArray(dm,global_vector,global_array); > PetscViewerDestroy(&viewer); > } > > 1) I have 2 PDE's to solve. Still, I went ahead creating a single > DM with dof=1 and creating two vectors using the > DMCreateGlobalVector(). I want to write the file out > periodically.? Should I perform DMDAVecRestoreArray and > DMDAVecGetArray every time is write out the global_vector? (I know > that it is just indexing the pointers and there is no copying of > values. But I am not sure) > > > I don't think you need the Get/RestoreArray() calls here. > > 2) I am writing out to HDF5 format. I see that the vecview is > supposed to reorder the global_vector based on the DM. However, > when I read the H5 files, I get an error on ViSIT and my output > image becomes a 1D image rather than a 2D/3D. What might be the > reason for this ?. > Error Msg : "In domain 0, your zonal variable "avtGhostZones" has > 25600 values, but it should have 160.? Some values were removed to > ensure VisIt runs smoothly" > I was using a 160x160x1 DM > > > I do not believe we support HDF5 <--> Visit/Paraview for DMDA. The > VecView() is just writing out the vector as a linear array without > mesh details. For > interfacing with the visualization, I think we use .vtu files. You > should be able to get this effect using > > ? VecViewFromOptions(global_vec, NULL, "-vec_view"); > > in your code, and then > > ? -vec_view vtk:sol.vtu > > on the command line. Hi. If you want to stick with HDF5, you can also write a XDMF file with the grid information and open that in Paraview. I am attaching some routines that I have written to do that in a solver that deals with a time dependent PDE system with 2 variables; with them I end up with a single XDMF file that Paraview can load and which contains references to all timesteps in my simulations, with each timestep being contained in an HDF5 file on its own. The idea is to call writeDomain at the beginning of the simulation, writeHDF5 for each timestep that I want to save and writeSimulationXDMF at the end. (Warning: 3D is in use, while 2D ia almost untested...) It's not the optimal solution since (1) all timesteps could be in the same HDF5 and (2) in each HDF5 i write the vectors separately and it would be better to dump the entire data in one go and interpret them as a Nx*Ny*Nz*Nvariables data from the XDMF. Nevertheless they might be a starting point for you if you wan to try this approach. Matteo > 3) I tried using the "petsc_gen_xdmf.py" to generate the xdmf > files for use in Paraview. Here the key ["viz/geometry"] is > missing. The keys present in the output H5 file are just the two > vectors I am writing and has no info about mesh. Isn't this > supposed to come automatically since the vector is attached to the > DM? How do I sort this out? > > > This support is for unstructured grids, DMPlex and DMForest. > > 4) Can I have multiple vectors attached to the DM by > DMCreateGlobalVector() even though I created the DMDA using dof=1. > > > Yes. > > ? Thanks, > > ? ? ?Matt > > thanks, > Abhishek > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -- --- Professore Associato in Analisi Numerica Dipartimento di Scienza e Alta Tecnologia Universit? degli Studi dell'Insubria Via Valleggio, 11 - Como -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hdf5Output.cpp Type: text/x-c++src Size: 9482 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hdf5Output.h Type: text/x-chdr Size: 1085 bytes Desc: not available URL: From cliu at pppl.gov Thu Oct 14 08:50:04 2021 From: cliu at pppl.gov (Chang Liu) Date: Thu, 14 Oct 2021 09:50:04 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> Message-ID: <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? Chang On 10/14/21 1:35 AM, Pierre Jolivet wrote: > Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? > -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu > This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. > Thus the need for specific code in mumps.c. > > Thanks, > Pierre > >> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >> >> Hi Junchao, >> >> Yes that is what I want. >> >> Chang >> >> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>> Junchao, >>> If I understand correctly Chang is using the block Jacobi >>> method with a single block for a number of MPI ranks and a direct >>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>> particular problems this preconditioner works well, but using an >>> iterative solver on the blocks does not work well. >>> If we had complete MPI-GPU direct solvers he could just use >>> the current code with MPIAIJCUSPARSE on each block but since we do >>> not he would like to use a single GPU for each block, this means >>> that diagonal blocks of the global parallel MPI matrix needs to be >>> sent to a subset of the GPUs (one GPU per block, which has multiple >>> MPI ranks associated with the blocks). Similarly for the triangular >>> solves the blocks of the right hand side needs to be shipped to the >>> appropriate GPU and the resulting solution shipped back to the >>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>> MPI ranks and then shrink each block down to a single GPU but this >>> would be pretty inefficient, ideally one would go directly from the >>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>> GPUs. But this may be a large coding project. >>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>> Barry >>> Since the matrices being factored and solved directly are relatively >>> large it is possible that the cusparse code could be reasonably >>> efficient (they are not the tiny problems one gets at the coarse >>> level of multigrid). Of course, this is speculation, I don't >>> actually know how much better the cusparse code would be on the >>> direct solver than a good CPU direct sparse solver. >>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >> > wrote: >>> > >>> > Sorry I am not familiar with the details either. Can you please >>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>> > >>> > Chang >>> > >>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>> >> Hi Chang, >>> >> I did the work in mumps. It is easy for me to understand >>> gathering matrix rows to one process. >>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>> >> Thanks >>> >> --Junchao Zhang >>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>> >>> >> >>> wrote: >>> >> Hi Barry, >>> >> I think mumps solver in petsc does support that. You can >>> check the >>> >> documentation on "-mat_mumps_use_omp_threads" at >>> >> >>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>> >>> >> >> > >>> >> and the code enclosed by #if >>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>> >> functions MatMumpsSetUpDistRHSInfo and >>> >> MatMumpsGatherNonzerosOnMaster in >>> >> mumps.c >>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>> However, I am >>> >> working on an existing code that was developed based on MPI >>> and the the >>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>> want to >>> >> change the whole structure of the code. >>> >> 2. What you have suggested has been coded in mumps.c. See >>> function >>> >> MatMumpsSetUpDistRHSInfo. >>> >> Regards, >>> >> Chang >>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>> >> > >>> >> > >>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >> >>> >> >> wrote: >>> >> >> >>> >> >> Hi Barry, >>> >> >> >>> >> >> That is exactly what I want. >>> >> >> >>> >> >> Back to my original question, I am looking for an approach to >>> >> transfer >>> >> >> matrix >>> >> >> data from many MPI processes to "master" MPI >>> >> >> processes, each of which taking care of one GPU, and then >>> upload >>> >> the data to GPU to >>> >> >> solve. >>> >> >> One can just grab some codes from mumps.c to >>> aijcusparse.cu >>> >> >. >>> >> > >>> >> > mumps.c doesn't actually do that. It never needs to >>> copy the >>> >> entire matrix to a single MPI rank. >>> >> > >>> >> > It would be possible to write such a code that you >>> suggest but >>> >> it is not clear that it makes sense >>> >> > >>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>> rank, so >>> >> while your one GPU per big domain is solving its systems the >>> other >>> >> GPUs (with the other MPI ranks that share that domain) are doing >>> >> nothing. >>> >> > >>> >> > 2) For each triangular solve you would have to gather the >>> right >>> >> hand side from the multiple ranks to the single GPU to pass it to >>> >> the GPU solver and then scatter the resulting solution back >>> to all >>> >> of its subdomain ranks. >>> >> > >>> >> > What I was suggesting was assign an entire subdomain to a >>> >> single MPI rank, thus it does everything on one GPU and can >>> use the >>> >> GPU solver directly. If all the major computations of a subdomain >>> >> can fit and be done on a single GPU then you would be >>> utilizing all >>> >> the GPUs you are using effectively. >>> >> > >>> >> > Barry >>> >> > >>> >> > >>> >> > >>> >> >> >>> >> >> Chang >>> >> >> >>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>> >> >>> Chang, >>> >> >>> You are correct there is no MPI + GPU direct >>> solvers that >>> >> currently do the triangular solves with MPI + GPU parallelism >>> that I >>> >> am aware of. You are limited that individual triangular solves be >>> >> done on a single GPU. I can only suggest making each subdomain as >>> >> big as possible to utilize each GPU as much as possible for the >>> >> direct triangular solves. >>> >> >>> Barry >>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>> >> >>> >> >>> wrote: >>> >> >>>> >>> >> >>>> Hi Mark, >>> >> >>>> >>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>> other >>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>> will give >>> >> an error. >>> >> >>>> >>> >> >>>> Yes what I want is to have mumps or superlu to do the >>> >> factorization, and then do the rest, including GMRES solver, >>> on gpu. >>> >> Is that possible? >>> >> >>>> >>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>> runs but >>> >> the iterative solver is still running on CPUs. I have >>> contacted the >>> >> superlu group and they confirmed that is the case right now. >>> But if >>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>> >> iterative solver is running on GPU. >>> >> >>>> >>> >> >>>> Chang >>> >> >>>> >>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>> >>> >> > >>> >>> >> >>> wrote: >>> >> >>>>> Thank you Junchao for explaining this. I guess in >>> my case >>> >> the code is >>> >> >>>>> just calling a seq solver like superlu to do >>> >> factorization on GPUs. >>> >> >>>>> My idea is that I want to have a traditional MPI >>> code to >>> >> utilize GPUs >>> >> >>>>> with cusparse. Right now cusparse does not support >>> mpiaij >>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>> >> mpiaijcusparse matrix with > 1 processes. >>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>> >> >>>>> However, I see in grepping the repo that all the mumps and >>> >> superlu tests use aij or sell matrix type. >>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>> .... but >>> >> you might want to do other matrix operations on the GPU. Is >>> that the >>> >> issue? >>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>> SuperLU >>> >> have a problem? (no test with it so it probably does not work) >>> >> >>>>> Thanks, >>> >> >>>>> Mark >>> >> >>>>> so I >>> >> >>>>> want the code to have a mpiaij matrix when adding >>> all the >>> >> matrix terms, >>> >> >>>>> and then transform the matrix to seqaij when doing the >>> >> factorization >>> >> >>>>> and >>> >> >>>>> solve. This involves sending the data to the master >>> >> process, and I >>> >> >>>>> think >>> >> >>>>> the petsc mumps solver have something similar already. >>> >> >>>>> Chang >>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>> >> >>>>> > >>> >> >>>>> > >>> >> >>>>> > >>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>> >> >>> > >>> >> >>>>> >>> >> >>> >> >>>>> > >> >> > >>> >> >>> >>>> wrote: >>> >> >>>>> > >>> >> >>>>> > >>> >> >>>>> > >>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>> >> >> > >>> >> >>>>> >>> >> >>> >> >>>>> > >> > >>> >> >>> >>>> wrote: >>> >> >>>>> > >>> >> >>>>> > Hi Mark, >>> >> >>>>> > >>> >> >>>>> > The option I use is like >>> >> >>>>> > >>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>> >> -ksp_type fgmres >>> >> >>>>> -mat_type >>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>> >> cusparse >>> >> >>>>> *-sub_ksp_type >>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>> >> -ksp_rtol 1.e-300 >>> >> >>>>> > -ksp_atol 1.e-300 >>> >> >>>>> > >>> >> >>>>> > >>> >> >>>>> > Note, If you use -log_view the last column >>> (rows >>> >> are the >>> >> >>>>> method like >>> >> >>>>> > MatFactorNumeric) has the percent of work >>> in the GPU. >>> >> >>>>> > >>> >> >>>>> > Junchao: *This* implies that we have a >>> cuSparse LU >>> >> >>>>> factorization. Is >>> >> >>>>> > that correct? (I don't think we do) >>> >> >>>>> > >>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>> find it >>> >> calls >>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>> >> make bigger >>> >> >>>>> blocks? >>> >> >>>>> > >>> >> >>>>> > >>> >> >>>>> > I think this one do both factorization and >>> >> solve on gpu. >>> >> >>>>> > >>> >> >>>>> > You can check the >>> runex72_aijcusparse.sh file >>> >> in petsc >>> >> >>>>> install >>> >> >>>>> > directory, and try it your self (this >>> is only lu >>> >> >>>>> factorization >>> >> >>>>> > without >>> >> >>>>> > iterative solve). >>> >> >>>>> > >>> >> >>>>> > Chang >>> >> >>>>> > >>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>> >> >>>>> > > >>> >> >>>>> > > >>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>> Chang Liu >>> >> >>>>> >>> > >>> >> >>> >> >>> >> >>>>> > >> > >>> >> >>> >>> >>> >> >>>>> > > >> >>> >> > >>> >> >> >>> >> >>>>> >>> > >>> >> >>> >>>>> wrote: >>> >> >>>>> > > >>> >> >>>>> > > Hi Junchao, >>> >> >>>>> > > >>> >> >>>>> > > No I only needs it to be transferred >>> >> within a >>> >> >>>>> node. I use >>> >> >>>>> > block-Jacobi >>> >> >>>>> > > method and GMRES to solve the sparse >>> >> matrix, so each >>> >> >>>>> > direct solver will >>> >> >>>>> > > take care of a sub-block of the >>> whole >>> >> matrix. In this >>> >> >>>>> > way, I can use >>> >> >>>>> > > one >>> >> >>>>> > > GPU to solve one sub-block, which is >>> >> stored within >>> >> >>>>> one node. >>> >> >>>>> > > >>> >> >>>>> > > It was stated in the >>> documentation that >>> >> cusparse >>> >> >>>>> solver >>> >> >>>>> > is slow. >>> >> >>>>> > > However, in my test using >>> ex72.c, the >>> >> cusparse >>> >> >>>>> solver is >>> >> >>>>> > faster than >>> >> >>>>> > > mumps or superlu_dist on CPUs. >>> >> >>>>> > > >>> >> >>>>> > > >>> >> >>>>> > > Are we talking about the >>> factorization, the >>> >> solve, or >>> >> >>>>> both? >>> >> >>>>> > > >>> >> >>>>> > > We do not have an interface to >>> cuSparse's LU >>> >> >>>>> factorization (I >>> >> >>>>> > just >>> >> >>>>> > > learned that it exists a few weeks ago). >>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>> >> '-pc_type lu >>> >> >>>>> -mat_type >>> >> >>>>> > > aijcusparse' ? This would be the CPU >>> >> factorization, >>> >> >>>>> which is the >>> >> >>>>> > > dominant cost. >>> >> >>>>> > > >>> >> >>>>> > > >>> >> >>>>> > > Chang >>> >> >>>>> > > >>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>> Zhang wrote: >>> >> >>>>> > > > Hi, Chang, >>> >> >>>>> > > > For the mumps solver, we >>> usually >>> >> transfers >>> >> >>>>> matrix >>> >> >>>>> > and vector >>> >> >>>>> > > data >>> >> >>>>> > > > within a compute node. For >>> the idea you >>> >> >>>>> propose, it >>> >> >>>>> > looks like >>> >> >>>>> > > we need >>> >> >>>>> > > > to gather data within >>> >> MPI_COMM_WORLD, right? >>> >> >>>>> > > > >>> >> >>>>> > > > Mark, I remember you said >>> >> cusparse solve is >>> >> >>>>> slow >>> >> >>>>> > and you would >>> >> >>>>> > > > rather do it on CPU. Is it right? >>> >> >>>>> > > > >>> >> >>>>> > > > --Junchao Zhang >>> >> >>>>> > > > >>> >> >>>>> > > > >>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>> >> Chang Liu via >>> >> >>>>> petsc-users >>> >> >>>>> > > > >> >>> >> > >>> >> >>>>> >> >>> >> >> >> >>> >> >>>>> > >> >>> >> > >>> >> >>>>> >> >>> >> >> >>> >> >>> >> > >>> >> >>>>> >> >>> >> >> >> >>> >> >>>>> > >> >>> >> > >>> >> >>>>> >> >>> >> >> >>>> >>> >> >>>>> > > >> >>> >> > >>> >> >>>>> >> >>> >> >> >> >>> >> >>>>> > >> >>> >> > >>> >> >>>>> >> >>> >> >> >>> >> >>> >> > >>> >> >>>>> >> >>> >> >> >> >>> >> >>>>> > >> >>> >> > >>> >> >>>>> >> >>> >> >> >>>>>> >>> >> >>>>> > > wrote: >>> >> >>>>> > > > >>> >> >>>>> > > > Hi, >>> >> >>>>> > > > >>> >> >>>>> > > > Currently, it is possible >>> to use >>> >> mumps >>> >> >>>>> solver in >>> >> >>>>> > PETSC with >>> >> >>>>> > > > -mat_mumps_use_omp_threads >>> >> option, so that >>> >> >>>>> > multiple MPI >>> >> >>>>> > > processes will >>> >> >>>>> > > > transfer the matrix and >>> rhs data >>> >> to the master >>> >> >>>>> > rank, and then >>> >> >>>>> > > master >>> >> >>>>> > > > rank will call mumps with >>> OpenMP >>> >> to solve >>> >> >>>>> the matrix. >>> >> >>>>> > > > >>> >> >>>>> > > > I wonder if someone can >>> develop >>> >> similar >>> >> >>>>> option for >>> >> >>>>> > cusparse >>> >> >>>>> > > solver. >>> >> >>>>> > > > Right now, this solver >>> does not >>> >> work with >>> >> >>>>> > mpiaijcusparse. I >>> >> >>>>> > > think a >>> >> >>>>> > > > possible workaround is to >>> >> transfer all the >>> >> >>>>> matrix >>> >> >>>>> > data to one MPI >>> >> >>>>> > > > process, and then upload the >>> >> data to GPU to >>> >> >>>>> solve. >>> >> >>>>> > In this >>> >> >>>>> > > way, one can >>> >> >>>>> > > > use cusparse solver for a MPI >>> >> program. >>> >> >>>>> > > > >>> >> >>>>> > > > Chang >>> >> >>>>> > > > -- >>> >> >>>>> > > > Chang Liu >>> >> >>>>> > > > Staff Research Physicist >>> >> >>>>> > > > +1 609 243 3438 >>> >> >>>>> > > > cliu at pppl.gov >>> > >>> >> >>> >> >>> >> >>>>> >>> > >>> >> >>> >>> >>> >> >>>>> > >> > >>> >> >>> >> >>> >> >>>>> >>> > >>> >> >>> >>>> >>> >> >>>>> > >> > >>> >> >>> >> >>> >> >>>>> >>> > >>> >> >>> >>> >>> >> >>>>> > > >> >>> >> > >>> >> >> >>> >> >>>>> >>> > >>> >> >>> >>>>> >>> >> >>>>> > > > Princeton Plasma Physics >>> Laboratory >>> >> >>>>> > > > 100 Stellarator Rd, >>> Princeton NJ >>> >> 08540, USA >>> >> >>>>> > > > >>> >> >>>>> > > >>> >> >>>>> > > -- >>> >> >>>>> > > Chang Liu >>> >> >>>>> > > Staff Research Physicist >>> >> >>>>> > > +1 609 243 3438 >>> >> >>>>> > > cliu at pppl.gov >>> > >>> >> >>> >> >>> >> >>>>> >>> > >>> >> >>> >>> >>> >>> >> > >>> >> >>>>> >>> >> >>> >> >>>>> > >> > >>> >> >>> >>>> >>> >> >>>>> > > Princeton Plasma Physics Laboratory >>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>> 08540, USA >>> >> >>>>> > > >>> >> >>>>> > >>> >> >>>>> > -- >>> >> >>>>> > Chang Liu >>> >> >>>>> > Staff Research Physicist >>> >> >>>>> > +1 609 243 3438 >>> >> >>>>> > cliu at pppl.gov >>> > >>> >> >>> >> >> >>> >> > >>> >> >>>>> >>> >>> >>> >> >>>>> > Princeton Plasma Physics Laboratory >>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>> >> >>>>> > >>> >> >>>>> -- Chang Liu >>> >> >>>>> Staff Research Physicist >>> >> >>>>> +1 609 243 3438 >>> >> >>>>> cliu at pppl.gov >>> > >> >>> >> >> >>> >> >>>>> Princeton Plasma Physics Laboratory >>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >> >>>> >>> >> >>>> -- >>> >> >>>> Chang Liu >>> >> >>>> Staff Research Physicist >>> >> >>>> +1 609 243 3438 >>> >> >>>> cliu at pppl.gov >>> > >>> >> >>>> Princeton Plasma Physics Laboratory >>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >> >> >>> >> >> -- >>> >> >> Chang Liu >>> >> >> Staff Research Physicist >>> >> >> +1 609 243 3438 >>> >> >> cliu at pppl.gov >>> > >>> >> >> Princeton Plasma Physics Laboratory >>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >> > >>> >> -- Chang Liu >>> >> Staff Research Physicist >>> >> +1 609 243 3438 >>> >> cliu at pppl.gov >> > >>> >> Princeton Plasma Physics Laboratory >>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>> > >>> > -- >>> > Chang Liu >>> > Staff Research Physicist >>> > +1 609 243 3438 >>> > cliu at pppl.gov >>> > Princeton Plasma Physics Laboratory >>> > 100 Stellarator Rd, Princeton NJ 08540, USA >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From pierre at joliv.et Thu Oct 14 09:04:03 2021 From: pierre at joliv.et (Pierre Jolivet) Date: Thu, 14 Oct 2021 16:04:03 +0200 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> References: <7b014fb2-1115-525e-29a9-18c7bb4a0afb@pppl.gov> <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> Message-ID: > On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: > > Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). 1) I?m not sure this is implemented for cuSparse matrices, but it should be; 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. Thanks, Pierre > Chang > > On 10/14/21 1:35 AM, Pierre Jolivet wrote: >> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >> Thus the need for specific code in mumps.c. >> Thanks, >> Pierre >>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>> >>> Hi Junchao, >>> >>> Yes that is what I want. >>> >>> Chang >>> >>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>> Junchao, >>>> If I understand correctly Chang is using the block Jacobi >>>> method with a single block for a number of MPI ranks and a direct >>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>> particular problems this preconditioner works well, but using an >>>> iterative solver on the blocks does not work well. >>>> If we had complete MPI-GPU direct solvers he could just use >>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>> not he would like to use a single GPU for each block, this means >>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>> MPI ranks associated with the blocks). Similarly for the triangular >>>> solves the blocks of the right hand side needs to be shipped to the >>>> appropriate GPU and the resulting solution shipped back to the >>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>> MPI ranks and then shrink each block down to a single GPU but this >>>> would be pretty inefficient, ideally one would go directly from the >>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>> GPUs. But this may be a large coding project. >>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>> Barry >>>> Since the matrices being factored and solved directly are relatively >>>> large it is possible that the cusparse code could be reasonably >>>> efficient (they are not the tiny problems one gets at the coarse >>>> level of multigrid). Of course, this is speculation, I don't >>>> actually know how much better the cusparse code would be on the >>>> direct solver than a good CPU direct sparse solver. >>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>> > wrote: >>>> > >>>> > Sorry I am not familiar with the details either. Can you please >>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>> > >>>> > Chang >>>> > >>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>> >> Hi Chang, >>>> >> I did the work in mumps. It is easy for me to understand >>>> gathering matrix rows to one process. >>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>> >> Thanks >>>> >> --Junchao Zhang >>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>> >>>> >> >>>> wrote: >>>> >> Hi Barry, >>>> >> I think mumps solver in petsc does support that. You can >>>> check the >>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>> >> >>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>> >>>> >> >>> > >>>> >> and the code enclosed by #if >>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>> >> functions MatMumpsSetUpDistRHSInfo and >>>> >> MatMumpsGatherNonzerosOnMaster in >>>> >> mumps.c >>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>> However, I am >>>> >> working on an existing code that was developed based on MPI >>>> and the the >>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>> want to >>>> >> change the whole structure of the code. >>>> >> 2. What you have suggested has been coded in mumps.c. See >>>> function >>>> >> MatMumpsSetUpDistRHSInfo. >>>> >> Regards, >>>> >> Chang >>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>> >> > >>>> >> > >>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>> >>>> >> >> wrote: >>>> >> >> >>>> >> >> Hi Barry, >>>> >> >> >>>> >> >> That is exactly what I want. >>>> >> >> >>>> >> >> Back to my original question, I am looking for an approach to >>>> >> transfer >>>> >> >> matrix >>>> >> >> data from many MPI processes to "master" MPI >>>> >> >> processes, each of which taking care of one GPU, and then >>>> upload >>>> >> the data to GPU to >>>> >> >> solve. >>>> >> >> One can just grab some codes from mumps.c to >>>> aijcusparse.cu >>>> >> >. >>>> >> > >>>> >> > mumps.c doesn't actually do that. It never needs to >>>> copy the >>>> >> entire matrix to a single MPI rank. >>>> >> > >>>> >> > It would be possible to write such a code that you >>>> suggest but >>>> >> it is not clear that it makes sense >>>> >> > >>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>> rank, so >>>> >> while your one GPU per big domain is solving its systems the >>>> other >>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>> >> nothing. >>>> >> > >>>> >> > 2) For each triangular solve you would have to gather the >>>> right >>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>> >> the GPU solver and then scatter the resulting solution back >>>> to all >>>> >> of its subdomain ranks. >>>> >> > >>>> >> > What I was suggesting was assign an entire subdomain to a >>>> >> single MPI rank, thus it does everything on one GPU and can >>>> use the >>>> >> GPU solver directly. If all the major computations of a subdomain >>>> >> can fit and be done on a single GPU then you would be >>>> utilizing all >>>> >> the GPUs you are using effectively. >>>> >> > >>>> >> > Barry >>>> >> > >>>> >> > >>>> >> > >>>> >> >> >>>> >> >> Chang >>>> >> >> >>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>> >> >>> Chang, >>>> >> >>> You are correct there is no MPI + GPU direct >>>> solvers that >>>> >> currently do the triangular solves with MPI + GPU parallelism >>>> that I >>>> >> am aware of. You are limited that individual triangular solves be >>>> >> done on a single GPU. I can only suggest making each subdomain as >>>> >> big as possible to utilize each GPU as much as possible for the >>>> >> direct triangular solves. >>>> >> >>> Barry >>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>> >> >>>> >> >>>> wrote: >>>> >> >>>> >>>> >> >>>> Hi Mark, >>>> >> >>>> >>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>> other >>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>> will give >>>> >> an error. >>>> >> >>>> >>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>> >> factorization, and then do the rest, including GMRES solver, >>>> on gpu. >>>> >> Is that possible? >>>> >> >>>> >>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>> runs but >>>> >> the iterative solver is still running on CPUs. I have >>>> contacted the >>>> >> superlu group and they confirmed that is the case right now. >>>> But if >>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>> >> iterative solver is running on GPU. >>>> >> >>>> >>>> >> >>>> Chang >>>> >> >>>> >>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>> >>>> >> > >>>> >>>> >> >>> wrote: >>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>> my case >>>> >> the code is >>>> >> >>>>> just calling a seq solver like superlu to do >>>> >> factorization on GPUs. >>>> >> >>>>> My idea is that I want to have a traditional MPI >>>> code to >>>> >> utilize GPUs >>>> >> >>>>> with cusparse. Right now cusparse does not support >>>> mpiaij >>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>> >> mpiaijcusparse matrix with > 1 processes. >>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>> >> superlu tests use aij or sell matrix type. >>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>> .... but >>>> >> you might want to do other matrix operations on the GPU. Is >>>> that the >>>> >> issue? >>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>> SuperLU >>>> >> have a problem? (no test with it so it probably does not work) >>>> >> >>>>> Thanks, >>>> >> >>>>> Mark >>>> >> >>>>> so I >>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>> all the >>>> >> matrix terms, >>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>> >> factorization >>>> >> >>>>> and >>>> >> >>>>> solve. This involves sending the data to the master >>>> >> process, and I >>>> >> >>>>> think >>>> >> >>>>> the petsc mumps solver have something similar already. >>>> >> >>>>> Chang >>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>> >> >>>> > >>>> >> >>>>> >>>> >> >>>> >> >>>>> > >>> >>> > >>>> >> >>>> >>>> wrote: >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>> >> >>> > >>>> >> >>>>> >>>> >> >>>> >> >>>>> > >>> > >>>> >> >>>> >>>> wrote: >>>> >> >>>>> > >>>> >> >>>>> > Hi Mark, >>>> >> >>>>> > >>>> >> >>>>> > The option I use is like >>>> >> >>>>> > >>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>> >> -ksp_type fgmres >>>> >> >>>>> -mat_type >>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>> >> cusparse >>>> >> >>>>> *-sub_ksp_type >>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>> >> -ksp_rtol 1.e-300 >>>> >> >>>>> > -ksp_atol 1.e-300 >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> > Note, If you use -log_view the last column >>>> (rows >>>> >> are the >>>> >> >>>>> method like >>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>> in the GPU. >>>> >> >>>>> > >>>> >> >>>>> > Junchao: *This* implies that we have a >>>> cuSparse LU >>>> >> >>>>> factorization. Is >>>> >> >>>>> > that correct? (I don't think we do) >>>> >> >>>>> > >>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>> find it >>>> >> calls >>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>> >> make bigger >>>> >> >>>>> blocks? >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> > I think this one do both factorization and >>>> >> solve on gpu. >>>> >> >>>>> > >>>> >> >>>>> > You can check the >>>> runex72_aijcusparse.sh file >>>> >> in petsc >>>> >> >>>>> install >>>> >> >>>>> > directory, and try it your self (this >>>> is only lu >>>> >> >>>>> factorization >>>> >> >>>>> > without >>>> >> >>>>> > iterative solve). >>>> >> >>>>> > >>>> >> >>>>> > Chang >>>> >> >>>>> > >>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>> >> >>>>> > > >>>> >> >>>>> > > >>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>> Chang Liu >>>> >> >>>>> >>>> > >>>> >> >>>> >> >>>> >> >>>>> > >>> > >>>> >> >>>> >>> >>>> >> >>>>> > > >>> >>>> >> > >>>> >>> >> >>>> >> >>>>> >>>> > >>>> >> >>>> >>>>> wrote: >>>> >> >>>>> > > >>>> >> >>>>> > > Hi Junchao, >>>> >> >>>>> > > >>>> >> >>>>> > > No I only needs it to be transferred >>>> >> within a >>>> >> >>>>> node. I use >>>> >> >>>>> > block-Jacobi >>>> >> >>>>> > > method and GMRES to solve the sparse >>>> >> matrix, so each >>>> >> >>>>> > direct solver will >>>> >> >>>>> > > take care of a sub-block of the >>>> whole >>>> >> matrix. In this >>>> >> >>>>> > way, I can use >>>> >> >>>>> > > one >>>> >> >>>>> > > GPU to solve one sub-block, which is >>>> >> stored within >>>> >> >>>>> one node. >>>> >> >>>>> > > >>>> >> >>>>> > > It was stated in the >>>> documentation that >>>> >> cusparse >>>> >> >>>>> solver >>>> >> >>>>> > is slow. >>>> >> >>>>> > > However, in my test using >>>> ex72.c, the >>>> >> cusparse >>>> >> >>>>> solver is >>>> >> >>>>> > faster than >>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>> >> >>>>> > > >>>> >> >>>>> > > >>>> >> >>>>> > > Are we talking about the >>>> factorization, the >>>> >> solve, or >>>> >> >>>>> both? >>>> >> >>>>> > > >>>> >> >>>>> > > We do not have an interface to >>>> cuSparse's LU >>>> >> >>>>> factorization (I >>>> >> >>>>> > just >>>> >> >>>>> > > learned that it exists a few weeks ago). >>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>> >> '-pc_type lu >>>> >> >>>>> -mat_type >>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>> >> factorization, >>>> >> >>>>> which is the >>>> >> >>>>> > > dominant cost. >>>> >> >>>>> > > >>>> >> >>>>> > > >>>> >> >>>>> > > Chang >>>> >> >>>>> > > >>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>> Zhang wrote: >>>> >> >>>>> > > > Hi, Chang, >>>> >> >>>>> > > > For the mumps solver, we >>>> usually >>>> >> transfers >>>> >> >>>>> matrix >>>> >> >>>>> > and vector >>>> >> >>>>> > > data >>>> >> >>>>> > > > within a compute node. For >>>> the idea you >>>> >> >>>>> propose, it >>>> >> >>>>> > looks like >>>> >> >>>>> > > we need >>>> >> >>>>> > > > to gather data within >>>> >> MPI_COMM_WORLD, right? >>>> >> >>>>> > > > >>>> >> >>>>> > > > Mark, I remember you said >>>> >> cusparse solve is >>>> >> >>>>> slow >>>> >> >>>>> > and you would >>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>> >> >>>>> > > > >>>> >> >>>>> > > > --Junchao Zhang >>>> >> >>>>> > > > >>>> >> >>>>> > > > >>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>> >> Chang Liu via >>>> >> >>>>> petsc-users >>>> >> >>>>> > > > >>> >>>> >> > >>>> >> >>>>> >>> >>>> >> >>> >> >>>> >> >>>>> > >>> >>>> >> > >>>> >> >>>>> >>> >>>> >> >>> >>> >>> >>>> >> > >>>> >> >>>>> >>> >>>> >> >>> >> >>>> >> >>>>> > >>> >>>> >> > >>>> >> >>>>> >>> >>>> >> >>> >>>> >>>> >> >>>>> > > >>> >>>> >> > >>>> >> >>>>> >>> >>>> >> >>> >> >>>> >> >>>>> > >>> >>>> >> > >>>> >> >>>>> >>> >>>> >> >>> >>> >>> >>>> >> > >>>> >> >>>>> >>> >>>> >> >>> >> >>>> >> >>>>> > >>> >>>> >> > >>>> >> >>>>> >>> >>>> >> >>> >>>>>> >>>> >> >>>>> > > wrote: >>>> >> >>>>> > > > >>>> >> >>>>> > > > Hi, >>>> >> >>>>> > > > >>>> >> >>>>> > > > Currently, it is possible >>>> to use >>>> >> mumps >>>> >> >>>>> solver in >>>> >> >>>>> > PETSC with >>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>> >> option, so that >>>> >> >>>>> > multiple MPI >>>> >> >>>>> > > processes will >>>> >> >>>>> > > > transfer the matrix and >>>> rhs data >>>> >> to the master >>>> >> >>>>> > rank, and then >>>> >> >>>>> > > master >>>> >> >>>>> > > > rank will call mumps with >>>> OpenMP >>>> >> to solve >>>> >> >>>>> the matrix. >>>> >> >>>>> > > > >>>> >> >>>>> > > > I wonder if someone can >>>> develop >>>> >> similar >>>> >> >>>>> option for >>>> >> >>>>> > cusparse >>>> >> >>>>> > > solver. >>>> >> >>>>> > > > Right now, this solver >>>> does not >>>> >> work with >>>> >> >>>>> > mpiaijcusparse. I >>>> >> >>>>> > > think a >>>> >> >>>>> > > > possible workaround is to >>>> >> transfer all the >>>> >> >>>>> matrix >>>> >> >>>>> > data to one MPI >>>> >> >>>>> > > > process, and then upload the >>>> >> data to GPU to >>>> >> >>>>> solve. >>>> >> >>>>> > In this >>>> >> >>>>> > > way, one can >>>> >> >>>>> > > > use cusparse solver for a MPI >>>> >> program. >>>> >> >>>>> > > > >>>> >> >>>>> > > > Chang >>>> >> >>>>> > > > -- >>>> >> >>>>> > > > Chang Liu >>>> >> >>>>> > > > Staff Research Physicist >>>> >> >>>>> > > > +1 609 243 3438 >>>> >> >>>>> > > > cliu at pppl.gov >>>> > >>>> >> >>>> >> >>>> >> >>>>> >>>> > >>>> >> >>>> >>> >>>> >> >>>>> > >>> > >>>> >> >>>> >> >>>> >> >>>>> >>>> > >>>> >> >>>> >>>> >>>> >> >>>>> > >>> > >>>> >> >>>> >> >>>> >> >>>>> >>>> > >>>> >> >>>> >>> >>>> >> >>>>> > > >>> >>>> >> > >>>> >>> >> >>>> >> >>>>> >>>> > >>>> >> >>>> >>>>> >>>> >> >>>>> > > > Princeton Plasma Physics >>>> Laboratory >>>> >> >>>>> > > > 100 Stellarator Rd, >>>> Princeton NJ >>>> >> 08540, USA >>>> >> >>>>> > > > >>>> >> >>>>> > > >>>> >> >>>>> > > -- >>>> >> >>>>> > > Chang Liu >>>> >> >>>>> > > Staff Research Physicist >>>> >> >>>>> > > +1 609 243 3438 >>>> >> >>>>> > > cliu at pppl.gov >>>> > >>>> >> >>>> >> >>>> >> >>>>> >>>> > >>>> >> >>>> >>> >>>> >>>> >> > >>>> >> >>>>> >>>> >> >>>> >> >>>>> > >>> > >>>> >> >>>> >>>> >>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>> 08540, USA >>>> >> >>>>> > > >>>> >> >>>>> > >>>> >> >>>>> > -- >>>> >> >>>>> > Chang Liu >>>> >> >>>>> > Staff Research Physicist >>>> >> >>>>> > +1 609 243 3438 >>>> >> >>>>> > cliu at pppl.gov >>>> > >>>> >> >>>> >> >>> >>>> >> > >>>> >> >>>>> >>>> >>> >>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >> >>>>> > >>>> >> >>>>> -- Chang Liu >>>> >> >>>>> Staff Research Physicist >>>> >> >>>>> +1 609 243 3438 >>>> >> >>>>> cliu at pppl.gov >>>> > >>> >>>> >> >> >>>> >> >>>>> Princeton Plasma Physics Laboratory >>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >> >>>> >>>> >> >>>> -- >>>> >> >>>> Chang Liu >>>> >> >>>> Staff Research Physicist >>>> >> >>>> +1 609 243 3438 >>>> >> >>>> cliu at pppl.gov >>>> > >>>> >> >>>> Princeton Plasma Physics Laboratory >>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >> >> >>>> >> >> -- >>>> >> >> Chang Liu >>>> >> >> Staff Research Physicist >>>> >> >> +1 609 243 3438 >>>> >> >> cliu at pppl.gov >>>> > >>>> >> >> Princeton Plasma Physics Laboratory >>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >> > >>>> >> -- Chang Liu >>>> >> Staff Research Physicist >>>> >> +1 609 243 3438 >>>> >> cliu at pppl.gov >>> > >>>> >> Princeton Plasma Physics Laboratory >>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> > >>>> > -- >>>> > Chang Liu >>>> > Staff Research Physicist >>>> > +1 609 243 3438 >>>> > cliu at pppl.gov >>>> > Princeton Plasma Physics Laboratory >>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>> -- >>> Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From knepley at gmail.com Thu Oct 14 09:27:46 2021 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 14 Oct 2021 10:27:46 -0400 Subject: [petsc-users] VecView DMDA and HDF5 - Unable to write out files properly In-Reply-To: References: Message-ID: On Thu, Oct 14, 2021 at 9:21 AM Matteo Semplice < matteo.semplice at uninsubria.it> wrote: > > Il 14/10/21 14:37, Matthew Knepley ha scritto: > > On Wed, Oct 13, 2021 at 6:30 PM Abhishek G.S. > wrote: > >> Hi, >> I need some help with getting the file output working right. >> >> I am using a DMDACreate3D to initialize my DM. This is my write function >> >> void write(){ >> PetscViewer viewer; >> >> PetscViewerHDF5Open(PETSC_COMM_WORLD,filename.c_str(),FILE_MODE_WRITE,&viewer); >> DMDAVecRestoreArray(dm,global_vector,global_array) >> VecView(global_vec, viewer); >> DMDAVecGetArray(dm,global_vector,global_array); >> PetscViewerDestroy(&viewer); >> } >> >> 1) I have 2 PDE's to solve. Still, I went ahead creating a single DM with >> dof=1 and creating two vectors using the DMCreateGlobalVector(). I want to >> write the file out periodically. Should I perform DMDAVecRestoreArray and >> DMDAVecGetArray every time is write out the global_vector? (I know that it >> is just indexing the pointers and there is no copying of values. But I am >> not sure) >> > > I don't think you need the Get/RestoreArray() calls here. > > >> 2) I am writing out to HDF5 format. I see that the vecview is supposed to >> reorder the global_vector based on the DM. However, when I read the H5 >> files, I get an error on ViSIT and my output image becomes a 1D image >> rather than a 2D/3D. What might be the reason for this ?. >> Error Msg : "In domain 0, your zonal variable "avtGhostZones" has 25600 >> values, but it should have 160. Some values were removed to ensure VisIt >> runs smoothly" >> I was using a 160x160x1 DM >> > > I do not believe we support HDF5 <--> Visit/Paraview for DMDA. The > VecView() is just writing out the vector as a linear array without mesh > details. For > interfacing with the visualization, I think we use .vtu files. You should > be able to get this effect using > > VecViewFromOptions(global_vec, NULL, "-vec_view"); > > in your code, and then > > -vec_view vtk:sol.vtu > > on the command line. > > Hi. > > If you want to stick with HDF5, you can also write a XDMF file with the > grid information and open that in Paraview. > > I am attaching some routines that I have written to do that in a solver > that deals with a time dependent PDE system with 2 variables; with them I > end up with a single XDMF file that Paraview can load and which contains > references to all timesteps in my simulations, with each timestep being > contained in an HDF5 file on its own. The idea is to call writeDomain at > the beginning of the simulation, writeHDF5 for each timestep that I want to > save and writeSimulationXDMF at the end. (Warning: 3D is in use, while 2D > ia almost untested...) > > It's not the optimal solution since (1) all timesteps could be in the same > HDF5 and (2) in each HDF5 i write the vectors separately and it would be > better to dump the entire data in one go and interpret them as a > Nx*Ny*Nz*Nvariables data from the XDMF. Nevertheless they might be a > starting point for you if you wan to try this approach. > > You can have HDF5 put the vectors in a single array with a time dimension now. Then you just alter the xdmf to point into that array. I do this with the unstructured code. Thanks, Matt > Matteo > > > >> 3) I tried using the "petsc_gen_xdmf.py" to generate the xdmf files for >> use in Paraview. Here the key ["viz/geometry"] is missing. The keys present >> in the output H5 file are just the two vectors I am writing and has no info >> about mesh. Isn't this supposed to come automatically since the vector is >> attached to the DM? How do I sort this out? >> > > This support is for unstructured grids, DMPlex and DMForest. > > >> 4) Can I have multiple vectors attached to the DM by >> DMCreateGlobalVector() even though I created the DMDA using dof=1. >> > > Yes. > > Thanks, > > Matt > > >> thanks, >> Abhishek >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > --- > Professore Associato in Analisi Numerica > Dipartimento di Scienza e Alta Tecnologia > Universit? degli Studi dell'Insubria > Via Valleggio, 11 - Como > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsabhishek1ags at gmail.com Thu Oct 14 10:58:00 2021 From: gsabhishek1ags at gmail.com (Abhishek G.S.) Date: Thu, 14 Oct 2021 21:28:00 +0530 Subject: [petsc-users] VecView DMDA and HDF5 - Unable to write out files properly In-Reply-To: References: Message-ID: Thanks, Matthew for the clarification/ suggestion. Thanks, Matteo for the scripts, I'll give this a try and get back with an update On Thu, 14 Oct 2021 at 19:57, Matthew Knepley wrote: > On Thu, Oct 14, 2021 at 9:21 AM Matteo Semplice < > matteo.semplice at uninsubria.it> wrote: > >> >> Il 14/10/21 14:37, Matthew Knepley ha scritto: >> >> On Wed, Oct 13, 2021 at 6:30 PM Abhishek G.S. >> wrote: >> >>> Hi, >>> I need some help with getting the file output working right. >>> >>> I am using a DMDACreate3D to initialize my DM. This is my write function >>> >>> void write(){ >>> PetscViewer viewer; >>> >>> PetscViewerHDF5Open(PETSC_COMM_WORLD,filename.c_str(),FILE_MODE_WRITE,&viewer); >>> DMDAVecRestoreArray(dm,global_vector,global_array) >>> VecView(global_vec, viewer); >>> DMDAVecGetArray(dm,global_vector,global_array); >>> PetscViewerDestroy(&viewer); >>> } >>> >>> 1) I have 2 PDE's to solve. Still, I went ahead creating a single DM >>> with dof=1 and creating two vectors using the DMCreateGlobalVector(). I >>> want to write the file out periodically. Should I perform >>> DMDAVecRestoreArray and DMDAVecGetArray every time is write out the >>> global_vector? (I know that it is just indexing the pointers and there is >>> no copying of values. But I am not sure) >>> >> >> I don't think you need the Get/RestoreArray() calls here. >> >> >>> 2) I am writing out to HDF5 format. I see that the vecview is supposed >>> to reorder the global_vector based on the DM. However, when I read the H5 >>> files, I get an error on ViSIT and my output image becomes a 1D image >>> rather than a 2D/3D. What might be the reason for this ?. >>> Error Msg : "In domain 0, your zonal variable "avtGhostZones" has 25600 >>> values, but it should have 160. Some values were removed to ensure VisIt >>> runs smoothly" >>> I was using a 160x160x1 DM >>> >> >> I do not believe we support HDF5 <--> Visit/Paraview for DMDA. The >> VecView() is just writing out the vector as a linear array without mesh >> details. For >> interfacing with the visualization, I think we use .vtu files. You should >> be able to get this effect using >> >> VecViewFromOptions(global_vec, NULL, "-vec_view"); >> >> in your code, and then >> >> -vec_view vtk:sol.vtu >> >> on the command line. >> >> Hi. >> >> If you want to stick with HDF5, you can also write a XDMF file with the >> grid information and open that in Paraview. >> >> I am attaching some routines that I have written to do that in a solver >> that deals with a time dependent PDE system with 2 variables; with them I >> end up with a single XDMF file that Paraview can load and which contains >> references to all timesteps in my simulations, with each timestep being >> contained in an HDF5 file on its own. The idea is to call writeDomain at >> the beginning of the simulation, writeHDF5 for each timestep that I want to >> save and writeSimulationXDMF at the end. (Warning: 3D is in use, while 2D >> ia almost untested...) >> >> It's not the optimal solution since (1) all timesteps could be in the >> same HDF5 and (2) in each HDF5 i write the vectors separately and it would >> be better to dump the entire data in one go and interpret them as a >> Nx*Ny*Nz*Nvariables data from the XDMF. Nevertheless they might be a >> starting point for you if you wan to try this approach. >> >> You can have HDF5 put the vectors in a single array with a time dimension > now. Then you just alter the xdmf to point into that array. I do this > with the unstructured code. > > Thanks, > > Matt > >> Matteo >> >> >> >>> 3) I tried using the "petsc_gen_xdmf.py" to generate the xdmf files for >>> use in Paraview. Here the key ["viz/geometry"] is missing. The keys present >>> in the output H5 file are just the two vectors I am writing and has no info >>> about mesh. Isn't this supposed to come automatically since the vector is >>> attached to the DM? How do I sort this out? >>> >> >> This support is for unstructured grids, DMPlex and DMForest. >> >> >>> 4) Can I have multiple vectors attached to the DM by >>> DMCreateGlobalVector() even though I created the DMDA using dof=1. >>> >> >> Yes. >> >> Thanks, >> >> Matt >> >> >>> thanks, >>> Abhishek >>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> -- >> --- >> Professore Associato in Analisi Numerica >> Dipartimento di Scienza e Alta Tecnologia >> Universit? degli Studi dell'Insubria >> Via Valleggio, 11 - Como >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Thu Oct 14 15:14:41 2021 From: cliu at pppl.gov (Chang Liu) Date: Thu, 14 Oct 2021 16:14:41 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> Message-ID: Hi Pierre, I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. The command line options I used for small matrix is like mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 which gives the correct output. For iterative solver, I tried mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 for large matrix. The output is like 0 KSP Residual norm 40.1497 1 KSP Residual norm < 1.e-11 Norm of error 400.999 iterations 1 So it seems to call a direct solver instead of an iterative one. Can you please help check these options? Chang On 10/14/21 10:04 AM, Pierre Jolivet wrote: > > >> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >> >> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? > > PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). > 1) I?m not sure this is implemented for cuSparse matrices, but it should be; > 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. > If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. > I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. > > Thanks, > Pierre > >> Chang >> >> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>> Thus the need for specific code in mumps.c. >>> Thanks, >>> Pierre >>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>> >>>> Hi Junchao, >>>> >>>> Yes that is what I want. >>>> >>>> Chang >>>> >>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>> Junchao, >>>>> If I understand correctly Chang is using the block Jacobi >>>>> method with a single block for a number of MPI ranks and a direct >>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>> particular problems this preconditioner works well, but using an >>>>> iterative solver on the blocks does not work well. >>>>> If we had complete MPI-GPU direct solvers he could just use >>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>> not he would like to use a single GPU for each block, this means >>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>> solves the blocks of the right hand side needs to be shipped to the >>>>> appropriate GPU and the resulting solution shipped back to the >>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>> would be pretty inefficient, ideally one would go directly from the >>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>> GPUs. But this may be a large coding project. >>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>> Barry >>>>> Since the matrices being factored and solved directly are relatively >>>>> large it is possible that the cusparse code could be reasonably >>>>> efficient (they are not the tiny problems one gets at the coarse >>>>> level of multigrid). Of course, this is speculation, I don't >>>>> actually know how much better the cusparse code would be on the >>>>> direct solver than a good CPU direct sparse solver. >>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>> > wrote: >>>>> > >>>>> > Sorry I am not familiar with the details either. Can you please >>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>> > >>>>> > Chang >>>>> > >>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>> >> Hi Chang, >>>>> >> I did the work in mumps. It is easy for me to understand >>>>> gathering matrix rows to one process. >>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>> >> Thanks >>>>> >> --Junchao Zhang >>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>> >>>>> >> >>>>> wrote: >>>>> >> Hi Barry, >>>>> >> I think mumps solver in petsc does support that. You can >>>>> check the >>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>> >> >>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>> >>>>> >> >>>> > >>>>> >> and the code enclosed by #if >>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>> >> mumps.c >>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>> However, I am >>>>> >> working on an existing code that was developed based on MPI >>>>> and the the >>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>> want to >>>>> >> change the whole structure of the code. >>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>> function >>>>> >> MatMumpsSetUpDistRHSInfo. >>>>> >> Regards, >>>>> >> Chang >>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>> >> > >>>>> >> > >>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>> >>>>> >> >> wrote: >>>>> >> >> >>>>> >> >> Hi Barry, >>>>> >> >> >>>>> >> >> That is exactly what I want. >>>>> >> >> >>>>> >> >> Back to my original question, I am looking for an approach to >>>>> >> transfer >>>>> >> >> matrix >>>>> >> >> data from many MPI processes to "master" MPI >>>>> >> >> processes, each of which taking care of one GPU, and then >>>>> upload >>>>> >> the data to GPU to >>>>> >> >> solve. >>>>> >> >> One can just grab some codes from mumps.c to >>>>> aijcusparse.cu >>>>> >> >. >>>>> >> > >>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>> copy the >>>>> >> entire matrix to a single MPI rank. >>>>> >> > >>>>> >> > It would be possible to write such a code that you >>>>> suggest but >>>>> >> it is not clear that it makes sense >>>>> >> > >>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>> rank, so >>>>> >> while your one GPU per big domain is solving its systems the >>>>> other >>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>> >> nothing. >>>>> >> > >>>>> >> > 2) For each triangular solve you would have to gather the >>>>> right >>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>> >> the GPU solver and then scatter the resulting solution back >>>>> to all >>>>> >> of its subdomain ranks. >>>>> >> > >>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>> use the >>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>> >> can fit and be done on a single GPU then you would be >>>>> utilizing all >>>>> >> the GPUs you are using effectively. >>>>> >> > >>>>> >> > Barry >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> >> >>>>> >> >> Chang >>>>> >> >> >>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>> >> >>> Chang, >>>>> >> >>> You are correct there is no MPI + GPU direct >>>>> solvers that >>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>> that I >>>>> >> am aware of. You are limited that individual triangular solves be >>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>> >> big as possible to utilize each GPU as much as possible for the >>>>> >> direct triangular solves. >>>>> >> >>> Barry >>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>> >> >>>>> >> >>>>> wrote: >>>>> >> >>>> >>>>> >> >>>> Hi Mark, >>>>> >> >>>> >>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>> other >>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>> will give >>>>> >> an error. >>>>> >> >>>> >>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>> >> factorization, and then do the rest, including GMRES solver, >>>>> on gpu. >>>>> >> Is that possible? >>>>> >> >>>> >>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>> runs but >>>>> >> the iterative solver is still running on CPUs. I have >>>>> contacted the >>>>> >> superlu group and they confirmed that is the case right now. >>>>> But if >>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>> >> iterative solver is running on GPU. >>>>> >> >>>> >>>>> >> >>>> Chang >>>>> >> >>>> >>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>> >>>>> >> > >>>>> >>>>> >> >>> wrote: >>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>> my case >>>>> >> the code is >>>>> >> >>>>> just calling a seq solver like superlu to do >>>>> >> factorization on GPUs. >>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>> code to >>>>> >> utilize GPUs >>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>> mpiaij >>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>> >> superlu tests use aij or sell matrix type. >>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>> .... but >>>>> >> you might want to do other matrix operations on the GPU. Is >>>>> that the >>>>> >> issue? >>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>> SuperLU >>>>> >> have a problem? (no test with it so it probably does not work) >>>>> >> >>>>> Thanks, >>>>> >> >>>>> Mark >>>>> >> >>>>> so I >>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>> all the >>>>> >> matrix terms, >>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>> >> factorization >>>>> >> >>>>> and >>>>> >> >>>>> solve. This involves sending the data to the master >>>>> >> process, and I >>>>> >> >>>>> think >>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>> >> >>>>> Chang >>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>> >> >>>>> > >>>>> >> >>>>> > >>>>> >> >>>>> > >>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>> >> >>>>> > >>>>> >> >>>>> >>>>> >> >>>>> >> >>>>> > >>>> >>>> > >>>>> >> >>>>> >>>> wrote: >>>>> >> >>>>> > >>>>> >> >>>>> > >>>>> >> >>>>> > >>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>> >> >>>> > >>>>> >> >>>>> >>>>> >> >>>>> >> >>>>> > >>>> > >>>>> >> >>>>> >>>> wrote: >>>>> >> >>>>> > >>>>> >> >>>>> > Hi Mark, >>>>> >> >>>>> > >>>>> >> >>>>> > The option I use is like >>>>> >> >>>>> > >>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>> >> -ksp_type fgmres >>>>> >> >>>>> -mat_type >>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>> >> cusparse >>>>> >> >>>>> *-sub_ksp_type >>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>> >> -ksp_rtol 1.e-300 >>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>> >> >>>>> > >>>>> >> >>>>> > >>>>> >> >>>>> > Note, If you use -log_view the last column >>>>> (rows >>>>> >> are the >>>>> >> >>>>> method like >>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>> in the GPU. >>>>> >> >>>>> > >>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>> cuSparse LU >>>>> >> >>>>> factorization. Is >>>>> >> >>>>> > that correct? (I don't think we do) >>>>> >> >>>>> > >>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>> find it >>>>> >> calls >>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>> >> make bigger >>>>> >> >>>>> blocks? >>>>> >> >>>>> > >>>>> >> >>>>> > >>>>> >> >>>>> > I think this one do both factorization and >>>>> >> solve on gpu. >>>>> >> >>>>> > >>>>> >> >>>>> > You can check the >>>>> runex72_aijcusparse.sh file >>>>> >> in petsc >>>>> >> >>>>> install >>>>> >> >>>>> > directory, and try it your self (this >>>>> is only lu >>>>> >> >>>>> factorization >>>>> >> >>>>> > without >>>>> >> >>>>> > iterative solve). >>>>> >> >>>>> > >>>>> >> >>>>> > Chang >>>>> >> >>>>> > >>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>> >> >>>>> > > >>>>> >> >>>>> > > >>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>> Chang Liu >>>>> >> >>>>> >>>>> > >>>>> >> >>>>> >> >>>>> >> >>>>> > >>>> > >>>>> >> >>>>> >>> >>>>> >> >>>>> > > >>>> >>>>> >> > >>>>> >>>> >> >>>>> >> >>>>> >>>>> > >>>>> >> >>>>> >>>>> wrote: >>>>> >> >>>>> > > >>>>> >> >>>>> > > Hi Junchao, >>>>> >> >>>>> > > >>>>> >> >>>>> > > No I only needs it to be transferred >>>>> >> within a >>>>> >> >>>>> node. I use >>>>> >> >>>>> > block-Jacobi >>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>> >> matrix, so each >>>>> >> >>>>> > direct solver will >>>>> >> >>>>> > > take care of a sub-block of the >>>>> whole >>>>> >> matrix. In this >>>>> >> >>>>> > way, I can use >>>>> >> >>>>> > > one >>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>> >> stored within >>>>> >> >>>>> one node. >>>>> >> >>>>> > > >>>>> >> >>>>> > > It was stated in the >>>>> documentation that >>>>> >> cusparse >>>>> >> >>>>> solver >>>>> >> >>>>> > is slow. >>>>> >> >>>>> > > However, in my test using >>>>> ex72.c, the >>>>> >> cusparse >>>>> >> >>>>> solver is >>>>> >> >>>>> > faster than >>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>> >> >>>>> > > >>>>> >> >>>>> > > >>>>> >> >>>>> > > Are we talking about the >>>>> factorization, the >>>>> >> solve, or >>>>> >> >>>>> both? >>>>> >> >>>>> > > >>>>> >> >>>>> > > We do not have an interface to >>>>> cuSparse's LU >>>>> >> >>>>> factorization (I >>>>> >> >>>>> > just >>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>> >> '-pc_type lu >>>>> >> >>>>> -mat_type >>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>> >> factorization, >>>>> >> >>>>> which is the >>>>> >> >>>>> > > dominant cost. >>>>> >> >>>>> > > >>>>> >> >>>>> > > >>>>> >> >>>>> > > Chang >>>>> >> >>>>> > > >>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>> Zhang wrote: >>>>> >> >>>>> > > > Hi, Chang, >>>>> >> >>>>> > > > For the mumps solver, we >>>>> usually >>>>> >> transfers >>>>> >> >>>>> matrix >>>>> >> >>>>> > and vector >>>>> >> >>>>> > > data >>>>> >> >>>>> > > > within a compute node. For >>>>> the idea you >>>>> >> >>>>> propose, it >>>>> >> >>>>> > looks like >>>>> >> >>>>> > > we need >>>>> >> >>>>> > > > to gather data within >>>>> >> MPI_COMM_WORLD, right? >>>>> >> >>>>> > > > >>>>> >> >>>>> > > > Mark, I remember you said >>>>> >> cusparse solve is >>>>> >> >>>>> slow >>>>> >> >>>>> > and you would >>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>> >> >>>>> > > > >>>>> >> >>>>> > > > --Junchao Zhang >>>>> >> >>>>> > > > >>>>> >> >>>>> > > > >>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>> >> Chang Liu via >>>>> >> >>>>> petsc-users >>>>> >> >>>>> > > > >>>> >>>>> >> > >>>>> >> >>>>> >>>> >>>>> >> >>>> >> >>>>> >> >>>>> > >>>> >>>>> >> > >>>>> >> >>>>> >>>> >>>>> >> >>>> >>> >>>> >>>>> >> > >>>>> >> >>>>> >>>> >>>>> >> >>>> >> >>>>> >> >>>>> > >>>> >>>>> >> > >>>>> >> >>>>> >>>> >>>>> >> >>>> >>>> >>>>> >> >>>>> > > >>>> >>>>> >> > >>>>> >> >>>>> >>>> >>>>> >> >>>> >> >>>>> >> >>>>> > >>>> >>>>> >> > >>>>> >> >>>>> >>>> >>>>> >> >>>> >>> >>>> >>>>> >> > >>>>> >> >>>>> >>>> >>>>> >> >>>> >> >>>>> >> >>>>> > >>>> >>>>> >> > >>>>> >> >>>>> >>>> >>>>> >> >>>> >>>>>> >>>>> >> >>>>> > > wrote: >>>>> >> >>>>> > > > >>>>> >> >>>>> > > > Hi, >>>>> >> >>>>> > > > >>>>> >> >>>>> > > > Currently, it is possible >>>>> to use >>>>> >> mumps >>>>> >> >>>>> solver in >>>>> >> >>>>> > PETSC with >>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>> >> option, so that >>>>> >> >>>>> > multiple MPI >>>>> >> >>>>> > > processes will >>>>> >> >>>>> > > > transfer the matrix and >>>>> rhs data >>>>> >> to the master >>>>> >> >>>>> > rank, and then >>>>> >> >>>>> > > master >>>>> >> >>>>> > > > rank will call mumps with >>>>> OpenMP >>>>> >> to solve >>>>> >> >>>>> the matrix. >>>>> >> >>>>> > > > >>>>> >> >>>>> > > > I wonder if someone can >>>>> develop >>>>> >> similar >>>>> >> >>>>> option for >>>>> >> >>>>> > cusparse >>>>> >> >>>>> > > solver. >>>>> >> >>>>> > > > Right now, this solver >>>>> does not >>>>> >> work with >>>>> >> >>>>> > mpiaijcusparse. I >>>>> >> >>>>> > > think a >>>>> >> >>>>> > > > possible workaround is to >>>>> >> transfer all the >>>>> >> >>>>> matrix >>>>> >> >>>>> > data to one MPI >>>>> >> >>>>> > > > process, and then upload the >>>>> >> data to GPU to >>>>> >> >>>>> solve. >>>>> >> >>>>> > In this >>>>> >> >>>>> > > way, one can >>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>> >> program. >>>>> >> >>>>> > > > >>>>> >> >>>>> > > > Chang >>>>> >> >>>>> > > > -- >>>>> >> >>>>> > > > Chang Liu >>>>> >> >>>>> > > > Staff Research Physicist >>>>> >> >>>>> > > > +1 609 243 3438 >>>>> >> >>>>> > > > cliu at pppl.gov >>>>> > >>>>> >> >>>>> >> >>>>> >> >>>>> >>>>> > >>>>> >> >>>>> >>> >>>>> >> >>>>> > >>>> > >>>>> >> >>>>> >> >>>>> >> >>>>> >>>>> > >>>>> >> >>>>> >>>> >>>>> >> >>>>> > >>>> > >>>>> >> >>>>> >> >>>>> >> >>>>> >>>>> > >>>>> >> >>>>> >>> >>>>> >> >>>>> > > >>>> >>>>> >> > >>>>> >>>> >> >>>>> >> >>>>> >>>>> > >>>>> >> >>>>> >>>>> >>>>> >> >>>>> > > > Princeton Plasma Physics >>>>> Laboratory >>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>> Princeton NJ >>>>> >> 08540, USA >>>>> >> >>>>> > > > >>>>> >> >>>>> > > >>>>> >> >>>>> > > -- >>>>> >> >>>>> > > Chang Liu >>>>> >> >>>>> > > Staff Research Physicist >>>>> >> >>>>> > > +1 609 243 3438 >>>>> >> >>>>> > > cliu at pppl.gov >>>>> > >>>>> >> >>>>> >> >>>>> >> >>>>> >>>>> > >>>>> >> >>>>> >>> >>>>> >>>>> >> > >>>>> >> >>>>> >>>>> >> >>>>> >> >>>>> > >>>> > >>>>> >> >>>>> >>>> >>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>> 08540, USA >>>>> >> >>>>> > > >>>>> >> >>>>> > >>>>> >> >>>>> > -- >>>>> >> >>>>> > Chang Liu >>>>> >> >>>>> > Staff Research Physicist >>>>> >> >>>>> > +1 609 243 3438 >>>>> >> >>>>> > cliu at pppl.gov >>>>> > >>>>> >> >>>>> >> >>>> >>>>> >> > >>>>> >> >>>>> >>>>> >>> >>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >> >>>>> > >>>>> >> >>>>> -- Chang Liu >>>>> >> >>>>> Staff Research Physicist >>>>> >> >>>>> +1 609 243 3438 >>>>> >> >>>>> cliu at pppl.gov >>>>> > >>>> >>>>> >> >> >>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >> >>>> >>>>> >> >>>> -- >>>>> >> >>>> Chang Liu >>>>> >> >>>> Staff Research Physicist >>>>> >> >>>> +1 609 243 3438 >>>>> >> >>>> cliu at pppl.gov >>>>> > >>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >> >> >>>>> >> >> -- >>>>> >> >> Chang Liu >>>>> >> >> Staff Research Physicist >>>>> >> >> +1 609 243 3438 >>>>> >> >> cliu at pppl.gov >>>>> > >>>>> >> >> Princeton Plasma Physics Laboratory >>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >> > >>>>> >> -- Chang Liu >>>>> >> Staff Research Physicist >>>>> >> +1 609 243 3438 >>>>> >> cliu at pppl.gov >>>> > >>>>> >> Princeton Plasma Physics Laboratory >>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> > >>>>> > -- >>>>> > Chang Liu >>>>> > Staff Research Physicist >>>>> > +1 609 243 3438 >>>>> > cliu at pppl.gov >>>>> > Princeton Plasma Physics Laboratory >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >>>> -- >>>> Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From bsmith at petsc.dev Thu Oct 14 16:15:56 2021 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 14 Oct 2021 17:15:56 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> Message-ID: You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu > On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: > > Hi Pierre, > > I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. > > The command line options I used for small matrix is like > > mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 > > which gives the correct output. For iterative solver, I tried > > mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 > > for large matrix. The output is like > > 0 KSP Residual norm 40.1497 > 1 KSP Residual norm < 1.e-11 > Norm of error 400.999 iterations 1 > > So it seems to call a direct solver instead of an iterative one. > > Can you please help check these options? > > Chang > > On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>> >>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >> Thanks, >> Pierre >>> Chang >>> >>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>> Thus the need for specific code in mumps.c. >>>> Thanks, >>>> Pierre >>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>> >>>>> Hi Junchao, >>>>> >>>>> Yes that is what I want. >>>>> >>>>> Chang >>>>> >>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>> Junchao, >>>>>> If I understand correctly Chang is using the block Jacobi >>>>>> method with a single block for a number of MPI ranks and a direct >>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>> particular problems this preconditioner works well, but using an >>>>>> iterative solver on the blocks does not work well. >>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>> not he would like to use a single GPU for each block, this means >>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>> GPUs. But this may be a large coding project. >>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>> Barry >>>>>> Since the matrices being factored and solved directly are relatively >>>>>> large it is possible that the cusparse code could be reasonably >>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>> actually know how much better the cusparse code would be on the >>>>>> direct solver than a good CPU direct sparse solver. >>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>> > wrote: >>>>>> > >>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>> > >>>>>> > Chang >>>>>> > >>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>> >> Hi Chang, >>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>> gathering matrix rows to one process. >>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>> >> Thanks >>>>>> >> --Junchao Zhang >>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>> >>>>>> >> >>>>>> wrote: >>>>>> >> Hi Barry, >>>>>> >> I think mumps solver in petsc does support that. You can >>>>>> check the >>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>> >> >>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>> >>>>>> >> >>>>> > >>>>>> >> and the code enclosed by #if >>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>> >> mumps.c >>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>> However, I am >>>>>> >> working on an existing code that was developed based on MPI >>>>>> and the the >>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>> want to >>>>>> >> change the whole structure of the code. >>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>> function >>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>> >> Regards, >>>>>> >> Chang >>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>> >> > >>>>>> >> > >>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>> >>>>>> >> >> wrote: >>>>>> >> >> >>>>>> >> >> Hi Barry, >>>>>> >> >> >>>>>> >> >> That is exactly what I want. >>>>>> >> >> >>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>> >> transfer >>>>>> >> >> matrix >>>>>> >> >> data from many MPI processes to "master" MPI >>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>> upload >>>>>> >> the data to GPU to >>>>>> >> >> solve. >>>>>> >> >> One can just grab some codes from mumps.c to >>>>>> aijcusparse.cu >>>>>> >> >. >>>>>> >> > >>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>> copy the >>>>>> >> entire matrix to a single MPI rank. >>>>>> >> > >>>>>> >> > It would be possible to write such a code that you >>>>>> suggest but >>>>>> >> it is not clear that it makes sense >>>>>> >> > >>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>> rank, so >>>>>> >> while your one GPU per big domain is solving its systems the >>>>>> other >>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>> >> nothing. >>>>>> >> > >>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>> right >>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>> to all >>>>>> >> of its subdomain ranks. >>>>>> >> > >>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>> use the >>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>> >> can fit and be done on a single GPU then you would be >>>>>> utilizing all >>>>>> >> the GPUs you are using effectively. >>>>>> >> > >>>>>> >> > Barry >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> >> >>>>>> >> >> Chang >>>>>> >> >> >>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>> >> >>> Chang, >>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>> solvers that >>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>> that I >>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>> >> direct triangular solves. >>>>>> >> >>> Barry >>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>> >> >>>>>> >> >>>>>> wrote: >>>>>> >> >>>> >>>>>> >> >>>> Hi Mark, >>>>>> >> >>>> >>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>> other >>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>> will give >>>>>> >> an error. >>>>>> >> >>>> >>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>> on gpu. >>>>>> >> Is that possible? >>>>>> >> >>>> >>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>> runs but >>>>>> >> the iterative solver is still running on CPUs. I have >>>>>> contacted the >>>>>> >> superlu group and they confirmed that is the case right now. >>>>>> But if >>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>> >> iterative solver is running on GPU. >>>>>> >> >>>> >>>>>> >> >>>> Chang >>>>>> >> >>>> >>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>> >>>>>> >> > >>>>>> >>>>>> >> >>> wrote: >>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>> my case >>>>>> >> the code is >>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>> >> factorization on GPUs. >>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>> code to >>>>>> >> utilize GPUs >>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>> mpiaij >>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>> >> superlu tests use aij or sell matrix type. >>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>> .... but >>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>> that the >>>>>> >> issue? >>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>> SuperLU >>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>> >> >>>>> Thanks, >>>>>> >> >>>>> Mark >>>>>> >> >>>>> so I >>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>> all the >>>>>> >> matrix terms, >>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>> >> factorization >>>>>> >> >>>>> and >>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>> >> process, and I >>>>>> >> >>>>> think >>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>> >> >>>>> Chang >>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>> >> >>>>>> > >>>>>> >> >>>>> >>>>>> >> >>>>>> >> >>>>> > >>>>> >>>>> > >>>>>> >> >>>>>> >>>> wrote: >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>> >> >>>>> > >>>>>> >> >>>>> >>>>>> >> >>>>>> >> >>>>> > >>>>> > >>>>>> >> >>>>>> >>>> wrote: >>>>>> >> >>>>> > >>>>>> >> >>>>> > Hi Mark, >>>>>> >> >>>>> > >>>>>> >> >>>>> > The option I use is like >>>>>> >> >>>>> > >>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>> >> -ksp_type fgmres >>>>>> >> >>>>> -mat_type >>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>> >> cusparse >>>>>> >> >>>>> *-sub_ksp_type >>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>> >> -ksp_rtol 1.e-300 >>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>> (rows >>>>>> >> are the >>>>>> >> >>>>> method like >>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>> in the GPU. >>>>>> >> >>>>> > >>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>> cuSparse LU >>>>>> >> >>>>> factorization. Is >>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>> >> >>>>> > >>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>> find it >>>>>> >> calls >>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>> >> make bigger >>>>>> >> >>>>> blocks? >>>>>> >> >>>>> > >>>>>> >> >>>>> > >>>>>> >> >>>>> > I think this one do both factorization and >>>>>> >> solve on gpu. >>>>>> >> >>>>> > >>>>>> >> >>>>> > You can check the >>>>>> runex72_aijcusparse.sh file >>>>>> >> in petsc >>>>>> >> >>>>> install >>>>>> >> >>>>> > directory, and try it your self (this >>>>>> is only lu >>>>>> >> >>>>> factorization >>>>>> >> >>>>> > without >>>>>> >> >>>>> > iterative solve). >>>>>> >> >>>>> > >>>>>> >> >>>>> > Chang >>>>>> >> >>>>> > >>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>> Chang Liu >>>>>> >> >>>>> >>>>>> > >>>>>> >> >>>>>> >> >>>>>> >> >>>>> > >>>>> > >>>>>> >> >>>>>> >>> >>>>>> >> >>>>> > > >>>>> >>>>>> >> > >>>>>> >>>>> >> >>>>>> >> >>>>> >>>>>> > >>>>>> >> >>>>>> >>>>> wrote: >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > Hi Junchao, >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>> >> within a >>>>>> >> >>>>> node. I use >>>>>> >> >>>>> > block-Jacobi >>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>> >> matrix, so each >>>>>> >> >>>>> > direct solver will >>>>>> >> >>>>> > > take care of a sub-block of the >>>>>> whole >>>>>> >> matrix. In this >>>>>> >> >>>>> > way, I can use >>>>>> >> >>>>> > > one >>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>> >> stored within >>>>>> >> >>>>> one node. >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > It was stated in the >>>>>> documentation that >>>>>> >> cusparse >>>>>> >> >>>>> solver >>>>>> >> >>>>> > is slow. >>>>>> >> >>>>> > > However, in my test using >>>>>> ex72.c, the >>>>>> >> cusparse >>>>>> >> >>>>> solver is >>>>>> >> >>>>> > faster than >>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > Are we talking about the >>>>>> factorization, the >>>>>> >> solve, or >>>>>> >> >>>>> both? >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > We do not have an interface to >>>>>> cuSparse's LU >>>>>> >> >>>>> factorization (I >>>>>> >> >>>>> > just >>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>> >> '-pc_type lu >>>>>> >> >>>>> -mat_type >>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>> >> factorization, >>>>>> >> >>>>> which is the >>>>>> >> >>>>> > > dominant cost. >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > Chang >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>> Zhang wrote: >>>>>> >> >>>>> > > > Hi, Chang, >>>>>> >> >>>>> > > > For the mumps solver, we >>>>>> usually >>>>>> >> transfers >>>>>> >> >>>>> matrix >>>>>> >> >>>>> > and vector >>>>>> >> >>>>> > > data >>>>>> >> >>>>> > > > within a compute node. For >>>>>> the idea you >>>>>> >> >>>>> propose, it >>>>>> >> >>>>> > looks like >>>>>> >> >>>>> > > we need >>>>>> >> >>>>> > > > to gather data within >>>>>> >> MPI_COMM_WORLD, right? >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > Mark, I remember you said >>>>>> >> cusparse solve is >>>>>> >> >>>>> slow >>>>>> >> >>>>> > and you would >>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > --Junchao Zhang >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>> >> Chang Liu via >>>>>> >> >>>>> petsc-users >>>>>> >> >>>>> > > > >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>> >>>>>> >> >>>>> >> >>>>>> >> >>>>> > >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>> >>>>>> >> >>>>> >>> >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>> >>>>>> >> >>>>> >> >>>>>> >> >>>>> > >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>> >>>>>> >> >>>>> >>>> >>>>>> >> >>>>> > > >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>> >>>>>> >> >>>>> >> >>>>>> >> >>>>> > >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>> >>>>>> >> >>>>> >>> >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>> >>>>>> >> >>>>> >> >>>>>> >> >>>>> > >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>> >>>>>> >> >>>>> >>>>>> >>>>>> >> >>>>> > > wrote: >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > Hi, >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > Currently, it is possible >>>>>> to use >>>>>> >> mumps >>>>>> >> >>>>> solver in >>>>>> >> >>>>> > PETSC with >>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>> >> option, so that >>>>>> >> >>>>> > multiple MPI >>>>>> >> >>>>> > > processes will >>>>>> >> >>>>> > > > transfer the matrix and >>>>>> rhs data >>>>>> >> to the master >>>>>> >> >>>>> > rank, and then >>>>>> >> >>>>> > > master >>>>>> >> >>>>> > > > rank will call mumps with >>>>>> OpenMP >>>>>> >> to solve >>>>>> >> >>>>> the matrix. >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > I wonder if someone can >>>>>> develop >>>>>> >> similar >>>>>> >> >>>>> option for >>>>>> >> >>>>> > cusparse >>>>>> >> >>>>> > > solver. >>>>>> >> >>>>> > > > Right now, this solver >>>>>> does not >>>>>> >> work with >>>>>> >> >>>>> > mpiaijcusparse. I >>>>>> >> >>>>> > > think a >>>>>> >> >>>>> > > > possible workaround is to >>>>>> >> transfer all the >>>>>> >> >>>>> matrix >>>>>> >> >>>>> > data to one MPI >>>>>> >> >>>>> > > > process, and then upload the >>>>>> >> data to GPU to >>>>>> >> >>>>> solve. >>>>>> >> >>>>> > In this >>>>>> >> >>>>> > > way, one can >>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>> >> program. >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > > Chang >>>>>> >> >>>>> > > > -- >>>>>> >> >>>>> > > > Chang Liu >>>>>> >> >>>>> > > > Staff Research Physicist >>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>> > >>>>>> >> >>>>>> >> >>>>>> >> >>>>> >>>>>> > >>>>>> >> >>>>>> >>> >>>>>> >> >>>>> > >>>>> > >>>>>> >> >>>>>> >> >>>>>> >> >>>>> >>>>>> > >>>>>> >> >>>>>> >>>> >>>>>> >> >>>>> > >>>>> > >>>>>> >> >>>>>> >> >>>>>> >> >>>>> >>>>>> > >>>>>> >> >>>>>> >>> >>>>>> >> >>>>> > > >>>>> >>>>>> >> > >>>>>> >>>>> >> >>>>>> >> >>>>> >>>>>> > >>>>>> >> >>>>>> >>>>> >>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>> Laboratory >>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>> Princeton NJ >>>>>> >> 08540, USA >>>>>> >> >>>>> > > > >>>>>> >> >>>>> > > >>>>>> >> >>>>> > > -- >>>>>> >> >>>>> > > Chang Liu >>>>>> >> >>>>> > > Staff Research Physicist >>>>>> >> >>>>> > > +1 609 243 3438 >>>>>> >> >>>>> > > cliu at pppl.gov >>>>>> > >>>>>> >> >>>>>> >> >>>>>> >> >>>>> >>>>>> > >>>>>> >> >>>>>> >>> >>>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>>> >> >>>>>> >> >>>>> > >>>>> > >>>>>> >> >>>>>> >>>> >>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>> 08540, USA >>>>>> >> >>>>> > > >>>>>> >> >>>>> > >>>>>> >> >>>>> > -- >>>>>> >> >>>>> > Chang Liu >>>>>> >> >>>>> > Staff Research Physicist >>>>>> >> >>>>> > +1 609 243 3438 >>>>>> >> >>>>> > cliu at pppl.gov >>>>>> > >>>>>> >> >>>>>> >> >>>>> >>>>>> >> > >>>>>> >> >>>>> >>>>>> >>> >>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >> >>>>> > >>>>>> >> >>>>> -- Chang Liu >>>>>> >> >>>>> Staff Research Physicist >>>>>> >> >>>>> +1 609 243 3438 >>>>>> >> >>>>> cliu at pppl.gov >>>>>> > >>>>> >>>>>> >> >> >>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >> >>>> >>>>>> >> >>>> -- >>>>>> >> >>>> Chang Liu >>>>>> >> >>>> Staff Research Physicist >>>>>> >> >>>> +1 609 243 3438 >>>>>> >> >>>> cliu at pppl.gov >>>>>> > >>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >> >> >>>>>> >> >> -- >>>>>> >> >> Chang Liu >>>>>> >> >> Staff Research Physicist >>>>>> >> >> +1 609 243 3438 >>>>>> >> >> cliu at pppl.gov >>>>>> > >>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >> > >>>>>> >> -- Chang Liu >>>>>> >> Staff Research Physicist >>>>>> >> +1 609 243 3438 >>>>>> >> cliu at pppl.gov >>>>> > >>>>>> >> Princeton Plasma Physics Laboratory >>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> > >>>>>> > -- >>>>>> > Chang Liu >>>>>> > Staff Research Physicist >>>>>> > +1 609 243 3438 >>>>>> > cliu at pppl.gov >>>>>> > Princeton Plasma Physics Laboratory >>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >>>>> -- >>>>> Chang Liu >>>>> Staff Research Physicist >>>>> +1 609 243 3438 >>>>> cliu at pppl.gov >>>>> Princeton Plasma Physics Laboratory >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>> -- >>> Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From cliu at pppl.gov Thu Oct 14 17:02:10 2021 From: cliu at pppl.gov (Chang Liu) Date: Thu, 14 Oct 2021 18:02:10 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> Message-ID: <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> Hi Barry, That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. Chang On 10/14/21 5:15 PM, Barry Smith wrote: > > You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu > >> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >> >> Hi Pierre, >> >> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. >> >> The command line options I used for small matrix is like >> >> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >> >> which gives the correct output. For iterative solver, I tried >> >> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 >> >> for large matrix. The output is like >> >> 0 KSP Residual norm 40.1497 >> 1 KSP Residual norm < 1.e-11 >> Norm of error 400.999 iterations 1 >> >> So it seems to call a direct solver instead of an iterative one. >> >> Can you please help check these options? >> >> Chang >> >> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>> >>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >>> Thanks, >>> Pierre >>>> Chang >>>> >>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>>> Thus the need for specific code in mumps.c. >>>>> Thanks, >>>>> Pierre >>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>>> >>>>>> Hi Junchao, >>>>>> >>>>>> Yes that is what I want. >>>>>> >>>>>> Chang >>>>>> >>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>>> Junchao, >>>>>>> If I understand correctly Chang is using the block Jacobi >>>>>>> method with a single block for a number of MPI ranks and a direct >>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>>> particular problems this preconditioner works well, but using an >>>>>>> iterative solver on the blocks does not work well. >>>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>>> not he would like to use a single GPU for each block, this means >>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>>> GPUs. But this may be a large coding project. >>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>>> Barry >>>>>>> Since the matrices being factored and solved directly are relatively >>>>>>> large it is possible that the cusparse code could be reasonably >>>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>>> actually know how much better the cusparse code would be on the >>>>>>> direct solver than a good CPU direct sparse solver. >>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>> > wrote: >>>>>>> > >>>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>> > >>>>>>> > Chang >>>>>>> > >>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>> >> Hi Chang, >>>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>>> gathering matrix rows to one process. >>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>>> >> Thanks >>>>>>> >> --Junchao Zhang >>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>> >>>>>>> >> >>>>>>> wrote: >>>>>>> >> Hi Barry, >>>>>>> >> I think mumps solver in petsc does support that. You can >>>>>>> check the >>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>>> >> >>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>> >>>>>>> >> >>>>>> > >>>>>>> >> and the code enclosed by #if >>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>>> >> mumps.c >>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>>> However, I am >>>>>>> >> working on an existing code that was developed based on MPI >>>>>>> and the the >>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>>> want to >>>>>>> >> change the whole structure of the code. >>>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>>> function >>>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>>> >> Regards, >>>>>>> >> Chang >>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>> >>>>>>> >> >> wrote: >>>>>>> >> >> >>>>>>> >> >> Hi Barry, >>>>>>> >> >> >>>>>>> >> >> That is exactly what I want. >>>>>>> >> >> >>>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>>> >> transfer >>>>>>> >> >> matrix >>>>>>> >> >> data from many MPI processes to "master" MPI >>>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>>> upload >>>>>>> >> the data to GPU to >>>>>>> >> >> solve. >>>>>>> >> >> One can just grab some codes from mumps.c to >>>>>>> aijcusparse.cu >>>>>>> >> >. >>>>>>> >> > >>>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>>> copy the >>>>>>> >> entire matrix to a single MPI rank. >>>>>>> >> > >>>>>>> >> > It would be possible to write such a code that you >>>>>>> suggest but >>>>>>> >> it is not clear that it makes sense >>>>>>> >> > >>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>>> rank, so >>>>>>> >> while your one GPU per big domain is solving its systems the >>>>>>> other >>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>>> >> nothing. >>>>>>> >> > >>>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>>> right >>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>>> to all >>>>>>> >> of its subdomain ranks. >>>>>>> >> > >>>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>>> use the >>>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>>> >> can fit and be done on a single GPU then you would be >>>>>>> utilizing all >>>>>>> >> the GPUs you are using effectively. >>>>>>> >> > >>>>>>> >> > Barry >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >> >>>>>>> >> >> Chang >>>>>>> >> >> >>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>> >> >>> Chang, >>>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>>> solvers that >>>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>>> that I >>>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>>> >> direct triangular solves. >>>>>>> >> >>> Barry >>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>>> >> >>>>>>> >> >>>>>>> wrote: >>>>>>> >> >>>> >>>>>>> >> >>>> Hi Mark, >>>>>>> >> >>>> >>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>>> other >>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>>> will give >>>>>>> >> an error. >>>>>>> >> >>>> >>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>>> on gpu. >>>>>>> >> Is that possible? >>>>>>> >> >>>> >>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>>> runs but >>>>>>> >> the iterative solver is still running on CPUs. I have >>>>>>> contacted the >>>>>>> >> superlu group and they confirmed that is the case right now. >>>>>>> But if >>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>>> >> iterative solver is running on GPU. >>>>>>> >> >>>> >>>>>>> >> >>>> Chang >>>>>>> >> >>>> >>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>> >>>>>>> >> > >>>>>>> >>>>>>> >> >>> wrote: >>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>>> my case >>>>>>> >> the code is >>>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>>> >> factorization on GPUs. >>>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>>> code to >>>>>>> >> utilize GPUs >>>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>>> mpiaij >>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>>> >> superlu tests use aij or sell matrix type. >>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>>> .... but >>>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>>> that the >>>>>>> >> issue? >>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>>> SuperLU >>>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>>> >> >>>>> Thanks, >>>>>>> >> >>>>> Mark >>>>>>> >> >>>>> so I >>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>>> all the >>>>>>> >> matrix terms, >>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>>> >> factorization >>>>>>> >> >>>>> and >>>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>>> >> process, and I >>>>>>> >> >>>>> think >>>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>>> >> >>>>> Chang >>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>>> >> >>>>>>> > >>>>>>> >> >>>>> >>>>>>> >> >>>>>>> >> >>>>> > >>>>>> >>>>>> > >>>>>>> >> >>>>>>> >>>> wrote: >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>>> >> >>>>>> > >>>>>>> >> >>>>> >>>>>>> >> >>>>>>> >> >>>>> > >>>>>> > >>>>>>> >> >>>>>>> >>>> wrote: >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > Hi Mark, >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > The option I use is like >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>>> >> -ksp_type fgmres >>>>>>> >> >>>>> -mat_type >>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>>> >> cusparse >>>>>>> >> >>>>> *-sub_ksp_type >>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>>> >> -ksp_rtol 1.e-300 >>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>>> (rows >>>>>>> >> are the >>>>>>> >> >>>>> method like >>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>>> in the GPU. >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>>> cuSparse LU >>>>>>> >> >>>>> factorization. Is >>>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>> find it >>>>>>> >> calls >>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>>> >> make bigger >>>>>>> >> >>>>> blocks? >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > I think this one do both factorization and >>>>>>> >> solve on gpu. >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > You can check the >>>>>>> runex72_aijcusparse.sh file >>>>>>> >> in petsc >>>>>>> >> >>>>> install >>>>>>> >> >>>>> > directory, and try it your self (this >>>>>>> is only lu >>>>>>> >> >>>>> factorization >>>>>>> >> >>>>> > without >>>>>>> >> >>>>> > iterative solve). >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > Chang >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>>> Chang Liu >>>>>>> >> >>>>> >>>>>>> > >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> > >>>>>> > >>>>>>> >> >>>>>>> >>> >>>>>>> >> >>>>> > > >>>>>> >>>>>>> >> > >>>>>>> >>>>>> >> >>>>>>> >> >>>>> >>>>>>> > >>>>>>> >> >>>>>>> >>>>> wrote: >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > Hi Junchao, >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>>> >> within a >>>>>>> >> >>>>> node. I use >>>>>>> >> >>>>> > block-Jacobi >>>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>>> >> matrix, so each >>>>>>> >> >>>>> > direct solver will >>>>>>> >> >>>>> > > take care of a sub-block of the >>>>>>> whole >>>>>>> >> matrix. In this >>>>>>> >> >>>>> > way, I can use >>>>>>> >> >>>>> > > one >>>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>>> >> stored within >>>>>>> >> >>>>> one node. >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > It was stated in the >>>>>>> documentation that >>>>>>> >> cusparse >>>>>>> >> >>>>> solver >>>>>>> >> >>>>> > is slow. >>>>>>> >> >>>>> > > However, in my test using >>>>>>> ex72.c, the >>>>>>> >> cusparse >>>>>>> >> >>>>> solver is >>>>>>> >> >>>>> > faster than >>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > Are we talking about the >>>>>>> factorization, the >>>>>>> >> solve, or >>>>>>> >> >>>>> both? >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > We do not have an interface to >>>>>>> cuSparse's LU >>>>>>> >> >>>>> factorization (I >>>>>>> >> >>>>> > just >>>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>>> >> '-pc_type lu >>>>>>> >> >>>>> -mat_type >>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>>> >> factorization, >>>>>>> >> >>>>> which is the >>>>>>> >> >>>>> > > dominant cost. >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > Chang >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>>> Zhang wrote: >>>>>>> >> >>>>> > > > Hi, Chang, >>>>>>> >> >>>>> > > > For the mumps solver, we >>>>>>> usually >>>>>>> >> transfers >>>>>>> >> >>>>> matrix >>>>>>> >> >>>>> > and vector >>>>>>> >> >>>>> > > data >>>>>>> >> >>>>> > > > within a compute node. For >>>>>>> the idea you >>>>>>> >> >>>>> propose, it >>>>>>> >> >>>>> > looks like >>>>>>> >> >>>>> > > we need >>>>>>> >> >>>>> > > > to gather data within >>>>>>> >> MPI_COMM_WORLD, right? >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > > Mark, I remember you said >>>>>>> >> cusparse solve is >>>>>>> >> >>>>> slow >>>>>>> >> >>>>> > and you would >>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > > --Junchao Zhang >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>>> >> Chang Liu via >>>>>>> >> >>>>> petsc-users >>>>>>> >> >>>>> > > > >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>> >>>>>>> >> >>>>>> >> >>>>>>> >> >>>>> > >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>> >>>>>>> >> >>>>>> >>> >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>> >>>>>>> >> >>>>>> >> >>>>>>> >> >>>>> > >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>> >>>>>>> >> >>>>>> >>>> >>>>>>> >> >>>>> > > >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>> >>>>>>> >> >>>>>> >> >>>>>>> >> >>>>> > >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>> >>>>>>> >> >>>>>> >>> >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>> >>>>>>> >> >>>>>> >> >>>>>>> >> >>>>> > >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>> >>>>>>> >> >>>>>> >>>>>> >>>>>>> >> >>>>> > > wrote: >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > > Hi, >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > > Currently, it is possible >>>>>>> to use >>>>>>> >> mumps >>>>>>> >> >>>>> solver in >>>>>>> >> >>>>> > PETSC with >>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>>> >> option, so that >>>>>>> >> >>>>> > multiple MPI >>>>>>> >> >>>>> > > processes will >>>>>>> >> >>>>> > > > transfer the matrix and >>>>>>> rhs data >>>>>>> >> to the master >>>>>>> >> >>>>> > rank, and then >>>>>>> >> >>>>> > > master >>>>>>> >> >>>>> > > > rank will call mumps with >>>>>>> OpenMP >>>>>>> >> to solve >>>>>>> >> >>>>> the matrix. >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > > I wonder if someone can >>>>>>> develop >>>>>>> >> similar >>>>>>> >> >>>>> option for >>>>>>> >> >>>>> > cusparse >>>>>>> >> >>>>> > > solver. >>>>>>> >> >>>>> > > > Right now, this solver >>>>>>> does not >>>>>>> >> work with >>>>>>> >> >>>>> > mpiaijcusparse. I >>>>>>> >> >>>>> > > think a >>>>>>> >> >>>>> > > > possible workaround is to >>>>>>> >> transfer all the >>>>>>> >> >>>>> matrix >>>>>>> >> >>>>> > data to one MPI >>>>>>> >> >>>>> > > > process, and then upload the >>>>>>> >> data to GPU to >>>>>>> >> >>>>> solve. >>>>>>> >> >>>>> > In this >>>>>>> >> >>>>> > > way, one can >>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>>> >> program. >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > > Chang >>>>>>> >> >>>>> > > > -- >>>>>>> >> >>>>> > > > Chang Liu >>>>>>> >> >>>>> > > > Staff Research Physicist >>>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>>> > >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >>>>>>> > >>>>>>> >> >>>>>>> >>> >>>>>>> >> >>>>> > >>>>>> > >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >>>>>>> > >>>>>>> >> >>>>>>> >>>> >>>>>>> >> >>>>> > >>>>>> > >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >>>>>>> > >>>>>>> >> >>>>>>> >>> >>>>>>> >> >>>>> > > >>>>>> >>>>>>> >> > >>>>>>> >>>>>> >> >>>>>>> >> >>>>> >>>>>>> > >>>>>>> >> >>>>>>> >>>>> >>>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>>> Laboratory >>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>>> Princeton NJ >>>>>>> >> 08540, USA >>>>>>> >> >>>>> > > > >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > > -- >>>>>>> >> >>>>> > > Chang Liu >>>>>>> >> >>>>> > > Staff Research Physicist >>>>>>> >> >>>>> > > +1 609 243 3438 >>>>>>> >> >>>>> > > cliu at pppl.gov >>>>>>> > >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >>>>>>> > >>>>>>> >> >>>>>>> >>> >>>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>>> >> >>>>>>> >> >>>>> > >>>>>> > >>>>>>> >> >>>>>>> >>>> >>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>>> 08540, USA >>>>>>> >> >>>>> > > >>>>>>> >> >>>>> > >>>>>>> >> >>>>> > -- >>>>>>> >> >>>>> > Chang Liu >>>>>>> >> >>>>> > Staff Research Physicist >>>>>>> >> >>>>> > +1 609 243 3438 >>>>>>> >> >>>>> > cliu at pppl.gov >>>>>>> > >>>>>>> >> >>>>>>> >> >>>>>> >>>>>>> >> > >>>>>>> >> >>>>> >>>>>>> >>> >>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >> >>>>> > >>>>>>> >> >>>>> -- Chang Liu >>>>>>> >> >>>>> Staff Research Physicist >>>>>>> >> >>>>> +1 609 243 3438 >>>>>>> >> >>>>> cliu at pppl.gov >>>>>>> > >>>>>> >>>>>>> >> >> >>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >> >>>> >>>>>>> >> >>>> -- >>>>>>> >> >>>> Chang Liu >>>>>>> >> >>>> Staff Research Physicist >>>>>>> >> >>>> +1 609 243 3438 >>>>>>> >> >>>> cliu at pppl.gov >>>>>>> > >>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >> >> >>>>>>> >> >> -- >>>>>>> >> >> Chang Liu >>>>>>> >> >> Staff Research Physicist >>>>>>> >> >> +1 609 243 3438 >>>>>>> >> >> cliu at pppl.gov >>>>>>> > >>>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >> > >>>>>>> >> -- Chang Liu >>>>>>> >> Staff Research Physicist >>>>>>> >> +1 609 243 3438 >>>>>>> >> cliu at pppl.gov >>>>>> > >>>>>>> >> Princeton Plasma Physics Laboratory >>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> > >>>>>>> > -- >>>>>>> > Chang Liu >>>>>>> > Staff Research Physicist >>>>>>> > +1 609 243 3438 >>>>>>> > cliu at pppl.gov >>>>>>> > Princeton Plasma Physics Laboratory >>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >>>>>> -- >>>>>> Chang Liu >>>>>> Staff Research Physicist >>>>>> +1 609 243 3438 >>>>>> cliu at pppl.gov >>>>>> Princeton Plasma Physics Laboratory >>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >>>> -- >>>> Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From bsmith at petsc.dev Thu Oct 14 20:47:23 2021 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 14 Oct 2021 21:47:23 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> References: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> Message-ID: Chang, Sorry I did not notice that one. Please run that with -ksp_view -ksp_monitor_true_residual so we can see exactly how options are interpreted and solver used. At a glance it looks ok but something must be wrong to get the wrong answer. Barry > On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: > > Hi Barry, > > That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. > > Chang > > On 10/14/21 5:15 PM, Barry Smith wrote: >> You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu >>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>> >>> Hi Pierre, >>> >>> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. >>> >>> The command line options I used for small matrix is like >>> >>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >>> >>> which gives the correct output. For iterative solver, I tried >>> >>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 >>> >>> for large matrix. The output is like >>> >>> 0 KSP Residual norm 40.1497 >>> 1 KSP Residual norm < 1.e-11 >>> Norm of error 400.999 iterations 1 >>> >>> So it seems to call a direct solver instead of an iterative one. >>> >>> Can you please help check these options? >>> >>> Chang >>> >>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>> >>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >>>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >>>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >>>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >>>> Thanks, >>>> Pierre >>>>> Chang >>>>> >>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>>>> Thus the need for specific code in mumps.c. >>>>>> Thanks, >>>>>> Pierre >>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>>>> >>>>>>> Hi Junchao, >>>>>>> >>>>>>> Yes that is what I want. >>>>>>> >>>>>>> Chang >>>>>>> >>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>>>> Junchao, >>>>>>>> If I understand correctly Chang is using the block Jacobi >>>>>>>> method with a single block for a number of MPI ranks and a direct >>>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>>>> particular problems this preconditioner works well, but using an >>>>>>>> iterative solver on the blocks does not work well. >>>>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>>>> not he would like to use a single GPU for each block, this means >>>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>>>> GPUs. But this may be a large coding project. >>>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>>>> Barry >>>>>>>> Since the matrices being factored and solved directly are relatively >>>>>>>> large it is possible that the cusparse code could be reasonably >>>>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>>>> actually know how much better the cusparse code would be on the >>>>>>>> direct solver than a good CPU direct sparse solver. >>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>> > wrote: >>>>>>>> > >>>>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>> > >>>>>>>> > Chang >>>>>>>> > >>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>> >> Hi Chang, >>>>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>>>> gathering matrix rows to one process. >>>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>>>> >> Thanks >>>>>>>> >> --Junchao Zhang >>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>>> >>>>>>>> >> >>>>>>>> wrote: >>>>>>>> >> Hi Barry, >>>>>>>> >> I think mumps solver in petsc does support that. You can >>>>>>>> check the >>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>>>> >> >>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>> >>>>>>>> >> >>>>>>> > >>>>>>>> >> and the code enclosed by #if >>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>>>> >> mumps.c >>>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>>>> However, I am >>>>>>>> >> working on an existing code that was developed based on MPI >>>>>>>> and the the >>>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>>>> want to >>>>>>>> >> change the whole structure of the code. >>>>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>>>> function >>>>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>>>> >> Regards, >>>>>>>> >> Chang >>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>> >>>>>>>> >> >> wrote: >>>>>>>> >> >> >>>>>>>> >> >> Hi Barry, >>>>>>>> >> >> >>>>>>>> >> >> That is exactly what I want. >>>>>>>> >> >> >>>>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>>>> >> transfer >>>>>>>> >> >> matrix >>>>>>>> >> >> data from many MPI processes to "master" MPI >>>>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>>>> upload >>>>>>>> >> the data to GPU to >>>>>>>> >> >> solve. >>>>>>>> >> >> One can just grab some codes from mumps.c to >>>>>>>> aijcusparse.cu >>>>>>>> >> >. >>>>>>>> >> > >>>>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>>>> copy the >>>>>>>> >> entire matrix to a single MPI rank. >>>>>>>> >> > >>>>>>>> >> > It would be possible to write such a code that you >>>>>>>> suggest but >>>>>>>> >> it is not clear that it makes sense >>>>>>>> >> > >>>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>>>> rank, so >>>>>>>> >> while your one GPU per big domain is solving its systems the >>>>>>>> other >>>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>>>> >> nothing. >>>>>>>> >> > >>>>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>>>> right >>>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>>>> to all >>>>>>>> >> of its subdomain ranks. >>>>>>>> >> > >>>>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>>>> use the >>>>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>>>> >> can fit and be done on a single GPU then you would be >>>>>>>> utilizing all >>>>>>>> >> the GPUs you are using effectively. >>>>>>>> >> > >>>>>>>> >> > Barry >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> >> >>>>>>>> >> >> Chang >>>>>>>> >> >> >>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>> >> >>> Chang, >>>>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>>>> solvers that >>>>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>>>> that I >>>>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>>>> >> direct triangular solves. >>>>>>>> >> >>> Barry >>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>>>> >> >>>>>>>> >> >>>>>>>> wrote: >>>>>>>> >> >>>> >>>>>>>> >> >>>> Hi Mark, >>>>>>>> >> >>>> >>>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>>>> other >>>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>>>> will give >>>>>>>> >> an error. >>>>>>>> >> >>>> >>>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>>>> on gpu. >>>>>>>> >> Is that possible? >>>>>>>> >> >>>> >>>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>>>> runs but >>>>>>>> >> the iterative solver is still running on CPUs. I have >>>>>>>> contacted the >>>>>>>> >> superlu group and they confirmed that is the case right now. >>>>>>>> But if >>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>>>> >> iterative solver is running on GPU. >>>>>>>> >> >>>> >>>>>>>> >> >>>> Chang >>>>>>>> >> >>>> >>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>> >>>>>>>> >> > >>>>>>>> >>>>>>>> >> >>> wrote: >>>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>>>> my case >>>>>>>> >> the code is >>>>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>>>> >> factorization on GPUs. >>>>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>>>> code to >>>>>>>> >> utilize GPUs >>>>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>>>> mpiaij >>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>>>> >> superlu tests use aij or sell matrix type. >>>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>>>> .... but >>>>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>>>> that the >>>>>>>> >> issue? >>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>>>> SuperLU >>>>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>>>> >> >>>>> Thanks, >>>>>>>> >> >>>>> Mark >>>>>>>> >> >>>>> so I >>>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>>>> all the >>>>>>>> >> matrix terms, >>>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>>>> >> factorization >>>>>>>> >> >>>>> and >>>>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>>>> >> process, and I >>>>>>>> >> >>>>> think >>>>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>>>> >> >>>>> Chang >>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>>>> >> >>>>>>>> > >>>>>>>> >> >>>>> >>>>>>>> >> >>>>>>>> >> >>>>> > >>>>>>> >>>>>>> > >>>>>>>> >> >>>>>>>> >>>> wrote: >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>>>> >> >>>>>>> > >>>>>>>> >> >>>>> >>>>>>>> >> >>>>>>>> >> >>>>> > >>>>>>> > >>>>>>>> >> >>>>>>>> >>>> wrote: >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > Hi Mark, >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > The option I use is like >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>>>> >> -ksp_type fgmres >>>>>>>> >> >>>>> -mat_type >>>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>>>> >> cusparse >>>>>>>> >> >>>>> *-sub_ksp_type >>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>>>> >> -ksp_rtol 1.e-300 >>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>>>> (rows >>>>>>>> >> are the >>>>>>>> >> >>>>> method like >>>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>>>> in the GPU. >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>>>> cuSparse LU >>>>>>>> >> >>>>> factorization. Is >>>>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>> find it >>>>>>>> >> calls >>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>>>> >> make bigger >>>>>>>> >> >>>>> blocks? >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > I think this one do both factorization and >>>>>>>> >> solve on gpu. >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > You can check the >>>>>>>> runex72_aijcusparse.sh file >>>>>>>> >> in petsc >>>>>>>> >> >>>>> install >>>>>>>> >> >>>>> > directory, and try it your self (this >>>>>>>> is only lu >>>>>>>> >> >>>>> factorization >>>>>>>> >> >>>>> > without >>>>>>>> >> >>>>> > iterative solve). >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > Chang >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>>>> Chang Liu >>>>>>>> >> >>>>> >>>>>>>> > >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> > >>>>>>> > >>>>>>>> >> >>>>>>>> >>> >>>>>>>> >> >>>>> > > >>>>>>> >>>>>>>> >> > >>>>>>>> >>>>>>> >> >>>>>>>> >> >>>>> >>>>>>>> > >>>>>>>> >> >>>>>>>> >>>>> wrote: >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > Hi Junchao, >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>>>> >> within a >>>>>>>> >> >>>>> node. I use >>>>>>>> >> >>>>> > block-Jacobi >>>>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>>>> >> matrix, so each >>>>>>>> >> >>>>> > direct solver will >>>>>>>> >> >>>>> > > take care of a sub-block of the >>>>>>>> whole >>>>>>>> >> matrix. In this >>>>>>>> >> >>>>> > way, I can use >>>>>>>> >> >>>>> > > one >>>>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>>>> >> stored within >>>>>>>> >> >>>>> one node. >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > It was stated in the >>>>>>>> documentation that >>>>>>>> >> cusparse >>>>>>>> >> >>>>> solver >>>>>>>> >> >>>>> > is slow. >>>>>>>> >> >>>>> > > However, in my test using >>>>>>>> ex72.c, the >>>>>>>> >> cusparse >>>>>>>> >> >>>>> solver is >>>>>>>> >> >>>>> > faster than >>>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > Are we talking about the >>>>>>>> factorization, the >>>>>>>> >> solve, or >>>>>>>> >> >>>>> both? >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > We do not have an interface to >>>>>>>> cuSparse's LU >>>>>>>> >> >>>>> factorization (I >>>>>>>> >> >>>>> > just >>>>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>>>> >> '-pc_type lu >>>>>>>> >> >>>>> -mat_type >>>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>>>> >> factorization, >>>>>>>> >> >>>>> which is the >>>>>>>> >> >>>>> > > dominant cost. >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > Chang >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>>>> Zhang wrote: >>>>>>>> >> >>>>> > > > Hi, Chang, >>>>>>>> >> >>>>> > > > For the mumps solver, we >>>>>>>> usually >>>>>>>> >> transfers >>>>>>>> >> >>>>> matrix >>>>>>>> >> >>>>> > and vector >>>>>>>> >> >>>>> > > data >>>>>>>> >> >>>>> > > > within a compute node. For >>>>>>>> the idea you >>>>>>>> >> >>>>> propose, it >>>>>>>> >> >>>>> > looks like >>>>>>>> >> >>>>> > > we need >>>>>>>> >> >>>>> > > > to gather data within >>>>>>>> >> MPI_COMM_WORLD, right? >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > > Mark, I remember you said >>>>>>>> >> cusparse solve is >>>>>>>> >> >>>>> slow >>>>>>>> >> >>>>> > and you would >>>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > > --Junchao Zhang >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>>>> >> Chang Liu via >>>>>>>> >> >>>>> petsc-users >>>>>>>> >> >>>>> > > > >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>> >>>>>>>> >> >>>>>>> >> >>>>>>>> >> >>>>> > >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>> >>>>>>>> >> >>>>>>> >>> >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>> >>>>>>>> >> >>>>>>> >> >>>>>>>> >> >>>>> > >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>> >>>>>>>> >> >>>>>>> >>>> >>>>>>>> >> >>>>> > > >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>> >>>>>>>> >> >>>>>>> >> >>>>>>>> >> >>>>> > >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>> >>>>>>>> >> >>>>>>> >>> >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>> >>>>>>>> >> >>>>>>> >> >>>>>>>> >> >>>>> > >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>> >>>>>>>> >> >>>>> > > wrote: >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > > Hi, >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > > Currently, it is possible >>>>>>>> to use >>>>>>>> >> mumps >>>>>>>> >> >>>>> solver in >>>>>>>> >> >>>>> > PETSC with >>>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>>>> >> option, so that >>>>>>>> >> >>>>> > multiple MPI >>>>>>>> >> >>>>> > > processes will >>>>>>>> >> >>>>> > > > transfer the matrix and >>>>>>>> rhs data >>>>>>>> >> to the master >>>>>>>> >> >>>>> > rank, and then >>>>>>>> >> >>>>> > > master >>>>>>>> >> >>>>> > > > rank will call mumps with >>>>>>>> OpenMP >>>>>>>> >> to solve >>>>>>>> >> >>>>> the matrix. >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > > I wonder if someone can >>>>>>>> develop >>>>>>>> >> similar >>>>>>>> >> >>>>> option for >>>>>>>> >> >>>>> > cusparse >>>>>>>> >> >>>>> > > solver. >>>>>>>> >> >>>>> > > > Right now, this solver >>>>>>>> does not >>>>>>>> >> work with >>>>>>>> >> >>>>> > mpiaijcusparse. I >>>>>>>> >> >>>>> > > think a >>>>>>>> >> >>>>> > > > possible workaround is to >>>>>>>> >> transfer all the >>>>>>>> >> >>>>> matrix >>>>>>>> >> >>>>> > data to one MPI >>>>>>>> >> >>>>> > > > process, and then upload the >>>>>>>> >> data to GPU to >>>>>>>> >> >>>>> solve. >>>>>>>> >> >>>>> > In this >>>>>>>> >> >>>>> > > way, one can >>>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>>>> >> program. >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > > Chang >>>>>>>> >> >>>>> > > > -- >>>>>>>> >> >>>>> > > > Chang Liu >>>>>>>> >> >>>>> > > > Staff Research Physicist >>>>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>>>> > >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >>>>>>>> > >>>>>>>> >> >>>>>>>> >>> >>>>>>>> >> >>>>> > >>>>>>> > >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >>>>>>>> > >>>>>>>> >> >>>>>>>> >>>> >>>>>>>> >> >>>>> > >>>>>>> > >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >>>>>>>> > >>>>>>>> >> >>>>>>>> >>> >>>>>>>> >> >>>>> > > >>>>>>> >>>>>>>> >> > >>>>>>>> >>>>>>> >> >>>>>>>> >> >>>>> >>>>>>>> > >>>>>>>> >> >>>>>>>> >>>>> >>>>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>>>> Laboratory >>>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>>>> Princeton NJ >>>>>>>> >> 08540, USA >>>>>>>> >> >>>>> > > > >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > > -- >>>>>>>> >> >>>>> > > Chang Liu >>>>>>>> >> >>>>> > > Staff Research Physicist >>>>>>>> >> >>>>> > > +1 609 243 3438 >>>>>>>> >> >>>>> > > cliu at pppl.gov >>>>>>>> > >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >>>>>>>> > >>>>>>>> >> >>>>>>>> >>> >>>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>>> >> >>>>>>>> >> >>>>> > >>>>>>> > >>>>>>>> >> >>>>>>>> >>>> >>>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>>>> 08540, USA >>>>>>>> >> >>>>> > > >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> > -- >>>>>>>> >> >>>>> > Chang Liu >>>>>>>> >> >>>>> > Staff Research Physicist >>>>>>>> >> >>>>> > +1 609 243 3438 >>>>>>>> >> >>>>> > cliu at pppl.gov >>>>>>>> > >>>>>>>> >> >>>>>>>> >> >>>>>>> >>>>>>>> >> > >>>>>>>> >> >>>>> >>>>>>>> >>> >>>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >> >>>>> > >>>>>>>> >> >>>>> -- Chang Liu >>>>>>>> >> >>>>> Staff Research Physicist >>>>>>>> >> >>>>> +1 609 243 3438 >>>>>>>> >> >>>>> cliu at pppl.gov >>>>>>>> > >>>>>>> >>>>>>>> >> >> >>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >> >>>> >>>>>>>> >> >>>> -- >>>>>>>> >> >>>> Chang Liu >>>>>>>> >> >>>> Staff Research Physicist >>>>>>>> >> >>>> +1 609 243 3438 >>>>>>>> >> >>>> cliu at pppl.gov >>>>>>>> > >>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >> >> >>>>>>>> >> >> -- >>>>>>>> >> >> Chang Liu >>>>>>>> >> >> Staff Research Physicist >>>>>>>> >> >> +1 609 243 3438 >>>>>>>> >> >> cliu at pppl.gov >>>>>>>> > >>>>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >> > >>>>>>>> >> -- Chang Liu >>>>>>>> >> Staff Research Physicist >>>>>>>> >> +1 609 243 3438 >>>>>>>> >> cliu at pppl.gov >>>>>>> > >>>>>>>> >> Princeton Plasma Physics Laboratory >>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Chang Liu >>>>>>>> > Staff Research Physicist >>>>>>>> > +1 609 243 3438 >>>>>>>> > cliu at pppl.gov >>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >>>>>>> -- >>>>>>> Chang Liu >>>>>>> Staff Research Physicist >>>>>>> +1 609 243 3438 >>>>>>> cliu at pppl.gov >>>>>>> Princeton Plasma Physics Laboratory >>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >>>>> -- >>>>> Chang Liu >>>>> Staff Research Physicist >>>>> +1 609 243 3438 >>>>> cliu at pppl.gov >>>>> Princeton Plasma Physics Laboratory >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>> -- >>> Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From cliu at pppl.gov Thu Oct 14 21:10:16 2021 From: cliu at pppl.gov (Chang Liu) Date: Thu, 14 Oct 2021 22:10:16 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> Message-ID: Hi Barry, No problem. Here is the output. It seems that the resid norm calculation is incorrect. $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 KSP Object: 16 MPI processes type: fgmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=2000, initial guess is zero tolerances: relative=1e-20, absolute=1e-09, divergence=10000. right preconditioning using UNPRECONDITIONED norm type for convergence test PC Object: 16 MPI processes type: bjacobi number of blocks = 4 Local solver information for first block is in the following KSP and PC objects on rank 0: Use -ksp_view ::ascii_info_detail to display information for all blocks KSP Object: (sub_) 4 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (sub_) 4 MPI processes type: telescope petsc subcomm: parent comm size reduction factor = 4 petsc subcomm: parent_size = 4 , subcomm_size = 1 petsc subcomm type = contiguous linear system matrix = precond matrix: Mat Object: (sub_) 4 MPI processes type: mpiaij rows=40200, cols=40200 total: nonzeros=199996, allocated nonzeros=203412 total number of mallocs used during MatSetValues calls=0 not using I-node (on process 0) routines setup type: default Parent DM object: NULL Sub DM object: NULL KSP Object: (sub_telescope_) 1 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (sub_telescope_) 1 MPI processes type: lu out-of-place factorization tolerance for zero pivot 2.22045e-14 matrix ordering: nd factor fill ratio given 5., needed 8.62558 Factored matrix follows: Mat Object: 1 MPI processes type: seqaijcusparse rows=40200, cols=40200 package used to perform factorization: cusparse total: nonzeros=1725082, allocated nonzeros=1725082 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaijcusparse rows=40200, cols=40200 total: nonzeros=199996, allocated nonzeros=199996 total number of mallocs used during MatSetValues calls=0 not using I-node routines linear system matrix = precond matrix: Mat Object: 16 MPI processes type: mpiaijcusparse rows=160800, cols=160800 total: nonzeros=802396, allocated nonzeros=1608000 total number of mallocs used during MatSetValues calls=0 not using I-node (on process 0) routines Norm of error 400.999 iterations 1 Chang On 10/14/21 9:47 PM, Barry Smith wrote: > > Chang, > > Sorry I did not notice that one. Please run that with -ksp_view -ksp_monitor_true_residual so we can see exactly how options are interpreted and solver used. At a glance it looks ok but something must be wrong to get the wrong answer. > > Barry > >> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: >> >> Hi Barry, >> >> That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. >> >> Chang >> >> On 10/14/21 5:15 PM, Barry Smith wrote: >>> You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu >>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>>> >>>> Hi Pierre, >>>> >>>> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. >>>> >>>> The command line options I used for small matrix is like >>>> >>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >>>> >>>> which gives the correct output. For iterative solver, I tried >>>> >>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 >>>> >>>> for large matrix. The output is like >>>> >>>> 0 KSP Residual norm 40.1497 >>>> 1 KSP Residual norm < 1.e-11 >>>> Norm of error 400.999 iterations 1 >>>> >>>> So it seems to call a direct solver instead of an iterative one. >>>> >>>> Can you please help check these options? >>>> >>>> Chang >>>> >>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>>> >>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >>>>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >>>>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >>>>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >>>>> Thanks, >>>>> Pierre >>>>>> Chang >>>>>> >>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>>>>> Thus the need for specific code in mumps.c. >>>>>>> Thanks, >>>>>>> Pierre >>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>>>>> >>>>>>>> Hi Junchao, >>>>>>>> >>>>>>>> Yes that is what I want. >>>>>>>> >>>>>>>> Chang >>>>>>>> >>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>>>>> Junchao, >>>>>>>>> If I understand correctly Chang is using the block Jacobi >>>>>>>>> method with a single block for a number of MPI ranks and a direct >>>>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>>>>> particular problems this preconditioner works well, but using an >>>>>>>>> iterative solver on the blocks does not work well. >>>>>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>>>>> not he would like to use a single GPU for each block, this means >>>>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>>>>> GPUs. But this may be a large coding project. >>>>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>>>>> Barry >>>>>>>>> Since the matrices being factored and solved directly are relatively >>>>>>>>> large it is possible that the cusparse code could be reasonably >>>>>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>>>>> actually know how much better the cusparse code would be on the >>>>>>>>> direct solver than a good CPU direct sparse solver. >>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>> > wrote: >>>>>>>>> > >>>>>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>> > >>>>>>>>> > Chang >>>>>>>>> > >>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>> >> Hi Chang, >>>>>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>>>>> gathering matrix rows to one process. >>>>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>>>>> >> Thanks >>>>>>>>> >> --Junchao Zhang >>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>>>> >>>>>>>>> >> >>>>>>>>> wrote: >>>>>>>>> >> Hi Barry, >>>>>>>>> >> I think mumps solver in petsc does support that. You can >>>>>>>>> check the >>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>> >> >>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>> >>>>>>>>> >> >>>>>>>> > >>>>>>>>> >> and the code enclosed by #if >>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>>>>> >> mumps.c >>>>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>>>>> However, I am >>>>>>>>> >> working on an existing code that was developed based on MPI >>>>>>>>> and the the >>>>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>>>>> want to >>>>>>>>> >> change the whole structure of the code. >>>>>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>>>>> function >>>>>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>>>>> >> Regards, >>>>>>>>> >> Chang >>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>> >>>>>>>>> >> >> wrote: >>>>>>>>> >> >> >>>>>>>>> >> >> Hi Barry, >>>>>>>>> >> >> >>>>>>>>> >> >> That is exactly what I want. >>>>>>>>> >> >> >>>>>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>>>>> >> transfer >>>>>>>>> >> >> matrix >>>>>>>>> >> >> data from many MPI processes to "master" MPI >>>>>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>>>>> upload >>>>>>>>> >> the data to GPU to >>>>>>>>> >> >> solve. >>>>>>>>> >> >> One can just grab some codes from mumps.c to >>>>>>>>> aijcusparse.cu >>>>>>>>> >> >. >>>>>>>>> >> > >>>>>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>>>>> copy the >>>>>>>>> >> entire matrix to a single MPI rank. >>>>>>>>> >> > >>>>>>>>> >> > It would be possible to write such a code that you >>>>>>>>> suggest but >>>>>>>>> >> it is not clear that it makes sense >>>>>>>>> >> > >>>>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>>>>> rank, so >>>>>>>>> >> while your one GPU per big domain is solving its systems the >>>>>>>>> other >>>>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>>>>> >> nothing. >>>>>>>>> >> > >>>>>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>>>>> right >>>>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>>>>> to all >>>>>>>>> >> of its subdomain ranks. >>>>>>>>> >> > >>>>>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>>>>> use the >>>>>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>>>>> >> can fit and be done on a single GPU then you would be >>>>>>>>> utilizing all >>>>>>>>> >> the GPUs you are using effectively. >>>>>>>>> >> > >>>>>>>>> >> > Barry >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> >> >>>>>>>>> >> >> Chang >>>>>>>>> >> >> >>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>> >> >>> Chang, >>>>>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>>>>> solvers that >>>>>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>>>>> that I >>>>>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>>>>> >> direct triangular solves. >>>>>>>>> >> >>> Barry >>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> wrote: >>>>>>>>> >> >>>> >>>>>>>>> >> >>>> Hi Mark, >>>>>>>>> >> >>>> >>>>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>>>>> other >>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>>>>> will give >>>>>>>>> >> an error. >>>>>>>>> >> >>>> >>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>>>>> on gpu. >>>>>>>>> >> Is that possible? >>>>>>>>> >> >>>> >>>>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>>>>> runs but >>>>>>>>> >> the iterative solver is still running on CPUs. I have >>>>>>>>> contacted the >>>>>>>>> >> superlu group and they confirmed that is the case right now. >>>>>>>>> But if >>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>>>>> >> iterative solver is running on GPU. >>>>>>>>> >> >>>> >>>>>>>>> >> >>>> Chang >>>>>>>>> >> >>>> >>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>> >>>>>>>>> >> > >>>>>>>>> >>>>>>>>> >> >>> wrote: >>>>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>>>>> my case >>>>>>>>> >> the code is >>>>>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>>>>> >> factorization on GPUs. >>>>>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>>>>> code to >>>>>>>>> >> utilize GPUs >>>>>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>>>>> mpiaij >>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>>>>> >> superlu tests use aij or sell matrix type. >>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>>>>> .... but >>>>>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>>>>> that the >>>>>>>>> >> issue? >>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>>>>> SuperLU >>>>>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>>>>> >> >>>>> Thanks, >>>>>>>>> >> >>>>> Mark >>>>>>>>> >> >>>>> so I >>>>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>>>>> all the >>>>>>>>> >> matrix terms, >>>>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>>>>> >> factorization >>>>>>>>> >> >>>>> and >>>>>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>>>>> >> process, and I >>>>>>>>> >> >>>>> think >>>>>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>>>>> >> >>>>> Chang >>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>>>>> >> >>>>>>>>> > >>>>>>>>> >> >>>>> >>>>>>>>> >> >>>>>>>>> >> >>>>> > >>>>>>>> >>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>>> wrote: >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>>>>> >> >>>>>>>> > >>>>>>>>> >> >>>>> >>>>>>>>> >> >>>>>>>>> >> >>>>> > >>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>>> wrote: >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > Hi Mark, >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > The option I use is like >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>>>>> >> -ksp_type fgmres >>>>>>>>> >> >>>>> -mat_type >>>>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>>>>> >> cusparse >>>>>>>>> >> >>>>> *-sub_ksp_type >>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>>>>> >> -ksp_rtol 1.e-300 >>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>>>>> (rows >>>>>>>>> >> are the >>>>>>>>> >> >>>>> method like >>>>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>>>>> in the GPU. >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>>>>> cuSparse LU >>>>>>>>> >> >>>>> factorization. Is >>>>>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>> find it >>>>>>>>> >> calls >>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>>>>> >> make bigger >>>>>>>>> >> >>>>> blocks? >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > I think this one do both factorization and >>>>>>>>> >> solve on gpu. >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > You can check the >>>>>>>>> runex72_aijcusparse.sh file >>>>>>>>> >> in petsc >>>>>>>>> >> >>>>> install >>>>>>>>> >> >>>>> > directory, and try it your self (this >>>>>>>>> is only lu >>>>>>>>> >> >>>>> factorization >>>>>>>>> >> >>>>> > without >>>>>>>>> >> >>>>> > iterative solve). >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > Chang >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>>>>> Chang Liu >>>>>>>>> >> >>>>> >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>> > >>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>> >>>>>>>>> >> >>>>> > > >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >>>>>>>> >> >>>>>>>>> >> >>>>> >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>>>> wrote: >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > Hi Junchao, >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>>>>> >> within a >>>>>>>>> >> >>>>> node. I use >>>>>>>>> >> >>>>> > block-Jacobi >>>>>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>>>>> >> matrix, so each >>>>>>>>> >> >>>>> > direct solver will >>>>>>>>> >> >>>>> > > take care of a sub-block of the >>>>>>>>> whole >>>>>>>>> >> matrix. In this >>>>>>>>> >> >>>>> > way, I can use >>>>>>>>> >> >>>>> > > one >>>>>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>>>>> >> stored within >>>>>>>>> >> >>>>> one node. >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > It was stated in the >>>>>>>>> documentation that >>>>>>>>> >> cusparse >>>>>>>>> >> >>>>> solver >>>>>>>>> >> >>>>> > is slow. >>>>>>>>> >> >>>>> > > However, in my test using >>>>>>>>> ex72.c, the >>>>>>>>> >> cusparse >>>>>>>>> >> >>>>> solver is >>>>>>>>> >> >>>>> > faster than >>>>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > Are we talking about the >>>>>>>>> factorization, the >>>>>>>>> >> solve, or >>>>>>>>> >> >>>>> both? >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > We do not have an interface to >>>>>>>>> cuSparse's LU >>>>>>>>> >> >>>>> factorization (I >>>>>>>>> >> >>>>> > just >>>>>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>>>>> >> '-pc_type lu >>>>>>>>> >> >>>>> -mat_type >>>>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>>>>> >> factorization, >>>>>>>>> >> >>>>> which is the >>>>>>>>> >> >>>>> > > dominant cost. >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > Chang >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>>>>> Zhang wrote: >>>>>>>>> >> >>>>> > > > Hi, Chang, >>>>>>>>> >> >>>>> > > > For the mumps solver, we >>>>>>>>> usually >>>>>>>>> >> transfers >>>>>>>>> >> >>>>> matrix >>>>>>>>> >> >>>>> > and vector >>>>>>>>> >> >>>>> > > data >>>>>>>>> >> >>>>> > > > within a compute node. For >>>>>>>>> the idea you >>>>>>>>> >> >>>>> propose, it >>>>>>>>> >> >>>>> > looks like >>>>>>>>> >> >>>>> > > we need >>>>>>>>> >> >>>>> > > > to gather data within >>>>>>>>> >> MPI_COMM_WORLD, right? >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > > Mark, I remember you said >>>>>>>>> >> cusparse solve is >>>>>>>>> >> >>>>> slow >>>>>>>>> >> >>>>> > and you would >>>>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > > --Junchao Zhang >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>>>>> >> Chang Liu via >>>>>>>>> >> >>>>> petsc-users >>>>>>>>> >> >>>>> > > > >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>> >>>>>>>>> >> >>>>>>>> >> >>>>>>>>> >> >>>>> > >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>> >>>>>>>>> >> >>>>>>>> >>> >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>> >>>>>>>>> >> >>>>>>>> >> >>>>>>>>> >> >>>>> > >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>> >>>>>>>>> >> >>>>>>>> >>>> >>>>>>>>> >> >>>>> > > >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>> >>>>>>>>> >> >>>>>>>> >> >>>>>>>>> >> >>>>> > >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>> >>>>>>>>> >> >>>>>>>> >>> >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>> >>>>>>>>> >> >>>>>>>> >> >>>>>>>>> >> >>>>> > >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>> >>>>>>>>> >> >>>>>>>> >>>>>> >>>>>>>>> >> >>>>> > > wrote: >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > > Hi, >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > > Currently, it is possible >>>>>>>>> to use >>>>>>>>> >> mumps >>>>>>>>> >> >>>>> solver in >>>>>>>>> >> >>>>> > PETSC with >>>>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>>>>> >> option, so that >>>>>>>>> >> >>>>> > multiple MPI >>>>>>>>> >> >>>>> > > processes will >>>>>>>>> >> >>>>> > > > transfer the matrix and >>>>>>>>> rhs data >>>>>>>>> >> to the master >>>>>>>>> >> >>>>> > rank, and then >>>>>>>>> >> >>>>> > > master >>>>>>>>> >> >>>>> > > > rank will call mumps with >>>>>>>>> OpenMP >>>>>>>>> >> to solve >>>>>>>>> >> >>>>> the matrix. >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > > I wonder if someone can >>>>>>>>> develop >>>>>>>>> >> similar >>>>>>>>> >> >>>>> option for >>>>>>>>> >> >>>>> > cusparse >>>>>>>>> >> >>>>> > > solver. >>>>>>>>> >> >>>>> > > > Right now, this solver >>>>>>>>> does not >>>>>>>>> >> work with >>>>>>>>> >> >>>>> > mpiaijcusparse. I >>>>>>>>> >> >>>>> > > think a >>>>>>>>> >> >>>>> > > > possible workaround is to >>>>>>>>> >> transfer all the >>>>>>>>> >> >>>>> matrix >>>>>>>>> >> >>>>> > data to one MPI >>>>>>>>> >> >>>>> > > > process, and then upload the >>>>>>>>> >> data to GPU to >>>>>>>>> >> >>>>> solve. >>>>>>>>> >> >>>>> > In this >>>>>>>>> >> >>>>> > > way, one can >>>>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>>>>> >> program. >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > > Chang >>>>>>>>> >> >>>>> > > > -- >>>>>>>>> >> >>>>> > > > Chang Liu >>>>>>>>> >> >>>>> > > > Staff Research Physicist >>>>>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>> >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>> >>>>>>>>> >> >>>>> > >>>>>>>> > >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>> >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>>> >>>>>>>>> >> >>>>> > >>>>>>>> > >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>> >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>> >>>>>>>>> >> >>>>> > > >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >>>>>>>> >> >>>>>>>>> >> >>>>> >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>>>> >>>>>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>>>>> Laboratory >>>>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>>>>> Princeton NJ >>>>>>>>> >> 08540, USA >>>>>>>>> >> >>>>> > > > >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > > -- >>>>>>>>> >> >>>>> > > Chang Liu >>>>>>>>> >> >>>>> > > Staff Research Physicist >>>>>>>>> >> >>>>> > > +1 609 243 3438 >>>>>>>>> >> >>>>> > > cliu at pppl.gov >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>> >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>> >>>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>>> >> >>>>>>>>> >> >>>>> > >>>>>>>> > >>>>>>>>> >> >>>>>>>>> >>>> >>>>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>>>>> 08540, USA >>>>>>>>> >> >>>>> > > >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> > -- >>>>>>>>> >> >>>>> > Chang Liu >>>>>>>>> >> >>>>> > Staff Research Physicist >>>>>>>>> >> >>>>> > +1 609 243 3438 >>>>>>>>> >> >>>>> > cliu at pppl.gov >>>>>>>>> > >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >>>>>>>>> >> > >>>>>>>>> >> >>>>> >>>>>>>>> >>> >>>>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >> >>>>> > >>>>>>>>> >> >>>>> -- Chang Liu >>>>>>>>> >> >>>>> Staff Research Physicist >>>>>>>>> >> >>>>> +1 609 243 3438 >>>>>>>>> >> >>>>> cliu at pppl.gov >>>>>>>>> > >>>>>>>> >>>>>>>>> >> >> >>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >> >>>> >>>>>>>>> >> >>>> -- >>>>>>>>> >> >>>> Chang Liu >>>>>>>>> >> >>>> Staff Research Physicist >>>>>>>>> >> >>>> +1 609 243 3438 >>>>>>>>> >> >>>> cliu at pppl.gov >>>>>>>>> > >>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >> >> >>>>>>>>> >> >> -- >>>>>>>>> >> >> Chang Liu >>>>>>>>> >> >> Staff Research Physicist >>>>>>>>> >> >> +1 609 243 3438 >>>>>>>>> >> >> cliu at pppl.gov >>>>>>>>> > >>>>>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >> > >>>>>>>>> >> -- Chang Liu >>>>>>>>> >> Staff Research Physicist >>>>>>>>> >> +1 609 243 3438 >>>>>>>>> >> cliu at pppl.gov >>>>>>>> > >>>>>>>>> >> Princeton Plasma Physics Laboratory >>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Chang Liu >>>>>>>>> > Staff Research Physicist >>>>>>>>> > +1 609 243 3438 >>>>>>>>> > cliu at pppl.gov >>>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >>>>>>>> -- >>>>>>>> Chang Liu >>>>>>>> Staff Research Physicist >>>>>>>> +1 609 243 3438 >>>>>>>> cliu at pppl.gov >>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >>>>>> -- >>>>>> Chang Liu >>>>>> Staff Research Physicist >>>>>> +1 609 243 3438 >>>>>> cliu at pppl.gov >>>>>> Princeton Plasma Physics Laboratory >>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >>>> -- >>>> Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From cliu at pppl.gov Thu Oct 14 21:11:57 2021 From: cliu at pppl.gov (Chang Liu) Date: Thu, 14 Oct 2021 22:11:57 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> Message-ID: For comparison, here is the output using mumps instead of cusparse $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type mumps -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 KSP Object: 16 MPI processes type: fgmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=2000, initial guess is zero tolerances: relative=1e-20, absolute=1e-09, divergence=10000. right preconditioning using UNPRECONDITIONED norm type for convergence test PC Object: 16 MPI processes type: bjacobi number of blocks = 4 Local solver information for first block is in the following KSP and PC objects on rank 0: Use -ksp_view ::ascii_info_detail to display information for all blocks KSP Object: (sub_) 4 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (sub_) 4 MPI processes type: telescope petsc subcomm: parent comm size reduction factor = 4 petsc subcomm: parent_size = 4 , subcomm_size = 1 petsc subcomm type = contiguous linear system matrix = precond matrix: Mat Object: (sub_) 4 MPI processes type: mpiaij rows=40200, cols=40200 total: nonzeros=199996, allocated nonzeros=203412 total number of mallocs used during MatSetValues calls=0 not using I-node (on process 0) routines setup type: default Parent DM object: NULL Sub DM object: NULL KSP Object: (sub_telescope_) 1 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (sub_telescope_) 1 MPI processes type: lu out-of-place factorization tolerance for zero pivot 2.22045e-14 matrix ordering: external factor fill ratio given 0., needed 0. Factored matrix follows: Mat Object: 1 MPI processes type: mumps rows=40200, cols=40200 package used to perform factorization: mumps total: nonzeros=1849788, allocated nonzeros=1849788 MUMPS run parameters: SYM (matrix type): 0 PAR (host participation): 1 ICNTL(1) (output for error): 6 ICNTL(2) (output of diagnostic msg): 0 ICNTL(3) (output for global info): 0 ICNTL(4) (level of printing): 0 ICNTL(5) (input mat struct): 0 ICNTL(6) (matrix prescaling): 7 ICNTL(7) (sequential matrix ordering):7 ICNTL(8) (scaling strategy): 77 ICNTL(10) (max num of refinements): 0 ICNTL(11) (error analysis): 0 ICNTL(12) (efficiency control): 1 ICNTL(13) (sequential factorization of the root node): 0 ICNTL(14) (percentage of estimated workspace increase): 20 ICNTL(18) (input mat struct): 0 ICNTL(19) (Schur complement info): 0 ICNTL(20) (RHS sparse pattern): 0 ICNTL(21) (solution struct): 0 ICNTL(22) (in-core/out-of-core facility): 0 ICNTL(23) (max size of memory can be allocated locally):0 ICNTL(24) (detection of null pivot rows): 0 ICNTL(25) (computation of a null space basis): 0 ICNTL(26) (Schur options for RHS or solution): 0 ICNTL(27) (blocking size for multiple RHS): -32 ICNTL(28) (use parallel or sequential ordering): 1 ICNTL(29) (parallel ordering): 0 ICNTL(30) (user-specified set of entries in inv(A)): 0 ICNTL(31) (factors is discarded in the solve phase): 0 ICNTL(33) (compute determinant): 0 ICNTL(35) (activate BLR based factorization): 0 ICNTL(36) (choice of BLR factorization variant): 0 ICNTL(38) (estimated compression rate of LU factors): 333 CNTL(1) (relative pivoting threshold): 0.01 CNTL(2) (stopping criterion of refinement): 1.49012e-08 CNTL(3) (absolute pivoting threshold): 0. CNTL(4) (value of static pivoting): -1. CNTL(5) (fixation for null pivots): 0. CNTL(7) (dropping parameter for BLR): 0. RINFO(1) (local estimated flops for the elimination after analysis): [0] 1.45525e+08 RINFO(2) (local estimated flops for the assembly after factorization): [0] 2.89397e+06 RINFO(3) (local estimated flops for the elimination after factorization): [0] 1.45525e+08 INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization): [0] 29 INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization): [0] 29 INFO(23) (num of pivots eliminated on this processor after factorization): [0] 40200 RINFOG(1) (global estimated flops for the elimination after analysis): 1.45525e+08 RINFOG(2) (global estimated flops for the assembly after factorization): 2.89397e+06 RINFOG(3) (global estimated flops for the elimination after factorization): 1.45525e+08 (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0) INFOG(3) (estimated real workspace for factors on all processors after analysis): 1849788 INFOG(4) (estimated integer workspace for factors on all processors after analysis): 879986 INFOG(5) (estimated maximum front size in the complete tree): 282 INFOG(6) (number of nodes in the complete tree): 23709 INFOG(7) (ordering option effectively used after analysis): 5 INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100 INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 1849788 INFOG(10) (total integer space store the matrix factors after factorization): 879986 INFOG(11) (order of largest frontal matrix after factorization): 282 INFOG(12) (number of off-diagonal pivots): 0 INFOG(13) (number of delayed pivots after factorization): 0 INFOG(14) (number of memory compress after factorization): 0 INFOG(15) (number of steps of iterative refinement after solution): 0 INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 29 INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 29 INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 29 INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 29 INFOG(20) (estimated number of entries in the factors): 1849788 INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 26 INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 26 INFOG(23) (after analysis: value of ICNTL(6) effectively used): 0 INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1 INFOG(25) (after factorization: number of pivots modified by static pivoting): 0 INFOG(28) (after factorization: number of null pivots encountered): 0 INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 1849788 INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 29, 29 INFOG(32) (after analysis: type of analysis done): 1 INFOG(33) (value used for ICNTL(8)): 7 INFOG(34) (exponent of the determinant if determinant is requested): 0 INFOG(35) (after factorization: number of entries taking into account BLR factor compression - sum over all processors): 1849788 INFOG(36) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - value on the most memory consuming processor): 0 INFOG(37) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - sum over all processors): 0 INFOG(38) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - value on the most memory consuming processor): 0 INFOG(39) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - sum over all processors): 0 linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaijcusparse rows=40200, cols=40200 total: nonzeros=199996, allocated nonzeros=199996 total number of mallocs used during MatSetValues calls=0 not using I-node routines linear system matrix = precond matrix: Mat Object: 16 MPI processes type: mpiaijcusparse rows=160800, cols=160800 total: nonzeros=802396, allocated nonzeros=1608000 total number of mallocs used during MatSetValues calls=0 not using I-node (on process 0) routines Norm of error 9.11684e-07 iterations 189 Chang On 10/14/21 10:10 PM, Chang Liu wrote: > Hi Barry, > > No problem. Here is the output. It seems that the resid norm calculation > is incorrect. > > $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks > 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope > -sub_ksp_type preonly -sub_telescope_ksp_type preonly > -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type > cusparse -sub_pc_telescope_reduction_factor 4 > -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol > 1.e-20 -ksp_atol 1.e-9 > ? 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm > 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > ? 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm > 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > KSP Object: 16 MPI processes > ? type: fgmres > ??? restart=30, using Classical (unmodified) Gram-Schmidt > Orthogonalization with no iterative refinement > ??? happy breakdown tolerance 1e-30 > ? maximum iterations=2000, initial guess is zero > ? tolerances:? relative=1e-20, absolute=1e-09, divergence=10000. > ? right preconditioning > ? using UNPRECONDITIONED norm type for convergence test > PC Object: 16 MPI processes > ? type: bjacobi > ??? number of blocks = 4 > ??? Local solver information for first block is in the following KSP > and PC objects on rank 0: > ??? Use -ksp_view ::ascii_info_detail to display information for all > blocks > ? KSP Object: (sub_) 4 MPI processes > ??? type: preonly > ??? maximum iterations=10000, initial guess is zero > ??? tolerances:? relative=1e-05, absolute=1e-50, divergence=10000. > ??? left preconditioning > ??? using NONE norm type for convergence test > ? PC Object: (sub_) 4 MPI processes > ??? type: telescope > ????? petsc subcomm: parent comm size reduction factor = 4 > ????? petsc subcomm: parent_size = 4 , subcomm_size = 1 > ????? petsc subcomm type = contiguous > ??? linear system matrix = precond matrix: > ??? Mat Object: (sub_) 4 MPI processes > ????? type: mpiaij > ????? rows=40200, cols=40200 > ????? total: nonzeros=199996, allocated nonzeros=203412 > ????? total number of mallocs used during MatSetValues calls=0 > ??????? not using I-node (on process 0) routines > ??????? setup type: default > ??????? Parent DM object: NULL > ??????? Sub DM object: NULL > ??????? KSP Object:?? (sub_telescope_)?? 1 MPI processes > ????????? type: preonly > ????????? maximum iterations=10000, initial guess is zero > ????????? tolerances:? relative=1e-05, absolute=1e-50, divergence=10000. > ????????? left preconditioning > ????????? using NONE norm type for convergence test > ??????? PC Object:?? (sub_telescope_)?? 1 MPI processes > ????????? type: lu > ??????????? out-of-place factorization > ??????????? tolerance for zero pivot 2.22045e-14 > ??????????? matrix ordering: nd > ??????????? factor fill ratio given 5., needed 8.62558 > ????????????? Factored matrix follows: > ??????????????? Mat Object:?? 1 MPI processes > ????????????????? type: seqaijcusparse > ????????????????? rows=40200, cols=40200 > ????????????????? package used to perform factorization: cusparse > ????????????????? total: nonzeros=1725082, allocated nonzeros=1725082 > ??????????????????? not using I-node routines > ????????? linear system matrix = precond matrix: > ????????? Mat Object:?? 1 MPI processes > ??????????? type: seqaijcusparse > ??????????? rows=40200, cols=40200 > ??????????? total: nonzeros=199996, allocated nonzeros=199996 > ??????????? total number of mallocs used during MatSetValues calls=0 > ????????????? not using I-node routines > ? linear system matrix = precond matrix: > ? Mat Object: 16 MPI processes > ??? type: mpiaijcusparse > ??? rows=160800, cols=160800 > ??? total: nonzeros=802396, allocated nonzeros=1608000 > ??? total number of mallocs used during MatSetValues calls=0 > ????? not using I-node (on process 0) routines > Norm of error 400.999 iterations 1 > > Chang > > > On 10/14/21 9:47 PM, Barry Smith wrote: >> >> ?? Chang, >> >> ??? Sorry I did not notice that one. Please run that with -ksp_view >> -ksp_monitor_true_residual so we can see exactly how options are >> interpreted and solver used. At a glance it looks ok but something >> must be wrong to get the wrong answer. >> >> ?? Barry >> >>> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: >>> >>> Hi Barry, >>> >>> That is exactly what I was doing in the second example, in which the >>> preconditioner works but the GMRES does not. >>> >>> Chang >>> >>> On 10/14/21 5:15 PM, Barry Smith wrote: >>>> ?? You need to use the PCTELESCOPE inside the block Jacobi, not >>>> outside it. So something like -pc_type bjacobi -sub_pc_type >>>> telescope -sub_telescope_pc_type lu >>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>>>> >>>>> Hi Pierre, >>>>> >>>>> I wonder if the trick of PCTELESCOPE only works for preconditioner >>>>> and not for the solver. I have done some tests, and find that for >>>>> solving a small matrix using -telescope_ksp_type preonly, it does >>>>> work for GPU with multiple MPI processes. However, for bjacobi and >>>>> gmres, it does not work. >>>>> >>>>> The command line options I used for small matrix is like >>>>> >>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short >>>>> -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu >>>>> -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type >>>>> preonly -pc_telescope_reduction_factor 4 >>>>> >>>>> which gives the correct output. For iterative solver, I tried >>>>> >>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short >>>>> -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type >>>>> aijcusparse -sub_pc_type telescope -sub_ksp_type preonly >>>>> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >>>>> -sub_telescope_pc_factor_mat_solver_type cusparse >>>>> -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol >>>>> 1.e-9 -ksp_atol 1.e-20 >>>>> >>>>> for large matrix. The output is like >>>>> >>>>> ? 0 KSP Residual norm 40.1497 >>>>> ? 1 KSP Residual norm < 1.e-11 >>>>> Norm of error 400.999 iterations 1 >>>>> >>>>> So it seems to call a direct solver instead of an iterative one. >>>>> >>>>> Can you please help check these options? >>>>> >>>>> Chang >>>>> >>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>>>> >>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This >>>>>>> sounds exactly what I need. I wonder if PCTELESCOPE can transform >>>>>>> a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it >>>>>> should be; >>>>>> 2) at least for the implementations >>>>>> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and >>>>>> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType >>>>>> is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough >>>>>> to detect if the MPI communicator on which the Mat lives is of >>>>>> size 1 (your case), and then the resulting Mat is of type MatSeqX >>>>>> instead of MatMPIX, so you would not need to worry about the >>>>>> transformation you are mentioning. >>>>>> If you try this out and this does not work, please provide the >>>>>> backtrace (probably something like ?Operation XYZ not implemented >>>>>> for MatType ABC?), and hopefully someone can add the missing >>>>>> plumbing. >>>>>> I do not claim that this will be efficient, but I think this goes >>>>>> in the direction of what you want to achieve. >>>>>> Thanks, >>>>>> Pierre >>>>>>> Chang >>>>>>> >>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a >>>>>>>> subdomain solver, with a reduction factor equal to the number of >>>>>>>> MPI processes you have per block? >>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X >>>>>>>> -sub_telescope_pc_type lu >>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because >>>>>>>> not only do the Mat needs to be redistributed, the secondary >>>>>>>> processes also need to be ?converted? to OpenMP threads. >>>>>>>> Thus the need for specific code in mumps.c. >>>>>>>> Thanks, >>>>>>>> Pierre >>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Junchao, >>>>>>>>> >>>>>>>>> Yes that is what I want. >>>>>>>>> >>>>>>>>> Chang >>>>>>>>> >>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith >>>>>>>>> > wrote: >>>>>>>>>> ?????? Junchao, >>>>>>>>>> ????????? If I understand correctly Chang is using the block >>>>>>>>>> Jacobi >>>>>>>>>> ??? method with a single block for a number of MPI ranks and a >>>>>>>>>> direct >>>>>>>>>> ??? solver for each block so it uses >>>>>>>>>> PCSetUp_BJacobi_Multiproc() which >>>>>>>>>> ??? is code Hong Zhang wrote a number of years ago for CPUs. >>>>>>>>>> For their >>>>>>>>>> ??? particular problems this preconditioner works well, but >>>>>>>>>> using an >>>>>>>>>> ??? iterative solver on the blocks does not work well. >>>>>>>>>> ????????? If we had complete MPI-GPU direct solvers he could >>>>>>>>>> just use >>>>>>>>>> ??? the current code with MPIAIJCUSPARSE on each block but >>>>>>>>>> since we do >>>>>>>>>> ??? not he would like to use a single GPU for each block, this >>>>>>>>>> means >>>>>>>>>> ??? that diagonal blocks of? the global parallel MPI matrix >>>>>>>>>> needs to be >>>>>>>>>> ??? sent to a subset of the GPUs (one GPU per block, which has >>>>>>>>>> multiple >>>>>>>>>> ??? MPI ranks associated with the blocks). Similarly for the >>>>>>>>>> triangular >>>>>>>>>> ??? solves the blocks of the right hand side needs to be >>>>>>>>>> shipped to the >>>>>>>>>> ??? appropriate GPU and the resulting solution shipped back to >>>>>>>>>> the >>>>>>>>>> ??? multiple GPUs. So Chang is absolutely correct, this is >>>>>>>>>> somewhat like >>>>>>>>>> ??? your code for MUMPS with OpenMP. OK, I now understand the >>>>>>>>>> background.. >>>>>>>>>> ??? One could use PCSetUp_BJacobi_Multiproc() and get the >>>>>>>>>> blocks on the >>>>>>>>>> ??? MPI ranks and then shrink each block down to a single GPU >>>>>>>>>> but this >>>>>>>>>> ??? would be pretty inefficient, ideally one would go directly >>>>>>>>>> from the >>>>>>>>>> ??? big MPI matrix on all the GPUs to the sub matrices on the >>>>>>>>>> subset of >>>>>>>>>> ??? GPUs. But this may be a large coding project. >>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? >>>>>>>>>> In my mind, we just need to move each block (submatrix) living >>>>>>>>>> over multiple MPI ranks to one of them and solve directly >>>>>>>>>> there.? In other words, we keep blocks' size, no shrinking or >>>>>>>>>> expanding. >>>>>>>>>> As mentioned before, cusparse does not provide LU >>>>>>>>>> factorization. So the LU factorization would be done on CPU, >>>>>>>>>> and the solve be done on GPU. I assume Chang wants to gain >>>>>>>>>> from the (potential) faster solve (instead of factorization) >>>>>>>>>> on GPU. >>>>>>>>>> ?????? Barry >>>>>>>>>> ??? Since the matrices being factored and solved directly are >>>>>>>>>> relatively >>>>>>>>>> ??? large it is possible that the cusparse code could be >>>>>>>>>> reasonably >>>>>>>>>> ??? efficient (they are not the tiny problems one gets at the >>>>>>>>>> coarse >>>>>>>>>> ??? level of multigrid). Of course, this is speculation, I don't >>>>>>>>>> ??? actually know how much better the cusparse code would be >>>>>>>>>> on the >>>>>>>>>> ??? direct solver than a good CPU direct sparse solver. >>>>>>>>>> ???? > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>> ??? > wrote: >>>>>>>>>> ???? > >>>>>>>>>> ???? > Sorry I am not familiar with the details either. Can >>>>>>>>>> you please >>>>>>>>>> ??? check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>>> ???? > >>>>>>>>>> ???? > Chang >>>>>>>>>> ???? > >>>>>>>>>> ???? > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>>> ???? >> Hi Chang, >>>>>>>>>> ???? >>?? I did the work in mumps. It is easy for me to >>>>>>>>>> understand >>>>>>>>>> ??? gathering matrix rows to one process. >>>>>>>>>> ???? >>?? But how to gather blocks (submatrices) to form a >>>>>>>>>> large block????? Can you draw a picture of that? >>>>>>>>>> ???? >>?? Thanks >>>>>>>>>> ???? >> --Junchao Zhang >>>>>>>>>> ???? >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>>>>> ??? >>>>>>>>>> ??? >>>>>>>>> >> >>>>>>>>>> ??? wrote: >>>>>>>>>> ???? >>??? Hi Barry, >>>>>>>>>> ???? >>??? I think mumps solver in petsc does support that. >>>>>>>>>> You can >>>>>>>>>> ??? check the >>>>>>>>>> ???? >>??? documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>>> ???? >> >>>>>>>>>> >>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ???? >> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> ???? >>??? and the code enclosed by #if >>>>>>>>>> ??? defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>>> ???? >>??? functions MatMumpsSetUpDistRHSInfo and >>>>>>>>>> ???? >>??? MatMumpsGatherNonzerosOnMaster in >>>>>>>>>> ???? >>??? mumps.c >>>>>>>>>> ???? >>??? 1. I understand it is ideal to do one MPI rank per >>>>>>>>>> GPU. >>>>>>>>>> ??? However, I am >>>>>>>>>> ???? >>??? working on an existing code that was developed >>>>>>>>>> based on MPI >>>>>>>>>> ??? and the the >>>>>>>>>> ???? >>??? # of mpi ranks is typically equal to # of cpu >>>>>>>>>> cores. We don't >>>>>>>>>> ??? want to >>>>>>>>>> ???? >>??? change the whole structure of the code. >>>>>>>>>> ???? >>??? 2. What you have suggested has been coded in >>>>>>>>>> mumps.c. See >>>>>>>>>> ??? function >>>>>>>>>> ???? >>??? MatMumpsSetUpDistRHSInfo. >>>>>>>>>> ???? >>??? Regards, >>>>>>>>>> ???? >>??? Chang >>>>>>>>>> ???? >>??? On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>>>> >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >> wrote: >>>>>>>>>> ???? >>???? >> >>>>>>>>>> ???? >>???? >> Hi Barry, >>>>>>>>>> ???? >>???? >> >>>>>>>>>> ???? >>???? >> That is exactly what I want. >>>>>>>>>> ???? >>???? >> >>>>>>>>>> ???? >>???? >> Back to my original question, I am looking for >>>>>>>>>> an approach to >>>>>>>>>> ???? >>??? transfer >>>>>>>>>> ???? >>???? >> matrix >>>>>>>>>> ???? >>???? >> data from many MPI processes to "master" MPI >>>>>>>>>> ???? >>???? >> processes, each of which taking care of one >>>>>>>>>> GPU, and then >>>>>>>>>> ??? upload >>>>>>>>>> ???? >>??? the data to GPU to >>>>>>>>>> ???? >>???? >> solve. >>>>>>>>>> ???? >>???? >> One can just grab some codes from mumps.c to >>>>>>>>>> ??? aijcusparse.cu >>>>>>>>>> ???? >>??? >. >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? >??? mumps.c doesn't actually do that. It never >>>>>>>>>> needs to >>>>>>>>>> ??? copy the >>>>>>>>>> ???? >>??? entire matrix to a single MPI rank. >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? >??? It would be possible to write such a code >>>>>>>>>> that you >>>>>>>>>> ??? suggest but >>>>>>>>>> ???? >>??? it is not clear that it makes sense >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? > 1)? For normal PETSc GPU usage there is one GPU >>>>>>>>>> per MPI >>>>>>>>>> ??? rank, so >>>>>>>>>> ???? >>??? while your one GPU per big domain is solving its >>>>>>>>>> systems the >>>>>>>>>> ??? other >>>>>>>>>> ???? >>??? GPUs (with the other MPI ranks that share that >>>>>>>>>> domain) are doing >>>>>>>>>> ???? >>??? nothing. >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? > 2) For each triangular solve you would have to >>>>>>>>>> gather the >>>>>>>>>> ??? right >>>>>>>>>> ???? >>??? hand side from the multiple ranks to the single GPU >>>>>>>>>> to pass it to >>>>>>>>>> ???? >>??? the GPU solver and then scatter the resulting >>>>>>>>>> solution back >>>>>>>>>> ??? to all >>>>>>>>>> ???? >>??? of its subdomain ranks. >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? >??? What I was suggesting was assign an entire >>>>>>>>>> subdomain to a >>>>>>>>>> ???? >>??? single MPI rank, thus it does everything on one GPU >>>>>>>>>> and can >>>>>>>>>> ??? use the >>>>>>>>>> ???? >>??? GPU solver directly. If all the major computations >>>>>>>>>> of a subdomain >>>>>>>>>> ???? >>??? can fit and be done on a single GPU then you would be >>>>>>>>>> ??? utilizing all >>>>>>>>>> ???? >>??? the GPUs you are using effectively. >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? >??? Barry >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>???? >> >>>>>>>>>> ???? >>???? >> Chang >>>>>>>>>> ???? >>???? >> >>>>>>>>>> ???? >>???? >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>>> ???? >>???? >>>??? Chang, >>>>>>>>>> ???? >>???? >>>????? You are correct there is no MPI + GPU direct >>>>>>>>>> ??? solvers that >>>>>>>>>> ???? >>??? currently do the triangular solves with MPI + GPU >>>>>>>>>> parallelism >>>>>>>>>> ??? that I >>>>>>>>>> ???? >>??? am aware of. You are limited that individual >>>>>>>>>> triangular solves be >>>>>>>>>> ???? >>??? done on a single GPU. I can only suggest making >>>>>>>>>> each subdomain as >>>>>>>>>> ???? >>??? big as possible to utilize each GPU as much as >>>>>>>>>> possible for the >>>>>>>>>> ???? >>??? direct triangular solves. >>>>>>>>>> ???? >>???? >>>???? Barry >>>>>>>>>> ???? >>???? >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via >>>>>>>>>> petsc-users >>>>>>>>>> ???? >>??? >>>>>>>>> >>>>>>>>>> ??? >>>>>>>>> >> >>>>>>>>>> ??? wrote: >>>>>>>>>> ???? >>???? >>>> >>>>>>>>>> ???? >>???? >>>> Hi Mark, >>>>>>>>>> ???? >>???? >>>> >>>>>>>>>> ???? >>???? >>>> '-mat_type aijcusparse' works with >>>>>>>>>> mpiaijcusparse with >>>>>>>>>> ??? other >>>>>>>>>> ???? >>??? solvers, but with -pc_factor_mat_solver_type >>>>>>>>>> cusparse, it >>>>>>>>>> ??? will give >>>>>>>>>> ???? >>??? an error. >>>>>>>>>> ???? >>???? >>>> >>>>>>>>>> ???? >>???? >>>> Yes what I want is to have mumps or superlu >>>>>>>>>> to do the >>>>>>>>>> ???? >>??? factorization, and then do the rest, including >>>>>>>>>> GMRES solver, >>>>>>>>>> ??? on gpu. >>>>>>>>>> ???? >>??? Is that possible? >>>>>>>>>> ???? >>???? >>>> >>>>>>>>>> ???? >>???? >>>> I have tried to use aijcusparse with >>>>>>>>>> superlu_dist, it >>>>>>>>>> ??? runs but >>>>>>>>>> ???? >>??? the iterative solver is still running on CPUs. I have >>>>>>>>>> ??? contacted the >>>>>>>>>> ???? >>??? superlu group and they confirmed that is the case >>>>>>>>>> right now. >>>>>>>>>> ??? But if >>>>>>>>>> ???? >>??? I set -pc_factor_mat_solver_type cusparse, it seems >>>>>>>>>> that the >>>>>>>>>> ???? >>??? iterative solver is running on GPU. >>>>>>>>>> ???? >>???? >>>> >>>>>>>>>> ???? >>???? >>>> Chang >>>>>>>>>> ???? >>???? >>>> >>>>>>>>>> ???? >>???? >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>>> ???? >>???? >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>>> ??? >>>>>>>>>> ???? >>??? > >>>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>> wrote: >>>>>>>>>> ???? >>???? >>>>>???? Thank you Junchao for explaining this. I >>>>>>>>>> guess in >>>>>>>>>> ??? my case >>>>>>>>>> ???? >>??? the code is >>>>>>>>>> ???? >>???? >>>>>???? just calling a seq solver like superlu >>>>>>>>>> to do >>>>>>>>>> ???? >>??? factorization on GPUs. >>>>>>>>>> ???? >>???? >>>>>???? My idea is that I want to have a >>>>>>>>>> traditional MPI >>>>>>>>>> ??? code to >>>>>>>>>> ???? >>??? utilize GPUs >>>>>>>>>> ???? >>???? >>>>>???? with cusparse. Right now cusparse does >>>>>>>>>> not support >>>>>>>>>> ??? mpiaij >>>>>>>>>> ???? >>??? matrix, Sure it does: '-mat_type aijcusparse' will >>>>>>>>>> give you an >>>>>>>>>> ???? >>??? mpiaijcusparse matrix with > 1 processes. >>>>>>>>>> ???? >>???? >>>>> (-mat_type mpiaijcusparse might also work >>>>>>>>>> with >1 proc). >>>>>>>>>> ???? >>???? >>>>> However, I see in grepping the repo that all >>>>>>>>>> the mumps and >>>>>>>>>> ???? >>??? superlu tests use aij or sell matrix type. >>>>>>>>>> ???? >>???? >>>>> MUMPS and SuperLU provide their own solves, >>>>>>>>>> I assume >>>>>>>>>> ??? .... but >>>>>>>>>> ???? >>??? you might want to do other matrix operations on the >>>>>>>>>> GPU. Is >>>>>>>>>> ??? that the >>>>>>>>>> ???? >>??? issue? >>>>>>>>>> ???? >>???? >>>>> Did you try -mat_type aijcusparse with MUMPS >>>>>>>>>> and/or >>>>>>>>>> ??? SuperLU >>>>>>>>>> ???? >>??? have a problem? (no test with it so it probably >>>>>>>>>> does not work) >>>>>>>>>> ???? >>???? >>>>> Thanks, >>>>>>>>>> ???? >>???? >>>>> Mark >>>>>>>>>> ???? >>???? >>>>>???? so I >>>>>>>>>> ???? >>???? >>>>>???? want the code to have a mpiaij matrix >>>>>>>>>> when adding >>>>>>>>>> ??? all the >>>>>>>>>> ???? >>??? matrix terms, >>>>>>>>>> ???? >>???? >>>>>???? and then transform the matrix to seqaij >>>>>>>>>> when doing the >>>>>>>>>> ???? >>??? factorization >>>>>>>>>> ???? >>???? >>>>>???? and >>>>>>>>>> ???? >>???? >>>>>???? solve. This involves sending the data to >>>>>>>>>> the master >>>>>>>>>> ???? >>??? process, and I >>>>>>>>>> ???? >>???? >>>>>???? think >>>>>>>>>> ???? >>???? >>>>>???? the petsc mumps solver have something >>>>>>>>>> similar already. >>>>>>>>>> ???? >>???? >>>>>???? Chang >>>>>>>>>> ???? >>???? >>>>>???? On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? > On Tue, Oct 12, 2021 at 1:07 PM Mark >>>>>>>>>> Adams >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>> ??? >>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>>> wrote: >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???? On Tue, Oct 12, 2021 at 1:45 PM >>>>>>>>>> Chang Liu >>>>>>>>>> ???? >>??? >>>>>>>>>> >>>>>>>>> ??? > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>????? >???? >>>>>>>>> ??? >>>>>>>>> > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>>> wrote: >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???????? Hi Mark, >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???????? The option I use is like >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???????? -pc_type bjacobi >>>>>>>>>> -pc_bjacobi_blocks 16 >>>>>>>>>> ???? >>??? -ksp_type fgmres >>>>>>>>>> ???? >>???? >>>>>???? -mat_type >>>>>>>>>> ???? >>???? >>>>>????? >???????? aijcusparse >>>>>>>>>> *-sub_pc_factor_mat_solver_type >>>>>>>>>> ???? >>??? cusparse >>>>>>>>>> ???? >>???? >>>>>???? *-sub_ksp_type >>>>>>>>>> ???? >>???? >>>>>????? >???????? preonly *-sub_pc_type lu* >>>>>>>>>> -ksp_max_it 2000 >>>>>>>>>> ???? >>??? -ksp_rtol 1.e-300 >>>>>>>>>> ???? >>???? >>>>>????? >???????? -ksp_atol 1.e-300 >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???? Note, If you use -log_view the >>>>>>>>>> last column >>>>>>>>>> ??? (rows >>>>>>>>>> ???? >>??? are the >>>>>>>>>> ???? >>???? >>>>>???? method like >>>>>>>>>> ???? >>???? >>>>>????? >???? MatFactorNumeric) has the percent >>>>>>>>>> of work >>>>>>>>>> ??? in the GPU. >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???? Junchao: *This* implies that we >>>>>>>>>> have a >>>>>>>>>> ??? cuSparse LU >>>>>>>>>> ???? >>???? >>>>>???? factorization. Is >>>>>>>>>> ???? >>???? >>>>>????? >???? that correct? (I don't think we do) >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? > No, we don't have cuSparse LU >>>>>>>>>> factorization.???? If you check >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>>> ??? find it >>>>>>>>>> ???? >>??? calls >>>>>>>>>> ???? >>???? >>>>>????? > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>>> ???? >>???? >>>>>????? > So I don't understand Chang's idea. >>>>>>>>>> Do you want to >>>>>>>>>> ???? >>??? make bigger >>>>>>>>>> ???? >>???? >>>>>???? blocks? >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???????? I think this one do both >>>>>>>>>> factorization and >>>>>>>>>> ???? >>??? solve on gpu. >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???????? You can check the >>>>>>>>>> ??? runex72_aijcusparse.sh file >>>>>>>>>> ???? >>??? in petsc >>>>>>>>>> ???? >>???? >>>>>???? install >>>>>>>>>> ???? >>???? >>>>>????? >???????? directory, and try it your >>>>>>>>>> self (this >>>>>>>>>> ??? is only lu >>>>>>>>>> ???? >>???? >>>>>???? factorization >>>>>>>>>> ???? >>???? >>>>>????? >???????? without >>>>>>>>>> ???? >>???? >>>>>????? >???????? iterative solve). >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???????? Chang >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???????? On 10/12/21 1:17 PM, Mark >>>>>>>>>> Adams wrote: >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? > On Tue, Oct 12, 2021 at >>>>>>>>>> 11:19 AM >>>>>>>>>> ??? Chang Liu >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>> ??? >>>>>>>>> > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>> >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>> ??? >>>>>>>>>> ???? >>??? > >>>>>>>>>> ??? >>>>>>>>>> >>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>>>> wrote: >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Hi Junchao, >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? No I only needs it to >>>>>>>>>> be transferred >>>>>>>>>> ???? >>??? within a >>>>>>>>>> ???? >>???? >>>>>???? node. I use >>>>>>>>>> ???? >>???? >>>>>????? >???????? block-Jacobi >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? method and GMRES to >>>>>>>>>> solve the sparse >>>>>>>>>> ???? >>??? matrix, so each >>>>>>>>>> ???? >>???? >>>>>????? >???????? direct solver will >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? take care of a >>>>>>>>>> sub-block of the >>>>>>>>>> ??? whole >>>>>>>>>> ???? >>??? matrix. In this >>>>>>>>>> ???? >>???? >>>>>????? >???????? way, I can use >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? one >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? GPU to solve one >>>>>>>>>> sub-block, which is >>>>>>>>>> ???? >>??? stored within >>>>>>>>>> ???? >>???? >>>>>???? one node. >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? It was stated in the >>>>>>>>>> ??? documentation that >>>>>>>>>> ???? >>??? cusparse >>>>>>>>>> ???? >>???? >>>>>???? solver >>>>>>>>>> ???? >>???? >>>>>????? >???????? is slow. >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? However, in my test using >>>>>>>>>> ??? ex72.c, the >>>>>>>>>> ???? >>??? cusparse >>>>>>>>>> ???? >>???? >>>>>???? solver is >>>>>>>>>> ???? >>???? >>>>>????? >???????? faster than >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? mumps or superlu_dist >>>>>>>>>> on CPUs. >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? > Are we talking about the >>>>>>>>>> ??? factorization, the >>>>>>>>>> ???? >>??? solve, or >>>>>>>>>> ???? >>???? >>>>>???? both? >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? > We do not have an >>>>>>>>>> interface to >>>>>>>>>> ??? cuSparse's LU >>>>>>>>>> ???? >>???? >>>>>???? factorization (I >>>>>>>>>> ???? >>???? >>>>>????? >???????? just >>>>>>>>>> ???? >>???? >>>>>????? >????????? > learned that it exists a >>>>>>>>>> few weeks ago). >>>>>>>>>> ???? >>???? >>>>>????? >????????? > Perhaps your fast >>>>>>>>>> "cusparse solver" is >>>>>>>>>> ???? >>??? '-pc_type lu >>>>>>>>>> ???? >>???? >>>>>???? -mat_type >>>>>>>>>> ???? >>???? >>>>>????? >????????? > aijcusparse' ? This would >>>>>>>>>> be the CPU >>>>>>>>>> ???? >>??? factorization, >>>>>>>>>> ???? >>???? >>>>>???? which is the >>>>>>>>>> ???? >>???? >>>>>????? >????????? > dominant cost. >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Chang >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? On 10/12/21 10:24 AM, >>>>>>>>>> Junchao >>>>>>>>>> ??? Zhang wrote: >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > Hi, Chang, >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? For the mumps >>>>>>>>>> solver, we >>>>>>>>>> ??? usually >>>>>>>>>> ???? >>??? transfers >>>>>>>>>> ???? >>???? >>>>>???? matrix >>>>>>>>>> ???? >>???? >>>>>????? >???????? and vector >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? data >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > within a compute >>>>>>>>>> node.? For >>>>>>>>>> ??? the idea you >>>>>>>>>> ???? >>???? >>>>>???? propose, it >>>>>>>>>> ???? >>???? >>>>>????? >???????? looks like >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? we need >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > to gather data within >>>>>>>>>> ???? >>??? MPI_COMM_WORLD, right? >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Mark, I >>>>>>>>>> remember you said >>>>>>>>>> ???? >>??? cusparse solve is >>>>>>>>>> ???? >>???? >>>>>???? slow >>>>>>>>>> ???? >>???? >>>>>????? >???????? and you would >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > rather do it on >>>>>>>>>> CPU. Is it right? >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > --Junchao Zhang >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > On Mon, Oct 11, >>>>>>>>>> 2021 at 10:25 PM >>>>>>>>>> ???? >>??? Chang Liu via >>>>>>>>>> ???? >>???? >>>>>???? petsc-users >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> ??? >>> >>>>>>>>>> >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> ??? >>>> >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> ??? >>> >>>>>>>>>> >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >>>>>>>>> ??? >>>>>> >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? wrote: >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Hi, >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Currently, it >>>>>>>>>> is possible >>>>>>>>>> ??? to use >>>>>>>>>> ???? >>??? mumps >>>>>>>>>> ???? >>???? >>>>>???? solver in >>>>>>>>>> ???? >>???? >>>>>????? >???????? PETSC with >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> -mat_mumps_use_omp_threads >>>>>>>>>> ???? >>??? option, so that >>>>>>>>>> ???? >>???? >>>>>????? >???????? multiple MPI >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? processes will >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? transfer the >>>>>>>>>> matrix and >>>>>>>>>> ??? rhs data >>>>>>>>>> ???? >>??? to the master >>>>>>>>>> ???? >>???? >>>>>????? >???????? rank, and then >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? master >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? rank will call >>>>>>>>>> mumps with >>>>>>>>>> ??? OpenMP >>>>>>>>>> ???? >>??? to solve >>>>>>>>>> ???? >>???? >>>>>???? the matrix. >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? I wonder if >>>>>>>>>> someone can >>>>>>>>>> ??? develop >>>>>>>>>> ???? >>??? similar >>>>>>>>>> ???? >>???? >>>>>???? option for >>>>>>>>>> ???? >>???? >>>>>????? >???????? cusparse >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? solver. >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Right now, this >>>>>>>>>> solver >>>>>>>>>> ??? does not >>>>>>>>>> ???? >>??? work with >>>>>>>>>> ???? >>???? >>>>>????? >???????? mpiaijcusparse. I >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? think a >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? possible >>>>>>>>>> workaround is to >>>>>>>>>> ???? >>??? transfer all the >>>>>>>>>> ???? >>???? >>>>>???? matrix >>>>>>>>>> ???? >>???? >>>>>????? >???????? data to one MPI >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? process, and >>>>>>>>>> then upload the >>>>>>>>>> ???? >>??? data to GPU to >>>>>>>>>> ???? >>???? >>>>>???? solve. >>>>>>>>>> ???? >>???? >>>>>????? >???????? In this >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? way, one can >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? use cusparse >>>>>>>>>> solver for a MPI >>>>>>>>>> ???? >>??? program. >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Chang >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? -- >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Chang Liu >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Staff Research >>>>>>>>>> Physicist >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? +1 609 243 3438 >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > cliu at pppl.gov >>>>>>>>>> ??? >>>>>>>>> > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>> >>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>> ??? >>>>>>>>> > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>>> >>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>> ??? >>>>>>>>> > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>> >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? >>>>>>>>> ??? >>>>>>>>>> ???? >>??? > >>>>>>>>>> ??? >>>>>>>>>> >>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>>>> >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Princeton >>>>>>>>>> Plasma Physics >>>>>>>>>> ??? Laboratory >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? 100 Stellarator >>>>>>>>>> Rd, >>>>>>>>>> ??? Princeton NJ >>>>>>>>>> ???? >>??? 08540, USA >>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? -- >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Chang Liu >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Staff Research Physicist >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? +1 609 243 3438 >>>>>>>>>> ???? >>???? >>>>>????? >????????? > cliu at pppl.gov >>>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>> >>>>>>>>>> ??? >>>>>>>>>> ???? >>??? > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? >> >>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>> ??? >>>>>>>>> > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>>> >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Princeton Plasma >>>>>>>>>> Physics Laboratory >>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? 100 Stellarator Rd, >>>>>>>>>> Princeton NJ >>>>>>>>>> ??? 08540, USA >>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>????? >???????? -- >>>>>>>>>> ???? >>???? >>>>>????? >???????? Chang Liu >>>>>>>>>> ???? >>???? >>>>>????? >???????? Staff Research Physicist >>>>>>>>>> ???? >>???? >>>>>????? >???????? +1 609 243 3438 >>>>>>>>>> ???? >>???? >>>>>????? > cliu at pppl.gov >>>>>>>>>> ??? > >>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>> >>>>>>>>> ??? >>>>>>>>>> ???? >>??? > >>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>> >>>>>>>>>> ??? >>> >>>>>>>>>> ???? >>???? >>>>>????? >???????? Princeton Plasma Physics >>>>>>>>>> Laboratory >>>>>>>>>> ???? >>???? >>>>>????? >???????? 100 Stellarator Rd, Princeton >>>>>>>>>> NJ 08540, USA >>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ???? >>???? >>>>>???? --???? Chang Liu >>>>>>>>>> ???? >>???? >>>>>???? Staff Research Physicist >>>>>>>>>> ???? >>???? >>>>>???? +1 609 243 3438 >>>>>>>>>> ???? >>???? >>>>> cliu at pppl.gov >>>>>>>>>> ??? > >>>>>>>>>> >>>>>>>>> ??? >>>>>>>>>> ???? >>??? >> >>>>>>>>>> ???? >>???? >>>>>???? Princeton Plasma Physics Laboratory >>>>>>>>>> ???? >>???? >>>>>???? 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>> ???? >>???? >>>> >>>>>>>>>> ???? >>???? >>>> -- >>>>>>>>>> ???? >>???? >>>> Chang Liu >>>>>>>>>> ???? >>???? >>>> Staff Research Physicist >>>>>>>>>> ???? >>???? >>>> +1 609 243 3438 >>>>>>>>>> ???? >>???? >>>> cliu at pppl.gov >>>>>>>>>> ??? > >>>>>>>>>> ???? >>???? >>>> Princeton Plasma Physics Laboratory >>>>>>>>>> ???? >>???? >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>> ???? >>???? >> >>>>>>>>>> ???? >>???? >> -- >>>>>>>>>> ???? >>???? >> Chang Liu >>>>>>>>>> ???? >>???? >> Staff Research Physicist >>>>>>>>>> ???? >>???? >> +1 609 243 3438 >>>>>>>>>> ???? >>???? >> cliu at pppl.gov >>>>>>>>>> ??? > >>>>>>>>>> ???? >>???? >> Princeton Plasma Physics Laboratory >>>>>>>>>> ???? >>???? >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>> ???? >>???? > >>>>>>>>>> ???? >>??? --???? Chang Liu >>>>>>>>>> ???? >>??? Staff Research Physicist >>>>>>>>>> ???? >>??? +1 609 243 3438 >>>>>>>>>> ???? >> cliu at pppl.gov >>>>>>>>>> >>>>>>>>> ??? > >>>>>>>>>> ???? >>??? Princeton Plasma Physics Laboratory >>>>>>>>>> ???? >>??? 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>> ???? > >>>>>>>>>> ???? > -- >>>>>>>>>> ???? > Chang Liu >>>>>>>>>> ???? > Staff Research Physicist >>>>>>>>>> ???? > +1 609 243 3438 >>>>>>>>>> ???? > cliu at pppl.gov >>>>>>>>>> ???? > Princeton Plasma Physics Laboratory >>>>>>>>>> ???? > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Chang Liu >>>>>>>>> Staff Research Physicist >>>>>>>>> +1 609 243 3438 >>>>>>>>> cliu at pppl.gov >>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >>>>>>> -- >>>>>>> Chang Liu >>>>>>> Staff Research Physicist >>>>>>> +1 609 243 3438 >>>>>>> cliu at pppl.gov >>>>>>> Princeton Plasma Physics Laboratory >>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >>>>> -- >>>>> Chang Liu >>>>> Staff Research Physicist >>>>> +1 609 243 3438 >>>>> cliu at pppl.gov >>>>> Princeton Plasma Physics Laboratory >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>> -- >>> Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA >> > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From cliu at pppl.gov Thu Oct 14 21:39:34 2021 From: cliu at pppl.gov (Chang Liu) Date: Thu, 14 Oct 2021 22:39:34 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> Message-ID: <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> Hi Pierre and Barry, I think maybe I should use telescope outside bjacobi? like this mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type telescope -pc_telescope_reduction_factor 4 -t elescope_pc_type bjacobi -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu -telescope_sub_pc_factor_mat_solve r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 But then I got an error that [0]PETSC ERROR: MatSolverType cusparse does not support matrix type seqaij But the mat type should be aijcusparse. I think telescope change the mat type. Chang On 10/14/21 10:11 PM, Chang Liu wrote: > For comparison, here is the output using mumps instead of cusparse > > $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks > 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope > -sub_ksp_type preonly -sub_telescope_ksp_type preonly > -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type mumps > -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type > contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > ? 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm > 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > ? 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm > 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 > ? 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm > 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 > ? 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm > 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 > ? 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm > 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 > ? 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm > 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 > ? 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm > 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 > ? 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm > 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 > ? 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm > 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 > ? 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm > 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 > ?10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm > 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 > ?11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm > 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 > ?12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm > 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 > ?13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm > 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 > ?14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm > 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 > ?15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm > 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 > ?16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm > 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 > ?17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm > 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 > ?18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm > 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 > ?19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm > 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 > ?20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm > 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 > ?21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm > 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 > ?22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm > 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 > ?23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm > 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 > ?24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm > 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 > ?25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm > 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 > ?26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm > 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 > ?27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm > 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 > ?28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm > 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 > ?29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm > 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 > ?30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm > 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 > ?31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm > 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 > ?32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm > 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 > ?33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm > 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 > ?34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm > 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 > ?35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm > 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 > ?36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm > 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 > ?37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm > 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 > ?38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm > 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 > ?39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm > 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 > ?40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm > 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 > ?41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm > 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 > ?42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm > 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 > ?43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm > 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 > ?44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm > 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 > ?45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm > 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 > ?46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm > 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 > ?47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm > 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 > ?48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm > 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 > ?49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm > 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 > ?50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm > 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 > ?51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm > 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 > ?52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm > 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 > ?53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm > 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 > ?54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm > 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 > ?55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm > 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 > ?56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm > 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 > ?57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm > 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 > ?58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm > 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 > ?59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm > 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 > ?60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm > 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 > ?61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm > 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 > ?62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm > 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 > ?63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm > 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 > ?64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm > 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 > ?65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm > 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 > ?66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm > 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 > ?67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm > 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 > ?68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm > 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 > ?69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm > 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 > ?70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm > 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 > ?71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm > 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 > ?72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm > 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 > ?73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm > 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 > ?74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm > 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 > ?75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm > 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 > ?76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm > 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 > ?77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm > 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 > ?78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm > 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 > ?79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm > 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 > ?80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm > 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 > ?81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm > 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 > ?82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm > 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 > ?83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm > 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 > ?84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm > 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 > ?85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm > 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 > ?86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm > 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 > ?87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm > 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 > ?88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm > 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 > ?89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm > 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 > ?90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm > 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 > ?91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm > 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 > ?92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm > 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 > ?93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm > 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 > ?94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm > 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 > ?95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm > 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 > ?96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm > 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 > ?97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm > 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 > ?98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm > 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 > ?99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm > 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 > 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm > 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 > 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm > 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 > 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm > 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 > 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm > 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 > 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm > 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 > 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm > 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 > 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm > 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 > 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm > 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 > 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm > 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 > 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm > 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 > 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm > 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 > 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm > 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 > 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm > 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 > 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm > 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 > 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm > 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 > 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm > 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 > 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm > 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 > 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm > 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 > 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm > 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 > 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm > 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 > 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm > 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 > 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm > 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 > 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm > 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 > 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm > 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 > 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm > 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 > 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm > 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 > 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm > 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 > 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm > 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 > 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm > 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 > 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm > 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 > 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm > 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 > 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm > 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 > 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm > 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 > 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm > 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 > 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm > 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 > 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm > 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 > 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm > 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 > 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm > 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 > 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm > 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 > 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm > 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 > 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm > 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 > 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm > 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 > 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm > 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 > 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm > 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 > 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm > 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 > 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm > 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 > 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm > 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 > 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm > 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 > 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm > 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 > 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm > 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 > 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm > 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 > 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm > 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 > 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm > 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 > 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm > 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 > 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm > 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 > 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm > 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 > 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm > 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 > 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm > 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 > 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm > 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 > 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm > 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 > 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm > 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 > 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm > 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 > 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm > 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 > 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm > 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 > 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm > 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 > 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm > 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 > 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm > 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 > 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm > 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 > 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm > 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 > 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm > 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 > 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm > 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 > 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm > 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 > 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm > 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 > 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm > 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 > 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm > 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 > 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm > 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 > 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm > 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 > 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm > 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 > 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm > 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 > 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm > 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 > 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm > 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 > 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm > 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 > 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm > 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 > 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm > 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 > 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm > 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 > 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm > 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 > 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm > 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 > 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm > 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 > 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm > 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 > 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm > 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 > KSP Object: 16 MPI processes > ? type: fgmres > ??? restart=30, using Classical (unmodified) Gram-Schmidt > Orthogonalization with no iterative refinement > ??? happy breakdown tolerance 1e-30 > ? maximum iterations=2000, initial guess is zero > ? tolerances:? relative=1e-20, absolute=1e-09, divergence=10000. > ? right preconditioning > ? using UNPRECONDITIONED norm type for convergence test > PC Object: 16 MPI processes > ? type: bjacobi > ??? number of blocks = 4 > ??? Local solver information for first block is in the following KSP > and PC objects on rank 0: > ??? Use -ksp_view ::ascii_info_detail to display information for all > blocks > ? KSP Object: (sub_) 4 MPI processes > ??? type: preonly > ??? maximum iterations=10000, initial guess is zero > ??? tolerances:? relative=1e-05, absolute=1e-50, divergence=10000. > ??? left preconditioning > ??? using NONE norm type for convergence test > ? PC Object: (sub_) 4 MPI processes > ??? type: telescope > ????? petsc subcomm: parent comm size reduction factor = 4 > ????? petsc subcomm: parent_size = 4 , subcomm_size = 1 > ????? petsc subcomm type = contiguous > ??? linear system matrix = precond matrix: > ??? Mat Object: (sub_) 4 MPI processes > ????? type: mpiaij > ????? rows=40200, cols=40200 > ????? total: nonzeros=199996, allocated nonzeros=203412 > ????? total number of mallocs used during MatSetValues calls=0 > ??????? not using I-node (on process 0) routines > ??????? setup type: default > ??????? Parent DM object: NULL > ??????? Sub DM object: NULL > ??????? KSP Object:?? (sub_telescope_)?? 1 MPI processes > ????????? type: preonly > ????????? maximum iterations=10000, initial guess is zero > ????????? tolerances:? relative=1e-05, absolute=1e-50, divergence=10000. > ????????? left preconditioning > ????????? using NONE norm type for convergence test > ??????? PC Object:?? (sub_telescope_)?? 1 MPI processes > ????????? type: lu > ??????????? out-of-place factorization > ??????????? tolerance for zero pivot 2.22045e-14 > ??????????? matrix ordering: external > ??????????? factor fill ratio given 0., needed 0. > ????????????? Factored matrix follows: > ??????????????? Mat Object:?? 1 MPI processes > ????????????????? type: mumps > ????????????????? rows=40200, cols=40200 > ????????????????? package used to perform factorization: mumps > ????????????????? total: nonzeros=1849788, allocated nonzeros=1849788 > ??????????????????? MUMPS run parameters: > ????????????????????? SYM (matrix type):?????????????????? 0 > ????????????????????? PAR (host participation):??????????? 1 > ????????????????????? ICNTL(1) (output for error):???????? 6 > ????????????????????? ICNTL(2) (output of diagnostic msg): 0 > ????????????????????? ICNTL(3) (output for global info):?? 0 > ????????????????????? ICNTL(4) (level of printing):??????? 0 > ????????????????????? ICNTL(5) (input mat struct):???????? 0 > ????????????????????? ICNTL(6) (matrix prescaling):??????? 7 > ????????????????????? ICNTL(7) (sequential matrix ordering):7 > ????????????????????? ICNTL(8) (scaling strategy):??????? 77 > ????????????????????? ICNTL(10) (max num of refinements):? 0 > ????????????????????? ICNTL(11) (error analysis):????????? 0 > ????????????????????? ICNTL(12) (efficiency control): ?????? 1 > ????????????????????? ICNTL(13) (sequential factorization of the root > node):? 0 > ????????????????????? ICNTL(14) (percentage of estimated workspace > increase): 20 > ????????????????????? ICNTL(18) (input mat struct): ?????? 0 > ????????????????????? ICNTL(19) (Schur complement info): ?????? 0 > ????????????????????? ICNTL(20) (RHS sparse pattern): ?????? 0 > ????????????????????? ICNTL(21) (solution struct): ?????? 0 > ????????????????????? ICNTL(22) (in-core/out-of-core facility): ?????? 0 > ????????????????????? ICNTL(23) (max size of memory can be allocated > locally):0 > ????????????????????? ICNTL(24) (detection of null pivot rows): ?????? 0 > ????????????????????? ICNTL(25) (computation of a null space basis): > ?????? 0 > ????????????????????? ICNTL(26) (Schur options for RHS or solution): > ?????? 0 > ????????????????????? ICNTL(27) (blocking size for multiple RHS): > ?????? -32 > ????????????????????? ICNTL(28) (use parallel or sequential ordering): > ?????? 1 > ????????????????????? ICNTL(29) (parallel ordering): ?????? 0 > ????????????????????? ICNTL(30) (user-specified set of entries in > inv(A)):??? 0 > ????????????????????? ICNTL(31) (factors is discarded in the solve > phase):??? 0 > ????????????????????? ICNTL(33) (compute determinant): ?????? 0 > ????????????????????? ICNTL(35) (activate BLR based factorization): > ?????? 0 > ????????????????????? ICNTL(36) (choice of BLR factorization variant): > ?????? 0 > ????????????????????? ICNTL(38) (estimated compression rate of LU > factors):?? 333 > ????????????????????? CNTL(1) (relative pivoting threshold):????? 0.01 > ????????????????????? CNTL(2) (stopping criterion of refinement): > 1.49012e-08 > ????????????????????? CNTL(3) (absolute pivoting threshold):????? 0. > ????????????????????? CNTL(4) (value of static pivoting):???????? -1. > ????????????????????? CNTL(5) (fixation for null pivots):???????? 0. > ????????????????????? CNTL(7) (dropping parameter for BLR):?????? 0. > ????????????????????? RINFO(1) (local estimated flops for the > elimination after analysis): > ??????????????????????? [0] 1.45525e+08 > ????????????????????? RINFO(2) (local estimated flops for the assembly > after factorization): > ??????????????????????? [0]? 2.89397e+06 > ????????????????????? RINFO(3) (local estimated flops for the > elimination after factorization): > ??????????????????????? [0]? 1.45525e+08 > ????????????????????? INFO(15) (estimated size of (in MB) MUMPS > internal data for running numerical factorization): > ????????????????????? [0] 29 > ????????????????????? INFO(16) (size of (in MB) MUMPS internal data > used during numerical factorization): > ??????????????????????? [0] 29 > ????????????????????? INFO(23) (num of pivots eliminated on this > processor after factorization): > ??????????????????????? [0] 40200 > ????????????????????? RINFOG(1) (global estimated flops for the > elimination after analysis): 1.45525e+08 > ????????????????????? RINFOG(2) (global estimated flops for the > assembly after factorization): 2.89397e+06 > ????????????????????? RINFOG(3) (global estimated flops for the > elimination after factorization): 1.45525e+08 > ????????????????????? (RINFOG(12) RINFOG(13))*2^INFOG(34) > (determinant): (0.,0.)*(2^0) > ????????????????????? INFOG(3) (estimated real workspace for factors on > all processors after analysis): 1849788 > ????????????????????? INFOG(4) (estimated integer workspace for factors > on all processors after analysis): 879986 > ????????????????????? INFOG(5) (estimated maximum front size in the > complete tree): 282 > ????????????????????? INFOG(6) (number of nodes in the complete tree): > 23709 > ????????????????????? INFOG(7) (ordering option effectively used after > analysis): 5 > ????????????????????? INFOG(8) (structural symmetry in percent of the > permuted matrix after analysis): 100 > ????????????????????? INFOG(9) (total real/complex workspace to store > the matrix factors after factorization): 1849788 > ????????????????????? INFOG(10) (total integer space store the matrix > factors after factorization): 879986 > ????????????????????? INFOG(11) (order of largest frontal matrix after > factorization): 282 > ????????????????????? INFOG(12) (number of off-diagonal pivots): 0 > ????????????????????? INFOG(13) (number of delayed pivots after > factorization): 0 > ????????????????????? INFOG(14) (number of memory compress after > factorization): 0 > ????????????????????? INFOG(15) (number of steps of iterative > refinement after solution): 0 > ????????????????????? INFOG(16) (estimated size (in MB) of all MUMPS > internal data for factorization after analysis: value on the most memory > consuming processor): 29 > ????????????????????? INFOG(17) (estimated size of all MUMPS internal > data for factorization after analysis: sum over all processors): 29 > ????????????????????? INFOG(18) (size of all MUMPS internal data > allocated during factorization: value on the most memory consuming > processor): 29 > ????????????????????? INFOG(19) (size of all MUMPS internal data > allocated during factorization: sum over all processors): 29 > ????????????????????? INFOG(20) (estimated number of entries in the > factors): 1849788 > ????????????????????? INFOG(21) (size in MB of memory effectively used > during factorization - value on the most memory consuming processor): 26 > ????????????????????? INFOG(22) (size in MB of memory effectively used > during factorization - sum over all processors): 26 > ????????????????????? INFOG(23) (after analysis: value of ICNTL(6) > effectively used): 0 > ????????????????????? INFOG(24) (after analysis: value of ICNTL(12) > effectively used): 1 > ????????????????????? INFOG(25) (after factorization: number of pivots > modified by static pivoting): 0 > ????????????????????? INFOG(28) (after factorization: number of null > pivots encountered): 0 > ????????????????????? INFOG(29) (after factorization: effective number > of entries in the factors (sum over all processors)): 1849788 > ????????????????????? INFOG(30, 31) (after solution: size in Mbytes of > memory used during solution phase): 29, 29 > ????????????????????? INFOG(32) (after analysis: type of analysis done): 1 > ????????????????????? INFOG(33) (value used for ICNTL(8)): 7 > ????????????????????? INFOG(34) (exponent of the determinant if > determinant is requested): 0 > ????????????????????? INFOG(35) (after factorization: number of entries > taking into account BLR factor compression - sum over all processors): > 1849788 > ????????????????????? INFOG(36) (after analysis: estimated size of all > MUMPS internal data for running BLR in-core - value on the most memory > consuming processor): 0 > ????????????????????? INFOG(37) (after analysis: estimated size of all > MUMPS internal data for running BLR in-core - sum over all processors): 0 > ????????????????????? INFOG(38) (after analysis: estimated size of all > MUMPS internal data for running BLR out-of-core - value on the most > memory consuming processor): 0 > ????????????????????? INFOG(39) (after analysis: estimated size of all > MUMPS internal data for running BLR out-of-core - sum over all > processors): 0 > ????????? linear system matrix = precond matrix: > ????????? Mat Object:?? 1 MPI processes > ??????????? type: seqaijcusparse > ??????????? rows=40200, cols=40200 > ??????????? total: nonzeros=199996, allocated nonzeros=199996 > ??????????? total number of mallocs used during MatSetValues calls=0 > ????????????? not using I-node routines > ? linear system matrix = precond matrix: > ? Mat Object: 16 MPI processes > ??? type: mpiaijcusparse > ??? rows=160800, cols=160800 > ??? total: nonzeros=802396, allocated nonzeros=1608000 > ??? total number of mallocs used during MatSetValues calls=0 > ????? not using I-node (on process 0) routines > Norm of error 9.11684e-07 iterations 189 > > Chang > > > > On 10/14/21 10:10 PM, Chang Liu wrote: >> Hi Barry, >> >> No problem. Here is the output. It seems that the resid norm >> calculation is incorrect. >> >> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 >> -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >> -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type >> preonly -sub_telescope_pc_type lu >> -sub_telescope_pc_factor_mat_solver_type cusparse >> -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type >> contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> ?? 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid >> norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> ?? 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid >> norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> KSP Object: 16 MPI processes >> ?? type: fgmres >> ???? restart=30, using Classical (unmodified) Gram-Schmidt >> Orthogonalization with no iterative refinement >> ???? happy breakdown tolerance 1e-30 >> ?? maximum iterations=2000, initial guess is zero >> ?? tolerances:? relative=1e-20, absolute=1e-09, divergence=10000. >> ?? right preconditioning >> ?? using UNPRECONDITIONED norm type for convergence test >> PC Object: 16 MPI processes >> ?? type: bjacobi >> ???? number of blocks = 4 >> ???? Local solver information for first block is in the following KSP >> and PC objects on rank 0: >> ???? Use -ksp_view ::ascii_info_detail to display information for all >> blocks >> ?? KSP Object: (sub_) 4 MPI processes >> ???? type: preonly >> ???? maximum iterations=10000, initial guess is zero >> ???? tolerances:? relative=1e-05, absolute=1e-50, divergence=10000. >> ???? left preconditioning >> ???? using NONE norm type for convergence test >> ?? PC Object: (sub_) 4 MPI processes >> ???? type: telescope >> ?????? petsc subcomm: parent comm size reduction factor = 4 >> ?????? petsc subcomm: parent_size = 4 , subcomm_size = 1 >> ?????? petsc subcomm type = contiguous >> ???? linear system matrix = precond matrix: >> ???? Mat Object: (sub_) 4 MPI processes >> ?????? type: mpiaij >> ?????? rows=40200, cols=40200 >> ?????? total: nonzeros=199996, allocated nonzeros=203412 >> ?????? total number of mallocs used during MatSetValues calls=0 >> ???????? not using I-node (on process 0) routines >> ???????? setup type: default >> ???????? Parent DM object: NULL >> ???????? Sub DM object: NULL >> ???????? KSP Object:?? (sub_telescope_)?? 1 MPI processes >> ?????????? type: preonly >> ?????????? maximum iterations=10000, initial guess is zero >> ?????????? tolerances:? relative=1e-05, absolute=1e-50, divergence=10000. >> ?????????? left preconditioning >> ?????????? using NONE norm type for convergence test >> ???????? PC Object:?? (sub_telescope_)?? 1 MPI processes >> ?????????? type: lu >> ???????????? out-of-place factorization >> ???????????? tolerance for zero pivot 2.22045e-14 >> ???????????? matrix ordering: nd >> ???????????? factor fill ratio given 5., needed 8.62558 >> ?????????????? Factored matrix follows: >> ???????????????? Mat Object:?? 1 MPI processes >> ?????????????????? type: seqaijcusparse >> ?????????????????? rows=40200, cols=40200 >> ?????????????????? package used to perform factorization: cusparse >> ?????????????????? total: nonzeros=1725082, allocated nonzeros=1725082 >> ???????????????????? not using I-node routines >> ?????????? linear system matrix = precond matrix: >> ?????????? Mat Object:?? 1 MPI processes >> ???????????? type: seqaijcusparse >> ???????????? rows=40200, cols=40200 >> ???????????? total: nonzeros=199996, allocated nonzeros=199996 >> ???????????? total number of mallocs used during MatSetValues calls=0 >> ?????????????? not using I-node routines >> ?? linear system matrix = precond matrix: >> ?? Mat Object: 16 MPI processes >> ???? type: mpiaijcusparse >> ???? rows=160800, cols=160800 >> ???? total: nonzeros=802396, allocated nonzeros=1608000 >> ???? total number of mallocs used during MatSetValues calls=0 >> ?????? not using I-node (on process 0) routines >> Norm of error 400.999 iterations 1 >> >> Chang >> >> >> On 10/14/21 9:47 PM, Barry Smith wrote: >>> >>> ?? Chang, >>> >>> ??? Sorry I did not notice that one. Please run that with -ksp_view >>> -ksp_monitor_true_residual so we can see exactly how options are >>> interpreted and solver used. At a glance it looks ok but something >>> must be wrong to get the wrong answer. >>> >>> ?? Barry >>> >>>> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: >>>> >>>> Hi Barry, >>>> >>>> That is exactly what I was doing in the second example, in which the >>>> preconditioner works but the GMRES does not. >>>> >>>> Chang >>>> >>>> On 10/14/21 5:15 PM, Barry Smith wrote: >>>>> ?? You need to use the PCTELESCOPE inside the block Jacobi, not >>>>> outside it. So something like -pc_type bjacobi -sub_pc_type >>>>> telescope -sub_telescope_pc_type lu >>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>>>>> >>>>>> Hi Pierre, >>>>>> >>>>>> I wonder if the trick of PCTELESCOPE only works for preconditioner >>>>>> and not for the solver. I have done some tests, and find that for >>>>>> solving a small matrix using -telescope_ksp_type preonly, it does >>>>>> work for GPU with multiple MPI processes. However, for bjacobi and >>>>>> gmres, it does not work. >>>>>> >>>>>> The command line options I used for small matrix is like >>>>>> >>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short >>>>>> -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu >>>>>> -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type >>>>>> preonly -pc_telescope_reduction_factor 4 >>>>>> >>>>>> which gives the correct output. For iterative solver, I tried >>>>>> >>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short >>>>>> -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type >>>>>> aijcusparse -sub_pc_type telescope -sub_ksp_type preonly >>>>>> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >>>>>> -sub_telescope_pc_factor_mat_solver_type cusparse >>>>>> -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol >>>>>> 1.e-9 -ksp_atol 1.e-20 >>>>>> >>>>>> for large matrix. The output is like >>>>>> >>>>>> ? 0 KSP Residual norm 40.1497 >>>>>> ? 1 KSP Residual norm < 1.e-11 >>>>>> Norm of error 400.999 iterations 1 >>>>>> >>>>>> So it seems to call a direct solver instead of an iterative one. >>>>>> >>>>>> Can you please help check these options? >>>>>> >>>>>> Chang >>>>>> >>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>>>>> >>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This >>>>>>>> sounds exactly what I need. I wonder if PCTELESCOPE can >>>>>>>> transform a mpiaijcusparse to seqaircusparse? Or I have to do it >>>>>>>> manually? >>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it >>>>>>> should be; >>>>>>> 2) at least for the implementations >>>>>>> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and >>>>>>> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType >>>>>>> is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? >>>>>>> enough to detect if the MPI communicator on which the Mat lives >>>>>>> is of size 1 (your case), and then the resulting Mat is of type >>>>>>> MatSeqX instead of MatMPIX, so you would not need to worry about >>>>>>> the transformation you are mentioning. >>>>>>> If you try this out and this does not work, please provide the >>>>>>> backtrace (probably something like ?Operation XYZ not implemented >>>>>>> for MatType ABC?), and hopefully someone can add the missing >>>>>>> plumbing. >>>>>>> I do not claim that this will be efficient, but I think this goes >>>>>>> in the direction of what you want to achieve. >>>>>>> Thanks, >>>>>>> Pierre >>>>>>>> Chang >>>>>>>> >>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a >>>>>>>>> subdomain solver, with a reduction factor equal to the number >>>>>>>>> of MPI processes you have per block? >>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X >>>>>>>>> -sub_telescope_pc_type lu >>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads >>>>>>>>> because not only do the Mat needs to be redistributed, the >>>>>>>>> secondary processes also need to be ?converted? to OpenMP threads. >>>>>>>>> Thus the need for specific code in mumps.c. >>>>>>>>> Thanks, >>>>>>>>> Pierre >>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Junchao, >>>>>>>>>> >>>>>>>>>> Yes that is what I want. >>>>>>>>>> >>>>>>>>>> Chang >>>>>>>>>> >>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith >>>>>>>>>> > wrote: >>>>>>>>>>> ?????? Junchao, >>>>>>>>>>> ????????? If I understand correctly Chang is using the block >>>>>>>>>>> Jacobi >>>>>>>>>>> ??? method with a single block for a number of MPI ranks and >>>>>>>>>>> a direct >>>>>>>>>>> ??? solver for each block so it uses >>>>>>>>>>> PCSetUp_BJacobi_Multiproc() which >>>>>>>>>>> ??? is code Hong Zhang wrote a number of years ago for CPUs. >>>>>>>>>>> For their >>>>>>>>>>> ??? particular problems this preconditioner works well, but >>>>>>>>>>> using an >>>>>>>>>>> ??? iterative solver on the blocks does not work well. >>>>>>>>>>> ????????? If we had complete MPI-GPU direct solvers he could >>>>>>>>>>> just use >>>>>>>>>>> ??? the current code with MPIAIJCUSPARSE on each block but >>>>>>>>>>> since we do >>>>>>>>>>> ??? not he would like to use a single GPU for each block, >>>>>>>>>>> this means >>>>>>>>>>> ??? that diagonal blocks of? the global parallel MPI matrix >>>>>>>>>>> needs to be >>>>>>>>>>> ??? sent to a subset of the GPUs (one GPU per block, which >>>>>>>>>>> has multiple >>>>>>>>>>> ??? MPI ranks associated with the blocks). Similarly for the >>>>>>>>>>> triangular >>>>>>>>>>> ??? solves the blocks of the right hand side needs to be >>>>>>>>>>> shipped to the >>>>>>>>>>> ??? appropriate GPU and the resulting solution shipped back >>>>>>>>>>> to the >>>>>>>>>>> ??? multiple GPUs. So Chang is absolutely correct, this is >>>>>>>>>>> somewhat like >>>>>>>>>>> ??? your code for MUMPS with OpenMP. OK, I now understand the >>>>>>>>>>> background.. >>>>>>>>>>> ??? One could use PCSetUp_BJacobi_Multiproc() and get the >>>>>>>>>>> blocks on the >>>>>>>>>>> ??? MPI ranks and then shrink each block down to a single GPU >>>>>>>>>>> but this >>>>>>>>>>> ??? would be pretty inefficient, ideally one would go >>>>>>>>>>> directly from the >>>>>>>>>>> ??? big MPI matrix on all the GPUs to the sub matrices on the >>>>>>>>>>> subset of >>>>>>>>>>> ??? GPUs. But this may be a large coding project. >>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? >>>>>>>>>>> In my mind, we just need to move each block (submatrix) >>>>>>>>>>> living over multiple MPI ranks to one of them and solve >>>>>>>>>>> directly there.? In other words, we keep blocks' size, no >>>>>>>>>>> shrinking or expanding. >>>>>>>>>>> As mentioned before, cusparse does not provide LU >>>>>>>>>>> factorization. So the LU factorization would be done on CPU, >>>>>>>>>>> and the solve be done on GPU. I assume Chang wants to gain >>>>>>>>>>> from the (potential) faster solve (instead of factorization) >>>>>>>>>>> on GPU. >>>>>>>>>>> ?????? Barry >>>>>>>>>>> ??? Since the matrices being factored and solved directly are >>>>>>>>>>> relatively >>>>>>>>>>> ??? large it is possible that the cusparse code could be >>>>>>>>>>> reasonably >>>>>>>>>>> ??? efficient (they are not the tiny problems one gets at the >>>>>>>>>>> coarse >>>>>>>>>>> ??? level of multigrid). Of course, this is speculation, I don't >>>>>>>>>>> ??? actually know how much better the cusparse code would be >>>>>>>>>>> on the >>>>>>>>>>> ??? direct solver than a good CPU direct sparse solver. >>>>>>>>>>> ???? > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>>> ??? > wrote: >>>>>>>>>>> ???? > >>>>>>>>>>> ???? > Sorry I am not familiar with the details either. Can >>>>>>>>>>> you please >>>>>>>>>>> ??? check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>>>> ???? > >>>>>>>>>>> ???? > Chang >>>>>>>>>>> ???? > >>>>>>>>>>> ???? > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>>>> ???? >> Hi Chang, >>>>>>>>>>> ???? >>?? I did the work in mumps. It is easy for me to >>>>>>>>>>> understand >>>>>>>>>>> ??? gathering matrix rows to one process. >>>>>>>>>>> ???? >>?? But how to gather blocks (submatrices) to form a >>>>>>>>>>> large block????? Can you draw a picture of that? >>>>>>>>>>> ???? >>?? Thanks >>>>>>>>>>> ???? >> --Junchao Zhang >>>>>>>>>>> ???? >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via >>>>>>>>>>> petsc-users >>>>>>>>>>> ??? >>>>>>>>>>> ??? >>>>>>>>>> >> >>>>>>>>>>> ??? wrote: >>>>>>>>>>> ???? >>??? Hi Barry, >>>>>>>>>>> ???? >>??? I think mumps solver in petsc does support that. >>>>>>>>>>> You can >>>>>>>>>>> ??? check the >>>>>>>>>>> ???? >>??? documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>>>> ???? >> >>>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ???? >> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>> ???? >>??? and the code enclosed by #if >>>>>>>>>>> ??? defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>>>> ???? >>??? functions MatMumpsSetUpDistRHSInfo and >>>>>>>>>>> ???? >>??? MatMumpsGatherNonzerosOnMaster in >>>>>>>>>>> ???? >>??? mumps.c >>>>>>>>>>> ???? >>??? 1. I understand it is ideal to do one MPI rank per >>>>>>>>>>> GPU. >>>>>>>>>>> ??? However, I am >>>>>>>>>>> ???? >>??? working on an existing code that was developed >>>>>>>>>>> based on MPI >>>>>>>>>>> ??? and the the >>>>>>>>>>> ???? >>??? # of mpi ranks is typically equal to # of cpu >>>>>>>>>>> cores. We don't >>>>>>>>>>> ??? want to >>>>>>>>>>> ???? >>??? change the whole structure of the code. >>>>>>>>>>> ???? >>??? 2. What you have suggested has been coded in >>>>>>>>>>> mumps.c. See >>>>>>>>>>> ??? function >>>>>>>>>>> ???? >>??? MatMumpsSetUpDistRHSInfo. >>>>>>>>>>> ???? >>??? Regards, >>>>>>>>>>> ???? >>??? Chang >>>>>>>>>>> ???? >>??? On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>>>>> >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >> wrote: >>>>>>>>>>> ???? >>???? >> >>>>>>>>>>> ???? >>???? >> Hi Barry, >>>>>>>>>>> ???? >>???? >> >>>>>>>>>>> ???? >>???? >> That is exactly what I want. >>>>>>>>>>> ???? >>???? >> >>>>>>>>>>> ???? >>???? >> Back to my original question, I am looking for >>>>>>>>>>> an approach to >>>>>>>>>>> ???? >>??? transfer >>>>>>>>>>> ???? >>???? >> matrix >>>>>>>>>>> ???? >>???? >> data from many MPI processes to "master" MPI >>>>>>>>>>> ???? >>???? >> processes, each of which taking care of one >>>>>>>>>>> GPU, and then >>>>>>>>>>> ??? upload >>>>>>>>>>> ???? >>??? the data to GPU to >>>>>>>>>>> ???? >>???? >> solve. >>>>>>>>>>> ???? >>???? >> One can just grab some codes from mumps.c to >>>>>>>>>>> ??? aijcusparse.cu >>>>>>>>>>> ???? >>??? >. >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? >??? mumps.c doesn't actually do that. It never >>>>>>>>>>> needs to >>>>>>>>>>> ??? copy the >>>>>>>>>>> ???? >>??? entire matrix to a single MPI rank. >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? >??? It would be possible to write such a code >>>>>>>>>>> that you >>>>>>>>>>> ??? suggest but >>>>>>>>>>> ???? >>??? it is not clear that it makes sense >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? > 1)? For normal PETSc GPU usage there is one GPU >>>>>>>>>>> per MPI >>>>>>>>>>> ??? rank, so >>>>>>>>>>> ???? >>??? while your one GPU per big domain is solving its >>>>>>>>>>> systems the >>>>>>>>>>> ??? other >>>>>>>>>>> ???? >>??? GPUs (with the other MPI ranks that share that >>>>>>>>>>> domain) are doing >>>>>>>>>>> ???? >>??? nothing. >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? > 2) For each triangular solve you would have to >>>>>>>>>>> gather the >>>>>>>>>>> ??? right >>>>>>>>>>> ???? >>??? hand side from the multiple ranks to the single >>>>>>>>>>> GPU to pass it to >>>>>>>>>>> ???? >>??? the GPU solver and then scatter the resulting >>>>>>>>>>> solution back >>>>>>>>>>> ??? to all >>>>>>>>>>> ???? >>??? of its subdomain ranks. >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? >??? What I was suggesting was assign an entire >>>>>>>>>>> subdomain to a >>>>>>>>>>> ???? >>??? single MPI rank, thus it does everything on one >>>>>>>>>>> GPU and can >>>>>>>>>>> ??? use the >>>>>>>>>>> ???? >>??? GPU solver directly. If all the major computations >>>>>>>>>>> of a subdomain >>>>>>>>>>> ???? >>??? can fit and be done on a single GPU then you would be >>>>>>>>>>> ??? utilizing all >>>>>>>>>>> ???? >>??? the GPUs you are using effectively. >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? >??? Barry >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>???? >> >>>>>>>>>>> ???? >>???? >> Chang >>>>>>>>>>> ???? >>???? >> >>>>>>>>>>> ???? >>???? >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>>>> ???? >>???? >>>??? Chang, >>>>>>>>>>> ???? >>???? >>>????? You are correct there is no MPI + GPU >>>>>>>>>>> direct >>>>>>>>>>> ??? solvers that >>>>>>>>>>> ???? >>??? currently do the triangular solves with MPI + GPU >>>>>>>>>>> parallelism >>>>>>>>>>> ??? that I >>>>>>>>>>> ???? >>??? am aware of. You are limited that individual >>>>>>>>>>> triangular solves be >>>>>>>>>>> ???? >>??? done on a single GPU. I can only suggest making >>>>>>>>>>> each subdomain as >>>>>>>>>>> ???? >>??? big as possible to utilize each GPU as much as >>>>>>>>>>> possible for the >>>>>>>>>>> ???? >>??? direct triangular solves. >>>>>>>>>>> ???? >>???? >>>???? Barry >>>>>>>>>>> ???? >>???? >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via >>>>>>>>>>> petsc-users >>>>>>>>>>> ???? >>??? >>>>>>>>>> >>>>>>>>>>> ??? >>>>>>>>>> >> >>>>>>>>>>> ??? wrote: >>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>> ???? >>???? >>>> Hi Mark, >>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>> ???? >>???? >>>> '-mat_type aijcusparse' works with >>>>>>>>>>> mpiaijcusparse with >>>>>>>>>>> ??? other >>>>>>>>>>> ???? >>??? solvers, but with -pc_factor_mat_solver_type >>>>>>>>>>> cusparse, it >>>>>>>>>>> ??? will give >>>>>>>>>>> ???? >>??? an error. >>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>> ???? >>???? >>>> Yes what I want is to have mumps or superlu >>>>>>>>>>> to do the >>>>>>>>>>> ???? >>??? factorization, and then do the rest, including >>>>>>>>>>> GMRES solver, >>>>>>>>>>> ??? on gpu. >>>>>>>>>>> ???? >>??? Is that possible? >>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>> ???? >>???? >>>> I have tried to use aijcusparse with >>>>>>>>>>> superlu_dist, it >>>>>>>>>>> ??? runs but >>>>>>>>>>> ???? >>??? the iterative solver is still running on CPUs. I have >>>>>>>>>>> ??? contacted the >>>>>>>>>>> ???? >>??? superlu group and they confirmed that is the case >>>>>>>>>>> right now. >>>>>>>>>>> ??? But if >>>>>>>>>>> ???? >>??? I set -pc_factor_mat_solver_type cusparse, it >>>>>>>>>>> seems that the >>>>>>>>>>> ???? >>??? iterative solver is running on GPU. >>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>> ???? >>???? >>>> Chang >>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>> ???? >>???? >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>>>> ???? >>???? >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? > >>>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>> >>>>>>>>>>> wrote: >>>>>>>>>>> ???? >>???? >>>>>???? Thank you Junchao for explaining this. >>>>>>>>>>> I guess in >>>>>>>>>>> ??? my case >>>>>>>>>>> ???? >>??? the code is >>>>>>>>>>> ???? >>???? >>>>>???? just calling a seq solver like superlu >>>>>>>>>>> to do >>>>>>>>>>> ???? >>??? factorization on GPUs. >>>>>>>>>>> ???? >>???? >>>>>???? My idea is that I want to have a >>>>>>>>>>> traditional MPI >>>>>>>>>>> ??? code to >>>>>>>>>>> ???? >>??? utilize GPUs >>>>>>>>>>> ???? >>???? >>>>>???? with cusparse. Right now cusparse does >>>>>>>>>>> not support >>>>>>>>>>> ??? mpiaij >>>>>>>>>>> ???? >>??? matrix, Sure it does: '-mat_type aijcusparse' will >>>>>>>>>>> give you an >>>>>>>>>>> ???? >>??? mpiaijcusparse matrix with > 1 processes. >>>>>>>>>>> ???? >>???? >>>>> (-mat_type mpiaijcusparse might also work >>>>>>>>>>> with >1 proc). >>>>>>>>>>> ???? >>???? >>>>> However, I see in grepping the repo that >>>>>>>>>>> all the mumps and >>>>>>>>>>> ???? >>??? superlu tests use aij or sell matrix type. >>>>>>>>>>> ???? >>???? >>>>> MUMPS and SuperLU provide their own solves, >>>>>>>>>>> I assume >>>>>>>>>>> ??? .... but >>>>>>>>>>> ???? >>??? you might want to do other matrix operations on >>>>>>>>>>> the GPU. Is >>>>>>>>>>> ??? that the >>>>>>>>>>> ???? >>??? issue? >>>>>>>>>>> ???? >>???? >>>>> Did you try -mat_type aijcusparse with >>>>>>>>>>> MUMPS and/or >>>>>>>>>>> ??? SuperLU >>>>>>>>>>> ???? >>??? have a problem? (no test with it so it probably >>>>>>>>>>> does not work) >>>>>>>>>>> ???? >>???? >>>>> Thanks, >>>>>>>>>>> ???? >>???? >>>>> Mark >>>>>>>>>>> ???? >>???? >>>>>???? so I >>>>>>>>>>> ???? >>???? >>>>>???? want the code to have a mpiaij matrix >>>>>>>>>>> when adding >>>>>>>>>>> ??? all the >>>>>>>>>>> ???? >>??? matrix terms, >>>>>>>>>>> ???? >>???? >>>>>???? and then transform the matrix to seqaij >>>>>>>>>>> when doing the >>>>>>>>>>> ???? >>??? factorization >>>>>>>>>>> ???? >>???? >>>>>???? and >>>>>>>>>>> ???? >>???? >>>>>???? solve. This involves sending the data >>>>>>>>>>> to the master >>>>>>>>>>> ???? >>??? process, and I >>>>>>>>>>> ???? >>???? >>>>>???? think >>>>>>>>>>> ???? >>???? >>>>>???? the petsc mumps solver have something >>>>>>>>>>> similar already. >>>>>>>>>>> ???? >>???? >>>>>???? Chang >>>>>>>>>>> ???? >>???? >>>>>???? On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? > On Tue, Oct 12, 2021 at 1:07 PM Mark >>>>>>>>>>> Adams >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>> ??? >>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>>> wrote: >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???? On Tue, Oct 12, 2021 at 1:45 PM >>>>>>>>>>> Chang Liu >>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>????? >???? >>>>>>>>>> ??? >>>>>>>>>> > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>>> wrote: >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???????? Hi Mark, >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???????? The option I use is like >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???????? -pc_type bjacobi >>>>>>>>>>> -pc_bjacobi_blocks 16 >>>>>>>>>>> ???? >>??? -ksp_type fgmres >>>>>>>>>>> ???? >>???? >>>>>???? -mat_type >>>>>>>>>>> ???? >>???? >>>>>????? >???????? aijcusparse >>>>>>>>>>> *-sub_pc_factor_mat_solver_type >>>>>>>>>>> ???? >>??? cusparse >>>>>>>>>>> ???? >>???? >>>>>???? *-sub_ksp_type >>>>>>>>>>> ???? >>???? >>>>>????? >???????? preonly *-sub_pc_type lu* >>>>>>>>>>> -ksp_max_it 2000 >>>>>>>>>>> ???? >>??? -ksp_rtol 1.e-300 >>>>>>>>>>> ???? >>???? >>>>>????? >???????? -ksp_atol 1.e-300 >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???? Note, If you use -log_view the >>>>>>>>>>> last column >>>>>>>>>>> ??? (rows >>>>>>>>>>> ???? >>??? are the >>>>>>>>>>> ???? >>???? >>>>>???? method like >>>>>>>>>>> ???? >>???? >>>>>????? >???? MatFactorNumeric) has the >>>>>>>>>>> percent of work >>>>>>>>>>> ??? in the GPU. >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???? Junchao: *This* implies that we >>>>>>>>>>> have a >>>>>>>>>>> ??? cuSparse LU >>>>>>>>>>> ???? >>???? >>>>>???? factorization. Is >>>>>>>>>>> ???? >>???? >>>>>????? >???? that correct? (I don't think we do) >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? > No, we don't have cuSparse LU >>>>>>>>>>> factorization.???? If you check >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>>>> ??? find it >>>>>>>>>>> ???? >>??? calls >>>>>>>>>>> ???? >>???? >>>>>????? > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>>>> ???? >>???? >>>>>????? > So I don't understand Chang's idea. >>>>>>>>>>> Do you want to >>>>>>>>>>> ???? >>??? make bigger >>>>>>>>>>> ???? >>???? >>>>>???? blocks? >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???????? I think this one do both >>>>>>>>>>> factorization and >>>>>>>>>>> ???? >>??? solve on gpu. >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???????? You can check the >>>>>>>>>>> ??? runex72_aijcusparse.sh file >>>>>>>>>>> ???? >>??? in petsc >>>>>>>>>>> ???? >>???? >>>>>???? install >>>>>>>>>>> ???? >>???? >>>>>????? >???????? directory, and try it your >>>>>>>>>>> self (this >>>>>>>>>>> ??? is only lu >>>>>>>>>>> ???? >>???? >>>>>???? factorization >>>>>>>>>>> ???? >>???? >>>>>????? >???????? without >>>>>>>>>>> ???? >>???? >>>>>????? >???????? iterative solve). >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???????? Chang >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???????? On 10/12/21 1:17 PM, Mark >>>>>>>>>>> Adams wrote: >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > On Tue, Oct 12, 2021 at >>>>>>>>>>> 11:19 AM >>>>>>>>>>> ??? Chang Liu >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>> ??? >>>>>>>>>> > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>> >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? > >>>>>>>>>>> ??? >>>>>>>>>>> >>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>>>> wrote: >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Hi Junchao, >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? No I only needs it to >>>>>>>>>>> be transferred >>>>>>>>>>> ???? >>??? within a >>>>>>>>>>> ???? >>???? >>>>>???? node. I use >>>>>>>>>>> ???? >>???? >>>>>????? >???????? block-Jacobi >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? method and GMRES to >>>>>>>>>>> solve the sparse >>>>>>>>>>> ???? >>??? matrix, so each >>>>>>>>>>> ???? >>???? >>>>>????? >???????? direct solver will >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? take care of a >>>>>>>>>>> sub-block of the >>>>>>>>>>> ??? whole >>>>>>>>>>> ???? >>??? matrix. In this >>>>>>>>>>> ???? >>???? >>>>>????? >???????? way, I can use >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? one >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? GPU to solve one >>>>>>>>>>> sub-block, which is >>>>>>>>>>> ???? >>??? stored within >>>>>>>>>>> ???? >>???? >>>>>???? one node. >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? It was stated in the >>>>>>>>>>> ??? documentation that >>>>>>>>>>> ???? >>??? cusparse >>>>>>>>>>> ???? >>???? >>>>>???? solver >>>>>>>>>>> ???? >>???? >>>>>????? >???????? is slow. >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? However, in my test >>>>>>>>>>> using >>>>>>>>>>> ??? ex72.c, the >>>>>>>>>>> ???? >>??? cusparse >>>>>>>>>>> ???? >>???? >>>>>???? solver is >>>>>>>>>>> ???? >>???? >>>>>????? >???????? faster than >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? mumps or superlu_dist >>>>>>>>>>> on CPUs. >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > Are we talking about the >>>>>>>>>>> ??? factorization, the >>>>>>>>>>> ???? >>??? solve, or >>>>>>>>>>> ???? >>???? >>>>>???? both? >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > We do not have an >>>>>>>>>>> interface to >>>>>>>>>>> ??? cuSparse's LU >>>>>>>>>>> ???? >>???? >>>>>???? factorization (I >>>>>>>>>>> ???? >>???? >>>>>????? >???????? just >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > learned that it exists a >>>>>>>>>>> few weeks ago). >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > Perhaps your fast >>>>>>>>>>> "cusparse solver" is >>>>>>>>>>> ???? >>??? '-pc_type lu >>>>>>>>>>> ???? >>???? >>>>>???? -mat_type >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > aijcusparse' ? This would >>>>>>>>>>> be the CPU >>>>>>>>>>> ???? >>??? factorization, >>>>>>>>>>> ???? >>???? >>>>>???? which is the >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > dominant cost. >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Chang >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? On 10/12/21 10:24 AM, >>>>>>>>>>> Junchao >>>>>>>>>>> ??? Zhang wrote: >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > Hi, Chang, >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? For the mumps >>>>>>>>>>> solver, we >>>>>>>>>>> ??? usually >>>>>>>>>>> ???? >>??? transfers >>>>>>>>>>> ???? >>???? >>>>>???? matrix >>>>>>>>>>> ???? >>???? >>>>>????? >???????? and vector >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? data >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > within a compute >>>>>>>>>>> node.? For >>>>>>>>>>> ??? the idea you >>>>>>>>>>> ???? >>???? >>>>>???? propose, it >>>>>>>>>>> ???? >>???? >>>>>????? >???????? looks like >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? we need >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > to gather data within >>>>>>>>>>> ???? >>??? MPI_COMM_WORLD, right? >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Mark, I >>>>>>>>>>> remember you said >>>>>>>>>>> ???? >>??? cusparse solve is >>>>>>>>>>> ???? >>???? >>>>>???? slow >>>>>>>>>>> ???? >>???? >>>>>????? >???????? and you would >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > rather do it on >>>>>>>>>>> CPU. Is it right? >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > --Junchao Zhang >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > On Mon, Oct 11, >>>>>>>>>>> 2021 at 10:25 PM >>>>>>>>>>> ???? >>??? Chang Liu via >>>>>>>>>>> ???? >>???? >>>>>???? petsc-users >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>> >>>>>>>>>>> >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>>> >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>> >>>>>>>>>>> >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >>>>>>>>>> ??? >>>>>> >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? wrote: >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Hi, >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Currently, it >>>>>>>>>>> is possible >>>>>>>>>>> ??? to use >>>>>>>>>>> ???? >>??? mumps >>>>>>>>>>> ???? >>???? >>>>>???? solver in >>>>>>>>>>> ???? >>???? >>>>>????? >???????? PETSC with >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> -mat_mumps_use_omp_threads >>>>>>>>>>> ???? >>??? option, so that >>>>>>>>>>> ???? >>???? >>>>>????? >???????? multiple MPI >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? processes will >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? transfer the >>>>>>>>>>> matrix and >>>>>>>>>>> ??? rhs data >>>>>>>>>>> ???? >>??? to the master >>>>>>>>>>> ???? >>???? >>>>>????? >???????? rank, and then >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? master >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? rank will call >>>>>>>>>>> mumps with >>>>>>>>>>> ??? OpenMP >>>>>>>>>>> ???? >>??? to solve >>>>>>>>>>> ???? >>???? >>>>>???? the matrix. >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? I wonder if >>>>>>>>>>> someone can >>>>>>>>>>> ??? develop >>>>>>>>>>> ???? >>??? similar >>>>>>>>>>> ???? >>???? >>>>>???? option for >>>>>>>>>>> ???? >>???? >>>>>????? >???????? cusparse >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? solver. >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Right now, >>>>>>>>>>> this solver >>>>>>>>>>> ??? does not >>>>>>>>>>> ???? >>??? work with >>>>>>>>>>> ???? >>???? >>>>>????? >???????? mpiaijcusparse. I >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? think a >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? possible >>>>>>>>>>> workaround is to >>>>>>>>>>> ???? >>??? transfer all the >>>>>>>>>>> ???? >>???? >>>>>???? matrix >>>>>>>>>>> ???? >>???? >>>>>????? >???????? data to one MPI >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? process, and >>>>>>>>>>> then upload the >>>>>>>>>>> ???? >>??? data to GPU to >>>>>>>>>>> ???? >>???? >>>>>???? solve. >>>>>>>>>>> ???? >>???? >>>>>????? >???????? In this >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? way, one can >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? use cusparse >>>>>>>>>>> solver for a MPI >>>>>>>>>>> ???? >>??? program. >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Chang >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? -- >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Chang Liu >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Staff Research >>>>>>>>>>> Physicist >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? +1 609 243 3438 >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > cliu at pppl.gov >>>>>>>>>>> ??? >>>>>>>>>> > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>> ??? >>>>>>>>>> > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>>> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>> ??? >>>>>>>>>> > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>> >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? > >>>>>>>>>>> ??? >>>>>>>>>>> >>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>>>> >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Princeton >>>>>>>>>>> Plasma Physics >>>>>>>>>>> ??? Laboratory >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? 100 >>>>>>>>>>> Stellarator Rd, >>>>>>>>>>> ??? Princeton NJ >>>>>>>>>>> ???? >>??? 08540, USA >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? -- >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Chang Liu >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Staff Research Physicist >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? +1 609 243 3438 >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > cliu at pppl.gov >>>>>>>>>>> >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>> >>>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? >> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>> ??? >>>>>>>>>> > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >>>> >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Princeton Plasma >>>>>>>>>>> Physics Laboratory >>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? 100 Stellarator Rd, >>>>>>>>>>> Princeton NJ >>>>>>>>>>> ??? 08540, USA >>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>????? >???????? -- >>>>>>>>>>> ???? >>???? >>>>>????? >???????? Chang Liu >>>>>>>>>>> ???? >>???? >>>>>????? >???????? Staff Research Physicist >>>>>>>>>>> ???? >>???? >>>>>????? >???????? +1 609 243 3438 >>>>>>>>>>> ???? >>???? >>>>>????? > cliu at pppl.gov >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? >>>>>>>>>>> ??? >> >>>>>>>>>>> >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? > >>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>> >>>>>>>>>>> ??? >>> >>>>>>>>>>> ???? >>???? >>>>>????? >???????? Princeton Plasma Physics >>>>>>>>>>> Laboratory >>>>>>>>>>> ???? >>???? >>>>>????? >???????? 100 Stellarator Rd, >>>>>>>>>>> Princeton NJ 08540, USA >>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> ???? >>???? >>>>>???? --???? Chang Liu >>>>>>>>>>> ???? >>???? >>>>>???? Staff Research Physicist >>>>>>>>>>> ???? >>???? >>>>>???? +1 609 243 3438 >>>>>>>>>>> ???? >>???? >>>>> cliu at pppl.gov >>>>>>>>>>> ??? > >>>>>>>>>>> >>>>>>>>>> ??? >>>>>>>>>>> ???? >>??? >> >>>>>>>>>>> ???? >>???? >>>>>???? Princeton Plasma Physics Laboratory >>>>>>>>>>> ???? >>???? >>>>>???? 100 Stellarator Rd, Princeton NJ 08540, >>>>>>>>>>> USA >>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>> ???? >>???? >>>> -- >>>>>>>>>>> ???? >>???? >>>> Chang Liu >>>>>>>>>>> ???? >>???? >>>> Staff Research Physicist >>>>>>>>>>> ???? >>???? >>>> +1 609 243 3438 >>>>>>>>>>> ???? >>???? >>>> cliu at pppl.gov >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>???? >>>> Princeton Plasma Physics Laboratory >>>>>>>>>>> ???? >>???? >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> ???? >>???? >> >>>>>>>>>>> ???? >>???? >> -- >>>>>>>>>>> ???? >>???? >> Chang Liu >>>>>>>>>>> ???? >>???? >> Staff Research Physicist >>>>>>>>>>> ???? >>???? >> +1 609 243 3438 >>>>>>>>>>> ???? >>???? >> cliu at pppl.gov >>>>>>>>>>> ??? > >>>>>>>>>>> ???? >>???? >> Princeton Plasma Physics Laboratory >>>>>>>>>>> ???? >>???? >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> ???? >>???? > >>>>>>>>>>> ???? >>??? --???? Chang Liu >>>>>>>>>>> ???? >>??? Staff Research Physicist >>>>>>>>>>> ???? >>??? +1 609 243 3438 >>>>>>>>>>> ???? >> cliu at pppl.gov >>>>>>>>>>> >>>>>>>>>> ??? > >>>>>>>>>>> ???? >>??? Princeton Plasma Physics Laboratory >>>>>>>>>>> ???? >>??? 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> ???? > >>>>>>>>>>> ???? > -- >>>>>>>>>>> ???? > Chang Liu >>>>>>>>>>> ???? > Staff Research Physicist >>>>>>>>>>> ???? > +1 609 243 3438 >>>>>>>>>>> ???? > cliu at pppl.gov >>>>>>>>>>> ???? > Princeton Plasma Physics Laboratory >>>>>>>>>>> ???? > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Chang Liu >>>>>>>>>> Staff Research Physicist >>>>>>>>>> +1 609 243 3438 >>>>>>>>>> cliu at pppl.gov >>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >>>>>>>> -- >>>>>>>> Chang Liu >>>>>>>> Staff Research Physicist >>>>>>>> +1 609 243 3438 >>>>>>>> cliu at pppl.gov >>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >>>>>> -- >>>>>> Chang Liu >>>>>> Staff Research Physicist >>>>>> +1 609 243 3438 >>>>>> cliu at pppl.gov >>>>>> Princeton Plasma Physics Laboratory >>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >>>> -- >>>> Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >> > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From wangyijia at lsec.cc.ac.cn Thu Oct 14 22:47:44 2021 From: wangyijia at lsec.cc.ac.cn (=?UTF-8?B?546L5LiA55Sy?=) Date: Fri, 15 Oct 2021 11:47:44 +0800 (GMT+08:00) Subject: [petsc-users] Issue on Block Jacobi Preconditioner Reuse Message-ID: <9c62be9.5da4.17c82101668.Coremail.wangyijia@lsec.cc.ac.cn> Hi!Everyone: Glad to join the mailing list.Recently I 've been working on a program using block jacobi preconditioner for a sequence solve of two linear system A_1x_1=b_1,A_2x_2=b_2. Since A1 and A2 have same nonzero pattern and their element values are quite close, we hope to reuse the preconditoners constructed when solving A_1x_1=b_1 , however after calling KSPSetReusePreconditioner, though the iteration number of second ksp solve is very small but the time used did not decrease much, these are my code: ierr = KSPSetUp(ksp);CHKERRQ(ierr); ierr = PCBJacobiGetSubKSP(pc, &num_local, &idx_first_local, &subksp);CHKERRQ(ierr); for (i=0; i From pierre at joliv.et Fri Oct 15 04:29:19 2021 From: pierre at joliv.et (Pierre Jolivet) Date: Fri, 15 Oct 2021 11:29:19 +0200 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> References: <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> Message-ID: <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> Hi Chang, The output you sent with MUMPS looks alright to me, you can see that the MatType is properly set to seqaijcusparse (and not mpiaijcusparse). I don?t know what is wrong with -sub_telescope_pc_factor_mat_solver_type cusparse, I don?t have a PETSc installation for testing this, hopefully Barry or Junchao can confirm this wrong behavior and get this fixed. As for permuting PCTELESCOPE and PCBJACOBI, in your case, the outer PC will be equivalent, yes. However, it would be more efficient to do PCBJACOBI and then PCTELESCOPE. PCBJACOBI prunes the operator by basically removing all coefficients outside of the diagonal blocks. Then, PCTELESCOPE "groups everything together?. If you do it the other way around, PCTELESCOPE will ?group everything together? and then PCBJACOBI will prune the operator. So the PCTELESCOPE SetUp will be costly for nothing since some coefficients will be thrown out afterwards in the PCBJACOBI SetUp. I hope I?m clear enough, otherwise I can try do draw some pictures. Thanks, Pierre > On 15 Oct 2021, at 4:39 AM, Chang Liu wrote: > > Hi Pierre and Barry, > > I think maybe I should use telescope outside bjacobi? like this > > mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type telescope -pc_telescope_reduction_factor 4 -t > elescope_pc_type bjacobi -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu -telescope_sub_pc_factor_mat_solve > r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > > But then I got an error that > > [0]PETSC ERROR: MatSolverType cusparse does not support matrix type seqaij > > But the mat type should be aijcusparse. I think telescope change the mat type. > > Chang > > On 10/14/21 10:11 PM, Chang Liu wrote: >> For comparison, here is the output using mumps instead of cusparse >> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type mumps -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 >> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 >> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 >> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 >> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 >> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 >> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 >> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 >> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 >> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 >> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 >> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 >> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 >> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 >> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 >> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 >> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 >> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 >> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 >> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 >> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 >> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 >> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 >> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 >> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 >> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 >> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 >> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 >> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 >> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 >> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 >> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 >> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 >> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 >> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 >> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 >> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 >> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 >> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 >> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 >> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 >> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 >> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 >> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 >> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 >> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 >> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 >> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 >> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 >> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 >> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 >> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 >> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 >> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 >> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 >> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 >> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 >> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 >> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 >> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 >> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 >> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 >> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 >> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 >> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 >> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 >> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 >> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 >> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 >> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 >> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 >> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 >> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 >> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 >> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 >> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 >> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 >> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 >> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 >> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 >> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 >> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 >> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 >> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 >> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 >> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 >> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 >> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 >> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 >> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 >> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 >> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 >> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 >> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 >> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 >> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 >> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 >> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 >> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 >> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 >> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 >> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 >> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 >> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 >> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 >> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 >> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 >> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 >> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 >> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 >> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 >> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 >> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 >> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 >> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 >> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 >> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 >> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 >> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 >> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 >> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 >> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 >> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 >> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 >> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 >> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 >> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 >> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 >> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 >> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 >> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 >> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 >> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 >> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 >> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 >> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 >> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 >> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 >> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 >> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 >> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 >> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 >> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 >> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 >> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 >> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 >> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 >> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 >> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 >> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 >> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 >> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 >> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 >> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 >> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 >> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 >> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 >> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 >> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 >> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 >> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 >> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 >> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 >> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 >> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 >> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 >> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 >> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 >> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 >> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 >> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 >> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 >> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 >> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 >> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 >> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 >> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 >> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 >> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 >> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 >> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 >> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 >> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 >> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 >> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 >> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 >> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 >> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 >> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 >> KSP Object: 16 MPI processes >> type: fgmres >> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >> happy breakdown tolerance 1e-30 >> maximum iterations=2000, initial guess is zero >> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >> right preconditioning >> using UNPRECONDITIONED norm type for convergence test >> PC Object: 16 MPI processes >> type: bjacobi >> number of blocks = 4 >> Local solver information for first block is in the following KSP and PC objects on rank 0: >> Use -ksp_view ::ascii_info_detail to display information for all blocks >> KSP Object: (sub_) 4 MPI processes >> type: preonly >> maximum iterations=10000, initial guess is zero >> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> left preconditioning >> using NONE norm type for convergence test >> PC Object: (sub_) 4 MPI processes >> type: telescope >> petsc subcomm: parent comm size reduction factor = 4 >> petsc subcomm: parent_size = 4 , subcomm_size = 1 >> petsc subcomm type = contiguous >> linear system matrix = precond matrix: >> Mat Object: (sub_) 4 MPI processes >> type: mpiaij >> rows=40200, cols=40200 >> total: nonzeros=199996, allocated nonzeros=203412 >> total number of mallocs used during MatSetValues calls=0 >> not using I-node (on process 0) routines >> setup type: default >> Parent DM object: NULL >> Sub DM object: NULL >> KSP Object: (sub_telescope_) 1 MPI processes >> type: preonly >> maximum iterations=10000, initial guess is zero >> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> left preconditioning >> using NONE norm type for convergence test >> PC Object: (sub_telescope_) 1 MPI processes >> type: lu >> out-of-place factorization >> tolerance for zero pivot 2.22045e-14 >> matrix ordering: external >> factor fill ratio given 0., needed 0. >> Factored matrix follows: >> Mat Object: 1 MPI processes >> type: mumps >> rows=40200, cols=40200 >> package used to perform factorization: mumps >> total: nonzeros=1849788, allocated nonzeros=1849788 >> MUMPS run parameters: >> SYM (matrix type): 0 >> PAR (host participation): 1 >> ICNTL(1) (output for error): 6 >> ICNTL(2) (output of diagnostic msg): 0 >> ICNTL(3) (output for global info): 0 >> ICNTL(4) (level of printing): 0 >> ICNTL(5) (input mat struct): 0 >> ICNTL(6) (matrix prescaling): 7 >> ICNTL(7) (sequential matrix ordering):7 >> ICNTL(8) (scaling strategy): 77 >> ICNTL(10) (max num of refinements): 0 >> ICNTL(11) (error analysis): 0 >> ICNTL(12) (efficiency control): 1 >> ICNTL(13) (sequential factorization of the root node): 0 >> ICNTL(14) (percentage of estimated workspace increase): 20 >> ICNTL(18) (input mat struct): 0 >> ICNTL(19) (Schur complement info): 0 >> ICNTL(20) (RHS sparse pattern): 0 >> ICNTL(21) (solution struct): 0 >> ICNTL(22) (in-core/out-of-core facility): 0 >> ICNTL(23) (max size of memory can be allocated locally):0 >> ICNTL(24) (detection of null pivot rows): 0 >> ICNTL(25) (computation of a null space basis): 0 >> ICNTL(26) (Schur options for RHS or solution): 0 >> ICNTL(27) (blocking size for multiple RHS): -32 >> ICNTL(28) (use parallel or sequential ordering): 1 >> ICNTL(29) (parallel ordering): 0 >> ICNTL(30) (user-specified set of entries in inv(A)): 0 >> ICNTL(31) (factors is discarded in the solve phase): 0 >> ICNTL(33) (compute determinant): 0 >> ICNTL(35) (activate BLR based factorization): 0 >> ICNTL(36) (choice of BLR factorization variant): 0 >> ICNTL(38) (estimated compression rate of LU factors): 333 >> CNTL(1) (relative pivoting threshold): 0.01 >> CNTL(2) (stopping criterion of refinement): 1.49012e-08 >> CNTL(3) (absolute pivoting threshold): 0. >> CNTL(4) (value of static pivoting): -1. >> CNTL(5) (fixation for null pivots): 0. >> CNTL(7) (dropping parameter for BLR): 0. >> RINFO(1) (local estimated flops for the elimination after analysis): >> [0] 1.45525e+08 >> RINFO(2) (local estimated flops for the assembly after factorization): >> [0] 2.89397e+06 >> RINFO(3) (local estimated flops for the elimination after factorization): >> [0] 1.45525e+08 >> INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization): >> [0] 29 >> INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization): >> [0] 29 >> INFO(23) (num of pivots eliminated on this processor after factorization): >> [0] 40200 >> RINFOG(1) (global estimated flops for the elimination after analysis): 1.45525e+08 >> RINFOG(2) (global estimated flops for the assembly after factorization): 2.89397e+06 >> RINFOG(3) (global estimated flops for the elimination after factorization): 1.45525e+08 >> (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0) >> INFOG(3) (estimated real workspace for factors on all processors after analysis): 1849788 >> INFOG(4) (estimated integer workspace for factors on all processors after analysis): 879986 >> INFOG(5) (estimated maximum front size in the complete tree): 282 >> INFOG(6) (number of nodes in the complete tree): 23709 >> INFOG(7) (ordering option effectively used after analysis): 5 >> INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100 >> INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 1849788 >> INFOG(10) (total integer space store the matrix factors after factorization): 879986 >> INFOG(11) (order of largest frontal matrix after factorization): 282 >> INFOG(12) (number of off-diagonal pivots): 0 >> INFOG(13) (number of delayed pivots after factorization): 0 >> INFOG(14) (number of memory compress after factorization): 0 >> INFOG(15) (number of steps of iterative refinement after solution): 0 >> INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 29 >> INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 29 >> INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 29 >> INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 29 >> INFOG(20) (estimated number of entries in the factors): 1849788 >> INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 26 >> INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 26 >> INFOG(23) (after analysis: value of ICNTL(6) effectively used): 0 >> INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1 >> INFOG(25) (after factorization: number of pivots modified by static pivoting): 0 >> INFOG(28) (after factorization: number of null pivots encountered): 0 >> INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 1849788 >> INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 29, 29 >> INFOG(32) (after analysis: type of analysis done): 1 >> INFOG(33) (value used for ICNTL(8)): 7 >> INFOG(34) (exponent of the determinant if determinant is requested): 0 >> INFOG(35) (after factorization: number of entries taking into account BLR factor compression - sum over all processors): 1849788 >> INFOG(36) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - value on the most memory consuming processor): 0 >> INFOG(37) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - sum over all processors): 0 >> INFOG(38) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - value on the most memory consuming processor): 0 >> INFOG(39) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - sum over all processors): 0 >> linear system matrix = precond matrix: >> Mat Object: 1 MPI processes >> type: seqaijcusparse >> rows=40200, cols=40200 >> total: nonzeros=199996, allocated nonzeros=199996 >> total number of mallocs used during MatSetValues calls=0 >> not using I-node routines >> linear system matrix = precond matrix: >> Mat Object: 16 MPI processes >> type: mpiaijcusparse >> rows=160800, cols=160800 >> total: nonzeros=802396, allocated nonzeros=1608000 >> total number of mallocs used during MatSetValues calls=0 >> not using I-node (on process 0) routines >> Norm of error 9.11684e-07 iterations 189 >> Chang >> On 10/14/21 10:10 PM, Chang Liu wrote: >>> Hi Barry, >>> >>> No problem. Here is the output. It seems that the resid norm calculation is incorrect. >>> >>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>> KSP Object: 16 MPI processes >>> type: fgmres >>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >>> happy breakdown tolerance 1e-30 >>> maximum iterations=2000, initial guess is zero >>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >>> right preconditioning >>> using UNPRECONDITIONED norm type for convergence test >>> PC Object: 16 MPI processes >>> type: bjacobi >>> number of blocks = 4 >>> Local solver information for first block is in the following KSP and PC objects on rank 0: >>> Use -ksp_view ::ascii_info_detail to display information for all blocks >>> KSP Object: (sub_) 4 MPI processes >>> type: preonly >>> maximum iterations=10000, initial guess is zero >>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>> left preconditioning >>> using NONE norm type for convergence test >>> PC Object: (sub_) 4 MPI processes >>> type: telescope >>> petsc subcomm: parent comm size reduction factor = 4 >>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>> petsc subcomm type = contiguous >>> linear system matrix = precond matrix: >>> Mat Object: (sub_) 4 MPI processes >>> type: mpiaij >>> rows=40200, cols=40200 >>> total: nonzeros=199996, allocated nonzeros=203412 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node (on process 0) routines >>> setup type: default >>> Parent DM object: NULL >>> Sub DM object: NULL >>> KSP Object: (sub_telescope_) 1 MPI processes >>> type: preonly >>> maximum iterations=10000, initial guess is zero >>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>> left preconditioning >>> using NONE norm type for convergence test >>> PC Object: (sub_telescope_) 1 MPI processes >>> type: lu >>> out-of-place factorization >>> tolerance for zero pivot 2.22045e-14 >>> matrix ordering: nd >>> factor fill ratio given 5., needed 8.62558 >>> Factored matrix follows: >>> Mat Object: 1 MPI processes >>> type: seqaijcusparse >>> rows=40200, cols=40200 >>> package used to perform factorization: cusparse >>> total: nonzeros=1725082, allocated nonzeros=1725082 >>> not using I-node routines >>> linear system matrix = precond matrix: >>> Mat Object: 1 MPI processes >>> type: seqaijcusparse >>> rows=40200, cols=40200 >>> total: nonzeros=199996, allocated nonzeros=199996 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node routines >>> linear system matrix = precond matrix: >>> Mat Object: 16 MPI processes >>> type: mpiaijcusparse >>> rows=160800, cols=160800 >>> total: nonzeros=802396, allocated nonzeros=1608000 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node (on process 0) routines >>> Norm of error 400.999 iterations 1 >>> >>> Chang >>> >>> >>> On 10/14/21 9:47 PM, Barry Smith wrote: >>>> >>>> Chang, >>>> >>>> Sorry I did not notice that one. Please run that with -ksp_view -ksp_monitor_true_residual so we can see exactly how options are interpreted and solver used. At a glance it looks ok but something must be wrong to get the wrong answer. >>>> >>>> Barry >>>> >>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: >>>>> >>>>> Hi Barry, >>>>> >>>>> That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. >>>>> >>>>> Chang >>>>> >>>>> On 10/14/21 5:15 PM, Barry Smith wrote: >>>>>> You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu >>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>>>>>> >>>>>>> Hi Pierre, >>>>>>> >>>>>>> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. >>>>>>> >>>>>>> The command line options I used for small matrix is like >>>>>>> >>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >>>>>>> >>>>>>> which gives the correct output. For iterative solver, I tried >>>>>>> >>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 >>>>>>> >>>>>>> for large matrix. The output is like >>>>>>> >>>>>>> 0 KSP Residual norm 40.1497 >>>>>>> 1 KSP Residual norm < 1.e-11 >>>>>>> Norm of error 400.999 iterations 1 >>>>>>> >>>>>>> So it seems to call a direct solver instead of an iterative one. >>>>>>> >>>>>>> Can you please help check these options? >>>>>>> >>>>>>> Chang >>>>>>> >>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>>>>>> >>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >>>>>>>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >>>>>>>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >>>>>>>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >>>>>>>> Thanks, >>>>>>>> Pierre >>>>>>>>> Chang >>>>>>>>> >>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>>>>>>>> Thus the need for specific code in mumps.c. >>>>>>>>>> Thanks, >>>>>>>>>> Pierre >>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Junchao, >>>>>>>>>>> >>>>>>>>>>> Yes that is what I want. >>>>>>>>>>> >>>>>>>>>>> Chang >>>>>>>>>>> >>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>>>>>>>> Junchao, >>>>>>>>>>>> If I understand correctly Chang is using the block Jacobi >>>>>>>>>>>> method with a single block for a number of MPI ranks and a direct >>>>>>>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>>>>>>>> particular problems this preconditioner works well, but using an >>>>>>>>>>>> iterative solver on the blocks does not work well. >>>>>>>>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>>>>>>>> not he would like to use a single GPU for each block, this means >>>>>>>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>>>>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>>>>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>>>>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>>>>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>>>>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>>>>>>>> GPUs. But this may be a large coding project. >>>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>>>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>>>>>>>> Barry >>>>>>>>>>>> Since the matrices being factored and solved directly are relatively >>>>>>>>>>>> large it is possible that the cusparse code could be reasonably >>>>>>>>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>>>>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>>>>>>>> actually know how much better the cusparse code would be on the >>>>>>>>>>>> direct solver than a good CPU direct sparse solver. >>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>>>> > wrote: >>>>>>>>>>>> > >>>>>>>>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>>>>> > >>>>>>>>>>>> > Chang >>>>>>>>>>>> > >>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>>>>> >> Hi Chang, >>>>>>>>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>>>>>>>> gathering matrix rows to one process. >>>>>>>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>>>>>>>> >> Thanks >>>>>>>>>>>> >> --Junchao Zhang >>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >> Hi Barry, >>>>>>>>>>>> >> I think mumps solver in petsc does support that. You can >>>>>>>>>>>> check the >>>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>>>>> >> >>>>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> > >>>>>>>>>>>> >> and the code enclosed by #if >>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>>>>>>>> >> mumps.c >>>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>>>>>>>> However, I am >>>>>>>>>>>> >> working on an existing code that was developed based on MPI >>>>>>>>>>>> and the the >>>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>>>>>>>> want to >>>>>>>>>>>> >> change the whole structure of the code. >>>>>>>>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>>>>>>>> function >>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>>>>>>>> >> Regards, >>>>>>>>>>>> >> Chang >>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>>>>> >>>>>>>>>>>> >> >> wrote: >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> Hi Barry, >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> That is exactly what I want. >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>>>>>>>> >> transfer >>>>>>>>>>>> >> >> matrix >>>>>>>>>>>> >> >> data from many MPI processes to "master" MPI >>>>>>>>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>>>>>>>> upload >>>>>>>>>>>> >> the data to GPU to >>>>>>>>>>>> >> >> solve. >>>>>>>>>>>> >> >> One can just grab some codes from mumps.c to >>>>>>>>>>>> aijcusparse.cu >>>>>>>>>>>> >> >. >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>>>>>>>> copy the >>>>>>>>>>>> >> entire matrix to a single MPI rank. >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > It would be possible to write such a code that you >>>>>>>>>>>> suggest but >>>>>>>>>>>> >> it is not clear that it makes sense >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>>>>>>>> rank, so >>>>>>>>>>>> >> while your one GPU per big domain is solving its systems the >>>>>>>>>>>> other >>>>>>>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>>>>>>>> >> nothing. >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>>>>>>>> right >>>>>>>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>>>>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>>>>>>>> to all >>>>>>>>>>>> >> of its subdomain ranks. >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>>>>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>>>>>>>> use the >>>>>>>>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>>>>>>>> >> can fit and be done on a single GPU then you would be >>>>>>>>>>>> utilizing all >>>>>>>>>>>> >> the GPUs you are using effectively. >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > Barry >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> Chang >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>>>>> >> >>> Chang, >>>>>>>>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>>>>>>>> solvers that >>>>>>>>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>>>>>>>> that I >>>>>>>>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>>>>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>>>>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>>>>>>>> >> direct triangular solves. >>>>>>>>>>>> >> >>> Barry >>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >> >>>> >>>>>>>>>>>> >> >>>> Hi Mark, >>>>>>>>>>>> >> >>>> >>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>>>>>>>> other >>>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>>>>>>>> will give >>>>>>>>>>>> >> an error. >>>>>>>>>>>> >> >>>> >>>>>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>>>>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>>>>>>>> on gpu. >>>>>>>>>>>> >> Is that possible? >>>>>>>>>>>> >> >>>> >>>>>>>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>>>>>>>> runs but >>>>>>>>>>>> >> the iterative solver is still running on CPUs. I have >>>>>>>>>>>> contacted the >>>>>>>>>>>> >> superlu group and they confirmed that is the case right now. >>>>>>>>>>>> But if >>>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>>>>>>>> >> iterative solver is running on GPU. >>>>>>>>>>>> >> >>>> >>>>>>>>>>>> >> >>>> Chang >>>>>>>>>>>> >> >>>> >>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >>>>>>>>>>>> >> >>> wrote: >>>>>>>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>>>>>>>> my case >>>>>>>>>>>> >> the code is >>>>>>>>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>>>>>>>> >> factorization on GPUs. >>>>>>>>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>>>>>>>> code to >>>>>>>>>>>> >> utilize GPUs >>>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>>>>>>>> mpiaij >>>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>>>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>>>>>>>> >> superlu tests use aij or sell matrix type. >>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>>>>>>>> .... but >>>>>>>>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>>>>>>>> that the >>>>>>>>>>>> >> issue? >>>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>>>>>>>> SuperLU >>>>>>>>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>>>>>>>> >> >>>>> Thanks, >>>>>>>>>>>> >> >>>>> Mark >>>>>>>>>>>> >> >>>>> so I >>>>>>>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>>>>>>>> all the >>>>>>>>>>>> >> matrix terms, >>>>>>>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>>>>>>>> >> factorization >>>>>>>>>>>> >> >>>>> and >>>>>>>>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>>>>>>>> >> process, and I >>>>>>>>>>>> >> >>>>> think >>>>>>>>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>>>>>>>> >> >>>>> Chang >>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>>>>>>>> >> >>>>>>>>>>> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > Hi Mark, >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > The option I use is like >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>>>>>>>> >> -ksp_type fgmres >>>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>>>>>>>> >> cusparse >>>>>>>>>>>> >> >>>>> *-sub_ksp_type >>>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>>>>>>>> >> -ksp_rtol 1.e-300 >>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>>>>>>>> (rows >>>>>>>>>>>> >> are the >>>>>>>>>>>> >> >>>>> method like >>>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>>>>>>>> in the GPU. >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>>>>>>>> cuSparse LU >>>>>>>>>>>> >> >>>>> factorization. Is >>>>>>>>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>>>>> find it >>>>>>>>>>>> >> calls >>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>>>>>>>> >> make bigger >>>>>>>>>>>> >> >>>>> blocks? >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > I think this one do both factorization and >>>>>>>>>>>> >> solve on gpu. >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > You can check the >>>>>>>>>>>> runex72_aijcusparse.sh file >>>>>>>>>>>> >> in petsc >>>>>>>>>>>> >> >>>>> install >>>>>>>>>>>> >> >>>>> > directory, and try it your self (this >>>>>>>>>>>> is only lu >>>>>>>>>>>> >> >>>>> factorization >>>>>>>>>>>> >> >>>>> > without >>>>>>>>>>>> >> >>>>> > iterative solve). >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > Chang >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>>>>>>>> Chang Liu >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>> wrote: >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > Hi Junchao, >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>>>>>>>> >> within a >>>>>>>>>>>> >> >>>>> node. I use >>>>>>>>>>>> >> >>>>> > block-Jacobi >>>>>>>>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>>>>>>>> >> matrix, so each >>>>>>>>>>>> >> >>>>> > direct solver will >>>>>>>>>>>> >> >>>>> > > take care of a sub-block of the >>>>>>>>>>>> whole >>>>>>>>>>>> >> matrix. In this >>>>>>>>>>>> >> >>>>> > way, I can use >>>>>>>>>>>> >> >>>>> > > one >>>>>>>>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>>>>>>>> >> stored within >>>>>>>>>>>> >> >>>>> one node. >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > It was stated in the >>>>>>>>>>>> documentation that >>>>>>>>>>>> >> cusparse >>>>>>>>>>>> >> >>>>> solver >>>>>>>>>>>> >> >>>>> > is slow. >>>>>>>>>>>> >> >>>>> > > However, in my test using >>>>>>>>>>>> ex72.c, the >>>>>>>>>>>> >> cusparse >>>>>>>>>>>> >> >>>>> solver is >>>>>>>>>>>> >> >>>>> > faster than >>>>>>>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > Are we talking about the >>>>>>>>>>>> factorization, the >>>>>>>>>>>> >> solve, or >>>>>>>>>>>> >> >>>>> both? >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > We do not have an interface to >>>>>>>>>>>> cuSparse's LU >>>>>>>>>>>> >> >>>>> factorization (I >>>>>>>>>>>> >> >>>>> > just >>>>>>>>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>>>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>>>>>>>> >> '-pc_type lu >>>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>>>>>>>> >> factorization, >>>>>>>>>>>> >> >>>>> which is the >>>>>>>>>>>> >> >>>>> > > dominant cost. >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > Chang >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>>>>>>>> Zhang wrote: >>>>>>>>>>>> >> >>>>> > > > Hi, Chang, >>>>>>>>>>>> >> >>>>> > > > For the mumps solver, we >>>>>>>>>>>> usually >>>>>>>>>>>> >> transfers >>>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>>> >> >>>>> > and vector >>>>>>>>>>>> >> >>>>> > > data >>>>>>>>>>>> >> >>>>> > > > within a compute node. For >>>>>>>>>>>> the idea you >>>>>>>>>>>> >> >>>>> propose, it >>>>>>>>>>>> >> >>>>> > looks like >>>>>>>>>>>> >> >>>>> > > we need >>>>>>>>>>>> >> >>>>> > > > to gather data within >>>>>>>>>>>> >> MPI_COMM_WORLD, right? >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > Mark, I remember you said >>>>>>>>>>>> >> cusparse solve is >>>>>>>>>>>> >> >>>>> slow >>>>>>>>>>>> >> >>>>> > and you would >>>>>>>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > --Junchao Zhang >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>>>>>>>> >> Chang Liu via >>>>>>>>>>>> >> >>>>> petsc-users >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> >>> >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> >>>> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> >>> >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>> >>>>>> >>>>>>>>>>>> >> >>>>> > > wrote: >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > Hi, >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > Currently, it is possible >>>>>>>>>>>> to use >>>>>>>>>>>> >> mumps >>>>>>>>>>>> >> >>>>> solver in >>>>>>>>>>>> >> >>>>> > PETSC with >>>>>>>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>>>>>>>> >> option, so that >>>>>>>>>>>> >> >>>>> > multiple MPI >>>>>>>>>>>> >> >>>>> > > processes will >>>>>>>>>>>> >> >>>>> > > > transfer the matrix and >>>>>>>>>>>> rhs data >>>>>>>>>>>> >> to the master >>>>>>>>>>>> >> >>>>> > rank, and then >>>>>>>>>>>> >> >>>>> > > master >>>>>>>>>>>> >> >>>>> > > > rank will call mumps with >>>>>>>>>>>> OpenMP >>>>>>>>>>>> >> to solve >>>>>>>>>>>> >> >>>>> the matrix. >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > I wonder if someone can >>>>>>>>>>>> develop >>>>>>>>>>>> >> similar >>>>>>>>>>>> >> >>>>> option for >>>>>>>>>>>> >> >>>>> > cusparse >>>>>>>>>>>> >> >>>>> > > solver. >>>>>>>>>>>> >> >>>>> > > > Right now, this solver >>>>>>>>>>>> does not >>>>>>>>>>>> >> work with >>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I >>>>>>>>>>>> >> >>>>> > > think a >>>>>>>>>>>> >> >>>>> > > > possible workaround is to >>>>>>>>>>>> >> transfer all the >>>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>>> >> >>>>> > data to one MPI >>>>>>>>>>>> >> >>>>> > > > process, and then upload the >>>>>>>>>>>> >> data to GPU to >>>>>>>>>>>> >> >>>>> solve. >>>>>>>>>>>> >> >>>>> > In this >>>>>>>>>>>> >> >>>>> > > way, one can >>>>>>>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>>>>>>>> >> program. >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > Chang >>>>>>>>>>>> >> >>>>> > > > -- >>>>>>>>>>>> >> >>>>> > > > Chang Liu >>>>>>>>>>>> >> >>>>> > > > Staff Research Physicist >>>>>>>>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>>>>>>>> Laboratory >>>>>>>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>>>>>>>> Princeton NJ >>>>>>>>>>>> >> 08540, USA >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > -- >>>>>>>>>>>> >> >>>>> > > Chang Liu >>>>>>>>>>>> >> >>>>> > > Staff Research Physicist >>>>>>>>>>>> >> >>>>> > > +1 609 243 3438 >>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>>>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>>>>>>>> 08540, USA >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> > -- >>>>>>>>>>>> >> >>>>> > Chang Liu >>>>>>>>>>>> >> >>>>> > Staff Research Physicist >>>>>>>>>>>> >> >>>>> > +1 609 243 3438 >>>>>>>>>>>> >> >>>>> > cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>> -- Chang Liu >>>>>>>>>>>> >> >>>>> Staff Research Physicist >>>>>>>>>>>> >> >>>>> +1 609 243 3438 >>>>>>>>>>>> >> >>>>> cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> >> >>>> >>>>>>>>>>>> >> >>>> -- >>>>>>>>>>>> >> >>>> Chang Liu >>>>>>>>>>>> >> >>>> Staff Research Physicist >>>>>>>>>>>> >> >>>> +1 609 243 3438 >>>>>>>>>>>> >> >>>> cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> -- >>>>>>>>>>>> >> >> Chang Liu >>>>>>>>>>>> >> >> Staff Research Physicist >>>>>>>>>>>> >> >> +1 609 243 3438 >>>>>>>>>>>> >> >> cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> >> > >>>>>>>>>>>> >> -- Chang Liu >>>>>>>>>>>> >> Staff Research Physicist >>>>>>>>>>>> >> +1 609 243 3438 >>>>>>>>>>>> >> cliu at pppl.gov >>>>>>>>>>> > >>>>>>>>>>>> >> Princeton Plasma Physics Laboratory >>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> > >>>>>>>>>>>> > -- >>>>>>>>>>>> > Chang Liu >>>>>>>>>>>> > Staff Research Physicist >>>>>>>>>>>> > +1 609 243 3438 >>>>>>>>>>>> > cliu at pppl.gov >>>>>>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Chang Liu >>>>>>>>>>> Staff Research Physicist >>>>>>>>>>> +1 609 243 3438 >>>>>>>>>>> cliu at pppl.gov >>>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Chang Liu >>>>>>>>> Staff Research Physicist >>>>>>>>> +1 609 243 3438 >>>>>>>>> cliu at pppl.gov >>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >>>>>>> -- >>>>>>> Chang Liu >>>>>>> Staff Research Physicist >>>>>>> +1 609 243 3438 >>>>>>> cliu at pppl.gov >>>>>>> Princeton Plasma Physics Laboratory >>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >>>>> -- >>>>> Chang Liu >>>>> Staff Research Physicist >>>>> +1 609 243 3438 >>>>> cliu at pppl.gov >>>>> Princeton Plasma Physics Laboratory >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >>> > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From gsabhishek1ags at gmail.com Fri Oct 15 05:11:20 2021 From: gsabhishek1ags at gmail.com (Abhishek G.S.) Date: Fri, 15 Oct 2021 15:41:20 +0530 Subject: [petsc-users] VecView DMDA and HDF5 - Unable to write out files properly In-Reply-To: References: Message-ID: I finally was able to output HDF5 files to read in Paraview. These are the following functions that make this possible, Here I create the domain using DMDACreate3d(For 2D I just use NZ=1 - You'll Just end up with a thickness during visualization - Accordingly set your DELTAZ) void write_xdmf_header(const std::string &filename){ int rank; MPI_Comm_rank(PETSC_COMM_WORLD,&rank); if(rank==0){ std::size_t name_split = filename.find_last_of("."); std::string xdmf_name = filename.substr(0,name_split) + ".xdmf"; std::ofstream fp(xdmf_name,std::ios::out); fp << "\n"; fp << "\n"; fp << " \n\n"; fp << " \n"; fp << " \n"; fp << " \n"; fp << " "<<"\n"; fp << " "<<0<<" "<<0<<" "<<0<<" \n"; fp << " "<\n"; fp << " \n\n"; fp.close(); } } template void Field:: write_xdmf(const std::string &filename){ int rank; MPI_Comm_rank(PETSC_COMM_WORLD,&rank); if(rank==0){ std::size_t name_split = filename.find_last_of("."); std::string xdmf_name = filename.substr(0,name_split) + ".xdmf"; std::ofstream fp(xdmf_name,std::ios::app); fp << " \n"; fp << " \n"; fp << " "<\n"; fp << " \n\n"; fp.close(); } } void write_xdmf_footer(const std::string &filename){ int rank; MPI_Comm_rank(PETSC_COMM_WORLD,&rank); if(rank==0){ std::size_t name_split = filename.find_last_of("."); std::string xdmf_name = filename.substr(0,name_split) + ".xdmf"; std::ofstream fp(xdmf_name,std::ios::app); fp << " \n"; fp << " \n"; fp << "\n"; fp.close(); } } //For HDF file writing template void Field:: write_to_file(const std::string &filename){ PetscViewer viewer; PetscViewerHDF5Open(PETSC_COMM_WORLD,filename.c_str(),FILE_MODE_APPEND,&viewer); PetscViewerHDF5SetBaseDimension2(viewer, PETSC_FALSE); VecView(global_vec, viewer); PetscViewerDestroy(&viewer); write_xdmf(filename); } Thanks to Matteo (most of the above are from his scripts). On Thu, 14 Oct 2021 at 21:28, Abhishek G.S. wrote: > Thanks, Matthew for the clarification/ suggestion. > Thanks, Matteo for the scripts, I'll give this a try and get back with an > update > > On Thu, 14 Oct 2021 at 19:57, Matthew Knepley wrote: > >> On Thu, Oct 14, 2021 at 9:21 AM Matteo Semplice < >> matteo.semplice at uninsubria.it> wrote: >> >>> >>> Il 14/10/21 14:37, Matthew Knepley ha scritto: >>> >>> On Wed, Oct 13, 2021 at 6:30 PM Abhishek G.S. >>> wrote: >>> >>>> Hi, >>>> I need some help with getting the file output working right. >>>> >>>> I am using a DMDACreate3D to initialize my DM. This is my write >>>> function >>>> >>>> void write(){ >>>> PetscViewer viewer; >>>> >>>> PetscViewerHDF5Open(PETSC_COMM_WORLD,filename.c_str(),FILE_MODE_WRITE,&viewer); >>>> DMDAVecRestoreArray(dm,global_vector,global_array) >>>> VecView(global_vec, viewer); >>>> DMDAVecGetArray(dm,global_vector,global_array); >>>> PetscViewerDestroy(&viewer); >>>> } >>>> >>>> 1) I have 2 PDE's to solve. Still, I went ahead creating a single DM >>>> with dof=1 and creating two vectors using the DMCreateGlobalVector(). I >>>> want to write the file out periodically. Should I perform >>>> DMDAVecRestoreArray and DMDAVecGetArray every time is write out the >>>> global_vector? (I know that it is just indexing the pointers and there is >>>> no copying of values. But I am not sure) >>>> >>> >>> I don't think you need the Get/RestoreArray() calls here. >>> >>> >>>> 2) I am writing out to HDF5 format. I see that the vecview is supposed >>>> to reorder the global_vector based on the DM. However, when I read the H5 >>>> files, I get an error on ViSIT and my output image becomes a 1D image >>>> rather than a 2D/3D. What might be the reason for this ?. >>>> Error Msg : "In domain 0, your zonal variable "avtGhostZones" has 25600 >>>> values, but it should have 160. Some values were removed to ensure VisIt >>>> runs smoothly" >>>> I was using a 160x160x1 DM >>>> >>> >>> I do not believe we support HDF5 <--> Visit/Paraview for DMDA. The >>> VecView() is just writing out the vector as a linear array without mesh >>> details. For >>> interfacing with the visualization, I think we use .vtu files. You >>> should be able to get this effect using >>> >>> VecViewFromOptions(global_vec, NULL, "-vec_view"); >>> >>> in your code, and then >>> >>> -vec_view vtk:sol.vtu >>> >>> on the command line. >>> >>> Hi. >>> >>> If you want to stick with HDF5, you can also write a XDMF file with the >>> grid information and open that in Paraview. >>> >>> I am attaching some routines that I have written to do that in a solver >>> that deals with a time dependent PDE system with 2 variables; with them I >>> end up with a single XDMF file that Paraview can load and which contains >>> references to all timesteps in my simulations, with each timestep being >>> contained in an HDF5 file on its own. The idea is to call writeDomain at >>> the beginning of the simulation, writeHDF5 for each timestep that I want to >>> save and writeSimulationXDMF at the end. (Warning: 3D is in use, while 2D >>> ia almost untested...) >>> >>> It's not the optimal solution since (1) all timesteps could be in the >>> same HDF5 and (2) in each HDF5 i write the vectors separately and it would >>> be better to dump the entire data in one go and interpret them as a >>> Nx*Ny*Nz*Nvariables data from the XDMF. Nevertheless they might be a >>> starting point for you if you wan to try this approach. >>> >>> You can have HDF5 put the vectors in a single array with a time >> dimension now. Then you just alter the xdmf to point into that array. I do >> this >> with the unstructured code. >> >> Thanks, >> >> Matt >> >>> Matteo >>> >>> >>> >>>> 3) I tried using the "petsc_gen_xdmf.py" to generate the xdmf files for >>>> use in Paraview. Here the key ["viz/geometry"] is missing. The keys present >>>> in the output H5 file are just the two vectors I am writing and has no info >>>> about mesh. Isn't this supposed to come automatically since the vector is >>>> attached to the DM? How do I sort this out? >>>> >>> >>> This support is for unstructured grids, DMPlex and DMForest. >>> >>> >>>> 4) Can I have multiple vectors attached to the DM by >>>> DMCreateGlobalVector() even though I created the DMDA using dof=1. >>>> >>> >>> Yes. >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> thanks, >>>> Abhishek >>>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >>> -- >>> --- >>> Professore Associato in Analisi Numerica >>> Dipartimento di Scienza e Alta Tecnologia >>> Universit? degli Studi dell'Insubria >>> Via Valleggio, 11 - Como >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Oct 15 05:59:13 2021 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 15 Oct 2021 06:59:13 -0400 Subject: [petsc-users] Issue on Block Jacobi Preconditioner Reuse In-Reply-To: <9c62be9.5da4.17c82101668.Coremail.wangyijia@lsec.cc.ac.cn> References: <9c62be9.5da4.17c82101668.Coremail.wangyijia@lsec.cc.ac.cn> Message-ID: On Thu, Oct 14, 2021 at 11:48 PM ??? via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi!Everyone: > > Glad to join the mailing list.Recently I 've been working on a program > using block jacobi preconditioner for a sequence solve of two linear system > A_1x_1=b_1,A_2x_2=b_2. > > Since A1 and A2 have same nonzero pattern and their element values > are quite close, we hope to reuse the preconditoners constructed when > solving A_1x_1=b_1 , > > however after calling KSPSetReusePreconditioner, though the iteration > number of second ksp solve is very small but the time used did not decrease > much, these are my code: > We can use some diagnostic output to help see what is going on. Can you run with -ksp_view -ksp_monitor_true_residual -info :pc -log_view and send all the output? Thanks, Matt > ierr = KSPSetUp(ksp);CHKERRQ(ierr); > ierr = PCBJacobiGetSubKSP(pc, &num_local, &idx_first_local, > &subksp);CHKERRQ(ierr); > > > for (i=0; i { > ierr = KSPGetPC(subksp[i], &subpc);CHKERRQ(ierr); > if (i==0) > { > ierr = > KSPSetType(subksp[i],"gmres");CHKERRQ(ierr); > ierr = PCSetType(subpc, > PCLU);CHKERRQ(ierr); > ierr = PCFactorSetMatSolverType(subpc, > "mkl_pardiso");CHKERRQ(ierr); > ierr = KSPSetOptionsPrefix(subksp[i], > "Blk0_");CHKERRQ(ierr); > ierr = > KSPSetReusePreconditioner(subksp[i],PETSC_TRUE);CHKERRQ(ierr); > ierr = > KSPSetFromOptions(subksp[i]);CHKERRQ(ierr); > } > if (i==1) > { > ierr = > KSPSetType(subksp[i],"gmres");CHKERRQ(ierr); > ierr = PCSetType(subpc, > PCHYPRE);CHKERRQ(ierr); > ierr = PCHYPRESetType(pc, > "boomeramg");CHKERRQ(ierr); > ierr = > KSPSetReusePreconditioner(subksp[i],PETSC_TRUE);CHKERRQ(ierr); > ierr = KSPSetOptionsPrefix(subksp[i], > "Blk1_");CHKERRQ(ierr); > } > }//nlocal > }//isbjacobi > > > //Solve the Linear System > ierr = KSPSetFromOptions(ksp);CHKERRQ(ierr); > //Solve the Linear System > t0 = MPI_Wtime(); > ierr = KSPSolve(ksp, b1, x1);CHKERRQ(ierr); > t0 = MPI_Wtime()-t0; > ierr = PetscPrintf(PETSC_COMM_SELF,"First KSP Solve time: %g > s\n",t0);CHKERRQ(ierr); > //Preconditioner Reuse > ierr = KSPSetReusePreconditioner(ksp, PETSC_TRUE);CHKERRQ(ierr); > ierr = KSPSetOperators(ksp,A2,A1);CHKERRQ(ierr); > t0 = MPI_Wtime(); > ierr = KSPSolve(ksp,b2,x2);CHKERRQ(ierr); > t0 = MPI_Wtime()-t0; > ierr = PetscPrintf(PETSC_COMM_SELF,"Second KSP Solve time: %g > s\n",t0);CHKERRQ(ierr); > > > The total block number is 2, and the first block is solved using direct > method and the second block is solved using hypre's boomeramg, so I hope to > reuse the factor of the first block and the set up phase of boomeramg as > preconditioner of the next solve, is there anything wrong with the reuse > code? > > > Best Wishes > > > WANG Yijia > > > 2021/10/15 > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.seize at onera.fr Fri Oct 15 06:08:00 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Fri, 15 Oct 2021 13:08:00 +0200 Subject: [petsc-users] Periodic boundary conditions in DMPlex Message-ID: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> Hi, I'm writing a code using PETSc to solve NS equations with FV on an unstructured mesh. Therefore I use DMPlex. Regarding periodicity, I manage to implement it this way: ? - for each couple of boundaries that is linked with periodicity, I create a buffer vector with an ISLocalToGlobalMapping ? - then, when I need to fill the ghost cells corresponding to the periodicity, the i "true" cell of the local vector fills the buffer vector on location i with VecSetValuesBlockedLocal, then VecAssemblyBegin/VecAssemblyEnd ensure each value is send to the correct location thanks to the mapping, then the i "ghost" cell of the local vector reads the vector on location i to get it's value. It works, but it seems to me there is a better way, with maybe PetscSF, VecScatter, or something I don't know yet. Does anyone have any advice ? Pierre Seize From knepley at gmail.com Fri Oct 15 06:25:22 2021 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 15 Oct 2021 07:25:22 -0400 Subject: [petsc-users] Periodic boundary conditions in DMPlex In-Reply-To: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> References: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> Message-ID: On Fri, Oct 15, 2021 at 7:08 AM Pierre Seize wrote: > Hi, > > I'm writing a code using PETSc to solve NS equations with FV on an > unstructured mesh. Therefore I use DMPlex. > > Regarding periodicity, I manage to implement it this way: > > - for each couple of boundaries that is linked with periodicity, I > create a buffer vector with an ISLocalToGlobalMapping > > - then, when I need to fill the ghost cells corresponding to the > periodicity, the i "true" cell of the local vector fills the buffer > vector on location i with VecSetValuesBlockedLocal, then > VecAssemblyBegin/VecAssemblyEnd ensure each value is send to the correct > location thanks to the mapping, then the i "ghost" cell of the local > vector reads the vector on location i to get it's value. > > > It works, but it seems to me there is a better way, with maybe PetscSF, > VecScatter, or something I don't know yet. Does anyone have any advice ? > There are at least two other ways to handle this. First, the method that is advocated in Plex is to actually make a periodic geometry, meaning connect the cells that are meant to be connected. Then, if you partition with overlap = 1, PetscGlobalToLocal() will fill in these cell values automatically. Second, you could use a non-periodic geometry, but alter the LocalToGlobal map such that the cells gets filled in anyway. Many codes use this scheme and it is straightforward with Plex just by augmenting the map it makes automatically. Does this make sense? Thanks, Matt > Pierre Seize > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.seize at onera.fr Fri Oct 15 06:31:06 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Fri, 15 Oct 2021 13:31:06 +0200 Subject: [petsc-users] Periodic boundary conditions in DMPlex In-Reply-To: References: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> Message-ID: <71784fff-35eb-a129-3609-004e5e596575@onera.fr> It makes sense, thank you. In fact, both ways seems better than my way. The first one looks the most straightforward. Unfortunately I do not know how to implement either of them. Could you please direct me to the corresponding PETSc functions ? Pierre On 15/10/21 13:25, Matthew Knepley wrote: > On Fri, Oct 15, 2021 at 7:08 AM Pierre Seize > wrote: > > Hi, > > I'm writing a code using PETSc to solve NS equations with FV on an > unstructured mesh. Therefore I use DMPlex. > > Regarding periodicity, I manage to implement it this way: > > ?? - for each couple of boundaries that is linked with periodicity, I > create a buffer vector with an ISLocalToGlobalMapping > > ?? - then, when I need to fill the ghost cells corresponding to the > periodicity, the i "true" cell of the local vector fills the buffer > vector on location i with VecSetValuesBlockedLocal, then > VecAssemblyBegin/VecAssemblyEnd ensure each value is send to the > correct > location thanks to the mapping, then the i "ghost" cell of the local > vector reads the vector on location i to get it's value. > > > It works, but it seems to me there is a better way, with maybe > PetscSF, > VecScatter, or something I don't know yet. Does anyone have any > advice ? > > > There are at least two other ways to handle this. First, the method > that is advocated in > Plex is to actually make a periodic geometry, meaning connect the > cells that are meant > to be connected. Then, if you partition with overlap = 1, > PetscGlobalToLocal() will fill in > these cell values automatically. > > Second, you could use a non-periodic geometry, but alter the > LocalToGlobal map such > that the cells gets filled in anyway. Many codes use this scheme and > it is straightforward > with Plex just by augmenting the map it makes automatically. > > Does this make sense? > > ? Thanks, > > ? ? ?Matt > > Pierre Seize > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Oct 15 07:03:52 2021 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 15 Oct 2021 08:03:52 -0400 Subject: [petsc-users] Periodic boundary conditions in DMPlex In-Reply-To: <71784fff-35eb-a129-3609-004e5e596575@onera.fr> References: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> <71784fff-35eb-a129-3609-004e5e596575@onera.fr> Message-ID: On Fri, Oct 15, 2021 at 7:31 AM Pierre Seize wrote: > It makes sense, thank you. In fact, both ways seems better than my way. > The first one looks the most straightforward. Unfortunately I do not know > how to implement either of them. Could you please direct me to the > corresponding PETSc functions ? > > The first way is implemented for example in DMPlexCreateBoxMesh() and DMPlexCreateCylinderMesh(). The second is not implemented since there did not seem to be a general way to do it. I would help if you wanted to try coding it up. Thanks, Matt > Pierre > > On 15/10/21 13:25, Matthew Knepley wrote: > > On Fri, Oct 15, 2021 at 7:08 AM Pierre Seize > wrote: > >> Hi, >> >> I'm writing a code using PETSc to solve NS equations with FV on an >> unstructured mesh. Therefore I use DMPlex. >> >> Regarding periodicity, I manage to implement it this way: >> >> - for each couple of boundaries that is linked with periodicity, I >> create a buffer vector with an ISLocalToGlobalMapping >> >> - then, when I need to fill the ghost cells corresponding to the >> periodicity, the i "true" cell of the local vector fills the buffer >> vector on location i with VecSetValuesBlockedLocal, then >> VecAssemblyBegin/VecAssemblyEnd ensure each value is send to the correct >> location thanks to the mapping, then the i "ghost" cell of the local >> vector reads the vector on location i to get it's value. >> >> >> It works, but it seems to me there is a better way, with maybe PetscSF, >> VecScatter, or something I don't know yet. Does anyone have any advice ? >> > > There are at least two other ways to handle this. First, the method that > is advocated in > Plex is to actually make a periodic geometry, meaning connect the cells > that are meant > to be connected. Then, if you partition with overlap = 1, > PetscGlobalToLocal() will fill in > these cell values automatically. > > Second, you could use a non-periodic geometry, but alter the LocalToGlobal > map such > that the cells gets filled in anyway. Many codes use this scheme and it is > straightforward > with Plex just by augmenting the map it makes automatically. > > Does this make sense? > > Thanks, > > Matt > > >> Pierre Seize >> > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.seize at onera.fr Fri Oct 15 08:33:44 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Fri, 15 Oct 2021 15:33:44 +0200 Subject: [petsc-users] Periodic boundary conditions in DMPlex In-Reply-To: References: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> <71784fff-35eb-a129-3609-004e5e596575@onera.fr> Message-ID: <8ff9d951-6958-aa1a-b875-e7488bb6b30b@onera.fr> When I first tried to handle the periodicity, I found the DMPlexCreateBoxMesh function (I cannot find the cylinder one). From reading the sources, I understand that we do some work either in DMPlexCreateCubeMesh_Internal or with DMSetPeriodicity. I tried to use DMSetPeriodicity before, for example with a 2x2 box on length 10. I did something like: const PetscReal maxCell[] = {2, 2}; const PetscReal L[] = {10, 10}; const DMBoundaryType bd[] = {DM_BOUNDARY_PERIODIC, DM_BOUNDARY_PERIODIC}; DMSetPeriodicity(dm, PETSC_TRUE, maxCell, L, bd); // or: DMSetPeriodicity(dm, PETSC_TRUE, NULL, L, bd); but it did not work: VecSet(X, 1); DMGetLocalVector(dm, &locX); VecZeroEntries(locX); DMGlobalToLocalBegin(dm, X, INSERT_VALUES, locX); DMGlobalToLocalEnd(dm, X, INSERT_VALUES, locX); VecView(locX, PETSC_VIEWER_STDOUT_WORLD); but the ghost cells values are all 0, only the real cells are 1. So I guess DMSetPeriodicity alone is not sufficient to handle the periodicity. Is there a way to do what I want ? That is set up my DMPlex in a way that DMGlobalToLocalBegin/DMGlobalToLocalEnd do exchange values between procs AND exchange the periodic values? Thanks for the help Pierre On 15/10/21 14:03, Matthew Knepley wrote: > On Fri, Oct 15, 2021 at 7:31 AM Pierre Seize > wrote: > > It makes sense, thank you. In fact, both ways seems better than my > way. The first one looks the most straightforward. Unfortunately I > do not know how to implement either of them. Could you please > direct me to the corresponding PETSc functions ? > > The first way is implemented for example in DMPlexCreateBoxMesh() and > DMPlexCreateCylinderMesh(). The second is not implemented since > there did not seem to be a general way to do it. I would help if you > wanted to try coding it up. > > ? Thanks, > > ? ? Matt > > Pierre > > > On 15/10/21 13:25, Matthew Knepley wrote: >> On Fri, Oct 15, 2021 at 7:08 AM Pierre Seize >> > wrote: >> >> Hi, >> >> I'm writing a code using PETSc to solve NS equations with FV >> on an >> unstructured mesh. Therefore I use DMPlex. >> >> Regarding periodicity, I manage to implement it this way: >> >> ?? - for each couple of boundaries that is linked with >> periodicity, I >> create a buffer vector with an ISLocalToGlobalMapping >> >> ?? - then, when I need to fill the ghost cells corresponding >> to the >> periodicity, the i "true" cell of the local vector fills the >> buffer >> vector on location i with VecSetValuesBlockedLocal, then >> VecAssemblyBegin/VecAssemblyEnd ensure each value is send to >> the correct >> location thanks to the mapping, then the i "ghost" cell of >> the local >> vector reads the vector on location i to get it's value. >> >> >> It works, but it seems to me there is a better way, with >> maybe PetscSF, >> VecScatter, or something I don't know yet. Does anyone have >> any advice ? >> >> >> There are at least two other ways to handle this. First, the >> method that is advocated in >> Plex is to actually make a periodic geometry, meaning connect the >> cells that are meant >> to be connected. Then, if you partition with overlap = 1, >> PetscGlobalToLocal() will fill in >> these cell values automatically. >> >> Second, you could use a non-periodic geometry, but alter the >> LocalToGlobal map such >> that the cells gets filled in anyway. Many codes use this scheme >> and it is straightforward >> with Plex just by augmenting the map it makes automatically. >> >> Does this make sense? >> >> ? Thanks, >> >> ? ? ?Matt >> >> Pierre Seize >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.seize at onera.fr Fri Oct 15 09:16:23 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Fri, 15 Oct 2021 16:16:23 +0200 Subject: [petsc-users] Periodic boundary conditions in DMPlex In-Reply-To: <8ff9d951-6958-aa1a-b875-e7488bb6b30b@onera.fr> References: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> <71784fff-35eb-a129-3609-004e5e596575@onera.fr> <8ff9d951-6958-aa1a-b875-e7488bb6b30b@onera.fr> Message-ID: <64787e00-820b-0c83-02f8-854569e4df9e@onera.fr> I read everything again, I think I did not understand you at first. The first solution is to modify the DAG, so that the rightmost cell is linked to the leftmost face, right ? To do that, do I have to manually edit the DAG (the mesh is read from a file) ? If so, the mesh connectivity is like the one of a torus, then how does it work with the cells/faces coordinates ? Now I think the second method may be more straightforward. What's the idea ? Get the mapping with DMGetLocalToGlobalMapping, then create the mapping corresponding to the periodicity with ISLocalToGlobalMappingCreate, and finally ISLocalToGlobalMappingConcatenate ? I'm not sure this is the way, and I did not find something like DMSetLocalToGlobalMapping to restore the modified mapping. Pierre On 15/10/21 15:33, Pierre Seize wrote: > > When I first tried to handle the periodicity, I found the > DMPlexCreateBoxMesh function (I cannot find the cylinder one). > > From reading the sources, I understand that we do some work either in > DMPlexCreateCubeMesh_Internal or with DMSetPeriodicity. > > I tried to use DMSetPeriodicity before, for example with a 2x2 box on > length 10. I did something like: > > const PetscReal maxCell[] = {2, 2}; > const PetscReal L[] = {10, 10}; > const DMBoundaryType bd[] = {DM_BOUNDARY_PERIODIC, DM_BOUNDARY_PERIODIC}; > DMSetPeriodicity(dm, PETSC_TRUE, maxCell, L, bd); > // or: > DMSetPeriodicity(dm, PETSC_TRUE, NULL, L, bd); > > but it did not work: > > VecSet(X, 1); > DMGetLocalVector(dm, &locX); > VecZeroEntries(locX); > DMGlobalToLocalBegin(dm, X, INSERT_VALUES, locX); > DMGlobalToLocalEnd(dm, X, INSERT_VALUES, locX); > VecView(locX, PETSC_VIEWER_STDOUT_WORLD); > > but the ghost cells values are all 0, only the real cells are 1. So I > guess DMSetPeriodicity alone is not sufficient to handle the > periodicity. Is there a way to do what I want ? That is set up my > DMPlex in a way that DMGlobalToLocalBegin/DMGlobalToLocalEnd do > exchange values between procs AND exchange the periodic values? > > > Thanks for the help > > > Pierre > > > On 15/10/21 14:03, Matthew Knepley wrote: >> On Fri, Oct 15, 2021 at 7:31 AM Pierre Seize > > wrote: >> >> It makes sense, thank you. In fact, both ways seems better than >> my way. The first one looks the most straightforward. >> Unfortunately I do not know how to implement either of them. >> Could you please direct me to the corresponding PETSc functions ? >> >> The first way is implemented for example in DMPlexCreateBoxMesh() and >> DMPlexCreateCylinderMesh(). The second is not implemented since >> there did not seem to be a general way to do it. I would help if you >> wanted to try coding it up. >> >> ? Thanks, >> >> ? ? Matt >> >> Pierre >> >> >> On 15/10/21 13:25, Matthew Knepley wrote: >>> On Fri, Oct 15, 2021 at 7:08 AM Pierre Seize >>> > wrote: >>> >>> Hi, >>> >>> I'm writing a code using PETSc to solve NS equations with FV >>> on an >>> unstructured mesh. Therefore I use DMPlex. >>> >>> Regarding periodicity, I manage to implement it this way: >>> >>> ?? - for each couple of boundaries that is linked with >>> periodicity, I >>> create a buffer vector with an ISLocalToGlobalMapping >>> >>> ?? - then, when I need to fill the ghost cells corresponding >>> to the >>> periodicity, the i "true" cell of the local vector fills the >>> buffer >>> vector on location i with VecSetValuesBlockedLocal, then >>> VecAssemblyBegin/VecAssemblyEnd ensure each value is send to >>> the correct >>> location thanks to the mapping, then the i "ghost" cell of >>> the local >>> vector reads the vector on location i to get it's value. >>> >>> >>> It works, but it seems to me there is a better way, with >>> maybe PetscSF, >>> VecScatter, or something I don't know yet. Does anyone have >>> any advice ? >>> >>> >>> There are at least two other ways to handle this. First, the >>> method that is advocated in >>> Plex is to actually make a periodic geometry, meaning connect >>> the cells that are meant >>> to be connected. Then, if you partition with overlap = 1, >>> PetscGlobalToLocal() will fill in >>> these cell values automatically. >>> >>> Second, you could use a non-periodic geometry, but alter the >>> LocalToGlobal map such >>> that the cells gets filled in anyway. Many codes use this scheme >>> and it is straightforward >>> with Plex just by augmenting the map it makes automatically. >>> >>> Does this make sense? >>> >>> ? Thanks, >>> >>> ? ? ?Matt >>> >>> Pierre Seize >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to >>> which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which >> their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Fri Oct 15 12:26:01 2021 From: mfadams at lbl.gov (Mark Adams) Date: Fri, 15 Oct 2021 13:26:01 -0400 Subject: [petsc-users] Issue on Block Jacobi Preconditioner Reuse In-Reply-To: <9c62be9.5da4.17c82101668.Coremail.wangyijia@lsec.cc.ac.cn> References: <9c62be9.5da4.17c82101668.Coremail.wangyijia@lsec.cc.ac.cn> Message-ID: You seem to be trying to use the setup of one solver (lu) for a different solver (hypre). You can't do that in general and PCSetType(subpc, PCHYPRE); will delete the old solvers data. On Thu, Oct 14, 2021 at 11:48 PM ??? via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi!Everyone: > > Glad to join the mailing list.Recently I 've been working on a program > using block jacobi preconditioner for a sequence solve of two linear system > A_1x_1=b_1,A_2x_2=b_2. > > Since A1 and A2 have same nonzero pattern and their element values > are quite close, we hope to reuse the preconditoners constructed when > solving A_1x_1=b_1 , > > however after calling KSPSetReusePreconditioner, though the iteration > number of second ksp solve is very small but the time used did not decrease > much, these are my code: > > ierr = KSPSetUp(ksp);CHKERRQ(ierr); > ierr = PCBJacobiGetSubKSP(pc, &num_local, &idx_first_local, > &subksp);CHKERRQ(ierr); > > > for (i=0; i { > ierr = KSPGetPC(subksp[i], &subpc);CHKERRQ(ierr); > if (i==0) > { > ierr = > KSPSetType(subksp[i],"gmres");CHKERRQ(ierr); > ierr = PCSetType(subpc, > PCLU);CHKERRQ(ierr); > ierr = PCFactorSetMatSolverType(subpc, > "mkl_pardiso");CHKERRQ(ierr); > ierr = KSPSetOptionsPrefix(subksp[i], > "Blk0_");CHKERRQ(ierr); > ierr = > KSPSetReusePreconditioner(subksp[i],PETSC_TRUE);CHKERRQ(ierr); > ierr = > KSPSetFromOptions(subksp[i]);CHKERRQ(ierr); > } > if (i==1) > { > ierr = > KSPSetType(subksp[i],"gmres");CHKERRQ(ierr); > ierr = PCSetType(subpc, > PCHYPRE);CHKERRQ(ierr); > ierr = PCHYPRESetType(pc, > "boomeramg");CHKERRQ(ierr); > ierr = > KSPSetReusePreconditioner(subksp[i],PETSC_TRUE);CHKERRQ(ierr); > ierr = KSPSetOptionsPrefix(subksp[i], > "Blk1_");CHKERRQ(ierr); > } > }//nlocal > }//isbjacobi > > > //Solve the Linear System > ierr = KSPSetFromOptions(ksp);CHKERRQ(ierr); > //Solve the Linear System > t0 = MPI_Wtime(); > ierr = KSPSolve(ksp, b1, x1);CHKERRQ(ierr); > t0 = MPI_Wtime()-t0; > ierr = PetscPrintf(PETSC_COMM_SELF,"First KSP Solve time: %g > s\n",t0);CHKERRQ(ierr); > //Preconditioner Reuse > ierr = KSPSetReusePreconditioner(ksp, PETSC_TRUE);CHKERRQ(ierr); > ierr = KSPSetOperators(ksp,A2,A1);CHKERRQ(ierr); > t0 = MPI_Wtime(); > ierr = KSPSolve(ksp,b2,x2);CHKERRQ(ierr); > t0 = MPI_Wtime()-t0; > ierr = PetscPrintf(PETSC_COMM_SELF,"Second KSP Solve time: %g > s\n",t0);CHKERRQ(ierr); > > > The total block number is 2, and the first block is solved using direct > method and the second block is solved using hypre's boomeramg, so I hope to > reuse the factor of the first block and the set up phase of boomeramg as > preconditioner of the next solve, is there anything wrong with the reuse > code? > > > Best Wishes > > > WANG Yijia > > > 2021/10/15 > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Oct 15 12:27:42 2021 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 15 Oct 2021 13:27:42 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <40fd95d0-c025-e5fb-da32-1c6037b87e53@pppl.gov> <08C573D5-5883-464B-B0EC-496FD12C0504@petsc.dev> <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> Message-ID: <6D4D8741-3F52-41BF-B2A3-AFBA09443755@petsc.dev> So the only difference is between -sub_telescope_pc_factor_mat_solver_type cusparse and -sub_telescope_pc_factor_mat_solver_type mumps ? Try without the -sub_telescope_pc_factor_mat_solver_type cusparse and then PETSc will just use the CPU solvers, I want to see if that works, it should. If it works then there is perhaps something specific about the PCTELESCOPE and the cusparse solver, for example the right hand side array values may never get to the GPU. Barry > On Oct 14, 2021, at 10:11 PM, Chang Liu wrote: > > For comparison, here is the output using mumps instead of cusparse > > $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type mumps -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 > 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 > 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 > 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 > 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 > 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 > 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 > 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 > 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 > 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 > 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 > 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 > 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 > 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 > 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 > 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 > 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 > 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 > 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 > 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 > 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 > 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 > 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 > 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 > 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 > 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 > 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 > 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 > 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 > 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 > 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 > 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 > 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 > 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 > 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 > 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 > 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 > 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 > 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 > 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 > 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 > 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 > 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 > 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 > 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 > 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 > 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 > 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 > 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 > 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 > 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 > 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 > 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 > 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 > 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 > 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 > 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 > 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 > 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 > 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 > 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 > 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 > 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 > 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 > 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 > 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 > 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 > 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 > 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 > 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 > 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 > 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 > 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 > 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 > 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 > 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 > 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 > 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 > 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 > 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 > 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 > 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 > 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 > 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 > 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 > 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 > 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 > 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 > 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 > 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 > 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 > 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 > 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 > 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 > 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 > 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 > 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 > 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 > 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 > 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 > 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 > 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 > 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 > 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 > 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 > 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 > 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 > 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 > 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 > 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 > 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 > 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 > 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 > 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 > 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 > 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 > 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 > 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 > 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 > 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 > 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 > 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 > 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 > 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 > 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 > 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 > 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 > 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 > 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 > 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 > 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 > 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 > 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 > 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 > 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 > 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 > 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 > 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 > 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 > 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 > 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 > 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 > 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 > 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 > 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 > 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 > 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 > 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 > 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 > 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 > 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 > 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 > 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 > 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 > 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 > 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 > 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 > 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 > 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 > 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 > 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 > 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 > 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 > 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 > 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 > 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 > 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 > 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 > 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 > 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 > 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 > 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 > 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 > 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 > 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 > 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 > 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 > 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 > 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 > 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 > 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 > 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 > 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 > 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 > 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 > 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 > 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 > 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 > 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 > KSP Object: 16 MPI processes > type: fgmres > restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > happy breakdown tolerance 1e-30 > maximum iterations=2000, initial guess is zero > tolerances: relative=1e-20, absolute=1e-09, divergence=10000. > right preconditioning > using UNPRECONDITIONED norm type for convergence test > PC Object: 16 MPI processes > type: bjacobi > number of blocks = 4 > Local solver information for first block is in the following KSP and PC objects on rank 0: > Use -ksp_view ::ascii_info_detail to display information for all blocks > KSP Object: (sub_) 4 MPI processes > type: preonly > maximum iterations=10000, initial guess is zero > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (sub_) 4 MPI processes > type: telescope > petsc subcomm: parent comm size reduction factor = 4 > petsc subcomm: parent_size = 4 , subcomm_size = 1 > petsc subcomm type = contiguous > linear system matrix = precond matrix: > Mat Object: (sub_) 4 MPI processes > type: mpiaij > rows=40200, cols=40200 > total: nonzeros=199996, allocated nonzeros=203412 > total number of mallocs used during MatSetValues calls=0 > not using I-node (on process 0) routines > setup type: default > Parent DM object: NULL > Sub DM object: NULL > KSP Object: (sub_telescope_) 1 MPI processes > type: preonly > maximum iterations=10000, initial guess is zero > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (sub_telescope_) 1 MPI processes > type: lu > out-of-place factorization > tolerance for zero pivot 2.22045e-14 > matrix ordering: external > factor fill ratio given 0., needed 0. > Factored matrix follows: > Mat Object: 1 MPI processes > type: mumps > rows=40200, cols=40200 > package used to perform factorization: mumps > total: nonzeros=1849788, allocated nonzeros=1849788 > MUMPS run parameters: > SYM (matrix type): 0 > PAR (host participation): 1 > ICNTL(1) (output for error): 6 > ICNTL(2) (output of diagnostic msg): 0 > ICNTL(3) (output for global info): 0 > ICNTL(4) (level of printing): 0 > ICNTL(5) (input mat struct): 0 > ICNTL(6) (matrix prescaling): 7 > ICNTL(7) (sequential matrix ordering):7 > ICNTL(8) (scaling strategy): 77 > ICNTL(10) (max num of refinements): 0 > ICNTL(11) (error analysis): 0 > ICNTL(12) (efficiency control): 1 > ICNTL(13) (sequential factorization of the root node): 0 > ICNTL(14) (percentage of estimated workspace increase): 20 > ICNTL(18) (input mat struct): 0 > ICNTL(19) (Schur complement info): 0 > ICNTL(20) (RHS sparse pattern): 0 > ICNTL(21) (solution struct): 0 > ICNTL(22) (in-core/out-of-core facility): 0 > ICNTL(23) (max size of memory can be allocated locally):0 > ICNTL(24) (detection of null pivot rows): 0 > ICNTL(25) (computation of a null space basis): 0 > ICNTL(26) (Schur options for RHS or solution): 0 > ICNTL(27) (blocking size for multiple RHS): -32 > ICNTL(28) (use parallel or sequential ordering): 1 > ICNTL(29) (parallel ordering): 0 > ICNTL(30) (user-specified set of entries in inv(A)): 0 > ICNTL(31) (factors is discarded in the solve phase): 0 > ICNTL(33) (compute determinant): 0 > ICNTL(35) (activate BLR based factorization): 0 > ICNTL(36) (choice of BLR factorization variant): 0 > ICNTL(38) (estimated compression rate of LU factors): 333 > CNTL(1) (relative pivoting threshold): 0.01 > CNTL(2) (stopping criterion of refinement): 1.49012e-08 > CNTL(3) (absolute pivoting threshold): 0. > CNTL(4) (value of static pivoting): -1. > CNTL(5) (fixation for null pivots): 0. > CNTL(7) (dropping parameter for BLR): 0. > RINFO(1) (local estimated flops for the elimination after analysis): > [0] 1.45525e+08 > RINFO(2) (local estimated flops for the assembly after factorization): > [0] 2.89397e+06 > RINFO(3) (local estimated flops for the elimination after factorization): > [0] 1.45525e+08 > INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization): > [0] 29 > INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization): > [0] 29 > INFO(23) (num of pivots eliminated on this processor after factorization): > [0] 40200 > RINFOG(1) (global estimated flops for the elimination after analysis): 1.45525e+08 > RINFOG(2) (global estimated flops for the assembly after factorization): 2.89397e+06 > RINFOG(3) (global estimated flops for the elimination after factorization): 1.45525e+08 > (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0) > INFOG(3) (estimated real workspace for factors on all processors after analysis): 1849788 > INFOG(4) (estimated integer workspace for factors on all processors after analysis): 879986 > INFOG(5) (estimated maximum front size in the complete tree): 282 > INFOG(6) (number of nodes in the complete tree): 23709 > INFOG(7) (ordering option effectively used after analysis): 5 > INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100 > INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 1849788 > INFOG(10) (total integer space store the matrix factors after factorization): 879986 > INFOG(11) (order of largest frontal matrix after factorization): 282 > INFOG(12) (number of off-diagonal pivots): 0 > INFOG(13) (number of delayed pivots after factorization): 0 > INFOG(14) (number of memory compress after factorization): 0 > INFOG(15) (number of steps of iterative refinement after solution): 0 > INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 29 > INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 29 > INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 29 > INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 29 > INFOG(20) (estimated number of entries in the factors): 1849788 > INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 26 > INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 26 > INFOG(23) (after analysis: value of ICNTL(6) effectively used): 0 > INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1 > INFOG(25) (after factorization: number of pivots modified by static pivoting): 0 > INFOG(28) (after factorization: number of null pivots encountered): 0 > INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 1849788 > INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 29, 29 > INFOG(32) (after analysis: type of analysis done): 1 > INFOG(33) (value used for ICNTL(8)): 7 > INFOG(34) (exponent of the determinant if determinant is requested): 0 > INFOG(35) (after factorization: number of entries taking into account BLR factor compression - sum over all processors): 1849788 > INFOG(36) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - value on the most memory consuming processor): 0 > INFOG(37) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - sum over all processors): 0 > INFOG(38) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - value on the most memory consuming processor): 0 > INFOG(39) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - sum over all processors): 0 > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaijcusparse > rows=40200, cols=40200 > total: nonzeros=199996, allocated nonzeros=199996 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > linear system matrix = precond matrix: > Mat Object: 16 MPI processes > type: mpiaijcusparse > rows=160800, cols=160800 > total: nonzeros=802396, allocated nonzeros=1608000 > total number of mallocs used during MatSetValues calls=0 > not using I-node (on process 0) routines > Norm of error 9.11684e-07 iterations 189 > > Chang > > > > On 10/14/21 10:10 PM, Chang Liu wrote: >> Hi Barry, >> No problem. Here is the output. It seems that the resid norm calculation is incorrect. >> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> KSP Object: 16 MPI processes >> type: fgmres >> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >> happy breakdown tolerance 1e-30 >> maximum iterations=2000, initial guess is zero >> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >> right preconditioning >> using UNPRECONDITIONED norm type for convergence test >> PC Object: 16 MPI processes >> type: bjacobi >> number of blocks = 4 >> Local solver information for first block is in the following KSP and PC objects on rank 0: >> Use -ksp_view ::ascii_info_detail to display information for all blocks >> KSP Object: (sub_) 4 MPI processes >> type: preonly >> maximum iterations=10000, initial guess is zero >> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> left preconditioning >> using NONE norm type for convergence test >> PC Object: (sub_) 4 MPI processes >> type: telescope >> petsc subcomm: parent comm size reduction factor = 4 >> petsc subcomm: parent_size = 4 , subcomm_size = 1 >> petsc subcomm type = contiguous >> linear system matrix = precond matrix: >> Mat Object: (sub_) 4 MPI processes >> type: mpiaij >> rows=40200, cols=40200 >> total: nonzeros=199996, allocated nonzeros=203412 >> total number of mallocs used during MatSetValues calls=0 >> not using I-node (on process 0) routines >> setup type: default >> Parent DM object: NULL >> Sub DM object: NULL >> KSP Object: (sub_telescope_) 1 MPI processes >> type: preonly >> maximum iterations=10000, initial guess is zero >> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> left preconditioning >> using NONE norm type for convergence test >> PC Object: (sub_telescope_) 1 MPI processes >> type: lu >> out-of-place factorization >> tolerance for zero pivot 2.22045e-14 >> matrix ordering: nd >> factor fill ratio given 5., needed 8.62558 >> Factored matrix follows: >> Mat Object: 1 MPI processes >> type: seqaijcusparse >> rows=40200, cols=40200 >> package used to perform factorization: cusparse >> total: nonzeros=1725082, allocated nonzeros=1725082 >> not using I-node routines >> linear system matrix = precond matrix: >> Mat Object: 1 MPI processes >> type: seqaijcusparse >> rows=40200, cols=40200 >> total: nonzeros=199996, allocated nonzeros=199996 >> total number of mallocs used during MatSetValues calls=0 >> not using I-node routines >> linear system matrix = precond matrix: >> Mat Object: 16 MPI processes >> type: mpiaijcusparse >> rows=160800, cols=160800 >> total: nonzeros=802396, allocated nonzeros=1608000 >> total number of mallocs used during MatSetValues calls=0 >> not using I-node (on process 0) routines >> Norm of error 400.999 iterations 1 >> Chang >> On 10/14/21 9:47 PM, Barry Smith wrote: >>> >>> Chang, >>> >>> Sorry I did not notice that one. Please run that with -ksp_view -ksp_monitor_true_residual so we can see exactly how options are interpreted and solver used. At a glance it looks ok but something must be wrong to get the wrong answer. >>> >>> Barry >>> >>>> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: >>>> >>>> Hi Barry, >>>> >>>> That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. >>>> >>>> Chang >>>> >>>> On 10/14/21 5:15 PM, Barry Smith wrote: >>>>> You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu >>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>>>>> >>>>>> Hi Pierre, >>>>>> >>>>>> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. >>>>>> >>>>>> The command line options I used for small matrix is like >>>>>> >>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >>>>>> >>>>>> which gives the correct output. For iterative solver, I tried >>>>>> >>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 >>>>>> >>>>>> for large matrix. The output is like >>>>>> >>>>>> 0 KSP Residual norm 40.1497 >>>>>> 1 KSP Residual norm < 1.e-11 >>>>>> Norm of error 400.999 iterations 1 >>>>>> >>>>>> So it seems to call a direct solver instead of an iterative one. >>>>>> >>>>>> Can you please help check these options? >>>>>> >>>>>> Chang >>>>>> >>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>>>>> >>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >>>>>>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >>>>>>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >>>>>>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >>>>>>> Thanks, >>>>>>> Pierre >>>>>>>> Chang >>>>>>>> >>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>>>>>>> Thus the need for specific code in mumps.c. >>>>>>>>> Thanks, >>>>>>>>> Pierre >>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>>>>>>> >>>>>>>>>> Hi Junchao, >>>>>>>>>> >>>>>>>>>> Yes that is what I want. >>>>>>>>>> >>>>>>>>>> Chang >>>>>>>>>> >>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>>>>>>> Junchao, >>>>>>>>>>> If I understand correctly Chang is using the block Jacobi >>>>>>>>>>> method with a single block for a number of MPI ranks and a direct >>>>>>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>>>>>>> particular problems this preconditioner works well, but using an >>>>>>>>>>> iterative solver on the blocks does not work well. >>>>>>>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>>>>>>> not he would like to use a single GPU for each block, this means >>>>>>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>>>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>>>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>>>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>>>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>>>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>>>>>>> GPUs. But this may be a large coding project. >>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>>>>>>> Barry >>>>>>>>>>> Since the matrices being factored and solved directly are relatively >>>>>>>>>>> large it is possible that the cusparse code could be reasonably >>>>>>>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>>>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>>>>>>> actually know how much better the cusparse code would be on the >>>>>>>>>>> direct solver than a good CPU direct sparse solver. >>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>>> > wrote: >>>>>>>>>>> > >>>>>>>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>>>> > >>>>>>>>>>> > Chang >>>>>>>>>>> > >>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>>>> >> Hi Chang, >>>>>>>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>>>>>>> gathering matrix rows to one process. >>>>>>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>>>>>>> >> Thanks >>>>>>>>>>> >> --Junchao Zhang >>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>> wrote: >>>>>>>>>>> >> Hi Barry, >>>>>>>>>>> >> I think mumps solver in petsc does support that. You can >>>>>>>>>>> check the >>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>>>> >> >>>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> > >>>>>>>>>>> >> and the code enclosed by #if >>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>>>>>>> >> mumps.c >>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>>>>>>> However, I am >>>>>>>>>>> >> working on an existing code that was developed based on MPI >>>>>>>>>>> and the the >>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>>>>>>> want to >>>>>>>>>>> >> change the whole structure of the code. >>>>>>>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>>>>>>> function >>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>>>>>>> >> Regards, >>>>>>>>>>> >> Chang >>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>>>> >> > >>>>>>>>>>> >> > >>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>>>> >>>>>>>>>>> >> >> wrote: >>>>>>>>>>> >> >> >>>>>>>>>>> >> >> Hi Barry, >>>>>>>>>>> >> >> >>>>>>>>>>> >> >> That is exactly what I want. >>>>>>>>>>> >> >> >>>>>>>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>>>>>>> >> transfer >>>>>>>>>>> >> >> matrix >>>>>>>>>>> >> >> data from many MPI processes to "master" MPI >>>>>>>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>>>>>>> upload >>>>>>>>>>> >> the data to GPU to >>>>>>>>>>> >> >> solve. >>>>>>>>>>> >> >> One can just grab some codes from mumps.c to >>>>>>>>>>> aijcusparse.cu >>>>>>>>>>> >> >. >>>>>>>>>>> >> > >>>>>>>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>>>>>>> copy the >>>>>>>>>>> >> entire matrix to a single MPI rank. >>>>>>>>>>> >> > >>>>>>>>>>> >> > It would be possible to write such a code that you >>>>>>>>>>> suggest but >>>>>>>>>>> >> it is not clear that it makes sense >>>>>>>>>>> >> > >>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>>>>>>> rank, so >>>>>>>>>>> >> while your one GPU per big domain is solving its systems the >>>>>>>>>>> other >>>>>>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>>>>>>> >> nothing. >>>>>>>>>>> >> > >>>>>>>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>>>>>>> right >>>>>>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>>>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>>>>>>> to all >>>>>>>>>>> >> of its subdomain ranks. >>>>>>>>>>> >> > >>>>>>>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>>>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>>>>>>> use the >>>>>>>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>>>>>>> >> can fit and be done on a single GPU then you would be >>>>>>>>>>> utilizing all >>>>>>>>>>> >> the GPUs you are using effectively. >>>>>>>>>>> >> > >>>>>>>>>>> >> > Barry >>>>>>>>>>> >> > >>>>>>>>>>> >> > >>>>>>>>>>> >> > >>>>>>>>>>> >> >> >>>>>>>>>>> >> >> Chang >>>>>>>>>>> >> >> >>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>>>> >> >>> Chang, >>>>>>>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>>>>>>> solvers that >>>>>>>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>>>>>>> that I >>>>>>>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>>>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>>>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>>>>>>> >> direct triangular solves. >>>>>>>>>>> >> >>> Barry >>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> wrote: >>>>>>>>>>> >> >>>> >>>>>>>>>>> >> >>>> Hi Mark, >>>>>>>>>>> >> >>>> >>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>>>>>>> other >>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>>>>>>> will give >>>>>>>>>>> >> an error. >>>>>>>>>>> >> >>>> >>>>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>>>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>>>>>>> on gpu. >>>>>>>>>>> >> Is that possible? >>>>>>>>>>> >> >>>> >>>>>>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>>>>>>> runs but >>>>>>>>>>> >> the iterative solver is still running on CPUs. I have >>>>>>>>>>> contacted the >>>>>>>>>>> >> superlu group and they confirmed that is the case right now. >>>>>>>>>>> But if >>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>>>>>>> >> iterative solver is running on GPU. >>>>>>>>>>> >> >>>> >>>>>>>>>>> >> >>>> Chang >>>>>>>>>>> >> >>>> >>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >>>>>>>>>>> >> >>> wrote: >>>>>>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>>>>>>> my case >>>>>>>>>>> >> the code is >>>>>>>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>>>>>>> >> factorization on GPUs. >>>>>>>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>>>>>>> code to >>>>>>>>>>> >> utilize GPUs >>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>>>>>>> mpiaij >>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>>>>>>> >> superlu tests use aij or sell matrix type. >>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>>>>>>> .... but >>>>>>>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>>>>>>> that the >>>>>>>>>>> >> issue? >>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>>>>>>> SuperLU >>>>>>>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>>>>>>> >> >>>>> Thanks, >>>>>>>>>>> >> >>>>> Mark >>>>>>>>>>> >> >>>>> so I >>>>>>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>>>>>>> all the >>>>>>>>>>> >> matrix terms, >>>>>>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>>>>>>> >> factorization >>>>>>>>>>> >> >>>>> and >>>>>>>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>>>>>>> >> process, and I >>>>>>>>>>> >> >>>>> think >>>>>>>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>>>>>>> >> >>>>> Chang >>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>>>>>>> >> >>>>>>>>>>> > >>>>>>>>>>> >> >>>>> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> >>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>>> wrote: >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>>>>>>> >> >>>>>>>>>> > >>>>>>>>>>> >> >>>>> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>>> wrote: >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > Hi Mark, >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > The option I use is like >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>>>>>>> >> -ksp_type fgmres >>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>>>>>>> >> cusparse >>>>>>>>>>> >> >>>>> *-sub_ksp_type >>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>>>>>>> >> -ksp_rtol 1.e-300 >>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>>>>>>> (rows >>>>>>>>>>> >> are the >>>>>>>>>>> >> >>>>> method like >>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>>>>>>> in the GPU. >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>>>>>>> cuSparse LU >>>>>>>>>>> >> >>>>> factorization. Is >>>>>>>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>>>> find it >>>>>>>>>>> >> calls >>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>>>>>>> >> make bigger >>>>>>>>>>> >> >>>>> blocks? >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > I think this one do both factorization and >>>>>>>>>>> >> solve on gpu. >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > You can check the >>>>>>>>>>> runex72_aijcusparse.sh file >>>>>>>>>>> >> in petsc >>>>>>>>>>> >> >>>>> install >>>>>>>>>>> >> >>>>> > directory, and try it your self (this >>>>>>>>>>> is only lu >>>>>>>>>>> >> >>>>> factorization >>>>>>>>>>> >> >>>>> > without >>>>>>>>>>> >> >>>>> > iterative solve). >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > Chang >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>>>>>>> Chang Liu >>>>>>>>>>> >> >>>>> >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>> >>>>>>>>>>> >> >>>>> > > >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >>>>>>>>>> >> >>>>>>>>>>> >> >>>>> >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>>>> wrote: >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > Hi Junchao, >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>>>>>>> >> within a >>>>>>>>>>> >> >>>>> node. I use >>>>>>>>>>> >> >>>>> > block-Jacobi >>>>>>>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>>>>>>> >> matrix, so each >>>>>>>>>>> >> >>>>> > direct solver will >>>>>>>>>>> >> >>>>> > > take care of a sub-block of the >>>>>>>>>>> whole >>>>>>>>>>> >> matrix. In this >>>>>>>>>>> >> >>>>> > way, I can use >>>>>>>>>>> >> >>>>> > > one >>>>>>>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>>>>>>> >> stored within >>>>>>>>>>> >> >>>>> one node. >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > It was stated in the >>>>>>>>>>> documentation that >>>>>>>>>>> >> cusparse >>>>>>>>>>> >> >>>>> solver >>>>>>>>>>> >> >>>>> > is slow. >>>>>>>>>>> >> >>>>> > > However, in my test using >>>>>>>>>>> ex72.c, the >>>>>>>>>>> >> cusparse >>>>>>>>>>> >> >>>>> solver is >>>>>>>>>>> >> >>>>> > faster than >>>>>>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > Are we talking about the >>>>>>>>>>> factorization, the >>>>>>>>>>> >> solve, or >>>>>>>>>>> >> >>>>> both? >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > We do not have an interface to >>>>>>>>>>> cuSparse's LU >>>>>>>>>>> >> >>>>> factorization (I >>>>>>>>>>> >> >>>>> > just >>>>>>>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>>>>>>> >> '-pc_type lu >>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>>>>>>> >> factorization, >>>>>>>>>>> >> >>>>> which is the >>>>>>>>>>> >> >>>>> > > dominant cost. >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > Chang >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>>>>>>> Zhang wrote: >>>>>>>>>>> >> >>>>> > > > Hi, Chang, >>>>>>>>>>> >> >>>>> > > > For the mumps solver, we >>>>>>>>>>> usually >>>>>>>>>>> >> transfers >>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>> >> >>>>> > and vector >>>>>>>>>>> >> >>>>> > > data >>>>>>>>>>> >> >>>>> > > > within a compute node. For >>>>>>>>>>> the idea you >>>>>>>>>>> >> >>>>> propose, it >>>>>>>>>>> >> >>>>> > looks like >>>>>>>>>>> >> >>>>> > > we need >>>>>>>>>>> >> >>>>> > > > to gather data within >>>>>>>>>>> >> MPI_COMM_WORLD, right? >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > > Mark, I remember you said >>>>>>>>>>> >> cusparse solve is >>>>>>>>>>> >> >>>>> slow >>>>>>>>>>> >> >>>>> > and you would >>>>>>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > > --Junchao Zhang >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>>>>>>> >> Chang Liu via >>>>>>>>>>> >> >>>>> petsc-users >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> >>> >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> >>>> >>>>>>>>>>> >> >>>>> > > >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> >>> >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>> >>>>>> >>>>>>>>>>> >> >>>>> > > wrote: >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > > Hi, >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > > Currently, it is possible >>>>>>>>>>> to use >>>>>>>>>>> >> mumps >>>>>>>>>>> >> >>>>> solver in >>>>>>>>>>> >> >>>>> > PETSC with >>>>>>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>>>>>>> >> option, so that >>>>>>>>>>> >> >>>>> > multiple MPI >>>>>>>>>>> >> >>>>> > > processes will >>>>>>>>>>> >> >>>>> > > > transfer the matrix and >>>>>>>>>>> rhs data >>>>>>>>>>> >> to the master >>>>>>>>>>> >> >>>>> > rank, and then >>>>>>>>>>> >> >>>>> > > master >>>>>>>>>>> >> >>>>> > > > rank will call mumps with >>>>>>>>>>> OpenMP >>>>>>>>>>> >> to solve >>>>>>>>>>> >> >>>>> the matrix. >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > > I wonder if someone can >>>>>>>>>>> develop >>>>>>>>>>> >> similar >>>>>>>>>>> >> >>>>> option for >>>>>>>>>>> >> >>>>> > cusparse >>>>>>>>>>> >> >>>>> > > solver. >>>>>>>>>>> >> >>>>> > > > Right now, this solver >>>>>>>>>>> does not >>>>>>>>>>> >> work with >>>>>>>>>>> >> >>>>> > mpiaijcusparse. I >>>>>>>>>>> >> >>>>> > > think a >>>>>>>>>>> >> >>>>> > > > possible workaround is to >>>>>>>>>>> >> transfer all the >>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>> >> >>>>> > data to one MPI >>>>>>>>>>> >> >>>>> > > > process, and then upload the >>>>>>>>>>> >> data to GPU to >>>>>>>>>>> >> >>>>> solve. >>>>>>>>>>> >> >>>>> > In this >>>>>>>>>>> >> >>>>> > > way, one can >>>>>>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>>>>>>> >> program. >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > > Chang >>>>>>>>>>> >> >>>>> > > > -- >>>>>>>>>>> >> >>>>> > > > Chang Liu >>>>>>>>>>> >> >>>>> > > > Staff Research Physicist >>>>>>>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>> >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>> >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>>> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>> >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>> >>>>>>>>>>> >> >>>>> > > >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >>>>>>>>>> >> >>>>>>>>>>> >> >>>>> >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>>>> >>>>>>>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>>>>>>> Laboratory >>>>>>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>>>>>>> Princeton NJ >>>>>>>>>>> >> 08540, USA >>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > > -- >>>>>>>>>>> >> >>>>> > > Chang Liu >>>>>>>>>>> >> >>>>> > > Staff Research Physicist >>>>>>>>>>> >> >>>>> > > +1 609 243 3438 >>>>>>>>>>> >> >>>>> > > cliu at pppl.gov >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>> >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>> >>>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>> > >>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >>>> >>>>>>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>>>>>>> 08540, USA >>>>>>>>>>> >> >>>>> > > >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> > -- >>>>>>>>>>> >> >>>>> > Chang Liu >>>>>>>>>>> >> >>>>> > Staff Research Physicist >>>>>>>>>>> >> >>>>> > +1 609 243 3438 >>>>>>>>>>> >> >>>>> > cliu at pppl.gov >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >>>>>>>>>>> >> > >>>>>>>>>>> >> >>>>> >>>>>>>>>>> >>> >>>>>>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> >> >>>>> > >>>>>>>>>>> >> >>>>> -- Chang Liu >>>>>>>>>>> >> >>>>> Staff Research Physicist >>>>>>>>>>> >> >>>>> +1 609 243 3438 >>>>>>>>>>> >> >>>>> cliu at pppl.gov >>>>>>>>>>> > >>>>>>>>>> >>>>>>>>>>> >> >> >>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> >> >>>> >>>>>>>>>>> >> >>>> -- >>>>>>>>>>> >> >>>> Chang Liu >>>>>>>>>>> >> >>>> Staff Research Physicist >>>>>>>>>>> >> >>>> +1 609 243 3438 >>>>>>>>>>> >> >>>> cliu at pppl.gov >>>>>>>>>>> > >>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> >> >> >>>>>>>>>>> >> >> -- >>>>>>>>>>> >> >> Chang Liu >>>>>>>>>>> >> >> Staff Research Physicist >>>>>>>>>>> >> >> +1 609 243 3438 >>>>>>>>>>> >> >> cliu at pppl.gov >>>>>>>>>>> > >>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> >> > >>>>>>>>>>> >> -- Chang Liu >>>>>>>>>>> >> Staff Research Physicist >>>>>>>>>>> >> +1 609 243 3438 >>>>>>>>>>> >> cliu at pppl.gov >>>>>>>>>> > >>>>>>>>>>> >> Princeton Plasma Physics Laboratory >>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> > >>>>>>>>>>> > -- >>>>>>>>>>> > Chang Liu >>>>>>>>>>> > Staff Research Physicist >>>>>>>>>>> > +1 609 243 3438 >>>>>>>>>>> > cliu at pppl.gov >>>>>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Chang Liu >>>>>>>>>> Staff Research Physicist >>>>>>>>>> +1 609 243 3438 >>>>>>>>>> cliu at pppl.gov >>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >>>>>>>> -- >>>>>>>> Chang Liu >>>>>>>> Staff Research Physicist >>>>>>>> +1 609 243 3438 >>>>>>>> cliu at pppl.gov >>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >>>>>> -- >>>>>> Chang Liu >>>>>> Staff Research Physicist >>>>>> +1 609 243 3438 >>>>>> cliu at pppl.gov >>>>>> Princeton Plasma Physics Laboratory >>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >>>> -- >>>> Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jed at jedbrown.org Fri Oct 15 13:50:49 2021 From: jed at jedbrown.org (Jed Brown) Date: Fri, 15 Oct 2021 12:50:49 -0600 Subject: [petsc-users] TS initial guess In-Reply-To: <1B573129-FF26-43A5-BC4C-6B51DB6D1047@petsc.dev> References: <1B573129-FF26-43A5-BC4C-6B51DB6D1047@petsc.dev> Message-ID: <87czo6m8ae.fsf@jedbrown.org> Some methods have extrapolation options. For THETA (which includes backward Euler as theta=1), you can use -ts_theta_initial_guess_extrapolate. This type of extrapolation is sometimes unstable or may produce an invalid state, such as negative density. I'm assuming in your question that you're concerned about the number of Newton iterations rather than number of Krylov iterations. Barry Smith writes: > For TSBEULER (the theta method) see https://petsc.org/release/docs/manualpages/TS/TSTHETA.html and look at the source code src/ts/impls/implicit/theta/teta.c for TSStep_Theta. You can use -snes_monitor_solution OPTIONS to see what the solutions are the nonlinear system look like as it solves the system. > > > > Barry > > > >> On Oct 11, 2021, at 12:26 PM, Alfredo J Duarte Gomez wrote: >> >> Good morning PETSC team, >> >> I have a working algorithm for my implicit TS integrator with a system of ODE/DAE's, but I am observing a rather high number of iterations >> >> I am currently using the simplest settings of a TSBEULER and setting a constant time step. >> >> My question right now is whether the default settings use any sort of initial guess algorithm before every time step. >> >> Since I have seen that the time step adapter calculates the Local Truncation Error, it should be possible to use an extrapolation of arbitrary order of accuracy as an initial guess for every time step right? Can someone indicate how I would be able to use that? >> >> Additionally, it would be very helpful to take a look at that initial guess, is it possible to use any existing function to calculate it either in the PreStep or PostStep function to visualize it? >> >> Thank you, >> >> -- >> Alfredo Duarte >> Graduate Research Assistant >> The University of Texas at Austin From knepley at gmail.com Fri Oct 15 15:30:12 2021 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 15 Oct 2021 16:30:12 -0400 Subject: [petsc-users] Periodic boundary conditions in DMPlex In-Reply-To: <64787e00-820b-0c83-02f8-854569e4df9e@onera.fr> References: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> <71784fff-35eb-a129-3609-004e5e596575@onera.fr> <8ff9d951-6958-aa1a-b875-e7488bb6b30b@onera.fr> <64787e00-820b-0c83-02f8-854569e4df9e@onera.fr> Message-ID: On Fri, Oct 15, 2021 at 10:16 AM Pierre Seize wrote: > I read everything again, I think I did not understand you at first. The > first solution is to modify the DAG, so that the rightmost cell is linked > to the leftmost face, right ? To do that, do I have to manually edit the > DAG (the mesh is read from a file) ? > Yes, the DAG would be modified if you want it for some particular mesh that we cannot read automatically. For example, we can read periodic GMsh meshes. > If so, the mesh connectivity is like the one of a torus, then how does it > work with the cells/faces coordinates ? > You let the coordinate field be in a DG space, so that it can have jumps. > Now I think the second method may be more straightforward. What's the idea > ? Get the mapping with DMGetLocalToGlobalMapping, then create the mapping > corresponding to the periodicity with ISLocalToGlobalMappingCreate, and > finally ISLocalToGlobalMappingConcatenate ? I'm not sure this is the way, > and I did not find something like DMSetLocalToGlobalMapping to restore the > modified mapping. > It is more complicated. We make the LocalToGlobalMap by looking at the PetscSection (essentially if gives function space information) and deciding which unknowns are removed from the global space. You would need to decide that unknowns constrained by periodicity are not present in the global space. Actually, this is not hard. You just mark them as constrained in the PetscSection, and all the layout functions will function correctly. However, then the LocalToGlobalMap will not be exactly right because the constrained unknowns will not be filled in (just like Dirichlet conditions). You would augment the map so that it fills those in by looking up their periodic counterparts. Jed has argued for this type of periodicity. To me, the first kind is much more straightforward, but maybe this is because I find the topology code more clear. Thanks, Matt > Pierre > > On 15/10/21 15:33, Pierre Seize wrote: > > When I first tried to handle the periodicity, I found the > DMPlexCreateBoxMesh function (I cannot find the cylinder one). > > From reading the sources, I understand that we do some work either in > DMPlexCreateCubeMesh_Internal or with DMSetPeriodicity. > > I tried to use DMSetPeriodicity before, for example with a 2x2 box on > length 10. I did something like: > const PetscReal maxCell[] = {2, 2}; > const PetscReal L[] = {10, 10}; > const DMBoundaryType bd[] = {DM_BOUNDARY_PERIODIC, DM_BOUNDARY_PERIODIC}; > DMSetPeriodicity(dm, PETSC_TRUE, maxCell, L, bd); > // or: > DMSetPeriodicity(dm, PETSC_TRUE, NULL, L, bd); > > but it did not work: > VecSet(X, 1); > DMGetLocalVector(dm, &locX); > VecZeroEntries(locX); > DMGlobalToLocalBegin(dm, X, INSERT_VALUES, locX); > DMGlobalToLocalEnd(dm, X, INSERT_VALUES, locX); > VecView(locX, PETSC_VIEWER_STDOUT_WORLD); > > but the ghost cells values are all 0, only the real cells are 1. So I > guess DMSetPeriodicity alone is not sufficient to handle the periodicity. > Is there a way to do what I want ? That is set up my DMPlex in a way that > DMGlobalToLocalBegin/DMGlobalToLocalEnd do exchange values between procs > AND exchange the periodic values? > > > Thanks for the help > > > Pierre > > On 15/10/21 14:03, Matthew Knepley wrote: > > On Fri, Oct 15, 2021 at 7:31 AM Pierre Seize > wrote: > >> It makes sense, thank you. In fact, both ways seems better than my way. >> The first one looks the most straightforward. Unfortunately I do not know >> how to implement either of them. Could you please direct me to the >> corresponding PETSc functions ? >> > The first way is implemented for example in DMPlexCreateBoxMesh() and > DMPlexCreateCylinderMesh(). The second is not implemented since > there did not seem to be a general way to do it. I would help if you > wanted to try coding it up. > > Thanks, > > Matt > >> Pierre >> >> On 15/10/21 13:25, Matthew Knepley wrote: >> >> On Fri, Oct 15, 2021 at 7:08 AM Pierre Seize >> wrote: >> >>> Hi, >>> >>> I'm writing a code using PETSc to solve NS equations with FV on an >>> unstructured mesh. Therefore I use DMPlex. >>> >>> Regarding periodicity, I manage to implement it this way: >>> >>> - for each couple of boundaries that is linked with periodicity, I >>> create a buffer vector with an ISLocalToGlobalMapping >>> >>> - then, when I need to fill the ghost cells corresponding to the >>> periodicity, the i "true" cell of the local vector fills the buffer >>> vector on location i with VecSetValuesBlockedLocal, then >>> VecAssemblyBegin/VecAssemblyEnd ensure each value is send to the correct >>> location thanks to the mapping, then the i "ghost" cell of the local >>> vector reads the vector on location i to get it's value. >>> >>> >>> It works, but it seems to me there is a better way, with maybe PetscSF, >>> VecScatter, or something I don't know yet. Does anyone have any advice ? >>> >> >> There are at least two other ways to handle this. First, the method that >> is advocated in >> Plex is to actually make a periodic geometry, meaning connect the cells >> that are meant >> to be connected. Then, if you partition with overlap = 1, >> PetscGlobalToLocal() will fill in >> these cell values automatically. >> >> Second, you could use a non-periodic geometry, but alter the >> LocalToGlobal map such >> that the cells gets filled in anyway. Many codes use this scheme and it >> is straightforward >> with Plex just by augmenting the map it makes automatically. >> >> Does this make sense? >> >> Thanks, >> >> Matt >> >> >>> Pierre Seize >>> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From zjorti at lanl.gov Fri Oct 15 16:07:15 2021 From: zjorti at lanl.gov (Jorti, Zakariae) Date: Fri, 15 Oct 2021 21:07:15 +0000 Subject: [petsc-users] Finite difference approximation of Jacobian Message-ID: <231abd15aab544f9850826cb437366f7@lanl.gov> Hello, Does the Jacobian approximation using coloring and finite differencing of the function evaluation work in DMStag? Thank you. Best regards, Zakariae -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Sat Oct 16 20:12:47 2021 From: cliu at pppl.gov (Chang Liu) Date: Sat, 16 Oct 2021 21:12:47 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <6D4D8741-3F52-41BF-B2A3-AFBA09443755@petsc.dev> References: <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <6D4D8741-3F52-41BF-B2A3-AFBA09443755@petsc.dev> Message-ID: Hi Barry, Pierre and Junchao, I spent some time to find the reason for the error. I think it is caused by some compability issues between telescope and cusparse. 1. In PCTelescopeMatCreate_default in telescope.c, it calls MatCreateMPIMatConcatenateSeqMat to concat seqmat to mpimat, but this function is from mpiaij.c and will set the mat type to mpiaij, even if the original matrix is mpiaijcusparse. 2. Simiar issue exists in PCTelescopeSetUp_default, where the vector is set to type mpi rather than mpicuda. I have fixed the issue using the following patch. After applying it, telescope and cusparse work as expected. diff --git a/src/ksp/pc/impls/telescope/telescope.c b/src/ksp/pc/impls/telescope/telescope.c index 893febb055..d3f687eff9 100644 --- a/src/ksp/pc/impls/telescope/telescope.c +++ b/src/ksp/pc/impls/telescope/telescope.c @@ -159,6 +159,7 @@ PetscErrorCode PCTelescopeSetUp_default(PC pc,PC_Telescope sred) ierr = VecCreate(subcomm,&xred);CHKERRQ(ierr); ierr = VecSetSizes(xred,PETSC_DECIDE,M);CHKERRQ(ierr); ierr = VecSetBlockSize(xred,bs);CHKERRQ(ierr); + ierr = VecSetType(xred,((PetscObject)x)->type_name);CHKERRQ(ierr); ierr = VecSetFromOptions(xred);CHKERRQ(ierr); ierr = VecGetLocalSize(xred,&m);CHKERRQ(ierr); } diff --git a/src/mat/impls/aij/mpi/mpiaij.c b/src/mat/impls/aij/mpi/mpiaij.c index 36077002db..ac374e07eb 100644 --- a/src/mat/impls/aij/mpi/mpiaij.c +++ b/src/mat/impls/aij/mpi/mpiaij.c @@ -4486,6 +4486,7 @@ PetscErrorCode MatCreateMPIMatConcatenateSeqMat_MPIAIJ(MPI_Comm comm,Mat inmat,P PetscInt m,N,i,rstart,nnz,Ii; PetscInt *indx; PetscScalar *values; + PetscBool isseqaijcusparse; PetscFunctionBegin; ierr = MatGetSize(inmat,&m,&N);CHKERRQ(ierr); @@ -4513,7 +4514,12 @@ PetscErrorCode MatCreateMPIMatConcatenateSeqMat_MPIAIJ(MPI_Comm comm,Mat inmat,P ierr = MatSetSizes(*outmat,m,n,PETSC_DETERMINE,PETSC_DETERMINE);CHKERRQ(ierr); ierr = MatGetBlockSizes(inmat,&bs,&cbs);CHKERRQ(ierr); ierr = MatSetBlockSizes(*outmat,bs,cbs);CHKERRQ(ierr); - ierr = MatSetType(*outmat,MATAIJ);CHKERRQ(ierr); + ierr = PetscObjectBaseTypeCompare((PetscObject)inmat,MATSEQAIJCUSPARSE,&isseqaijcusparse);CHKERRQ(ierr); + if (isseqaijcusparse) { + ierr = MatSetType(*outmat,MATAIJCUSPARSE);CHKERRQ(ierr); + } else { + ierr = MatSetType(*outmat,MATAIJ);CHKERRQ(ierr); + } ierr = MatSeqAIJSetPreallocation(*outmat,0,dnz);CHKERRQ(ierr); ierr = MatMPIAIJSetPreallocation(*outmat,0,dnz,0,onz);CHKERRQ(ierr); ierr = MatPreallocateFinalize(dnz,onz);CHKERRQ(ierr); Please help view it and merge to master if possible. Regards, Chang On 10/15/21 1:27 PM, Barry Smith wrote: > > ? So the only difference is between > -sub_telescope_pc_factor_mat_solver_type cusparse ?and > -sub_telescope_pc_factor_mat_solver_type mumps ? ?? > > ? ?Try without the -sub_telescope_pc_factor_mat_solver_type cusparse > ?and then PETSc will just use the CPU solvers, I want to see if that > works, it should. If it works then there is perhaps something specific > about the PCTELESCOPE and the cusparse solver, for example the right > hand side array values may never get to the GPU. > > ? Barry > >> On Oct 14, 2021, at 10:11 PM, Chang Liu > > wrote: >> >> For comparison, here is the output using mumps instead of cusparse >> >> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 >> -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >> -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type >> preonly -sub_telescope_pc_type lu >> -sub_telescope_pc_factor_mat_solver_type mumps >> -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type >> contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > > $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks > 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope > -sub_ksp_type preonly -sub_telescope_ksp_type preonly > -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type > cusparse -sub_pc_telescope_reduction_factor 4 > -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol > 1.e-20 -ksp_atol 1.e-9 > > >> ?0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm >> 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> ?1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm >> 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 >> ?2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm >> 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 >> ?3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm >> 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 >> ?4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm >> 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 >> ?5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm >> 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 >> ?6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm >> 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 >> ?7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm >> 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 >> ?8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm >> 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 >> ?9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm >> 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 >> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm >> 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 >> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm >> 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 >> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm >> 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 >> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm >> 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 >> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm >> 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 >> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm >> 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 >> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm >> 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 >> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm >> 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 >> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm >> 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 >> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm >> 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 >> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm >> 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 >> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm >> 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 >> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm >> 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 >> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm >> 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 >> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm >> 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 >> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm >> 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 >> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm >> 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 >> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm >> 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 >> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm >> 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 >> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm >> 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 >> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm >> 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 >> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm >> 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 >> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm >> 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 >> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm >> 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 >> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm >> 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 >> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm >> 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 >> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm >> 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 >> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm >> 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 >> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm >> 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 >> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm >> 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 >> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm >> 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 >> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm >> 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 >> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm >> 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 >> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm >> 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 >> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm >> 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 >> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm >> 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 >> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm >> 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 >> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm >> 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 >> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm >> 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 >> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm >> 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 >> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm >> 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 >> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm >> 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 >> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm >> 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 >> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm >> 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 >> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm >> 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 >> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm >> 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 >> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm >> 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 >> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm >> 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 >> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm >> 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 >> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm >> 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 >> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm >> 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 >> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm >> 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 >> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm >> 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 >> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm >> 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 >> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm >> 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 >> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm >> 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 >> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm >> 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 >> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm >> 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 >> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm >> 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 >> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm >> 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 >> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm >> 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 >> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm >> 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 >> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm >> 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 >> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm >> 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 >> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm >> 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 >> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm >> 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 >> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm >> 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 >> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm >> 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 >> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm >> 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 >> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm >> 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 >> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm >> 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 >> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm >> 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 >> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm >> 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 >> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm >> 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 >> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm >> 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 >> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm >> 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 >> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm >> 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 >> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm >> 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 >> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm >> 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 >> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm >> 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 >> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm >> 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 >> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm >> 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 >> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm >> 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 >> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm >> 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 >> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm >> 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 >> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm >> 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 >> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm >> 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 >> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm >> 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 >> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm >> 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 >> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm >> 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 >> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm >> 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 >> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm >> 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 >> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm >> 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 >> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm >> 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 >> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm >> 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 >> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm >> 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 >> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm >> 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 >> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm >> 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 >> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm >> 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 >> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm >> 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 >> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm >> 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 >> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm >> 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 >> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm >> 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 >> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm >> 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 >> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm >> 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 >> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm >> 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 >> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm >> 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 >> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm >> 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 >> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm >> 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 >> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm >> 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 >> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm >> 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 >> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm >> 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 >> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm >> 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 >> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm >> 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 >> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm >> 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 >> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm >> 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 >> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm >> 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 >> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm >> 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 >> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm >> 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 >> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm >> 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 >> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm >> 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 >> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm >> 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 >> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm >> 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 >> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm >> 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 >> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm >> 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 >> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm >> 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 >> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm >> 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 >> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm >> 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 >> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm >> 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 >> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm >> 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 >> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm >> 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 >> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm >> 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 >> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm >> 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 >> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm >> 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 >> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm >> 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 >> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm >> 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 >> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm >> 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 >> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm >> 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 >> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm >> 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 >> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm >> 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 >> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm >> 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 >> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm >> 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 >> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm >> 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 >> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm >> 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 >> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm >> 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 >> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm >> 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 >> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm >> 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 >> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm >> 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 >> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm >> 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 >> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm >> 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 >> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm >> 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 >> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm >> 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 >> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm >> 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 >> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm >> 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 >> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm >> 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 >> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm >> 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 >> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm >> 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 >> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm >> 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 >> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm >> 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 >> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm >> 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 >> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm >> 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 >> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm >> 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 >> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm >> 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 >> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm >> 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 >> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm >> 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 >> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm >> 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 >> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm >> 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 >> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm >> 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 >> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm >> 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 >> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm >> 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 >> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm >> 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 >> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm >> 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 >> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm >> 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 >> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm >> 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 >> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm >> 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 >> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm >> 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 >> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm >> 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 >> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm >> 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 >> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm >> 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 >> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm >> 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 >> KSP Object: 16 MPI processes >> ?type: fgmres >> ???restart=30, using Classical (unmodified) Gram-Schmidt >> Orthogonalization with no iterative refinement >> ???happy breakdown tolerance 1e-30 >> ?maximum iterations=2000, initial guess is zero >> ?tolerances: ?relative=1e-20, absolute=1e-09, divergence=10000. >> ?right preconditioning >> ?using UNPRECONDITIONED norm type for convergence test >> PC Object: 16 MPI processes >> ?type: bjacobi >> ???number of blocks = 4 >> ???Local solver information for first block is in the following KSP >> and PC objects on rank 0: >> ???Use -ksp_view ::ascii_info_detail to display information for all blocks >> ?KSP Object: (sub_) 4 MPI processes >> ???type: preonly >> ???maximum iterations=10000, initial guess is zero >> ???tolerances: ?relative=1e-05, absolute=1e-50, divergence=10000. >> ???left preconditioning >> ???using NONE norm type for convergence test >> ?PC Object: (sub_) 4 MPI processes >> ???type: telescope >> ?????petsc subcomm: parent comm size reduction factor = 4 >> ?????petsc subcomm: parent_size = 4 , subcomm_size = 1 >> ?????petsc subcomm type = contiguous >> ???linear system matrix = precond matrix: >> ???Mat Object: (sub_) 4 MPI processes >> ?????type: mpiaij >> ?????rows=40200, cols=40200 >> ?????total: nonzeros=199996, allocated nonzeros=203412 >> ?????total number of mallocs used during MatSetValues calls=0 >> ???????not using I-node (on process 0) routines >> ???????setup type: default >> ???????Parent DM object: NULL >> ???????Sub DM object: NULL >> ???????KSP Object: ??(sub_telescope_) ??1 MPI processes >> ?????????type: preonly >> ?????????maximum iterations=10000, initial guess is zero >> ?????????tolerances: ?relative=1e-05, absolute=1e-50, divergence=10000. >> ?????????left preconditioning >> ?????????using NONE norm type for convergence test >> ???????PC Object: ??(sub_telescope_) ??1 MPI processes >> ?????????type: lu >> ???????????out-of-place factorization >> ???????????tolerance for zero pivot 2.22045e-14 >> ???????????matrix ordering: external >> ???????????factor fill ratio given 0., needed 0. >> ?????????????Factored matrix follows: >> ???????????????Mat Object: ??1 MPI processes >> ?????????????????type: mumps >> ?????????????????rows=40200, cols=40200 >> ?????????????????package used to perform factorization: mumps >> ?????????????????total: nonzeros=1849788, allocated nonzeros=1849788 >> ???????????????????MUMPS run parameters: >> ?????????????????????SYM (matrix type): ??????????????????0 >> ?????????????????????PAR (host participation): ???????????1 >> ?????????????????????ICNTL(1) (output for error): ????????6 >> ?????????????????????ICNTL(2) (output of diagnostic msg): 0 >> ?????????????????????ICNTL(3) (output for global info): ??0 >> ?????????????????????ICNTL(4) (level of printing): ???????0 >> ?????????????????????ICNTL(5) (input mat struct): ????????0 >> ?????????????????????ICNTL(6) (matrix prescaling): ???????7 >> ?????????????????????ICNTL(7) (sequential matrix ordering):7 >> ?????????????????????ICNTL(8) (scaling strategy): ???????77 >> ?????????????????????ICNTL(10) (max num of refinements): ?0 >> ?????????????????????ICNTL(11) (error analysis): ?????????0 >> ?????????????????????ICNTL(12) (efficiency control): ???????1 >> ?????????????????????ICNTL(13) (sequential factorization of the root >> node): ?0 >> ?????????????????????ICNTL(14) (percentage of estimated workspace >> increase): 20 >> ?????????????????????ICNTL(18) (input mat struct): ???????0 >> ?????????????????????ICNTL(19) (Schur complement info): ???????0 >> ?????????????????????ICNTL(20) (RHS sparse pattern): ???????0 >> ?????????????????????ICNTL(21) (solution struct): ???????0 >> ?????????????????????ICNTL(22) (in-core/out-of-core facility): ???????0 >> ?????????????????????ICNTL(23) (max size of memory can be allocated >> locally):0 >> ?????????????????????ICNTL(24) (detection of null pivot rows): ???????0 >> ?????????????????????ICNTL(25) (computation of a null space basis): >> ???????0 >> ?????????????????????ICNTL(26) (Schur options for RHS or solution): >> ???????0 >> ?????????????????????ICNTL(27) (blocking size for multiple RHS): >> ???????-32 >> ?????????????????????ICNTL(28) (use parallel or sequential ordering): >> ???????1 >> ?????????????????????ICNTL(29) (parallel ordering): ???????0 >> ?????????????????????ICNTL(30) (user-specified set of entries in >> inv(A)): ???0 >> ?????????????????????ICNTL(31) (factors is discarded in the solve >> phase): ???0 >> ?????????????????????ICNTL(33) (compute determinant): ???????0 >> ?????????????????????ICNTL(35) (activate BLR based factorization): >> ???????0 >> ?????????????????????ICNTL(36) (choice of BLR factorization variant): >> ???????0 >> ?????????????????????ICNTL(38) (estimated compression rate of LU >> factors): ??333 >> ?????????????????????CNTL(1) (relative pivoting threshold): ?????0.01 >> ?????????????????????CNTL(2) (stopping criterion of refinement): >> 1.49012e-08 >> ?????????????????????CNTL(3) (absolute pivoting threshold): ?????0. >> ?????????????????????CNTL(4) (value of static pivoting): ????????-1. >> ?????????????????????CNTL(5) (fixation for null pivots): ????????0. >> ?????????????????????CNTL(7) (dropping parameter for BLR): ??????0. >> ?????????????????????RINFO(1) (local estimated flops for the >> elimination after analysis): >> ???????????????????????[0] 1.45525e+08 >> ?????????????????????RINFO(2) (local estimated flops for the assembly >> after factorization): >> ???????????????????????[0] ?2.89397e+06 >> ?????????????????????RINFO(3) (local estimated flops for the >> elimination after factorization): >> ???????????????????????[0] ?1.45525e+08 >> ?????????????????????INFO(15) (estimated size of (in MB) MUMPS >> internal data for running numerical factorization): >> ?????????????????????[0] 29 >> ?????????????????????INFO(16) (size of (in MB) MUMPS internal data >> used during numerical factorization): >> ???????????????????????[0] 29 >> ?????????????????????INFO(23) (num of pivots eliminated on this >> processor after factorization): >> ???????????????????????[0] 40200 >> ?????????????????????RINFOG(1) (global estimated flops for the >> elimination after analysis): 1.45525e+08 >> ?????????????????????RINFOG(2) (global estimated flops for the >> assembly after factorization): 2.89397e+06 >> ?????????????????????RINFOG(3) (global estimated flops for the >> elimination after factorization): 1.45525e+08 >> ?????????????????????(RINFOG(12) RINFOG(13))*2^INFOG(34) >> (determinant): (0.,0.)*(2^0) >> ?????????????????????INFOG(3) (estimated real workspace for factors on >> all processors after analysis): 1849788 >> ?????????????????????INFOG(4) (estimated integer workspace for factors >> on all processors after analysis): 879986 >> ?????????????????????INFOG(5) (estimated maximum front size in the >> complete tree): 282 >> ?????????????????????INFOG(6) (number of nodes in the complete tree): >> 23709 >> ?????????????????????INFOG(7) (ordering option effectively used after >> analysis): 5 >> ?????????????????????INFOG(8) (structural symmetry in percent of the >> permuted matrix after analysis): 100 >> ?????????????????????INFOG(9) (total real/complex workspace to store >> the matrix factors after factorization): 1849788 >> ?????????????????????INFOG(10) (total integer space store the matrix >> factors after factorization): 879986 >> ?????????????????????INFOG(11) (order of largest frontal matrix after >> factorization): 282 >> ?????????????????????INFOG(12) (number of off-diagonal pivots): 0 >> ?????????????????????INFOG(13) (number of delayed pivots after >> factorization): 0 >> ?????????????????????INFOG(14) (number of memory compress after >> factorization): 0 >> ?????????????????????INFOG(15) (number of steps of iterative >> refinement after solution): 0 >> ?????????????????????INFOG(16) (estimated size (in MB) of all MUMPS >> internal data for factorization after analysis: value on the most >> memory consuming processor): 29 >> ?????????????????????INFOG(17) (estimated size of all MUMPS internal >> data for factorization after analysis: sum over all processors): 29 >> ?????????????????????INFOG(18) (size of all MUMPS internal data >> allocated during factorization: value on the most memory consuming >> processor): 29 >> ?????????????????????INFOG(19) (size of all MUMPS internal data >> allocated during factorization: sum over all processors): 29 >> ?????????????????????INFOG(20) (estimated number of entries in the >> factors): 1849788 >> ?????????????????????INFOG(21) (size in MB of memory effectively used >> during factorization - value on the most memory consuming processor): 26 >> ?????????????????????INFOG(22) (size in MB of memory effectively used >> during factorization - sum over all processors): 26 >> ?????????????????????INFOG(23) (after analysis: value of ICNTL(6) >> effectively used): 0 >> ?????????????????????INFOG(24) (after analysis: value of ICNTL(12) >> effectively used): 1 >> ?????????????????????INFOG(25) (after factorization: number of pivots >> modified by static pivoting): 0 >> ?????????????????????INFOG(28) (after factorization: number of null >> pivots encountered): 0 >> ?????????????????????INFOG(29) (after factorization: effective number >> of entries in the factors (sum over all processors)): 1849788 >> ?????????????????????INFOG(30, 31) (after solution: size in Mbytes of >> memory used during solution phase): 29, 29 >> ?????????????????????INFOG(32) (after analysis: type of analysis done): 1 >> ?????????????????????INFOG(33) (value used for ICNTL(8)): 7 >> ?????????????????????INFOG(34) (exponent of the determinant if >> determinant is requested): 0 >> ?????????????????????INFOG(35) (after factorization: number of entries >> taking into account BLR factor compression - sum over all processors): >> 1849788 >> ?????????????????????INFOG(36) (after analysis: estimated size of all >> MUMPS internal data for running BLR in-core - value on the most memory >> consuming processor): 0 >> ?????????????????????INFOG(37) (after analysis: estimated size of all >> MUMPS internal data for running BLR in-core - sum over all processors): 0 >> ?????????????????????INFOG(38) (after analysis: estimated size of all >> MUMPS internal data for running BLR out-of-core - value on the most >> memory consuming processor): 0 >> ?????????????????????INFOG(39) (after analysis: estimated size of all >> MUMPS internal data for running BLR out-of-core - sum over all >> processors): 0 >> ?????????linear system matrix = precond matrix: >> ?????????Mat Object: ??1 MPI processes >> ???????????type: seqaijcusparse >> ???????????rows=40200, cols=40200 >> ???????????total: nonzeros=199996, allocated nonzeros=199996 >> ???????????total number of mallocs used during MatSetValues calls=0 >> ?????????????not using I-node routines >> ?linear system matrix = precond matrix: >> ?Mat Object: 16 MPI processes >> ???type: mpiaijcusparse >> ???rows=160800, cols=160800 >> ???total: nonzeros=802396, allocated nonzeros=1608000 >> ???total number of mallocs used during MatSetValues calls=0 >> ?????not using I-node (on process 0) routines >> Norm of error 9.11684e-07 iterations 189 >> >> Chang >> >> >> >> On 10/14/21 10:10 PM, Chang Liu wrote: >>> Hi Barry, >>> No problem. Here is the output. It seems that the resid norm >>> calculation is incorrect. >>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 >>> -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >>> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >>> -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type >>> preonly -sub_telescope_pc_type lu >>> -sub_telescope_pc_factor_mat_solver_type cusparse >>> -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type >>> contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>> ? 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid >>> norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>> ? 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid >>> norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>> KSP Object: 16 MPI processes >>> ? type: fgmres >>> ??? restart=30, using Classical (unmodified) Gram-Schmidt >>> Orthogonalization with no iterative refinement >>> ??? happy breakdown tolerance 1e-30 >>> ? maximum iterations=2000, initial guess is zero >>> ? tolerances:? relative=1e-20, absolute=1e-09, divergence=10000. >>> ? right preconditioning >>> ? using UNPRECONDITIONED norm type for convergence test >>> PC Object: 16 MPI processes >>> ? type: bjacobi >>> ??? number of blocks = 4 >>> ??? Local solver information for first block is in the following KSP >>> and PC objects on rank 0: >>> ??? Use -ksp_view ::ascii_info_detail to display information for all >>> blocks >>> ? KSP Object: (sub_) 4 MPI processes >>> ??? type: preonly >>> ??? maximum iterations=10000, initial guess is zero >>> ??? tolerances:? relative=1e-05, absolute=1e-50, divergence=10000. >>> ??? left preconditioning >>> ??? using NONE norm type for convergence test >>> ? PC Object: (sub_) 4 MPI processes >>> ??? type: telescope >>> ????? petsc subcomm: parent comm size reduction factor = 4 >>> ????? petsc subcomm: parent_size = 4 , subcomm_size = 1 >>> ????? petsc subcomm type = contiguous >>> ??? linear system matrix = precond matrix: >>> ??? Mat Object: (sub_) 4 MPI processes >>> ????? type: mpiaij >>> ????? rows=40200, cols=40200 >>> ????? total: nonzeros=199996, allocated nonzeros=203412 >>> ????? total number of mallocs used during MatSetValues calls=0 >>> ??????? not using I-node (on process 0) routines >>> ??????? setup type: default >>> ??????? Parent DM object: NULL >>> ??????? Sub DM object: NULL >>> ??????? KSP Object:?? (sub_telescope_)?? 1 MPI processes >>> ????????? type: preonly >>> ????????? maximum iterations=10000, initial guess is zero >>> ????????? tolerances:? relative=1e-05, absolute=1e-50, divergence=10000. >>> ????????? left preconditioning >>> ????????? using NONE norm type for convergence test >>> ??????? PC Object:?? (sub_telescope_)?? 1 MPI processes >>> ????????? type: lu >>> ??????????? out-of-place factorization >>> ??????????? tolerance for zero pivot 2.22045e-14 >>> ??????????? matrix ordering: nd >>> ??????????? factor fill ratio given 5., needed 8.62558 >>> ????????????? Factored matrix follows: >>> ??????????????? Mat Object:?? 1 MPI processes >>> ????????????????? type: seqaijcusparse >>> ????????????????? rows=40200, cols=40200 >>> ????????????????? package used to perform factorization: cusparse >>> ????????????????? total: nonzeros=1725082, allocated nonzeros=1725082 >>> ??????????????????? not using I-node routines >>> ????????? linear system matrix = precond matrix: >>> ????????? Mat Object:?? 1 MPI processes >>> ??????????? type: seqaijcusparse >>> ??????????? rows=40200, cols=40200 >>> ??????????? total: nonzeros=199996, allocated nonzeros=199996 >>> ??????????? total number of mallocs used during MatSetValues calls=0 >>> ????????????? not using I-node routines >>> ? linear system matrix = precond matrix: >>> ? Mat Object: 16 MPI processes >>> ??? type: mpiaijcusparse >>> ??? rows=160800, cols=160800 >>> ??? total: nonzeros=802396, allocated nonzeros=1608000 >>> ??? total number of mallocs used during MatSetValues calls=0 >>> ????? not using I-node (on process 0) routines >>> Norm of error 400.999 iterations 1 >>> Chang >>> On 10/14/21 9:47 PM, Barry Smith wrote: >>>> >>>> ?? Chang, >>>> >>>> ??? Sorry I did not notice that one. Please run that with -ksp_view >>>> -ksp_monitor_true_residual so we can see exactly how options are >>>> interpreted and solver used. At a glance it looks ok but something >>>> must be wrong to get the wrong answer. >>>> >>>> ?? Barry >>>> >>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu >>>> > wrote: >>>>> >>>>> Hi Barry, >>>>> >>>>> That is exactly what I was doing in the second example, in which >>>>> the preconditioner works but the GMRES does not. >>>>> >>>>> Chang >>>>> >>>>> On 10/14/21 5:15 PM, Barry Smith wrote: >>>>>> ?? You need to use the PCTELESCOPE inside the block Jacobi, not >>>>>> outside it. So something like -pc_type bjacobi -sub_pc_type >>>>>> telescope -sub_telescope_pc_type lu >>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu >>>>>> > wrote: >>>>>>> >>>>>>> Hi Pierre, >>>>>>> >>>>>>> I wonder if the trick of PCTELESCOPE only works for >>>>>>> preconditioner and not for the solver. I have done some tests, >>>>>>> and find that for solving a small matrix using >>>>>>> -telescope_ksp_type preonly, it does work for GPU with multiple >>>>>>> MPI processes. However, for bjacobi and gmres, it does not work. >>>>>>> >>>>>>> The command line options I used for small matrix is like >>>>>>> >>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short >>>>>>> -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu >>>>>>> -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type >>>>>>> preonly -pc_telescope_reduction_factor 4 >>>>>>> >>>>>>> which gives the correct output. For iterative solver, I tried >>>>>>> >>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short >>>>>>> -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type >>>>>>> aijcusparse -sub_pc_type telescope -sub_ksp_type preonly >>>>>>> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >>>>>>> -sub_telescope_pc_factor_mat_solver_type cusparse >>>>>>> -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol >>>>>>> 1.e-9 -ksp_atol 1.e-20 >>>>>>> >>>>>>> for large matrix. The output is like >>>>>>> >>>>>>> ? 0 KSP Residual norm 40.1497 >>>>>>> ? 1 KSP Residual norm < 1.e-11 >>>>>>> Norm of error 400.999 iterations 1 >>>>>>> >>>>>>> So it seems to call a direct solver instead of an iterative one. >>>>>>> >>>>>>> Can you please help check these options? >>>>>>> >>>>>>> Chang >>>>>>> >>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu >>>>>>>> > wrote: >>>>>>>>> >>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This >>>>>>>>> sounds exactly what I need. I wonder if PCTELESCOPE can >>>>>>>>> transform a mpiaijcusparse to seqaircusparse? Or I have to do >>>>>>>>> it manually? >>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but >>>>>>>> it should be; >>>>>>>> 2) at least for the implementations >>>>>>>> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and >>>>>>>> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType >>>>>>>> is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? >>>>>>>> enough to detect if the MPI communicator on which the Mat lives >>>>>>>> is of size 1 (your case), and then the resulting Mat is of type >>>>>>>> MatSeqX instead of MatMPIX, so you would not need to worry about >>>>>>>> the transformation you are mentioning. >>>>>>>> If you try this out and this does not work, please provide the >>>>>>>> backtrace (probably something like ?Operation XYZ not >>>>>>>> implemented for MatType ABC?), and hopefully someone can add the >>>>>>>> missing plumbing. >>>>>>>> I do not claim that this will be efficient, but I think this >>>>>>>> goes in the direction of what you want to achieve. >>>>>>>> Thanks, >>>>>>>> Pierre >>>>>>>>> Chang >>>>>>>>> >>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as >>>>>>>>>> a subdomain solver, with a reduction factor equal to the >>>>>>>>>> number of MPI processes you have per block? >>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X >>>>>>>>>> -sub_telescope_pc_type lu >>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads >>>>>>>>>> because not only do the Mat needs to be redistributed, the >>>>>>>>>> secondary processes also need to be ?converted? to OpenMP threads. >>>>>>>>>> Thus the need for specific code in mumps.c. >>>>>>>>>> Thanks, >>>>>>>>>> Pierre >>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Junchao, >>>>>>>>>>> >>>>>>>>>>> Yes that is what I want. >>>>>>>>>>> >>>>>>>>>>> Chang >>>>>>>>>>> >>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith >>>>>>>>>>>> >>>>>>>>>>>> >> wrote: >>>>>>>>>>>> ?????? Junchao, >>>>>>>>>>>> ????????? If I understand correctly Chang is using the block >>>>>>>>>>>> Jacobi >>>>>>>>>>>> ??? method with a single block for a number of MPI ranks and >>>>>>>>>>>> a direct >>>>>>>>>>>> ??? solver for each block so it uses >>>>>>>>>>>> PCSetUp_BJacobi_Multiproc() which >>>>>>>>>>>> ??? is code Hong Zhang wrote a number of years ago for CPUs. >>>>>>>>>>>> For their >>>>>>>>>>>> ??? particular problems this preconditioner works well, but >>>>>>>>>>>> using an >>>>>>>>>>>> ??? iterative solver on the blocks does not work well. >>>>>>>>>>>> ????????? If we had complete MPI-GPU direct solvers he could >>>>>>>>>>>> just use >>>>>>>>>>>> ??? the current code with MPIAIJCUSPARSE on each block but >>>>>>>>>>>> since we do >>>>>>>>>>>> ??? not he would like to use a single GPU for each block, >>>>>>>>>>>> this means >>>>>>>>>>>> ??? that diagonal blocks of? the global parallel MPI matrix >>>>>>>>>>>> needs to be >>>>>>>>>>>> ??? sent to a subset of the GPUs (one GPU per block, which >>>>>>>>>>>> has multiple >>>>>>>>>>>> ??? MPI ranks associated with the blocks). Similarly for the >>>>>>>>>>>> triangular >>>>>>>>>>>> ??? solves the blocks of the right hand side needs to be >>>>>>>>>>>> shipped to the >>>>>>>>>>>> ??? appropriate GPU and the resulting solution shipped back >>>>>>>>>>>> to the >>>>>>>>>>>> ??? multiple GPUs. So Chang is absolutely correct, this is >>>>>>>>>>>> somewhat like >>>>>>>>>>>> ??? your code for MUMPS with OpenMP. OK, I now understand >>>>>>>>>>>> the background.. >>>>>>>>>>>> ??? One could use PCSetUp_BJacobi_Multiproc() and get the >>>>>>>>>>>> blocks on the >>>>>>>>>>>> ??? MPI ranks and then shrink each block down to a single >>>>>>>>>>>> GPU but this >>>>>>>>>>>> ??? would be pretty inefficient, ideally one would go >>>>>>>>>>>> directly from the >>>>>>>>>>>> ??? big MPI matrix on all the GPUs to the sub matrices on >>>>>>>>>>>> the subset of >>>>>>>>>>>> ??? GPUs. But this may be a large coding project. >>>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? >>>>>>>>>>>> In my mind, we just need to move each block (submatrix) >>>>>>>>>>>> living over multiple MPI ranks to one of them and solve >>>>>>>>>>>> directly there.? In other words, we keep blocks' size, no >>>>>>>>>>>> shrinking or expanding. >>>>>>>>>>>> As mentioned before, cusparse does not provide LU >>>>>>>>>>>> factorization. So the LU factorization would be done on CPU, >>>>>>>>>>>> and the solve be done on GPU. I assume Chang wants to gain >>>>>>>>>>>> from the (potential) faster solve (instead of factorization) >>>>>>>>>>>> on GPU. >>>>>>>>>>>> ?????? Barry >>>>>>>>>>>> ??? Since the matrices being factored and solved directly >>>>>>>>>>>> are relatively >>>>>>>>>>>> ??? large it is possible that the cusparse code could be >>>>>>>>>>>> reasonably >>>>>>>>>>>> ??? efficient (they are not the tiny problems one gets at >>>>>>>>>>>> the coarse >>>>>>>>>>>> ??? level of multigrid). Of course, this is speculation, I don't >>>>>>>>>>>> ??? actually know how much better the cusparse code would be >>>>>>>>>>>> on the >>>>>>>>>>>> ??? direct solver than a good CPU direct sparse solver. >>>>>>>>>>>> ???? > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>>>> >>>>>>>>>>>> ??? >> wrote: >>>>>>>>>>>> ???? > >>>>>>>>>>>> ???? > Sorry I am not familiar with the details either. Can >>>>>>>>>>>> you please >>>>>>>>>>>> ??? check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>>>>> ???? > >>>>>>>>>>>> ???? > Chang >>>>>>>>>>>> ???? > >>>>>>>>>>>> ???? > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>>>>> ???? >> Hi Chang, >>>>>>>>>>>> ???? >>?? I did the work in mumps. It is easy for me to >>>>>>>>>>>> understand >>>>>>>>>>>> ??? gathering matrix rows to one process. >>>>>>>>>>>> ???? >>?? But how to gather blocks (submatrices) to form a >>>>>>>>>>>> large block????? Can you draw a picture of that? >>>>>>>>>>>> ???? >>?? Thanks >>>>>>>>>>>> ???? >> --Junchao Zhang >>>>>>>>>>>> ???? >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via >>>>>>>>>>>> petsc-users >>>>>>>>>>>> ??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>> >>>>>>>>>>>> ??? wrote: >>>>>>>>>>>> ???? >>??? Hi Barry, >>>>>>>>>>>> ???? >>??? I think mumps solver in petsc does support that. >>>>>>>>>>>> You can >>>>>>>>>>>> ??? check the >>>>>>>>>>>> ???? >>??? documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>>>>> ???? >> >>>>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ???>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> ???? >> >>>>>>>>>>>> ??????>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ???>>>>>>>>>>> >> >>>>>>>>>>>> >>>>>>>>>>>> ???? >>??? and the code enclosed by #if >>>>>>>>>>>> ??? defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>>>>> ???? >>??? functions MatMumpsSetUpDistRHSInfo and >>>>>>>>>>>> ???? >>??? MatMumpsGatherNonzerosOnMaster in >>>>>>>>>>>> ???? >>??? mumps.c >>>>>>>>>>>> ???? >>??? 1. I understand it is ideal to do one MPI rank >>>>>>>>>>>> per GPU. >>>>>>>>>>>> ??? However, I am >>>>>>>>>>>> ???? >>??? working on an existing code that was developed >>>>>>>>>>>> based on MPI >>>>>>>>>>>> ??? and the the >>>>>>>>>>>> ???? >>??? # of mpi ranks is typically equal to # of cpu >>>>>>>>>>>> cores. We don't >>>>>>>>>>>> ??? want to >>>>>>>>>>>> ???? >>??? change the whole structure of the code. >>>>>>>>>>>> ???? >>??? 2. What you have suggested has been coded in >>>>>>>>>>>> mumps.c. See >>>>>>>>>>>> ??? function >>>>>>>>>>>> ???? >>??? MatMumpsSetUpDistRHSInfo. >>>>>>>>>>>> ???? >>??? Regards, >>>>>>>>>>>> ???? >>??? Chang >>>>>>>>>>>> ???? >>??? On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >>> wrote: >>>>>>>>>>>> ???? >>???? >> >>>>>>>>>>>> ???? >>???? >> Hi Barry, >>>>>>>>>>>> ???? >>???? >> >>>>>>>>>>>> ???? >>???? >> That is exactly what I want. >>>>>>>>>>>> ???? >>???? >> >>>>>>>>>>>> ???? >>???? >> Back to my original question, I am looking >>>>>>>>>>>> for an approach to >>>>>>>>>>>> ???? >>??? transfer >>>>>>>>>>>> ???? >>???? >> matrix >>>>>>>>>>>> ???? >>???? >> data from many MPI processes to "master" MPI >>>>>>>>>>>> ???? >>???? >> processes, each of which taking care of one >>>>>>>>>>>> GPU, and then >>>>>>>>>>>> ??? upload >>>>>>>>>>>> ???? >>??? the data to GPU to >>>>>>>>>>>> ???? >>???? >> solve. >>>>>>>>>>>> ???? >>???? >> One can just grab some codes from mumps.c to >>>>>>>>>>>> ??? aijcusparse.cu >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >>. >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? >??? mumps.c doesn't actually do that. It never >>>>>>>>>>>> needs to >>>>>>>>>>>> ??? copy the >>>>>>>>>>>> ???? >>??? entire matrix to a single MPI rank. >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? >??? It would be possible to write such a code >>>>>>>>>>>> that you >>>>>>>>>>>> ??? suggest but >>>>>>>>>>>> ???? >>??? it is not clear that it makes sense >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? > 1)? For normal PETSc GPU usage there is one >>>>>>>>>>>> GPU per MPI >>>>>>>>>>>> ??? rank, so >>>>>>>>>>>> ???? >>??? while your one GPU per big domain is solving its >>>>>>>>>>>> systems the >>>>>>>>>>>> ??? other >>>>>>>>>>>> ???? >>??? GPUs (with the other MPI ranks that share that >>>>>>>>>>>> domain) are doing >>>>>>>>>>>> ???? >>??? nothing. >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? > 2) For each triangular solve you would have to >>>>>>>>>>>> gather the >>>>>>>>>>>> ??? right >>>>>>>>>>>> ???? >>??? hand side from the multiple ranks to the single >>>>>>>>>>>> GPU to pass it to >>>>>>>>>>>> ???? >>??? the GPU solver and then scatter the resulting >>>>>>>>>>>> solution back >>>>>>>>>>>> ??? to all >>>>>>>>>>>> ???? >>??? of its subdomain ranks. >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? >??? What I was suggesting was assign an entire >>>>>>>>>>>> subdomain to a >>>>>>>>>>>> ???? >>??? single MPI rank, thus it does everything on one >>>>>>>>>>>> GPU and can >>>>>>>>>>>> ??? use the >>>>>>>>>>>> ???? >>??? GPU solver directly. If all the major >>>>>>>>>>>> computations of a subdomain >>>>>>>>>>>> ???? >>??? can fit and be done on a single GPU then you would be >>>>>>>>>>>> ??? utilizing all >>>>>>>>>>>> ???? >>??? the GPUs you are using effectively. >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? >??? Barry >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>???? >> >>>>>>>>>>>> ???? >>???? >> Chang >>>>>>>>>>>> ???? >>???? >> >>>>>>>>>>>> ???? >>???? >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>>>>> ???? >>???? >>>??? Chang, >>>>>>>>>>>> ???? >>???? >>>????? You are correct there is no MPI + GPU >>>>>>>>>>>> direct >>>>>>>>>>>> ??? solvers that >>>>>>>>>>>> ???? >>??? currently do the triangular solves with MPI + GPU >>>>>>>>>>>> parallelism >>>>>>>>>>>> ??? that I >>>>>>>>>>>> ???? >>??? am aware of. You are limited that individual >>>>>>>>>>>> triangular solves be >>>>>>>>>>>> ???? >>??? done on a single GPU. I can only suggest making >>>>>>>>>>>> each subdomain as >>>>>>>>>>>> ???? >>??? big as possible to utilize each GPU as much as >>>>>>>>>>>> possible for the >>>>>>>>>>>> ???? >>??? direct triangular solves. >>>>>>>>>>>> ???? >>???? >>>???? Barry >>>>>>>>>>>> ???? >>???? >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via >>>>>>>>>>>> petsc-users >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>> >>>>>>>>>>>> ??? wrote: >>>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>>> ???? >>???? >>>> Hi Mark, >>>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>>> ???? >>???? >>>> '-mat_type aijcusparse' works with >>>>>>>>>>>> mpiaijcusparse with >>>>>>>>>>>> ??? other >>>>>>>>>>>> ???? >>??? solvers, but with -pc_factor_mat_solver_type >>>>>>>>>>>> cusparse, it >>>>>>>>>>>> ??? will give >>>>>>>>>>>> ???? >>??? an error. >>>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>>> ???? >>???? >>>> Yes what I want is to have mumps or superlu >>>>>>>>>>>> to do the >>>>>>>>>>>> ???? >>??? factorization, and then do the rest, including >>>>>>>>>>>> GMRES solver, >>>>>>>>>>>> ??? on gpu. >>>>>>>>>>>> ???? >>??? Is that possible? >>>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>>> ???? >>???? >>>> I have tried to use aijcusparse with >>>>>>>>>>>> superlu_dist, it >>>>>>>>>>>> ??? runs but >>>>>>>>>>>> ???? >>??? the iterative solver is still running on CPUs. I have >>>>>>>>>>>> ??? contacted the >>>>>>>>>>>> ???? >>??? superlu group and they confirmed that is the case >>>>>>>>>>>> right now. >>>>>>>>>>>> ??? But if >>>>>>>>>>>> ???? >>??? I set -pc_factor_mat_solver_type cusparse, it >>>>>>>>>>>> seems that the >>>>>>>>>>>> ???? >>??? iterative solver is running on GPU. >>>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>>> ???? >>???? >>>> Chang >>>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>>> ???? >>???? >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>>>>> ???? >>???? >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>>>>> ??? >>>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >> >>>>>>>>>>>> ??? >>>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>> ???? >>???? >>>>>???? Thank you Junchao for explaining this. >>>>>>>>>>>> I guess in >>>>>>>>>>>> ??? my case >>>>>>>>>>>> ???? >>??? the code is >>>>>>>>>>>> ???? >>???? >>>>>???? just calling a seq solver like superlu >>>>>>>>>>>> to do >>>>>>>>>>>> ???? >>??? factorization on GPUs. >>>>>>>>>>>> ???? >>???? >>>>>???? My idea is that I want to have a >>>>>>>>>>>> traditional MPI >>>>>>>>>>>> ??? code to >>>>>>>>>>>> ???? >>??? utilize GPUs >>>>>>>>>>>> ???? >>???? >>>>>???? with cusparse. Right now cusparse does >>>>>>>>>>>> not support >>>>>>>>>>>> ??? mpiaij >>>>>>>>>>>> ???? >>??? matrix, Sure it does: '-mat_type aijcusparse' >>>>>>>>>>>> will give you an >>>>>>>>>>>> ???? >>??? mpiaijcusparse matrix with > 1 processes. >>>>>>>>>>>> ???? >>???? >>>>> (-mat_type mpiaijcusparse might also work >>>>>>>>>>>> with >1 proc). >>>>>>>>>>>> ???? >>???? >>>>> However, I see in grepping the repo that >>>>>>>>>>>> all the mumps and >>>>>>>>>>>> ???? >>??? superlu tests use aij or sell matrix type. >>>>>>>>>>>> ???? >>???? >>>>> MUMPS and SuperLU provide their own >>>>>>>>>>>> solves, I assume >>>>>>>>>>>> ??? .... but >>>>>>>>>>>> ???? >>??? you might want to do other matrix operations on >>>>>>>>>>>> the GPU. Is >>>>>>>>>>>> ??? that the >>>>>>>>>>>> ???? >>??? issue? >>>>>>>>>>>> ???? >>???? >>>>> Did you try -mat_type aijcusparse with >>>>>>>>>>>> MUMPS and/or >>>>>>>>>>>> ??? SuperLU >>>>>>>>>>>> ???? >>??? have a problem? (no test with it so it probably >>>>>>>>>>>> does not work) >>>>>>>>>>>> ???? >>???? >>>>> Thanks, >>>>>>>>>>>> ???? >>???? >>>>> Mark >>>>>>>>>>>> ???? >>???? >>>>>???? so I >>>>>>>>>>>> ???? >>???? >>>>>???? want the code to have a mpiaij matrix >>>>>>>>>>>> when adding >>>>>>>>>>>> ??? all the >>>>>>>>>>>> ???? >>??? matrix terms, >>>>>>>>>>>> ???? >>???? >>>>>???? and then transform the matrix to >>>>>>>>>>>> seqaij when doing the >>>>>>>>>>>> ???? >>??? factorization >>>>>>>>>>>> ???? >>???? >>>>>???? and >>>>>>>>>>>> ???? >>???? >>>>>???? solve. This involves sending the data >>>>>>>>>>>> to the master >>>>>>>>>>>> ???? >>??? process, and I >>>>>>>>>>>> ???? >>???? >>>>>???? think >>>>>>>>>>>> ???? >>???? >>>>>???? the petsc mumps solver have something >>>>>>>>>>>> similar already. >>>>>>>>>>>> ???? >>???? >>>>>???? Chang >>>>>>>>>>>> ???? >>???? >>>>>???? On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > On Tue, Oct 12, 2021 at 1:07 PM >>>>>>>>>>>> Mark Adams >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> >>>>>>>>>>>> ??? >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>>> wrote: >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???? On Tue, Oct 12, 2021 at 1:45 PM >>>>>>>>>>>> Chang Liu >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> ??? >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>????? >???? >>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>>> wrote: >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? Hi Mark, >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? The option I use is like >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? -pc_type bjacobi >>>>>>>>>>>> -pc_bjacobi_blocks 16 >>>>>>>>>>>> ???? >>??? -ksp_type fgmres >>>>>>>>>>>> ???? >>???? >>>>>???? -mat_type >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? aijcusparse >>>>>>>>>>>> *-sub_pc_factor_mat_solver_type >>>>>>>>>>>> ???? >>??? cusparse >>>>>>>>>>>> ???? >>???? >>>>>???? *-sub_ksp_type >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? preonly *-sub_pc_type lu* >>>>>>>>>>>> -ksp_max_it 2000 >>>>>>>>>>>> ???? >>??? -ksp_rtol 1.e-300 >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? -ksp_atol 1.e-300 >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???? Note, If you use -log_view the >>>>>>>>>>>> last column >>>>>>>>>>>> ??? (rows >>>>>>>>>>>> ???? >>??? are the >>>>>>>>>>>> ???? >>???? >>>>>???? method like >>>>>>>>>>>> ???? >>???? >>>>>????? >???? MatFactorNumeric) has the >>>>>>>>>>>> percent of work >>>>>>>>>>>> ??? in the GPU. >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???? Junchao: *This* implies that we >>>>>>>>>>>> have a >>>>>>>>>>>> ??? cuSparse LU >>>>>>>>>>>> ???? >>???? >>>>>???? factorization. Is >>>>>>>>>>>> ???? >>???? >>>>>????? >???? that correct? (I don't think we do) >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > No, we don't have cuSparse LU >>>>>>>>>>>> factorization.???? If you check >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>>>>> ??? find it >>>>>>>>>>>> ???? >>??? calls >>>>>>>>>>>> ???? >>???? >>>>>????? > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>>>>> ???? >>???? >>>>>????? > So I don't understand Chang's idea. >>>>>>>>>>>> Do you want to >>>>>>>>>>>> ???? >>??? make bigger >>>>>>>>>>>> ???? >>???? >>>>>???? blocks? >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? I think this one do both >>>>>>>>>>>> factorization and >>>>>>>>>>>> ???? >>??? solve on gpu. >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? You can check the >>>>>>>>>>>> ??? runex72_aijcusparse.sh file >>>>>>>>>>>> ???? >>??? in petsc >>>>>>>>>>>> ???? >>???? >>>>>???? install >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? directory, and try it your >>>>>>>>>>>> self (this >>>>>>>>>>>> ??? is only lu >>>>>>>>>>>> ???? >>???? >>>>>???? factorization >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? without >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? iterative solve). >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? Chang >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? On 10/12/21 1:17 PM, Mark >>>>>>>>>>>> Adams wrote: >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > On Tue, Oct 12, 2021 at >>>>>>>>>>>> 11:19 AM >>>>>>>>>>>> ??? Chang Liu >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >> >>>>>>>>>>>> ??? >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> ??? >>> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>>>> wrote: >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Hi Junchao, >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? No I only needs it >>>>>>>>>>>> to be transferred >>>>>>>>>>>> ???? >>??? within a >>>>>>>>>>>> ???? >>???? >>>>>???? node. I use >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? block-Jacobi >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? method and GMRES to >>>>>>>>>>>> solve the sparse >>>>>>>>>>>> ???? >>??? matrix, so each >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? direct solver will >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? take care of a >>>>>>>>>>>> sub-block of the >>>>>>>>>>>> ??? whole >>>>>>>>>>>> ???? >>??? matrix. In this >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? way, I can use >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? one >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? GPU to solve one >>>>>>>>>>>> sub-block, which is >>>>>>>>>>>> ???? >>??? stored within >>>>>>>>>>>> ???? >>???? >>>>>???? one node. >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? It was stated in the >>>>>>>>>>>> ??? documentation that >>>>>>>>>>>> ???? >>??? cusparse >>>>>>>>>>>> ???? >>???? >>>>>???? solver >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? is slow. >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? However, in my test >>>>>>>>>>>> using >>>>>>>>>>>> ??? ex72.c, the >>>>>>>>>>>> ???? >>??? cusparse >>>>>>>>>>>> ???? >>???? >>>>>???? solver is >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? faster than >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? mumps or >>>>>>>>>>>> superlu_dist on CPUs. >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > Are we talking about the >>>>>>>>>>>> ??? factorization, the >>>>>>>>>>>> ???? >>??? solve, or >>>>>>>>>>>> ???? >>???? >>>>>???? both? >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > We do not have an >>>>>>>>>>>> interface to >>>>>>>>>>>> ??? cuSparse's LU >>>>>>>>>>>> ???? >>???? >>>>>???? factorization (I >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? just >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > learned that it exists a >>>>>>>>>>>> few weeks ago). >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > Perhaps your fast >>>>>>>>>>>> "cusparse solver" is >>>>>>>>>>>> ???? >>??? '-pc_type lu >>>>>>>>>>>> ???? >>???? >>>>>???? -mat_type >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > aijcusparse' ? This >>>>>>>>>>>> would be the CPU >>>>>>>>>>>> ???? >>??? factorization, >>>>>>>>>>>> ???? >>???? >>>>>???? which is the >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > dominant cost. >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Chang >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? On 10/12/21 10:24 >>>>>>>>>>>> AM, Junchao >>>>>>>>>>>> ??? Zhang wrote: >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > Hi, Chang, >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? For the mumps >>>>>>>>>>>> solver, we >>>>>>>>>>>> ??? usually >>>>>>>>>>>> ???? >>??? transfers >>>>>>>>>>>> ???? >>???? >>>>>???? matrix >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? and vector >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? data >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > within a compute >>>>>>>>>>>> node.? For >>>>>>>>>>>> ??? the idea you >>>>>>>>>>>> ???? >>???? >>>>>???? propose, it >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? looks like >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? we need >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > to gather data within >>>>>>>>>>>> ???? >>??? MPI_COMM_WORLD, right? >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Mark, I >>>>>>>>>>>> remember you said >>>>>>>>>>>> ???? >>??? cusparse solve is >>>>>>>>>>>> ???? >>???? >>>>>???? slow >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? and you would >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > rather do it on >>>>>>>>>>>> CPU. Is it right? >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > --Junchao Zhang >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > On Mon, Oct 11, >>>>>>>>>>>> 2021 at 10:25 PM >>>>>>>>>>>> ???? >>??? Chang Liu via >>>>>>>>>>>> ???? >>???? >>>>>???? petsc-users >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> >>>> >>>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> >>>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ????>>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> >>>> >>>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>> >>>>>>>>>>>> ??? >>>>>>>>>>> >>>>>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? wrote: >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Hi, >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Currently, it >>>>>>>>>>>> is possible >>>>>>>>>>>> ??? to use >>>>>>>>>>>> ???? >>??? mumps >>>>>>>>>>>> ???? >>???? >>>>>???? solver in >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? PETSC with >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ????-mat_mumps_use_omp_threads >>>>>>>>>>>> ???? >>??? option, so that >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? multiple MPI >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? processes will >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? transfer the >>>>>>>>>>>> matrix and >>>>>>>>>>>> ??? rhs data >>>>>>>>>>>> ???? >>??? to the master >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? rank, and then >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? master >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? rank will >>>>>>>>>>>> call mumps with >>>>>>>>>>>> ??? OpenMP >>>>>>>>>>>> ???? >>??? to solve >>>>>>>>>>>> ???? >>???? >>>>>???? the matrix. >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? I wonder if >>>>>>>>>>>> someone can >>>>>>>>>>>> ??? develop >>>>>>>>>>>> ???? >>??? similar >>>>>>>>>>>> ???? >>???? >>>>>???? option for >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? cusparse >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? solver. >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Right now, >>>>>>>>>>>> this solver >>>>>>>>>>>> ??? does not >>>>>>>>>>>> ???? >>??? work with >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? mpiaijcusparse. I >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? think a >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? possible >>>>>>>>>>>> workaround is to >>>>>>>>>>>> ???? >>??? transfer all the >>>>>>>>>>>> ???? >>???? >>>>>???? matrix >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? data to one MPI >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? process, and >>>>>>>>>>>> then upload the >>>>>>>>>>>> ???? >>??? data to GPU to >>>>>>>>>>>> ???? >>???? >>>>>???? solve. >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? In this >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? way, one can >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? use cusparse >>>>>>>>>>>> solver for a MPI >>>>>>>>>>>> ???? >>??? program. >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Chang >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? -- >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Chang Liu >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Staff >>>>>>>>>>>> Research Physicist >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? +1 609 243 3438 >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > cliu at pppl.gov >>>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >> >>>>>>>>>>>> ??? >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> ??? >>> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? Princeton >>>>>>>>>>>> Plasma Physics >>>>>>>>>>>> ??? Laboratory >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? >???? 100 >>>>>>>>>>>> Stellarator Rd, >>>>>>>>>>>> ??? Princeton NJ >>>>>>>>>>>> ???? >>??? 08540, USA >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? -- >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Chang Liu >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Staff Research Physicist >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? +1 609 243 3438 >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > cliu at pppl.gov >>>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>> >>>>>>>>>>>> ??? >>>>>>>>>>>> > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? >>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? Princeton Plasma >>>>>>>>>>>> Physics Laboratory >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? >???? 100 Stellarator Rd, >>>>>>>>>>>> Princeton NJ >>>>>>>>>>>> ??? 08540, USA >>>>>>>>>>>> ???? >>???? >>>>>????? >????????? > >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? -- >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? Chang Liu >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? Staff Research Physicist >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? +1 609 243 3438 >>>>>>>>>>>> ???? >>???? >>>>>????? > cliu at pppl.gov >>>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>> >>>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>>>???? >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >>>> >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? Princeton Plasma Physics >>>>>>>>>>>> Laboratory >>>>>>>>>>>> ???? >>???? >>>>>????? >???????? 100 Stellarator Rd, >>>>>>>>>>>> Princeton NJ 08540, USA >>>>>>>>>>>> ???? >>???? >>>>>????? > >>>>>>>>>>>> ???? >>???? >>>>>???? --???? Chang Liu >>>>>>>>>>>> ???? >>???? >>>>>???? Staff Research Physicist >>>>>>>>>>>> ???? >>???? >>>>>???? +1 609 243 3438 >>>>>>>>>>>> ???? >>???? >>>>> cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>>>>>>>>> ??? > >>>>>>>>>>>> ???? >>??? >>>>>>>>>>>> >>> >>>>>>>>>>>> ???? >>???? >>>>>???? Princeton Plasma Physics Laboratory >>>>>>>>>>>> ???? >>???? >>>>>???? 100 Stellarator Rd, Princeton NJ >>>>>>>>>>>> 08540, USA >>>>>>>>>>>> ???? >>???? >>>> >>>>>>>>>>>> ???? >>???? >>>> -- >>>>>>>>>>>> ???? >>???? >>>> Chang Liu >>>>>>>>>>>> ???? >>???? >>>> Staff Research Physicist >>>>>>>>>>>> ???? >>???? >>>> +1 609 243 3438 >>>>>>>>>>>> ???? >>???? >>>> cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>> ???? >>???? >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> ???? >>???? >> >>>>>>>>>>>> ???? >>???? >> -- >>>>>>>>>>>> ???? >>???? >> Chang Liu >>>>>>>>>>>> ???? >>???? >> Staff Research Physicist >>>>>>>>>>>> ???? >>???? >> +1 609 243 3438 >>>>>>>>>>>> ???? >>???? >> cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> ??? >>>>>>>>>>>> >> >>>>>>>>>>>> ???? >>???? >> Princeton Plasma Physics Laboratory >>>>>>>>>>>> ???? >>???? >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> ???? >>???? > >>>>>>>>>>>> ???? >>??? --???? Chang Liu >>>>>>>>>>>> ???? >>??? Staff Research Physicist >>>>>>>>>>>> ???? >>??? +1 609 243 3438 >>>>>>>>>>>> ???? >> cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> ??? >> >>>>>>>>>>>> ???? >>??? Princeton Plasma Physics Laboratory >>>>>>>>>>>> ???? >>??? 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> ???? > >>>>>>>>>>>> ???? > -- >>>>>>>>>>>> ???? > Chang Liu >>>>>>>>>>>> ???? > Staff Research Physicist >>>>>>>>>>>> ???? > +1 609 243 3438 >>>>>>>>>>>> ???? > cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>> ???? > Princeton Plasma Physics Laboratory >>>>>>>>>>>> ???? > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Chang Liu >>>>>>>>>>> Staff Research Physicist >>>>>>>>>>> +1 609 243 3438 >>>>>>>>>>> cliu at pppl.gov >>>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Chang Liu >>>>>>>>> Staff Research Physicist >>>>>>>>> +1 609 243 3438 >>>>>>>>> cliu at pppl.gov >>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >>>>>>> -- >>>>>>> Chang Liu >>>>>>> Staff Research Physicist >>>>>>> +1 609 243 3438 >>>>>>> cliu at pppl.gov >>>>>>> Princeton Plasma Physics Laboratory >>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >>>>> -- >>>>> Chang Liu >>>>> Staff Research Physicist >>>>> +1 609 243 3438 >>>>> cliu at pppl.gov >>>>> Princeton Plasma Physics Laboratory >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From junchao.zhang at gmail.com Sat Oct 16 20:59:07 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Sat, 16 Oct 2021 20:59:07 -0500 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <6D4D8741-3F52-41BF-B2A3-AFBA09443755@petsc.dev> Message-ID: Hi, Chang, Thanks a lot for the fix. I will create an MR for it. --Junchao Zhang On Sat, Oct 16, 2021 at 8:12 PM Chang Liu wrote: > Hi Barry, Pierre and Junchao, > > I spent some time to find the reason for the error. I think it is caused > by some compability issues between telescope and cusparse. > > 1. In PCTelescopeMatCreate_default in telescope.c, it calls > MatCreateMPIMatConcatenateSeqMat to concat seqmat to mpimat, but this > function is from mpiaij.c and will set the mat type to mpiaij, even if > the original matrix is mpiaijcusparse. > > 2. Simiar issue exists in PCTelescopeSetUp_default, where the vector is > set to type mpi rather than mpicuda. > > I have fixed the issue using the following patch. After applying it, > telescope and cusparse work as expected. > > diff --git a/src/ksp/pc/impls/telescope/telescope.c > b/src/ksp/pc/impls/telescope/telescope.c > index 893febb055..d3f687eff9 100644 > --- a/src/ksp/pc/impls/telescope/telescope.c > +++ b/src/ksp/pc/impls/telescope/telescope.c > @@ -159,6 +159,7 @@ PetscErrorCode PCTelescopeSetUp_default(PC > pc,PC_Telescope sred) > ierr = VecCreate(subcomm,&xred);CHKERRQ(ierr); > ierr = VecSetSizes(xred,PETSC_DECIDE,M);CHKERRQ(ierr); > ierr = VecSetBlockSize(xred,bs);CHKERRQ(ierr); > + ierr = VecSetType(xred,((PetscObject)x)->type_name);CHKERRQ(ierr); > ierr = VecSetFromOptions(xred);CHKERRQ(ierr); > ierr = VecGetLocalSize(xred,&m);CHKERRQ(ierr); > } > diff --git a/src/mat/impls/aij/mpi/mpiaij.c > b/src/mat/impls/aij/mpi/mpiaij.c > index 36077002db..ac374e07eb 100644 > --- a/src/mat/impls/aij/mpi/mpiaij.c > +++ b/src/mat/impls/aij/mpi/mpiaij.c > @@ -4486,6 +4486,7 @@ PetscErrorCode > MatCreateMPIMatConcatenateSeqMat_MPIAIJ(MPI_Comm comm,Mat inmat,P > PetscInt m,N,i,rstart,nnz,Ii; > PetscInt *indx; > PetscScalar *values; > + PetscBool isseqaijcusparse; > > PetscFunctionBegin; > ierr = MatGetSize(inmat,&m,&N);CHKERRQ(ierr); > @@ -4513,7 +4514,12 @@ PetscErrorCode > MatCreateMPIMatConcatenateSeqMat_MPIAIJ(MPI_Comm comm,Mat inmat,P > ierr = > MatSetSizes(*outmat,m,n,PETSC_DETERMINE,PETSC_DETERMINE);CHKERRQ(ierr); > ierr = MatGetBlockSizes(inmat,&bs,&cbs);CHKERRQ(ierr); > ierr = MatSetBlockSizes(*outmat,bs,cbs);CHKERRQ(ierr); > - ierr = MatSetType(*outmat,MATAIJ);CHKERRQ(ierr); > + ierr = > > PetscObjectBaseTypeCompare((PetscObject)inmat,MATSEQAIJCUSPARSE,&isseqaijcusparse);CHKERRQ(ierr); > + if (isseqaijcusparse) { > + ierr = MatSetType(*outmat,MATAIJCUSPARSE);CHKERRQ(ierr); > + } else { > + ierr = MatSetType(*outmat,MATAIJ);CHKERRQ(ierr); > + } > ierr = MatSeqAIJSetPreallocation(*outmat,0,dnz);CHKERRQ(ierr); > ierr = MatMPIAIJSetPreallocation(*outmat,0,dnz,0,onz);CHKERRQ(ierr); > ierr = MatPreallocateFinalize(dnz,onz);CHKERRQ(ierr); > > Please help view it and merge to master if possible. > > Regards, > > Chang > > On 10/15/21 1:27 PM, Barry Smith wrote: > > > > So the only difference is between > > -sub_telescope_pc_factor_mat_solver_type cusparse and > > -sub_telescope_pc_factor_mat_solver_type mumps ? > > > > Try without the -sub_telescope_pc_factor_mat_solver_type cusparse > > and then PETSc will just use the CPU solvers, I want to see if that > > works, it should. If it works then there is perhaps something specific > > about the PCTELESCOPE and the cusparse solver, for example the right > > hand side array values may never get to the GPU. > > > > Barry > > > >> On Oct 14, 2021, at 10:11 PM, Chang Liu >> > wrote: > >> > >> For comparison, here is the output using mumps instead of cusparse > >> > >> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > >> -ksp_view -ksp_monitor_true_residual -pc_type bjacobi > >> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse > >> -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type > >> preonly -sub_telescope_pc_type lu > >> -sub_telescope_pc_factor_mat_solver_type mumps > >> -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type > >> contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > > > > $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > > -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks > > 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope > > -sub_ksp_type preonly -sub_telescope_ksp_type preonly > > -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type > > cusparse -sub_pc_telescope_reduction_factor 4 > > -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol > > 1.e-20 -ksp_atol 1.e-9 > > > > > >> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm > >> 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm > >> 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 > >> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm > >> 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 > >> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm > >> 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 > >> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm > >> 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 > >> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm > >> 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 > >> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm > >> 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 > >> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm > >> 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 > >> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm > >> 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 > >> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm > >> 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 > >> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm > >> 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 > >> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm > >> 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 > >> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm > >> 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 > >> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm > >> 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 > >> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm > >> 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 > >> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm > >> 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 > >> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm > >> 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 > >> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm > >> 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 > >> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm > >> 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 > >> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm > >> 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 > >> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm > >> 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 > >> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm > >> 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 > >> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm > >> 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 > >> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm > >> 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 > >> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm > >> 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 > >> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm > >> 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 > >> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm > >> 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 > >> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm > >> 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 > >> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm > >> 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 > >> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm > >> 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 > >> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm > >> 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 > >> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm > >> 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 > >> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm > >> 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 > >> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm > >> 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 > >> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm > >> 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 > >> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm > >> 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 > >> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm > >> 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 > >> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm > >> 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 > >> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm > >> 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 > >> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm > >> 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 > >> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm > >> 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 > >> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm > >> 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 > >> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm > >> 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 > >> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm > >> 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 > >> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm > >> 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 > >> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm > >> 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 > >> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm > >> 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 > >> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm > >> 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 > >> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm > >> 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 > >> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm > >> 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 > >> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm > >> 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 > >> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm > >> 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 > >> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm > >> 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 > >> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm > >> 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 > >> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm > >> 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 > >> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm > >> 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 > >> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm > >> 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 > >> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm > >> 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 > >> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm > >> 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 > >> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm > >> 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 > >> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm > >> 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 > >> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm > >> 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 > >> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm > >> 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 > >> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm > >> 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 > >> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm > >> 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 > >> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm > >> 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 > >> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm > >> 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 > >> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm > >> 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 > >> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm > >> 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 > >> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm > >> 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 > >> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm > >> 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 > >> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm > >> 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 > >> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm > >> 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 > >> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm > >> 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 > >> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm > >> 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 > >> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm > >> 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 > >> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm > >> 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 > >> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm > >> 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 > >> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm > >> 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 > >> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm > >> 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 > >> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm > >> 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 > >> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm > >> 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 > >> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm > >> 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 > >> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm > >> 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 > >> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm > >> 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 > >> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm > >> 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 > >> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm > >> 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 > >> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm > >> 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 > >> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm > >> 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 > >> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm > >> 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 > >> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm > >> 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 > >> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm > >> 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 > >> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm > >> 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 > >> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm > >> 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 > >> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm > >> 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 > >> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm > >> 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 > >> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm > >> 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 > >> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm > >> 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 > >> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm > >> 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 > >> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm > >> 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 > >> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm > >> 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 > >> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm > >> 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 > >> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm > >> 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 > >> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm > >> 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 > >> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm > >> 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 > >> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm > >> 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 > >> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm > >> 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 > >> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm > >> 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 > >> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm > >> 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 > >> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm > >> 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 > >> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm > >> 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 > >> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm > >> 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 > >> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm > >> 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 > >> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm > >> 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 > >> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm > >> 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 > >> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm > >> 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 > >> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm > >> 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 > >> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm > >> 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 > >> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm > >> 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 > >> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm > >> 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 > >> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm > >> 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 > >> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm > >> 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 > >> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm > >> 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 > >> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm > >> 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 > >> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm > >> 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 > >> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm > >> 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 > >> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm > >> 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 > >> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm > >> 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 > >> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm > >> 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 > >> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm > >> 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 > >> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm > >> 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 > >> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm > >> 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 > >> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm > >> 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 > >> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm > >> 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 > >> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm > >> 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 > >> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm > >> 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 > >> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm > >> 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 > >> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm > >> 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 > >> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm > >> 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 > >> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm > >> 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 > >> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm > >> 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 > >> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm > >> 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 > >> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm > >> 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 > >> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm > >> 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 > >> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm > >> 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 > >> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm > >> 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 > >> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm > >> 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 > >> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm > >> 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 > >> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm > >> 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 > >> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm > >> 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 > >> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm > >> 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 > >> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm > >> 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 > >> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm > >> 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 > >> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm > >> 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 > >> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm > >> 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 > >> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm > >> 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 > >> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm > >> 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 > >> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm > >> 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 > >> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm > >> 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 > >> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm > >> 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 > >> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm > >> 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 > >> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm > >> 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 > >> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm > >> 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 > >> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm > >> 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 > >> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm > >> 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 > >> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm > >> 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 > >> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm > >> 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 > >> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm > >> 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 > >> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm > >> 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 > >> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm > >> 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 > >> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm > >> 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 > >> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm > >> 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 > >> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm > >> 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 > >> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm > >> 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 > >> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm > >> 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 > >> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm > >> 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 > >> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm > >> 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 > >> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm > >> 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 > >> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm > >> 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 > >> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm > >> 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 > >> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm > >> 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 > >> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm > >> 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 > >> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm > >> 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 > >> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm > >> 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 > >> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm > >> 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 > >> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm > >> 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 > >> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm > >> 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 > >> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm > >> 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 > >> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm > >> 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 > >> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm > >> 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 > >> KSP Object: 16 MPI processes > >> type: fgmres > >> restart=30, using Classical (unmodified) Gram-Schmidt > >> Orthogonalization with no iterative refinement > >> happy breakdown tolerance 1e-30 > >> maximum iterations=2000, initial guess is zero > >> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. > >> right preconditioning > >> using UNPRECONDITIONED norm type for convergence test > >> PC Object: 16 MPI processes > >> type: bjacobi > >> number of blocks = 4 > >> Local solver information for first block is in the following KSP > >> and PC objects on rank 0: > >> Use -ksp_view ::ascii_info_detail to display information for all > blocks > >> KSP Object: (sub_) 4 MPI processes > >> type: preonly > >> maximum iterations=10000, initial guess is zero > >> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >> left preconditioning > >> using NONE norm type for convergence test > >> PC Object: (sub_) 4 MPI processes > >> type: telescope > >> petsc subcomm: parent comm size reduction factor = 4 > >> petsc subcomm: parent_size = 4 , subcomm_size = 1 > >> petsc subcomm type = contiguous > >> linear system matrix = precond matrix: > >> Mat Object: (sub_) 4 MPI processes > >> type: mpiaij > >> rows=40200, cols=40200 > >> total: nonzeros=199996, allocated nonzeros=203412 > >> total number of mallocs used during MatSetValues calls=0 > >> not using I-node (on process 0) routines > >> setup type: default > >> Parent DM object: NULL > >> Sub DM object: NULL > >> KSP Object: (sub_telescope_) 1 MPI processes > >> type: preonly > >> maximum iterations=10000, initial guess is zero > >> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >> left preconditioning > >> using NONE norm type for convergence test > >> PC Object: (sub_telescope_) 1 MPI processes > >> type: lu > >> out-of-place factorization > >> tolerance for zero pivot 2.22045e-14 > >> matrix ordering: external > >> factor fill ratio given 0., needed 0. > >> Factored matrix follows: > >> Mat Object: 1 MPI processes > >> type: mumps > >> rows=40200, cols=40200 > >> package used to perform factorization: mumps > >> total: nonzeros=1849788, allocated nonzeros=1849788 > >> MUMPS run parameters: > >> SYM (matrix type): 0 > >> PAR (host participation): 1 > >> ICNTL(1) (output for error): 6 > >> ICNTL(2) (output of diagnostic msg): 0 > >> ICNTL(3) (output for global info): 0 > >> ICNTL(4) (level of printing): 0 > >> ICNTL(5) (input mat struct): 0 > >> ICNTL(6) (matrix prescaling): 7 > >> ICNTL(7) (sequential matrix ordering):7 > >> ICNTL(8) (scaling strategy): 77 > >> ICNTL(10) (max num of refinements): 0 > >> ICNTL(11) (error analysis): 0 > >> ICNTL(12) (efficiency control): 1 > >> ICNTL(13) (sequential factorization of the root > >> node): 0 > >> ICNTL(14) (percentage of estimated workspace > >> increase): 20 > >> ICNTL(18) (input mat struct): 0 > >> ICNTL(19) (Schur complement info): 0 > >> ICNTL(20) (RHS sparse pattern): 0 > >> ICNTL(21) (solution struct): 0 > >> ICNTL(22) (in-core/out-of-core facility): 0 > >> ICNTL(23) (max size of memory can be allocated > >> locally):0 > >> ICNTL(24) (detection of null pivot rows): 0 > >> ICNTL(25) (computation of a null space basis): > >> 0 > >> ICNTL(26) (Schur options for RHS or solution): > >> 0 > >> ICNTL(27) (blocking size for multiple RHS): > >> -32 > >> ICNTL(28) (use parallel or sequential ordering): > >> 1 > >> ICNTL(29) (parallel ordering): 0 > >> ICNTL(30) (user-specified set of entries in > >> inv(A)): 0 > >> ICNTL(31) (factors is discarded in the solve > >> phase): 0 > >> ICNTL(33) (compute determinant): 0 > >> ICNTL(35) (activate BLR based factorization): > >> 0 > >> ICNTL(36) (choice of BLR factorization variant): > >> 0 > >> ICNTL(38) (estimated compression rate of LU > >> factors): 333 > >> CNTL(1) (relative pivoting threshold): 0.01 > >> CNTL(2) (stopping criterion of refinement): > >> 1.49012e-08 > >> CNTL(3) (absolute pivoting threshold): 0. > >> CNTL(4) (value of static pivoting): -1. > >> CNTL(5) (fixation for null pivots): 0. > >> CNTL(7) (dropping parameter for BLR): 0. > >> RINFO(1) (local estimated flops for the > >> elimination after analysis): > >> [0] 1.45525e+08 > >> RINFO(2) (local estimated flops for the assembly > >> after factorization): > >> [0] 2.89397e+06 > >> RINFO(3) (local estimated flops for the > >> elimination after factorization): > >> [0] 1.45525e+08 > >> INFO(15) (estimated size of (in MB) MUMPS > >> internal data for running numerical factorization): > >> [0] 29 > >> INFO(16) (size of (in MB) MUMPS internal data > >> used during numerical factorization): > >> [0] 29 > >> INFO(23) (num of pivots eliminated on this > >> processor after factorization): > >> [0] 40200 > >> RINFOG(1) (global estimated flops for the > >> elimination after analysis): 1.45525e+08 > >> RINFOG(2) (global estimated flops for the > >> assembly after factorization): 2.89397e+06 > >> RINFOG(3) (global estimated flops for the > >> elimination after factorization): 1.45525e+08 > >> (RINFOG(12) RINFOG(13))*2^INFOG(34) > >> (determinant): (0.,0.)*(2^0) > >> INFOG(3) (estimated real workspace for factors on > >> all processors after analysis): 1849788 > >> INFOG(4) (estimated integer workspace for factors > >> on all processors after analysis): 879986 > >> INFOG(5) (estimated maximum front size in the > >> complete tree): 282 > >> INFOG(6) (number of nodes in the complete tree): > >> 23709 > >> INFOG(7) (ordering option effectively used after > >> analysis): 5 > >> INFOG(8) (structural symmetry in percent of the > >> permuted matrix after analysis): 100 > >> INFOG(9) (total real/complex workspace to store > >> the matrix factors after factorization): 1849788 > >> INFOG(10) (total integer space store the matrix > >> factors after factorization): 879986 > >> INFOG(11) (order of largest frontal matrix after > >> factorization): 282 > >> INFOG(12) (number of off-diagonal pivots): 0 > >> INFOG(13) (number of delayed pivots after > >> factorization): 0 > >> INFOG(14) (number of memory compress after > >> factorization): 0 > >> INFOG(15) (number of steps of iterative > >> refinement after solution): 0 > >> INFOG(16) (estimated size (in MB) of all MUMPS > >> internal data for factorization after analysis: value on the most > >> memory consuming processor): 29 > >> INFOG(17) (estimated size of all MUMPS internal > >> data for factorization after analysis: sum over all processors): 29 > >> INFOG(18) (size of all MUMPS internal data > >> allocated during factorization: value on the most memory consuming > >> processor): 29 > >> INFOG(19) (size of all MUMPS internal data > >> allocated during factorization: sum over all processors): 29 > >> INFOG(20) (estimated number of entries in the > >> factors): 1849788 > >> INFOG(21) (size in MB of memory effectively used > >> during factorization - value on the most memory consuming processor): 26 > >> INFOG(22) (size in MB of memory effectively used > >> during factorization - sum over all processors): 26 > >> INFOG(23) (after analysis: value of ICNTL(6) > >> effectively used): 0 > >> INFOG(24) (after analysis: value of ICNTL(12) > >> effectively used): 1 > >> INFOG(25) (after factorization: number of pivots > >> modified by static pivoting): 0 > >> INFOG(28) (after factorization: number of null > >> pivots encountered): 0 > >> INFOG(29) (after factorization: effective number > >> of entries in the factors (sum over all processors)): 1849788 > >> INFOG(30, 31) (after solution: size in Mbytes of > >> memory used during solution phase): 29, 29 > >> INFOG(32) (after analysis: type of analysis done): > 1 > >> INFOG(33) (value used for ICNTL(8)): 7 > >> INFOG(34) (exponent of the determinant if > >> determinant is requested): 0 > >> INFOG(35) (after factorization: number of entries > >> taking into account BLR factor compression - sum over all processors): > >> 1849788 > >> INFOG(36) (after analysis: estimated size of all > >> MUMPS internal data for running BLR in-core - value on the most memory > >> consuming processor): 0 > >> INFOG(37) (after analysis: estimated size of all > >> MUMPS internal data for running BLR in-core - sum over all processors): > 0 > >> INFOG(38) (after analysis: estimated size of all > >> MUMPS internal data for running BLR out-of-core - value on the most > >> memory consuming processor): 0 > >> INFOG(39) (after analysis: estimated size of all > >> MUMPS internal data for running BLR out-of-core - sum over all > >> processors): 0 > >> linear system matrix = precond matrix: > >> Mat Object: 1 MPI processes > >> type: seqaijcusparse > >> rows=40200, cols=40200 > >> total: nonzeros=199996, allocated nonzeros=199996 > >> total number of mallocs used during MatSetValues calls=0 > >> not using I-node routines > >> linear system matrix = precond matrix: > >> Mat Object: 16 MPI processes > >> type: mpiaijcusparse > >> rows=160800, cols=160800 > >> total: nonzeros=802396, allocated nonzeros=1608000 > >> total number of mallocs used during MatSetValues calls=0 > >> not using I-node (on process 0) routines > >> Norm of error 9.11684e-07 iterations 189 > >> > >> Chang > >> > >> > >> > >> On 10/14/21 10:10 PM, Chang Liu wrote: > >>> Hi Barry, > >>> No problem. Here is the output. It seems that the resid norm > >>> calculation is incorrect. > >>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > >>> -ksp_view -ksp_monitor_true_residual -pc_type bjacobi > >>> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse > >>> -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type > >>> preonly -sub_telescope_pc_type lu > >>> -sub_telescope_pc_factor_mat_solver_type cusparse > >>> -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type > >>> contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > >>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid > >>> norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid > >>> norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >>> KSP Object: 16 MPI processes > >>> type: fgmres > >>> restart=30, using Classical (unmodified) Gram-Schmidt > >>> Orthogonalization with no iterative refinement > >>> happy breakdown tolerance 1e-30 > >>> maximum iterations=2000, initial guess is zero > >>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. > >>> right preconditioning > >>> using UNPRECONDITIONED norm type for convergence test > >>> PC Object: 16 MPI processes > >>> type: bjacobi > >>> number of blocks = 4 > >>> Local solver information for first block is in the following KSP > >>> and PC objects on rank 0: > >>> Use -ksp_view ::ascii_info_detail to display information for all > >>> blocks > >>> KSP Object: (sub_) 4 MPI processes > >>> type: preonly > >>> maximum iterations=10000, initial guess is zero > >>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >>> left preconditioning > >>> using NONE norm type for convergence test > >>> PC Object: (sub_) 4 MPI processes > >>> type: telescope > >>> petsc subcomm: parent comm size reduction factor = 4 > >>> petsc subcomm: parent_size = 4 , subcomm_size = 1 > >>> petsc subcomm type = contiguous > >>> linear system matrix = precond matrix: > >>> Mat Object: (sub_) 4 MPI processes > >>> type: mpiaij > >>> rows=40200, cols=40200 > >>> total: nonzeros=199996, allocated nonzeros=203412 > >>> total number of mallocs used during MatSetValues calls=0 > >>> not using I-node (on process 0) routines > >>> setup type: default > >>> Parent DM object: NULL > >>> Sub DM object: NULL > >>> KSP Object: (sub_telescope_) 1 MPI processes > >>> type: preonly > >>> maximum iterations=10000, initial guess is zero > >>> tolerances: relative=1e-05, absolute=1e-50, > divergence=10000. > >>> left preconditioning > >>> using NONE norm type for convergence test > >>> PC Object: (sub_telescope_) 1 MPI processes > >>> type: lu > >>> out-of-place factorization > >>> tolerance for zero pivot 2.22045e-14 > >>> matrix ordering: nd > >>> factor fill ratio given 5., needed 8.62558 > >>> Factored matrix follows: > >>> Mat Object: 1 MPI processes > >>> type: seqaijcusparse > >>> rows=40200, cols=40200 > >>> package used to perform factorization: cusparse > >>> total: nonzeros=1725082, allocated nonzeros=1725082 > >>> not using I-node routines > >>> linear system matrix = precond matrix: > >>> Mat Object: 1 MPI processes > >>> type: seqaijcusparse > >>> rows=40200, cols=40200 > >>> total: nonzeros=199996, allocated nonzeros=199996 > >>> total number of mallocs used during MatSetValues calls=0 > >>> not using I-node routines > >>> linear system matrix = precond matrix: > >>> Mat Object: 16 MPI processes > >>> type: mpiaijcusparse > >>> rows=160800, cols=160800 > >>> total: nonzeros=802396, allocated nonzeros=1608000 > >>> total number of mallocs used during MatSetValues calls=0 > >>> not using I-node (on process 0) routines > >>> Norm of error 400.999 iterations 1 > >>> Chang > >>> On 10/14/21 9:47 PM, Barry Smith wrote: > >>>> > >>>> Chang, > >>>> > >>>> Sorry I did not notice that one. Please run that with -ksp_view > >>>> -ksp_monitor_true_residual so we can see exactly how options are > >>>> interpreted and solver used. At a glance it looks ok but something > >>>> must be wrong to get the wrong answer. > >>>> > >>>> Barry > >>>> > >>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu >>>>> > wrote: > >>>>> > >>>>> Hi Barry, > >>>>> > >>>>> That is exactly what I was doing in the second example, in which > >>>>> the preconditioner works but the GMRES does not. > >>>>> > >>>>> Chang > >>>>> > >>>>> On 10/14/21 5:15 PM, Barry Smith wrote: > >>>>>> You need to use the PCTELESCOPE inside the block Jacobi, not > >>>>>> outside it. So something like -pc_type bjacobi -sub_pc_type > >>>>>> telescope -sub_telescope_pc_type lu > >>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu >>>>>>> > wrote: > >>>>>>> > >>>>>>> Hi Pierre, > >>>>>>> > >>>>>>> I wonder if the trick of PCTELESCOPE only works for > >>>>>>> preconditioner and not for the solver. I have done some tests, > >>>>>>> and find that for solving a small matrix using > >>>>>>> -telescope_ksp_type preonly, it does work for GPU with multiple > >>>>>>> MPI processes. However, for bjacobi and gmres, it does not work. > >>>>>>> > >>>>>>> The command line options I used for small matrix is like > >>>>>>> > >>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short > >>>>>>> -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu > >>>>>>> -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type > >>>>>>> preonly -pc_telescope_reduction_factor 4 > >>>>>>> > >>>>>>> which gives the correct output. For iterative solver, I tried > >>>>>>> > >>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short > >>>>>>> -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type > >>>>>>> aijcusparse -sub_pc_type telescope -sub_ksp_type preonly > >>>>>>> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu > >>>>>>> -sub_telescope_pc_factor_mat_solver_type cusparse > >>>>>>> -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol > >>>>>>> 1.e-9 -ksp_atol 1.e-20 > >>>>>>> > >>>>>>> for large matrix. The output is like > >>>>>>> > >>>>>>> 0 KSP Residual norm 40.1497 > >>>>>>> 1 KSP Residual norm < 1.e-11 > >>>>>>> Norm of error 400.999 iterations 1 > >>>>>>> > >>>>>>> So it seems to call a direct solver instead of an iterative one. > >>>>>>> > >>>>>>> Can you please help check these options? > >>>>>>> > >>>>>>> Chang > >>>>>>> > >>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: > >>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu >>>>>>>>> > wrote: > >>>>>>>>> > >>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This > >>>>>>>>> sounds exactly what I need. I wonder if PCTELESCOPE can > >>>>>>>>> transform a mpiaijcusparse to seqaircusparse? Or I have to do > >>>>>>>>> it manually? > >>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). > >>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but > >>>>>>>> it should be; > >>>>>>>> 2) at least for the implementations > >>>>>>>> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and > >>>>>>>> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType > >>>>>>>> is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? > >>>>>>>> enough to detect if the MPI communicator on which the Mat lives > >>>>>>>> is of size 1 (your case), and then the resulting Mat is of type > >>>>>>>> MatSeqX instead of MatMPIX, so you would not need to worry about > >>>>>>>> the transformation you are mentioning. > >>>>>>>> If you try this out and this does not work, please provide the > >>>>>>>> backtrace (probably something like ?Operation XYZ not > >>>>>>>> implemented for MatType ABC?), and hopefully someone can add the > >>>>>>>> missing plumbing. > >>>>>>>> I do not claim that this will be efficient, but I think this > >>>>>>>> goes in the direction of what you want to achieve. > >>>>>>>> Thanks, > >>>>>>>> Pierre > >>>>>>>>> Chang > >>>>>>>>> > >>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: > >>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as > >>>>>>>>>> a subdomain solver, with a reduction factor equal to the > >>>>>>>>>> number of MPI processes you have per block? > >>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X > >>>>>>>>>> -sub_telescope_pc_type lu > >>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads > >>>>>>>>>> because not only do the Mat needs to be redistributed, the > >>>>>>>>>> secondary processes also need to be ?converted? to OpenMP > threads. > >>>>>>>>>> Thus the need for specific code in mumps.c. > >>>>>>>>>> Thanks, > >>>>>>>>>> Pierre > >>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users > >>>>>>>>>>> > > wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Junchao, > >>>>>>>>>>> > >>>>>>>>>>> Yes that is what I want. > >>>>>>>>>>> > >>>>>>>>>>> Chang > >>>>>>>>>>> > >>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: > >>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > >>>>>>>>>>>> > >>>>>>>>>>>> >> wrote: > >>>>>>>>>>>> Junchao, > >>>>>>>>>>>> If I understand correctly Chang is using the block > >>>>>>>>>>>> Jacobi > >>>>>>>>>>>> method with a single block for a number of MPI ranks and > >>>>>>>>>>>> a direct > >>>>>>>>>>>> solver for each block so it uses > >>>>>>>>>>>> PCSetUp_BJacobi_Multiproc() which > >>>>>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. > >>>>>>>>>>>> For their > >>>>>>>>>>>> particular problems this preconditioner works well, but > >>>>>>>>>>>> using an > >>>>>>>>>>>> iterative solver on the blocks does not work well. > >>>>>>>>>>>> If we had complete MPI-GPU direct solvers he could > >>>>>>>>>>>> just use > >>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block but > >>>>>>>>>>>> since we do > >>>>>>>>>>>> not he would like to use a single GPU for each block, > >>>>>>>>>>>> this means > >>>>>>>>>>>> that diagonal blocks of the global parallel MPI matrix > >>>>>>>>>>>> needs to be > >>>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, which > >>>>>>>>>>>> has multiple > >>>>>>>>>>>> MPI ranks associated with the blocks). Similarly for the > >>>>>>>>>>>> triangular > >>>>>>>>>>>> solves the blocks of the right hand side needs to be > >>>>>>>>>>>> shipped to the > >>>>>>>>>>>> appropriate GPU and the resulting solution shipped back > >>>>>>>>>>>> to the > >>>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is > >>>>>>>>>>>> somewhat like > >>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand > >>>>>>>>>>>> the background.. > >>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the > >>>>>>>>>>>> blocks on the > >>>>>>>>>>>> MPI ranks and then shrink each block down to a single > >>>>>>>>>>>> GPU but this > >>>>>>>>>>>> would be pretty inefficient, ideally one would go > >>>>>>>>>>>> directly from the > >>>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on > >>>>>>>>>>>> the subset of > >>>>>>>>>>>> GPUs. But this may be a large coding project. > >>>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? > >>>>>>>>>>>> In my mind, we just need to move each block (submatrix) > >>>>>>>>>>>> living over multiple MPI ranks to one of them and solve > >>>>>>>>>>>> directly there. In other words, we keep blocks' size, no > >>>>>>>>>>>> shrinking or expanding. > >>>>>>>>>>>> As mentioned before, cusparse does not provide LU > >>>>>>>>>>>> factorization. So the LU factorization would be done on CPU, > >>>>>>>>>>>> and the solve be done on GPU. I assume Chang wants to gain > >>>>>>>>>>>> from the (potential) faster solve (instead of factorization) > >>>>>>>>>>>> on GPU. > >>>>>>>>>>>> Barry > >>>>>>>>>>>> Since the matrices being factored and solved directly > >>>>>>>>>>>> are relatively > >>>>>>>>>>>> large it is possible that the cusparse code could be > >>>>>>>>>>>> reasonably > >>>>>>>>>>>> efficient (they are not the tiny problems one gets at > >>>>>>>>>>>> the coarse > >>>>>>>>>>>> level of multigrid). Of course, this is speculation, I > don't > >>>>>>>>>>>> actually know how much better the cusparse code would be > >>>>>>>>>>>> on the > >>>>>>>>>>>> direct solver than a good CPU direct sparse solver. > >>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>>>>> > >>>>>>>>>>>> >> wrote: > >>>>>>>>>>>> > > >>>>>>>>>>>> > Sorry I am not familiar with the details either. Can > >>>>>>>>>>>> you please > >>>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in > mumps.c? > >>>>>>>>>>>> > > >>>>>>>>>>>> > Chang > >>>>>>>>>>>> > > >>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: > >>>>>>>>>>>> >> Hi Chang, > >>>>>>>>>>>> >> I did the work in mumps. It is easy for me to > >>>>>>>>>>>> understand > >>>>>>>>>>>> gathering matrix rows to one process. > >>>>>>>>>>>> >> But how to gather blocks (submatrices) to form a > >>>>>>>>>>>> large block? Can you draw a picture of that? > >>>>>>>>>>>> >> Thanks > >>>>>>>>>>>> >> --Junchao Zhang > >>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via > >>>>>>>>>>>> petsc-users > >>>>>>>>>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> >> Hi Barry, > >>>>>>>>>>>> >> I think mumps solver in petsc does support that. > >>>>>>>>>>>> You can > >>>>>>>>>>>> check the > >>>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at > >>>>>>>>>>>> >> > >>>>>>>>>>>> > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > >>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html> > >>>>>>>>>>>> > >>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > >>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>> > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > >>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html> > >>>>>>>>>>>> > >>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > >>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>>> > >>>>>>>>>>>> > >>>>>>>>>>>> >> and the code enclosed by #if > >>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in > >>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and > >>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in > >>>>>>>>>>>> >> mumps.c > >>>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank > >>>>>>>>>>>> per GPU. > >>>>>>>>>>>> However, I am > >>>>>>>>>>>> >> working on an existing code that was developed > >>>>>>>>>>>> based on MPI > >>>>>>>>>>>> and the the > >>>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu > >>>>>>>>>>>> cores. We don't > >>>>>>>>>>>> want to > >>>>>>>>>>>> >> change the whole structure of the code. > >>>>>>>>>>>> >> 2. What you have suggested has been coded in > >>>>>>>>>>>> mumps.c. See > >>>>>>>>>>>> function > >>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. > >>>>>>>>>>>> >> Regards, > >>>>>>>>>>>> >> Chang > >>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >>> wrote: > >>>>>>>>>>>> >> >> > >>>>>>>>>>>> >> >> Hi Barry, > >>>>>>>>>>>> >> >> > >>>>>>>>>>>> >> >> That is exactly what I want. > >>>>>>>>>>>> >> >> > >>>>>>>>>>>> >> >> Back to my original question, I am looking > >>>>>>>>>>>> for an approach to > >>>>>>>>>>>> >> transfer > >>>>>>>>>>>> >> >> matrix > >>>>>>>>>>>> >> >> data from many MPI processes to "master" MPI > >>>>>>>>>>>> >> >> processes, each of which taking care of one > >>>>>>>>>>>> GPU, and then > >>>>>>>>>>>> upload > >>>>>>>>>>>> >> the data to GPU to > >>>>>>>>>>>> >> >> solve. > >>>>>>>>>>>> >> >> One can just grab some codes from mumps.c to > >>>>>>>>>>>> aijcusparse.cu >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >>. > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > mumps.c doesn't actually do that. It never > >>>>>>>>>>>> needs to > >>>>>>>>>>>> copy the > >>>>>>>>>>>> >> entire matrix to a single MPI rank. > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > It would be possible to write such a code > >>>>>>>>>>>> that you > >>>>>>>>>>>> suggest but > >>>>>>>>>>>> >> it is not clear that it makes sense > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one > >>>>>>>>>>>> GPU per MPI > >>>>>>>>>>>> rank, so > >>>>>>>>>>>> >> while your one GPU per big domain is solving its > >>>>>>>>>>>> systems the > >>>>>>>>>>>> other > >>>>>>>>>>>> >> GPUs (with the other MPI ranks that share that > >>>>>>>>>>>> domain) are doing > >>>>>>>>>>>> >> nothing. > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > 2) For each triangular solve you would have to > >>>>>>>>>>>> gather the > >>>>>>>>>>>> right > >>>>>>>>>>>> >> hand side from the multiple ranks to the single > >>>>>>>>>>>> GPU to pass it to > >>>>>>>>>>>> >> the GPU solver and then scatter the resulting > >>>>>>>>>>>> solution back > >>>>>>>>>>>> to all > >>>>>>>>>>>> >> of its subdomain ranks. > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > What I was suggesting was assign an entire > >>>>>>>>>>>> subdomain to a > >>>>>>>>>>>> >> single MPI rank, thus it does everything on one > >>>>>>>>>>>> GPU and can > >>>>>>>>>>>> use the > >>>>>>>>>>>> >> GPU solver directly. If all the major > >>>>>>>>>>>> computations of a subdomain > >>>>>>>>>>>> >> can fit and be done on a single GPU then you would > be > >>>>>>>>>>>> utilizing all > >>>>>>>>>>>> >> the GPUs you are using effectively. > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > Barry > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> >> > >>>>>>>>>>>> >> >> Chang > >>>>>>>>>>>> >> >> > >>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: > >>>>>>>>>>>> >> >>> Chang, > >>>>>>>>>>>> >> >>> You are correct there is no MPI + GPU > >>>>>>>>>>>> direct > >>>>>>>>>>>> solvers that > >>>>>>>>>>>> >> currently do the triangular solves with MPI + GPU > >>>>>>>>>>>> parallelism > >>>>>>>>>>>> that I > >>>>>>>>>>>> >> am aware of. You are limited that individual > >>>>>>>>>>>> triangular solves be > >>>>>>>>>>>> >> done on a single GPU. I can only suggest making > >>>>>>>>>>>> each subdomain as > >>>>>>>>>>>> >> big as possible to utilize each GPU as much as > >>>>>>>>>>>> possible for the > >>>>>>>>>>>> >> direct triangular solves. > >>>>>>>>>>>> >> >>> Barry > >>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via > >>>>>>>>>>>> petsc-users > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> >> >>>> > >>>>>>>>>>>> >> >>>> Hi Mark, > >>>>>>>>>>>> >> >>>> > >>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with > >>>>>>>>>>>> mpiaijcusparse with > >>>>>>>>>>>> other > >>>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type > >>>>>>>>>>>> cusparse, it > >>>>>>>>>>>> will give > >>>>>>>>>>>> >> an error. > >>>>>>>>>>>> >> >>>> > >>>>>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu > >>>>>>>>>>>> to do the > >>>>>>>>>>>> >> factorization, and then do the rest, including > >>>>>>>>>>>> GMRES solver, > >>>>>>>>>>>> on gpu. > >>>>>>>>>>>> >> Is that possible? > >>>>>>>>>>>> >> >>>> > >>>>>>>>>>>> >> >>>> I have tried to use aijcusparse with > >>>>>>>>>>>> superlu_dist, it > >>>>>>>>>>>> runs but > >>>>>>>>>>>> >> the iterative solver is still running on CPUs. I > have > >>>>>>>>>>>> contacted the > >>>>>>>>>>>> >> superlu group and they confirmed that is the case > >>>>>>>>>>>> right now. > >>>>>>>>>>>> But if > >>>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it > >>>>>>>>>>>> seems that the > >>>>>>>>>>>> >> iterative solver is running on GPU. > >>>>>>>>>>>> >> >>>> > >>>>>>>>>>>> >> >>>> Chang > >>>>>>>>>>>> >> >>>> > >>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: > >>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >>>> wrote: > >>>>>>>>>>>> >> >>>>> Thank you Junchao for explaining this. > >>>>>>>>>>>> I guess in > >>>>>>>>>>>> my case > >>>>>>>>>>>> >> the code is > >>>>>>>>>>>> >> >>>>> just calling a seq solver like superlu > >>>>>>>>>>>> to do > >>>>>>>>>>>> >> factorization on GPUs. > >>>>>>>>>>>> >> >>>>> My idea is that I want to have a > >>>>>>>>>>>> traditional MPI > >>>>>>>>>>>> code to > >>>>>>>>>>>> >> utilize GPUs > >>>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse does > >>>>>>>>>>>> not support > >>>>>>>>>>>> mpiaij > >>>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' > >>>>>>>>>>>> will give you an > >>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. > >>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work > >>>>>>>>>>>> with >1 proc). > >>>>>>>>>>>> >> >>>>> However, I see in grepping the repo that > >>>>>>>>>>>> all the mumps and > >>>>>>>>>>>> >> superlu tests use aij or sell matrix type. > >>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own > >>>>>>>>>>>> solves, I assume > >>>>>>>>>>>> .... but > >>>>>>>>>>>> >> you might want to do other matrix operations on > >>>>>>>>>>>> the GPU. Is > >>>>>>>>>>>> that the > >>>>>>>>>>>> >> issue? > >>>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with > >>>>>>>>>>>> MUMPS and/or > >>>>>>>>>>>> SuperLU > >>>>>>>>>>>> >> have a problem? (no test with it so it probably > >>>>>>>>>>>> does not work) > >>>>>>>>>>>> >> >>>>> Thanks, > >>>>>>>>>>>> >> >>>>> Mark > >>>>>>>>>>>> >> >>>>> so I > >>>>>>>>>>>> >> >>>>> want the code to have a mpiaij matrix > >>>>>>>>>>>> when adding > >>>>>>>>>>>> all the > >>>>>>>>>>>> >> matrix terms, > >>>>>>>>>>>> >> >>>>> and then transform the matrix to > >>>>>>>>>>>> seqaij when doing the > >>>>>>>>>>>> >> factorization > >>>>>>>>>>>> >> >>>>> and > >>>>>>>>>>>> >> >>>>> solve. This involves sending the data > >>>>>>>>>>>> to the master > >>>>>>>>>>>> >> process, and I > >>>>>>>>>>>> >> >>>>> think > >>>>>>>>>>>> >> >>>>> the petsc mumps solver have something > >>>>>>>>>>>> similar already. > >>>>>>>>>>>> >> >>>>> Chang > >>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang > wrote: > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM > >>>>>>>>>>>> Mark Adams > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>> wrote: > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM > >>>>>>>>>>>> Chang Liu > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>> wrote: > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > Hi Mark, > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > The option I use is like > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > -pc_type bjacobi > >>>>>>>>>>>> -pc_bjacobi_blocks 16 > >>>>>>>>>>>> >> -ksp_type fgmres > >>>>>>>>>>>> >> >>>>> -mat_type > >>>>>>>>>>>> >> >>>>> > aijcusparse > >>>>>>>>>>>> *-sub_pc_factor_mat_solver_type > >>>>>>>>>>>> >> cusparse > >>>>>>>>>>>> >> >>>>> *-sub_ksp_type > >>>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* > >>>>>>>>>>>> -ksp_max_it 2000 > >>>>>>>>>>>> >> -ksp_rtol 1.e-300 > >>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > Note, If you use -log_view the > >>>>>>>>>>>> last column > >>>>>>>>>>>> (rows > >>>>>>>>>>>> >> are the > >>>>>>>>>>>> >> >>>>> method like > >>>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the > >>>>>>>>>>>> percent of work > >>>>>>>>>>>> in the GPU. > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > Junchao: *This* implies that we > >>>>>>>>>>>> have a > >>>>>>>>>>>> cuSparse LU > >>>>>>>>>>>> >> >>>>> factorization. Is > >>>>>>>>>>>> >> >>>>> > that correct? (I don't think we > do) > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU > >>>>>>>>>>>> factorization. If you check > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will > >>>>>>>>>>>> find it > >>>>>>>>>>>> >> calls > >>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. > >>>>>>>>>>>> >> >>>>> > So I don't understand Chang's idea. > >>>>>>>>>>>> Do you want to > >>>>>>>>>>>> >> make bigger > >>>>>>>>>>>> >> >>>>> blocks? > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > I think this one do both > >>>>>>>>>>>> factorization and > >>>>>>>>>>>> >> solve on gpu. > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > You can check the > >>>>>>>>>>>> runex72_aijcusparse.sh file > >>>>>>>>>>>> >> in petsc > >>>>>>>>>>>> >> >>>>> install > >>>>>>>>>>>> >> >>>>> > directory, and try it your > >>>>>>>>>>>> self (this > >>>>>>>>>>>> is only lu > >>>>>>>>>>>> >> >>>>> factorization > >>>>>>>>>>>> >> >>>>> > without > >>>>>>>>>>>> >> >>>>> > iterative solve). > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > Chang > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark > >>>>>>>>>>>> Adams wrote: > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at > >>>>>>>>>>>> 11:19 AM > >>>>>>>>>>>> Chang Liu > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>> wrote: > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > Hi Junchao, > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > No I only needs it > >>>>>>>>>>>> to be transferred > >>>>>>>>>>>> >> within a > >>>>>>>>>>>> >> >>>>> node. I use > >>>>>>>>>>>> >> >>>>> > block-Jacobi > >>>>>>>>>>>> >> >>>>> > > method and GMRES to > >>>>>>>>>>>> solve the sparse > >>>>>>>>>>>> >> matrix, so each > >>>>>>>>>>>> >> >>>>> > direct solver will > >>>>>>>>>>>> >> >>>>> > > take care of a > >>>>>>>>>>>> sub-block of the > >>>>>>>>>>>> whole > >>>>>>>>>>>> >> matrix. In this > >>>>>>>>>>>> >> >>>>> > way, I can use > >>>>>>>>>>>> >> >>>>> > > one > >>>>>>>>>>>> >> >>>>> > > GPU to solve one > >>>>>>>>>>>> sub-block, which is > >>>>>>>>>>>> >> stored within > >>>>>>>>>>>> >> >>>>> one node. > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > It was stated in the > >>>>>>>>>>>> documentation that > >>>>>>>>>>>> >> cusparse > >>>>>>>>>>>> >> >>>>> solver > >>>>>>>>>>>> >> >>>>> > is slow. > >>>>>>>>>>>> >> >>>>> > > However, in my test > >>>>>>>>>>>> using > >>>>>>>>>>>> ex72.c, the > >>>>>>>>>>>> >> cusparse > >>>>>>>>>>>> >> >>>>> solver is > >>>>>>>>>>>> >> >>>>> > faster than > >>>>>>>>>>>> >> >>>>> > > mumps or > >>>>>>>>>>>> superlu_dist on CPUs. > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > Are we talking about the > >>>>>>>>>>>> factorization, the > >>>>>>>>>>>> >> solve, or > >>>>>>>>>>>> >> >>>>> both? > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > We do not have an > >>>>>>>>>>>> interface to > >>>>>>>>>>>> cuSparse's LU > >>>>>>>>>>>> >> >>>>> factorization (I > >>>>>>>>>>>> >> >>>>> > just > >>>>>>>>>>>> >> >>>>> > > learned that it exists a > >>>>>>>>>>>> few weeks ago). > >>>>>>>>>>>> >> >>>>> > > Perhaps your fast > >>>>>>>>>>>> "cusparse solver" is > >>>>>>>>>>>> >> '-pc_type lu > >>>>>>>>>>>> >> >>>>> -mat_type > >>>>>>>>>>>> >> >>>>> > > aijcusparse' ? This > >>>>>>>>>>>> would be the CPU > >>>>>>>>>>>> >> factorization, > >>>>>>>>>>>> >> >>>>> which is the > >>>>>>>>>>>> >> >>>>> > > dominant cost. > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > Chang > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 > >>>>>>>>>>>> AM, Junchao > >>>>>>>>>>>> Zhang wrote: > >>>>>>>>>>>> >> >>>>> > > > Hi, Chang, > >>>>>>>>>>>> >> >>>>> > > > For the mumps > >>>>>>>>>>>> solver, we > >>>>>>>>>>>> usually > >>>>>>>>>>>> >> transfers > >>>>>>>>>>>> >> >>>>> matrix > >>>>>>>>>>>> >> >>>>> > and vector > >>>>>>>>>>>> >> >>>>> > > data > >>>>>>>>>>>> >> >>>>> > > > within a compute > >>>>>>>>>>>> node. For > >>>>>>>>>>>> the idea you > >>>>>>>>>>>> >> >>>>> propose, it > >>>>>>>>>>>> >> >>>>> > looks like > >>>>>>>>>>>> >> >>>>> > > we need > >>>>>>>>>>>> >> >>>>> > > > to gather data > within > >>>>>>>>>>>> >> MPI_COMM_WORLD, right? > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > Mark, I > >>>>>>>>>>>> remember you said > >>>>>>>>>>>> >> cusparse solve is > >>>>>>>>>>>> >> >>>>> slow > >>>>>>>>>>>> >> >>>>> > and you would > >>>>>>>>>>>> >> >>>>> > > > rather do it on > >>>>>>>>>>>> CPU. Is it right? > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > --Junchao Zhang > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > On Mon, Oct 11, > >>>>>>>>>>>> 2021 at 10:25 PM > >>>>>>>>>>>> >> Chang Liu via > >>>>>>>>>>>> >> >>>>> petsc-users > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> petsc-users at mcs.anl.gov> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>>> > >>>>>>>>>>>> petsc-users at mcs.anl.gov> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> petsc-users at mcs.anl.gov> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> petsc-users at mcs.anl.gov> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>>> > >>>>>>>>>>>> petsc-users at mcs.anl.gov> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> petsc-users at mcs.anl.gov> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>> > >>>>>>>>>>>> >> >>>>> > > wrote: > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > Hi, > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > Currently, it > >>>>>>>>>>>> is possible > >>>>>>>>>>>> to use > >>>>>>>>>>>> >> mumps > >>>>>>>>>>>> >> >>>>> solver in > >>>>>>>>>>>> >> >>>>> > PETSC with > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> -mat_mumps_use_omp_threads > >>>>>>>>>>>> >> option, so that > >>>>>>>>>>>> >> >>>>> > multiple MPI > >>>>>>>>>>>> >> >>>>> > > processes will > >>>>>>>>>>>> >> >>>>> > > > transfer the > >>>>>>>>>>>> matrix and > >>>>>>>>>>>> rhs data > >>>>>>>>>>>> >> to the master > >>>>>>>>>>>> >> >>>>> > rank, and then > >>>>>>>>>>>> >> >>>>> > > master > >>>>>>>>>>>> >> >>>>> > > > rank will > >>>>>>>>>>>> call mumps with > >>>>>>>>>>>> OpenMP > >>>>>>>>>>>> >> to solve > >>>>>>>>>>>> >> >>>>> the matrix. > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > I wonder if > >>>>>>>>>>>> someone can > >>>>>>>>>>>> develop > >>>>>>>>>>>> >> similar > >>>>>>>>>>>> >> >>>>> option for > >>>>>>>>>>>> >> >>>>> > cusparse > >>>>>>>>>>>> >> >>>>> > > solver. > >>>>>>>>>>>> >> >>>>> > > > Right now, > >>>>>>>>>>>> this solver > >>>>>>>>>>>> does not > >>>>>>>>>>>> >> work with > >>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I > >>>>>>>>>>>> >> >>>>> > > think a > >>>>>>>>>>>> >> >>>>> > > > possible > >>>>>>>>>>>> workaround is to > >>>>>>>>>>>> >> transfer all the > >>>>>>>>>>>> >> >>>>> matrix > >>>>>>>>>>>> >> >>>>> > data to one MPI > >>>>>>>>>>>> >> >>>>> > > > process, and > >>>>>>>>>>>> then upload the > >>>>>>>>>>>> >> data to GPU to > >>>>>>>>>>>> >> >>>>> solve. > >>>>>>>>>>>> >> >>>>> > In this > >>>>>>>>>>>> >> >>>>> > > way, one can > >>>>>>>>>>>> >> >>>>> > > > use cusparse > >>>>>>>>>>>> solver for a MPI > >>>>>>>>>>>> >> program. > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > Chang > >>>>>>>>>>>> >> >>>>> > > > -- > >>>>>>>>>>>> >> >>>>> > > > Chang Liu > >>>>>>>>>>>> >> >>>>> > > > Staff > >>>>>>>>>>>> Research Physicist > >>>>>>>>>>>> >> >>>>> > > > +1 609 243 3438 > >>>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>> > >>>>>>>>>>>> >> >>>>> > > > Princeton > >>>>>>>>>>>> Plasma Physics > >>>>>>>>>>>> Laboratory > >>>>>>>>>>>> >> >>>>> > > > 100 > >>>>>>>>>>>> Stellarator Rd, > >>>>>>>>>>>> Princeton NJ > >>>>>>>>>>>> >> 08540, USA > >>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > -- > >>>>>>>>>>>> >> >>>>> > > Chang Liu > >>>>>>>>>>>> >> >>>>> > > Staff Research > Physicist > >>>>>>>>>>>> >> >>>>> > > +1 609 243 3438 > >>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> >> >>>>> > > Princeton Plasma > >>>>>>>>>>>> Physics Laboratory > >>>>>>>>>>>> >> >>>>> > > 100 Stellarator Rd, > >>>>>>>>>>>> Princeton NJ > >>>>>>>>>>>> 08540, USA > >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> > -- > >>>>>>>>>>>> >> >>>>> > Chang Liu > >>>>>>>>>>>> >> >>>>> > Staff Research Physicist > >>>>>>>>>>>> >> >>>>> > +1 609 243 3438 > >>>>>>>>>>>> >> >>>>> > cliu at pppl.gov > >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >> >>>>> > Princeton Plasma Physics > >>>>>>>>>>>> Laboratory > >>>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, > >>>>>>>>>>>> Princeton NJ 08540, USA > >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>> -- Chang Liu > >>>>>>>>>>>> >> >>>>> Staff Research Physicist > >>>>>>>>>>>> >> >>>>> +1 609 243 3438 > >>>>>>>>>>>> >> >>>>> cliu at pppl.gov > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> > >>>>>>>>>>>> > > >>>>>>>>>>>> >> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ > >>>>>>>>>>>> 08540, USA > >>>>>>>>>>>> >> >>>> > >>>>>>>>>>>> >> >>>> -- > >>>>>>>>>>>> >> >>>> Chang Liu > >>>>>>>>>>>> >> >>>> Staff Research Physicist > >>>>>>>>>>>> >> >>>> +1 609 243 3438 > >>>>>>>>>>>> >> >>>> cliu at pppl.gov > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>> >> >> > >>>>>>>>>>>> >> >> -- > >>>>>>>>>>>> >> >> Chang Liu > >>>>>>>>>>>> >> >> Staff Research Physicist > >>>>>>>>>>>> >> >> +1 609 243 3438 > >>>>>>>>>>>> >> >> cliu at pppl.gov > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory > >>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>> >> > > >>>>>>>>>>>> >> -- Chang Liu > >>>>>>>>>>>> >> Staff Research Physicist > >>>>>>>>>>>> >> +1 609 243 3438 > >>>>>>>>>>>> >> cliu at pppl.gov > >>>>>>>>>>>> > > >>>>>>>>>>>> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> Princeton Plasma Physics Laboratory > >>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>> > > >>>>>>>>>>>> > -- > >>>>>>>>>>>> > Chang Liu > >>>>>>>>>>>> > Staff Research Physicist > >>>>>>>>>>>> > +1 609 243 3438 > >>>>>>>>>>>> > cliu at pppl.gov > >>>>>>>>>>>> > > >>>>>>>>>>>> > Princeton Plasma Physics Laboratory > >>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Chang Liu > >>>>>>>>>>> Staff Research Physicist > >>>>>>>>>>> +1 609 243 3438 > >>>>>>>>>>> cliu at pppl.gov > >>>>>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Chang Liu > >>>>>>>>> Staff Research Physicist > >>>>>>>>> +1 609 243 3438 > >>>>>>>>> cliu at pppl.gov > >>>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>> > >>>>>>> -- > >>>>>>> Chang Liu > >>>>>>> Staff Research Physicist > >>>>>>> +1 609 243 3438 > >>>>>>> cliu at pppl.gov > >>>>>>> Princeton Plasma Physics Laboratory > >>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>> > >>>>> -- > >>>>> Chang Liu > >>>>> Staff Research Physicist > >>>>> +1 609 243 3438 > >>>>> cliu at pppl.gov > >>>>> Princeton Plasma Physics Laboratory > >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>> > >> > >> -- > >> Chang Liu > >> Staff Research Physicist > >> +1 609 243 3438 > >> cliu at pppl.gov > >> Princeton Plasma Physics Laboratory > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Sun Oct 17 08:04:33 2021 From: mfadams at lbl.gov (Mark Adams) Date: Sun, 17 Oct 2021 09:04:33 -0400 Subject: [petsc-users] gamg student questions In-Reply-To: <1634439002116.86375@mit.edu> References: <1634439002116.86375@mit.edu> Message-ID: Hi Daniel, [this is a PETSc users list question so let me move it there] The behavior that you are seeing is a bit odd but not surprising. First, you should start with simple problems and get AMG (you might want to try this exercise with hypre as well: --download-hypre and use -pc_type hypre, or BDDC, see below). There are, alas, a lot of tuning parameters in AMG/DD and I recommend a homotopy process: you can start with issues that deal with your discretization on a simple cube, linear elasticity, cube elements, modest Posson ratio, etc., and first get "textbook multigrid efficiency" (TME), which for elasticity and a V(2,2) cycle in GAMG is about one digit of error reduction per iteration and perfectly monotonic until it hits floating point precision. I would set this problem up and I would hope it runs OK, but the problems that you want to do are probably pretty hard (high order FE, plasticity, incompressibility) so there will be more work to do. That said, PETSc has nice domain decomposition solvers that are more optimized and maintained for elasticity. Now that I think about it, you should probably look at these ( https://petsc.org/release/docs/manualpages/PC/PCBDDC.html https://petsc.org/release/docs/manual/ksp/#balancing-domain-decomposition-by-constraints). I think they prefer, but do not require, that you do not assemble your element matrices, but let them do it. The docs will make that clear. BSSC is great but it is not magic, and it is no less complex, so I would still recommend the same process of getting TME and then moving to the problems that you want to solve. Good luck, Mark On Sat, Oct 16, 2021 at 10:50 PM Daniel N Pickard wrote: > Hi Dr Adams, > > > I am using the gamg in petsc to solve some elasticity problems for > modeling bones. I am new to profiling with petsc, but I am observing that > around a thousand iterations my norm has gone down 3 orders of magnitude > but the solver slows down and progress sort of stalls. The norm > also doesn't decrease monotonically, but jumps around a bit. I also notice > that if I request to only use 1 multigrid level, the preconditioner is > much cheaper and not as powerful so the code takes more iterations, but > runs 2-3x faster. Is this expected that large models require lots of > iterations and convergence slows down as we get more accurate? What exactly > should I be looking for when I am profiling to try to understand how to run > faster? I see that a lot of my ratio's are 2.7, but I think that is because > my mesh partitioner is not doing a great job making equal domains. What are > the giveaways in the log_view that tell you that petsc could be optimized > more? > > > Also when I look at the solution with just 4 orders of magnitude of > convergence I can see that the solver has not made much progress in the > interior of the domain, but seems to have smoothed out the boundary where > forces where applied very well. Does this mean I should use a larger > threshold to get more course grids that can fix the low frequency error? > > > Thanks, > > Daniel Pickard > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Sun Oct 17 10:31:02 2021 From: knepley at gmail.com (Matthew Knepley) Date: Sun, 17 Oct 2021 11:31:02 -0400 Subject: [petsc-users] gamg student questions In-Reply-To: References: <1634439002116.86375@mit.edu> Message-ID: On Sun, Oct 17, 2021 at 9:04 AM Mark Adams wrote: > Hi Daniel, [this is a PETSc users list question so let me move it there] > > The behavior that you are seeing is a bit odd but not surprising. > > First, you should start with simple problems and get AMG (you might want > to try this exercise with hypre as well: --download-hypre and use -pc_type > hypre, or BDDC, see below). > We have two examples that do this: 1) SNES ex56: This shows good performance of GAMG on Q1 and Q2 elasticity 2) SNES ex17: This sets up a lot of finite element elasticity problems where you can experiment with GAMG, ML, Hypre, BDDC, and other preconditioners As a rule of thumb, if my solver is taking more than 100 iterations (usually for 1e-8 tolerance), something is very wrong. Either the problem is setup incorrectly, the solver is configured incorrectly, or I need to switch solvers. Thanks, Matt > There are, alas, a lot of tuning parameters in AMG/DD and I recommend a > homotopy process: you can start with issues that deal with your > discretization on a simple cube, linear elasticity, cube elements, modest > Posson ratio, etc., and first get "textbook multigrid efficiency" (TME), > which for elasticity and a V(2,2) cycle in GAMG is about one digit of error > reduction per iteration and perfectly monotonic until it hits floating > point precision. > > I would set this problem up and I would hope it runs OK, but the > problems that you want to do are probably pretty hard (high order FE, > plasticity, incompressibility) so there will be more work to do. > > That said, PETSc has nice domain decomposition solvers that are more > optimized and maintained for elasticity. Now that I think about it, you > should probably look at these ( > https://petsc.org/release/docs/manualpages/PC/PCBDDC.html > https://petsc.org/release/docs/manual/ksp/#balancing-domain-decomposition-by-constraints). > I think they prefer, but do not require, that you do not assemble your > element matrices, but let them do it. The docs will make that clear. > > BSSC is great but it is not magic, and it is no less complex, so I would > still recommend the same process of getting TME and then moving to the > problems that you want to solve. > > Good luck, > Mark > > > > On Sat, Oct 16, 2021 at 10:50 PM Daniel N Pickard wrote: > >> Hi Dr Adams, >> >> >> I am using the gamg in petsc to solve some elasticity problems for >> modeling bones. I am new to profiling with petsc, but I am observing that >> around a thousand iterations my norm has gone down 3 orders of magnitude >> but the solver slows down and progress sort of stalls. The norm >> also doesn't decrease monotonically, but jumps around a bit. I also notice >> that if I request to only use 1 multigrid level, the preconditioner is >> much cheaper and not as powerful so the code takes more iterations, but >> runs 2-3x faster. Is this expected that large models require lots of >> iterations and convergence slows down as we get more accurate? What exactly >> should I be looking for when I am profiling to try to understand how to run >> faster? I see that a lot of my ratio's are 2.7, but I think that is because >> my mesh partitioner is not doing a great job making equal domains. What are >> the giveaways in the log_view that tell you that petsc could be optimized >> more? >> >> >> Also when I look at the solution with just 4 orders of magnitude of >> convergence I can see that the solver has not made much progress in the >> interior of the domain, but seems to have smoothed out the boundary where >> forces where applied very well. Does this mean I should use a larger >> threshold to get more course grids that can fix the low frequency error? >> >> >> Thanks, >> >> Daniel Pickard >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From swarnava89 at gmail.com Sun Oct 17 17:32:23 2021 From: swarnava89 at gmail.com (Swarnava Ghosh) Date: Sun, 17 Oct 2021 18:32:23 -0400 Subject: [petsc-users] MatVec on GPUs Message-ID: Dear Petsc team, I had a query regarding using CUDA to accelerate a matrix vector product. I have a sequential sparse matrix (MATSEQBAIJ type). I want to port a MatVec call onto GPUs. Is there any code/example I can look at? Sincerely, SG -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Sun Oct 17 18:07:11 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Sun, 17 Oct 2021 18:07:11 -0500 Subject: [petsc-users] MatVec on GPUs In-Reply-To: References: Message-ID: You can do that with command line options -mat_type aijcusparse -vec_type cuda On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh wrote: > Dear Petsc team, > > I had a query regarding using CUDA to accelerate a matrix vector product. > I have a sequential sparse matrix (MATSEQBAIJ type). I want to port a > MatVec call onto GPUs. Is there any code/example I can look at? > > Sincerely, > SG > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swarnava89 at gmail.com Sun Oct 17 18:12:17 2021 From: swarnava89 at gmail.com (Swarnava Ghosh) Date: Sun, 17 Oct 2021 19:12:17 -0400 Subject: [petsc-users] MatVec on GPUs In-Reply-To: References: Message-ID: Do I need convert the MATSEQBAIJ to a cuda matrix in code? If I do it from command line, then are the other MatVec calls are ported onto CUDA? I have many MatVec calls in my code, but I specifically want to port just one call. Sincerely, Swarnava On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang wrote: > You can do that with command line options -mat_type aijcusparse -vec_type > cuda > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh wrote: > >> Dear Petsc team, >> >> I had a query regarding using CUDA to accelerate a matrix vector product. >> I have a sequential sparse matrix (MATSEQBAIJ type). I want to port a >> MatVec call onto GPUs. Is there any code/example I can look at? >> >> Sincerely, >> SG >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Sun Oct 17 18:50:14 2021 From: knepley at gmail.com (Matthew Knepley) Date: Sun, 17 Oct 2021 19:50:14 -0400 Subject: [petsc-users] MatVec on GPUs In-Reply-To: References: Message-ID: On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh wrote: > Do I need convert the MATSEQBAIJ to a cuda matrix in code? > You would need a call to MatSetFromOptions() to take that type from the command line, and not have the type hard-coded in your application. It is generally a bad idea to hard code the implementation type. > If I do it from command line, then are the other MatVec calls are ported > onto CUDA? I have many MatVec calls in my code, but I specifically want to > port just one call. > You can give that one matrix an options prefix to isolate it. Thanks, Matt > Sincerely, > Swarnava > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang > wrote: > >> You can do that with command line options -mat_type aijcusparse -vec_type >> cuda >> >> On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >> wrote: >> >>> Dear Petsc team, >>> >>> I had a query regarding using CUDA to accelerate a matrix vector >>> product. >>> I have a sequential sparse matrix (MATSEQBAIJ type). I want to port a >>> MatVec call onto GPUs. Is there any code/example I can look at? >>> >>> Sincerely, >>> SG >>> >> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From swarnava89 at gmail.com Sun Oct 17 19:09:58 2021 From: swarnava89 at gmail.com (Swarnava Ghosh) Date: Sun, 17 Oct 2021 20:09:58 -0400 Subject: [petsc-users] MatVec on GPUs In-Reply-To: References: Message-ID: Thanks Matt and Junchao. Sincerely, Swarnava On Sun, Oct 17, 2021 at 7:50 PM Matthew Knepley wrote: > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh > wrote: > >> Do I need convert the MATSEQBAIJ to a cuda matrix in code? >> > > You would need a call to MatSetFromOptions() to take that type from the > command line, and not have > the type hard-coded in your application. It is generally a bad idea to > hard code the implementation type. > > >> If I do it from command line, then are the other MatVec calls are ported >> onto CUDA? I have many MatVec calls in my code, but I specifically want to >> port just one call. >> > > You can give that one matrix an options prefix to isolate it. > > Thanks, > > Matt > > >> Sincerely, >> Swarnava >> >> On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >> wrote: >> >>> You can do that with command line options -mat_type aijcusparse >>> -vec_type cuda >>> >>> On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>> wrote: >>> >>>> Dear Petsc team, >>>> >>>> I had a query regarding using CUDA to accelerate a matrix vector >>>> product. >>>> I have a sequential sparse matrix (MATSEQBAIJ type). I want to port a >>>> MatVec call onto GPUs. Is there any code/example I can look at? >>>> >>>> Sincerely, >>>> SG >>>> >>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yc17470 at connect.um.edu.mo Sat Oct 16 22:31:37 2021 From: yc17470 at connect.um.edu.mo (Gong Yujie) Date: Sun, 17 Oct 2021 03:31:37 +0000 Subject: [petsc-users] Question about DMPlex for parallel computing Message-ID: Hi, I'm learning to use DMPlex to write a parallel program. I've tried to write a sequential code earlier successfully, but when to write a parallel code, there are many things different. There are some questions I'm curious about. 1. Are the functions as DMPlexCreateGmshFromFile() and other read from file functions reading in the mesh in parallel? Or just the root node read in the mesh? 2. Are there some examples available for distribute the mesh and create the correspondingly local to global node(or position) mapping ? I'm grateful for your kindly help! Best Regards, Gong -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.seize at onera.fr Mon Oct 18 02:27:05 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Mon, 18 Oct 2021 09:27:05 +0200 Subject: [petsc-users] Periodic boundary conditions in DMPlex In-Reply-To: References: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> <71784fff-35eb-a129-3609-004e5e596575@onera.fr> <8ff9d951-6958-aa1a-b875-e7488bb6b30b@onera.fr> <64787e00-820b-0c83-02f8-854569e4df9e@onera.fr> Message-ID: <17ca8a59-bb0b-ff7e-b34e-7666aa1931b4@onera.fr> On 15/10/21 22:30, Matthew Knepley wrote: > On Fri, Oct 15, 2021 at 10:16 AM Pierre Seize > wrote: > > I read everything again, I think I did not understand you at > first. The first solution is to modify the DAG, so that the > rightmost cell is linked to the leftmost face, right ? To do that, > do I have to manually edit the DAG (the mesh is read from a file) ? > > Yes, the DAG would be modified if you want it for some particular mesh > that we cannot read automatically. For example, we can read periodic > GMsh meshes. > > If so, the mesh connectivity is like the one of a torus, then how > does it work with the cells/faces coordinates ? > > You let the coordinate field be in a DG space, so that it can have jumps. I'm not sure I fully understand this, what I'll do is that I will experiment with a periodic GMSH mesh. > Now I think the second method may be more straightforward. What's > the idea ? Get the mapping with DMGetLocalToGlobalMapping, then > create the mapping corresponding to the periodicity with > ISLocalToGlobalMappingCreate, and finally > ISLocalToGlobalMappingConcatenate ? I'm not sure this is the way, > and I did not find something like DMSetLocalToGlobalMapping to > restore the modified mapping. > > It is more complicated. We make the LocalToGlobalMap by looking at the > PetscSection (essentially if gives function space information)? and > deciding which unknowns are removed from the global space. > You would need to decide that unknowns constrained by periodicity are > not present in the global space. Actually, this is not hard. You just > mark them as constrained in the PetscSection, and all the layout > functions will function correctly. However, then the LocalToGlobalMap > will not be exactly right because the constrained unknowns will not be > filled in (just like Dirichlet conditions). You would augment the > map so that it fills those in by looking up their periodic > counterparts. Jed has argued for this type of periodicity. I also don't understand. One one side of my mesh I have :?? | ghost1 | cell1 | cell2 | ...? and on the other | cell_n-1 | cell_n | ghost_n |. Are not the ghosts (from DMPlexConstructGhostCells) already constrained ? I experimented on that too, I did: DMGetSectionSF PetscSFGetGraph augment the graph by adding the local ghost cells to the leaves, and the correct remote "true" cells to the roots PetscSFSetGraph and it seems to do what I want. Is this what you meant ? Is this a correct way to use the PETSc objects ? Or is this just hacky and I'm lucky it works ? Pierre > > To me, the first kind is much more straightforward, but maybe this is > because I find the topology code more clear. > > ? Thanks, > > ? ? ? Matt > > Pierre > > > On 15/10/21 15:33, Pierre Seize wrote: >> >> When I first tried to handle the periodicity, I found the >> DMPlexCreateBoxMesh function (I cannot find the cylinder one). >> >> From reading the sources, I understand that we do some work >> either in DMPlexCreateCubeMesh_Internal or with DMSetPeriodicity. >> >> I tried to use DMSetPeriodicity before, for example with a 2x2 >> box on length 10. I did something like: >> >> const PetscReal maxCell[] = {2, 2}; >> const PetscReal L[] = {10, 10}; >> const DMBoundaryType bd[] = {DM_BOUNDARY_PERIODIC, >> DM_BOUNDARY_PERIODIC}; >> DMSetPeriodicity(dm, PETSC_TRUE, maxCell, L, bd); >> // or: >> DMSetPeriodicity(dm, PETSC_TRUE, NULL, L, bd); >> >> but it did not work: >> >> VecSet(X, 1); >> DMGetLocalVector(dm, &locX); >> VecZeroEntries(locX); >> DMGlobalToLocalBegin(dm, X, INSERT_VALUES, locX); >> DMGlobalToLocalEnd(dm, X, INSERT_VALUES, locX); >> VecView(locX, PETSC_VIEWER_STDOUT_WORLD); >> >> but the ghost cells values are all 0, only the real cells are 1. >> So I guess DMSetPeriodicity alone is not sufficient to handle the >> periodicity. Is there a way to do what I want ? That is set up my >> DMPlex in a way that DMGlobalToLocalBegin/DMGlobalToLocalEnd do >> exchange values between procs AND exchange the periodic values? >> >> >> Thanks for the help >> >> >> Pierre >> >> >> On 15/10/21 14:03, Matthew Knepley wrote: >>> On Fri, Oct 15, 2021 at 7:31 AM Pierre Seize >>> > wrote: >>> >>> It makes sense, thank you. In fact, both ways seems better >>> than my way. The first one looks the most straightforward. >>> Unfortunately I do not know how to implement either of them. >>> Could you please direct me to the corresponding PETSc >>> functions ? >>> >>> The first way is implemented for example in >>> DMPlexCreateBoxMesh() and DMPlexCreateCylinderMesh(). The second >>> is not implemented since >>> there did not seem to be a general way to do it. I would help if >>> you wanted to try coding it up. >>> >>> ? Thanks, >>> >>> ? ? Matt >>> >>> Pierre >>> >>> >>> On 15/10/21 13:25, Matthew Knepley wrote: >>>> On Fri, Oct 15, 2021 at 7:08 AM Pierre Seize >>>> > wrote: >>>> >>>> Hi, >>>> >>>> I'm writing a code using PETSc to solve NS equations >>>> with FV on an >>>> unstructured mesh. Therefore I use DMPlex. >>>> >>>> Regarding periodicity, I manage to implement it this way: >>>> >>>> ?? - for each couple of boundaries that is linked with >>>> periodicity, I >>>> create a buffer vector with an ISLocalToGlobalMapping >>>> >>>> ?? - then, when I need to fill the ghost cells >>>> corresponding to the >>>> periodicity, the i "true" cell of the local vector >>>> fills the buffer >>>> vector on location i with VecSetValuesBlockedLocal, then >>>> VecAssemblyBegin/VecAssemblyEnd ensure each value is >>>> send to the correct >>>> location thanks to the mapping, then the i "ghost" cell >>>> of the local >>>> vector reads the vector on location i to get it's value. >>>> >>>> >>>> It works, but it seems to me there is a better way, >>>> with maybe PetscSF, >>>> VecScatter, or something I don't know yet. Does anyone >>>> have any advice ? >>>> >>>> >>>> There are at least two other ways to handle this. First, >>>> the method that is advocated in >>>> Plex is to actually make a periodic geometry, meaning >>>> connect the cells that are meant >>>> to be connected. Then, if you partition with overlap = 1, >>>> PetscGlobalToLocal() will fill in >>>> these cell values automatically. >>>> >>>> Second, you could use a non-periodic geometry, but alter >>>> the LocalToGlobal map such >>>> that the cells gets filled in anyway. Many codes use this >>>> scheme and it is straightforward >>>> with Plex just by augmenting the map it makes automatically. >>>> >>>> Does this make sense? >>>> >>>> ? Thanks, >>>> >>>> ? ? ?Matt >>>> >>>> Pierre Seize >>>> >>>> -- >>>> What most experimenters take for granted before they begin >>>> their experiments is infinitely more interesting than any >>>> results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to >>> which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Oct 18 05:35:52 2021 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 18 Oct 2021 06:35:52 -0400 Subject: [petsc-users] Periodic boundary conditions in DMPlex In-Reply-To: <17ca8a59-bb0b-ff7e-b34e-7666aa1931b4@onera.fr> References: <9a6f1a40-142e-f2a5-2101-2b60074b705e@onera.fr> <71784fff-35eb-a129-3609-004e5e596575@onera.fr> <8ff9d951-6958-aa1a-b875-e7488bb6b30b@onera.fr> <64787e00-820b-0c83-02f8-854569e4df9e@onera.fr> <17ca8a59-bb0b-ff7e-b34e-7666aa1931b4@onera.fr> Message-ID: On Mon, Oct 18, 2021 at 3:27 AM Pierre Seize wrote: > On 15/10/21 22:30, Matthew Knepley wrote: > > On Fri, Oct 15, 2021 at 10:16 AM Pierre Seize > wrote: > >> I read everything again, I think I did not understand you at first. The >> first solution is to modify the DAG, so that the rightmost cell is linked >> to the leftmost face, right ? To do that, do I have to manually edit the >> DAG (the mesh is read from a file) ? >> > Yes, the DAG would be modified if you want it for some particular mesh > that we cannot read automatically. For example, we can read periodic GMsh > meshes. > >> If so, the mesh connectivity is like the one of a torus, then how does it >> work with the cells/faces coordinates ? >> > You let the coordinate field be in a DG space, so that it can have jumps. > > I'm not sure I fully understand this, what I'll do is that I will > experiment with a periodic GMSH mesh. > We normally assume that the coordinate field on a mesh is continuous, which is why we associate it with vertices. However you could define the field on cells, representing it by discontinuous polynomials of degree 1. > Now I think the second method may be more straightforward. What's the idea >> ? Get the mapping with DMGetLocalToGlobalMapping, then create the mapping >> corresponding to the periodicity with ISLocalToGlobalMappingCreate, and >> finally ISLocalToGlobalMappingConcatenate ? I'm not sure this is the way, >> and I did not find something like DMSetLocalToGlobalMapping to restore the >> modified mapping. >> > It is more complicated. We make the LocalToGlobalMap by looking at the > PetscSection (essentially if gives function space information) and > deciding which unknowns are removed from the global space. > You would need to decide that unknowns constrained by periodicity are not > present in the global space. Actually, this is not hard. You just mark them > as constrained in the PetscSection, and all the layout > functions will function correctly. However, then the LocalToGlobalMap will > not be exactly right because the constrained unknowns will not be filled in > (just like Dirichlet conditions). You would augment the > map so that it fills those in by looking up their periodic counterparts. > Jed has argued for this type of periodicity. > > I also don't understand. One one side of my mesh I have : | ghost1 | > cell1 | cell2 | ... and on the other | cell_n-1 | cell_n | ghost_n |. Are > not the ghosts (from DMPlexConstructGhostCells) already constrained ? > I experimented on that too, I did: > > DMGetSectionSF > PetscSFGetGraph > augment the graph by adding the local ghost cells to the leaves, and the > correct remote "true" cells to the roots > PetscSFSetGraph > > and it seems to do what I want. Is this what you meant ? Is this a correct > way to use the PETSc objects ? Or is this just hacky and I'm lucky it works > ? > Yes, this will do what you want for field values, but not for coordinates. This is exactly what you would get if you just connected the topology. Thanks, Matt > Pierre > > > To me, the first kind is much more straightforward, but maybe this is > because I find the topology code more clear. > > Thanks, > > Matt > >> Pierre >> >> On 15/10/21 15:33, Pierre Seize wrote: >> >> When I first tried to handle the periodicity, I found the >> DMPlexCreateBoxMesh function (I cannot find the cylinder one). >> >> From reading the sources, I understand that we do some work either in >> DMPlexCreateCubeMesh_Internal or with DMSetPeriodicity. >> >> I tried to use DMSetPeriodicity before, for example with a 2x2 box on >> length 10. I did something like: >> const PetscReal maxCell[] = {2, 2}; >> const PetscReal L[] = {10, 10}; >> const DMBoundaryType bd[] = {DM_BOUNDARY_PERIODIC, DM_BOUNDARY_PERIODIC}; >> DMSetPeriodicity(dm, PETSC_TRUE, maxCell, L, bd); >> // or: >> DMSetPeriodicity(dm, PETSC_TRUE, NULL, L, bd); >> >> but it did not work: >> VecSet(X, 1); >> DMGetLocalVector(dm, &locX); >> VecZeroEntries(locX); >> DMGlobalToLocalBegin(dm, X, INSERT_VALUES, locX); >> DMGlobalToLocalEnd(dm, X, INSERT_VALUES, locX); >> VecView(locX, PETSC_VIEWER_STDOUT_WORLD); >> >> but the ghost cells values are all 0, only the real cells are 1. So I >> guess DMSetPeriodicity alone is not sufficient to handle the >> periodicity. Is there a way to do what I want ? That is set up my DMPlex in >> a way that DMGlobalToLocalBegin/DMGlobalToLocalEnd do exchange values >> between procs AND exchange the periodic values? >> >> >> Thanks for the help >> >> >> Pierre >> >> On 15/10/21 14:03, Matthew Knepley wrote: >> >> On Fri, Oct 15, 2021 at 7:31 AM Pierre Seize >> wrote: >> >>> It makes sense, thank you. In fact, both ways seems better than my way. >>> The first one looks the most straightforward. Unfortunately I do not know >>> how to implement either of them. Could you please direct me to the >>> corresponding PETSc functions ? >>> >> The first way is implemented for example in DMPlexCreateBoxMesh() and >> DMPlexCreateCylinderMesh(). The second is not implemented since >> there did not seem to be a general way to do it. I would help if you >> wanted to try coding it up. >> >> Thanks, >> >> Matt >> >>> Pierre >>> >>> On 15/10/21 13:25, Matthew Knepley wrote: >>> >>> On Fri, Oct 15, 2021 at 7:08 AM Pierre Seize >>> wrote: >>> >>>> Hi, >>>> >>>> I'm writing a code using PETSc to solve NS equations with FV on an >>>> unstructured mesh. Therefore I use DMPlex. >>>> >>>> Regarding periodicity, I manage to implement it this way: >>>> >>>> - for each couple of boundaries that is linked with periodicity, I >>>> create a buffer vector with an ISLocalToGlobalMapping >>>> >>>> - then, when I need to fill the ghost cells corresponding to the >>>> periodicity, the i "true" cell of the local vector fills the buffer >>>> vector on location i with VecSetValuesBlockedLocal, then >>>> VecAssemblyBegin/VecAssemblyEnd ensure each value is send to the >>>> correct >>>> location thanks to the mapping, then the i "ghost" cell of the local >>>> vector reads the vector on location i to get it's value. >>>> >>>> >>>> It works, but it seems to me there is a better way, with maybe PetscSF, >>>> VecScatter, or something I don't know yet. Does anyone have any advice ? >>>> >>> >>> There are at least two other ways to handle this. First, the method that >>> is advocated in >>> Plex is to actually make a periodic geometry, meaning connect the cells >>> that are meant >>> to be connected. Then, if you partition with overlap = 1, >>> PetscGlobalToLocal() will fill in >>> these cell values automatically. >>> >>> Second, you could use a non-periodic geometry, but alter the >>> LocalToGlobal map such >>> that the cells gets filled in anyway. Many codes use this scheme and it >>> is straightforward >>> with Plex just by augmenting the map it makes automatically. >>> >>> Does this make sense? >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Pierre Seize >>>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Oct 18 05:43:51 2021 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 18 Oct 2021 06:43:51 -0400 Subject: [petsc-users] Question about DMPlex for parallel computing In-Reply-To: References: Message-ID: On Mon, Oct 18, 2021 at 12:26 AM Gong Yujie wrote: > Hi, > > I'm learning to use DMPlex to write a parallel program. I've tried to > write a sequential code earlier successfully, but when to write a parallel > code, there are many things different. There are some questions I'm > curious about. > > > 1. Are the functions as DMPlexCreateGmshFromFile() and other read from > file functions reading in the mesh in parallel? Or just the root node read > in the mesh? > > This function works in parallel, meaning it will correctly read a file if run from a parallel program. However, it is read only by proc 0. The only format that we have truly reading in parallel is the PETSc HDF5 format, which relies on the parallel reads from HDF5. > > 1. Are there some examples available for distribute the mesh and > create the correspondingly local to global node(or position) mapping ? > > Yes, hundreds of examples. For example, in SNES ex12, we have many parallel examples, such as https://gitlab.com/petsc/petsc/-/blob/main/src/snes/tutorials/ex12.c#L1182 You will notice it has '-dm_distribute' in the arguments, which distributes the mesh automatically which was read in or created serially. You can also call DMPlexDistribute() yourself instead of using the option. Thanks, Matt > I'm grateful for your kindly help! > > Best Regards, > Gong > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhugp01 at nus.edu.sg Mon Oct 18 09:03:58 2021 From: zhugp01 at nus.edu.sg (Guangpu Zhu) Date: Mon, 18 Oct 2021 14:03:58 +0000 Subject: [petsc-users] Questions about matrix operation in petsc4py Message-ID: Dear Sir/Madam, My name is Guangpu Zhu, I met a problem when I tested the matrix operation in petsc4py. As the following code shows, I first set all elements of the matrix to 2.0. Then, I try to set the diagonal elements of the matrix to 1.0. However, I surprisingly found that the diagonal elements are still 2.0. This question has confused me for a few days, so I am writing to you for help or suggestions. Attached is the test code I used. I would greatly appreciate it if you can kindly reply to me. Thank you in ad [cid:8c058bba-e6bb-4dc5-ad0a-596e3150b21d] --- Guangpu Zhu Research Associate, Department of Mechanical Engineering National University of Singapore Personal E-mail: zhugpupc at gmail.com Phone: (+65) 87581879 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 18308 bytes Desc: image.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Petsc_Test.tar.gz Type: application/gzip Size: 1248 bytes Desc: Petsc_Test.tar.gz URL: From mfadams at lbl.gov Mon Oct 18 09:29:04 2021 From: mfadams at lbl.gov (Mark Adams) Date: Mon, 18 Oct 2021 10:29:04 -0400 Subject: [petsc-users] Questions about matrix operation in petsc4py In-Reply-To: References: Message-ID: see this example: https://github.com/JesseLu/petsc4py-tutorial/blob/master/mat_serial.py You need to add the assembly calls On Mon, Oct 18, 2021 at 10:06 AM Guangpu Zhu wrote: > Dear Sir/Madam, > > My name is Guangpu Zhu, I met a problem when I tested the matrix > operation in petsc4py. As the following code shows, I first set all > elements of the matrix to 2.0. Then, I try to set the diagonal elements of > the matrix to 1.0. However, I surprisingly found that the diagonal > elements are still 2.0. This question has confused me for a few days, so I > am writing to you for help or suggestions. Attached is the test code I > used. I would greatly appreciate it if you can kindly reply to me. Thank > you in ad > > > > --- > Guangpu Zhu > > Research Associate, Department of Mechanical Engineering > > National University of Singapore > > Personal E-mail: zhugpupc at gmail.com > > Phone: (+65) 87581879 > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 18308 bytes Desc: not available URL: From cliu at pppl.gov Mon Oct 18 15:42:27 2021 From: cliu at pppl.gov (Chang Liu) Date: Mon, 18 Oct 2021 16:42:27 -0400 Subject: [petsc-users] MatVec on GPUs In-Reply-To: References: Message-ID: Hi Matt, I have a related question. In my code I have many matrices and I only want to have one living on GPU, the others still staying on CPU mem. I wonder if there is an easier way to copy a mpiaij matrix to mpiaijcusparse (in other words, copy data to GPUs). I can think of creating a new mpiaijcusparse matrix, and copying the data line by line. But I wonder if there is a better option. I have tried MatCopy and MatConvert but neither work. Chang On 10/17/21 7:50 PM, Matthew Knepley wrote: > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh > wrote: > > Do I need convert the MATSEQBAIJ?to a cuda matrix in code? > > > You would need a call to MatSetFromOptions() to take that type from the > command line, and not have > the type hard-coded in your application. It is generally a bad idea to > hard code the implementation type. > > If I do it from command line, then are the other MatVec calls are > ported onto CUDA? I have many MatVec calls in my code, but I > specifically want to port just one call. > > > You can give that one matrix an options prefix to isolate it. > > ? Thanks, > > ? ? ?Matt > > Sincerely, > Swarnava > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang > > wrote: > > You can do that with command line options -mat_type aijcusparse > -vec_type cuda > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh > > wrote: > > Dear Petsc team, > > I had a query regarding using CUDA to accelerate a matrix > vector product. > I have a sequential sparse matrix (MATSEQBAIJ?type). I want > to port a MatVec?call onto GPUs. Is there any code/example I > can look at? > > Sincerely, > SG > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From junchao.zhang at gmail.com Mon Oct 18 16:23:46 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Mon, 18 Oct 2021 16:23:46 -0500 Subject: [petsc-users] MatVec on GPUs In-Reply-To: References: Message-ID: On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi Matt, > > I have a related question. In my code I have many matrices and I only > want to have one living on GPU, the others still staying on CPU mem. > > I wonder if there is an easier way to copy a mpiaij matrix to > mpiaijcusparse (in other words, copy data to GPUs). I can think of > creating a new mpiaijcusparse matrix, and copying the data line by line. > But I wonder if there is a better option. > > I have tried MatCopy and MatConvert but neither work. > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? > > Chang > > On 10/17/21 7:50 PM, Matthew Knepley wrote: > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh > > wrote: > > > > Do I need convert the MATSEQBAIJ to a cuda matrix in code? > > > > > > You would need a call to MatSetFromOptions() to take that type from the > > command line, and not have > > the type hard-coded in your application. It is generally a bad idea to > > hard code the implementation type. > > > > If I do it from command line, then are the other MatVec calls are > > ported onto CUDA? I have many MatVec calls in my code, but I > > specifically want to port just one call. > > > > > > You can give that one matrix an options prefix to isolate it. > > > > Thanks, > > > > Matt > > > > Sincerely, > > Swarnava > > > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang > > > wrote: > > > > You can do that with command line options -mat_type aijcusparse > > -vec_type cuda > > > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh > > > wrote: > > > > Dear Petsc team, > > > > I had a query regarding using CUDA to accelerate a matrix > > vector product. > > I have a sequential sparse matrix (MATSEQBAIJ type). I want > > to port a MatVec call onto GPUs. Is there any code/example I > > can look at? > > > > Sincerely, > > SG > > > > > > > > -- > > What most experimenters take for granted before they begin their > > experiments is infinitely more interesting than any results to which > > their experiments lead. > > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ < > http://www.cse.buffalo.edu/~knepley/> > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Mon Oct 18 20:04:27 2021 From: cliu at pppl.gov (Chang Liu) Date: Mon, 18 Oct 2021 21:04:27 -0400 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: Message-ID: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Hi Junchao, Thank you for your answer. I tried MatConvert and it works. I didn't make it before because I forgot to convert a vector from mpi to mpicuda previously. For vector, there is no VecConvert to use, so I have to do VecDuplicate, VecSetType and VecCopy. Is there an easier option? Chang On 10/18/21 5:23 PM, Junchao Zhang wrote: > > > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users > > wrote: > > Hi Matt, > > I have a related question. In my code I have many matrices and I only > want to have one living on GPU, the others still staying on CPU mem. > > I wonder if there is an easier way to copy a mpiaij matrix to > mpiaijcusparse (in other words, copy data to GPUs). I can think of > creating a new mpiaijcusparse matrix, and copying the data line by > line. > But I wonder if there is a better option. > > I have tried MatCopy and MatConvert but neither work. > > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? > > > Chang > > On 10/17/21 7:50 PM, Matthew Knepley wrote: > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh > > > >> wrote: > > > >? ? ?Do I need convert the MATSEQBAIJ?to a cuda matrix in code? > > > > > > You would need a call to MatSetFromOptions() to take that type > from the > > command line, and not have > > the type hard-coded in your application. It is generally a bad > idea to > > hard code the implementation type. > > > >? ? ?If I do it from command line, then are the other MatVec calls are > >? ? ?ported onto CUDA? I have many MatVec calls in my code, but I > >? ? ?specifically want to port just one call. > > > > > > You can give that one matrix an options prefix to isolate it. > > > >? ? Thanks, > > > >? ? ? ?Matt > > > >? ? ?Sincerely, > >? ? ?Swarnava > > > >? ? ?On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang > >? ? ? > >> > wrote: > > > >? ? ? ? ?You can do that with command line options -mat_type > aijcusparse > >? ? ? ? ?-vec_type cuda > > > >? ? ? ? ?On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh > >? ? ? ? ? > >> wrote: > > > >? ? ? ? ? ? ?Dear Petsc team, > > > >? ? ? ? ? ? ?I had a query regarding using CUDA to accelerate a matrix > >? ? ? ? ? ? ?vector product. > >? ? ? ? ? ? ?I have a sequential sparse matrix (MATSEQBAIJ?type). > I want > >? ? ? ? ? ? ?to port a MatVec?call onto GPUs. Is there any > code/example I > >? ? ? ? ? ? ?can look at? > > > >? ? ? ? ? ? ?Sincerely, > >? ? ? ? ? ? ?SG > > > > > > > > -- > > What most experimenters take for granted before they begin their > > experiments is infinitely more interesting than any results to which > > their experiments lead. > > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From junchao.zhang at gmail.com Mon Oct 18 20:23:18 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Mon, 18 Oct 2021 20:23:18 -0500 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: MatSetOptionsPrefix(A,"mymat") VecSetOptionsPrefix(v,"myvec") --Junchao Zhang On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: > Hi Junchao, > > Thank you for your answer. I tried MatConvert and it works. I didn't > make it before because I forgot to convert a vector from mpi to mpicuda > previously. > > For vector, there is no VecConvert to use, so I have to do VecDuplicate, > VecSetType and VecCopy. Is there an easier option? > As Matt suggested, you could single out the matrix and vector with options prefix and set their type on command line MatSetOptionsPrefix(A,"mymat"); VecSetOptionsPrefix(v,"myvec"); Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda A simpler code is to have the vector type automatically set by MatCreateVecs(A,&v,NULL) > Chang > > On 10/18/21 5:23 PM, Junchao Zhang wrote: > > > > > > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users > > > wrote: > > > > Hi Matt, > > > > I have a related question. In my code I have many matrices and I only > > want to have one living on GPU, the others still staying on CPU mem. > > > > I wonder if there is an easier way to copy a mpiaij matrix to > > mpiaijcusparse (in other words, copy data to GPUs). I can think of > > creating a new mpiaijcusparse matrix, and copying the data line by > > line. > > But I wonder if there is a better option. > > > > I have tried MatCopy and MatConvert but neither work. > > > > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? > > > > > > Chang > > > > On 10/17/21 7:50 PM, Matthew Knepley wrote: > > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh > > > > > >> > wrote: > > > > > > Do I need convert the MATSEQBAIJ to a cuda matrix in code? > > > > > > > > > You would need a call to MatSetFromOptions() to take that type > > from the > > > command line, and not have > > > the type hard-coded in your application. It is generally a bad > > idea to > > > hard code the implementation type. > > > > > > If I do it from command line, then are the other MatVec calls > are > > > ported onto CUDA? I have many MatVec calls in my code, but I > > > specifically want to port just one call. > > > > > > > > > You can give that one matrix an options prefix to isolate it. > > > > > > Thanks, > > > > > > Matt > > > > > > Sincerely, > > > Swarnava > > > > > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang > > > > > >> > > wrote: > > > > > > You can do that with command line options -mat_type > > aijcusparse > > > -vec_type cuda > > > > > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh > > > > > >> wrote: > > > > > > Dear Petsc team, > > > > > > I had a query regarding using CUDA to accelerate a > matrix > > > vector product. > > > I have a sequential sparse matrix (MATSEQBAIJ type). > > I want > > > to port a MatVec call onto GPUs. Is there any > > code/example I > > > can look at? > > > > > > Sincerely, > > > SG > > > > > > > > > > > > -- > > > What most experimenters take for granted before they begin their > > > experiments is infinitely more interesting than any results to > which > > > their experiments lead. > > > -- Norbert Wiener > > > > > > https://www.cse.buffalo.edu/~knepley/ > > > > > > > > > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Mon Oct 18 20:24:59 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Mon, 18 Oct 2021 20:24:59 -0500 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <6D4D8741-3F52-41BF-B2A3-AFBA09443755@petsc.dev> Message-ID: Hi, Chang, I revised your patch and made an MR https://gitlab.com/petsc/petsc/-/merge_requests/4471 with branch jczhang/fix-PCTelescope-GPU Could you check if it works on your end? Thanks. --Junchao Zhang On Sat, Oct 16, 2021 at 8:59 PM Junchao Zhang wrote: > Hi, Chang, > Thanks a lot for the fix. I will create an MR for it. > --Junchao Zhang > > > On Sat, Oct 16, 2021 at 8:12 PM Chang Liu wrote: > >> Hi Barry, Pierre and Junchao, >> >> I spent some time to find the reason for the error. I think it is caused >> by some compability issues between telescope and cusparse. >> >> 1. In PCTelescopeMatCreate_default in telescope.c, it calls >> MatCreateMPIMatConcatenateSeqMat to concat seqmat to mpimat, but this >> function is from mpiaij.c and will set the mat type to mpiaij, even if >> the original matrix is mpiaijcusparse. >> >> 2. Simiar issue exists in PCTelescopeSetUp_default, where the vector is >> set to type mpi rather than mpicuda. >> >> I have fixed the issue using the following patch. After applying it, >> telescope and cusparse work as expected. >> >> diff --git a/src/ksp/pc/impls/telescope/telescope.c >> b/src/ksp/pc/impls/telescope/telescope.c >> index 893febb055..d3f687eff9 100644 >> --- a/src/ksp/pc/impls/telescope/telescope.c >> +++ b/src/ksp/pc/impls/telescope/telescope.c >> @@ -159,6 +159,7 @@ PetscErrorCode PCTelescopeSetUp_default(PC >> pc,PC_Telescope sred) >> ierr = VecCreate(subcomm,&xred);CHKERRQ(ierr); >> ierr = VecSetSizes(xred,PETSC_DECIDE,M);CHKERRQ(ierr); >> ierr = VecSetBlockSize(xred,bs);CHKERRQ(ierr); >> + ierr = VecSetType(xred,((PetscObject)x)->type_name);CHKERRQ(ierr); >> ierr = VecSetFromOptions(xred);CHKERRQ(ierr); >> ierr = VecGetLocalSize(xred,&m);CHKERRQ(ierr); >> } >> diff --git a/src/mat/impls/aij/mpi/mpiaij.c >> b/src/mat/impls/aij/mpi/mpiaij.c >> index 36077002db..ac374e07eb 100644 >> --- a/src/mat/impls/aij/mpi/mpiaij.c >> +++ b/src/mat/impls/aij/mpi/mpiaij.c >> @@ -4486,6 +4486,7 @@ PetscErrorCode >> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(MPI_Comm comm,Mat inmat,P >> PetscInt m,N,i,rstart,nnz,Ii; >> PetscInt *indx; >> PetscScalar *values; >> + PetscBool isseqaijcusparse; >> >> PetscFunctionBegin; >> ierr = MatGetSize(inmat,&m,&N);CHKERRQ(ierr); >> @@ -4513,7 +4514,12 @@ PetscErrorCode >> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(MPI_Comm comm,Mat inmat,P >> ierr = >> MatSetSizes(*outmat,m,n,PETSC_DETERMINE,PETSC_DETERMINE);CHKERRQ(ierr); >> ierr = MatGetBlockSizes(inmat,&bs,&cbs);CHKERRQ(ierr); >> ierr = MatSetBlockSizes(*outmat,bs,cbs);CHKERRQ(ierr); >> - ierr = MatSetType(*outmat,MATAIJ);CHKERRQ(ierr); >> + ierr = >> >> PetscObjectBaseTypeCompare((PetscObject)inmat,MATSEQAIJCUSPARSE,&isseqaijcusparse);CHKERRQ(ierr); >> + if (isseqaijcusparse) { >> + ierr = MatSetType(*outmat,MATAIJCUSPARSE);CHKERRQ(ierr); >> + } else { >> + ierr = MatSetType(*outmat,MATAIJ);CHKERRQ(ierr); >> + } >> ierr = MatSeqAIJSetPreallocation(*outmat,0,dnz);CHKERRQ(ierr); >> ierr = MatMPIAIJSetPreallocation(*outmat,0,dnz,0,onz);CHKERRQ(ierr); >> ierr = MatPreallocateFinalize(dnz,onz);CHKERRQ(ierr); >> >> Please help view it and merge to master if possible. >> >> Regards, >> >> Chang >> >> On 10/15/21 1:27 PM, Barry Smith wrote: >> > >> > So the only difference is between >> > -sub_telescope_pc_factor_mat_solver_type cusparse and >> > -sub_telescope_pc_factor_mat_solver_type mumps ? >> > >> > Try without the -sub_telescope_pc_factor_mat_solver_type cusparse >> > and then PETSc will just use the CPU solvers, I want to see if that >> > works, it should. If it works then there is perhaps something specific >> > about the PCTELESCOPE and the cusparse solver, for example the right >> > hand side array values may never get to the GPU. >> > >> > Barry >> > >> >> On Oct 14, 2021, at 10:11 PM, Chang Liu > >> > wrote: >> >> >> >> For comparison, here is the output using mumps instead of cusparse >> >> >> >> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 >> >> -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >> >> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >> >> -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type >> >> preonly -sub_telescope_pc_type lu >> >> -sub_telescope_pc_factor_mat_solver_type mumps >> >> -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type >> >> contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> > >> > $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 >> > -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >> -pc_bjacobi_blocks >> > 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope >> > -sub_ksp_type preonly -sub_telescope_ksp_type preonly >> > -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type >> > cusparse -sub_pc_telescope_reduction_factor 4 >> > -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol >> > 1.e-20 -ksp_atol 1.e-9 >> > >> > >> >> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm >> >> 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> >> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm >> >> 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 >> >> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm >> >> 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 >> >> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm >> >> 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 >> >> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm >> >> 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 >> >> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm >> >> 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 >> >> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm >> >> 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 >> >> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm >> >> 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 >> >> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm >> >> 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 >> >> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm >> >> 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 >> >> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm >> >> 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 >> >> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm >> >> 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 >> >> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm >> >> 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 >> >> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm >> >> 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 >> >> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm >> >> 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 >> >> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm >> >> 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 >> >> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm >> >> 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 >> >> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm >> >> 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 >> >> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm >> >> 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 >> >> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm >> >> 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 >> >> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm >> >> 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 >> >> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm >> >> 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 >> >> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm >> >> 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 >> >> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm >> >> 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 >> >> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm >> >> 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 >> >> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm >> >> 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 >> >> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm >> >> 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 >> >> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm >> >> 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 >> >> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm >> >> 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 >> >> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm >> >> 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 >> >> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm >> >> 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 >> >> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm >> >> 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 >> >> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm >> >> 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 >> >> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm >> >> 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 >> >> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm >> >> 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 >> >> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm >> >> 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 >> >> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm >> >> 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 >> >> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm >> >> 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 >> >> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm >> >> 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 >> >> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm >> >> 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 >> >> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm >> >> 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 >> >> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm >> >> 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 >> >> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm >> >> 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 >> >> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm >> >> 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 >> >> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm >> >> 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 >> >> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm >> >> 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 >> >> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm >> >> 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 >> >> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm >> >> 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 >> >> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm >> >> 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 >> >> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm >> >> 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 >> >> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm >> >> 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 >> >> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm >> >> 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 >> >> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm >> >> 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 >> >> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm >> >> 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 >> >> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm >> >> 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 >> >> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm >> >> 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 >> >> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm >> >> 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 >> >> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm >> >> 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 >> >> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm >> >> 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 >> >> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm >> >> 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 >> >> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm >> >> 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 >> >> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm >> >> 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 >> >> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm >> >> 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 >> >> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm >> >> 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 >> >> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm >> >> 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 >> >> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm >> >> 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 >> >> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm >> >> 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 >> >> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm >> >> 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 >> >> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm >> >> 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 >> >> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm >> >> 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 >> >> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm >> >> 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 >> >> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm >> >> 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 >> >> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm >> >> 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 >> >> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm >> >> 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 >> >> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm >> >> 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 >> >> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm >> >> 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 >> >> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm >> >> 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 >> >> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm >> >> 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 >> >> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm >> >> 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 >> >> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm >> >> 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 >> >> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm >> >> 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 >> >> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm >> >> 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 >> >> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm >> >> 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 >> >> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm >> >> 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 >> >> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm >> >> 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 >> >> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm >> >> 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 >> >> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm >> >> 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 >> >> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm >> >> 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 >> >> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm >> >> 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 >> >> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm >> >> 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 >> >> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm >> >> 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 >> >> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm >> >> 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 >> >> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm >> >> 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 >> >> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm >> >> 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 >> >> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm >> >> 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 >> >> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm >> >> 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 >> >> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm >> >> 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 >> >> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm >> >> 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 >> >> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm >> >> 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 >> >> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm >> >> 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 >> >> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm >> >> 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 >> >> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm >> >> 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 >> >> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm >> >> 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 >> >> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm >> >> 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 >> >> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm >> >> 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 >> >> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm >> >> 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 >> >> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm >> >> 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 >> >> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm >> >> 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 >> >> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm >> >> 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 >> >> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm >> >> 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 >> >> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm >> >> 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 >> >> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm >> >> 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 >> >> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm >> >> 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 >> >> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm >> >> 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 >> >> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm >> >> 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 >> >> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm >> >> 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 >> >> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm >> >> 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 >> >> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm >> >> 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 >> >> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm >> >> 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 >> >> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm >> >> 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 >> >> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm >> >> 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 >> >> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm >> >> 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 >> >> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm >> >> 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 >> >> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm >> >> 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 >> >> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm >> >> 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 >> >> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm >> >> 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 >> >> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm >> >> 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 >> >> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm >> >> 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 >> >> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm >> >> 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 >> >> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm >> >> 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 >> >> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm >> >> 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 >> >> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm >> >> 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 >> >> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm >> >> 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 >> >> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm >> >> 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 >> >> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm >> >> 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 >> >> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm >> >> 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 >> >> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm >> >> 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 >> >> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm >> >> 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 >> >> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm >> >> 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 >> >> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm >> >> 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 >> >> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm >> >> 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 >> >> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm >> >> 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 >> >> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm >> >> 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 >> >> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm >> >> 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 >> >> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm >> >> 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 >> >> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm >> >> 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 >> >> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm >> >> 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 >> >> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm >> >> 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 >> >> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm >> >> 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 >> >> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm >> >> 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 >> >> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm >> >> 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 >> >> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm >> >> 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 >> >> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm >> >> 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 >> >> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm >> >> 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 >> >> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm >> >> 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 >> >> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm >> >> 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 >> >> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm >> >> 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 >> >> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm >> >> 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 >> >> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm >> >> 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 >> >> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm >> >> 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 >> >> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm >> >> 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 >> >> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm >> >> 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 >> >> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm >> >> 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 >> >> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm >> >> 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 >> >> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm >> >> 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 >> >> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm >> >> 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 >> >> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm >> >> 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 >> >> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm >> >> 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 >> >> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm >> >> 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 >> >> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm >> >> 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 >> >> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm >> >> 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 >> >> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm >> >> 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 >> >> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm >> >> 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 >> >> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm >> >> 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 >> >> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm >> >> 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 >> >> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm >> >> 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 >> >> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm >> >> 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 >> >> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm >> >> 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 >> >> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm >> >> 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 >> >> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm >> >> 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 >> >> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm >> >> 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 >> >> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm >> >> 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 >> >> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm >> >> 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 >> >> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm >> >> 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 >> >> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm >> >> 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 >> >> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm >> >> 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 >> >> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm >> >> 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 >> >> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm >> >> 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 >> >> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm >> >> 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 >> >> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm >> >> 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 >> >> KSP Object: 16 MPI processes >> >> type: fgmres >> >> restart=30, using Classical (unmodified) Gram-Schmidt >> >> Orthogonalization with no iterative refinement >> >> happy breakdown tolerance 1e-30 >> >> maximum iterations=2000, initial guess is zero >> >> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >> >> right preconditioning >> >> using UNPRECONDITIONED norm type for convergence test >> >> PC Object: 16 MPI processes >> >> type: bjacobi >> >> number of blocks = 4 >> >> Local solver information for first block is in the following KSP >> >> and PC objects on rank 0: >> >> Use -ksp_view ::ascii_info_detail to display information for all >> blocks >> >> KSP Object: (sub_) 4 MPI processes >> >> type: preonly >> >> maximum iterations=10000, initial guess is zero >> >> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> >> left preconditioning >> >> using NONE norm type for convergence test >> >> PC Object: (sub_) 4 MPI processes >> >> type: telescope >> >> petsc subcomm: parent comm size reduction factor = 4 >> >> petsc subcomm: parent_size = 4 , subcomm_size = 1 >> >> petsc subcomm type = contiguous >> >> linear system matrix = precond matrix: >> >> Mat Object: (sub_) 4 MPI processes >> >> type: mpiaij >> >> rows=40200, cols=40200 >> >> total: nonzeros=199996, allocated nonzeros=203412 >> >> total number of mallocs used during MatSetValues calls=0 >> >> not using I-node (on process 0) routines >> >> setup type: default >> >> Parent DM object: NULL >> >> Sub DM object: NULL >> >> KSP Object: (sub_telescope_) 1 MPI processes >> >> type: preonly >> >> maximum iterations=10000, initial guess is zero >> >> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> >> left preconditioning >> >> using NONE norm type for convergence test >> >> PC Object: (sub_telescope_) 1 MPI processes >> >> type: lu >> >> out-of-place factorization >> >> tolerance for zero pivot 2.22045e-14 >> >> matrix ordering: external >> >> factor fill ratio given 0., needed 0. >> >> Factored matrix follows: >> >> Mat Object: 1 MPI processes >> >> type: mumps >> >> rows=40200, cols=40200 >> >> package used to perform factorization: mumps >> >> total: nonzeros=1849788, allocated nonzeros=1849788 >> >> MUMPS run parameters: >> >> SYM (matrix type): 0 >> >> PAR (host participation): 1 >> >> ICNTL(1) (output for error): 6 >> >> ICNTL(2) (output of diagnostic msg): 0 >> >> ICNTL(3) (output for global info): 0 >> >> ICNTL(4) (level of printing): 0 >> >> ICNTL(5) (input mat struct): 0 >> >> ICNTL(6) (matrix prescaling): 7 >> >> ICNTL(7) (sequential matrix ordering):7 >> >> ICNTL(8) (scaling strategy): 77 >> >> ICNTL(10) (max num of refinements): 0 >> >> ICNTL(11) (error analysis): 0 >> >> ICNTL(12) (efficiency control): 1 >> >> ICNTL(13) (sequential factorization of the root >> >> node): 0 >> >> ICNTL(14) (percentage of estimated workspace >> >> increase): 20 >> >> ICNTL(18) (input mat struct): 0 >> >> ICNTL(19) (Schur complement info): 0 >> >> ICNTL(20) (RHS sparse pattern): 0 >> >> ICNTL(21) (solution struct): 0 >> >> ICNTL(22) (in-core/out-of-core facility): 0 >> >> ICNTL(23) (max size of memory can be allocated >> >> locally):0 >> >> ICNTL(24) (detection of null pivot rows): 0 >> >> ICNTL(25) (computation of a null space basis): >> >> 0 >> >> ICNTL(26) (Schur options for RHS or solution): >> >> 0 >> >> ICNTL(27) (blocking size for multiple RHS): >> >> -32 >> >> ICNTL(28) (use parallel or sequential ordering): >> >> 1 >> >> ICNTL(29) (parallel ordering): 0 >> >> ICNTL(30) (user-specified set of entries in >> >> inv(A)): 0 >> >> ICNTL(31) (factors is discarded in the solve >> >> phase): 0 >> >> ICNTL(33) (compute determinant): 0 >> >> ICNTL(35) (activate BLR based factorization): >> >> 0 >> >> ICNTL(36) (choice of BLR factorization variant): >> >> 0 >> >> ICNTL(38) (estimated compression rate of LU >> >> factors): 333 >> >> CNTL(1) (relative pivoting threshold): 0.01 >> >> CNTL(2) (stopping criterion of refinement): >> >> 1.49012e-08 >> >> CNTL(3) (absolute pivoting threshold): 0. >> >> CNTL(4) (value of static pivoting): -1. >> >> CNTL(5) (fixation for null pivots): 0. >> >> CNTL(7) (dropping parameter for BLR): 0. >> >> RINFO(1) (local estimated flops for the >> >> elimination after analysis): >> >> [0] 1.45525e+08 >> >> RINFO(2) (local estimated flops for the assembly >> >> after factorization): >> >> [0] 2.89397e+06 >> >> RINFO(3) (local estimated flops for the >> >> elimination after factorization): >> >> [0] 1.45525e+08 >> >> INFO(15) (estimated size of (in MB) MUMPS >> >> internal data for running numerical factorization): >> >> [0] 29 >> >> INFO(16) (size of (in MB) MUMPS internal data >> >> used during numerical factorization): >> >> [0] 29 >> >> INFO(23) (num of pivots eliminated on this >> >> processor after factorization): >> >> [0] 40200 >> >> RINFOG(1) (global estimated flops for the >> >> elimination after analysis): 1.45525e+08 >> >> RINFOG(2) (global estimated flops for the >> >> assembly after factorization): 2.89397e+06 >> >> RINFOG(3) (global estimated flops for the >> >> elimination after factorization): 1.45525e+08 >> >> (RINFOG(12) RINFOG(13))*2^INFOG(34) >> >> (determinant): (0.,0.)*(2^0) >> >> INFOG(3) (estimated real workspace for factors on >> >> all processors after analysis): 1849788 >> >> INFOG(4) (estimated integer workspace for factors >> >> on all processors after analysis): 879986 >> >> INFOG(5) (estimated maximum front size in the >> >> complete tree): 282 >> >> INFOG(6) (number of nodes in the complete tree): >> >> 23709 >> >> INFOG(7) (ordering option effectively used after >> >> analysis): 5 >> >> INFOG(8) (structural symmetry in percent of the >> >> permuted matrix after analysis): 100 >> >> INFOG(9) (total real/complex workspace to store >> >> the matrix factors after factorization): 1849788 >> >> INFOG(10) (total integer space store the matrix >> >> factors after factorization): 879986 >> >> INFOG(11) (order of largest frontal matrix after >> >> factorization): 282 >> >> INFOG(12) (number of off-diagonal pivots): 0 >> >> INFOG(13) (number of delayed pivots after >> >> factorization): 0 >> >> INFOG(14) (number of memory compress after >> >> factorization): 0 >> >> INFOG(15) (number of steps of iterative >> >> refinement after solution): 0 >> >> INFOG(16) (estimated size (in MB) of all MUMPS >> >> internal data for factorization after analysis: value on the most >> >> memory consuming processor): 29 >> >> INFOG(17) (estimated size of all MUMPS internal >> >> data for factorization after analysis: sum over all processors): 29 >> >> INFOG(18) (size of all MUMPS internal data >> >> allocated during factorization: value on the most memory consuming >> >> processor): 29 >> >> INFOG(19) (size of all MUMPS internal data >> >> allocated during factorization: sum over all processors): 29 >> >> INFOG(20) (estimated number of entries in the >> >> factors): 1849788 >> >> INFOG(21) (size in MB of memory effectively used >> >> during factorization - value on the most memory consuming processor): >> 26 >> >> INFOG(22) (size in MB of memory effectively used >> >> during factorization - sum over all processors): 26 >> >> INFOG(23) (after analysis: value of ICNTL(6) >> >> effectively used): 0 >> >> INFOG(24) (after analysis: value of ICNTL(12) >> >> effectively used): 1 >> >> INFOG(25) (after factorization: number of pivots >> >> modified by static pivoting): 0 >> >> INFOG(28) (after factorization: number of null >> >> pivots encountered): 0 >> >> INFOG(29) (after factorization: effective number >> >> of entries in the factors (sum over all processors)): 1849788 >> >> INFOG(30, 31) (after solution: size in Mbytes of >> >> memory used during solution phase): 29, 29 >> >> INFOG(32) (after analysis: type of analysis >> done): 1 >> >> INFOG(33) (value used for ICNTL(8)): 7 >> >> INFOG(34) (exponent of the determinant if >> >> determinant is requested): 0 >> >> INFOG(35) (after factorization: number of entries >> >> taking into account BLR factor compression - sum over all processors): >> >> 1849788 >> >> INFOG(36) (after analysis: estimated size of all >> >> MUMPS internal data for running BLR in-core - value on the most memory >> >> consuming processor): 0 >> >> INFOG(37) (after analysis: estimated size of all >> >> MUMPS internal data for running BLR in-core - sum over all >> processors): 0 >> >> INFOG(38) (after analysis: estimated size of all >> >> MUMPS internal data for running BLR out-of-core - value on the most >> >> memory consuming processor): 0 >> >> INFOG(39) (after analysis: estimated size of all >> >> MUMPS internal data for running BLR out-of-core - sum over all >> >> processors): 0 >> >> linear system matrix = precond matrix: >> >> Mat Object: 1 MPI processes >> >> type: seqaijcusparse >> >> rows=40200, cols=40200 >> >> total: nonzeros=199996, allocated nonzeros=199996 >> >> total number of mallocs used during MatSetValues calls=0 >> >> not using I-node routines >> >> linear system matrix = precond matrix: >> >> Mat Object: 16 MPI processes >> >> type: mpiaijcusparse >> >> rows=160800, cols=160800 >> >> total: nonzeros=802396, allocated nonzeros=1608000 >> >> total number of mallocs used during MatSetValues calls=0 >> >> not using I-node (on process 0) routines >> >> Norm of error 9.11684e-07 iterations 189 >> >> >> >> Chang >> >> >> >> >> >> >> >> On 10/14/21 10:10 PM, Chang Liu wrote: >> >>> Hi Barry, >> >>> No problem. Here is the output. It seems that the resid norm >> >>> calculation is incorrect. >> >>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 >> >>> -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >> >>> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >> >>> -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type >> >>> preonly -sub_telescope_pc_type lu >> >>> -sub_telescope_pc_factor_mat_solver_type cusparse >> >>> -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type >> >>> contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> >>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid >> >>> norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> >>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid >> >>> norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> >>> KSP Object: 16 MPI processes >> >>> type: fgmres >> >>> restart=30, using Classical (unmodified) Gram-Schmidt >> >>> Orthogonalization with no iterative refinement >> >>> happy breakdown tolerance 1e-30 >> >>> maximum iterations=2000, initial guess is zero >> >>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >> >>> right preconditioning >> >>> using UNPRECONDITIONED norm type for convergence test >> >>> PC Object: 16 MPI processes >> >>> type: bjacobi >> >>> number of blocks = 4 >> >>> Local solver information for first block is in the following KSP >> >>> and PC objects on rank 0: >> >>> Use -ksp_view ::ascii_info_detail to display information for all >> >>> blocks >> >>> KSP Object: (sub_) 4 MPI processes >> >>> type: preonly >> >>> maximum iterations=10000, initial guess is zero >> >>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> >>> left preconditioning >> >>> using NONE norm type for convergence test >> >>> PC Object: (sub_) 4 MPI processes >> >>> type: telescope >> >>> petsc subcomm: parent comm size reduction factor = 4 >> >>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >> >>> petsc subcomm type = contiguous >> >>> linear system matrix = precond matrix: >> >>> Mat Object: (sub_) 4 MPI processes >> >>> type: mpiaij >> >>> rows=40200, cols=40200 >> >>> total: nonzeros=199996, allocated nonzeros=203412 >> >>> total number of mallocs used during MatSetValues calls=0 >> >>> not using I-node (on process 0) routines >> >>> setup type: default >> >>> Parent DM object: NULL >> >>> Sub DM object: NULL >> >>> KSP Object: (sub_telescope_) 1 MPI processes >> >>> type: preonly >> >>> maximum iterations=10000, initial guess is zero >> >>> tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> >>> left preconditioning >> >>> using NONE norm type for convergence test >> >>> PC Object: (sub_telescope_) 1 MPI processes >> >>> type: lu >> >>> out-of-place factorization >> >>> tolerance for zero pivot 2.22045e-14 >> >>> matrix ordering: nd >> >>> factor fill ratio given 5., needed 8.62558 >> >>> Factored matrix follows: >> >>> Mat Object: 1 MPI processes >> >>> type: seqaijcusparse >> >>> rows=40200, cols=40200 >> >>> package used to perform factorization: cusparse >> >>> total: nonzeros=1725082, allocated nonzeros=1725082 >> >>> not using I-node routines >> >>> linear system matrix = precond matrix: >> >>> Mat Object: 1 MPI processes >> >>> type: seqaijcusparse >> >>> rows=40200, cols=40200 >> >>> total: nonzeros=199996, allocated nonzeros=199996 >> >>> total number of mallocs used during MatSetValues calls=0 >> >>> not using I-node routines >> >>> linear system matrix = precond matrix: >> >>> Mat Object: 16 MPI processes >> >>> type: mpiaijcusparse >> >>> rows=160800, cols=160800 >> >>> total: nonzeros=802396, allocated nonzeros=1608000 >> >>> total number of mallocs used during MatSetValues calls=0 >> >>> not using I-node (on process 0) routines >> >>> Norm of error 400.999 iterations 1 >> >>> Chang >> >>> On 10/14/21 9:47 PM, Barry Smith wrote: >> >>>> >> >>>> Chang, >> >>>> >> >>>> Sorry I did not notice that one. Please run that with -ksp_view >> >>>> -ksp_monitor_true_residual so we can see exactly how options are >> >>>> interpreted and solver used. At a glance it looks ok but something >> >>>> must be wrong to get the wrong answer. >> >>>> >> >>>> Barry >> >>>> >> >>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu > >>>>> > wrote: >> >>>>> >> >>>>> Hi Barry, >> >>>>> >> >>>>> That is exactly what I was doing in the second example, in which >> >>>>> the preconditioner works but the GMRES does not. >> >>>>> >> >>>>> Chang >> >>>>> >> >>>>> On 10/14/21 5:15 PM, Barry Smith wrote: >> >>>>>> You need to use the PCTELESCOPE inside the block Jacobi, not >> >>>>>> outside it. So something like -pc_type bjacobi -sub_pc_type >> >>>>>> telescope -sub_telescope_pc_type lu >> >>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu > >>>>>>> > wrote: >> >>>>>>> >> >>>>>>> Hi Pierre, >> >>>>>>> >> >>>>>>> I wonder if the trick of PCTELESCOPE only works for >> >>>>>>> preconditioner and not for the solver. I have done some tests, >> >>>>>>> and find that for solving a small matrix using >> >>>>>>> -telescope_ksp_type preonly, it does work for GPU with multiple >> >>>>>>> MPI processes. However, for bjacobi and gmres, it does not work. >> >>>>>>> >> >>>>>>> The command line options I used for small matrix is like >> >>>>>>> >> >>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short >> >>>>>>> -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu >> >>>>>>> -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type >> >>>>>>> preonly -pc_telescope_reduction_factor 4 >> >>>>>>> >> >>>>>>> which gives the correct output. For iterative solver, I tried >> >>>>>>> >> >>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short >> >>>>>>> -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type >> >>>>>>> aijcusparse -sub_pc_type telescope -sub_ksp_type preonly >> >>>>>>> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >> >>>>>>> -sub_telescope_pc_factor_mat_solver_type cusparse >> >>>>>>> -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol >> >>>>>>> 1.e-9 -ksp_atol 1.e-20 >> >>>>>>> >> >>>>>>> for large matrix. The output is like >> >>>>>>> >> >>>>>>> 0 KSP Residual norm 40.1497 >> >>>>>>> 1 KSP Residual norm < 1.e-11 >> >>>>>>> Norm of error 400.999 iterations 1 >> >>>>>>> >> >>>>>>> So it seems to call a direct solver instead of an iterative one. >> >>>>>>> >> >>>>>>> Can you please help check these options? >> >>>>>>> >> >>>>>>> Chang >> >>>>>>> >> >>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >> >>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu > >>>>>>>>> > wrote: >> >>>>>>>>> >> >>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This >> >>>>>>>>> sounds exactly what I need. I wonder if PCTELESCOPE can >> >>>>>>>>> transform a mpiaijcusparse to seqaircusparse? Or I have to do >> >>>>>>>>> it manually? >> >>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >> >>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but >> >>>>>>>> it should be; >> >>>>>>>> 2) at least for the implementations >> >>>>>>>> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and >> >>>>>>>> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType >> >>>>>>>> is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? >> >>>>>>>> enough to detect if the MPI communicator on which the Mat lives >> >>>>>>>> is of size 1 (your case), and then the resulting Mat is of type >> >>>>>>>> MatSeqX instead of MatMPIX, so you would not need to worry about >> >>>>>>>> the transformation you are mentioning. >> >>>>>>>> If you try this out and this does not work, please provide the >> >>>>>>>> backtrace (probably something like ?Operation XYZ not >> >>>>>>>> implemented for MatType ABC?), and hopefully someone can add the >> >>>>>>>> missing plumbing. >> >>>>>>>> I do not claim that this will be efficient, but I think this >> >>>>>>>> goes in the direction of what you want to achieve. >> >>>>>>>> Thanks, >> >>>>>>>> Pierre >> >>>>>>>>> Chang >> >>>>>>>>> >> >>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >> >>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as >> >>>>>>>>>> a subdomain solver, with a reduction factor equal to the >> >>>>>>>>>> number of MPI processes you have per block? >> >>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X >> >>>>>>>>>> -sub_telescope_pc_type lu >> >>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads >> >>>>>>>>>> because not only do the Mat needs to be redistributed, the >> >>>>>>>>>> secondary processes also need to be ?converted? to OpenMP >> threads. >> >>>>>>>>>> Thus the need for specific code in mumps.c. >> >>>>>>>>>> Thanks, >> >>>>>>>>>> Pierre >> >>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users >> >>>>>>>>>>> > >> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> Hi Junchao, >> >>>>>>>>>>> >> >>>>>>>>>>> Yes that is what I want. >> >>>>>>>>>>> >> >>>>>>>>>>> Chang >> >>>>>>>>>>> >> >>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >> >>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> wrote: >> >>>>>>>>>>>> Junchao, >> >>>>>>>>>>>> If I understand correctly Chang is using the block >> >>>>>>>>>>>> Jacobi >> >>>>>>>>>>>> method with a single block for a number of MPI ranks and >> >>>>>>>>>>>> a direct >> >>>>>>>>>>>> solver for each block so it uses >> >>>>>>>>>>>> PCSetUp_BJacobi_Multiproc() which >> >>>>>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. >> >>>>>>>>>>>> For their >> >>>>>>>>>>>> particular problems this preconditioner works well, but >> >>>>>>>>>>>> using an >> >>>>>>>>>>>> iterative solver on the blocks does not work well. >> >>>>>>>>>>>> If we had complete MPI-GPU direct solvers he could >> >>>>>>>>>>>> just use >> >>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block but >> >>>>>>>>>>>> since we do >> >>>>>>>>>>>> not he would like to use a single GPU for each block, >> >>>>>>>>>>>> this means >> >>>>>>>>>>>> that diagonal blocks of the global parallel MPI matrix >> >>>>>>>>>>>> needs to be >> >>>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, which >> >>>>>>>>>>>> has multiple >> >>>>>>>>>>>> MPI ranks associated with the blocks). Similarly for the >> >>>>>>>>>>>> triangular >> >>>>>>>>>>>> solves the blocks of the right hand side needs to be >> >>>>>>>>>>>> shipped to the >> >>>>>>>>>>>> appropriate GPU and the resulting solution shipped back >> >>>>>>>>>>>> to the >> >>>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is >> >>>>>>>>>>>> somewhat like >> >>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand >> >>>>>>>>>>>> the background.. >> >>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the >> >>>>>>>>>>>> blocks on the >> >>>>>>>>>>>> MPI ranks and then shrink each block down to a single >> >>>>>>>>>>>> GPU but this >> >>>>>>>>>>>> would be pretty inefficient, ideally one would go >> >>>>>>>>>>>> directly from the >> >>>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on >> >>>>>>>>>>>> the subset of >> >>>>>>>>>>>> GPUs. But this may be a large coding project. >> >>>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? >> >>>>>>>>>>>> In my mind, we just need to move each block (submatrix) >> >>>>>>>>>>>> living over multiple MPI ranks to one of them and solve >> >>>>>>>>>>>> directly there. In other words, we keep blocks' size, no >> >>>>>>>>>>>> shrinking or expanding. >> >>>>>>>>>>>> As mentioned before, cusparse does not provide LU >> >>>>>>>>>>>> factorization. So the LU factorization would be done on CPU, >> >>>>>>>>>>>> and the solve be done on GPU. I assume Chang wants to gain >> >>>>>>>>>>>> from the (potential) faster solve (instead of factorization) >> >>>>>>>>>>>> on GPU. >> >>>>>>>>>>>> Barry >> >>>>>>>>>>>> Since the matrices being factored and solved directly >> >>>>>>>>>>>> are relatively >> >>>>>>>>>>>> large it is possible that the cusparse code could be >> >>>>>>>>>>>> reasonably >> >>>>>>>>>>>> efficient (they are not the tiny problems one gets at >> >>>>>>>>>>>> the coarse >> >>>>>>>>>>>> level of multigrid). Of course, this is speculation, I >> don't >> >>>>>>>>>>>> actually know how much better the cusparse code would be >> >>>>>>>>>>>> on the >> >>>>>>>>>>>> direct solver than a good CPU direct sparse solver. >> >>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu > >>>>>>>>>>>> >> >>>>>>>>>>>> >> wrote: >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > Sorry I am not familiar with the details either. Can >> >>>>>>>>>>>> you please >> >>>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in >> mumps.c? >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > Chang >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >> >>>>>>>>>>>> >> Hi Chang, >> >>>>>>>>>>>> >> I did the work in mumps. It is easy for me to >> >>>>>>>>>>>> understand >> >>>>>>>>>>>> gathering matrix rows to one process. >> >>>>>>>>>>>> >> But how to gather blocks (submatrices) to form a >> >>>>>>>>>>>> large block? Can you draw a picture of that? >> >>>>>>>>>>>> >> Thanks >> >>>>>>>>>>>> >> --Junchao Zhang >> >>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via >> >>>>>>>>>>>> petsc-users >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>> >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>> >> Hi Barry, >> >>>>>>>>>>>> >> I think mumps solver in petsc does support that. >> >>>>>>>>>>>> You can >> >>>>>>>>>>>> check the >> >>>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >> >>>>>>>>>>>> < >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html> >> >>>>>>>>>>>> >> >>>>>>>>>>>> < >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >> >>>>>>>>>>>> < >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> < >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >> >>>>>>>>>>>> < >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html> >> >>>>>>>>>>>> >> >>>>>>>>>>>> < >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >> >>>>>>>>>>>> < >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> and the code enclosed by #if >> >>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >> >>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >> >>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >> >>>>>>>>>>>> >> mumps.c >> >>>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank >> >>>>>>>>>>>> per GPU. >> >>>>>>>>>>>> However, I am >> >>>>>>>>>>>> >> working on an existing code that was developed >> >>>>>>>>>>>> based on MPI >> >>>>>>>>>>>> and the the >> >>>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu >> >>>>>>>>>>>> cores. We don't >> >>>>>>>>>>>> want to >> >>>>>>>>>>>> >> change the whole structure of the code. >> >>>>>>>>>>>> >> 2. What you have suggested has been coded in >> >>>>>>>>>>>> mumps.c. See >> >>>>>>>>>>>> function >> >>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. >> >>>>>>>>>>>> >> Regards, >> >>>>>>>>>>>> >> Chang >> >>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >>> wrote: >> >>>>>>>>>>>> >> >> >> >>>>>>>>>>>> >> >> Hi Barry, >> >>>>>>>>>>>> >> >> >> >>>>>>>>>>>> >> >> That is exactly what I want. >> >>>>>>>>>>>> >> >> >> >>>>>>>>>>>> >> >> Back to my original question, I am looking >> >>>>>>>>>>>> for an approach to >> >>>>>>>>>>>> >> transfer >> >>>>>>>>>>>> >> >> matrix >> >>>>>>>>>>>> >> >> data from many MPI processes to "master" MPI >> >>>>>>>>>>>> >> >> processes, each of which taking care of one >> >>>>>>>>>>>> GPU, and then >> >>>>>>>>>>>> upload >> >>>>>>>>>>>> >> the data to GPU to >> >>>>>>>>>>>> >> >> solve. >> >>>>>>>>>>>> >> >> One can just grab some codes from mumps.c to >> >>>>>>>>>>>> aijcusparse.cu > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >>. >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > mumps.c doesn't actually do that. It never >> >>>>>>>>>>>> needs to >> >>>>>>>>>>>> copy the >> >>>>>>>>>>>> >> entire matrix to a single MPI rank. >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > It would be possible to write such a code >> >>>>>>>>>>>> that you >> >>>>>>>>>>>> suggest but >> >>>>>>>>>>>> >> it is not clear that it makes sense >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one >> >>>>>>>>>>>> GPU per MPI >> >>>>>>>>>>>> rank, so >> >>>>>>>>>>>> >> while your one GPU per big domain is solving its >> >>>>>>>>>>>> systems the >> >>>>>>>>>>>> other >> >>>>>>>>>>>> >> GPUs (with the other MPI ranks that share that >> >>>>>>>>>>>> domain) are doing >> >>>>>>>>>>>> >> nothing. >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > 2) For each triangular solve you would have to >> >>>>>>>>>>>> gather the >> >>>>>>>>>>>> right >> >>>>>>>>>>>> >> hand side from the multiple ranks to the single >> >>>>>>>>>>>> GPU to pass it to >> >>>>>>>>>>>> >> the GPU solver and then scatter the resulting >> >>>>>>>>>>>> solution back >> >>>>>>>>>>>> to all >> >>>>>>>>>>>> >> of its subdomain ranks. >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > What I was suggesting was assign an entire >> >>>>>>>>>>>> subdomain to a >> >>>>>>>>>>>> >> single MPI rank, thus it does everything on one >> >>>>>>>>>>>> GPU and can >> >>>>>>>>>>>> use the >> >>>>>>>>>>>> >> GPU solver directly. If all the major >> >>>>>>>>>>>> computations of a subdomain >> >>>>>>>>>>>> >> can fit and be done on a single GPU then you >> would be >> >>>>>>>>>>>> utilizing all >> >>>>>>>>>>>> >> the GPUs you are using effectively. >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > Barry >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> >> >> >>>>>>>>>>>> >> >> Chang >> >>>>>>>>>>>> >> >> >> >>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >> >>>>>>>>>>>> >> >>> Chang, >> >>>>>>>>>>>> >> >>> You are correct there is no MPI + GPU >> >>>>>>>>>>>> direct >> >>>>>>>>>>>> solvers that >> >>>>>>>>>>>> >> currently do the triangular solves with MPI + GPU >> >>>>>>>>>>>> parallelism >> >>>>>>>>>>>> that I >> >>>>>>>>>>>> >> am aware of. You are limited that individual >> >>>>>>>>>>>> triangular solves be >> >>>>>>>>>>>> >> done on a single GPU. I can only suggest making >> >>>>>>>>>>>> each subdomain as >> >>>>>>>>>>>> >> big as possible to utilize each GPU as much as >> >>>>>>>>>>>> possible for the >> >>>>>>>>>>>> >> direct triangular solves. >> >>>>>>>>>>>> >> >>> Barry >> >>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via >> >>>>>>>>>>>> petsc-users >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>> >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>> >> >>>> >> >>>>>>>>>>>> >> >>>> Hi Mark, >> >>>>>>>>>>>> >> >>>> >> >>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with >> >>>>>>>>>>>> mpiaijcusparse with >> >>>>>>>>>>>> other >> >>>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type >> >>>>>>>>>>>> cusparse, it >> >>>>>>>>>>>> will give >> >>>>>>>>>>>> >> an error. >> >>>>>>>>>>>> >> >>>> >> >>>>>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu >> >>>>>>>>>>>> to do the >> >>>>>>>>>>>> >> factorization, and then do the rest, including >> >>>>>>>>>>>> GMRES solver, >> >>>>>>>>>>>> on gpu. >> >>>>>>>>>>>> >> Is that possible? >> >>>>>>>>>>>> >> >>>> >> >>>>>>>>>>>> >> >>>> I have tried to use aijcusparse with >> >>>>>>>>>>>> superlu_dist, it >> >>>>>>>>>>>> runs but >> >>>>>>>>>>>> >> the iterative solver is still running on CPUs. I >> have >> >>>>>>>>>>>> contacted the >> >>>>>>>>>>>> >> superlu group and they confirmed that is the case >> >>>>>>>>>>>> right now. >> >>>>>>>>>>>> But if >> >>>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it >> >>>>>>>>>>>> seems that the >> >>>>>>>>>>>> >> iterative solver is running on GPU. >> >>>>>>>>>>>> >> >>>> >> >>>>>>>>>>>> >> >>>> Chang >> >>>>>>>>>>>> >> >>>> >> >>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >> >>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >>>> wrote: >> >>>>>>>>>>>> >> >>>>> Thank you Junchao for explaining this. >> >>>>>>>>>>>> I guess in >> >>>>>>>>>>>> my case >> >>>>>>>>>>>> >> the code is >> >>>>>>>>>>>> >> >>>>> just calling a seq solver like superlu >> >>>>>>>>>>>> to do >> >>>>>>>>>>>> >> factorization on GPUs. >> >>>>>>>>>>>> >> >>>>> My idea is that I want to have a >> >>>>>>>>>>>> traditional MPI >> >>>>>>>>>>>> code to >> >>>>>>>>>>>> >> utilize GPUs >> >>>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse does >> >>>>>>>>>>>> not support >> >>>>>>>>>>>> mpiaij >> >>>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' >> >>>>>>>>>>>> will give you an >> >>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >> >>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work >> >>>>>>>>>>>> with >1 proc). >> >>>>>>>>>>>> >> >>>>> However, I see in grepping the repo that >> >>>>>>>>>>>> all the mumps and >> >>>>>>>>>>>> >> superlu tests use aij or sell matrix type. >> >>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own >> >>>>>>>>>>>> solves, I assume >> >>>>>>>>>>>> .... but >> >>>>>>>>>>>> >> you might want to do other matrix operations on >> >>>>>>>>>>>> the GPU. Is >> >>>>>>>>>>>> that the >> >>>>>>>>>>>> >> issue? >> >>>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with >> >>>>>>>>>>>> MUMPS and/or >> >>>>>>>>>>>> SuperLU >> >>>>>>>>>>>> >> have a problem? (no test with it so it probably >> >>>>>>>>>>>> does not work) >> >>>>>>>>>>>> >> >>>>> Thanks, >> >>>>>>>>>>>> >> >>>>> Mark >> >>>>>>>>>>>> >> >>>>> so I >> >>>>>>>>>>>> >> >>>>> want the code to have a mpiaij matrix >> >>>>>>>>>>>> when adding >> >>>>>>>>>>>> all the >> >>>>>>>>>>>> >> matrix terms, >> >>>>>>>>>>>> >> >>>>> and then transform the matrix to >> >>>>>>>>>>>> seqaij when doing the >> >>>>>>>>>>>> >> factorization >> >>>>>>>>>>>> >> >>>>> and >> >>>>>>>>>>>> >> >>>>> solve. This involves sending the data >> >>>>>>>>>>>> to the master >> >>>>>>>>>>>> >> process, and I >> >>>>>>>>>>>> >> >>>>> think >> >>>>>>>>>>>> >> >>>>> the petsc mumps solver have something >> >>>>>>>>>>>> similar already. >> >>>>>>>>>>>> >> >>>>> Chang >> >>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang >> wrote: >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM >> >>>>>>>>>>>> Mark Adams >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>> wrote: >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM >> >>>>>>>>>>>> Chang Liu >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>> wrote: >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > Hi Mark, >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > The option I use is like >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > -pc_type bjacobi >> >>>>>>>>>>>> -pc_bjacobi_blocks 16 >> >>>>>>>>>>>> >> -ksp_type fgmres >> >>>>>>>>>>>> >> >>>>> -mat_type >> >>>>>>>>>>>> >> >>>>> > aijcusparse >> >>>>>>>>>>>> *-sub_pc_factor_mat_solver_type >> >>>>>>>>>>>> >> cusparse >> >>>>>>>>>>>> >> >>>>> *-sub_ksp_type >> >>>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* >> >>>>>>>>>>>> -ksp_max_it 2000 >> >>>>>>>>>>>> >> -ksp_rtol 1.e-300 >> >>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > Note, If you use -log_view the >> >>>>>>>>>>>> last column >> >>>>>>>>>>>> (rows >> >>>>>>>>>>>> >> are the >> >>>>>>>>>>>> >> >>>>> method like >> >>>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the >> >>>>>>>>>>>> percent of work >> >>>>>>>>>>>> in the GPU. >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > Junchao: *This* implies that we >> >>>>>>>>>>>> have a >> >>>>>>>>>>>> cuSparse LU >> >>>>>>>>>>>> >> >>>>> factorization. Is >> >>>>>>>>>>>> >> >>>>> > that correct? (I don't think we >> do) >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU >> >>>>>>>>>>>> factorization. If you check >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >> >>>>>>>>>>>> find it >> >>>>>>>>>>>> >> calls >> >>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() >> instead. >> >>>>>>>>>>>> >> >>>>> > So I don't understand Chang's idea. >> >>>>>>>>>>>> Do you want to >> >>>>>>>>>>>> >> make bigger >> >>>>>>>>>>>> >> >>>>> blocks? >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > I think this one do both >> >>>>>>>>>>>> factorization and >> >>>>>>>>>>>> >> solve on gpu. >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > You can check the >> >>>>>>>>>>>> runex72_aijcusparse.sh file >> >>>>>>>>>>>> >> in petsc >> >>>>>>>>>>>> >> >>>>> install >> >>>>>>>>>>>> >> >>>>> > directory, and try it your >> >>>>>>>>>>>> self (this >> >>>>>>>>>>>> is only lu >> >>>>>>>>>>>> >> >>>>> factorization >> >>>>>>>>>>>> >> >>>>> > without >> >>>>>>>>>>>> >> >>>>> > iterative solve). >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > Chang >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark >> >>>>>>>>>>>> Adams wrote: >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at >> >>>>>>>>>>>> 11:19 AM >> >>>>>>>>>>>> Chang Liu >> >>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>>> wrote: >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > Hi Junchao, >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > No I only needs it >> >>>>>>>>>>>> to be transferred >> >>>>>>>>>>>> >> within a >> >>>>>>>>>>>> >> >>>>> node. I use >> >>>>>>>>>>>> >> >>>>> > block-Jacobi >> >>>>>>>>>>>> >> >>>>> > > method and GMRES to >> >>>>>>>>>>>> solve the sparse >> >>>>>>>>>>>> >> matrix, so each >> >>>>>>>>>>>> >> >>>>> > direct solver will >> >>>>>>>>>>>> >> >>>>> > > take care of a >> >>>>>>>>>>>> sub-block of the >> >>>>>>>>>>>> whole >> >>>>>>>>>>>> >> matrix. In this >> >>>>>>>>>>>> >> >>>>> > way, I can use >> >>>>>>>>>>>> >> >>>>> > > one >> >>>>>>>>>>>> >> >>>>> > > GPU to solve one >> >>>>>>>>>>>> sub-block, which is >> >>>>>>>>>>>> >> stored within >> >>>>>>>>>>>> >> >>>>> one node. >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > It was stated in the >> >>>>>>>>>>>> documentation that >> >>>>>>>>>>>> >> cusparse >> >>>>>>>>>>>> >> >>>>> solver >> >>>>>>>>>>>> >> >>>>> > is slow. >> >>>>>>>>>>>> >> >>>>> > > However, in my test >> >>>>>>>>>>>> using >> >>>>>>>>>>>> ex72.c, the >> >>>>>>>>>>>> >> cusparse >> >>>>>>>>>>>> >> >>>>> solver is >> >>>>>>>>>>>> >> >>>>> > faster than >> >>>>>>>>>>>> >> >>>>> > > mumps or >> >>>>>>>>>>>> superlu_dist on CPUs. >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > Are we talking about the >> >>>>>>>>>>>> factorization, the >> >>>>>>>>>>>> >> solve, or >> >>>>>>>>>>>> >> >>>>> both? >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > We do not have an >> >>>>>>>>>>>> interface to >> >>>>>>>>>>>> cuSparse's LU >> >>>>>>>>>>>> >> >>>>> factorization (I >> >>>>>>>>>>>> >> >>>>> > just >> >>>>>>>>>>>> >> >>>>> > > learned that it exists a >> >>>>>>>>>>>> few weeks ago). >> >>>>>>>>>>>> >> >>>>> > > Perhaps your fast >> >>>>>>>>>>>> "cusparse solver" is >> >>>>>>>>>>>> >> '-pc_type lu >> >>>>>>>>>>>> >> >>>>> -mat_type >> >>>>>>>>>>>> >> >>>>> > > aijcusparse' ? This >> >>>>>>>>>>>> would be the CPU >> >>>>>>>>>>>> >> factorization, >> >>>>>>>>>>>> >> >>>>> which is the >> >>>>>>>>>>>> >> >>>>> > > dominant cost. >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > Chang >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 >> >>>>>>>>>>>> AM, Junchao >> >>>>>>>>>>>> Zhang wrote: >> >>>>>>>>>>>> >> >>>>> > > > Hi, Chang, >> >>>>>>>>>>>> >> >>>>> > > > For the mumps >> >>>>>>>>>>>> solver, we >> >>>>>>>>>>>> usually >> >>>>>>>>>>>> >> transfers >> >>>>>>>>>>>> >> >>>>> matrix >> >>>>>>>>>>>> >> >>>>> > and vector >> >>>>>>>>>>>> >> >>>>> > > data >> >>>>>>>>>>>> >> >>>>> > > > within a compute >> >>>>>>>>>>>> node. For >> >>>>>>>>>>>> the idea you >> >>>>>>>>>>>> >> >>>>> propose, it >> >>>>>>>>>>>> >> >>>>> > looks like >> >>>>>>>>>>>> >> >>>>> > > we need >> >>>>>>>>>>>> >> >>>>> > > > to gather data >> within >> >>>>>>>>>>>> >> MPI_COMM_WORLD, right? >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > > Mark, I >> >>>>>>>>>>>> remember you said >> >>>>>>>>>>>> >> cusparse solve is >> >>>>>>>>>>>> >> >>>>> slow >> >>>>>>>>>>>> >> >>>>> > and you would >> >>>>>>>>>>>> >> >>>>> > > > rather do it on >> >>>>>>>>>>>> CPU. Is it right? >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > > --Junchao Zhang >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > > On Mon, Oct 11, >> >>>>>>>>>>>> 2021 at 10:25 PM >> >>>>>>>>>>>> >> Chang Liu via >> >>>>>>>>>>>> >> >>>>> petsc-users >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> > petsc-users at mcs.anl.gov> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> > petsc-users at mcs.anl.gov> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> > petsc-users at mcs.anl.gov> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>> >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> > petsc-users at mcs.anl.gov> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> > petsc-users at mcs.anl.gov> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> > petsc-users at mcs.anl.gov> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> > >>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>> >> >>>>>>>>>>>> >> >>>>> > > wrote: >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > > Hi, >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > > Currently, it >> >>>>>>>>>>>> is possible >> >>>>>>>>>>>> to use >> >>>>>>>>>>>> >> mumps >> >>>>>>>>>>>> >> >>>>> solver in >> >>>>>>>>>>>> >> >>>>> > PETSC with >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> -mat_mumps_use_omp_threads >> >>>>>>>>>>>> >> option, so that >> >>>>>>>>>>>> >> >>>>> > multiple MPI >> >>>>>>>>>>>> >> >>>>> > > processes will >> >>>>>>>>>>>> >> >>>>> > > > transfer the >> >>>>>>>>>>>> matrix and >> >>>>>>>>>>>> rhs data >> >>>>>>>>>>>> >> to the master >> >>>>>>>>>>>> >> >>>>> > rank, and then >> >>>>>>>>>>>> >> >>>>> > > master >> >>>>>>>>>>>> >> >>>>> > > > rank will >> >>>>>>>>>>>> call mumps with >> >>>>>>>>>>>> OpenMP >> >>>>>>>>>>>> >> to solve >> >>>>>>>>>>>> >> >>>>> the matrix. >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > > I wonder if >> >>>>>>>>>>>> someone can >> >>>>>>>>>>>> develop >> >>>>>>>>>>>> >> similar >> >>>>>>>>>>>> >> >>>>> option for >> >>>>>>>>>>>> >> >>>>> > cusparse >> >>>>>>>>>>>> >> >>>>> > > solver. >> >>>>>>>>>>>> >> >>>>> > > > Right now, >> >>>>>>>>>>>> this solver >> >>>>>>>>>>>> does not >> >>>>>>>>>>>> >> work with >> >>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I >> >>>>>>>>>>>> >> >>>>> > > think a >> >>>>>>>>>>>> >> >>>>> > > > possible >> >>>>>>>>>>>> workaround is to >> >>>>>>>>>>>> >> transfer all the >> >>>>>>>>>>>> >> >>>>> matrix >> >>>>>>>>>>>> >> >>>>> > data to one MPI >> >>>>>>>>>>>> >> >>>>> > > > process, and >> >>>>>>>>>>>> then upload the >> >>>>>>>>>>>> >> data to GPU to >> >>>>>>>>>>>> >> >>>>> solve. >> >>>>>>>>>>>> >> >>>>> > In this >> >>>>>>>>>>>> >> >>>>> > > way, one can >> >>>>>>>>>>>> >> >>>>> > > > use cusparse >> >>>>>>>>>>>> solver for a MPI >> >>>>>>>>>>>> >> program. >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > > Chang >> >>>>>>>>>>>> >> >>>>> > > > -- >> >>>>>>>>>>>> >> >>>>> > > > Chang Liu >> >>>>>>>>>>>> >> >>>>> > > > Staff >> >>>>>>>>>>>> Research Physicist >> >>>>>>>>>>>> >> >>>>> > > > +1 609 243 >> 3438 >> >>>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>> >> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>>> >> >>>>>>>>>>>> >> >>>>> > > > Princeton >> >>>>>>>>>>>> Plasma Physics >> >>>>>>>>>>>> Laboratory >> >>>>>>>>>>>> >> >>>>> > > > 100 >> >>>>>>>>>>>> Stellarator Rd, >> >>>>>>>>>>>> Princeton NJ >> >>>>>>>>>>>> >> 08540, USA >> >>>>>>>>>>>> >> >>>>> > > > >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > > -- >> >>>>>>>>>>>> >> >>>>> > > Chang Liu >> >>>>>>>>>>>> >> >>>>> > > Staff Research >> Physicist >> >>>>>>>>>>>> >> >>>>> > > +1 609 243 3438 >> >>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>>> >> >>>>>>>>>>>> >> >>>>> > > Princeton Plasma >> >>>>>>>>>>>> Physics Laboratory >> >>>>>>>>>>>> >> >>>>> > > 100 Stellarator Rd, >> >>>>>>>>>>>> Princeton NJ >> >>>>>>>>>>>> 08540, USA >> >>>>>>>>>>>> >> >>>>> > > >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> > -- >> >>>>>>>>>>>> >> >>>>> > Chang Liu >> >>>>>>>>>>>> >> >>>>> > Staff Research Physicist >> >>>>>>>>>>>> >> >>>>> > +1 609 243 3438 >> >>>>>>>>>>>> >> >>>>> > cliu at pppl.gov >> >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >>>> >> >>>>>>>>>>>> >> >>>>> > Princeton Plasma Physics >> >>>>>>>>>>>> Laboratory >> >>>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, >> >>>>>>>>>>>> Princeton NJ 08540, USA >> >>>>>>>>>>>> >> >>>>> > >> >>>>>>>>>>>> >> >>>>> -- Chang Liu >> >>>>>>>>>>>> >> >>>>> Staff Research Physicist >> >>>>>>>>>>>> >> >>>>> +1 609 243 3438 >> >>>>>>>>>>>> >> >>>>> cliu at pppl.gov >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >>> >> >>>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ >> >>>>>>>>>>>> 08540, USA >> >>>>>>>>>>>> >> >>>> >> >>>>>>>>>>>> >> >>>> -- >> >>>>>>>>>>>> >> >>>> Chang Liu >> >>>>>>>>>>>> >> >>>> Staff Research Physicist >> >>>>>>>>>>>> >> >>>> +1 609 243 3438 >> >>>>>>>>>>>> >> >>>> cliu at pppl.gov >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>>>>>> >> >> >> >>>>>>>>>>>> >> >> -- >> >>>>>>>>>>>> >> >> Chang Liu >> >>>>>>>>>>>> >> >> Staff Research Physicist >> >>>>>>>>>>>> >> >> +1 609 243 3438 >> >>>>>>>>>>>> >> >> cliu at pppl.gov >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>>>>>> >> > >> >>>>>>>>>>>> >> -- Chang Liu >> >>>>>>>>>>>> >> Staff Research Physicist >> >>>>>>>>>>>> >> +1 609 243 3438 >> >>>>>>>>>>>> >> cliu at pppl.gov >> >>>>>>>>>>>> > >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >> >>>>>>>>>>>> >> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > -- >> >>>>>>>>>>>> > Chang Liu >> >>>>>>>>>>>> > Staff Research Physicist >> >>>>>>>>>>>> > +1 609 243 3438 >> >>>>>>>>>>>> > cliu at pppl.gov >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > Princeton Plasma Physics Laboratory >> >>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>>>>> >> >>>>>>>>>>> -- >> >>>>>>>>>>> Chang Liu >> >>>>>>>>>>> Staff Research Physicist >> >>>>>>>>>>> +1 609 243 3438 >> >>>>>>>>>>> cliu at pppl.gov >> >>>>>>>>>>> Princeton Plasma Physics Laboratory >> >>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> Chang Liu >> >>>>>>>>> Staff Research Physicist >> >>>>>>>>> +1 609 243 3438 >> >>>>>>>>> cliu at pppl.gov >> >>>>>>>>> Princeton Plasma Physics Laboratory >> >>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>> >> >>>>>>> -- >> >>>>>>> Chang Liu >> >>>>>>> Staff Research Physicist >> >>>>>>> +1 609 243 3438 >> >>>>>>> cliu at pppl.gov >> >>>>>>> Princeton Plasma Physics Laboratory >> >>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>> >> >>>>> -- >> >>>>> Chang Liu >> >>>>> Staff Research Physicist >> >>>>> +1 609 243 3438 >> >>>>> cliu at pppl.gov >> >>>>> Princeton Plasma Physics Laboratory >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>> >> >> >> >> -- >> >> Chang Liu >> >> Staff Research Physicist >> >> +1 609 243 3438 >> >> cliu at pppl.gov >> >> Princeton Plasma Physics Laboratory >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swarnava89 at gmail.com Mon Oct 18 20:47:29 2021 From: swarnava89 at gmail.com (Swarnava Ghosh) Date: Mon, 18 Oct 2021 21:47:29 -0400 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: Hi Junchao, If I want to pass command line options as -mymat_mat_type aijcusparse, should it be MatSetOptionsPrefix(A,"mymat"); or MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? Sincerely, Swarnava On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang wrote: > MatSetOptionsPrefix(A,"mymat") > VecSetOptionsPrefix(v,"myvec") > > --Junchao Zhang > > > On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: > >> Hi Junchao, >> >> Thank you for your answer. I tried MatConvert and it works. I didn't >> make it before because I forgot to convert a vector from mpi to mpicuda >> previously. >> >> For vector, there is no VecConvert to use, so I have to do VecDuplicate, >> VecSetType and VecCopy. Is there an easier option? >> > As Matt suggested, you could single out the matrix and vector with > options prefix and set their type on command line > > MatSetOptionsPrefix(A,"mymat"); > VecSetOptionsPrefix(v,"myvec"); > > Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda > > A simpler code is to have the vector type automatically set by > MatCreateVecs(A,&v,NULL) > > >> Chang >> >> On 10/18/21 5:23 PM, Junchao Zhang wrote: >> > >> > >> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >> > > wrote: >> > >> > Hi Matt, >> > >> > I have a related question. In my code I have many matrices and I >> only >> > want to have one living on GPU, the others still staying on CPU mem. >> > >> > I wonder if there is an easier way to copy a mpiaij matrix to >> > mpiaijcusparse (in other words, copy data to GPUs). I can think of >> > creating a new mpiaijcusparse matrix, and copying the data line by >> > line. >> > But I wonder if there is a better option. >> > >> > I have tried MatCopy and MatConvert but neither work. >> > >> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >> > >> > >> > Chang >> > >> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >> > >> > > >> >> wrote: >> > > >> > > Do I need convert the MATSEQBAIJ to a cuda matrix in code? >> > > >> > > >> > > You would need a call to MatSetFromOptions() to take that type >> > from the >> > > command line, and not have >> > > the type hard-coded in your application. It is generally a bad >> > idea to >> > > hard code the implementation type. >> > > >> > > If I do it from command line, then are the other MatVec >> calls are >> > > ported onto CUDA? I have many MatVec calls in my code, but I >> > > specifically want to port just one call. >> > > >> > > >> > > You can give that one matrix an options prefix to isolate it. >> > > >> > > Thanks, >> > > >> > > Matt >> > > >> > > Sincerely, >> > > Swarnava >> > > >> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >> > > >> > >> >> > wrote: >> > > >> > > You can do that with command line options -mat_type >> > aijcusparse >> > > -vec_type cuda >> > > >> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >> > > >> > >> wrote: >> > > >> > > Dear Petsc team, >> > > >> > > I had a query regarding using CUDA to accelerate a >> matrix >> > > vector product. >> > > I have a sequential sparse matrix (MATSEQBAIJ type). >> > I want >> > > to port a MatVec call onto GPUs. Is there any >> > code/example I >> > > can look at? >> > > >> > > Sincerely, >> > > SG >> > > >> > > >> > > >> > > -- >> > > What most experimenters take for granted before they begin their >> > > experiments is infinitely more interesting than any results to >> which >> > > their experiments lead. >> > > -- Norbert Wiener >> > > >> > > https://www.cse.buffalo.edu/~knepley/ >> > >> > > > > >> > >> > -- >> > Chang Liu >> > Staff Research Physicist >> > +1 609 243 3438 >> > cliu at pppl.gov >> > Princeton Plasma Physics Laboratory >> > 100 Stellarator Rd, Princeton NJ 08540, USA >> > >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Mon Oct 18 21:08:03 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Mon, 18 Oct 2021 21:08:03 -0500 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh wrote: > Hi Junchao, > > If I want to pass command line options as -mymat_mat_type aijcusparse, > should it be MatSetOptionsPrefix(A,"mymat"); or > MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? > my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in mat/tests/ex62.c Thanks > > Sincerely, > Swarnava > > On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang > wrote: > >> MatSetOptionsPrefix(A,"mymat") >> VecSetOptionsPrefix(v,"myvec") >> >> --Junchao Zhang >> >> >> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: >> >>> Hi Junchao, >>> >>> Thank you for your answer. I tried MatConvert and it works. I didn't >>> make it before because I forgot to convert a vector from mpi to mpicuda >>> previously. >>> >>> For vector, there is no VecConvert to use, so I have to do VecDuplicate, >>> VecSetType and VecCopy. Is there an easier option? >>> >> As Matt suggested, you could single out the matrix and vector with >> options prefix and set their type on command line >> >> MatSetOptionsPrefix(A,"mymat"); >> VecSetOptionsPrefix(v,"myvec"); >> >> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda >> >> A simpler code is to have the vector type automatically set by >> MatCreateVecs(A,&v,NULL) >> >> >>> Chang >>> >>> On 10/18/21 5:23 PM, Junchao Zhang wrote: >>> > >>> > >>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >>> > > wrote: >>> > >>> > Hi Matt, >>> > >>> > I have a related question. In my code I have many matrices and I >>> only >>> > want to have one living on GPU, the others still staying on CPU >>> mem. >>> > >>> > I wonder if there is an easier way to copy a mpiaij matrix to >>> > mpiaijcusparse (in other words, copy data to GPUs). I can think of >>> > creating a new mpiaijcusparse matrix, and copying the data line by >>> > line. >>> > But I wonder if there is a better option. >>> > >>> > I have tried MatCopy and MatConvert but neither work. >>> > >>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >>> > >>> > >>> > Chang >>> > >>> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >>> > >>> > > >> >>> wrote: >>> > > >>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in code? >>> > > >>> > > >>> > > You would need a call to MatSetFromOptions() to take that type >>> > from the >>> > > command line, and not have >>> > > the type hard-coded in your application. It is generally a bad >>> > idea to >>> > > hard code the implementation type. >>> > > >>> > > If I do it from command line, then are the other MatVec >>> calls are >>> > > ported onto CUDA? I have many MatVec calls in my code, but I >>> > > specifically want to port just one call. >>> > > >>> > > >>> > > You can give that one matrix an options prefix to isolate it. >>> > > >>> > > Thanks, >>> > > >>> > > Matt >>> > > >>> > > Sincerely, >>> > > Swarnava >>> > > >>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >>> > > >>> > >> >>> > wrote: >>> > > >>> > > You can do that with command line options -mat_type >>> > aijcusparse >>> > > -vec_type cuda >>> > > >>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>> > > >>> > >> >>> wrote: >>> > > >>> > > Dear Petsc team, >>> > > >>> > > I had a query regarding using CUDA to accelerate a >>> matrix >>> > > vector product. >>> > > I have a sequential sparse matrix (MATSEQBAIJ type). >>> > I want >>> > > to port a MatVec call onto GPUs. Is there any >>> > code/example I >>> > > can look at? >>> > > >>> > > Sincerely, >>> > > SG >>> > > >>> > > >>> > > >>> > > -- >>> > > What most experimenters take for granted before they begin their >>> > > experiments is infinitely more interesting than any results to >>> which >>> > > their experiments lead. >>> > > -- Norbert Wiener >>> > > >>> > > https://www.cse.buffalo.edu/~knepley/ >>> > >>> > >> > > >>> > >>> > -- >>> > Chang Liu >>> > Staff Research Physicist >>> > +1 609 243 3438 >>> > cliu at pppl.gov >>> > Princeton Plasma Physics Laboratory >>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>> > >>> >>> -- >>> Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From swarnava89 at gmail.com Mon Oct 18 21:13:31 2021 From: swarnava89 at gmail.com (Swarnava Ghosh) Date: Mon, 18 Oct 2021 22:13:31 -0400 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: Thanks for the clarification, Junchao. Sincerely, Swarnava On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang wrote: > > > > On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh > wrote: > >> Hi Junchao, >> >> If I want to pass command line options as -mymat_mat_type aijcusparse, >> should it be MatSetOptionsPrefix(A,"mymat"); or >> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? >> > my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in > mat/tests/ex62.c > Thanks > > >> >> Sincerely, >> Swarnava >> >> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang >> wrote: >> >>> MatSetOptionsPrefix(A,"mymat") >>> VecSetOptionsPrefix(v,"myvec") >>> >>> --Junchao Zhang >>> >>> >>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: >>> >>>> Hi Junchao, >>>> >>>> Thank you for your answer. I tried MatConvert and it works. I didn't >>>> make it before because I forgot to convert a vector from mpi to mpicuda >>>> previously. >>>> >>>> For vector, there is no VecConvert to use, so I have to do >>>> VecDuplicate, >>>> VecSetType and VecCopy. Is there an easier option? >>>> >>> As Matt suggested, you could single out the matrix and vector with >>> options prefix and set their type on command line >>> >>> MatSetOptionsPrefix(A,"mymat"); >>> VecSetOptionsPrefix(v,"myvec"); >>> >>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda >>> >>> A simpler code is to have the vector type automatically set by >>> MatCreateVecs(A,&v,NULL) >>> >>> >>>> Chang >>>> >>>> On 10/18/21 5:23 PM, Junchao Zhang wrote: >>>> > >>>> > >>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >>>> > > wrote: >>>> > >>>> > Hi Matt, >>>> > >>>> > I have a related question. In my code I have many matrices and I >>>> only >>>> > want to have one living on GPU, the others still staying on CPU >>>> mem. >>>> > >>>> > I wonder if there is an easier way to copy a mpiaij matrix to >>>> > mpiaijcusparse (in other words, copy data to GPUs). I can think of >>>> > creating a new mpiaijcusparse matrix, and copying the data line by >>>> > line. >>>> > But I wonder if there is a better option. >>>> > >>>> > I have tried MatCopy and MatConvert but neither work. >>>> > >>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >>>> > >>>> > >>>> > Chang >>>> > >>>> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >>>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >>>> > >>>> > > >> >>>> wrote: >>>> > > >>>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in code? >>>> > > >>>> > > >>>> > > You would need a call to MatSetFromOptions() to take that type >>>> > from the >>>> > > command line, and not have >>>> > > the type hard-coded in your application. It is generally a bad >>>> > idea to >>>> > > hard code the implementation type. >>>> > > >>>> > > If I do it from command line, then are the other MatVec >>>> calls are >>>> > > ported onto CUDA? I have many MatVec calls in my code, but >>>> I >>>> > > specifically want to port just one call. >>>> > > >>>> > > >>>> > > You can give that one matrix an options prefix to isolate it. >>>> > > >>>> > > Thanks, >>>> > > >>>> > > Matt >>>> > > >>>> > > Sincerely, >>>> > > Swarnava >>>> > > >>>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >>>> > > >>>> > >>> >>> >>>> > wrote: >>>> > > >>>> > > You can do that with command line options -mat_type >>>> > aijcusparse >>>> > > -vec_type cuda >>>> > > >>>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>>> > > >>>> > >> >>>> wrote: >>>> > > >>>> > > Dear Petsc team, >>>> > > >>>> > > I had a query regarding using CUDA to accelerate a >>>> matrix >>>> > > vector product. >>>> > > I have a sequential sparse matrix >>>> (MATSEQBAIJ type). >>>> > I want >>>> > > to port a MatVec call onto GPUs. Is there any >>>> > code/example I >>>> > > can look at? >>>> > > >>>> > > Sincerely, >>>> > > SG >>>> > > >>>> > > >>>> > > >>>> > > -- >>>> > > What most experimenters take for granted before they begin >>>> their >>>> > > experiments is infinitely more interesting than any results to >>>> which >>>> > > their experiments lead. >>>> > > -- Norbert Wiener >>>> > > >>>> > > https://www.cse.buffalo.edu/~knepley/ >>>> > >>>> > >>> > > >>>> > >>>> > -- >>>> > Chang Liu >>>> > Staff Research Physicist >>>> > +1 609 243 3438 >>>> > cliu at pppl.gov >>>> > Princeton Plasma Physics Laboratory >>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>> > >>>> >>>> -- >>>> Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.croucher at auckland.ac.nz Mon Oct 18 21:35:08 2021 From: a.croucher at auckland.ac.nz (Adrian Croucher) Date: Tue, 19 Oct 2021 15:35:08 +1300 Subject: [petsc-users] HDF5 timestepping in PETSc 3.16 In-Reply-To: References: Message-ID: <18b4e68c-2524-932d-9aa4-c1a28ea44158@auckland.ac.nz> Any response on this? This is a bit of a showstopper for me - I can't upgrade to PETSc 3.16 if it does not allow my users to read their HDF5 files created using earlier versions of PETSc. So far I can't see a workaround. Possibly the timestepping functions need some kind of optional parameter to specify what the default timestepping attribute should be, if it's not present in the file (rather than just assuming it's false)? Regards, Adrian On 10/14/21 4:19 PM, Adrian Croucher wrote: > hi > > I am just testing out PETSc 3.16 and making the necessary changes to > my code. Amongst other things I now have to add a > PetscViewerHDF5PushTimestepping() call before starting to output > time-dependent results to HDF5 using a PetscViewer. > > I now also have to add this call before reading in sets of previously > computed time-dependent results (for restarting a simulation from the > results of a previous run). > > The problem with this is that if I try to read in the results of any > previous run, computed with an earlier version of PETSc (< 3.16), an > error is raised because the time-dependent datasets in the file do not > have the 'timestepping' attribute. > > Is there something else I need to do to make this work? > > - Adrian > -- Dr Adrian Croucher Senior Research Fellow Department of Engineering Science University of Auckland, New Zealand email: a.croucher at auckland.ac.nz tel: +64 (0)9 923 4611 From swarnava89 at gmail.com Mon Oct 18 22:56:26 2021 From: swarnava89 at gmail.com (Swarnava Ghosh) Date: Mon, 18 Oct 2021 23:56:26 -0400 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: I am trying the port parts of the following function on GPUs. Essentially, the lines of codes between the two "TODO..." comments should be executed on the device. Here is the function: PetscScalar CalculateSpectralNodesAndWeights(LSDFT_OBJ *pLsdft, int p, int LIp) { PetscInt N_qp; N_qp = pLsdft->N_qp; int k; PetscScalar *a, *b; k=0; PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &a); PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &b); /* * TODO: COPY a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, pLsdft->LapPlusVeffOprloc, k,p,N_qp from HOST to DEVICE * DO THE FOLLOWING OPERATIONS ON DEVICE */ //zero out vectors VecZeroEntries(pLsdft->Vk); VecZeroEntries(pLsdft->Vkm1); VecZeroEntries(pLsdft->Vkp1); VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vkm1,pLsdft->Vk); VecDot(pLsdft->Vkm1, pLsdft->Vk, &a[0]); VecAXPY(pLsdft->Vk, -a[0], pLsdft->Vkm1); VecNorm(pLsdft->Vk, NORM_2, &b[0]); VecScale(pLsdft->Vk, 1.0 / b[0]); for (k = 0; k < N_qp; k++) { MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vk,pLsdft->Vkp1); VecDot(pLsdft->Vk, pLsdft->Vkp1, &a[k + 1]); VecAXPY(pLsdft->Vkp1, -a[k + 1], pLsdft->Vk); VecAXPY(pLsdft->Vkp1, -b[k], pLsdft->Vkm1); VecCopy(pLsdft->Vk, pLsdft->Vkm1); VecNorm(pLsdft->Vkp1, NORM_2, &b[k + 1]); VecCopy(pLsdft->Vkp1, pLsdft->Vk); VecScale(pLsdft->Vk, 1.0 / b[k + 1]); } /* * TODO: Copy back a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, pLsdft->LapPlusVeffOprloc, k,p,N_qp from DEVICE to HOST */ /* * Some operation with a, and b on HOST * */ TridiagEigenVecSolve_NodesAndWeights(pLsdft, a, b, N_qp, LIp); // operation on the host // free a,b PetscFree(a); PetscFree(b); return 0; } If I just use the command line options to set vectors Vk,Vkp1 and Vkm1 as cuda vectors and the matrix LapPlusVeffOprloc as aijcusparse, will the lines of code between the two "TODO" comments be entirely executed on the device? Sincerely, Swarnava On Mon, Oct 18, 2021 at 10:13 PM Swarnava Ghosh wrote: > Thanks for the clarification, Junchao. > > Sincerely, > Swarnava > > On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang > wrote: > >> >> >> >> On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh >> wrote: >> >>> Hi Junchao, >>> >>> If I want to pass command line options as -mymat_mat_type aijcusparse, >>> should it be MatSetOptionsPrefix(A,"mymat"); or >>> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? >>> >> my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in >> mat/tests/ex62.c >> Thanks >> >> >>> >>> Sincerely, >>> Swarnava >>> >>> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang >>> wrote: >>> >>>> MatSetOptionsPrefix(A,"mymat") >>>> VecSetOptionsPrefix(v,"myvec") >>>> >>>> --Junchao Zhang >>>> >>>> >>>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: >>>> >>>>> Hi Junchao, >>>>> >>>>> Thank you for your answer. I tried MatConvert and it works. I didn't >>>>> make it before because I forgot to convert a vector from mpi to >>>>> mpicuda >>>>> previously. >>>>> >>>>> For vector, there is no VecConvert to use, so I have to do >>>>> VecDuplicate, >>>>> VecSetType and VecCopy. Is there an easier option? >>>>> >>>> As Matt suggested, you could single out the matrix and vector with >>>> options prefix and set their type on command line >>>> >>>> MatSetOptionsPrefix(A,"mymat"); >>>> VecSetOptionsPrefix(v,"myvec"); >>>> >>>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda >>>> >>>> A simpler code is to have the vector type automatically set by >>>> MatCreateVecs(A,&v,NULL) >>>> >>>> >>>>> Chang >>>>> >>>>> On 10/18/21 5:23 PM, Junchao Zhang wrote: >>>>> > >>>>> > >>>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >>>>> > > wrote: >>>>> > >>>>> > Hi Matt, >>>>> > >>>>> > I have a related question. In my code I have many matrices and I >>>>> only >>>>> > want to have one living on GPU, the others still staying on CPU >>>>> mem. >>>>> > >>>>> > I wonder if there is an easier way to copy a mpiaij matrix to >>>>> > mpiaijcusparse (in other words, copy data to GPUs). I can think >>>>> of >>>>> > creating a new mpiaijcusparse matrix, and copying the data line >>>>> by >>>>> > line. >>>>> > But I wonder if there is a better option. >>>>> > >>>>> > I have tried MatCopy and MatConvert but neither work. >>>>> > >>>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >>>>> > >>>>> > >>>>> > Chang >>>>> > >>>>> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >>>>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >>>>> > >>>>> > > >> >>>>> wrote: >>>>> > > >>>>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in code? >>>>> > > >>>>> > > >>>>> > > You would need a call to MatSetFromOptions() to take that type >>>>> > from the >>>>> > > command line, and not have >>>>> > > the type hard-coded in your application. It is generally a bad >>>>> > idea to >>>>> > > hard code the implementation type. >>>>> > > >>>>> > > If I do it from command line, then are the other MatVec >>>>> calls are >>>>> > > ported onto CUDA? I have many MatVec calls in my code, >>>>> but I >>>>> > > specifically want to port just one call. >>>>> > > >>>>> > > >>>>> > > You can give that one matrix an options prefix to isolate it. >>>>> > > >>>>> > > Thanks, >>>>> > > >>>>> > > Matt >>>>> > > >>>>> > > Sincerely, >>>>> > > Swarnava >>>>> > > >>>>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >>>>> > > >>>>> > >>>> >>> >>>>> > wrote: >>>>> > > >>>>> > > You can do that with command line options -mat_type >>>>> > aijcusparse >>>>> > > -vec_type cuda >>>>> > > >>>>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>>>> > > >>>>> > >> >>>>> wrote: >>>>> > > >>>>> > > Dear Petsc team, >>>>> > > >>>>> > > I had a query regarding using CUDA to accelerate >>>>> a matrix >>>>> > > vector product. >>>>> > > I have a sequential sparse matrix >>>>> (MATSEQBAIJ type). >>>>> > I want >>>>> > > to port a MatVec call onto GPUs. Is there any >>>>> > code/example I >>>>> > > can look at? >>>>> > > >>>>> > > Sincerely, >>>>> > > SG >>>>> > > >>>>> > > >>>>> > > >>>>> > > -- >>>>> > > What most experimenters take for granted before they begin >>>>> their >>>>> > > experiments is infinitely more interesting than any results >>>>> to which >>>>> > > their experiments lead. >>>>> > > -- Norbert Wiener >>>>> > > >>>>> > > https://www.cse.buffalo.edu/~knepley/ >>>>> > >>>>> > >>>> > > >>>>> > >>>>> > -- >>>>> > Chang Liu >>>>> > Staff Research Physicist >>>>> > +1 609 243 3438 >>>>> > cliu at pppl.gov >>>>> > Princeton Plasma Physics Laboratory >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> > >>>>> >>>>> -- >>>>> Chang Liu >>>>> Staff Research Physicist >>>>> +1 609 243 3438 >>>>> cliu at pppl.gov >>>>> Princeton Plasma Physics Laboratory >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Mon Oct 18 23:28:46 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Mon, 18 Oct 2021 23:28:46 -0500 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: On Mon, Oct 18, 2021 at 10:56 PM Swarnava Ghosh wrote: > I am trying the port parts of the following function on GPUs. Essentially, > the lines of codes between the two "TODO..." comments should be executed on > the device. Here is the function: > > PetscScalar CalculateSpectralNodesAndWeights(LSDFT_OBJ *pLsdft, int p, int > LIp) > { > > PetscInt N_qp; > N_qp = pLsdft->N_qp; > > int k; > PetscScalar *a, *b; > k=0; > > PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &a); > PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &b); > > /* > * TODO: COPY a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, > pLsdft->LapPlusVeffOprloc, k,p,N_qp from HOST to DEVICE > * DO THE FOLLOWING OPERATIONS ON DEVICE > */ > > //zero out vectors > VecZeroEntries(pLsdft->Vk); > VecZeroEntries(pLsdft->Vkm1); > VecZeroEntries(pLsdft->Vkp1); > > VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); > MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vkm1,pLsdft->Vk); > VecDot(pLsdft->Vkm1, pLsdft->Vk, &a[0]); > VecAXPY(pLsdft->Vk, -a[0], pLsdft->Vkm1); > VecNorm(pLsdft->Vk, NORM_2, &b[0]); > VecScale(pLsdft->Vk, 1.0 / b[0]); > > for (k = 0; k < N_qp; k++) { > MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vk,pLsdft->Vkp1); > VecDot(pLsdft->Vk, pLsdft->Vkp1, &a[k + 1]); > VecAXPY(pLsdft->Vkp1, -a[k + 1], pLsdft->Vk); > VecAXPY(pLsdft->Vkp1, -b[k], pLsdft->Vkm1); > VecCopy(pLsdft->Vk, pLsdft->Vkm1); > VecNorm(pLsdft->Vkp1, NORM_2, &b[k + 1]); > VecCopy(pLsdft->Vkp1, pLsdft->Vk); > VecScale(pLsdft->Vk, 1.0 / b[k + 1]); > } > > /* > * TODO: Copy back a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, > pLsdft->LapPlusVeffOprloc, k,p,N_qp from DEVICE to HOST > */ > > /* > * Some operation with a, and b on HOST > * > */ > TridiagEigenVecSolve_NodesAndWeights(pLsdft, a, b, N_qp, LIp); // > operation on the host > > // free a,b > PetscFree(a); > PetscFree(b); > > return 0; > } > > If I just use the command line options to set vectors Vk,Vkp1 and Vkm1 as > cuda vectors and the matrix LapPlusVeffOprloc as aijcusparse, will the > lines of code between the two "TODO" comments be entirely executed on the > device? > yes, except VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); which is done on CPU, by pulling down vector data from GPU to CPU and setting the value. Subsequent vector operations will push the updated vector data to GPU again. > > Sincerely, > Swarnava > > > On Mon, Oct 18, 2021 at 10:13 PM Swarnava Ghosh > wrote: > >> Thanks for the clarification, Junchao. >> >> Sincerely, >> Swarnava >> >> On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang >> wrote: >> >>> >>> >>> >>> On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh >>> wrote: >>> >>>> Hi Junchao, >>>> >>>> If I want to pass command line options as -mymat_mat_type aijcusparse, >>>> should it be MatSetOptionsPrefix(A,"mymat"); or >>>> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? >>>> >>> my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in >>> mat/tests/ex62.c >>> Thanks >>> >>> >>>> >>>> Sincerely, >>>> Swarnava >>>> >>>> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang >>>> wrote: >>>> >>>>> MatSetOptionsPrefix(A,"mymat") >>>>> VecSetOptionsPrefix(v,"myvec") >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: >>>>> >>>>>> Hi Junchao, >>>>>> >>>>>> Thank you for your answer. I tried MatConvert and it works. I didn't >>>>>> make it before because I forgot to convert a vector from mpi to >>>>>> mpicuda >>>>>> previously. >>>>>> >>>>>> For vector, there is no VecConvert to use, so I have to do >>>>>> VecDuplicate, >>>>>> VecSetType and VecCopy. Is there an easier option? >>>>>> >>>>> As Matt suggested, you could single out the matrix and vector with >>>>> options prefix and set their type on command line >>>>> >>>>> MatSetOptionsPrefix(A,"mymat"); >>>>> VecSetOptionsPrefix(v,"myvec"); >>>>> >>>>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda >>>>> >>>>> A simpler code is to have the vector type automatically set by >>>>> MatCreateVecs(A,&v,NULL) >>>>> >>>>> >>>>>> Chang >>>>>> >>>>>> On 10/18/21 5:23 PM, Junchao Zhang wrote: >>>>>> > >>>>>> > >>>>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >>>>>> > > wrote: >>>>>> > >>>>>> > Hi Matt, >>>>>> > >>>>>> > I have a related question. In my code I have many matrices and >>>>>> I only >>>>>> > want to have one living on GPU, the others still staying on CPU >>>>>> mem. >>>>>> > >>>>>> > I wonder if there is an easier way to copy a mpiaij matrix to >>>>>> > mpiaijcusparse (in other words, copy data to GPUs). I can think >>>>>> of >>>>>> > creating a new mpiaijcusparse matrix, and copying the data line >>>>>> by >>>>>> > line. >>>>>> > But I wonder if there is a better option. >>>>>> > >>>>>> > I have tried MatCopy and MatConvert but neither work. >>>>>> > >>>>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >>>>>> > >>>>>> > >>>>>> > Chang >>>>>> > >>>>>> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >>>>>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >>>>>> > >>>>>> > > >> >>>>>> wrote: >>>>>> > > >>>>>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in >>>>>> code? >>>>>> > > >>>>>> > > >>>>>> > > You would need a call to MatSetFromOptions() to take that >>>>>> type >>>>>> > from the >>>>>> > > command line, and not have >>>>>> > > the type hard-coded in your application. It is generally a >>>>>> bad >>>>>> > idea to >>>>>> > > hard code the implementation type. >>>>>> > > >>>>>> > > If I do it from command line, then are the other MatVec >>>>>> calls are >>>>>> > > ported onto CUDA? I have many MatVec calls in my code, >>>>>> but I >>>>>> > > specifically want to port just one call. >>>>>> > > >>>>>> > > >>>>>> > > You can give that one matrix an options prefix to isolate it. >>>>>> > > >>>>>> > > Thanks, >>>>>> > > >>>>>> > > Matt >>>>>> > > >>>>>> > > Sincerely, >>>>>> > > Swarnava >>>>>> > > >>>>>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >>>>>> > > >>>>> > >>>>>> > >>>>> >>> >>>>>> > wrote: >>>>>> > > >>>>>> > > You can do that with command line options -mat_type >>>>>> > aijcusparse >>>>>> > > -vec_type cuda >>>>>> > > >>>>>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>>>>> > > >>>>>> > >> >>>>>> wrote: >>>>>> > > >>>>>> > > Dear Petsc team, >>>>>> > > >>>>>> > > I had a query regarding using CUDA to accelerate >>>>>> a matrix >>>>>> > > vector product. >>>>>> > > I have a sequential sparse matrix >>>>>> (MATSEQBAIJ type). >>>>>> > I want >>>>>> > > to port a MatVec call onto GPUs. Is there any >>>>>> > code/example I >>>>>> > > can look at? >>>>>> > > >>>>>> > > Sincerely, >>>>>> > > SG >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > -- >>>>>> > > What most experimenters take for granted before they begin >>>>>> their >>>>>> > > experiments is infinitely more interesting than any results >>>>>> to which >>>>>> > > their experiments lead. >>>>>> > > -- Norbert Wiener >>>>>> > > >>>>>> > > https://www.cse.buffalo.edu/~knepley/ >>>>>> > >>>>>> > >>>>> > > >>>>>> > >>>>>> > -- >>>>>> > Chang Liu >>>>>> > Staff Research Physicist >>>>>> > +1 609 243 3438 >>>>>> > cliu at pppl.gov >>>>>> > Princeton Plasma Physics Laboratory >>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> > >>>>>> >>>>>> -- >>>>>> Chang Liu >>>>>> Staff Research Physicist >>>>>> +1 609 243 3438 >>>>>> cliu at pppl.gov >>>>>> Princeton Plasma Physics Laboratory >>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Oct 19 05:12:22 2021 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 19 Oct 2021 06:12:22 -0400 Subject: [petsc-users] HDF5 timestepping in PETSc 3.16 In-Reply-To: <18b4e68c-2524-932d-9aa4-c1a28ea44158@auckland.ac.nz> References: <18b4e68c-2524-932d-9aa4-c1a28ea44158@auckland.ac.nz> Message-ID: On Mon, Oct 18, 2021 at 10:35 PM Adrian Croucher wrote: > Any response on this? > > This is a bit of a showstopper for me - I can't upgrade to PETSc 3.16 if > it does not allow my users to read their HDF5 files created using > earlier versions of PETSc. > > So far I can't see a workaround. Possibly the timestepping functions > need some kind of optional parameter to specify what the default > timestepping attribute should be, if it's not present in the file > (rather than just assuming it's false)? > I will fix it. I think I can do it tomorrow. Class just started this week do it is hectic :) I think you are right. We should always write the attribute, but have it be false. We should interpret a missing attribute as an old file. Thanks, Matt > Regards, Adrian > > On 10/14/21 4:19 PM, Adrian Croucher wrote: > > hi > > > > I am just testing out PETSc 3.16 and making the necessary changes to > > my code. Amongst other things I now have to add a > > PetscViewerHDF5PushTimestepping() call before starting to output > > time-dependent results to HDF5 using a PetscViewer. > > > > I now also have to add this call before reading in sets of previously > > computed time-dependent results (for restarting a simulation from the > > results of a previous run). > > > > The problem with this is that if I try to read in the results of any > > previous run, computed with an earlier version of PETSc (< 3.16), an > > error is raised because the time-dependent datasets in the file do not > > have the 'timestepping' attribute. > > > > Is there something else I need to do to make this work? > > > > - Adrian > > > -- > Dr Adrian Croucher > Senior Research Fellow > Department of Engineering Science > University of Auckland, New Zealand > email: a.croucher at auckland.ac.nz > tel: +64 (0)9 923 4611 > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From swarnava89 at gmail.com Tue Oct 19 20:17:30 2021 From: swarnava89 at gmail.com (Swarnava Ghosh) Date: Tue, 19 Oct 2021 21:17:30 -0400 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: Thank you Junchao! Is it possible to determine how much time is being spent on data transfer from the CPU mem to the GPU mem from the log? ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- /ccsopen/home/swarnava/MiniApp_xl_cu/bin/sq on a named h49n15 with 4 processors, by swarnava Tue Oct 19 21:10:56 2021 Using Petsc Release Version 3.15.0, Mar 30, 2021 Max Max/Min Avg Total Time (sec): 1.172e+02 1.000 1.172e+02 Objects: 1.160e+02 1.000 1.160e+02 Flop: 5.832e+10 1.125 5.508e+10 2.203e+11 Flop/sec: 4.974e+08 1.125 4.698e+08 1.879e+09 MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Reductions: 1.320e+02 1.000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flop and VecAXPY() for complex vectors of length N --> 8N flop Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total Count %Total Avg %Total Count %Total 0: Main Stage: 1.1725e+02 100.0% 2.2033e+11 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 1.140e+02 86.4% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flop: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent AvgLen: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flop in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors) GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors) CpuToGpu Count: total number of CPU to GPU copies per processor CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor) GpuToCpu Count: total number of GPU to CPU copies per processor GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor) GPU %F: percent flops on GPU in this event ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F --------------------------------------------------------------------------------------------------------------------------------------------------------------- --- Event Stage 0: Main Stage BuildTwoSided 2 1.0 6.2501e-03145.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0 BuildTwoSidedF 2 1.0 6.2628e-03123.2 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0 VecDot 89991 1.1 3.4663e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00 0.0e+00 3 3 0 0 0 3 3 0 0 0 1816 1841 0 0.00e+00 84992 6.80e-01 100 VecNorm 89991 1.1 5.5282e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00 0.0e+00 4 3 0 0 0 4 3 0 0 0 1139 1148 0 0.00e+00 84992 6.80e-01 100 VecScale 89991 1.1 1.3902e+00 1.2 8.33e+08 1.1 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 2265 2343 84992 6.80e-01 0 0.00e+00 100 VecCopy 178201 1.1 2.9825e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecSet 3589 1.1 1.0195e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecAXPY 179091 1.1 2.7456e+00 1.2 3.32e+09 1.1 0.0e+00 0.0e+00 0.0e+00 2 6 0 0 0 2 6 0 0 0 4564 4739 169142 1.35e+00 0 0.00e+00 100 VecCUDACopyTo 891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 6.23e+01 0 0.00e+00 0 VecCUDACopyFrom 891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 842 6.23e+01 0 DMCreateMat 5 1.0 7.3491e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 7.0e+00 1 0 0 0 5 1 0 0 0 6 0 0 0 0.00e+00 0 0.00e+00 0 SFSetGraph 5 1.0 3.5016e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 MatMult 89991 1.1 2.0423e+00 1.2 5.08e+10 1.1 0.0e+00 0.0e+00 0.0e+00 2 87 0 0 0 2 87 0 0 0 94039 105680 1683 2.00e+03 0 0.00e+00 100 MatCopy 891 1.1 1.3600e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 MatConvert 2 1.0 1.0489e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 MatScale 2 1.0 2.7950e-04 1.3 3.18e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 4530 0 0 0.00e+00 0 0.00e+00 0 MatAssemblyBegin 7 1.0 6.3768e-0368.8 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0 MatAssemblyEnd 7 1.0 7.9870e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 3 0 0 0 0 4 0 0 0 0.00e+00 0 0.00e+00 0 MatCUSPARSCopyTo 891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 1.93e+03 0 0.00e+00 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------- Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. Reports information only for process 0. --- Event Stage 0: Main Stage Vector 69 11 19112 0. Distributed Mesh 3 0 0 0. Index Set 12 10 187512 0. IS L to G Mapping 3 0 0 0. Star Forest Graph 11 0 0 0. Discrete System 3 0 0 0. Weak Form 3 0 0 0. Application Order 1 0 0 0. Matrix 8 0 0 0. Krylov Solver 1 0 0 0. Preconditioner 1 0 0 0. Viewer 1 0 0 0. ======================================================================================================================== Average time to get PetscTime(): 4.32e-08 Average time for MPI_Barrier(): 9.94e-07 Average time for zero size MPI_Send(): 4.20135e-05 Sincerely, SG On Tue, Oct 19, 2021 at 12:28 AM Junchao Zhang wrote: > > > > On Mon, Oct 18, 2021 at 10:56 PM Swarnava Ghosh > wrote: > >> I am trying the port parts of the following function on GPUs. >> Essentially, the lines of codes between the two "TODO..." comments should >> be executed on the device. Here is the function: >> >> PetscScalar CalculateSpectralNodesAndWeights(LSDFT_OBJ *pLsdft, int p, >> int LIp) >> { >> >> PetscInt N_qp; >> N_qp = pLsdft->N_qp; >> >> int k; >> PetscScalar *a, *b; >> k=0; >> >> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &a); >> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &b); >> >> /* >> * TODO: COPY a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, >> pLsdft->LapPlusVeffOprloc, k,p,N_qp from HOST to DEVICE >> * DO THE FOLLOWING OPERATIONS ON DEVICE >> */ >> >> //zero out vectors >> VecZeroEntries(pLsdft->Vk); >> VecZeroEntries(pLsdft->Vkm1); >> VecZeroEntries(pLsdft->Vkp1); >> >> VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); >> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vkm1,pLsdft->Vk); >> VecDot(pLsdft->Vkm1, pLsdft->Vk, &a[0]); >> VecAXPY(pLsdft->Vk, -a[0], pLsdft->Vkm1); >> VecNorm(pLsdft->Vk, NORM_2, &b[0]); >> VecScale(pLsdft->Vk, 1.0 / b[0]); >> >> for (k = 0; k < N_qp; k++) { >> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vk,pLsdft->Vkp1); >> VecDot(pLsdft->Vk, pLsdft->Vkp1, &a[k + 1]); >> VecAXPY(pLsdft->Vkp1, -a[k + 1], pLsdft->Vk); >> VecAXPY(pLsdft->Vkp1, -b[k], pLsdft->Vkm1); >> VecCopy(pLsdft->Vk, pLsdft->Vkm1); >> VecNorm(pLsdft->Vkp1, NORM_2, &b[k + 1]); >> VecCopy(pLsdft->Vkp1, pLsdft->Vk); >> VecScale(pLsdft->Vk, 1.0 / b[k + 1]); >> } >> >> /* >> * TODO: Copy back a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, >> pLsdft->LapPlusVeffOprloc, k,p,N_qp from DEVICE to HOST >> */ >> >> /* >> * Some operation with a, and b on HOST >> * >> */ >> TridiagEigenVecSolve_NodesAndWeights(pLsdft, a, b, N_qp, LIp); // >> operation on the host >> >> // free a,b >> PetscFree(a); >> PetscFree(b); >> >> return 0; >> } >> >> If I just use the command line options to set vectors Vk,Vkp1 and Vkm1 as >> cuda vectors and the matrix LapPlusVeffOprloc as aijcusparse, will the >> lines of code between the two "TODO" comments be entirely executed on the >> device? >> > yes, except VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); which is > done on CPU, by pulling down vector data from GPU to CPU and setting the > value. Subsequent vector operations will push the updated vector data to > GPU again. > > >> >> Sincerely, >> Swarnava >> >> >> On Mon, Oct 18, 2021 at 10:13 PM Swarnava Ghosh >> wrote: >> >>> Thanks for the clarification, Junchao. >>> >>> Sincerely, >>> Swarnava >>> >>> On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang >>> wrote: >>> >>>> >>>> >>>> >>>> On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh >>>> wrote: >>>> >>>>> Hi Junchao, >>>>> >>>>> If I want to pass command line options as -mymat_mat_type >>>>> aijcusparse, should it be MatSetOptionsPrefix(A,"mymat"); or >>>>> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? >>>>> >>>> my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in >>>> mat/tests/ex62.c >>>> Thanks >>>> >>>> >>>>> >>>>> Sincerely, >>>>> Swarnava >>>>> >>>>> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang >>>>> wrote: >>>>> >>>>>> MatSetOptionsPrefix(A,"mymat") >>>>>> VecSetOptionsPrefix(v,"myvec") >>>>>> >>>>>> --Junchao Zhang >>>>>> >>>>>> >>>>>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: >>>>>> >>>>>>> Hi Junchao, >>>>>>> >>>>>>> Thank you for your answer. I tried MatConvert and it works. I didn't >>>>>>> make it before because I forgot to convert a vector from mpi to >>>>>>> mpicuda >>>>>>> previously. >>>>>>> >>>>>>> For vector, there is no VecConvert to use, so I have to do >>>>>>> VecDuplicate, >>>>>>> VecSetType and VecCopy. Is there an easier option? >>>>>>> >>>>>> As Matt suggested, you could single out the matrix and vector with >>>>>> options prefix and set their type on command line >>>>>> >>>>>> MatSetOptionsPrefix(A,"mymat"); >>>>>> VecSetOptionsPrefix(v,"myvec"); >>>>>> >>>>>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda >>>>>> >>>>>> A simpler code is to have the vector type automatically set by >>>>>> MatCreateVecs(A,&v,NULL) >>>>>> >>>>>> >>>>>>> Chang >>>>>>> >>>>>>> On 10/18/21 5:23 PM, Junchao Zhang wrote: >>>>>>> > >>>>>>> > >>>>>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >>>>>>> > > wrote: >>>>>>> > >>>>>>> > Hi Matt, >>>>>>> > >>>>>>> > I have a related question. In my code I have many matrices and >>>>>>> I only >>>>>>> > want to have one living on GPU, the others still staying on >>>>>>> CPU mem. >>>>>>> > >>>>>>> > I wonder if there is an easier way to copy a mpiaij matrix to >>>>>>> > mpiaijcusparse (in other words, copy data to GPUs). I can >>>>>>> think of >>>>>>> > creating a new mpiaijcusparse matrix, and copying the data >>>>>>> line by >>>>>>> > line. >>>>>>> > But I wonder if there is a better option. >>>>>>> > >>>>>>> > I have tried MatCopy and MatConvert but neither work. >>>>>>> > >>>>>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >>>>>>> > >>>>>>> > >>>>>>> > Chang >>>>>>> > >>>>>>> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >>>>>>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >>>>>>> > >>>>>>> > > >> >>>>>>> wrote: >>>>>>> > > >>>>>>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in >>>>>>> code? >>>>>>> > > >>>>>>> > > >>>>>>> > > You would need a call to MatSetFromOptions() to take that >>>>>>> type >>>>>>> > from the >>>>>>> > > command line, and not have >>>>>>> > > the type hard-coded in your application. It is generally a >>>>>>> bad >>>>>>> > idea to >>>>>>> > > hard code the implementation type. >>>>>>> > > >>>>>>> > > If I do it from command line, then are the other MatVec >>>>>>> calls are >>>>>>> > > ported onto CUDA? I have many MatVec calls in my code, >>>>>>> but I >>>>>>> > > specifically want to port just one call. >>>>>>> > > >>>>>>> > > >>>>>>> > > You can give that one matrix an options prefix to isolate >>>>>>> it. >>>>>>> > > >>>>>>> > > Thanks, >>>>>>> > > >>>>>>> > > Matt >>>>>>> > > >>>>>>> > > Sincerely, >>>>>>> > > Swarnava >>>>>>> > > >>>>>>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >>>>>>> > > >>>>>> junchao.zhang at gmail.com> >>>>>>> > >>>>>> junchao.zhang at gmail.com>>> >>>>>>> > wrote: >>>>>>> > > >>>>>>> > > You can do that with command line options -mat_type >>>>>>> > aijcusparse >>>>>>> > > -vec_type cuda >>>>>>> > > >>>>>>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>>>>>> > > >>>>>>> > >> >>>>>>> wrote: >>>>>>> > > >>>>>>> > > Dear Petsc team, >>>>>>> > > >>>>>>> > > I had a query regarding using CUDA to >>>>>>> accelerate a matrix >>>>>>> > > vector product. >>>>>>> > > I have a sequential sparse matrix >>>>>>> (MATSEQBAIJ type). >>>>>>> > I want >>>>>>> > > to port a MatVec call onto GPUs. Is there any >>>>>>> > code/example I >>>>>>> > > can look at? >>>>>>> > > >>>>>>> > > Sincerely, >>>>>>> > > SG >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > -- >>>>>>> > > What most experimenters take for granted before they begin >>>>>>> their >>>>>>> > > experiments is infinitely more interesting than any results >>>>>>> to which >>>>>>> > > their experiments lead. >>>>>>> > > -- Norbert Wiener >>>>>>> > > >>>>>>> > > https://www.cse.buffalo.edu/~knepley/ >>>>>>> > >>>>>>> > >>>>>> > > >>>>>>> > >>>>>>> > -- >>>>>>> > Chang Liu >>>>>>> > Staff Research Physicist >>>>>>> > +1 609 243 3438 >>>>>>> > cliu at pppl.gov >>>>>>> > Princeton Plasma Physics Laboratory >>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> > >>>>>>> >>>>>>> -- >>>>>>> Chang Liu >>>>>>> Staff Research Physicist >>>>>>> +1 609 243 3438 >>>>>>> cliu at pppl.gov >>>>>>> Princeton Plasma Physics Laboratory >>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Oct 19 20:34:28 2021 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 19 Oct 2021 21:34:28 -0400 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: On Tue, Oct 19, 2021 at 9:18 PM Swarnava Ghosh wrote: > Thank you Junchao! Is it possible to determine how much time is being > spent on data transfer from the CPU mem to the GPU mem from the log? > It looks like VecCUDACopyTo 891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 6.23e+01 0 0.00e+00 0 VecCUDACopyFrom 891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 842 6.23e+01 0 MatCUSPARSCopyTo 891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 1.93e+03 0 0.00e+00 0 Thanks, Matt > > ************************************************************************************************************************ > > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > > ************************************************************************************************************************ > > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > > /ccsopen/home/swarnava/MiniApp_xl_cu/bin/sq on a named h49n15 with 4 > processors, by swarnava Tue Oct 19 21:10:56 2021 > > Using Petsc Release Version 3.15.0, Mar 30, 2021 > > > Max Max/Min Avg Total > > Time (sec): 1.172e+02 1.000 1.172e+02 > > Objects: 1.160e+02 1.000 1.160e+02 > > Flop: 5.832e+10 1.125 5.508e+10 2.203e+11 > > Flop/sec: 4.974e+08 1.125 4.698e+08 1.879e+09 > > MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 > > MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 > > MPI Reductions: 1.320e+02 1.000 > > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > > e.g., VecAXPY() for real vectors of length N > --> 2N flop > > and VecAXPY() for complex vectors of length N > --> 8N flop > > > Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages > --- -- Message Lengths -- -- Reductions -- > > Avg %Total Avg %Total Count %Total > Avg %Total Count %Total > > 0: Main Stage: 1.1725e+02 100.0% 2.2033e+11 100.0% 0.000e+00 > 0.0% 0.000e+00 0.0% 1.140e+02 86.4% > > > > ------------------------------------------------------------------------------------------------------------------------ > > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > > Phase summary info: > > Count: number of times phase was executed > > Time and Flop: Max - maximum over all processors > > Ratio - ratio of maximum to minimum over all processors > > Mess: number of messages sent > > AvgLen: average message length (bytes) > > Reduct: number of global reductions > > Global: entire computation > > Stage: stages of a computation. Set stages with PetscLogStagePush() > and PetscLogStagePop(). > > %T - percent time in this phase %F - percent flop in this > phase > > %M - percent messages in this phase %L - percent message > lengths in this phase > > %R - percent reductions in this phase > > Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time > over all processors) > > GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU > time over all processors) > > CpuToGpu Count: total number of CPU to GPU copies per processor > > CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per > processor) > > GpuToCpu Count: total number of GPU to CPU copies per processor > > GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per > processor) > > GPU %F: percent flops on GPU in this event > > > ------------------------------------------------------------------------------------------------------------------------ > > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - > GpuToCpu - GPU > > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count > Size %F > > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > --- Event Stage 0: Main Stage > > > BuildTwoSided 2 1.0 6.2501e-03145.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0 > 0.00e+00 0 > > BuildTwoSidedF 2 1.0 6.2628e-03123.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0 > 0.00e+00 0 > > VecDot 89991 1.1 3.4663e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00 > 0.0e+00 3 3 0 0 0 3 3 0 0 0 1816 1841 0 0.00e+00 > 84992 6.80e-01 100 > > VecNorm 89991 1.1 5.5282e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00 > 0.0e+00 4 3 0 0 0 4 3 0 0 0 1139 1148 0 0.00e+00 > 84992 6.80e-01 100 > > VecScale 89991 1.1 1.3902e+00 1.2 8.33e+08 1.1 0.0e+00 0.0e+00 > 0.0e+00 1 1 0 0 0 1 1 0 0 0 2265 2343 84992 6.80e-01 0 > 0.00e+00 100 > > VecCopy 178201 1.1 2.9825e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > VecSet 3589 1.1 1.0195e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > VecAXPY 179091 1.1 2.7456e+00 1.2 3.32e+09 1.1 0.0e+00 0.0e+00 > 0.0e+00 2 6 0 0 0 2 6 0 0 0 4564 4739 169142 1.35e+00 0 > 0.00e+00 100 > > VecCUDACopyTo 891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 6.23e+01 0 > 0.00e+00 0 > > VecCUDACopyFrom 891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 842 > 6.23e+01 0 > > DMCreateMat 5 1.0 7.3491e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 7.0e+00 1 0 0 0 5 1 0 0 0 6 0 0 0 0.00e+00 0 > 0.00e+00 0 > > SFSetGraph 5 1.0 3.5016e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > MatMult 89991 1.1 2.0423e+00 1.2 5.08e+10 1.1 0.0e+00 0.0e+00 > 0.0e+00 2 87 0 0 0 2 87 0 0 0 94039 105680 1683 2.00e+03 0 > 0.00e+00 100 > > MatCopy 891 1.1 1.3600e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > MatConvert 2 1.0 1.0489e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > MatScale 2 1.0 2.7950e-04 1.3 3.18e+05 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 4530 0 0 0.00e+00 0 > 0.00e+00 0 > > MatAssemblyBegin 7 1.0 6.3768e-0368.8 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0 > 0.00e+00 0 > > MatAssemblyEnd 7 1.0 7.9870e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 4.0e+00 0 0 0 0 3 0 0 0 0 4 0 0 0 0.00e+00 0 > 0.00e+00 0 > > MatCUSPARSCopyTo 891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 1.93e+03 0 > 0.00e+00 0 > > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Memory usage is given in bytes: > > > Object Type Creations Destructions Memory Descendants' > Mem. > > Reports information only for process 0. > > > --- Event Stage 0: Main Stage > > > Vector 69 11 19112 0. > > Distributed Mesh 3 0 0 0. > > Index Set 12 10 187512 0. > > IS L to G Mapping 3 0 0 0. > > Star Forest Graph 11 0 0 0. > > Discrete System 3 0 0 0. > > Weak Form 3 0 0 0. > > Application Order 1 0 0 0. > > Matrix 8 0 0 0. > > Krylov Solver 1 0 0 0. > > Preconditioner 1 0 0 0. > > Viewer 1 0 0 0. > > > ======================================================================================================================== > > Average time to get PetscTime(): 4.32e-08 > > Average time for MPI_Barrier(): 9.94e-07 > > Average time for zero size MPI_Send(): 4.20135e-05 > > > Sincerely, > > SG > > On Tue, Oct 19, 2021 at 12:28 AM Junchao Zhang > wrote: > >> >> >> >> On Mon, Oct 18, 2021 at 10:56 PM Swarnava Ghosh >> wrote: >> >>> I am trying the port parts of the following function on GPUs. >>> Essentially, the lines of codes between the two "TODO..." comments should >>> be executed on the device. Here is the function: >>> >>> PetscScalar CalculateSpectralNodesAndWeights(LSDFT_OBJ *pLsdft, int p, >>> int LIp) >>> { >>> >>> PetscInt N_qp; >>> N_qp = pLsdft->N_qp; >>> >>> int k; >>> PetscScalar *a, *b; >>> k=0; >>> >>> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &a); >>> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &b); >>> >>> /* >>> * TODO: COPY a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, >>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from HOST to DEVICE >>> * DO THE FOLLOWING OPERATIONS ON DEVICE >>> */ >>> >>> //zero out vectors >>> VecZeroEntries(pLsdft->Vk); >>> VecZeroEntries(pLsdft->Vkm1); >>> VecZeroEntries(pLsdft->Vkp1); >>> >>> VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); >>> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vkm1,pLsdft->Vk); >>> VecDot(pLsdft->Vkm1, pLsdft->Vk, &a[0]); >>> VecAXPY(pLsdft->Vk, -a[0], pLsdft->Vkm1); >>> VecNorm(pLsdft->Vk, NORM_2, &b[0]); >>> VecScale(pLsdft->Vk, 1.0 / b[0]); >>> >>> for (k = 0; k < N_qp; k++) { >>> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vk,pLsdft->Vkp1); >>> VecDot(pLsdft->Vk, pLsdft->Vkp1, &a[k + 1]); >>> VecAXPY(pLsdft->Vkp1, -a[k + 1], pLsdft->Vk); >>> VecAXPY(pLsdft->Vkp1, -b[k], pLsdft->Vkm1); >>> VecCopy(pLsdft->Vk, pLsdft->Vkm1); >>> VecNorm(pLsdft->Vkp1, NORM_2, &b[k + 1]); >>> VecCopy(pLsdft->Vkp1, pLsdft->Vk); >>> VecScale(pLsdft->Vk, 1.0 / b[k + 1]); >>> } >>> >>> /* >>> * TODO: Copy back a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, >>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from DEVICE to HOST >>> */ >>> >>> /* >>> * Some operation with a, and b on HOST >>> * >>> */ >>> TridiagEigenVecSolve_NodesAndWeights(pLsdft, a, b, N_qp, LIp); // >>> operation on the host >>> >>> // free a,b >>> PetscFree(a); >>> PetscFree(b); >>> >>> return 0; >>> } >>> >>> If I just use the command line options to set vectors Vk,Vkp1 and Vkm1 >>> as cuda vectors and the matrix LapPlusVeffOprloc as aijcusparse, will the >>> lines of code between the two "TODO" comments be entirely executed on the >>> device? >>> >> yes, except VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); which is >> done on CPU, by pulling down vector data from GPU to CPU and setting the >> value. Subsequent vector operations will push the updated vector data to >> GPU again. >> >> >>> >>> Sincerely, >>> Swarnava >>> >>> >>> On Mon, Oct 18, 2021 at 10:13 PM Swarnava Ghosh >>> wrote: >>> >>>> Thanks for the clarification, Junchao. >>>> >>>> Sincerely, >>>> Swarnava >>>> >>>> On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang >>>> wrote: >>>> >>>>> >>>>> >>>>> >>>>> On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh >>>>> wrote: >>>>> >>>>>> Hi Junchao, >>>>>> >>>>>> If I want to pass command line options as -mymat_mat_type >>>>>> aijcusparse, should it be MatSetOptionsPrefix(A,"mymat"); or >>>>>> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? >>>>>> >>>>> my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in >>>>> mat/tests/ex62.c >>>>> Thanks >>>>> >>>>> >>>>>> >>>>>> Sincerely, >>>>>> Swarnava >>>>>> >>>>>> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang < >>>>>> junchao.zhang at gmail.com> wrote: >>>>>> >>>>>>> MatSetOptionsPrefix(A,"mymat") >>>>>>> VecSetOptionsPrefix(v,"myvec") >>>>>>> >>>>>>> --Junchao Zhang >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: >>>>>>> >>>>>>>> Hi Junchao, >>>>>>>> >>>>>>>> Thank you for your answer. I tried MatConvert and it works. I >>>>>>>> didn't >>>>>>>> make it before because I forgot to convert a vector from mpi to >>>>>>>> mpicuda >>>>>>>> previously. >>>>>>>> >>>>>>>> For vector, there is no VecConvert to use, so I have to do >>>>>>>> VecDuplicate, >>>>>>>> VecSetType and VecCopy. Is there an easier option? >>>>>>>> >>>>>>> As Matt suggested, you could single out the matrix and vector with >>>>>>> options prefix and set their type on command line >>>>>>> >>>>>>> MatSetOptionsPrefix(A,"mymat"); >>>>>>> VecSetOptionsPrefix(v,"myvec"); >>>>>>> >>>>>>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda >>>>>>> >>>>>>> A simpler code is to have the vector type automatically set by >>>>>>> MatCreateVecs(A,&v,NULL) >>>>>>> >>>>>>> >>>>>>>> Chang >>>>>>>> >>>>>>>> On 10/18/21 5:23 PM, Junchao Zhang wrote: >>>>>>>> > >>>>>>>> > >>>>>>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >>>>>>>> > > wrote: >>>>>>>> > >>>>>>>> > Hi Matt, >>>>>>>> > >>>>>>>> > I have a related question. In my code I have many matrices >>>>>>>> and I only >>>>>>>> > want to have one living on GPU, the others still staying on >>>>>>>> CPU mem. >>>>>>>> > >>>>>>>> > I wonder if there is an easier way to copy a mpiaij matrix to >>>>>>>> > mpiaijcusparse (in other words, copy data to GPUs). I can >>>>>>>> think of >>>>>>>> > creating a new mpiaijcusparse matrix, and copying the data >>>>>>>> line by >>>>>>>> > line. >>>>>>>> > But I wonder if there is a better option. >>>>>>>> > >>>>>>>> > I have tried MatCopy and MatConvert but neither work. >>>>>>>> > >>>>>>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >>>>>>>> > >>>>>>>> > >>>>>>>> > Chang >>>>>>>> > >>>>>>>> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >>>>>>>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >>>>>>>> > >>>>>>>> > > >> >>>>>>>> wrote: >>>>>>>> > > >>>>>>>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in >>>>>>>> code? >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > You would need a call to MatSetFromOptions() to take that >>>>>>>> type >>>>>>>> > from the >>>>>>>> > > command line, and not have >>>>>>>> > > the type hard-coded in your application. It is generally a >>>>>>>> bad >>>>>>>> > idea to >>>>>>>> > > hard code the implementation type. >>>>>>>> > > >>>>>>>> > > If I do it from command line, then are the other >>>>>>>> MatVec calls are >>>>>>>> > > ported onto CUDA? I have many MatVec calls in my code, >>>>>>>> but I >>>>>>>> > > specifically want to port just one call. >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > You can give that one matrix an options prefix to isolate >>>>>>>> it. >>>>>>>> > > >>>>>>>> > > Thanks, >>>>>>>> > > >>>>>>>> > > Matt >>>>>>>> > > >>>>>>>> > > Sincerely, >>>>>>>> > > Swarnava >>>>>>>> > > >>>>>>>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >>>>>>>> > > >>>>>>> junchao.zhang at gmail.com> >>>>>>>> > >>>>>>> junchao.zhang at gmail.com>>> >>>>>>>> > wrote: >>>>>>>> > > >>>>>>>> > > You can do that with command line options -mat_type >>>>>>>> > aijcusparse >>>>>>>> > > -vec_type cuda >>>>>>>> > > >>>>>>>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>>>>>>> > > >>>>>>> > >>>>>>>> > >> >>>>>>>> wrote: >>>>>>>> > > >>>>>>>> > > Dear Petsc team, >>>>>>>> > > >>>>>>>> > > I had a query regarding using CUDA to >>>>>>>> accelerate a matrix >>>>>>>> > > vector product. >>>>>>>> > > I have a sequential sparse matrix >>>>>>>> (MATSEQBAIJ type). >>>>>>>> > I want >>>>>>>> > > to port a MatVec call onto GPUs. Is there any >>>>>>>> > code/example I >>>>>>>> > > can look at? >>>>>>>> > > >>>>>>>> > > Sincerely, >>>>>>>> > > SG >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > -- >>>>>>>> > > What most experimenters take for granted before they begin >>>>>>>> their >>>>>>>> > > experiments is infinitely more interesting than any >>>>>>>> results to which >>>>>>>> > > their experiments lead. >>>>>>>> > > -- Norbert Wiener >>>>>>>> > > >>>>>>>> > > https://www.cse.buffalo.edu/~knepley/ >>>>>>>> > >>>>>>>> > >>>>>>> > > >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Chang Liu >>>>>>>> > Staff Research Physicist >>>>>>>> > +1 609 243 3438 >>>>>>>> > cliu at pppl.gov >>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> > >>>>>>>> >>>>>>>> -- >>>>>>>> Chang Liu >>>>>>>> Staff Research Physicist >>>>>>>> +1 609 243 3438 >>>>>>>> cliu at pppl.gov >>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >>>>>>> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From swarnava89 at gmail.com Tue Oct 19 21:01:14 2021 From: swarnava89 at gmail.com (Swarnava Ghosh) Date: Tue, 19 Oct 2021 22:01:14 -0400 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: Thanks, Matt! Sincerely, SG On Tue, Oct 19, 2021 at 9:34 PM Matthew Knepley wrote: > On Tue, Oct 19, 2021 at 9:18 PM Swarnava Ghosh > wrote: > >> Thank you Junchao! Is it possible to determine how much time is being >> spent on data transfer from the CPU mem to the GPU mem from the log? >> > > It looks like > > VecCUDACopyTo 891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 6.23e+01 0 > 0.00e+00 0 > > VecCUDACopyFrom 891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 842 > 6.23e+01 0 > > MatCUSPARSCopyTo 891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 1.93e+03 0 > 0.00e+00 0 > > Thanks, > > Matt > > >> >> ************************************************************************************************************************ >> >> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r >> -fCourier9' to print this document *** >> >> >> ************************************************************************************************************************ >> >> >> ---------------------------------------------- PETSc Performance Summary: >> ---------------------------------------------- >> >> >> /ccsopen/home/swarnava/MiniApp_xl_cu/bin/sq on a named h49n15 with 4 >> processors, by swarnava Tue Oct 19 21:10:56 2021 >> >> Using Petsc Release Version 3.15.0, Mar 30, 2021 >> >> >> Max Max/Min Avg Total >> >> Time (sec): 1.172e+02 1.000 1.172e+02 >> >> Objects: 1.160e+02 1.000 1.160e+02 >> >> Flop: 5.832e+10 1.125 5.508e+10 2.203e+11 >> >> Flop/sec: 4.974e+08 1.125 4.698e+08 1.879e+09 >> >> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 >> >> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 >> >> MPI Reductions: 1.320e+02 1.000 >> >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> >> e.g., VecAXPY() for real vectors of length N >> --> 2N flop >> >> and VecAXPY() for complex vectors of length >> N --> 8N flop >> >> >> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >> --- -- Message Lengths -- -- Reductions -- >> >> Avg %Total Avg %Total Count %Total >> Avg %Total Count %Total >> >> 0: Main Stage: 1.1725e+02 100.0% 2.2033e+11 100.0% 0.000e+00 >> 0.0% 0.000e+00 0.0% 1.140e+02 86.4% >> >> >> >> ------------------------------------------------------------------------------------------------------------------------ >> >> See the 'Profiling' chapter of the users' manual for details on >> interpreting output. >> >> Phase summary info: >> >> Count: number of times phase was executed >> >> Time and Flop: Max - maximum over all processors >> >> Ratio - ratio of maximum to minimum over all processors >> >> Mess: number of messages sent >> >> AvgLen: average message length (bytes) >> >> Reduct: number of global reductions >> >> Global: entire computation >> >> Stage: stages of a computation. Set stages with PetscLogStagePush() >> and PetscLogStagePop(). >> >> %T - percent time in this phase %F - percent flop in this >> phase >> >> %M - percent messages in this phase %L - percent message >> lengths in this phase >> >> %R - percent reductions in this phase >> >> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >> over all processors) >> >> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max >> GPU time over all processors) >> >> CpuToGpu Count: total number of CPU to GPU copies per processor >> >> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >> processor) >> >> GpuToCpu Count: total number of GPU to CPU copies per processor >> >> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >> processor) >> >> GPU %F: percent flops on GPU in this event >> >> >> ------------------------------------------------------------------------------------------------------------------------ >> >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >> GpuToCpu - GPU >> >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count >> Size %F >> >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> >> --- Event Stage 0: Main Stage >> >> >> BuildTwoSided 2 1.0 6.2501e-03145.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 >> 0.00e+00 0 0.00e+00 0 >> >> BuildTwoSidedF 2 1.0 6.2628e-03123.2 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 >> 0.00e+00 0 0.00e+00 0 >> >> VecDot 89991 1.1 3.4663e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00 >> 0.0e+00 3 3 0 0 0 3 3 0 0 0 1816 1841 0 0.00e+00 >> 84992 6.80e-01 100 >> >> VecNorm 89991 1.1 5.5282e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00 >> 0.0e+00 4 3 0 0 0 4 3 0 0 0 1139 1148 0 0.00e+00 >> 84992 6.80e-01 100 >> >> VecScale 89991 1.1 1.3902e+00 1.2 8.33e+08 1.1 0.0e+00 0.0e+00 >> 0.0e+00 1 1 0 0 0 1 1 0 0 0 2265 2343 84992 6.80e-01 0 >> 0.00e+00 100 >> >> VecCopy 178201 1.1 2.9825e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> VecSet 3589 1.1 1.0195e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> VecAXPY 179091 1.1 2.7456e+00 1.2 3.32e+09 1.1 0.0e+00 0.0e+00 >> 0.0e+00 2 6 0 0 0 2 6 0 0 0 4564 4739 169142 1.35e+00 >> 0 0.00e+00 100 >> >> VecCUDACopyTo 891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 6.23e+01 0 >> 0.00e+00 0 >> >> VecCUDACopyFrom 891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 842 >> 6.23e+01 0 >> >> DMCreateMat 5 1.0 7.3491e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 7.0e+00 1 0 0 0 5 1 0 0 0 6 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> SFSetGraph 5 1.0 3.5016e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatMult 89991 1.1 2.0423e+00 1.2 5.08e+10 1.1 0.0e+00 0.0e+00 >> 0.0e+00 2 87 0 0 0 2 87 0 0 0 94039 105680 1683 2.00e+03 0 >> 0.00e+00 100 >> >> MatCopy 891 1.1 1.3600e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatConvert 2 1.0 1.0489e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatScale 2 1.0 2.7950e-04 1.3 3.18e+05 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 4530 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatAssemblyBegin 7 1.0 6.3768e-0368.8 0.00e+00 0.0 0.0e+00 0.0e+00 >> 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatAssemblyEnd 7 1.0 7.9870e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 4.0e+00 0 0 0 0 3 0 0 0 0 4 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatCUSPARSCopyTo 891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 1.93e+03 0 >> 0.00e+00 0 >> >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> Memory usage is given in bytes: >> >> >> Object Type Creations Destructions Memory Descendants' >> Mem. >> >> Reports information only for process 0. >> >> >> --- Event Stage 0: Main Stage >> >> >> Vector 69 11 19112 0. >> >> Distributed Mesh 3 0 0 0. >> >> Index Set 12 10 187512 0. >> >> IS L to G Mapping 3 0 0 0. >> >> Star Forest Graph 11 0 0 0. >> >> Discrete System 3 0 0 0. >> >> Weak Form 3 0 0 0. >> >> Application Order 1 0 0 0. >> >> Matrix 8 0 0 0. >> >> Krylov Solver 1 0 0 0. >> >> Preconditioner 1 0 0 0. >> >> Viewer 1 0 0 0. >> >> >> ======================================================================================================================== >> >> Average time to get PetscTime(): 4.32e-08 >> >> Average time for MPI_Barrier(): 9.94e-07 >> >> Average time for zero size MPI_Send(): 4.20135e-05 >> >> >> Sincerely, >> >> SG >> >> On Tue, Oct 19, 2021 at 12:28 AM Junchao Zhang >> wrote: >> >>> >>> >>> >>> On Mon, Oct 18, 2021 at 10:56 PM Swarnava Ghosh >>> wrote: >>> >>>> I am trying the port parts of the following function on GPUs. >>>> Essentially, the lines of codes between the two "TODO..." comments should >>>> be executed on the device. Here is the function: >>>> >>>> PetscScalar CalculateSpectralNodesAndWeights(LSDFT_OBJ *pLsdft, int p, >>>> int LIp) >>>> { >>>> >>>> PetscInt N_qp; >>>> N_qp = pLsdft->N_qp; >>>> >>>> int k; >>>> PetscScalar *a, *b; >>>> k=0; >>>> >>>> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &a); >>>> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &b); >>>> >>>> /* >>>> * TODO: COPY a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, >>>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from HOST to DEVICE >>>> * DO THE FOLLOWING OPERATIONS ON DEVICE >>>> */ >>>> >>>> //zero out vectors >>>> VecZeroEntries(pLsdft->Vk); >>>> VecZeroEntries(pLsdft->Vkm1); >>>> VecZeroEntries(pLsdft->Vkp1); >>>> >>>> VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); >>>> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vkm1,pLsdft->Vk); >>>> VecDot(pLsdft->Vkm1, pLsdft->Vk, &a[0]); >>>> VecAXPY(pLsdft->Vk, -a[0], pLsdft->Vkm1); >>>> VecNorm(pLsdft->Vk, NORM_2, &b[0]); >>>> VecScale(pLsdft->Vk, 1.0 / b[0]); >>>> >>>> for (k = 0; k < N_qp; k++) { >>>> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vk,pLsdft->Vkp1); >>>> VecDot(pLsdft->Vk, pLsdft->Vkp1, &a[k + 1]); >>>> VecAXPY(pLsdft->Vkp1, -a[k + 1], pLsdft->Vk); >>>> VecAXPY(pLsdft->Vkp1, -b[k], pLsdft->Vkm1); >>>> VecCopy(pLsdft->Vk, pLsdft->Vkm1); >>>> VecNorm(pLsdft->Vkp1, NORM_2, &b[k + 1]); >>>> VecCopy(pLsdft->Vkp1, pLsdft->Vk); >>>> VecScale(pLsdft->Vk, 1.0 / b[k + 1]); >>>> } >>>> >>>> /* >>>> * TODO: Copy back a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, >>>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from DEVICE to HOST >>>> */ >>>> >>>> /* >>>> * Some operation with a, and b on HOST >>>> * >>>> */ >>>> TridiagEigenVecSolve_NodesAndWeights(pLsdft, a, b, N_qp, LIp); // >>>> operation on the host >>>> >>>> // free a,b >>>> PetscFree(a); >>>> PetscFree(b); >>>> >>>> return 0; >>>> } >>>> >>>> If I just use the command line options to set vectors Vk,Vkp1 and Vkm1 >>>> as cuda vectors and the matrix LapPlusVeffOprloc as aijcusparse, will the >>>> lines of code between the two "TODO" comments be entirely executed on the >>>> device? >>>> >>> yes, except VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); which is >>> done on CPU, by pulling down vector data from GPU to CPU and setting the >>> value. Subsequent vector operations will push the updated vector data to >>> GPU again. >>> >>> >>>> >>>> Sincerely, >>>> Swarnava >>>> >>>> >>>> On Mon, Oct 18, 2021 at 10:13 PM Swarnava Ghosh >>>> wrote: >>>> >>>>> Thanks for the clarification, Junchao. >>>>> >>>>> Sincerely, >>>>> Swarnava >>>>> >>>>> On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang < >>>>> junchao.zhang at gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh >>>>>> wrote: >>>>>> >>>>>>> Hi Junchao, >>>>>>> >>>>>>> If I want to pass command line options as -mymat_mat_type >>>>>>> aijcusparse, should it be MatSetOptionsPrefix(A,"mymat"); or >>>>>>> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? >>>>>>> >>>>>> my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in >>>>>> mat/tests/ex62.c >>>>>> Thanks >>>>>> >>>>>> >>>>>>> >>>>>>> Sincerely, >>>>>>> Swarnava >>>>>>> >>>>>>> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang < >>>>>>> junchao.zhang at gmail.com> wrote: >>>>>>> >>>>>>>> MatSetOptionsPrefix(A,"mymat") >>>>>>>> VecSetOptionsPrefix(v,"myvec") >>>>>>>> >>>>>>>> --Junchao Zhang >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu wrote: >>>>>>>> >>>>>>>>> Hi Junchao, >>>>>>>>> >>>>>>>>> Thank you for your answer. I tried MatConvert and it works. I >>>>>>>>> didn't >>>>>>>>> make it before because I forgot to convert a vector from mpi to >>>>>>>>> mpicuda >>>>>>>>> previously. >>>>>>>>> >>>>>>>>> For vector, there is no VecConvert to use, so I have to do >>>>>>>>> VecDuplicate, >>>>>>>>> VecSetType and VecCopy. Is there an easier option? >>>>>>>>> >>>>>>>> As Matt suggested, you could single out the matrix and vector with >>>>>>>> options prefix and set their type on command line >>>>>>>> >>>>>>>> MatSetOptionsPrefix(A,"mymat"); >>>>>>>> VecSetOptionsPrefix(v,"myvec"); >>>>>>>> >>>>>>>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda >>>>>>>> >>>>>>>> A simpler code is to have the vector type automatically set by >>>>>>>> MatCreateVecs(A,&v,NULL) >>>>>>>> >>>>>>>> >>>>>>>>> Chang >>>>>>>>> >>>>>>>>> On 10/18/21 5:23 PM, Junchao Zhang wrote: >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >>>>>>>>> > > >>>>>>>>> wrote: >>>>>>>>> > >>>>>>>>> > Hi Matt, >>>>>>>>> > >>>>>>>>> > I have a related question. In my code I have many matrices >>>>>>>>> and I only >>>>>>>>> > want to have one living on GPU, the others still staying on >>>>>>>>> CPU mem. >>>>>>>>> > >>>>>>>>> > I wonder if there is an easier way to copy a mpiaij matrix to >>>>>>>>> > mpiaijcusparse (in other words, copy data to GPUs). I can >>>>>>>>> think of >>>>>>>>> > creating a new mpiaijcusparse matrix, and copying the data >>>>>>>>> line by >>>>>>>>> > line. >>>>>>>>> > But I wonder if there is a better option. >>>>>>>>> > >>>>>>>>> > I have tried MatCopy and MatConvert but neither work. >>>>>>>>> > >>>>>>>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > Chang >>>>>>>>> > >>>>>>>>> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >>>>>>>>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >>>>>>>>> > >>>>>>>>> > > >> >>>>>>>>> wrote: >>>>>>>>> > > >>>>>>>>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in >>>>>>>>> code? >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > You would need a call to MatSetFromOptions() to take that >>>>>>>>> type >>>>>>>>> > from the >>>>>>>>> > > command line, and not have >>>>>>>>> > > the type hard-coded in your application. It is generally >>>>>>>>> a bad >>>>>>>>> > idea to >>>>>>>>> > > hard code the implementation type. >>>>>>>>> > > >>>>>>>>> > > If I do it from command line, then are the other >>>>>>>>> MatVec calls are >>>>>>>>> > > ported onto CUDA? I have many MatVec calls in my >>>>>>>>> code, but I >>>>>>>>> > > specifically want to port just one call. >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > You can give that one matrix an options prefix to isolate >>>>>>>>> it. >>>>>>>>> > > >>>>>>>>> > > Thanks, >>>>>>>>> > > >>>>>>>>> > > Matt >>>>>>>>> > > >>>>>>>>> > > Sincerely, >>>>>>>>> > > Swarnava >>>>>>>>> > > >>>>>>>>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >>>>>>>>> > > >>>>>>>> junchao.zhang at gmail.com> >>>>>>>>> > >>>>>>>> junchao.zhang at gmail.com>>> >>>>>>>>> > wrote: >>>>>>>>> > > >>>>>>>>> > > You can do that with command line options >>>>>>>>> -mat_type >>>>>>>>> > aijcusparse >>>>>>>>> > > -vec_type cuda >>>>>>>>> > > >>>>>>>>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>>>>>>>> > > >>>>>>>> swarnava89 at gmail.com> >>>>>>>>> > >> >>>>>>>>> wrote: >>>>>>>>> > > >>>>>>>>> > > Dear Petsc team, >>>>>>>>> > > >>>>>>>>> > > I had a query regarding using CUDA to >>>>>>>>> accelerate a matrix >>>>>>>>> > > vector product. >>>>>>>>> > > I have a sequential sparse matrix >>>>>>>>> (MATSEQBAIJ type). >>>>>>>>> > I want >>>>>>>>> > > to port a MatVec call onto GPUs. Is there any >>>>>>>>> > code/example I >>>>>>>>> > > can look at? >>>>>>>>> > > >>>>>>>>> > > Sincerely, >>>>>>>>> > > SG >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > -- >>>>>>>>> > > What most experimenters take for granted before they >>>>>>>>> begin their >>>>>>>>> > > experiments is infinitely more interesting than any >>>>>>>>> results to which >>>>>>>>> > > their experiments lead. >>>>>>>>> > > -- Norbert Wiener >>>>>>>>> > > >>>>>>>>> > > https://www.cse.buffalo.edu/~knepley/ >>>>>>>>> > >>>>>>>>> > >>>>>>>> > > >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Chang Liu >>>>>>>>> > Staff Research Physicist >>>>>>>>> > +1 609 243 3438 >>>>>>>>> > cliu at pppl.gov >>>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> > >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Chang Liu >>>>>>>>> Staff Research Physicist >>>>>>>>> +1 609 243 3438 >>>>>>>>> cliu at pppl.gov >>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >>>>>>>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Wed Oct 20 11:48:09 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 20 Oct 2021 12:48:09 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> References: <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> Message-ID: Hi Pierre, I have another suggestion for telescope. I have achieved my goal by putting telescope outside bjacobi. But the code still does not work if I use telescope as a pc for subblock. I think the reason is that I want to use cusparse as the solver, which can only deal with seqaij matrix and not mpiaij matrix. However, for telescope pc, it can put the matrix into one mpi rank, thus making it a seqaij for factorization stage, but then after factorization it will give the data back to the original comminicator. This will make the matrix back to mpiaij, and then cusparse cannot solve it. I think a better option is to do the factorization on CPU with mpiaij, then then transform the preconditioner matrix to seqaij and do the matsolve GPU. But I am not sure if it can be achieved using telescope. Regads, Chang On 10/15/21 5:29 AM, Pierre Jolivet wrote: > Hi Chang, > The output you sent with MUMPS looks alright to me, you can see that the MatType is properly set to seqaijcusparse (and not mpiaijcusparse). > I don?t know what is wrong with -sub_telescope_pc_factor_mat_solver_type cusparse, I don?t have a PETSc installation for testing this, hopefully Barry or Junchao can confirm this wrong behavior and get this fixed. > As for permuting PCTELESCOPE and PCBJACOBI, in your case, the outer PC will be equivalent, yes. > However, it would be more efficient to do PCBJACOBI and then PCTELESCOPE. > PCBJACOBI prunes the operator by basically removing all coefficients outside of the diagonal blocks. > Then, PCTELESCOPE "groups everything together?. > If you do it the other way around, PCTELESCOPE will ?group everything together? and then PCBJACOBI will prune the operator. > So the PCTELESCOPE SetUp will be costly for nothing since some coefficients will be thrown out afterwards in the PCBJACOBI SetUp. > I hope I?m clear enough, otherwise I can try do draw some pictures. > > Thanks, > Pierre > >> On 15 Oct 2021, at 4:39 AM, Chang Liu wrote: >> >> Hi Pierre and Barry, >> >> I think maybe I should use telescope outside bjacobi? like this >> >> mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type telescope -pc_telescope_reduction_factor 4 -t >> elescope_pc_type bjacobi -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu -telescope_sub_pc_factor_mat_solve >> r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> >> But then I got an error that >> >> [0]PETSC ERROR: MatSolverType cusparse does not support matrix type seqaij >> >> But the mat type should be aijcusparse. I think telescope change the mat type. >> >> Chang >> >> On 10/14/21 10:11 PM, Chang Liu wrote: >>> For comparison, here is the output using mumps instead of cusparse >>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type mumps -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 >>> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 >>> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 >>> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 >>> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 >>> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 >>> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 >>> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 >>> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 >>> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 >>> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 >>> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 >>> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 >>> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 >>> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 >>> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 >>> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 >>> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 >>> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 >>> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 >>> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 >>> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 >>> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 >>> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 >>> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 >>> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 >>> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 >>> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 >>> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 >>> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 >>> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 >>> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 >>> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 >>> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 >>> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 >>> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 >>> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 >>> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 >>> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 >>> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 >>> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 >>> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 >>> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 >>> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 >>> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 >>> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 >>> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 >>> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 >>> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 >>> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 >>> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 >>> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 >>> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 >>> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 >>> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 >>> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 >>> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 >>> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 >>> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 >>> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 >>> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 >>> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 >>> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 >>> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 >>> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 >>> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 >>> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 >>> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 >>> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 >>> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 >>> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 >>> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 >>> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 >>> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 >>> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 >>> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 >>> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 >>> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 >>> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 >>> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 >>> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 >>> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 >>> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 >>> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 >>> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 >>> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 >>> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 >>> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 >>> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 >>> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 >>> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 >>> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 >>> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 >>> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 >>> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 >>> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 >>> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 >>> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 >>> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 >>> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 >>> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 >>> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 >>> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 >>> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 >>> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 >>> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 >>> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 >>> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 >>> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 >>> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 >>> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 >>> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 >>> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 >>> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 >>> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 >>> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 >>> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 >>> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 >>> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 >>> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 >>> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 >>> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 >>> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 >>> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 >>> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 >>> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 >>> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 >>> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 >>> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 >>> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 >>> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 >>> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 >>> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 >>> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 >>> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 >>> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 >>> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 >>> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 >>> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 >>> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 >>> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 >>> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 >>> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 >>> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 >>> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 >>> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 >>> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 >>> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 >>> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 >>> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 >>> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 >>> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 >>> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 >>> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 >>> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 >>> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 >>> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 >>> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 >>> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 >>> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 >>> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 >>> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 >>> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 >>> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 >>> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 >>> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 >>> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 >>> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 >>> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 >>> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 >>> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 >>> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 >>> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 >>> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 >>> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 >>> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 >>> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 >>> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 >>> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 >>> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 >>> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 >>> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 >>> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 >>> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 >>> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 >>> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 >>> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 >>> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 >>> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 >>> KSP Object: 16 MPI processes >>> type: fgmres >>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >>> happy breakdown tolerance 1e-30 >>> maximum iterations=2000, initial guess is zero >>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >>> right preconditioning >>> using UNPRECONDITIONED norm type for convergence test >>> PC Object: 16 MPI processes >>> type: bjacobi >>> number of blocks = 4 >>> Local solver information for first block is in the following KSP and PC objects on rank 0: >>> Use -ksp_view ::ascii_info_detail to display information for all blocks >>> KSP Object: (sub_) 4 MPI processes >>> type: preonly >>> maximum iterations=10000, initial guess is zero >>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>> left preconditioning >>> using NONE norm type for convergence test >>> PC Object: (sub_) 4 MPI processes >>> type: telescope >>> petsc subcomm: parent comm size reduction factor = 4 >>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>> petsc subcomm type = contiguous >>> linear system matrix = precond matrix: >>> Mat Object: (sub_) 4 MPI processes >>> type: mpiaij >>> rows=40200, cols=40200 >>> total: nonzeros=199996, allocated nonzeros=203412 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node (on process 0) routines >>> setup type: default >>> Parent DM object: NULL >>> Sub DM object: NULL >>> KSP Object: (sub_telescope_) 1 MPI processes >>> type: preonly >>> maximum iterations=10000, initial guess is zero >>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>> left preconditioning >>> using NONE norm type for convergence test >>> PC Object: (sub_telescope_) 1 MPI processes >>> type: lu >>> out-of-place factorization >>> tolerance for zero pivot 2.22045e-14 >>> matrix ordering: external >>> factor fill ratio given 0., needed 0. >>> Factored matrix follows: >>> Mat Object: 1 MPI processes >>> type: mumps >>> rows=40200, cols=40200 >>> package used to perform factorization: mumps >>> total: nonzeros=1849788, allocated nonzeros=1849788 >>> MUMPS run parameters: >>> SYM (matrix type): 0 >>> PAR (host participation): 1 >>> ICNTL(1) (output for error): 6 >>> ICNTL(2) (output of diagnostic msg): 0 >>> ICNTL(3) (output for global info): 0 >>> ICNTL(4) (level of printing): 0 >>> ICNTL(5) (input mat struct): 0 >>> ICNTL(6) (matrix prescaling): 7 >>> ICNTL(7) (sequential matrix ordering):7 >>> ICNTL(8) (scaling strategy): 77 >>> ICNTL(10) (max num of refinements): 0 >>> ICNTL(11) (error analysis): 0 >>> ICNTL(12) (efficiency control): 1 >>> ICNTL(13) (sequential factorization of the root node): 0 >>> ICNTL(14) (percentage of estimated workspace increase): 20 >>> ICNTL(18) (input mat struct): 0 >>> ICNTL(19) (Schur complement info): 0 >>> ICNTL(20) (RHS sparse pattern): 0 >>> ICNTL(21) (solution struct): 0 >>> ICNTL(22) (in-core/out-of-core facility): 0 >>> ICNTL(23) (max size of memory can be allocated locally):0 >>> ICNTL(24) (detection of null pivot rows): 0 >>> ICNTL(25) (computation of a null space basis): 0 >>> ICNTL(26) (Schur options for RHS or solution): 0 >>> ICNTL(27) (blocking size for multiple RHS): -32 >>> ICNTL(28) (use parallel or sequential ordering): 1 >>> ICNTL(29) (parallel ordering): 0 >>> ICNTL(30) (user-specified set of entries in inv(A)): 0 >>> ICNTL(31) (factors is discarded in the solve phase): 0 >>> ICNTL(33) (compute determinant): 0 >>> ICNTL(35) (activate BLR based factorization): 0 >>> ICNTL(36) (choice of BLR factorization variant): 0 >>> ICNTL(38) (estimated compression rate of LU factors): 333 >>> CNTL(1) (relative pivoting threshold): 0.01 >>> CNTL(2) (stopping criterion of refinement): 1.49012e-08 >>> CNTL(3) (absolute pivoting threshold): 0. >>> CNTL(4) (value of static pivoting): -1. >>> CNTL(5) (fixation for null pivots): 0. >>> CNTL(7) (dropping parameter for BLR): 0. >>> RINFO(1) (local estimated flops for the elimination after analysis): >>> [0] 1.45525e+08 >>> RINFO(2) (local estimated flops for the assembly after factorization): >>> [0] 2.89397e+06 >>> RINFO(3) (local estimated flops for the elimination after factorization): >>> [0] 1.45525e+08 >>> INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization): >>> [0] 29 >>> INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization): >>> [0] 29 >>> INFO(23) (num of pivots eliminated on this processor after factorization): >>> [0] 40200 >>> RINFOG(1) (global estimated flops for the elimination after analysis): 1.45525e+08 >>> RINFOG(2) (global estimated flops for the assembly after factorization): 2.89397e+06 >>> RINFOG(3) (global estimated flops for the elimination after factorization): 1.45525e+08 >>> (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0) >>> INFOG(3) (estimated real workspace for factors on all processors after analysis): 1849788 >>> INFOG(4) (estimated integer workspace for factors on all processors after analysis): 879986 >>> INFOG(5) (estimated maximum front size in the complete tree): 282 >>> INFOG(6) (number of nodes in the complete tree): 23709 >>> INFOG(7) (ordering option effectively used after analysis): 5 >>> INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100 >>> INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 1849788 >>> INFOG(10) (total integer space store the matrix factors after factorization): 879986 >>> INFOG(11) (order of largest frontal matrix after factorization): 282 >>> INFOG(12) (number of off-diagonal pivots): 0 >>> INFOG(13) (number of delayed pivots after factorization): 0 >>> INFOG(14) (number of memory compress after factorization): 0 >>> INFOG(15) (number of steps of iterative refinement after solution): 0 >>> INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 29 >>> INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 29 >>> INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 29 >>> INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 29 >>> INFOG(20) (estimated number of entries in the factors): 1849788 >>> INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 26 >>> INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 26 >>> INFOG(23) (after analysis: value of ICNTL(6) effectively used): 0 >>> INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1 >>> INFOG(25) (after factorization: number of pivots modified by static pivoting): 0 >>> INFOG(28) (after factorization: number of null pivots encountered): 0 >>> INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 1849788 >>> INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 29, 29 >>> INFOG(32) (after analysis: type of analysis done): 1 >>> INFOG(33) (value used for ICNTL(8)): 7 >>> INFOG(34) (exponent of the determinant if determinant is requested): 0 >>> INFOG(35) (after factorization: number of entries taking into account BLR factor compression - sum over all processors): 1849788 >>> INFOG(36) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - value on the most memory consuming processor): 0 >>> INFOG(37) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - sum over all processors): 0 >>> INFOG(38) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - value on the most memory consuming processor): 0 >>> INFOG(39) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - sum over all processors): 0 >>> linear system matrix = precond matrix: >>> Mat Object: 1 MPI processes >>> type: seqaijcusparse >>> rows=40200, cols=40200 >>> total: nonzeros=199996, allocated nonzeros=199996 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node routines >>> linear system matrix = precond matrix: >>> Mat Object: 16 MPI processes >>> type: mpiaijcusparse >>> rows=160800, cols=160800 >>> total: nonzeros=802396, allocated nonzeros=1608000 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node (on process 0) routines >>> Norm of error 9.11684e-07 iterations 189 >>> Chang >>> On 10/14/21 10:10 PM, Chang Liu wrote: >>>> Hi Barry, >>>> >>>> No problem. Here is the output. It seems that the resid norm calculation is incorrect. >>>> >>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>>> KSP Object: 16 MPI processes >>>> type: fgmres >>>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >>>> happy breakdown tolerance 1e-30 >>>> maximum iterations=2000, initial guess is zero >>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >>>> right preconditioning >>>> using UNPRECONDITIONED norm type for convergence test >>>> PC Object: 16 MPI processes >>>> type: bjacobi >>>> number of blocks = 4 >>>> Local solver information for first block is in the following KSP and PC objects on rank 0: >>>> Use -ksp_view ::ascii_info_detail to display information for all blocks >>>> KSP Object: (sub_) 4 MPI processes >>>> type: preonly >>>> maximum iterations=10000, initial guess is zero >>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>> left preconditioning >>>> using NONE norm type for convergence test >>>> PC Object: (sub_) 4 MPI processes >>>> type: telescope >>>> petsc subcomm: parent comm size reduction factor = 4 >>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>>> petsc subcomm type = contiguous >>>> linear system matrix = precond matrix: >>>> Mat Object: (sub_) 4 MPI processes >>>> type: mpiaij >>>> rows=40200, cols=40200 >>>> total: nonzeros=199996, allocated nonzeros=203412 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node (on process 0) routines >>>> setup type: default >>>> Parent DM object: NULL >>>> Sub DM object: NULL >>>> KSP Object: (sub_telescope_) 1 MPI processes >>>> type: preonly >>>> maximum iterations=10000, initial guess is zero >>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>> left preconditioning >>>> using NONE norm type for convergence test >>>> PC Object: (sub_telescope_) 1 MPI processes >>>> type: lu >>>> out-of-place factorization >>>> tolerance for zero pivot 2.22045e-14 >>>> matrix ordering: nd >>>> factor fill ratio given 5., needed 8.62558 >>>> Factored matrix follows: >>>> Mat Object: 1 MPI processes >>>> type: seqaijcusparse >>>> rows=40200, cols=40200 >>>> package used to perform factorization: cusparse >>>> total: nonzeros=1725082, allocated nonzeros=1725082 >>>> not using I-node routines >>>> linear system matrix = precond matrix: >>>> Mat Object: 1 MPI processes >>>> type: seqaijcusparse >>>> rows=40200, cols=40200 >>>> total: nonzeros=199996, allocated nonzeros=199996 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node routines >>>> linear system matrix = precond matrix: >>>> Mat Object: 16 MPI processes >>>> type: mpiaijcusparse >>>> rows=160800, cols=160800 >>>> total: nonzeros=802396, allocated nonzeros=1608000 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node (on process 0) routines >>>> Norm of error 400.999 iterations 1 >>>> >>>> Chang >>>> >>>> >>>> On 10/14/21 9:47 PM, Barry Smith wrote: >>>>> >>>>> Chang, >>>>> >>>>> Sorry I did not notice that one. Please run that with -ksp_view -ksp_monitor_true_residual so we can see exactly how options are interpreted and solver used. At a glance it looks ok but something must be wrong to get the wrong answer. >>>>> >>>>> Barry >>>>> >>>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: >>>>>> >>>>>> Hi Barry, >>>>>> >>>>>> That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. >>>>>> >>>>>> Chang >>>>>> >>>>>> On 10/14/21 5:15 PM, Barry Smith wrote: >>>>>>> You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu >>>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>>>>>>> >>>>>>>> Hi Pierre, >>>>>>>> >>>>>>>> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. >>>>>>>> >>>>>>>> The command line options I used for small matrix is like >>>>>>>> >>>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >>>>>>>> >>>>>>>> which gives the correct output. For iterative solver, I tried >>>>>>>> >>>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 >>>>>>>> >>>>>>>> for large matrix. The output is like >>>>>>>> >>>>>>>> 0 KSP Residual norm 40.1497 >>>>>>>> 1 KSP Residual norm < 1.e-11 >>>>>>>> Norm of error 400.999 iterations 1 >>>>>>>> >>>>>>>> So it seems to call a direct solver instead of an iterative one. >>>>>>>> >>>>>>>> Can you please help check these options? >>>>>>>> >>>>>>>> Chang >>>>>>>> >>>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>>>>>>> >>>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >>>>>>>>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >>>>>>>>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >>>>>>>>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >>>>>>>>> Thanks, >>>>>>>>> Pierre >>>>>>>>>> Chang >>>>>>>>>> >>>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>>>>>>>>> Thus the need for specific code in mumps.c. >>>>>>>>>>> Thanks, >>>>>>>>>>> Pierre >>>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Junchao, >>>>>>>>>>>> >>>>>>>>>>>> Yes that is what I want. >>>>>>>>>>>> >>>>>>>>>>>> Chang >>>>>>>>>>>> >>>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>>>>>>>>> Junchao, >>>>>>>>>>>>> If I understand correctly Chang is using the block Jacobi >>>>>>>>>>>>> method with a single block for a number of MPI ranks and a direct >>>>>>>>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>>>>>>>>> particular problems this preconditioner works well, but using an >>>>>>>>>>>>> iterative solver on the blocks does not work well. >>>>>>>>>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>>>>>>>>> not he would like to use a single GPU for each block, this means >>>>>>>>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>>>>>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>>>>>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>>>>>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>>>>>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>>>>>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>>>>>>>>> GPUs. But this may be a large coding project. >>>>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>>>>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>>>>>>>>> Barry >>>>>>>>>>>>> Since the matrices being factored and solved directly are relatively >>>>>>>>>>>>> large it is possible that the cusparse code could be reasonably >>>>>>>>>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>>>>>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>>>>>>>>> actually know how much better the cusparse code would be on the >>>>>>>>>>>>> direct solver than a good CPU direct sparse solver. >>>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>>>>> > wrote: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>>>>>> > >>>>>>>>>>>>> > Chang >>>>>>>>>>>>> > >>>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>>>>>> >> Hi Chang, >>>>>>>>>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>>>>>>>>> gathering matrix rows to one process. >>>>>>>>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>>>>>>>>> >> Thanks >>>>>>>>>>>>> >> --Junchao Zhang >>>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >> Hi Barry, >>>>>>>>>>>>> >> I think mumps solver in petsc does support that. You can >>>>>>>>>>>>> check the >>>>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>>>>>> >> >>>>>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>>> >> and the code enclosed by #if >>>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>>>>>>>>> >> mumps.c >>>>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>>>>>>>>> However, I am >>>>>>>>>>>>> >> working on an existing code that was developed based on MPI >>>>>>>>>>>>> and the the >>>>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>>>>>>>>> want to >>>>>>>>>>>>> >> change the whole structure of the code. >>>>>>>>>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>>>>>>>>> function >>>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>>>>>>>>> >> Regards, >>>>>>>>>>>>> >> Chang >>>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>>>>>> >>>>>>>>>>>>> >> >> wrote: >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> Hi Barry, >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> That is exactly what I want. >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>>>>>>>>> >> transfer >>>>>>>>>>>>> >> >> matrix >>>>>>>>>>>>> >> >> data from many MPI processes to "master" MPI >>>>>>>>>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>>>>>>>>> upload >>>>>>>>>>>>> >> the data to GPU to >>>>>>>>>>>>> >> >> solve. >>>>>>>>>>>>> >> >> One can just grab some codes from mumps.c to >>>>>>>>>>>>> aijcusparse.cu >>>>>>>>>>>>> >> >. >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>>>>>>>>> copy the >>>>>>>>>>>>> >> entire matrix to a single MPI rank. >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > It would be possible to write such a code that you >>>>>>>>>>>>> suggest but >>>>>>>>>>>>> >> it is not clear that it makes sense >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>>>>>>>>> rank, so >>>>>>>>>>>>> >> while your one GPU per big domain is solving its systems the >>>>>>>>>>>>> other >>>>>>>>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>>>>>>>>> >> nothing. >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>>>>>>>>> right >>>>>>>>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>>>>>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>>>>>>>>> to all >>>>>>>>>>>>> >> of its subdomain ranks. >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>>>>>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>>>>>>>>> use the >>>>>>>>>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>>>>>>>>> >> can fit and be done on a single GPU then you would be >>>>>>>>>>>>> utilizing all >>>>>>>>>>>>> >> the GPUs you are using effectively. >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > Barry >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> Chang >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>>>>>> >> >>> Chang, >>>>>>>>>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>>>>>>>>> solvers that >>>>>>>>>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>>>>>>>>> that I >>>>>>>>>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>>>>>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>>>>>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>>>>>>>>> >> direct triangular solves. >>>>>>>>>>>>> >> >>> Barry >>>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> Hi Mark, >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>>>>>>>>> other >>>>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>>>>>>>>> will give >>>>>>>>>>>>> >> an error. >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>>>>>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>>>>>>>>> on gpu. >>>>>>>>>>>>> >> Is that possible? >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>>>>>>>>> runs but >>>>>>>>>>>>> >> the iterative solver is still running on CPUs. I have >>>>>>>>>>>>> contacted the >>>>>>>>>>>>> >> superlu group and they confirmed that is the case right now. >>>>>>>>>>>>> But if >>>>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>>>>>>>>> >> iterative solver is running on GPU. >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> Chang >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >>>>>>>>>>>>> >> >>> wrote: >>>>>>>>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>>>>>>>>> my case >>>>>>>>>>>>> >> the code is >>>>>>>>>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>>>>>>>>> >> factorization on GPUs. >>>>>>>>>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>>>>>>>>> code to >>>>>>>>>>>>> >> utilize GPUs >>>>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>>>>>>>>> mpiaij >>>>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>>>>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>>>>>>>>> >> superlu tests use aij or sell matrix type. >>>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>>>>>>>>> .... but >>>>>>>>>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>>>>>>>>> that the >>>>>>>>>>>>> >> issue? >>>>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>>>>>>>>> SuperLU >>>>>>>>>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>>>>>>>>> >> >>>>> Thanks, >>>>>>>>>>>>> >> >>>>> Mark >>>>>>>>>>>>> >> >>>>> so I >>>>>>>>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>>>>>>>>> all the >>>>>>>>>>>>> >> matrix terms, >>>>>>>>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>>>>>>>>> >> factorization >>>>>>>>>>>>> >> >>>>> and >>>>>>>>>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>>>>>>>>> >> process, and I >>>>>>>>>>>>> >> >>>>> think >>>>>>>>>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>>>>>>>>> >> >>>>> Chang >>>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>>>>>>>>> >> >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>>>>>>>>> >> >>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > Hi Mark, >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > The option I use is like >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>>>>>>>>> >> -ksp_type fgmres >>>>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>> >> >>>>> *-sub_ksp_type >>>>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>>>>>>>>> >> -ksp_rtol 1.e-300 >>>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>>>>>>>>> (rows >>>>>>>>>>>>> >> are the >>>>>>>>>>>>> >> >>>>> method like >>>>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>>>>>>>>> in the GPU. >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>>>>>>>>> cuSparse LU >>>>>>>>>>>>> >> >>>>> factorization. Is >>>>>>>>>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>>>>>> find it >>>>>>>>>>>>> >> calls >>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>>>>>>>>> >> make bigger >>>>>>>>>>>>> >> >>>>> blocks? >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > I think this one do both factorization and >>>>>>>>>>>>> >> solve on gpu. >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > You can check the >>>>>>>>>>>>> runex72_aijcusparse.sh file >>>>>>>>>>>>> >> in petsc >>>>>>>>>>>>> >> >>>>> install >>>>>>>>>>>>> >> >>>>> > directory, and try it your self (this >>>>>>>>>>>>> is only lu >>>>>>>>>>>>> >> >>>>> factorization >>>>>>>>>>>>> >> >>>>> > without >>>>>>>>>>>>> >> >>>>> > iterative solve). >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > Chang >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>>>>>>>>> Chang Liu >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>>> wrote: >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > Hi Junchao, >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>>>>>>>>> >> within a >>>>>>>>>>>>> >> >>>>> node. I use >>>>>>>>>>>>> >> >>>>> > block-Jacobi >>>>>>>>>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>>>>>>>>> >> matrix, so each >>>>>>>>>>>>> >> >>>>> > direct solver will >>>>>>>>>>>>> >> >>>>> > > take care of a sub-block of the >>>>>>>>>>>>> whole >>>>>>>>>>>>> >> matrix. In this >>>>>>>>>>>>> >> >>>>> > way, I can use >>>>>>>>>>>>> >> >>>>> > > one >>>>>>>>>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>>>>>>>>> >> stored within >>>>>>>>>>>>> >> >>>>> one node. >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > It was stated in the >>>>>>>>>>>>> documentation that >>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>> >> >>>>> solver >>>>>>>>>>>>> >> >>>>> > is slow. >>>>>>>>>>>>> >> >>>>> > > However, in my test using >>>>>>>>>>>>> ex72.c, the >>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>> >> >>>>> solver is >>>>>>>>>>>>> >> >>>>> > faster than >>>>>>>>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > Are we talking about the >>>>>>>>>>>>> factorization, the >>>>>>>>>>>>> >> solve, or >>>>>>>>>>>>> >> >>>>> both? >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > We do not have an interface to >>>>>>>>>>>>> cuSparse's LU >>>>>>>>>>>>> >> >>>>> factorization (I >>>>>>>>>>>>> >> >>>>> > just >>>>>>>>>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>>>>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>>>>>>>>> >> '-pc_type lu >>>>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>>>>>>>>> >> factorization, >>>>>>>>>>>>> >> >>>>> which is the >>>>>>>>>>>>> >> >>>>> > > dominant cost. >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > Chang >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>>>>>>>>> Zhang wrote: >>>>>>>>>>>>> >> >>>>> > > > Hi, Chang, >>>>>>>>>>>>> >> >>>>> > > > For the mumps solver, we >>>>>>>>>>>>> usually >>>>>>>>>>>>> >> transfers >>>>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>>>> >> >>>>> > and vector >>>>>>>>>>>>> >> >>>>> > > data >>>>>>>>>>>>> >> >>>>> > > > within a compute node. For >>>>>>>>>>>>> the idea you >>>>>>>>>>>>> >> >>>>> propose, it >>>>>>>>>>>>> >> >>>>> > looks like >>>>>>>>>>>>> >> >>>>> > > we need >>>>>>>>>>>>> >> >>>>> > > > to gather data within >>>>>>>>>>>>> >> MPI_COMM_WORLD, right? >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > > Mark, I remember you said >>>>>>>>>>>>> >> cusparse solve is >>>>>>>>>>>>> >> >>>>> slow >>>>>>>>>>>>> >> >>>>> > and you would >>>>>>>>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > > --Junchao Zhang >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>>>>>>>>> >> Chang Liu via >>>>>>>>>>>>> >> >>>>> petsc-users >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> >>>> >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>> >>>>>> >>>>>>>>>>>>> >> >>>>> > > wrote: >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > > Hi, >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > > Currently, it is possible >>>>>>>>>>>>> to use >>>>>>>>>>>>> >> mumps >>>>>>>>>>>>> >> >>>>> solver in >>>>>>>>>>>>> >> >>>>> > PETSC with >>>>>>>>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>>>>>>>>> >> option, so that >>>>>>>>>>>>> >> >>>>> > multiple MPI >>>>>>>>>>>>> >> >>>>> > > processes will >>>>>>>>>>>>> >> >>>>> > > > transfer the matrix and >>>>>>>>>>>>> rhs data >>>>>>>>>>>>> >> to the master >>>>>>>>>>>>> >> >>>>> > rank, and then >>>>>>>>>>>>> >> >>>>> > > master >>>>>>>>>>>>> >> >>>>> > > > rank will call mumps with >>>>>>>>>>>>> OpenMP >>>>>>>>>>>>> >> to solve >>>>>>>>>>>>> >> >>>>> the matrix. >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > > I wonder if someone can >>>>>>>>>>>>> develop >>>>>>>>>>>>> >> similar >>>>>>>>>>>>> >> >>>>> option for >>>>>>>>>>>>> >> >>>>> > cusparse >>>>>>>>>>>>> >> >>>>> > > solver. >>>>>>>>>>>>> >> >>>>> > > > Right now, this solver >>>>>>>>>>>>> does not >>>>>>>>>>>>> >> work with >>>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I >>>>>>>>>>>>> >> >>>>> > > think a >>>>>>>>>>>>> >> >>>>> > > > possible workaround is to >>>>>>>>>>>>> >> transfer all the >>>>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>>>> >> >>>>> > data to one MPI >>>>>>>>>>>>> >> >>>>> > > > process, and then upload the >>>>>>>>>>>>> >> data to GPU to >>>>>>>>>>>>> >> >>>>> solve. >>>>>>>>>>>>> >> >>>>> > In this >>>>>>>>>>>>> >> >>>>> > > way, one can >>>>>>>>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>>>>>>>>> >> program. >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > > Chang >>>>>>>>>>>>> >> >>>>> > > > -- >>>>>>>>>>>>> >> >>>>> > > > Chang Liu >>>>>>>>>>>>> >> >>>>> > > > Staff Research Physicist >>>>>>>>>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>>>>>>>>> Laboratory >>>>>>>>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>>>>>>>>> Princeton NJ >>>>>>>>>>>>> >> 08540, USA >>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > > -- >>>>>>>>>>>>> >> >>>>> > > Chang Liu >>>>>>>>>>>>> >> >>>>> > > Staff Research Physicist >>>>>>>>>>>>> >> >>>>> > > +1 609 243 3438 >>>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>>>>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>>>>>>>>> 08540, USA >>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> > -- >>>>>>>>>>>>> >> >>>>> > Chang Liu >>>>>>>>>>>>> >> >>>>> > Staff Research Physicist >>>>>>>>>>>>> >> >>>>> > +1 609 243 3438 >>>>>>>>>>>>> >> >>>>> > cliu at pppl.gov >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >> >>>>> -- Chang Liu >>>>>>>>>>>>> >> >>>>> Staff Research Physicist >>>>>>>>>>>>> >> >>>>> +1 609 243 3438 >>>>>>>>>>>>> >> >>>>> cliu at pppl.gov >>>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> -- >>>>>>>>>>>>> >> >>>> Chang Liu >>>>>>>>>>>>> >> >>>> Staff Research Physicist >>>>>>>>>>>>> >> >>>> +1 609 243 3438 >>>>>>>>>>>>> >> >>>> cliu at pppl.gov >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> -- >>>>>>>>>>>>> >> >> Chang Liu >>>>>>>>>>>>> >> >> Staff Research Physicist >>>>>>>>>>>>> >> >> +1 609 243 3438 >>>>>>>>>>>>> >> >> cliu at pppl.gov >>>>>>>>>>>>> > >>>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> -- Chang Liu >>>>>>>>>>>>> >> Staff Research Physicist >>>>>>>>>>>>> >> +1 609 243 3438 >>>>>>>>>>>>> >> cliu at pppl.gov >>>>>>>>>>>> > >>>>>>>>>>>>> >> Princeton Plasma Physics Laboratory >>>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>> > >>>>>>>>>>>>> > -- >>>>>>>>>>>>> > Chang Liu >>>>>>>>>>>>> > Staff Research Physicist >>>>>>>>>>>>> > +1 609 243 3438 >>>>>>>>>>>>> > cliu at pppl.gov >>>>>>>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Chang Liu >>>>>>>>>>>> Staff Research Physicist >>>>>>>>>>>> +1 609 243 3438 >>>>>>>>>>>> cliu at pppl.gov >>>>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Chang Liu >>>>>>>>>> Staff Research Physicist >>>>>>>>>> +1 609 243 3438 >>>>>>>>>> cliu at pppl.gov >>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >>>>>>>> -- >>>>>>>> Chang Liu >>>>>>>> Staff Research Physicist >>>>>>>> +1 609 243 3438 >>>>>>>> cliu at pppl.gov >>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >>>>>> -- >>>>>> Chang Liu >>>>>> Staff Research Physicist >>>>>> +1 609 243 3438 >>>>>> cliu at pppl.gov >>>>>> Princeton Plasma Physics Laboratory >>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>> >>>> >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From cliu at pppl.gov Wed Oct 20 11:55:47 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 20 Oct 2021 12:55:47 -0400 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> Message-ID: <07787336-5a69-d6f8-45ca-b2f4223f9311@pppl.gov> Hi Junchao, Thank you for the suggestion. I did some more tests and found that MatConvert does not always work. In one of my tests, I did MatConvert to convert the matrix to aijcusparse, then did a preonly ksp solver and it works well. But then I tried a fgmres solver and it gave an error. It only happen when the matrix is mpiaijcusparse and for seqaijcusparse it works. So I tried to create a new aijcusparse matrix and copy the data line by line, then both solvers works. So I guess there are some tricky things with MatConvert. Chang On 10/18/21 9:23 PM, Junchao Zhang wrote: > MatSetOptionsPrefix(A,"mymat") > VecSetOptionsPrefix(v,"myvec") > > --Junchao Zhang > > > On Mon, Oct 18, 2021 at 8:04 PM Chang Liu > wrote: > > Hi Junchao, > > Thank you for your answer. I tried MatConvert and it works. I didn't > make it before because I forgot to convert a vector from mpi to mpicuda > previously. > > For vector, there is no VecConvert to use, so I have to do > VecDuplicate, > VecSetType and VecCopy. Is there an easier option? > > ?As Matt suggested, you could single out the matrix and vector with > options prefix and set their type on command line > > MatSetOptionsPrefix(A,"mymat"); > VecSetOptionsPrefix(v,"myvec"); > > Then, -mymat_mat_type aijcusparse?-myvec_vec_type cuda > A simpler code is to have the vector type automatically set by > MatCreateVecs(A,&v,NULL) > > > Chang > > On 10/18/21 5:23 PM, Junchao Zhang wrote: > > > > > > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users > > > >> > wrote: > > > >? ? ?Hi Matt, > > > >? ? ?I have a related question. In my code I have many matrices > and I only > >? ? ?want to have one living on GPU, the others still staying on > CPU mem. > > > >? ? ?I wonder if there is an easier way to copy a mpiaij matrix to > >? ? ?mpiaijcusparse (in other words, copy data to GPUs). I can > think of > >? ? ?creating a new mpiaijcusparse matrix, and copying the data > line by > >? ? ?line. > >? ? ?But I wonder if there is a better option. > > > >? ? ?I have tried MatCopy and MatConvert but neither work. > > > > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? > > > > > >? ? ?Chang > > > >? ? ?On 10/17/21 7:50 PM, Matthew Knepley wrote: > >? ? ? > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh > >? ? ? > > > >? ? ? > > >>> wrote: > >? ? ? > > >? ? ? >? ? ?Do I need convert the MATSEQBAIJ?to a cuda matrix in code? > >? ? ? > > >? ? ? > > >? ? ? > You would need a call to MatSetFromOptions() to take that type > >? ? ?from the > >? ? ? > command line, and not have > >? ? ? > the type hard-coded in your application. It is generally a bad > >? ? ?idea to > >? ? ? > hard code the implementation type. > >? ? ? > > >? ? ? >? ? ?If I do it from command line, then are the other > MatVec calls are > >? ? ? >? ? ?ported onto CUDA? I have many MatVec calls in my code, > but I > >? ? ? >? ? ?specifically want to port just one call. > >? ? ? > > >? ? ? > > >? ? ? > You can give that one matrix an options prefix to isolate it. > >? ? ? > > >? ? ? >? ? Thanks, > >? ? ? > > >? ? ? >? ? ? ?Matt > >? ? ? > > >? ? ? >? ? ?Sincerely, > >? ? ? >? ? ?Swarnava > >? ? ? > > >? ? ? >? ? ?On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang > >? ? ? >? ? ? > > >? ? ? >>> > >? ? ?wrote: > >? ? ? > > >? ? ? >? ? ? ? ?You can do that with command line options -mat_type > >? ? ?aijcusparse > >? ? ? >? ? ? ? ?-vec_type cuda > >? ? ? > > >? ? ? >? ? ? ? ?On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh > >? ? ? >? ? ? ? ? > > >? ? ? > >>> wrote: > >? ? ? > > >? ? ? >? ? ? ? ? ? ?Dear Petsc team, > >? ? ? > > >? ? ? >? ? ? ? ? ? ?I had a query regarding using CUDA to > accelerate a matrix > >? ? ? >? ? ? ? ? ? ?vector product. > >? ? ? >? ? ? ? ? ? ?I have a sequential sparse matrix > (MATSEQBAIJ?type). > >? ? ?I want > >? ? ? >? ? ? ? ? ? ?to port a MatVec?call onto GPUs. Is there any > >? ? ?code/example I > >? ? ? >? ? ? ? ? ? ?can look at? > >? ? ? > > >? ? ? >? ? ? ? ? ? ?Sincerely, > >? ? ? >? ? ? ? ? ? ?SG > >? ? ? > > >? ? ? > > >? ? ? > > >? ? ? > -- > >? ? ? > What most experimenters take for granted before they begin > their > >? ? ? > experiments is infinitely more interesting than any > results to which > >? ? ? > their experiments lead. > >? ? ? > -- Norbert Wiener > >? ? ? > > >? ? ? > https://www.cse.buffalo.edu/~knepley/ > > >? ? ? > > >? ? ? > >? ? ? >> > > > >? ? ?-- > >? ? ?Chang Liu > >? ? ?Staff Research Physicist > >? ? ?+1 609 243 3438 > > cliu at pppl.gov > > >? ? ?Princeton Plasma Physics Laboratory > >? ? ?100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From bsmith at petsc.dev Wed Oct 20 12:14:39 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 20 Oct 2021 13:14:39 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <1121151E-D090-42BF-8599-9DF9CCF7DB11@petsc.dev> <4336955e-e338-b503-67eb-c1900e48b593@pppl.gov> <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> Message-ID: <968434BC-E8DC-49B0-9119-F208DB1E01B0@petsc.dev> > On Oct 20, 2021, at 12:48 PM, Chang Liu wrote: > > Hi Pierre, > > I have another suggestion for telescope. I have achieved my goal by putting telescope outside bjacobi. But the code still does not work if I use telescope as a pc for subblock. I think the reason is that I want to use cusparse as the solver, which can only deal with seqaij matrix and not mpiaij matrix. This is suppose to work with the recent fixes. The telescope should produce a seq matrix and for each solve map the parallel vector (over the subdomain) automatically down to the one rank with the GPU to solve it on the GPU. It is not clear to me where the process is going wrong. Barry > However, for telescope pc, it can put the matrix into one mpi rank, thus making it a seqaij for factorization stage, but then after factorization it will give the data back to the original comminicator. This will make the matrix back to mpiaij, and then cusparse cannot solve it. > > I think a better option is to do the factorization on CPU with mpiaij, then then transform the preconditioner matrix to seqaij and do the matsolve GPU. But I am not sure if it can be achieved using telescope. > > Regads, > > Chang > > On 10/15/21 5:29 AM, Pierre Jolivet wrote: >> Hi Chang, >> The output you sent with MUMPS looks alright to me, you can see that the MatType is properly set to seqaijcusparse (and not mpiaijcusparse). >> I don?t know what is wrong with -sub_telescope_pc_factor_mat_solver_type cusparse, I don?t have a PETSc installation for testing this, hopefully Barry or Junchao can confirm this wrong behavior and get this fixed. >> As for permuting PCTELESCOPE and PCBJACOBI, in your case, the outer PC will be equivalent, yes. >> However, it would be more efficient to do PCBJACOBI and then PCTELESCOPE. >> PCBJACOBI prunes the operator by basically removing all coefficients outside of the diagonal blocks. >> Then, PCTELESCOPE "groups everything together?. >> If you do it the other way around, PCTELESCOPE will ?group everything together? and then PCBJACOBI will prune the operator. >> So the PCTELESCOPE SetUp will be costly for nothing since some coefficients will be thrown out afterwards in the PCBJACOBI SetUp. >> I hope I?m clear enough, otherwise I can try do draw some pictures. >> Thanks, >> Pierre >>> On 15 Oct 2021, at 4:39 AM, Chang Liu wrote: >>> >>> Hi Pierre and Barry, >>> >>> I think maybe I should use telescope outside bjacobi? like this >>> >>> mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type telescope -pc_telescope_reduction_factor 4 -t >>> elescope_pc_type bjacobi -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu -telescope_sub_pc_factor_mat_solve >>> r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>> >>> But then I got an error that >>> >>> [0]PETSC ERROR: MatSolverType cusparse does not support matrix type seqaij >>> >>> But the mat type should be aijcusparse. I think telescope change the mat type. >>> >>> Chang >>> >>> On 10/14/21 10:11 PM, Chang Liu wrote: >>>> For comparison, here is the output using mumps instead of cusparse >>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type mumps -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>>> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 >>>> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 >>>> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 >>>> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 >>>> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 >>>> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 >>>> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 >>>> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 >>>> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 >>>> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 >>>> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 >>>> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 >>>> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 >>>> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 >>>> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 >>>> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 >>>> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 >>>> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 >>>> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 >>>> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 >>>> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 >>>> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 >>>> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 >>>> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 >>>> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 >>>> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 >>>> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 >>>> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 >>>> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 >>>> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 >>>> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 >>>> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 >>>> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 >>>> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 >>>> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 >>>> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 >>>> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 >>>> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 >>>> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 >>>> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 >>>> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 >>>> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 >>>> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 >>>> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 >>>> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 >>>> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 >>>> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 >>>> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 >>>> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 >>>> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 >>>> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 >>>> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 >>>> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 >>>> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 >>>> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 >>>> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 >>>> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 >>>> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 >>>> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 >>>> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 >>>> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 >>>> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 >>>> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 >>>> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 >>>> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 >>>> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 >>>> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 >>>> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 >>>> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 >>>> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 >>>> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 >>>> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 >>>> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 >>>> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 >>>> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 >>>> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 >>>> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 >>>> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 >>>> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 >>>> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 >>>> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 >>>> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 >>>> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 >>>> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 >>>> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 >>>> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 >>>> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 >>>> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 >>>> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 >>>> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 >>>> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 >>>> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 >>>> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 >>>> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 >>>> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 >>>> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 >>>> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 >>>> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 >>>> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 >>>> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 >>>> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 >>>> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 >>>> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 >>>> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 >>>> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 >>>> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 >>>> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 >>>> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 >>>> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 >>>> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 >>>> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 >>>> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 >>>> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 >>>> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 >>>> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 >>>> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 >>>> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 >>>> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 >>>> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 >>>> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 >>>> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 >>>> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 >>>> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 >>>> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 >>>> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 >>>> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 >>>> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 >>>> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 >>>> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 >>>> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 >>>> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 >>>> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 >>>> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 >>>> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 >>>> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 >>>> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 >>>> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 >>>> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 >>>> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 >>>> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 >>>> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 >>>> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 >>>> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 >>>> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 >>>> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 >>>> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 >>>> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 >>>> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 >>>> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 >>>> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 >>>> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 >>>> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 >>>> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 >>>> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 >>>> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 >>>> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 >>>> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 >>>> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 >>>> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 >>>> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 >>>> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 >>>> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 >>>> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 >>>> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 >>>> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 >>>> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 >>>> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 >>>> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 >>>> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 >>>> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 >>>> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 >>>> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 >>>> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 >>>> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 >>>> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 >>>> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 >>>> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 >>>> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 >>>> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 >>>> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 >>>> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 >>>> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 >>>> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 >>>> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 >>>> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 >>>> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 >>>> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 >>>> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 >>>> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 >>>> KSP Object: 16 MPI processes >>>> type: fgmres >>>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >>>> happy breakdown tolerance 1e-30 >>>> maximum iterations=2000, initial guess is zero >>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >>>> right preconditioning >>>> using UNPRECONDITIONED norm type for convergence test >>>> PC Object: 16 MPI processes >>>> type: bjacobi >>>> number of blocks = 4 >>>> Local solver information for first block is in the following KSP and PC objects on rank 0: >>>> Use -ksp_view ::ascii_info_detail to display information for all blocks >>>> KSP Object: (sub_) 4 MPI processes >>>> type: preonly >>>> maximum iterations=10000, initial guess is zero >>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>> left preconditioning >>>> using NONE norm type for convergence test >>>> PC Object: (sub_) 4 MPI processes >>>> type: telescope >>>> petsc subcomm: parent comm size reduction factor = 4 >>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>>> petsc subcomm type = contiguous >>>> linear system matrix = precond matrix: >>>> Mat Object: (sub_) 4 MPI processes >>>> type: mpiaij >>>> rows=40200, cols=40200 >>>> total: nonzeros=199996, allocated nonzeros=203412 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node (on process 0) routines >>>> setup type: default >>>> Parent DM object: NULL >>>> Sub DM object: NULL >>>> KSP Object: (sub_telescope_) 1 MPI processes >>>> type: preonly >>>> maximum iterations=10000, initial guess is zero >>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>> left preconditioning >>>> using NONE norm type for convergence test >>>> PC Object: (sub_telescope_) 1 MPI processes >>>> type: lu >>>> out-of-place factorization >>>> tolerance for zero pivot 2.22045e-14 >>>> matrix ordering: external >>>> factor fill ratio given 0., needed 0. >>>> Factored matrix follows: >>>> Mat Object: 1 MPI processes >>>> type: mumps >>>> rows=40200, cols=40200 >>>> package used to perform factorization: mumps >>>> total: nonzeros=1849788, allocated nonzeros=1849788 >>>> MUMPS run parameters: >>>> SYM (matrix type): 0 >>>> PAR (host participation): 1 >>>> ICNTL(1) (output for error): 6 >>>> ICNTL(2) (output of diagnostic msg): 0 >>>> ICNTL(3) (output for global info): 0 >>>> ICNTL(4) (level of printing): 0 >>>> ICNTL(5) (input mat struct): 0 >>>> ICNTL(6) (matrix prescaling): 7 >>>> ICNTL(7) (sequential matrix ordering):7 >>>> ICNTL(8) (scaling strategy): 77 >>>> ICNTL(10) (max num of refinements): 0 >>>> ICNTL(11) (error analysis): 0 >>>> ICNTL(12) (efficiency control): 1 >>>> ICNTL(13) (sequential factorization of the root node): 0 >>>> ICNTL(14) (percentage of estimated workspace increase): 20 >>>> ICNTL(18) (input mat struct): 0 >>>> ICNTL(19) (Schur complement info): 0 >>>> ICNTL(20) (RHS sparse pattern): 0 >>>> ICNTL(21) (solution struct): 0 >>>> ICNTL(22) (in-core/out-of-core facility): 0 >>>> ICNTL(23) (max size of memory can be allocated locally):0 >>>> ICNTL(24) (detection of null pivot rows): 0 >>>> ICNTL(25) (computation of a null space basis): 0 >>>> ICNTL(26) (Schur options for RHS or solution): 0 >>>> ICNTL(27) (blocking size for multiple RHS): -32 >>>> ICNTL(28) (use parallel or sequential ordering): 1 >>>> ICNTL(29) (parallel ordering): 0 >>>> ICNTL(30) (user-specified set of entries in inv(A)): 0 >>>> ICNTL(31) (factors is discarded in the solve phase): 0 >>>> ICNTL(33) (compute determinant): 0 >>>> ICNTL(35) (activate BLR based factorization): 0 >>>> ICNTL(36) (choice of BLR factorization variant): 0 >>>> ICNTL(38) (estimated compression rate of LU factors): 333 >>>> CNTL(1) (relative pivoting threshold): 0.01 >>>> CNTL(2) (stopping criterion of refinement): 1.49012e-08 >>>> CNTL(3) (absolute pivoting threshold): 0. >>>> CNTL(4) (value of static pivoting): -1. >>>> CNTL(5) (fixation for null pivots): 0. >>>> CNTL(7) (dropping parameter for BLR): 0. >>>> RINFO(1) (local estimated flops for the elimination after analysis): >>>> [0] 1.45525e+08 >>>> RINFO(2) (local estimated flops for the assembly after factorization): >>>> [0] 2.89397e+06 >>>> RINFO(3) (local estimated flops for the elimination after factorization): >>>> [0] 1.45525e+08 >>>> INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization): >>>> [0] 29 >>>> INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization): >>>> [0] 29 >>>> INFO(23) (num of pivots eliminated on this processor after factorization): >>>> [0] 40200 >>>> RINFOG(1) (global estimated flops for the elimination after analysis): 1.45525e+08 >>>> RINFOG(2) (global estimated flops for the assembly after factorization): 2.89397e+06 >>>> RINFOG(3) (global estimated flops for the elimination after factorization): 1.45525e+08 >>>> (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0) >>>> INFOG(3) (estimated real workspace for factors on all processors after analysis): 1849788 >>>> INFOG(4) (estimated integer workspace for factors on all processors after analysis): 879986 >>>> INFOG(5) (estimated maximum front size in the complete tree): 282 >>>> INFOG(6) (number of nodes in the complete tree): 23709 >>>> INFOG(7) (ordering option effectively used after analysis): 5 >>>> INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100 >>>> INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 1849788 >>>> INFOG(10) (total integer space store the matrix factors after factorization): 879986 >>>> INFOG(11) (order of largest frontal matrix after factorization): 282 >>>> INFOG(12) (number of off-diagonal pivots): 0 >>>> INFOG(13) (number of delayed pivots after factorization): 0 >>>> INFOG(14) (number of memory compress after factorization): 0 >>>> INFOG(15) (number of steps of iterative refinement after solution): 0 >>>> INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 29 >>>> INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 29 >>>> INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 29 >>>> INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 29 >>>> INFOG(20) (estimated number of entries in the factors): 1849788 >>>> INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 26 >>>> INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 26 >>>> INFOG(23) (after analysis: value of ICNTL(6) effectively used): 0 >>>> INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1 >>>> INFOG(25) (after factorization: number of pivots modified by static pivoting): 0 >>>> INFOG(28) (after factorization: number of null pivots encountered): 0 >>>> INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 1849788 >>>> INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 29, 29 >>>> INFOG(32) (after analysis: type of analysis done): 1 >>>> INFOG(33) (value used for ICNTL(8)): 7 >>>> INFOG(34) (exponent of the determinant if determinant is requested): 0 >>>> INFOG(35) (after factorization: number of entries taking into account BLR factor compression - sum over all processors): 1849788 >>>> INFOG(36) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - value on the most memory consuming processor): 0 >>>> INFOG(37) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - sum over all processors): 0 >>>> INFOG(38) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - value on the most memory consuming processor): 0 >>>> INFOG(39) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - sum over all processors): 0 >>>> linear system matrix = precond matrix: >>>> Mat Object: 1 MPI processes >>>> type: seqaijcusparse >>>> rows=40200, cols=40200 >>>> total: nonzeros=199996, allocated nonzeros=199996 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node routines >>>> linear system matrix = precond matrix: >>>> Mat Object: 16 MPI processes >>>> type: mpiaijcusparse >>>> rows=160800, cols=160800 >>>> total: nonzeros=802396, allocated nonzeros=1608000 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node (on process 0) routines >>>> Norm of error 9.11684e-07 iterations 189 >>>> Chang >>>> On 10/14/21 10:10 PM, Chang Liu wrote: >>>>> Hi Barry, >>>>> >>>>> No problem. Here is the output. It seems that the resid norm calculation is incorrect. >>>>> >>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>>>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>>>> KSP Object: 16 MPI processes >>>>> type: fgmres >>>>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >>>>> happy breakdown tolerance 1e-30 >>>>> maximum iterations=2000, initial guess is zero >>>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >>>>> right preconditioning >>>>> using UNPRECONDITIONED norm type for convergence test >>>>> PC Object: 16 MPI processes >>>>> type: bjacobi >>>>> number of blocks = 4 >>>>> Local solver information for first block is in the following KSP and PC objects on rank 0: >>>>> Use -ksp_view ::ascii_info_detail to display information for all blocks >>>>> KSP Object: (sub_) 4 MPI processes >>>>> type: preonly >>>>> maximum iterations=10000, initial guess is zero >>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>>> left preconditioning >>>>> using NONE norm type for convergence test >>>>> PC Object: (sub_) 4 MPI processes >>>>> type: telescope >>>>> petsc subcomm: parent comm size reduction factor = 4 >>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>>>> petsc subcomm type = contiguous >>>>> linear system matrix = precond matrix: >>>>> Mat Object: (sub_) 4 MPI processes >>>>> type: mpiaij >>>>> rows=40200, cols=40200 >>>>> total: nonzeros=199996, allocated nonzeros=203412 >>>>> total number of mallocs used during MatSetValues calls=0 >>>>> not using I-node (on process 0) routines >>>>> setup type: default >>>>> Parent DM object: NULL >>>>> Sub DM object: NULL >>>>> KSP Object: (sub_telescope_) 1 MPI processes >>>>> type: preonly >>>>> maximum iterations=10000, initial guess is zero >>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>>> left preconditioning >>>>> using NONE norm type for convergence test >>>>> PC Object: (sub_telescope_) 1 MPI processes >>>>> type: lu >>>>> out-of-place factorization >>>>> tolerance for zero pivot 2.22045e-14 >>>>> matrix ordering: nd >>>>> factor fill ratio given 5., needed 8.62558 >>>>> Factored matrix follows: >>>>> Mat Object: 1 MPI processes >>>>> type: seqaijcusparse >>>>> rows=40200, cols=40200 >>>>> package used to perform factorization: cusparse >>>>> total: nonzeros=1725082, allocated nonzeros=1725082 >>>>> not using I-node routines >>>>> linear system matrix = precond matrix: >>>>> Mat Object: 1 MPI processes >>>>> type: seqaijcusparse >>>>> rows=40200, cols=40200 >>>>> total: nonzeros=199996, allocated nonzeros=199996 >>>>> total number of mallocs used during MatSetValues calls=0 >>>>> not using I-node routines >>>>> linear system matrix = precond matrix: >>>>> Mat Object: 16 MPI processes >>>>> type: mpiaijcusparse >>>>> rows=160800, cols=160800 >>>>> total: nonzeros=802396, allocated nonzeros=1608000 >>>>> total number of mallocs used during MatSetValues calls=0 >>>>> not using I-node (on process 0) routines >>>>> Norm of error 400.999 iterations 1 >>>>> >>>>> Chang >>>>> >>>>> >>>>> On 10/14/21 9:47 PM, Barry Smith wrote: >>>>>> >>>>>> Chang, >>>>>> >>>>>> Sorry I did not notice that one. Please run that with -ksp_view -ksp_monitor_true_residual so we can see exactly how options are interpreted and solver used. At a glance it looks ok but something must be wrong to get the wrong answer. >>>>>> >>>>>> Barry >>>>>> >>>>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: >>>>>>> >>>>>>> Hi Barry, >>>>>>> >>>>>>> That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. >>>>>>> >>>>>>> Chang >>>>>>> >>>>>>> On 10/14/21 5:15 PM, Barry Smith wrote: >>>>>>>> You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu >>>>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>>>>>>>> >>>>>>>>> Hi Pierre, >>>>>>>>> >>>>>>>>> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. >>>>>>>>> >>>>>>>>> The command line options I used for small matrix is like >>>>>>>>> >>>>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >>>>>>>>> >>>>>>>>> which gives the correct output. For iterative solver, I tried >>>>>>>>> >>>>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 >>>>>>>>> >>>>>>>>> for large matrix. The output is like >>>>>>>>> >>>>>>>>> 0 KSP Residual norm 40.1497 >>>>>>>>> 1 KSP Residual norm < 1.e-11 >>>>>>>>> Norm of error 400.999 iterations 1 >>>>>>>>> >>>>>>>>> So it seems to call a direct solver instead of an iterative one. >>>>>>>>> >>>>>>>>> Can you please help check these options? >>>>>>>>> >>>>>>>>> Chang >>>>>>>>> >>>>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>>>>>>>> >>>>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>>>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >>>>>>>>>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >>>>>>>>>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >>>>>>>>>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >>>>>>>>>> Thanks, >>>>>>>>>> Pierre >>>>>>>>>>> Chang >>>>>>>>>>> >>>>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>>>>>>>>>> Thus the need for specific code in mumps.c. >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Pierre >>>>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Junchao, >>>>>>>>>>>>> >>>>>>>>>>>>> Yes that is what I want. >>>>>>>>>>>>> >>>>>>>>>>>>> Chang >>>>>>>>>>>>> >>>>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>>>>>>>>>> Junchao, >>>>>>>>>>>>>> If I understand correctly Chang is using the block Jacobi >>>>>>>>>>>>>> method with a single block for a number of MPI ranks and a direct >>>>>>>>>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>>>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>>>>>>>>>> particular problems this preconditioner works well, but using an >>>>>>>>>>>>>> iterative solver on the blocks does not work well. >>>>>>>>>>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>>>>>>>>>> not he would like to use a single GPU for each block, this means >>>>>>>>>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>>>>>>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>>>>>>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>>>>>>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>>>>>>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>>>>>>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>>>>>>>>>> GPUs. But this may be a large coding project. >>>>>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>>>>>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>>>>>>>>>> Barry >>>>>>>>>>>>>> Since the matrices being factored and solved directly are relatively >>>>>>>>>>>>>> large it is possible that the cusparse code could be reasonably >>>>>>>>>>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>>>>>>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>>>>>>>>>> actually know how much better the cusparse code would be on the >>>>>>>>>>>>>> direct solver than a good CPU direct sparse solver. >>>>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>>>>>> > wrote: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Chang >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>>>>>>> >> Hi Chang, >>>>>>>>>>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>>>>>>>>>> gathering matrix rows to one process. >>>>>>>>>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>>>>>>>>>> >> Thanks >>>>>>>>>>>>>> >> --Junchao Zhang >>>>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >> Hi Barry, >>>>>>>>>>>>>> >> I think mumps solver in petsc does support that. You can >>>>>>>>>>>>>> check the >>>>>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> and the code enclosed by #if >>>>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>>>>>>>>>> >> mumps.c >>>>>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>>>>>>>>>> However, I am >>>>>>>>>>>>>> >> working on an existing code that was developed based on MPI >>>>>>>>>>>>>> and the the >>>>>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>>>>>>>>>> want to >>>>>>>>>>>>>> >> change the whole structure of the code. >>>>>>>>>>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>>>>>>>>>> function >>>>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>>>>>>>>>> >> Regards, >>>>>>>>>>>>>> >> Chang >>>>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >> wrote: >>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>> >> >> Hi Barry, >>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>> >> >> That is exactly what I want. >>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>>>>>>>>>> >> transfer >>>>>>>>>>>>>> >> >> matrix >>>>>>>>>>>>>> >> >> data from many MPI processes to "master" MPI >>>>>>>>>>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>>>>>>>>>> upload >>>>>>>>>>>>>> >> the data to GPU to >>>>>>>>>>>>>> >> >> solve. >>>>>>>>>>>>>> >> >> One can just grab some codes from mumps.c to >>>>>>>>>>>>>> aijcusparse.cu >>>>>>>>>>>>>> >> >. >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>>>>>>>>>> copy the >>>>>>>>>>>>>> >> entire matrix to a single MPI rank. >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > It would be possible to write such a code that you >>>>>>>>>>>>>> suggest but >>>>>>>>>>>>>> >> it is not clear that it makes sense >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>>>>>>>>>> rank, so >>>>>>>>>>>>>> >> while your one GPU per big domain is solving its systems the >>>>>>>>>>>>>> other >>>>>>>>>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>>>>>>>>>> >> nothing. >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>>>>>>>>>> right >>>>>>>>>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>>>>>>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>>>>>>>>>> to all >>>>>>>>>>>>>> >> of its subdomain ranks. >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>>>>>>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>>>>>>>>>> use the >>>>>>>>>>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>>>>>>>>>> >> can fit and be done on a single GPU then you would be >>>>>>>>>>>>>> utilizing all >>>>>>>>>>>>>> >> the GPUs you are using effectively. >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > Barry >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>> >> >> Chang >>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>>>>>>> >> >>> Chang, >>>>>>>>>>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>>>>>>>>>> solvers that >>>>>>>>>>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>>>>>>>>>> that I >>>>>>>>>>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>>>>>>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>>>>>>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>>>>>>>>>> >> direct triangular solves. >>>>>>>>>>>>>> >> >>> Barry >>>>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>> >> >>>> Hi Mark, >>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>>>>>>>>>> other >>>>>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>>>>>>>>>> will give >>>>>>>>>>>>>> >> an error. >>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>>>>>>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>>>>>>>>>> on gpu. >>>>>>>>>>>>>> >> Is that possible? >>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>>>>>>>>>> runs but >>>>>>>>>>>>>> >> the iterative solver is still running on CPUs. I have >>>>>>>>>>>>>> contacted the >>>>>>>>>>>>>> >> superlu group and they confirmed that is the case right now. >>>>>>>>>>>>>> But if >>>>>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>>>>>>>>>> >> iterative solver is running on GPU. >>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>> >> >>>> Chang >>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>> wrote: >>>>>>>>>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>>>>>>>>>> my case >>>>>>>>>>>>>> >> the code is >>>>>>>>>>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>>>>>>>>>> >> factorization on GPUs. >>>>>>>>>>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>>>>>>>>>> code to >>>>>>>>>>>>>> >> utilize GPUs >>>>>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>>>>>>>>>> mpiaij >>>>>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>>>>>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>>>>>>>>>> >> superlu tests use aij or sell matrix type. >>>>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>>>>>>>>>> .... but >>>>>>>>>>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>>>>>>>>>> that the >>>>>>>>>>>>>> >> issue? >>>>>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>>>>>>>>>> SuperLU >>>>>>>>>>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>>>>>>>>>> >> >>>>> Thanks, >>>>>>>>>>>>>> >> >>>>> Mark >>>>>>>>>>>>>> >> >>>>> so I >>>>>>>>>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>>>>>>>>>> all the >>>>>>>>>>>>>> >> matrix terms, >>>>>>>>>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>>>>>>>>>> >> factorization >>>>>>>>>>>>>> >> >>>>> and >>>>>>>>>>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>>>>>>>>>> >> process, and I >>>>>>>>>>>>>> >> >>>>> think >>>>>>>>>>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>>>>>>>>>> >> >>>>> Chang >>>>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>>>>>>>>>> >> >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > Hi Mark, >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > The option I use is like >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>>>>>>>>>> >> -ksp_type fgmres >>>>>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>>> >> >>>>> *-sub_ksp_type >>>>>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>>>>>>>>>> >> -ksp_rtol 1.e-300 >>>>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>>>>>>>>>> (rows >>>>>>>>>>>>>> >> are the >>>>>>>>>>>>>> >> >>>>> method like >>>>>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>>>>>>>>>> in the GPU. >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>>>>>>>>>> cuSparse LU >>>>>>>>>>>>>> >> >>>>> factorization. Is >>>>>>>>>>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>>>>>>> find it >>>>>>>>>>>>>> >> calls >>>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>>>>>>>>>> >> make bigger >>>>>>>>>>>>>> >> >>>>> blocks? >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > I think this one do both factorization and >>>>>>>>>>>>>> >> solve on gpu. >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > You can check the >>>>>>>>>>>>>> runex72_aijcusparse.sh file >>>>>>>>>>>>>> >> in petsc >>>>>>>>>>>>>> >> >>>>> install >>>>>>>>>>>>>> >> >>>>> > directory, and try it your self (this >>>>>>>>>>>>>> is only lu >>>>>>>>>>>>>> >> >>>>> factorization >>>>>>>>>>>>>> >> >>>>> > without >>>>>>>>>>>>>> >> >>>>> > iterative solve). >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > Chang >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>>>>>>>>>> Chang Liu >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>>> wrote: >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > Hi Junchao, >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>>>>>>>>>> >> within a >>>>>>>>>>>>>> >> >>>>> node. I use >>>>>>>>>>>>>> >> >>>>> > block-Jacobi >>>>>>>>>>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>>>>>>>>>> >> matrix, so each >>>>>>>>>>>>>> >> >>>>> > direct solver will >>>>>>>>>>>>>> >> >>>>> > > take care of a sub-block of the >>>>>>>>>>>>>> whole >>>>>>>>>>>>>> >> matrix. In this >>>>>>>>>>>>>> >> >>>>> > way, I can use >>>>>>>>>>>>>> >> >>>>> > > one >>>>>>>>>>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>>>>>>>>>> >> stored within >>>>>>>>>>>>>> >> >>>>> one node. >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > It was stated in the >>>>>>>>>>>>>> documentation that >>>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>>> >> >>>>> solver >>>>>>>>>>>>>> >> >>>>> > is slow. >>>>>>>>>>>>>> >> >>>>> > > However, in my test using >>>>>>>>>>>>>> ex72.c, the >>>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>>> >> >>>>> solver is >>>>>>>>>>>>>> >> >>>>> > faster than >>>>>>>>>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > Are we talking about the >>>>>>>>>>>>>> factorization, the >>>>>>>>>>>>>> >> solve, or >>>>>>>>>>>>>> >> >>>>> both? >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > We do not have an interface to >>>>>>>>>>>>>> cuSparse's LU >>>>>>>>>>>>>> >> >>>>> factorization (I >>>>>>>>>>>>>> >> >>>>> > just >>>>>>>>>>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>>>>>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>>>>>>>>>> >> '-pc_type lu >>>>>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>>>>>>>>>> >> factorization, >>>>>>>>>>>>>> >> >>>>> which is the >>>>>>>>>>>>>> >> >>>>> > > dominant cost. >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > Chang >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>>>>>>>>>> Zhang wrote: >>>>>>>>>>>>>> >> >>>>> > > > Hi, Chang, >>>>>>>>>>>>>> >> >>>>> > > > For the mumps solver, we >>>>>>>>>>>>>> usually >>>>>>>>>>>>>> >> transfers >>>>>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>>>>> >> >>>>> > and vector >>>>>>>>>>>>>> >> >>>>> > > data >>>>>>>>>>>>>> >> >>>>> > > > within a compute node. For >>>>>>>>>>>>>> the idea you >>>>>>>>>>>>>> >> >>>>> propose, it >>>>>>>>>>>>>> >> >>>>> > looks like >>>>>>>>>>>>>> >> >>>>> > > we need >>>>>>>>>>>>>> >> >>>>> > > > to gather data within >>>>>>>>>>>>>> >> MPI_COMM_WORLD, right? >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > > Mark, I remember you said >>>>>>>>>>>>>> >> cusparse solve is >>>>>>>>>>>>>> >> >>>>> slow >>>>>>>>>>>>>> >> >>>>> > and you would >>>>>>>>>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > > --Junchao Zhang >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>>>>>>>>>> >> Chang Liu via >>>>>>>>>>>>>> >> >>>>> petsc-users >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>> >> >>>>> > > wrote: >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > > Hi, >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > > Currently, it is possible >>>>>>>>>>>>>> to use >>>>>>>>>>>>>> >> mumps >>>>>>>>>>>>>> >> >>>>> solver in >>>>>>>>>>>>>> >> >>>>> > PETSC with >>>>>>>>>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>>>>>>>>>> >> option, so that >>>>>>>>>>>>>> >> >>>>> > multiple MPI >>>>>>>>>>>>>> >> >>>>> > > processes will >>>>>>>>>>>>>> >> >>>>> > > > transfer the matrix and >>>>>>>>>>>>>> rhs data >>>>>>>>>>>>>> >> to the master >>>>>>>>>>>>>> >> >>>>> > rank, and then >>>>>>>>>>>>>> >> >>>>> > > master >>>>>>>>>>>>>> >> >>>>> > > > rank will call mumps with >>>>>>>>>>>>>> OpenMP >>>>>>>>>>>>>> >> to solve >>>>>>>>>>>>>> >> >>>>> the matrix. >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > > I wonder if someone can >>>>>>>>>>>>>> develop >>>>>>>>>>>>>> >> similar >>>>>>>>>>>>>> >> >>>>> option for >>>>>>>>>>>>>> >> >>>>> > cusparse >>>>>>>>>>>>>> >> >>>>> > > solver. >>>>>>>>>>>>>> >> >>>>> > > > Right now, this solver >>>>>>>>>>>>>> does not >>>>>>>>>>>>>> >> work with >>>>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I >>>>>>>>>>>>>> >> >>>>> > > think a >>>>>>>>>>>>>> >> >>>>> > > > possible workaround is to >>>>>>>>>>>>>> >> transfer all the >>>>>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>>>>> >> >>>>> > data to one MPI >>>>>>>>>>>>>> >> >>>>> > > > process, and then upload the >>>>>>>>>>>>>> >> data to GPU to >>>>>>>>>>>>>> >> >>>>> solve. >>>>>>>>>>>>>> >> >>>>> > In this >>>>>>>>>>>>>> >> >>>>> > > way, one can >>>>>>>>>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>>>>>>>>>> >> program. >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > > Chang >>>>>>>>>>>>>> >> >>>>> > > > -- >>>>>>>>>>>>>> >> >>>>> > > > Chang Liu >>>>>>>>>>>>>> >> >>>>> > > > Staff Research Physicist >>>>>>>>>>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>>>>>>>>>> Laboratory >>>>>>>>>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>>>>>>>>>> Princeton NJ >>>>>>>>>>>>>> >> 08540, USA >>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > > -- >>>>>>>>>>>>>> >> >>>>> > > Chang Liu >>>>>>>>>>>>>> >> >>>>> > > Staff Research Physicist >>>>>>>>>>>>>> >> >>>>> > > +1 609 243 3438 >>>>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>>>>>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>>>>>>>>>> 08540, USA >>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> > -- >>>>>>>>>>>>>> >> >>>>> > Chang Liu >>>>>>>>>>>>>> >> >>>>> > Staff Research Physicist >>>>>>>>>>>>>> >> >>>>> > +1 609 243 3438 >>>>>>>>>>>>>> >> >>>>> > cliu at pppl.gov >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >> >>>>> -- Chang Liu >>>>>>>>>>>>>> >> >>>>> Staff Research Physicist >>>>>>>>>>>>>> >> >>>>> +1 609 243 3438 >>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov >>>>>>>>>>>>>> > >>>>>>>>>>>>> >>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>> >> >>>> -- >>>>>>>>>>>>>> >> >>>> Chang Liu >>>>>>>>>>>>>> >> >>>> Staff Research Physicist >>>>>>>>>>>>>> >> >>>> +1 609 243 3438 >>>>>>>>>>>>>> >> >>>> cliu at pppl.gov >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>> >> >> -- >>>>>>>>>>>>>> >> >> Chang Liu >>>>>>>>>>>>>> >> >> Staff Research Physicist >>>>>>>>>>>>>> >> >> +1 609 243 3438 >>>>>>>>>>>>>> >> >> cliu at pppl.gov >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> -- Chang Liu >>>>>>>>>>>>>> >> Staff Research Physicist >>>>>>>>>>>>>> >> +1 609 243 3438 >>>>>>>>>>>>>> >> cliu at pppl.gov >>>>>>>>>>>>> > >>>>>>>>>>>>>> >> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > -- >>>>>>>>>>>>>> > Chang Liu >>>>>>>>>>>>>> > Staff Research Physicist >>>>>>>>>>>>>> > +1 609 243 3438 >>>>>>>>>>>>>> > cliu at pppl.gov >>>>>>>>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Chang Liu >>>>>>>>>>>>> Staff Research Physicist >>>>>>>>>>>>> +1 609 243 3438 >>>>>>>>>>>>> cliu at pppl.gov >>>>>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Chang Liu >>>>>>>>>>> Staff Research Physicist >>>>>>>>>>> +1 609 243 3438 >>>>>>>>>>> cliu at pppl.gov >>>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Chang Liu >>>>>>>>> Staff Research Physicist >>>>>>>>> +1 609 243 3438 >>>>>>>>> cliu at pppl.gov >>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >>>>>>> -- >>>>>>> Chang Liu >>>>>>> Staff Research Physicist >>>>>>> +1 609 243 3438 >>>>>>> cliu at pppl.gov >>>>>>> Princeton Plasma Physics Laboratory >>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>> >>>>> >>> >>> -- >>> Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From cliu at pppl.gov Wed Oct 20 13:47:12 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 20 Oct 2021 14:47:12 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <968434BC-E8DC-49B0-9119-F208DB1E01B0@petsc.dev> References: <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> <968434BC-E8DC-49B0-9119-F208DB1E01B0@petsc.dev> Message-ID: <7a3d5347-f2da-b4a9-f44a-aa534a314c7f@pppl.gov> Hi Barry, Are the fixes merged in the master? I was using bjacobi as a preconditioner. Using the latest version of petsc, I found that by calling mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view -ksp_monitor_true_residual -ksp_type fgmres -pc_type bjacobi -pc_bjacobi _blocks 4 -sub_ksp_type preonly -sub_pc_type telescope -sub_pc_telescope_reduction_factor 8 -sub_pc_telescope_subcomm_type contiguous -sub_telescope_pc_type lu -sub_telescope_ksp_type preonly -sub_telescope_pc_factor_mat_solver_type mumps -ksp_max_it 2000 -ksp_rtol 1.e-30 -ksp_atol 1.e-30 The code is calling PCApply_BJacobi_Multiproc. If I use mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view -ksp_monitor_true_residual -telescope_ksp_monitor_true_residual -ksp_type preonly -pc_type telescope -pc_telescope_reduction_factor 8 -pc_telescope_subcomm_type contiguous -telescope_pc_type bjacobi -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu -telescope_sub_pc_factor_mat_solver_type mumps -telescope_ksp_max_it 2000 -telescope_ksp_rtol 1.e-30 -telescope_ksp_atol 1.e-30 The code is calling PCApply_BJacobi_Singleblock. You can test it yourself. Regards, Chang On 10/20/21 1:14 PM, Barry Smith wrote: > > >> On Oct 20, 2021, at 12:48 PM, Chang Liu wrote: >> >> Hi Pierre, >> >> I have another suggestion for telescope. I have achieved my goal by putting telescope outside bjacobi. But the code still does not work if I use telescope as a pc for subblock. I think the reason is that I want to use cusparse as the solver, which can only deal with seqaij matrix and not mpiaij matrix. > > > This is suppose to work with the recent fixes. The telescope should produce a seq matrix and for each solve map the parallel vector (over the subdomain) automatically down to the one rank with the GPU to solve it on the GPU. It is not clear to me where the process is going wrong. > > Barry > > > >> However, for telescope pc, it can put the matrix into one mpi rank, thus making it a seqaij for factorization stage, but then after factorization it will give the data back to the original comminicator. This will make the matrix back to mpiaij, and then cusparse cannot solve it. >> >> I think a better option is to do the factorization on CPU with mpiaij, then then transform the preconditioner matrix to seqaij and do the matsolve GPU. But I am not sure if it can be achieved using telescope. >> >> Regads, >> >> Chang >> >> On 10/15/21 5:29 AM, Pierre Jolivet wrote: >>> Hi Chang, >>> The output you sent with MUMPS looks alright to me, you can see that the MatType is properly set to seqaijcusparse (and not mpiaijcusparse). >>> I don?t know what is wrong with -sub_telescope_pc_factor_mat_solver_type cusparse, I don?t have a PETSc installation for testing this, hopefully Barry or Junchao can confirm this wrong behavior and get this fixed. >>> As for permuting PCTELESCOPE and PCBJACOBI, in your case, the outer PC will be equivalent, yes. >>> However, it would be more efficient to do PCBJACOBI and then PCTELESCOPE. >>> PCBJACOBI prunes the operator by basically removing all coefficients outside of the diagonal blocks. >>> Then, PCTELESCOPE "groups everything together?. >>> If you do it the other way around, PCTELESCOPE will ?group everything together? and then PCBJACOBI will prune the operator. >>> So the PCTELESCOPE SetUp will be costly for nothing since some coefficients will be thrown out afterwards in the PCBJACOBI SetUp. >>> I hope I?m clear enough, otherwise I can try do draw some pictures. >>> Thanks, >>> Pierre >>>> On 15 Oct 2021, at 4:39 AM, Chang Liu wrote: >>>> >>>> Hi Pierre and Barry, >>>> >>>> I think maybe I should use telescope outside bjacobi? like this >>>> >>>> mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type telescope -pc_telescope_reduction_factor 4 -t >>>> elescope_pc_type bjacobi -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu -telescope_sub_pc_factor_mat_solve >>>> r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>>> >>>> But then I got an error that >>>> >>>> [0]PETSC ERROR: MatSolverType cusparse does not support matrix type seqaij >>>> >>>> But the mat type should be aijcusparse. I think telescope change the mat type. >>>> >>>> Chang >>>> >>>> On 10/14/21 10:11 PM, Chang Liu wrote: >>>>> For comparison, here is the output using mumps instead of cusparse >>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type mumps -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>>>> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 >>>>> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 >>>>> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 >>>>> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 >>>>> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 >>>>> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 >>>>> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 >>>>> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 >>>>> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 >>>>> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 >>>>> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 >>>>> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 >>>>> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 >>>>> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 >>>>> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 >>>>> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 >>>>> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 >>>>> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 >>>>> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 >>>>> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 >>>>> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 >>>>> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 >>>>> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 >>>>> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 >>>>> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 >>>>> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 >>>>> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 >>>>> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 >>>>> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 >>>>> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 >>>>> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 >>>>> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 >>>>> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 >>>>> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 >>>>> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 >>>>> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 >>>>> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 >>>>> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 >>>>> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 >>>>> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 >>>>> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 >>>>> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 >>>>> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 >>>>> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 >>>>> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 >>>>> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 >>>>> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 >>>>> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 >>>>> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 >>>>> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 >>>>> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 >>>>> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 >>>>> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 >>>>> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 >>>>> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 >>>>> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 >>>>> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 >>>>> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 >>>>> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 >>>>> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 >>>>> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 >>>>> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 >>>>> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 >>>>> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 >>>>> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 >>>>> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 >>>>> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 >>>>> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 >>>>> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 >>>>> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 >>>>> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 >>>>> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 >>>>> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 >>>>> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 >>>>> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 >>>>> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 >>>>> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 >>>>> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 >>>>> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 >>>>> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 >>>>> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 >>>>> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 >>>>> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 >>>>> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 >>>>> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 >>>>> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 >>>>> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 >>>>> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 >>>>> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 >>>>> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 >>>>> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 >>>>> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 >>>>> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 >>>>> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 >>>>> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 >>>>> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 >>>>> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 >>>>> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 >>>>> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 >>>>> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 >>>>> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 >>>>> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 >>>>> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 >>>>> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 >>>>> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 >>>>> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 >>>>> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 >>>>> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 >>>>> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 >>>>> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 >>>>> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 >>>>> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 >>>>> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 >>>>> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 >>>>> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 >>>>> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 >>>>> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 >>>>> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 >>>>> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 >>>>> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 >>>>> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 >>>>> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 >>>>> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 >>>>> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 >>>>> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 >>>>> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 >>>>> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 >>>>> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 >>>>> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 >>>>> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 >>>>> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 >>>>> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 >>>>> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 >>>>> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 >>>>> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 >>>>> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 >>>>> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 >>>>> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 >>>>> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 >>>>> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 >>>>> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 >>>>> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 >>>>> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 >>>>> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 >>>>> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 >>>>> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 >>>>> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 >>>>> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 >>>>> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 >>>>> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 >>>>> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 >>>>> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 >>>>> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 >>>>> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 >>>>> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 >>>>> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 >>>>> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 >>>>> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 >>>>> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 >>>>> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 >>>>> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 >>>>> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 >>>>> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 >>>>> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 >>>>> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 >>>>> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 >>>>> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 >>>>> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 >>>>> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 >>>>> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 >>>>> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 >>>>> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 >>>>> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 >>>>> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 >>>>> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 >>>>> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 >>>>> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 >>>>> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 >>>>> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 >>>>> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 >>>>> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 >>>>> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 >>>>> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 >>>>> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 >>>>> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 >>>>> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 >>>>> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 >>>>> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 >>>>> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 >>>>> KSP Object: 16 MPI processes >>>>> type: fgmres >>>>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >>>>> happy breakdown tolerance 1e-30 >>>>> maximum iterations=2000, initial guess is zero >>>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >>>>> right preconditioning >>>>> using UNPRECONDITIONED norm type for convergence test >>>>> PC Object: 16 MPI processes >>>>> type: bjacobi >>>>> number of blocks = 4 >>>>> Local solver information for first block is in the following KSP and PC objects on rank 0: >>>>> Use -ksp_view ::ascii_info_detail to display information for all blocks >>>>> KSP Object: (sub_) 4 MPI processes >>>>> type: preonly >>>>> maximum iterations=10000, initial guess is zero >>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>>> left preconditioning >>>>> using NONE norm type for convergence test >>>>> PC Object: (sub_) 4 MPI processes >>>>> type: telescope >>>>> petsc subcomm: parent comm size reduction factor = 4 >>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>>>> petsc subcomm type = contiguous >>>>> linear system matrix = precond matrix: >>>>> Mat Object: (sub_) 4 MPI processes >>>>> type: mpiaij >>>>> rows=40200, cols=40200 >>>>> total: nonzeros=199996, allocated nonzeros=203412 >>>>> total number of mallocs used during MatSetValues calls=0 >>>>> not using I-node (on process 0) routines >>>>> setup type: default >>>>> Parent DM object: NULL >>>>> Sub DM object: NULL >>>>> KSP Object: (sub_telescope_) 1 MPI processes >>>>> type: preonly >>>>> maximum iterations=10000, initial guess is zero >>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>>> left preconditioning >>>>> using NONE norm type for convergence test >>>>> PC Object: (sub_telescope_) 1 MPI processes >>>>> type: lu >>>>> out-of-place factorization >>>>> tolerance for zero pivot 2.22045e-14 >>>>> matrix ordering: external >>>>> factor fill ratio given 0., needed 0. >>>>> Factored matrix follows: >>>>> Mat Object: 1 MPI processes >>>>> type: mumps >>>>> rows=40200, cols=40200 >>>>> package used to perform factorization: mumps >>>>> total: nonzeros=1849788, allocated nonzeros=1849788 >>>>> MUMPS run parameters: >>>>> SYM (matrix type): 0 >>>>> PAR (host participation): 1 >>>>> ICNTL(1) (output for error): 6 >>>>> ICNTL(2) (output of diagnostic msg): 0 >>>>> ICNTL(3) (output for global info): 0 >>>>> ICNTL(4) (level of printing): 0 >>>>> ICNTL(5) (input mat struct): 0 >>>>> ICNTL(6) (matrix prescaling): 7 >>>>> ICNTL(7) (sequential matrix ordering):7 >>>>> ICNTL(8) (scaling strategy): 77 >>>>> ICNTL(10) (max num of refinements): 0 >>>>> ICNTL(11) (error analysis): 0 >>>>> ICNTL(12) (efficiency control): 1 >>>>> ICNTL(13) (sequential factorization of the root node): 0 >>>>> ICNTL(14) (percentage of estimated workspace increase): 20 >>>>> ICNTL(18) (input mat struct): 0 >>>>> ICNTL(19) (Schur complement info): 0 >>>>> ICNTL(20) (RHS sparse pattern): 0 >>>>> ICNTL(21) (solution struct): 0 >>>>> ICNTL(22) (in-core/out-of-core facility): 0 >>>>> ICNTL(23) (max size of memory can be allocated locally):0 >>>>> ICNTL(24) (detection of null pivot rows): 0 >>>>> ICNTL(25) (computation of a null space basis): 0 >>>>> ICNTL(26) (Schur options for RHS or solution): 0 >>>>> ICNTL(27) (blocking size for multiple RHS): -32 >>>>> ICNTL(28) (use parallel or sequential ordering): 1 >>>>> ICNTL(29) (parallel ordering): 0 >>>>> ICNTL(30) (user-specified set of entries in inv(A)): 0 >>>>> ICNTL(31) (factors is discarded in the solve phase): 0 >>>>> ICNTL(33) (compute determinant): 0 >>>>> ICNTL(35) (activate BLR based factorization): 0 >>>>> ICNTL(36) (choice of BLR factorization variant): 0 >>>>> ICNTL(38) (estimated compression rate of LU factors): 333 >>>>> CNTL(1) (relative pivoting threshold): 0.01 >>>>> CNTL(2) (stopping criterion of refinement): 1.49012e-08 >>>>> CNTL(3) (absolute pivoting threshold): 0. >>>>> CNTL(4) (value of static pivoting): -1. >>>>> CNTL(5) (fixation for null pivots): 0. >>>>> CNTL(7) (dropping parameter for BLR): 0. >>>>> RINFO(1) (local estimated flops for the elimination after analysis): >>>>> [0] 1.45525e+08 >>>>> RINFO(2) (local estimated flops for the assembly after factorization): >>>>> [0] 2.89397e+06 >>>>> RINFO(3) (local estimated flops for the elimination after factorization): >>>>> [0] 1.45525e+08 >>>>> INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization): >>>>> [0] 29 >>>>> INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization): >>>>> [0] 29 >>>>> INFO(23) (num of pivots eliminated on this processor after factorization): >>>>> [0] 40200 >>>>> RINFOG(1) (global estimated flops for the elimination after analysis): 1.45525e+08 >>>>> RINFOG(2) (global estimated flops for the assembly after factorization): 2.89397e+06 >>>>> RINFOG(3) (global estimated flops for the elimination after factorization): 1.45525e+08 >>>>> (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0) >>>>> INFOG(3) (estimated real workspace for factors on all processors after analysis): 1849788 >>>>> INFOG(4) (estimated integer workspace for factors on all processors after analysis): 879986 >>>>> INFOG(5) (estimated maximum front size in the complete tree): 282 >>>>> INFOG(6) (number of nodes in the complete tree): 23709 >>>>> INFOG(7) (ordering option effectively used after analysis): 5 >>>>> INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100 >>>>> INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 1849788 >>>>> INFOG(10) (total integer space store the matrix factors after factorization): 879986 >>>>> INFOG(11) (order of largest frontal matrix after factorization): 282 >>>>> INFOG(12) (number of off-diagonal pivots): 0 >>>>> INFOG(13) (number of delayed pivots after factorization): 0 >>>>> INFOG(14) (number of memory compress after factorization): 0 >>>>> INFOG(15) (number of steps of iterative refinement after solution): 0 >>>>> INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 29 >>>>> INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 29 >>>>> INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 29 >>>>> INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 29 >>>>> INFOG(20) (estimated number of entries in the factors): 1849788 >>>>> INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 26 >>>>> INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 26 >>>>> INFOG(23) (after analysis: value of ICNTL(6) effectively used): 0 >>>>> INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1 >>>>> INFOG(25) (after factorization: number of pivots modified by static pivoting): 0 >>>>> INFOG(28) (after factorization: number of null pivots encountered): 0 >>>>> INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 1849788 >>>>> INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 29, 29 >>>>> INFOG(32) (after analysis: type of analysis done): 1 >>>>> INFOG(33) (value used for ICNTL(8)): 7 >>>>> INFOG(34) (exponent of the determinant if determinant is requested): 0 >>>>> INFOG(35) (after factorization: number of entries taking into account BLR factor compression - sum over all processors): 1849788 >>>>> INFOG(36) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - value on the most memory consuming processor): 0 >>>>> INFOG(37) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - sum over all processors): 0 >>>>> INFOG(38) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - value on the most memory consuming processor): 0 >>>>> INFOG(39) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - sum over all processors): 0 >>>>> linear system matrix = precond matrix: >>>>> Mat Object: 1 MPI processes >>>>> type: seqaijcusparse >>>>> rows=40200, cols=40200 >>>>> total: nonzeros=199996, allocated nonzeros=199996 >>>>> total number of mallocs used during MatSetValues calls=0 >>>>> not using I-node routines >>>>> linear system matrix = precond matrix: >>>>> Mat Object: 16 MPI processes >>>>> type: mpiaijcusparse >>>>> rows=160800, cols=160800 >>>>> total: nonzeros=802396, allocated nonzeros=1608000 >>>>> total number of mallocs used during MatSetValues calls=0 >>>>> not using I-node (on process 0) routines >>>>> Norm of error 9.11684e-07 iterations 189 >>>>> Chang >>>>> On 10/14/21 10:10 PM, Chang Liu wrote: >>>>>> Hi Barry, >>>>>> >>>>>> No problem. Here is the output. It seems that the resid norm calculation is incorrect. >>>>>> >>>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>>>>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>>>>> KSP Object: 16 MPI processes >>>>>> type: fgmres >>>>>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement >>>>>> happy breakdown tolerance 1e-30 >>>>>> maximum iterations=2000, initial guess is zero >>>>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. >>>>>> right preconditioning >>>>>> using UNPRECONDITIONED norm type for convergence test >>>>>> PC Object: 16 MPI processes >>>>>> type: bjacobi >>>>>> number of blocks = 4 >>>>>> Local solver information for first block is in the following KSP and PC objects on rank 0: >>>>>> Use -ksp_view ::ascii_info_detail to display information for all blocks >>>>>> KSP Object: (sub_) 4 MPI processes >>>>>> type: preonly >>>>>> maximum iterations=10000, initial guess is zero >>>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>>>> left preconditioning >>>>>> using NONE norm type for convergence test >>>>>> PC Object: (sub_) 4 MPI processes >>>>>> type: telescope >>>>>> petsc subcomm: parent comm size reduction factor = 4 >>>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>>>>> petsc subcomm type = contiguous >>>>>> linear system matrix = precond matrix: >>>>>> Mat Object: (sub_) 4 MPI processes >>>>>> type: mpiaij >>>>>> rows=40200, cols=40200 >>>>>> total: nonzeros=199996, allocated nonzeros=203412 >>>>>> total number of mallocs used during MatSetValues calls=0 >>>>>> not using I-node (on process 0) routines >>>>>> setup type: default >>>>>> Parent DM object: NULL >>>>>> Sub DM object: NULL >>>>>> KSP Object: (sub_telescope_) 1 MPI processes >>>>>> type: preonly >>>>>> maximum iterations=10000, initial guess is zero >>>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>>>> left preconditioning >>>>>> using NONE norm type for convergence test >>>>>> PC Object: (sub_telescope_) 1 MPI processes >>>>>> type: lu >>>>>> out-of-place factorization >>>>>> tolerance for zero pivot 2.22045e-14 >>>>>> matrix ordering: nd >>>>>> factor fill ratio given 5., needed 8.62558 >>>>>> Factored matrix follows: >>>>>> Mat Object: 1 MPI processes >>>>>> type: seqaijcusparse >>>>>> rows=40200, cols=40200 >>>>>> package used to perform factorization: cusparse >>>>>> total: nonzeros=1725082, allocated nonzeros=1725082 >>>>>> not using I-node routines >>>>>> linear system matrix = precond matrix: >>>>>> Mat Object: 1 MPI processes >>>>>> type: seqaijcusparse >>>>>> rows=40200, cols=40200 >>>>>> total: nonzeros=199996, allocated nonzeros=199996 >>>>>> total number of mallocs used during MatSetValues calls=0 >>>>>> not using I-node routines >>>>>> linear system matrix = precond matrix: >>>>>> Mat Object: 16 MPI processes >>>>>> type: mpiaijcusparse >>>>>> rows=160800, cols=160800 >>>>>> total: nonzeros=802396, allocated nonzeros=1608000 >>>>>> total number of mallocs used during MatSetValues calls=0 >>>>>> not using I-node (on process 0) routines >>>>>> Norm of error 400.999 iterations 1 >>>>>> >>>>>> Chang >>>>>> >>>>>> >>>>>> On 10/14/21 9:47 PM, Barry Smith wrote: >>>>>>> >>>>>>> Chang, >>>>>>> >>>>>>> Sorry I did not notice that one. Please run that with -ksp_view -ksp_monitor_true_residual so we can see exactly how options are interpreted and solver used. At a glance it looks ok but something must be wrong to get the wrong answer. >>>>>>> >>>>>>> Barry >>>>>>> >>>>>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: >>>>>>>> >>>>>>>> Hi Barry, >>>>>>>> >>>>>>>> That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. >>>>>>>> >>>>>>>> Chang >>>>>>>> >>>>>>>> On 10/14/21 5:15 PM, Barry Smith wrote: >>>>>>>>> You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu >>>>>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: >>>>>>>>>> >>>>>>>>>> Hi Pierre, >>>>>>>>>> >>>>>>>>>> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. >>>>>>>>>> >>>>>>>>>> The command line options I used for small matrix is like >>>>>>>>>> >>>>>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >>>>>>>>>> >>>>>>>>>> which gives the correct output. For iterative solver, I tried >>>>>>>>>> >>>>>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 >>>>>>>>>> >>>>>>>>>> for large matrix. The output is like >>>>>>>>>> >>>>>>>>>> 0 KSP Residual norm 40.1497 >>>>>>>>>> 1 KSP Residual norm < 1.e-11 >>>>>>>>>> Norm of error 400.999 iterations 1 >>>>>>>>>> >>>>>>>>>> So it seems to call a direct solver instead of an iterative one. >>>>>>>>>> >>>>>>>>>> Can you please help check these options? >>>>>>>>>> >>>>>>>>>> Chang >>>>>>>>>> >>>>>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>>>>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? >>>>>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>>>>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; >>>>>>>>>>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. >>>>>>>>>>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. >>>>>>>>>>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. >>>>>>>>>>> Thanks, >>>>>>>>>>> Pierre >>>>>>>>>>>> Chang >>>>>>>>>>>> >>>>>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>>>>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? >>>>>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>>>>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. >>>>>>>>>>>>> Thus the need for specific code in mumps.c. >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Pierre >>>>>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Junchao, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes that is what I want. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Chang >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>>>>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith > wrote: >>>>>>>>>>>>>>> Junchao, >>>>>>>>>>>>>>> If I understand correctly Chang is using the block Jacobi >>>>>>>>>>>>>>> method with a single block for a number of MPI ranks and a direct >>>>>>>>>>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which >>>>>>>>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their >>>>>>>>>>>>>>> particular problems this preconditioner works well, but using an >>>>>>>>>>>>>>> iterative solver on the blocks does not work well. >>>>>>>>>>>>>>> If we had complete MPI-GPU direct solvers he could just use >>>>>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do >>>>>>>>>>>>>>> not he would like to use a single GPU for each block, this means >>>>>>>>>>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be >>>>>>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple >>>>>>>>>>>>>>> MPI ranks associated with the blocks). Similarly for the triangular >>>>>>>>>>>>>>> solves the blocks of the right hand side needs to be shipped to the >>>>>>>>>>>>>>> appropriate GPU and the resulting solution shipped back to the >>>>>>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like >>>>>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. >>>>>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the >>>>>>>>>>>>>>> MPI ranks and then shrink each block down to a single GPU but this >>>>>>>>>>>>>>> would be pretty inefficient, ideally one would go directly from the >>>>>>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of >>>>>>>>>>>>>>> GPUs. But this may be a large coding project. >>>>>>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. >>>>>>>>>>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. >>>>>>>>>>>>>>> Barry >>>>>>>>>>>>>>> Since the matrices being factored and solved directly are relatively >>>>>>>>>>>>>>> large it is possible that the cusparse code could be reasonably >>>>>>>>>>>>>>> efficient (they are not the tiny problems one gets at the coarse >>>>>>>>>>>>>>> level of multigrid). Of course, this is speculation, I don't >>>>>>>>>>>>>>> actually know how much better the cusparse code would be on the >>>>>>>>>>>>>>> direct solver than a good CPU direct sparse solver. >>>>>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>>>>>>>>>>>>> > wrote: >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Sorry I am not familiar with the details either. Can you please >>>>>>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Chang >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>>>>>>>>>>>>>> >> Hi Chang, >>>>>>>>>>>>>>> >> I did the work in mumps. It is easy for me to understand >>>>>>>>>>>>>>> gathering matrix rows to one process. >>>>>>>>>>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? >>>>>>>>>>>>>>> >> Thanks >>>>>>>>>>>>>>> >> --Junchao Zhang >>>>>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >> Hi Barry, >>>>>>>>>>>>>>> >> I think mumps solver in petsc does support that. You can >>>>>>>>>>>>>>> check the >>>>>>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> and the code enclosed by #if >>>>>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>>>>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>>>>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>>>>>>>>>>>>>> >> mumps.c >>>>>>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. >>>>>>>>>>>>>>> However, I am >>>>>>>>>>>>>>> >> working on an existing code that was developed based on MPI >>>>>>>>>>>>>>> and the the >>>>>>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't >>>>>>>>>>>>>>> want to >>>>>>>>>>>>>>> >> change the whole structure of the code. >>>>>>>>>>>>>>> >> 2. What you have suggested has been coded in mumps.c. See >>>>>>>>>>>>>>> function >>>>>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. >>>>>>>>>>>>>>> >> Regards, >>>>>>>>>>>>>>> >> Chang >>>>>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >> wrote: >>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>> >> >> Hi Barry, >>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>> >> >> That is exactly what I want. >>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>> >> >> Back to my original question, I am looking for an approach to >>>>>>>>>>>>>>> >> transfer >>>>>>>>>>>>>>> >> >> matrix >>>>>>>>>>>>>>> >> >> data from many MPI processes to "master" MPI >>>>>>>>>>>>>>> >> >> processes, each of which taking care of one GPU, and then >>>>>>>>>>>>>>> upload >>>>>>>>>>>>>>> >> the data to GPU to >>>>>>>>>>>>>>> >> >> solve. >>>>>>>>>>>>>>> >> >> One can just grab some codes from mumps.c to >>>>>>>>>>>>>>> aijcusparse.cu >>>>>>>>>>>>>>> >> >. >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > mumps.c doesn't actually do that. It never needs to >>>>>>>>>>>>>>> copy the >>>>>>>>>>>>>>> >> entire matrix to a single MPI rank. >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > It would be possible to write such a code that you >>>>>>>>>>>>>>> suggest but >>>>>>>>>>>>>>> >> it is not clear that it makes sense >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI >>>>>>>>>>>>>>> rank, so >>>>>>>>>>>>>>> >> while your one GPU per big domain is solving its systems the >>>>>>>>>>>>>>> other >>>>>>>>>>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing >>>>>>>>>>>>>>> >> nothing. >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > 2) For each triangular solve you would have to gather the >>>>>>>>>>>>>>> right >>>>>>>>>>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to >>>>>>>>>>>>>>> >> the GPU solver and then scatter the resulting solution back >>>>>>>>>>>>>>> to all >>>>>>>>>>>>>>> >> of its subdomain ranks. >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > What I was suggesting was assign an entire subdomain to a >>>>>>>>>>>>>>> >> single MPI rank, thus it does everything on one GPU and can >>>>>>>>>>>>>>> use the >>>>>>>>>>>>>>> >> GPU solver directly. If all the major computations of a subdomain >>>>>>>>>>>>>>> >> can fit and be done on a single GPU then you would be >>>>>>>>>>>>>>> utilizing all >>>>>>>>>>>>>>> >> the GPUs you are using effectively. >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > Barry >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>> >> >> Chang >>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: >>>>>>>>>>>>>>> >> >>> Chang, >>>>>>>>>>>>>>> >> >>> You are correct there is no MPI + GPU direct >>>>>>>>>>>>>>> solvers that >>>>>>>>>>>>>>> >> currently do the triangular solves with MPI + GPU parallelism >>>>>>>>>>>>>>> that I >>>>>>>>>>>>>>> >> am aware of. You are limited that individual triangular solves be >>>>>>>>>>>>>>> >> done on a single GPU. I can only suggest making each subdomain as >>>>>>>>>>>>>>> >> big as possible to utilize each GPU as much as possible for the >>>>>>>>>>>>>>> >> direct triangular solves. >>>>>>>>>>>>>>> >> >>> Barry >>>>>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>> >> >>>> Hi Mark, >>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with >>>>>>>>>>>>>>> other >>>>>>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it >>>>>>>>>>>>>>> will give >>>>>>>>>>>>>>> >> an error. >>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the >>>>>>>>>>>>>>> >> factorization, and then do the rest, including GMRES solver, >>>>>>>>>>>>>>> on gpu. >>>>>>>>>>>>>>> >> Is that possible? >>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it >>>>>>>>>>>>>>> runs but >>>>>>>>>>>>>>> >> the iterative solver is still running on CPUs. I have >>>>>>>>>>>>>>> contacted the >>>>>>>>>>>>>>> >> superlu group and they confirmed that is the case right now. >>>>>>>>>>>>>>> But if >>>>>>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the >>>>>>>>>>>>>>> >> iterative solver is running on GPU. >>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>> >> >>>> Chang >>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: >>>>>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>> wrote: >>>>>>>>>>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in >>>>>>>>>>>>>>> my case >>>>>>>>>>>>>>> >> the code is >>>>>>>>>>>>>>> >> >>>>> just calling a seq solver like superlu to do >>>>>>>>>>>>>>> >> factorization on GPUs. >>>>>>>>>>>>>>> >> >>>>> My idea is that I want to have a traditional MPI >>>>>>>>>>>>>>> code to >>>>>>>>>>>>>>> >> utilize GPUs >>>>>>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse does not support >>>>>>>>>>>>>>> mpiaij >>>>>>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an >>>>>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>>>>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). >>>>>>>>>>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and >>>>>>>>>>>>>>> >> superlu tests use aij or sell matrix type. >>>>>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume >>>>>>>>>>>>>>> .... but >>>>>>>>>>>>>>> >> you might want to do other matrix operations on the GPU. Is >>>>>>>>>>>>>>> that the >>>>>>>>>>>>>>> >> issue? >>>>>>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or >>>>>>>>>>>>>>> SuperLU >>>>>>>>>>>>>>> >> have a problem? (no test with it so it probably does not work) >>>>>>>>>>>>>>> >> >>>>> Thanks, >>>>>>>>>>>>>>> >> >>>>> Mark >>>>>>>>>>>>>>> >> >>>>> so I >>>>>>>>>>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding >>>>>>>>>>>>>>> all the >>>>>>>>>>>>>>> >> matrix terms, >>>>>>>>>>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the >>>>>>>>>>>>>>> >> factorization >>>>>>>>>>>>>>> >> >>>>> and >>>>>>>>>>>>>>> >> >>>>> solve. This involves sending the data to the master >>>>>>>>>>>>>>> >> process, and I >>>>>>>>>>>>>>> >> >>>>> think >>>>>>>>>>>>>>> >> >>>>> the petsc mumps solver have something similar already. >>>>>>>>>>>>>>> >> >>>>> Chang >>>>>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>>> wrote: >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > Hi Mark, >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > The option I use is like >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 >>>>>>>>>>>>>>> >> -ksp_type fgmres >>>>>>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type >>>>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>>>> >> >>>>> *-sub_ksp_type >>>>>>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 >>>>>>>>>>>>>>> >> -ksp_rtol 1.e-300 >>>>>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > Note, If you use -log_view the last column >>>>>>>>>>>>>>> (rows >>>>>>>>>>>>>>> >> are the >>>>>>>>>>>>>>> >> >>>>> method like >>>>>>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work >>>>>>>>>>>>>>> in the GPU. >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > Junchao: *This* implies that we have a >>>>>>>>>>>>>>> cuSparse LU >>>>>>>>>>>>>>> >> >>>>> factorization. Is >>>>>>>>>>>>>>> >> >>>>> > that correct? (I don't think we do) >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check >>>>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>>>>>>>>>>>>>> find it >>>>>>>>>>>>>>> >> calls >>>>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. >>>>>>>>>>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to >>>>>>>>>>>>>>> >> make bigger >>>>>>>>>>>>>>> >> >>>>> blocks? >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > I think this one do both factorization and >>>>>>>>>>>>>>> >> solve on gpu. >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > You can check the >>>>>>>>>>>>>>> runex72_aijcusparse.sh file >>>>>>>>>>>>>>> >> in petsc >>>>>>>>>>>>>>> >> >>>>> install >>>>>>>>>>>>>>> >> >>>>> > directory, and try it your self (this >>>>>>>>>>>>>>> is only lu >>>>>>>>>>>>>>> >> >>>>> factorization >>>>>>>>>>>>>>> >> >>>>> > without >>>>>>>>>>>>>>> >> >>>>> > iterative solve). >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > Chang >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM >>>>>>>>>>>>>>> Chang Liu >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>>>> wrote: >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > Hi Junchao, >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > No I only needs it to be transferred >>>>>>>>>>>>>>> >> within a >>>>>>>>>>>>>>> >> >>>>> node. I use >>>>>>>>>>>>>>> >> >>>>> > block-Jacobi >>>>>>>>>>>>>>> >> >>>>> > > method and GMRES to solve the sparse >>>>>>>>>>>>>>> >> matrix, so each >>>>>>>>>>>>>>> >> >>>>> > direct solver will >>>>>>>>>>>>>>> >> >>>>> > > take care of a sub-block of the >>>>>>>>>>>>>>> whole >>>>>>>>>>>>>>> >> matrix. In this >>>>>>>>>>>>>>> >> >>>>> > way, I can use >>>>>>>>>>>>>>> >> >>>>> > > one >>>>>>>>>>>>>>> >> >>>>> > > GPU to solve one sub-block, which is >>>>>>>>>>>>>>> >> stored within >>>>>>>>>>>>>>> >> >>>>> one node. >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > It was stated in the >>>>>>>>>>>>>>> documentation that >>>>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>>>> >> >>>>> solver >>>>>>>>>>>>>>> >> >>>>> > is slow. >>>>>>>>>>>>>>> >> >>>>> > > However, in my test using >>>>>>>>>>>>>>> ex72.c, the >>>>>>>>>>>>>>> >> cusparse >>>>>>>>>>>>>>> >> >>>>> solver is >>>>>>>>>>>>>>> >> >>>>> > faster than >>>>>>>>>>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > Are we talking about the >>>>>>>>>>>>>>> factorization, the >>>>>>>>>>>>>>> >> solve, or >>>>>>>>>>>>>>> >> >>>>> both? >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > We do not have an interface to >>>>>>>>>>>>>>> cuSparse's LU >>>>>>>>>>>>>>> >> >>>>> factorization (I >>>>>>>>>>>>>>> >> >>>>> > just >>>>>>>>>>>>>>> >> >>>>> > > learned that it exists a few weeks ago). >>>>>>>>>>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is >>>>>>>>>>>>>>> >> '-pc_type lu >>>>>>>>>>>>>>> >> >>>>> -mat_type >>>>>>>>>>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU >>>>>>>>>>>>>>> >> factorization, >>>>>>>>>>>>>>> >> >>>>> which is the >>>>>>>>>>>>>>> >> >>>>> > > dominant cost. >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > Chang >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao >>>>>>>>>>>>>>> Zhang wrote: >>>>>>>>>>>>>>> >> >>>>> > > > Hi, Chang, >>>>>>>>>>>>>>> >> >>>>> > > > For the mumps solver, we >>>>>>>>>>>>>>> usually >>>>>>>>>>>>>>> >> transfers >>>>>>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>>>>>> >> >>>>> > and vector >>>>>>>>>>>>>>> >> >>>>> > > data >>>>>>>>>>>>>>> >> >>>>> > > > within a compute node. For >>>>>>>>>>>>>>> the idea you >>>>>>>>>>>>>>> >> >>>>> propose, it >>>>>>>>>>>>>>> >> >>>>> > looks like >>>>>>>>>>>>>>> >> >>>>> > > we need >>>>>>>>>>>>>>> >> >>>>> > > > to gather data within >>>>>>>>>>>>>>> >> MPI_COMM_WORLD, right? >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > Mark, I remember you said >>>>>>>>>>>>>>> >> cusparse solve is >>>>>>>>>>>>>>> >> >>>>> slow >>>>>>>>>>>>>>> >> >>>>> > and you would >>>>>>>>>>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > --Junchao Zhang >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM >>>>>>>>>>>>>>> >> Chang Liu via >>>>>>>>>>>>>>> >> >>>>> petsc-users >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>> >> >>>>> > > wrote: >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > Hi, >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > Currently, it is possible >>>>>>>>>>>>>>> to use >>>>>>>>>>>>>>> >> mumps >>>>>>>>>>>>>>> >> >>>>> solver in >>>>>>>>>>>>>>> >> >>>>> > PETSC with >>>>>>>>>>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads >>>>>>>>>>>>>>> >> option, so that >>>>>>>>>>>>>>> >> >>>>> > multiple MPI >>>>>>>>>>>>>>> >> >>>>> > > processes will >>>>>>>>>>>>>>> >> >>>>> > > > transfer the matrix and >>>>>>>>>>>>>>> rhs data >>>>>>>>>>>>>>> >> to the master >>>>>>>>>>>>>>> >> >>>>> > rank, and then >>>>>>>>>>>>>>> >> >>>>> > > master >>>>>>>>>>>>>>> >> >>>>> > > > rank will call mumps with >>>>>>>>>>>>>>> OpenMP >>>>>>>>>>>>>>> >> to solve >>>>>>>>>>>>>>> >> >>>>> the matrix. >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > I wonder if someone can >>>>>>>>>>>>>>> develop >>>>>>>>>>>>>>> >> similar >>>>>>>>>>>>>>> >> >>>>> option for >>>>>>>>>>>>>>> >> >>>>> > cusparse >>>>>>>>>>>>>>> >> >>>>> > > solver. >>>>>>>>>>>>>>> >> >>>>> > > > Right now, this solver >>>>>>>>>>>>>>> does not >>>>>>>>>>>>>>> >> work with >>>>>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I >>>>>>>>>>>>>>> >> >>>>> > > think a >>>>>>>>>>>>>>> >> >>>>> > > > possible workaround is to >>>>>>>>>>>>>>> >> transfer all the >>>>>>>>>>>>>>> >> >>>>> matrix >>>>>>>>>>>>>>> >> >>>>> > data to one MPI >>>>>>>>>>>>>>> >> >>>>> > > > process, and then upload the >>>>>>>>>>>>>>> >> data to GPU to >>>>>>>>>>>>>>> >> >>>>> solve. >>>>>>>>>>>>>>> >> >>>>> > In this >>>>>>>>>>>>>>> >> >>>>> > > way, one can >>>>>>>>>>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>>>>>>>>>>>>>> >> program. >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > Chang >>>>>>>>>>>>>>> >> >>>>> > > > -- >>>>>>>>>>>>>>> >> >>>>> > > > Chang Liu >>>>>>>>>>>>>>> >> >>>>> > > > Staff Research Physicist >>>>>>>>>>>>>>> >> >>>>> > > > +1 609 243 3438 >>>>>>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>> >> >>>>> > > > Princeton Plasma Physics >>>>>>>>>>>>>>> Laboratory >>>>>>>>>>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>>>>>>>>>>>>>> Princeton NJ >>>>>>>>>>>>>>> >> 08540, USA >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > -- >>>>>>>>>>>>>>> >> >>>>> > > Chang Liu >>>>>>>>>>>>>>> >> >>>>> > > Staff Research Physicist >>>>>>>>>>>>>>> >> >>>>> > > +1 609 243 3438 >>>>>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory >>>>>>>>>>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ >>>>>>>>>>>>>>> 08540, USA >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > -- >>>>>>>>>>>>>>> >> >>>>> > Chang Liu >>>>>>>>>>>>>>> >> >>>>> > Staff Research Physicist >>>>>>>>>>>>>>> >> >>>>> > +1 609 243 3438 >>>>>>>>>>>>>>> >> >>>>> > cliu at pppl.gov >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >> >>>>> -- Chang Liu >>>>>>>>>>>>>>> >> >>>>> Staff Research Physicist >>>>>>>>>>>>>>> >> >>>>> +1 609 243 3438 >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>> >> >>>> -- >>>>>>>>>>>>>>> >> >>>> Chang Liu >>>>>>>>>>>>>>> >> >>>> Staff Research Physicist >>>>>>>>>>>>>>> >> >>>> +1 609 243 3438 >>>>>>>>>>>>>>> >> >>>> cliu at pppl.gov >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>> >> >> -- >>>>>>>>>>>>>>> >> >> Chang Liu >>>>>>>>>>>>>>> >> >> Staff Research Physicist >>>>>>>>>>>>>>> >> >> +1 609 243 3438 >>>>>>>>>>>>>>> >> >> cliu at pppl.gov >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> -- Chang Liu >>>>>>>>>>>>>>> >> Staff Research Physicist >>>>>>>>>>>>>>> >> +1 609 243 3438 >>>>>>>>>>>>>>> >> cliu at pppl.gov >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > -- >>>>>>>>>>>>>>> > Chang Liu >>>>>>>>>>>>>>> > Staff Research Physicist >>>>>>>>>>>>>>> > +1 609 243 3438 >>>>>>>>>>>>>>> > cliu at pppl.gov >>>>>>>>>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Chang Liu >>>>>>>>>>>>>> Staff Research Physicist >>>>>>>>>>>>>> +1 609 243 3438 >>>>>>>>>>>>>> cliu at pppl.gov >>>>>>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Chang Liu >>>>>>>>>>>> Staff Research Physicist >>>>>>>>>>>> +1 609 243 3438 >>>>>>>>>>>> cliu at pppl.gov >>>>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Chang Liu >>>>>>>>>> Staff Research Physicist >>>>>>>>>> +1 609 243 3438 >>>>>>>>>> cliu at pppl.gov >>>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>> >>>>>>>> -- >>>>>>>> Chang Liu >>>>>>>> Staff Research Physicist >>>>>>>> +1 609 243 3438 >>>>>>>> cliu at pppl.gov >>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>> >>>>>> >>>> >>>> -- >>>> Chang Liu >>>> Staff Research Physicist >>>> +1 609 243 3438 >>>> cliu at pppl.gov >>>> Princeton Plasma Physics Laboratory >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From junchao.zhang at gmail.com Wed Oct 20 13:59:12 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Wed, 20 Oct 2021 13:59:12 -0500 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <7a3d5347-f2da-b4a9-f44a-aa534a314c7f@pppl.gov> References: <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> <968434BC-E8DC-49B0-9119-F208DB1E01B0@petsc.dev> <7a3d5347-f2da-b4a9-f44a-aa534a314c7f@pppl.gov> Message-ID: The MR https://gitlab.com/petsc/petsc/-/merge_requests/4471 has not been merged yet. --Junchao Zhang On Wed, Oct 20, 2021 at 1:47 PM Chang Liu via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi Barry, > > Are the fixes merged in the master? I was using bjacobi as a > preconditioner. Using the latest version of petsc, I found that by calling > > mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view > -ksp_monitor_true_residual -ksp_type fgmres -pc_type bjacobi -pc_bjacobi > _blocks 4 -sub_ksp_type preonly -sub_pc_type telescope > -sub_pc_telescope_reduction_factor 8 -sub_pc_telescope_subcomm_type > contiguous -sub_telescope_pc_type lu -sub_telescope_ksp_type preonly > -sub_telescope_pc_factor_mat_solver_type mumps -ksp_max_it 2000 > -ksp_rtol 1.e-30 -ksp_atol 1.e-30 > > The code is calling PCApply_BJacobi_Multiproc. If I use > > mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view > -ksp_monitor_true_residual -telescope_ksp_monitor_true_residual > -ksp_type preonly -pc_type telescope -pc_telescope_reduction_factor 8 > -pc_telescope_subcomm_type contiguous -telescope_pc_type bjacobi > -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 > -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu > -telescope_sub_pc_factor_mat_solver_type mumps -telescope_ksp_max_it > 2000 -telescope_ksp_rtol 1.e-30 -telescope_ksp_atol 1.e-30 > > The code is calling PCApply_BJacobi_Singleblock. You can test it yourself. > > Regards, > > Chang > > On 10/20/21 1:14 PM, Barry Smith wrote: > > > > > >> On Oct 20, 2021, at 12:48 PM, Chang Liu wrote: > >> > >> Hi Pierre, > >> > >> I have another suggestion for telescope. I have achieved my goal by > putting telescope outside bjacobi. But the code still does not work if I > use telescope as a pc for subblock. I think the reason is that I want to > use cusparse as the solver, which can only deal with seqaij matrix and not > mpiaij matrix. > > > > > > This is suppose to work with the recent fixes. The telescope should > produce a seq matrix and for each solve map the parallel vector (over the > subdomain) automatically down to the one rank with the GPU to solve it on > the GPU. It is not clear to me where the process is going wrong. > > > > Barry > > > > > > > >> However, for telescope pc, it can put the matrix into one mpi rank, > thus making it a seqaij for factorization stage, but then after > factorization it will give the data back to the original comminicator. This > will make the matrix back to mpiaij, and then cusparse cannot solve it. > >> > >> I think a better option is to do the factorization on CPU with mpiaij, > then then transform the preconditioner matrix to seqaij and do the matsolve > GPU. But I am not sure if it can be achieved using telescope. > >> > >> Regads, > >> > >> Chang > >> > >> On 10/15/21 5:29 AM, Pierre Jolivet wrote: > >>> Hi Chang, > >>> The output you sent with MUMPS looks alright to me, you can see that > the MatType is properly set to seqaijcusparse (and not mpiaijcusparse). > >>> I don?t know what is wrong with > -sub_telescope_pc_factor_mat_solver_type cusparse, I don?t have a PETSc > installation for testing this, hopefully Barry or Junchao can confirm this > wrong behavior and get this fixed. > >>> As for permuting PCTELESCOPE and PCBJACOBI, in your case, the outer PC > will be equivalent, yes. > >>> However, it would be more efficient to do PCBJACOBI and then > PCTELESCOPE. > >>> PCBJACOBI prunes the operator by basically removing all coefficients > outside of the diagonal blocks. > >>> Then, PCTELESCOPE "groups everything together?. > >>> If you do it the other way around, PCTELESCOPE will ?group everything > together? and then PCBJACOBI will prune the operator. > >>> So the PCTELESCOPE SetUp will be costly for nothing since some > coefficients will be thrown out afterwards in the PCBJACOBI SetUp. > >>> I hope I?m clear enough, otherwise I can try do draw some pictures. > >>> Thanks, > >>> Pierre > >>>> On 15 Oct 2021, at 4:39 AM, Chang Liu wrote: > >>>> > >>>> Hi Pierre and Barry, > >>>> > >>>> I think maybe I should use telescope outside bjacobi? like this > >>>> > >>>> mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > -ksp_view -ksp_monitor_true_residual -pc_type telescope > -pc_telescope_reduction_factor 4 -t > >>>> elescope_pc_type bjacobi -telescope_ksp_type fgmres > -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse > -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu > -telescope_sub_pc_factor_mat_solve > >>>> r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > >>>> > >>>> But then I got an error that > >>>> > >>>> [0]PETSC ERROR: MatSolverType cusparse does not support matrix type > seqaij > >>>> > >>>> But the mat type should be aijcusparse. I think telescope change the > mat type. > >>>> > >>>> Chang > >>>> > >>>> On 10/14/21 10:11 PM, Chang Liu wrote: > >>>>> For comparison, here is the output using mumps instead of cusparse > >>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 > -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type > preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu > -sub_telescope_pc_factor_mat_solver_type mumps > -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type > contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > >>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid > norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >>>>> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid > norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 > >>>>> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid > norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 > >>>>> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid > norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 > >>>>> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid > norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 > >>>>> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid > norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 > >>>>> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid > norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 > >>>>> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid > norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 > >>>>> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid > norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 > >>>>> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid > norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 > >>>>> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid > norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 > >>>>> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid > norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 > >>>>> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid > norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 > >>>>> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid > norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 > >>>>> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid > norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 > >>>>> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid > norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 > >>>>> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid > norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 > >>>>> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid > norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 > >>>>> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid > norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 > >>>>> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid > norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 > >>>>> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid > norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 > >>>>> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid > norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 > >>>>> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid > norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 > >>>>> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid > norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 > >>>>> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid > norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 > >>>>> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid > norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 > >>>>> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid > norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 > >>>>> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid > norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 > >>>>> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid > norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 > >>>>> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid > norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 > >>>>> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid > norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 > >>>>> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid > norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 > >>>>> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid > norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 > >>>>> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid > norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 > >>>>> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid > norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 > >>>>> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid > norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 > >>>>> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid > norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 > >>>>> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid > norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 > >>>>> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid > norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 > >>>>> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid > norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 > >>>>> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid > norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 > >>>>> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid > norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 > >>>>> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid > norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 > >>>>> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid > norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 > >>>>> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid > norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 > >>>>> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid > norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 > >>>>> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid > norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 > >>>>> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid > norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 > >>>>> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid > norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 > >>>>> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid > norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 > >>>>> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid > norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 > >>>>> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid > norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 > >>>>> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid > norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 > >>>>> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid > norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 > >>>>> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid > norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 > >>>>> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid > norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 > >>>>> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid > norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 > >>>>> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid > norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 > >>>>> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid > norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 > >>>>> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid > norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 > >>>>> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid > norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 > >>>>> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid > norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 > >>>>> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid > norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 > >>>>> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid > norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 > >>>>> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid > norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 > >>>>> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid > norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 > >>>>> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid > norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 > >>>>> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid > norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 > >>>>> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid > norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 > >>>>> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid > norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 > >>>>> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid > norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 > >>>>> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid > norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 > >>>>> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid > norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 > >>>>> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid > norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 > >>>>> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid > norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 > >>>>> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid > norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 > >>>>> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid > norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 > >>>>> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid > norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 > >>>>> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid > norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 > >>>>> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid > norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 > >>>>> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid > norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 > >>>>> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid > norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 > >>>>> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid > norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 > >>>>> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid > norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 > >>>>> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid > norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 > >>>>> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid > norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 > >>>>> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid > norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 > >>>>> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid > norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 > >>>>> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid > norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 > >>>>> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid > norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 > >>>>> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid > norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 > >>>>> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid > norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 > >>>>> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid > norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 > >>>>> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid > norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 > >>>>> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid > norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 > >>>>> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid > norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 > >>>>> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid > norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 > >>>>> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid > norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 > >>>>> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid > norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 > >>>>> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid > norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 > >>>>> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid > norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 > >>>>> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid > norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 > >>>>> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid > norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 > >>>>> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid > norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 > >>>>> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid > norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 > >>>>> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid > norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 > >>>>> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid > norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 > >>>>> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid > norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 > >>>>> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid > norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 > >>>>> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid > norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 > >>>>> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid > norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 > >>>>> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid > norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 > >>>>> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid > norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 > >>>>> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid > norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 > >>>>> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid > norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 > >>>>> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid > norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 > >>>>> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid > norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 > >>>>> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid > norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 > >>>>> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid > norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 > >>>>> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid > norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 > >>>>> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid > norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 > >>>>> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid > norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 > >>>>> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid > norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 > >>>>> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid > norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 > >>>>> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid > norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 > >>>>> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid > norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 > >>>>> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid > norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 > >>>>> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid > norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 > >>>>> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid > norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 > >>>>> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid > norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 > >>>>> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid > norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 > >>>>> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid > norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 > >>>>> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid > norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 > >>>>> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid > norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 > >>>>> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid > norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 > >>>>> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid > norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 > >>>>> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid > norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 > >>>>> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid > norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 > >>>>> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid > norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 > >>>>> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid > norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 > >>>>> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid > norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 > >>>>> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid > norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 > >>>>> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid > norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 > >>>>> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid > norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 > >>>>> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid > norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 > >>>>> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid > norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 > >>>>> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid > norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 > >>>>> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid > norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 > >>>>> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid > norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 > >>>>> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid > norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 > >>>>> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid > norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 > >>>>> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid > norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 > >>>>> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid > norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 > >>>>> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid > norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 > >>>>> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid > norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 > >>>>> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid > norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 > >>>>> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid > norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 > >>>>> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid > norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 > >>>>> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid > norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 > >>>>> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid > norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 > >>>>> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid > norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 > >>>>> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid > norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 > >>>>> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid > norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 > >>>>> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid > norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 > >>>>> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid > norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 > >>>>> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid > norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 > >>>>> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid > norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 > >>>>> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid > norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 > >>>>> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid > norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 > >>>>> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid > norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 > >>>>> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid > norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 > >>>>> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid > norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 > >>>>> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid > norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 > >>>>> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid > norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 > >>>>> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid > norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 > >>>>> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid > norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 > >>>>> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid > norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 > >>>>> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid > norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 > >>>>> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid > norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 > >>>>> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid > norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 > >>>>> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid > norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 > >>>>> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid > norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 > >>>>> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid > norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 > >>>>> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid > norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 > >>>>> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid > norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 > >>>>> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid > norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 > >>>>> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid > norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 > >>>>> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid > norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 > >>>>> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid > norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 > >>>>> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid > norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 > >>>>> KSP Object: 16 MPI processes > >>>>> type: fgmres > >>>>> restart=30, using Classical (unmodified) Gram-Schmidt > Orthogonalization with no iterative refinement > >>>>> happy breakdown tolerance 1e-30 > >>>>> maximum iterations=2000, initial guess is zero > >>>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. > >>>>> right preconditioning > >>>>> using UNPRECONDITIONED norm type for convergence test > >>>>> PC Object: 16 MPI processes > >>>>> type: bjacobi > >>>>> number of blocks = 4 > >>>>> Local solver information for first block is in the following > KSP and PC objects on rank 0: > >>>>> Use -ksp_view ::ascii_info_detail to display information for > all blocks > >>>>> KSP Object: (sub_) 4 MPI processes > >>>>> type: preonly > >>>>> maximum iterations=10000, initial guess is zero > >>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >>>>> left preconditioning > >>>>> using NONE norm type for convergence test > >>>>> PC Object: (sub_) 4 MPI processes > >>>>> type: telescope > >>>>> petsc subcomm: parent comm size reduction factor = 4 > >>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 > >>>>> petsc subcomm type = contiguous > >>>>> linear system matrix = precond matrix: > >>>>> Mat Object: (sub_) 4 MPI processes > >>>>> type: mpiaij > >>>>> rows=40200, cols=40200 > >>>>> total: nonzeros=199996, allocated nonzeros=203412 > >>>>> total number of mallocs used during MatSetValues calls=0 > >>>>> not using I-node (on process 0) routines > >>>>> setup type: default > >>>>> Parent DM object: NULL > >>>>> Sub DM object: NULL > >>>>> KSP Object: (sub_telescope_) 1 MPI processes > >>>>> type: preonly > >>>>> maximum iterations=10000, initial guess is zero > >>>>> tolerances: relative=1e-05, absolute=1e-50, > divergence=10000. > >>>>> left preconditioning > >>>>> using NONE norm type for convergence test > >>>>> PC Object: (sub_telescope_) 1 MPI processes > >>>>> type: lu > >>>>> out-of-place factorization > >>>>> tolerance for zero pivot 2.22045e-14 > >>>>> matrix ordering: external > >>>>> factor fill ratio given 0., needed 0. > >>>>> Factored matrix follows: > >>>>> Mat Object: 1 MPI processes > >>>>> type: mumps > >>>>> rows=40200, cols=40200 > >>>>> package used to perform factorization: mumps > >>>>> total: nonzeros=1849788, allocated > nonzeros=1849788 > >>>>> MUMPS run parameters: > >>>>> SYM (matrix type): 0 > >>>>> PAR (host participation): 1 > >>>>> ICNTL(1) (output for error): 6 > >>>>> ICNTL(2) (output of diagnostic msg): 0 > >>>>> ICNTL(3) (output for global info): 0 > >>>>> ICNTL(4) (level of printing): 0 > >>>>> ICNTL(5) (input mat struct): 0 > >>>>> ICNTL(6) (matrix prescaling): 7 > >>>>> ICNTL(7) (sequential matrix ordering):7 > >>>>> ICNTL(8) (scaling strategy): 77 > >>>>> ICNTL(10) (max num of refinements): 0 > >>>>> ICNTL(11) (error analysis): 0 > >>>>> ICNTL(12) (efficiency control): 1 > >>>>> ICNTL(13) (sequential factorization of the > root node): 0 > >>>>> ICNTL(14) (percentage of estimated workspace > increase): 20 > >>>>> ICNTL(18) (input mat struct): 0 > >>>>> ICNTL(19) (Schur complement info): 0 > >>>>> ICNTL(20) (RHS sparse pattern): 0 > >>>>> ICNTL(21) (solution struct): 0 > >>>>> ICNTL(22) (in-core/out-of-core facility): > 0 > >>>>> ICNTL(23) (max size of memory can be > allocated locally):0 > >>>>> ICNTL(24) (detection of null pivot rows): > 0 > >>>>> ICNTL(25) (computation of a null space > basis): 0 > >>>>> ICNTL(26) (Schur options for RHS or > solution): 0 > >>>>> ICNTL(27) (blocking size for multiple RHS): > -32 > >>>>> ICNTL(28) (use parallel or sequential > ordering): 1 > >>>>> ICNTL(29) (parallel ordering): 0 > >>>>> ICNTL(30) (user-specified set of entries in > inv(A)): 0 > >>>>> ICNTL(31) (factors is discarded in the solve > phase): 0 > >>>>> ICNTL(33) (compute determinant): 0 > >>>>> ICNTL(35) (activate BLR based > factorization): 0 > >>>>> ICNTL(36) (choice of BLR factorization > variant): 0 > >>>>> ICNTL(38) (estimated compression rate of LU > factors): 333 > >>>>> CNTL(1) (relative pivoting threshold): > 0.01 > >>>>> CNTL(2) (stopping criterion of refinement): > 1.49012e-08 > >>>>> CNTL(3) (absolute pivoting threshold): 0. > >>>>> CNTL(4) (value of static pivoting): > -1. > >>>>> CNTL(5) (fixation for null pivots): 0. > >>>>> CNTL(7) (dropping parameter for BLR): 0. > >>>>> RINFO(1) (local estimated flops for the > elimination after analysis): > >>>>> [0] 1.45525e+08 > >>>>> RINFO(2) (local estimated flops for the > assembly after factorization): > >>>>> [0] 2.89397e+06 > >>>>> RINFO(3) (local estimated flops for the > elimination after factorization): > >>>>> [0] 1.45525e+08 > >>>>> INFO(15) (estimated size of (in MB) MUMPS > internal data for running numerical factorization): > >>>>> [0] 29 > >>>>> INFO(16) (size of (in MB) MUMPS internal data > used during numerical factorization): > >>>>> [0] 29 > >>>>> INFO(23) (num of pivots eliminated on this > processor after factorization): > >>>>> [0] 40200 > >>>>> RINFOG(1) (global estimated flops for the > elimination after analysis): 1.45525e+08 > >>>>> RINFOG(2) (global estimated flops for the > assembly after factorization): 2.89397e+06 > >>>>> RINFOG(3) (global estimated flops for the > elimination after factorization): 1.45525e+08 > >>>>> (RINFOG(12) RINFOG(13))*2^INFOG(34) > (determinant): (0.,0.)*(2^0) > >>>>> INFOG(3) (estimated real workspace for > factors on all processors after analysis): 1849788 > >>>>> INFOG(4) (estimated integer workspace for > factors on all processors after analysis): 879986 > >>>>> INFOG(5) (estimated maximum front size in the > complete tree): 282 > >>>>> INFOG(6) (number of nodes in the complete > tree): 23709 > >>>>> INFOG(7) (ordering option effectively used > after analysis): 5 > >>>>> INFOG(8) (structural symmetry in percent of > the permuted matrix after analysis): 100 > >>>>> INFOG(9) (total real/complex workspace to > store the matrix factors after factorization): 1849788 > >>>>> INFOG(10) (total integer space store the > matrix factors after factorization): 879986 > >>>>> INFOG(11) (order of largest frontal matrix > after factorization): 282 > >>>>> INFOG(12) (number of off-diagonal pivots): 0 > >>>>> INFOG(13) (number of delayed pivots after > factorization): 0 > >>>>> INFOG(14) (number of memory compress after > factorization): 0 > >>>>> INFOG(15) (number of steps of iterative > refinement after solution): 0 > >>>>> INFOG(16) (estimated size (in MB) of all > MUMPS internal data for factorization after analysis: value on the most > memory consuming processor): 29 > >>>>> INFOG(17) (estimated size of all MUMPS > internal data for factorization after analysis: sum over all processors): 29 > >>>>> INFOG(18) (size of all MUMPS internal data > allocated during factorization: value on the most memory consuming > processor): 29 > >>>>> INFOG(19) (size of all MUMPS internal data > allocated during factorization: sum over all processors): 29 > >>>>> INFOG(20) (estimated number of entries in the > factors): 1849788 > >>>>> INFOG(21) (size in MB of memory effectively > used during factorization - value on the most memory consuming processor): > 26 > >>>>> INFOG(22) (size in MB of memory effectively > used during factorization - sum over all processors): 26 > >>>>> INFOG(23) (after analysis: value of ICNTL(6) > effectively used): 0 > >>>>> INFOG(24) (after analysis: value of ICNTL(12) > effectively used): 1 > >>>>> INFOG(25) (after factorization: number of > pivots modified by static pivoting): 0 > >>>>> INFOG(28) (after factorization: number of > null pivots encountered): 0 > >>>>> INFOG(29) (after factorization: effective > number of entries in the factors (sum over all processors)): 1849788 > >>>>> INFOG(30, 31) (after solution: size in Mbytes > of memory used during solution phase): 29, 29 > >>>>> INFOG(32) (after analysis: type of analysis > done): 1 > >>>>> INFOG(33) (value used for ICNTL(8)): 7 > >>>>> INFOG(34) (exponent of the determinant if > determinant is requested): 0 > >>>>> INFOG(35) (after factorization: number of > entries taking into account BLR factor compression - sum over all > processors): 1849788 > >>>>> INFOG(36) (after analysis: estimated size of > all MUMPS internal data for running BLR in-core - value on the most memory > consuming processor): 0 > >>>>> INFOG(37) (after analysis: estimated size of > all MUMPS internal data for running BLR in-core - sum over all processors): > 0 > >>>>> INFOG(38) (after analysis: estimated size of > all MUMPS internal data for running BLR out-of-core - value on the most > memory consuming processor): 0 > >>>>> INFOG(39) (after analysis: estimated size of > all MUMPS internal data for running BLR out-of-core - sum over all > processors): 0 > >>>>> linear system matrix = precond matrix: > >>>>> Mat Object: 1 MPI processes > >>>>> type: seqaijcusparse > >>>>> rows=40200, cols=40200 > >>>>> total: nonzeros=199996, allocated nonzeros=199996 > >>>>> total number of mallocs used during MatSetValues calls=0 > >>>>> not using I-node routines > >>>>> linear system matrix = precond matrix: > >>>>> Mat Object: 16 MPI processes > >>>>> type: mpiaijcusparse > >>>>> rows=160800, cols=160800 > >>>>> total: nonzeros=802396, allocated nonzeros=1608000 > >>>>> total number of mallocs used during MatSetValues calls=0 > >>>>> not using I-node (on process 0) routines > >>>>> Norm of error 9.11684e-07 iterations 189 > >>>>> Chang > >>>>> On 10/14/21 10:10 PM, Chang Liu wrote: > >>>>>> Hi Barry, > >>>>>> > >>>>>> No problem. Here is the output. It seems that the resid norm > calculation is incorrect. > >>>>>> > >>>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 > -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 > -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type > preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu > -sub_telescope_pc_factor_mat_solver_type cusparse > -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type > contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > >>>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid > norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >>>>>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid > norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >>>>>> KSP Object: 16 MPI processes > >>>>>> type: fgmres > >>>>>> restart=30, using Classical (unmodified) Gram-Schmidt > Orthogonalization with no iterative refinement > >>>>>> happy breakdown tolerance 1e-30 > >>>>>> maximum iterations=2000, initial guess is zero > >>>>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. > >>>>>> right preconditioning > >>>>>> using UNPRECONDITIONED norm type for convergence test > >>>>>> PC Object: 16 MPI processes > >>>>>> type: bjacobi > >>>>>> number of blocks = 4 > >>>>>> Local solver information for first block is in the following > KSP and PC objects on rank 0: > >>>>>> Use -ksp_view ::ascii_info_detail to display information for > all blocks > >>>>>> KSP Object: (sub_) 4 MPI processes > >>>>>> type: preonly > >>>>>> maximum iterations=10000, initial guess is zero > >>>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >>>>>> left preconditioning > >>>>>> using NONE norm type for convergence test > >>>>>> PC Object: (sub_) 4 MPI processes > >>>>>> type: telescope > >>>>>> petsc subcomm: parent comm size reduction factor = 4 > >>>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 > >>>>>> petsc subcomm type = contiguous > >>>>>> linear system matrix = precond matrix: > >>>>>> Mat Object: (sub_) 4 MPI processes > >>>>>> type: mpiaij > >>>>>> rows=40200, cols=40200 > >>>>>> total: nonzeros=199996, allocated nonzeros=203412 > >>>>>> total number of mallocs used during MatSetValues calls=0 > >>>>>> not using I-node (on process 0) routines > >>>>>> setup type: default > >>>>>> Parent DM object: NULL > >>>>>> Sub DM object: NULL > >>>>>> KSP Object: (sub_telescope_) 1 MPI processes > >>>>>> type: preonly > >>>>>> maximum iterations=10000, initial guess is zero > >>>>>> tolerances: relative=1e-05, absolute=1e-50, > divergence=10000. > >>>>>> left preconditioning > >>>>>> using NONE norm type for convergence test > >>>>>> PC Object: (sub_telescope_) 1 MPI processes > >>>>>> type: lu > >>>>>> out-of-place factorization > >>>>>> tolerance for zero pivot 2.22045e-14 > >>>>>> matrix ordering: nd > >>>>>> factor fill ratio given 5., needed 8.62558 > >>>>>> Factored matrix follows: > >>>>>> Mat Object: 1 MPI processes > >>>>>> type: seqaijcusparse > >>>>>> rows=40200, cols=40200 > >>>>>> package used to perform factorization: cusparse > >>>>>> total: nonzeros=1725082, allocated > nonzeros=1725082 > >>>>>> not using I-node routines > >>>>>> linear system matrix = precond matrix: > >>>>>> Mat Object: 1 MPI processes > >>>>>> type: seqaijcusparse > >>>>>> rows=40200, cols=40200 > >>>>>> total: nonzeros=199996, allocated nonzeros=199996 > >>>>>> total number of mallocs used during MatSetValues > calls=0 > >>>>>> not using I-node routines > >>>>>> linear system matrix = precond matrix: > >>>>>> Mat Object: 16 MPI processes > >>>>>> type: mpiaijcusparse > >>>>>> rows=160800, cols=160800 > >>>>>> total: nonzeros=802396, allocated nonzeros=1608000 > >>>>>> total number of mallocs used during MatSetValues calls=0 > >>>>>> not using I-node (on process 0) routines > >>>>>> Norm of error 400.999 iterations 1 > >>>>>> > >>>>>> Chang > >>>>>> > >>>>>> > >>>>>> On 10/14/21 9:47 PM, Barry Smith wrote: > >>>>>>> > >>>>>>> Chang, > >>>>>>> > >>>>>>> Sorry I did not notice that one. Please run that with > -ksp_view -ksp_monitor_true_residual so we can see exactly how options are > interpreted and solver used. At a glance it looks ok but something must be > wrong to get the wrong answer. > >>>>>>> > >>>>>>> Barry > >>>>>>> > >>>>>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu wrote: > >>>>>>>> > >>>>>>>> Hi Barry, > >>>>>>>> > >>>>>>>> That is exactly what I was doing in the second example, in which > the preconditioner works but the GMRES does not. > >>>>>>>> > >>>>>>>> Chang > >>>>>>>> > >>>>>>>> On 10/14/21 5:15 PM, Barry Smith wrote: > >>>>>>>>> You need to use the PCTELESCOPE inside the block Jacobi, not > outside it. So something like -pc_type bjacobi -sub_pc_type telescope > -sub_telescope_pc_type lu > >>>>>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu wrote: > >>>>>>>>>> > >>>>>>>>>> Hi Pierre, > >>>>>>>>>> > >>>>>>>>>> I wonder if the trick of PCTELESCOPE only works for > preconditioner and not for the solver. I have done some tests, and find > that for solving a small matrix using -telescope_ksp_type preonly, it does > work for GPU with multiple MPI processes. However, for bjacobi and gmres, > it does not work. > >>>>>>>>>> > >>>>>>>>>> The command line options I used for small matrix is like > >>>>>>>>>> > >>>>>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short > -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu > -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly > -pc_telescope_reduction_factor 4 > >>>>>>>>>> > >>>>>>>>>> which gives the correct output. For iterative solver, I tried > >>>>>>>>>> > >>>>>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short > -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type > aijcusparse -sub_pc_type telescope -sub_ksp_type preonly > -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu > -sub_telescope_pc_factor_mat_solver_type cusparse > -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 > -ksp_atol 1.e-20 > >>>>>>>>>> > >>>>>>>>>> for large matrix. The output is like > >>>>>>>>>> > >>>>>>>>>> 0 KSP Residual norm 40.1497 > >>>>>>>>>> 1 KSP Residual norm < 1.e-11 > >>>>>>>>>> Norm of error 400.999 iterations 1 > >>>>>>>>>> > >>>>>>>>>> So it seems to call a direct solver instead of an iterative one. > >>>>>>>>>> > >>>>>>>>>> Can you please help check these options? > >>>>>>>>>> > >>>>>>>>>> Chang > >>>>>>>>>> > >>>>>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: > >>>>>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This > sounds exactly what I need. I wonder if PCTELESCOPE can transform a > mpiaijcusparse to seqaircusparse? Or I have to do it manually? > >>>>>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). > >>>>>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but > it should be; > >>>>>>>>>>> 2) at least for the implementations > MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and > MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ > (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the > MPI communicator on which the Mat lives is of size 1 (your case), and then > the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not > need to worry about the transformation you are mentioning. > >>>>>>>>>>> If you try this out and this does not work, please provide the > backtrace (probably something like ?Operation XYZ not implemented for > MatType ABC?), and hopefully someone can add the missing plumbing. > >>>>>>>>>>> I do not claim that this will be efficient, but I think this > goes in the direction of what you want to achieve. > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Pierre > >>>>>>>>>>>> Chang > >>>>>>>>>>>> > >>>>>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: > >>>>>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE > as a subdomain solver, with a reduction factor equal to the number of MPI > processes you have per block? > >>>>>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X > -sub_telescope_pc_type lu > >>>>>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads > because not only do the Mat needs to be redistributed, the secondary > processes also need to be ?converted? to OpenMP threads. > >>>>>>>>>>>>> Thus the need for specific code in mumps.c. > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Pierre > >>>>>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users < > petsc-users at mcs.anl.gov> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Junchao, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Yes that is what I want. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Chang > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: > >>>>>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith < > bsmith at petsc.dev > wrote: > >>>>>>>>>>>>>>> Junchao, > >>>>>>>>>>>>>>> If I understand correctly Chang is using the > block Jacobi > >>>>>>>>>>>>>>> method with a single block for a number of MPI ranks > and a direct > >>>>>>>>>>>>>>> solver for each block so it uses > PCSetUp_BJacobi_Multiproc() which > >>>>>>>>>>>>>>> is code Hong Zhang wrote a number of years ago for > CPUs. For their > >>>>>>>>>>>>>>> particular problems this preconditioner works well, > but using an > >>>>>>>>>>>>>>> iterative solver on the blocks does not work well. > >>>>>>>>>>>>>>> If we had complete MPI-GPU direct solvers he > could just use > >>>>>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block > but since we do > >>>>>>>>>>>>>>> not he would like to use a single GPU for each block, > this means > >>>>>>>>>>>>>>> that diagonal blocks of the global parallel MPI > matrix needs to be > >>>>>>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, > which has multiple > >>>>>>>>>>>>>>> MPI ranks associated with the blocks). Similarly for > the triangular > >>>>>>>>>>>>>>> solves the blocks of the right hand side needs to be > shipped to the > >>>>>>>>>>>>>>> appropriate GPU and the resulting solution shipped > back to the > >>>>>>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this > is somewhat like > >>>>>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand > the background.. > >>>>>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the > blocks on the > >>>>>>>>>>>>>>> MPI ranks and then shrink each block down to a single > GPU but this > >>>>>>>>>>>>>>> would be pretty inefficient, ideally one would go > directly from the > >>>>>>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on > the subset of > >>>>>>>>>>>>>>> GPUs. But this may be a large coding project. > >>>>>>>>>>>>>>> I don't understand these sentences. Why do you say > "shrink"? In my mind, we just need to move each block (submatrix) living > over multiple MPI ranks to one of them and solve directly there. In other > words, we keep blocks' size, no shrinking or expanding. > >>>>>>>>>>>>>>> As mentioned before, cusparse does not provide LU > factorization. So the LU factorization would be done on CPU, and the solve > be done on GPU. I assume Chang wants to gain from the (potential) faster > solve (instead of factorization) on GPU. > >>>>>>>>>>>>>>> Barry > >>>>>>>>>>>>>>> Since the matrices being factored and solved directly > are relatively > >>>>>>>>>>>>>>> large it is possible that the cusparse code could be > reasonably > >>>>>>>>>>>>>>> efficient (they are not the tiny problems one gets at > the coarse > >>>>>>>>>>>>>>> level of multigrid). Of course, this is speculation, > I don't > >>>>>>>>>>>>>>> actually know how much better the cusparse code would > be on the > >>>>>>>>>>>>>>> direct solver than a good CPU direct sparse solver. > >>>>>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu < > cliu at pppl.gov > >>>>>>>>>>>>>>> > wrote: > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > Sorry I am not familiar with the details either. > Can you please > >>>>>>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in > mumps.c? > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > Chang > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: > >>>>>>>>>>>>>>> >> Hi Chang, > >>>>>>>>>>>>>>> >> I did the work in mumps. It is easy for me to > understand > >>>>>>>>>>>>>>> gathering matrix rows to one process. > >>>>>>>>>>>>>>> >> But how to gather blocks (submatrices) to form > a large block? Can you draw a picture of that? > >>>>>>>>>>>>>>> >> Thanks > >>>>>>>>>>>>>>> >> --Junchao Zhang > >>>>>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via > petsc-users > >>>>>>>>>>>>>>> petsc-users at mcs.anl.gov> > >>>>>>>>>>>>>>> petsc-users at mcs.anl.gov>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> >> Hi Barry, > >>>>>>>>>>>>>>> >> I think mumps solver in petsc does support > that. You can > >>>>>>>>>>>>>>> check the > >>>>>>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" > at > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > >>>>>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html> > >>>>>>>>>>>>>>> >> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > >>>>>>>>>>>>>>> < > https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>> > >>>>>>>>>>>>>>> >> and the code enclosed by #if > >>>>>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in > >>>>>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and > >>>>>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in > >>>>>>>>>>>>>>> >> mumps.c > >>>>>>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank > per GPU. > >>>>>>>>>>>>>>> However, I am > >>>>>>>>>>>>>>> >> working on an existing code that was developed > based on MPI > >>>>>>>>>>>>>>> and the the > >>>>>>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu > cores. We don't > >>>>>>>>>>>>>>> want to > >>>>>>>>>>>>>>> >> change the whole structure of the code. > >>>>>>>>>>>>>>> >> 2. What you have suggested has been coded in > mumps.c. See > >>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. > >>>>>>>>>>>>>>> >> Regards, > >>>>>>>>>>>>>>> >> Chang > >>>>>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu < > cliu at pppl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >> > wrote: > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> Hi Barry, > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> That is exactly what I want. > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> Back to my original question, I am looking > for an approach to > >>>>>>>>>>>>>>> >> transfer > >>>>>>>>>>>>>>> >> >> matrix > >>>>>>>>>>>>>>> >> >> data from many MPI processes to "master" > MPI > >>>>>>>>>>>>>>> >> >> processes, each of which taking care of > one GPU, and then > >>>>>>>>>>>>>>> upload > >>>>>>>>>>>>>>> >> the data to GPU to > >>>>>>>>>>>>>>> >> >> solve. > >>>>>>>>>>>>>>> >> >> One can just grab some codes from mumps.c > to > >>>>>>>>>>>>>>> aijcusparse.cu > >>>>>>>>>>>>>>> >> >>. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > mumps.c doesn't actually do that. It > never needs to > >>>>>>>>>>>>>>> copy the > >>>>>>>>>>>>>>> >> entire matrix to a single MPI rank. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > It would be possible to write such a > code that you > >>>>>>>>>>>>>>> suggest but > >>>>>>>>>>>>>>> >> it is not clear that it makes sense > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one > GPU per MPI > >>>>>>>>>>>>>>> rank, so > >>>>>>>>>>>>>>> >> while your one GPU per big domain is solving > its systems the > >>>>>>>>>>>>>>> other > >>>>>>>>>>>>>>> >> GPUs (with the other MPI ranks that share that > domain) are doing > >>>>>>>>>>>>>>> >> nothing. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > 2) For each triangular solve you would have > to gather the > >>>>>>>>>>>>>>> right > >>>>>>>>>>>>>>> >> hand side from the multiple ranks to the > single GPU to pass it to > >>>>>>>>>>>>>>> >> the GPU solver and then scatter the resulting > solution back > >>>>>>>>>>>>>>> to all > >>>>>>>>>>>>>>> >> of its subdomain ranks. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > What I was suggesting was assign an > entire subdomain to a > >>>>>>>>>>>>>>> >> single MPI rank, thus it does everything on > one GPU and can > >>>>>>>>>>>>>>> use the > >>>>>>>>>>>>>>> >> GPU solver directly. If all the major > computations of a subdomain > >>>>>>>>>>>>>>> >> can fit and be done on a single GPU then you > would be > >>>>>>>>>>>>>>> utilizing all > >>>>>>>>>>>>>>> >> the GPUs you are using effectively. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > Barry > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> Chang > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: > >>>>>>>>>>>>>>> >> >>> Chang, > >>>>>>>>>>>>>>> >> >>> You are correct there is no MPI + > GPU direct > >>>>>>>>>>>>>>> solvers that > >>>>>>>>>>>>>>> >> currently do the triangular solves with MPI + > GPU parallelism > >>>>>>>>>>>>>>> that I > >>>>>>>>>>>>>>> >> am aware of. You are limited that individual > triangular solves be > >>>>>>>>>>>>>>> >> done on a single GPU. I can only suggest > making each subdomain as > >>>>>>>>>>>>>>> >> big as possible to utilize each GPU as much as > possible for the > >>>>>>>>>>>>>>> >> direct triangular solves. > >>>>>>>>>>>>>>> >> >>> Barry > >>>>>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu > via petsc-users > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov> > >>>>>>>>>>>>>>> petsc-users at mcs.anl.gov>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> Hi Mark, > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with > mpiaijcusparse with > >>>>>>>>>>>>>>> other > >>>>>>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type > cusparse, it > >>>>>>>>>>>>>>> will give > >>>>>>>>>>>>>>> >> an error. > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> Yes what I want is to have mumps or > superlu to do the > >>>>>>>>>>>>>>> >> factorization, and then do the rest, including > GMRES solver, > >>>>>>>>>>>>>>> on gpu. > >>>>>>>>>>>>>>> >> Is that possible? > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> I have tried to use aijcusparse with > superlu_dist, it > >>>>>>>>>>>>>>> runs but > >>>>>>>>>>>>>>> >> the iterative solver is still running on CPUs. > I have > >>>>>>>>>>>>>>> contacted the > >>>>>>>>>>>>>>> >> superlu group and they confirmed that is the > case right now. > >>>>>>>>>>>>>>> But if > >>>>>>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it > seems that the > >>>>>>>>>>>>>>> >> iterative solver is running on GPU. > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> Chang > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: > >>>>>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang > Liu > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>> > wrote: > >>>>>>>>>>>>>>> >> >>>>> Thank you Junchao for explaining > this. I guess in > >>>>>>>>>>>>>>> my case > >>>>>>>>>>>>>>> >> the code is > >>>>>>>>>>>>>>> >> >>>>> just calling a seq solver like > superlu to do > >>>>>>>>>>>>>>> >> factorization on GPUs. > >>>>>>>>>>>>>>> >> >>>>> My idea is that I want to have a > traditional MPI > >>>>>>>>>>>>>>> code to > >>>>>>>>>>>>>>> >> utilize GPUs > >>>>>>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse > does not support > >>>>>>>>>>>>>>> mpiaij > >>>>>>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' > will give you an > >>>>>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. > >>>>>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also > work with >1 proc). > >>>>>>>>>>>>>>> >> >>>>> However, I see in grepping the repo > that all the mumps and > >>>>>>>>>>>>>>> >> superlu tests use aij or sell matrix type. > >>>>>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own > solves, I assume > >>>>>>>>>>>>>>> .... but > >>>>>>>>>>>>>>> >> you might want to do other matrix operations > on the GPU. Is > >>>>>>>>>>>>>>> that the > >>>>>>>>>>>>>>> >> issue? > >>>>>>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with > MUMPS and/or > >>>>>>>>>>>>>>> SuperLU > >>>>>>>>>>>>>>> >> have a problem? (no test with it so it > probably does not work) > >>>>>>>>>>>>>>> >> >>>>> Thanks, > >>>>>>>>>>>>>>> >> >>>>> Mark > >>>>>>>>>>>>>>> >> >>>>> so I > >>>>>>>>>>>>>>> >> >>>>> want the code to have a mpiaij > matrix when adding > >>>>>>>>>>>>>>> all the > >>>>>>>>>>>>>>> >> matrix terms, > >>>>>>>>>>>>>>> >> >>>>> and then transform the matrix to > seqaij when doing the > >>>>>>>>>>>>>>> >> factorization > >>>>>>>>>>>>>>> >> >>>>> and > >>>>>>>>>>>>>>> >> >>>>> solve. This involves sending the > data to the master > >>>>>>>>>>>>>>> >> process, and I > >>>>>>>>>>>>>>> >> >>>>> think > >>>>>>>>>>>>>>> >> >>>>> the petsc mumps solver have > something similar already. > >>>>>>>>>>>>>>> >> >>>>> Chang > >>>>>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang > wrote: > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM > Mark Adams > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >>>>> mfadams at lbl.gov> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> mfadams at lbl.gov> > >>>>>>>>>>>>>>> >>>> > wrote: > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 > PM Chang Liu > >>>>>>>>>>>>>>> >> cliu at pppl.gov > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> cliu at pppl.gov>> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>> > wrote: > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > Hi Mark, > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > The option I use is like > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > -pc_type bjacobi > -pc_bjacobi_blocks 16 > >>>>>>>>>>>>>>> >> -ksp_type fgmres > >>>>>>>>>>>>>>> >> >>>>> -mat_type > >>>>>>>>>>>>>>> >> >>>>> > aijcusparse > *-sub_pc_factor_mat_solver_type > >>>>>>>>>>>>>>> >> cusparse > >>>>>>>>>>>>>>> >> >>>>> *-sub_ksp_type > >>>>>>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type > lu* -ksp_max_it 2000 > >>>>>>>>>>>>>>> >> -ksp_rtol 1.e-300 > >>>>>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > Note, If you use -log_view > the last column > >>>>>>>>>>>>>>> (rows > >>>>>>>>>>>>>>> >> are the > >>>>>>>>>>>>>>> >> >>>>> method like > >>>>>>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the > percent of work > >>>>>>>>>>>>>>> in the GPU. > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > Junchao: *This* implies that > we have a > >>>>>>>>>>>>>>> cuSparse LU > >>>>>>>>>>>>>>> >> >>>>> factorization. Is > >>>>>>>>>>>>>>> >> >>>>> > that correct? (I don't think > we do) > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU > factorization. If you check > >>>>>>>>>>>>>>> >> >>>>> > > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will > >>>>>>>>>>>>>>> find it > >>>>>>>>>>>>>>> >> calls > >>>>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() > instead. > >>>>>>>>>>>>>>> >> >>>>> > So I don't understand Chang's > idea. Do you want to > >>>>>>>>>>>>>>> >> make bigger > >>>>>>>>>>>>>>> >> >>>>> blocks? > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > I think this one do both > factorization and > >>>>>>>>>>>>>>> >> solve on gpu. > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > You can check the > >>>>>>>>>>>>>>> runex72_aijcusparse.sh file > >>>>>>>>>>>>>>> >> in petsc > >>>>>>>>>>>>>>> >> >>>>> install > >>>>>>>>>>>>>>> >> >>>>> > directory, and try it > your self (this > >>>>>>>>>>>>>>> is only lu > >>>>>>>>>>>>>>> >> >>>>> factorization > >>>>>>>>>>>>>>> >> >>>>> > without > >>>>>>>>>>>>>>> >> >>>>> > iterative solve). > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > Chang > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, > Mark Adams wrote: > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 > at 11:19 AM > >>>>>>>>>>>>>>> Chang Liu > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> cliu at pppl.gov>> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> cliu at pppl.gov > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>>> > wrote: > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > Hi Junchao, > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > No I only needs > it to be transferred > >>>>>>>>>>>>>>> >> within a > >>>>>>>>>>>>>>> >> >>>>> node. I use > >>>>>>>>>>>>>>> >> >>>>> > block-Jacobi > >>>>>>>>>>>>>>> >> >>>>> > > method and GMRES > to solve the sparse > >>>>>>>>>>>>>>> >> matrix, so each > >>>>>>>>>>>>>>> >> >>>>> > direct solver will > >>>>>>>>>>>>>>> >> >>>>> > > take care of a > sub-block of the > >>>>>>>>>>>>>>> whole > >>>>>>>>>>>>>>> >> matrix. In this > >>>>>>>>>>>>>>> >> >>>>> > way, I can use > >>>>>>>>>>>>>>> >> >>>>> > > one > >>>>>>>>>>>>>>> >> >>>>> > > GPU to solve one > sub-block, which is > >>>>>>>>>>>>>>> >> stored within > >>>>>>>>>>>>>>> >> >>>>> one node. > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > It was stated in > the > >>>>>>>>>>>>>>> documentation that > >>>>>>>>>>>>>>> >> cusparse > >>>>>>>>>>>>>>> >> >>>>> solver > >>>>>>>>>>>>>>> >> >>>>> > is slow. > >>>>>>>>>>>>>>> >> >>>>> > > However, in my > test using > >>>>>>>>>>>>>>> ex72.c, the > >>>>>>>>>>>>>>> >> cusparse > >>>>>>>>>>>>>>> >> >>>>> solver is > >>>>>>>>>>>>>>> >> >>>>> > faster than > >>>>>>>>>>>>>>> >> >>>>> > > mumps or > superlu_dist on CPUs. > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > Are we talking about > the > >>>>>>>>>>>>>>> factorization, the > >>>>>>>>>>>>>>> >> solve, or > >>>>>>>>>>>>>>> >> >>>>> both? > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > We do not have an > interface to > >>>>>>>>>>>>>>> cuSparse's LU > >>>>>>>>>>>>>>> >> >>>>> factorization (I > >>>>>>>>>>>>>>> >> >>>>> > just > >>>>>>>>>>>>>>> >> >>>>> > > learned that it > exists a few weeks ago). > >>>>>>>>>>>>>>> >> >>>>> > > Perhaps your fast > "cusparse solver" is > >>>>>>>>>>>>>>> >> '-pc_type lu > >>>>>>>>>>>>>>> >> >>>>> -mat_type > >>>>>>>>>>>>>>> >> >>>>> > > aijcusparse' ? This > would be the CPU > >>>>>>>>>>>>>>> >> factorization, > >>>>>>>>>>>>>>> >> >>>>> which is the > >>>>>>>>>>>>>>> >> >>>>> > > dominant cost. > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > Chang > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 > AM, Junchao > >>>>>>>>>>>>>>> Zhang wrote: > >>>>>>>>>>>>>>> >> >>>>> > > > Hi, Chang, > >>>>>>>>>>>>>>> >> >>>>> > > > For the > mumps solver, we > >>>>>>>>>>>>>>> usually > >>>>>>>>>>>>>>> >> transfers > >>>>>>>>>>>>>>> >> >>>>> matrix > >>>>>>>>>>>>>>> >> >>>>> > and vector > >>>>>>>>>>>>>>> >> >>>>> > > data > >>>>>>>>>>>>>>> >> >>>>> > > > within a > compute node. For > >>>>>>>>>>>>>>> the idea you > >>>>>>>>>>>>>>> >> >>>>> propose, it > >>>>>>>>>>>>>>> >> >>>>> > looks like > >>>>>>>>>>>>>>> >> >>>>> > > we need > >>>>>>>>>>>>>>> >> >>>>> > > > to gather data > within > >>>>>>>>>>>>>>> >> MPI_COMM_WORLD, right? > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > Mark, I > remember you said > >>>>>>>>>>>>>>> >> cusparse solve is > >>>>>>>>>>>>>>> >> >>>>> slow > >>>>>>>>>>>>>>> >> >>>>> > and you would > >>>>>>>>>>>>>>> >> >>>>> > > > rather do it > on CPU. Is it right? > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > --Junchao Zhang > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > On Mon, Oct > 11, 2021 at 10:25 PM > >>>>>>>>>>>>>>> >> Chang Liu via > >>>>>>>>>>>>>>> >> >>>>> petsc-users > >>>>>>>>>>>>>>> >> >>>>> > > > < > petsc-users at mcs.anl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > petsc-users at mcs.anl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>> petsc-users at mcs.anl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > petsc-users at mcs.anl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> >> >>>>> > > petsc-users at mcs.anl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > petsc-users at mcs.anl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>> petsc-users at mcs.anl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > petsc-users at mcs.anl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> petsc-users at mcs.anl.gov>> > >>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >>>>>> > >>>>>>>>>>>>>>> >> >>>>> > > wrote: > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > Hi, > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > Currently, > it is possible > >>>>>>>>>>>>>>> to use > >>>>>>>>>>>>>>> >> mumps > >>>>>>>>>>>>>>> >> >>>>> solver in > >>>>>>>>>>>>>>> >> >>>>> > PETSC with > >>>>>>>>>>>>>>> >> >>>>> > > > > -mat_mumps_use_omp_threads > >>>>>>>>>>>>>>> >> option, so that > >>>>>>>>>>>>>>> >> >>>>> > multiple MPI > >>>>>>>>>>>>>>> >> >>>>> > > processes will > >>>>>>>>>>>>>>> >> >>>>> > > > transfer > the matrix and > >>>>>>>>>>>>>>> rhs data > >>>>>>>>>>>>>>> >> to the master > >>>>>>>>>>>>>>> >> >>>>> > rank, and then > >>>>>>>>>>>>>>> >> >>>>> > > master > >>>>>>>>>>>>>>> >> >>>>> > > > rank will > call mumps with > >>>>>>>>>>>>>>> OpenMP > >>>>>>>>>>>>>>> >> to solve > >>>>>>>>>>>>>>> >> >>>>> the matrix. > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > I wonder > if someone can > >>>>>>>>>>>>>>> develop > >>>>>>>>>>>>>>> >> similar > >>>>>>>>>>>>>>> >> >>>>> option for > >>>>>>>>>>>>>>> >> >>>>> > cusparse > >>>>>>>>>>>>>>> >> >>>>> > > solver. > >>>>>>>>>>>>>>> >> >>>>> > > > Right now, > this solver > >>>>>>>>>>>>>>> does not > >>>>>>>>>>>>>>> >> work with > >>>>>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I > >>>>>>>>>>>>>>> >> >>>>> > > think a > >>>>>>>>>>>>>>> >> >>>>> > > > possible > workaround is to > >>>>>>>>>>>>>>> >> transfer all the > >>>>>>>>>>>>>>> >> >>>>> matrix > >>>>>>>>>>>>>>> >> >>>>> > data to one MPI > >>>>>>>>>>>>>>> >> >>>>> > > > process, > and then upload the > >>>>>>>>>>>>>>> >> data to GPU to > >>>>>>>>>>>>>>> >> >>>>> solve. > >>>>>>>>>>>>>>> >> >>>>> > In this > >>>>>>>>>>>>>>> >> >>>>> > > way, one can > >>>>>>>>>>>>>>> >> >>>>> > > > use > cusparse solver for a MPI > >>>>>>>>>>>>>>> >> program. > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > Chang > >>>>>>>>>>>>>>> >> >>>>> > > > -- > >>>>>>>>>>>>>>> >> >>>>> > > > Chang Liu > >>>>>>>>>>>>>>> >> >>>>> > > > Staff > Research Physicist > >>>>>>>>>>>>>>> >> >>>>> > > > +1 609 243 > 3438 > >>>>>>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov > >>>>>>>>>>>>>>> cliu at pppl.gov>> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> cliu at pppl.gov>> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> cliu at pppl.gov>> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> cliu at pppl.gov > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > > > Princeton > Plasma Physics > >>>>>>>>>>>>>>> Laboratory > >>>>>>>>>>>>>>> >> >>>>> > > > 100 > Stellarator Rd, > >>>>>>>>>>>>>>> Princeton NJ > >>>>>>>>>>>>>>> >> 08540, USA > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > -- > >>>>>>>>>>>>>>> >> >>>>> > > Chang Liu > >>>>>>>>>>>>>>> >> >>>>> > > Staff Research > Physicist > >>>>>>>>>>>>>>> >> >>>>> > > +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> cliu at pppl.gov>> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> >> >>>>> > > Princeton Plasma > Physics Laboratory > >>>>>>>>>>>>>>> >> >>>>> > > 100 Stellarator > Rd, Princeton NJ > >>>>>>>>>>>>>>> 08540, USA > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > -- > >>>>>>>>>>>>>>> >> >>>>> > Chang Liu > >>>>>>>>>>>>>>> >> >>>>> > Staff Research Physicist > >>>>>>>>>>>>>>> >> >>>>> > +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>>> > cliu at pppl.gov cliu at pppl.gov> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > Princeton Plasma Physics > Laboratory > >>>>>>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, > Princeton NJ 08540, USA > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> -- Chang Liu > >>>>>>>>>>>>>>> >> >>>>> Staff Research Physicist > >>>>>>>>>>>>>>> >> >>>>> +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ > 08540, USA > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> -- > >>>>>>>>>>>>>>> >> >>>> Chang Liu > >>>>>>>>>>>>>>> >> >>>> Staff Research Physicist > >>>>>>>>>>>>>>> >> >>>> +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>> cliu at pppl.gov > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, > USA > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> -- > >>>>>>>>>>>>>>> >> >> Chang Liu > >>>>>>>>>>>>>>> >> >> Staff Research Physicist > >>>>>>>>>>>>>>> >> >> +1 609 243 3438 > >>>>>>>>>>>>>>> >> >> cliu at pppl.gov > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> -- Chang Liu > >>>>>>>>>>>>>>> >> Staff Research Physicist > >>>>>>>>>>>>>>> >> +1 609 243 3438 > >>>>>>>>>>>>>>> >> cliu at pppl.gov cliu at pppl.gov > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > -- > >>>>>>>>>>>>>>> > Chang Liu > >>>>>>>>>>>>>>> > Staff Research Physicist > >>>>>>>>>>>>>>> > +1 609 243 3438 > >>>>>>>>>>>>>>> > cliu at pppl.gov > >>>>>>>>>>>>>>> > Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Chang Liu > >>>>>>>>>>>>>> Staff Research Physicist > >>>>>>>>>>>>>> +1 609 243 3438 > >>>>>>>>>>>>>> cliu at pppl.gov > >>>>>>>>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Chang Liu > >>>>>>>>>>>> Staff Research Physicist > >>>>>>>>>>>> +1 609 243 3438 > >>>>>>>>>>>> cliu at pppl.gov > >>>>>>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Chang Liu > >>>>>>>>>> Staff Research Physicist > >>>>>>>>>> +1 609 243 3438 > >>>>>>>>>> cliu at pppl.gov > >>>>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Chang Liu > >>>>>>>> Staff Research Physicist > >>>>>>>> +1 609 243 3438 > >>>>>>>> cliu at pppl.gov > >>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>> > >>>>>> > >>>> > >>>> -- > >>>> Chang Liu > >>>> Staff Research Physicist > >>>> +1 609 243 3438 > >>>> cliu at pppl.gov > >>>> Princeton Plasma Physics Laboratory > >>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >> > >> -- > >> Chang Liu > >> Staff Research Physicist > >> +1 609 243 3438 > >> cliu at pppl.gov > >> Princeton Plasma Physics Laboratory > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Eric.Chamberland at giref.ulaval.ca Wed Oct 20 14:06:36 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Wed, 20 Oct 2021 15:06:36 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> Message-ID: <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> Hi Matthew, we tried to reproduce the error in a simple example. The context is the following: We hard coded the mesh and initial partition into the code (see sConnectivity and sInitialPartition) for 2 ranks and try to create a section in order to use the DMPlexNaturalToGlobalBegin function to retreive our initial element numbers. Now the call to DMPlexDistribute give different errors depending on what type of component we ask the field to be created.? For our objective, we would like a global field to be created on elements only (like a P0 interpolation). We now have the following error generated: [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Petsc has generated inconsistent data [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 [0]PETSC ERROR: ./bug on a? named rohan by ericc Wed Oct 20 14:52:36 2021 [0]PETSC ERROR: Configure options --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 --with-mpi-compilers=1 --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 --with-cxx-dialect=C++14 --with-make-np=12 --with-shared-libraries=1 --with-debugging=yes --with-memalign=64 --with-visibility=0 --with-64-bit-indices=0 --download-ml=yes --download-mumps=yes --download-superlu=yes --download-hpddm=yes --download-slepc=yes --download-superlu_dist=yes --download-parmetis=yes --download-ptscotch=yes --download-metis=yes --download-strumpack=yes --download-suitesparse=yes --download-hypre=yes --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. --with-scalapack=1 --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 [0]PETSC ERROR: #3 DMPlexDistribute() at /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 [0]PETSC ERROR: #4 main() at bug_section.cc:159 [0]PETSC ERROR: No PETSc Option Table entries [0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov---------- Hope the attached code is self-explaining, note that to make it short, we have not included the final part of it, just the buggy part we are encountering right now... Thanks for your insights, Eric On 2021-10-06 9:23 p.m., Matthew Knepley wrote: > On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland > > wrote: > > Hi Matthew, > > we tried to use that.? Now, we discovered that: > > 1- even if we "ask" for sfNatural creation with DMSetUseNatural, > it is not created because DMPlexCreateGlobalToNaturalSF looks for > a "section": this is not documented in DMSetUseNaturalso we are > asking ourselfs: "is this a permanent feature or a temporary > situation?" > > I think explaining this will help clear up a lot. > > What the Natural2Global?map does is permute a solution vector into the > ordering that it would have had prior to mesh distribution. > Now, in order to do this permutation, I need to know the original > (global) data layout. If it is not specified _before_ distribution, we > cannot build the permutation.? The section describes the data layout, > so I need it before distribution. > > I cannot think of another way that you would implement this, but if > you want something else, let me know. > > 2- We then tried to create a "section" in different manners: we > took the code into the example > petsc/src/dm/impls/plex/tests/ex15.c.? However, we ended up with a > segfault: > > corrupted size vs. prev_size > [rohan:07297] *** Process received signal *** > [rohan:07297] Signal: Aborted (6) > [rohan:07297] Signal code:? (-6) > [rohan:07297] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] > [rohan:07297] [ 1] /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] > [rohan:07297] [ 2] /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] > [rohan:07297] [ 3] /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] > [rohan:07297] [ 4] /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] > [rohan:07297] [ 5] /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] > [rohan:07297] [ 6] /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] > [rohan:07297] [ 7] /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] > [rohan:07297] [ 8] > /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] > [rohan:07297] [ 9] > /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] > [rohan:07297] [10] > /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] > [rohan:07297] [11] > /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] > [rohan:07297] [12] > /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] > [rohan:07297] [13] /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] > > [rohan:07297] [14] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] > [rohan:07297] [15] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] > [rohan:07297] [16] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] > [rohan:07297] [17] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] > [rohan:07297] [18] > /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] > > I am not sure what happened here, but if you could send a sample code, > I will figure it out. > > If we do not create a section, the call to DMPlexDistribute is > successful, but DMPlexGetGlobalToNaturalSF return a null SF pointer... > > Yes, it just ignores it in this case because it does not have a global > layout. > > Here are the operations we are calling ( this is almost the code > we are using, I just removed verifications and creation of the > connectivity which use our parallel structure and code): > > =========== > > ? PetscInt* lCells????? = 0; > ? PetscInt? lNumCorners = 0; > ? PetscInt? lDimMail??? = 0; > ? PetscInt? lnumCells?? = 0; > > ? //At this point we create the cells for PETSc expected input for > DMPlexBuildFromCellListParallel and set lNumCorners, lDimMail and > lnumCells to correct values. > ? ... > > ? DM?????? lDMBete = 0 > ? DMPlexCreate(lMPIComm,&lDMBete); > > ? DMSetDimension(lDMBete, lDimMail); > > ? DMPlexBuildFromCellListParallel(lDMBete, > ????????????????????????????????? lnumCells, > ????????????????????????????????? PETSC_DECIDE, > pLectureElementsLocaux.reqNbTotalSommets(), > ????????????????????????????????? lNumCorners, > ????????????????????????????????? lCells, > ????????????????????????????????? PETSC_NULL); > > ? DM lDMBeteInterp = 0; > ? DMPlexInterpolate(lDMBete, &lDMBeteInterp); > ? DMDestroy(&lDMBete); > ? lDMBete = lDMBeteInterp; > > ? DMSetUseNatural(lDMBete,PETSC_TRUE); > > ? PetscSF lSFMigrationSansOvl = 0; > ? PetscSF lSFMigrationOvl = 0; > ? DM lDMDistribueSansOvl = 0; > ? DM lDMAvecOverlap = 0; > > ? PetscPartitioner lPart; > ? DMPlexGetPartitioner(lDMBete, &lPart); > ? PetscPartitionerSetFromOptions(lPart); > > ? PetscSection?? section; > ? PetscInt?????? numFields?? = 1; > ? PetscInt?????? numBC?????? = 0; > ? PetscInt?????? numComp[1]? = {1}; > ? PetscInt?????? numDof[4]?? = {1, 0, 0, 0}; > ? PetscInt?????? bcFields[1] = {0}; > ? IS???????????? bcPoints[1] = {NULL}; > > ? DMSetNumFields(lDMBete, numFields); > > ? DMPlexCreateSection(lDMBete, NULL, numComp, numDof, numBC, > bcFields, bcPoints, NULL, NULL, §ion); > ? DMSetLocalSection(lDMBete, section); > > ? DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, > &lDMDistribueSansOvl); // segfault! > > =========== > > So we have other question/remarks: > > 3- Maybe PETSc expect something specific that is missing/not > verified: for example, we didn't gave any coordinates since we > just want to partition and compute overlap for the mesh... and > then recover our element numbers in a "simple way" > > 4- We are telling ourselves it is somewhat a "big price to pay" to > have to build an unused section to have the global to natural > ordering set ?? Could this requirement be avoided? > > I don't think so. There would have to be _some_ way of describing your > data layout in terms of mesh points, and I do not see how you could > use less memory doing that. > > 5- Are there any improvement towards our usages in 3.16 release? > > Let me try and run the code above. > > ? Thanks, > > ? ? ?Matt > > Thanks, > > Eric > > > On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland >> > > wrote: >> >> Hi, >> >> I come back with _almost_ the original question: >> >> I would like to add an integer information (*our* original >> element >> number, not petsc one) on each element of the DMPlex I create >> with >> DMPlexBuildFromCellListParallel. >> >> I would like this interger to be distribruted by or the same way >> DMPlexDistribute distribute the mesh. >> >> Is it possible to do this? >> >> >> I think we already have support for what you want. If you call >> >> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >> >> >> before DMPlexDistribute(), it will compute a PetscSF encoding the >> global to natural map. You >> can get it with >> >> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >> >> >> and use it with >> >> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >> >> >> Is this sufficient? >> >> ? Thanks, >> >> ? ? ?Matt >> >> Thanks, >> >> Eric >> >> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >> > Hi, >> > >> > I want to use DMPlexDistribute from PETSc for computing >> overlapping >> > and play with the different partitioners supported. >> > >> > However, after calling DMPlexDistribute, I noticed the >> elements are >> > renumbered and then the original number is lost. >> > >> > What would be the best way to keep track of the element >> renumbering? >> > >> > a) Adding an optional parameter to let the user retrieve a >> vector or >> > "IS" giving the old number? >> > >> > b) Adding a DMLabel (seems a wrong good solution) >> > >> > c) Other idea? >> > >> > Of course, I don't want to loose performances with the need >> of this >> > "mapping"... >> > >> > Thanks, >> > >> > Eric >> > >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bug_section.cc Type: text/x-c++src Size: 4728 bytes Desc: not available URL: From bsmith at petsc.dev Wed Oct 20 15:40:07 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 20 Oct 2021 16:40:07 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: References: <8b905166-b8b0-5886-9a46-096a086a9797@pppl.gov> <44BF9AE6-628A-406E-95D9-8AFEF86A1FA6@petsc.dev> <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> <968434BC-E8DC-49B0-9119-F208DB1E01B0@petsc.dev> <7a3d5347-f2da-b4a9-f44a-aa534a314c7f@pppl.gov> Message-ID: <144089C5-D011-4A94-9AC1-F4AD5A66257C@petsc.dev> Yes, but the branch can be used to do telescoping inside the bjacobi as needed. > On Oct 20, 2021, at 2:59 PM, Junchao Zhang wrote: > > The MR https://gitlab.com/petsc/petsc/-/merge_requests/4471 has not been merged yet. > > --Junchao Zhang > > > On Wed, Oct 20, 2021 at 1:47 PM Chang Liu via petsc-users > wrote: > Hi Barry, > > Are the fixes merged in the master? I was using bjacobi as a > preconditioner. Using the latest version of petsc, I found that by calling > > mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view > -ksp_monitor_true_residual -ksp_type fgmres -pc_type bjacobi -pc_bjacobi > _blocks 4 -sub_ksp_type preonly -sub_pc_type telescope > -sub_pc_telescope_reduction_factor 8 -sub_pc_telescope_subcomm_type > contiguous -sub_telescope_pc_type lu -sub_telescope_ksp_type preonly > -sub_telescope_pc_factor_mat_solver_type mumps -ksp_max_it 2000 > -ksp_rtol 1.e-30 -ksp_atol 1.e-30 > > The code is calling PCApply_BJacobi_Multiproc. If I use > > mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view > -ksp_monitor_true_residual -telescope_ksp_monitor_true_residual > -ksp_type preonly -pc_type telescope -pc_telescope_reduction_factor 8 > -pc_telescope_subcomm_type contiguous -telescope_pc_type bjacobi > -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 > -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu > -telescope_sub_pc_factor_mat_solver_type mumps -telescope_ksp_max_it > 2000 -telescope_ksp_rtol 1.e-30 -telescope_ksp_atol 1.e-30 > > The code is calling PCApply_BJacobi_Singleblock. You can test it yourself. > > Regards, > > Chang > > On 10/20/21 1:14 PM, Barry Smith wrote: > > > > > >> On Oct 20, 2021, at 12:48 PM, Chang Liu > wrote: > >> > >> Hi Pierre, > >> > >> I have another suggestion for telescope. I have achieved my goal by putting telescope outside bjacobi. But the code still does not work if I use telescope as a pc for subblock. I think the reason is that I want to use cusparse as the solver, which can only deal with seqaij matrix and not mpiaij matrix. > > > > > > This is suppose to work with the recent fixes. The telescope should produce a seq matrix and for each solve map the parallel vector (over the subdomain) automatically down to the one rank with the GPU to solve it on the GPU. It is not clear to me where the process is going wrong. > > > > Barry > > > > > > > >> However, for telescope pc, it can put the matrix into one mpi rank, thus making it a seqaij for factorization stage, but then after factorization it will give the data back to the original comminicator. This will make the matrix back to mpiaij, and then cusparse cannot solve it. > >> > >> I think a better option is to do the factorization on CPU with mpiaij, then then transform the preconditioner matrix to seqaij and do the matsolve GPU. But I am not sure if it can be achieved using telescope. > >> > >> Regads, > >> > >> Chang > >> > >> On 10/15/21 5:29 AM, Pierre Jolivet wrote: > >>> Hi Chang, > >>> The output you sent with MUMPS looks alright to me, you can see that the MatType is properly set to seqaijcusparse (and not mpiaijcusparse). > >>> I don?t know what is wrong with -sub_telescope_pc_factor_mat_solver_type cusparse, I don?t have a PETSc installation for testing this, hopefully Barry or Junchao can confirm this wrong behavior and get this fixed. > >>> As for permuting PCTELESCOPE and PCBJACOBI, in your case, the outer PC will be equivalent, yes. > >>> However, it would be more efficient to do PCBJACOBI and then PCTELESCOPE. > >>> PCBJACOBI prunes the operator by basically removing all coefficients outside of the diagonal blocks. > >>> Then, PCTELESCOPE "groups everything together?. > >>> If you do it the other way around, PCTELESCOPE will ?group everything together? and then PCBJACOBI will prune the operator. > >>> So the PCTELESCOPE SetUp will be costly for nothing since some coefficients will be thrown out afterwards in the PCBJACOBI SetUp. > >>> I hope I?m clear enough, otherwise I can try do draw some pictures. > >>> Thanks, > >>> Pierre > >>>> On 15 Oct 2021, at 4:39 AM, Chang Liu > wrote: > >>>> > >>>> Hi Pierre and Barry, > >>>> > >>>> I think maybe I should use telescope outside bjacobi? like this > >>>> > >>>> mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type telescope -pc_telescope_reduction_factor 4 -t > >>>> elescope_pc_type bjacobi -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu -telescope_sub_pc_factor_mat_solve > >>>> r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > >>>> > >>>> But then I got an error that > >>>> > >>>> [0]PETSC ERROR: MatSolverType cusparse does not support matrix type seqaij > >>>> > >>>> But the mat type should be aijcusparse. I think telescope change the mat type. > >>>> > >>>> Chang > >>>> > >>>> On 10/14/21 10:11 PM, Chang Liu wrote: > >>>>> For comparison, here is the output using mumps instead of cusparse > >>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type mumps -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > >>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >>>>> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 > >>>>> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 > >>>>> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 > >>>>> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 > >>>>> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 > >>>>> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 > >>>>> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 > >>>>> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 > >>>>> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 > >>>>> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 > >>>>> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 > >>>>> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 > >>>>> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 > >>>>> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 > >>>>> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 > >>>>> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 > >>>>> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 > >>>>> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 > >>>>> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 > >>>>> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 > >>>>> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 > >>>>> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 > >>>>> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 > >>>>> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 > >>>>> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 > >>>>> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 > >>>>> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 > >>>>> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 > >>>>> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 > >>>>> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 > >>>>> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 > >>>>> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 > >>>>> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 > >>>>> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 > >>>>> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 > >>>>> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 > >>>>> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 > >>>>> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 > >>>>> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 > >>>>> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 > >>>>> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 > >>>>> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 > >>>>> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 > >>>>> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 > >>>>> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 > >>>>> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 > >>>>> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 > >>>>> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 > >>>>> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 > >>>>> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 > >>>>> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 > >>>>> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 > >>>>> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 > >>>>> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 > >>>>> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 > >>>>> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 > >>>>> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 > >>>>> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 > >>>>> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 > >>>>> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 > >>>>> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 > >>>>> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 > >>>>> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 > >>>>> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 > >>>>> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 > >>>>> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 > >>>>> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 > >>>>> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 > >>>>> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 > >>>>> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 > >>>>> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 > >>>>> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 > >>>>> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 > >>>>> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 > >>>>> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 > >>>>> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 > >>>>> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 > >>>>> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 > >>>>> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 > >>>>> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 > >>>>> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 > >>>>> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 > >>>>> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 > >>>>> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 > >>>>> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 > >>>>> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 > >>>>> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 > >>>>> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 > >>>>> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 > >>>>> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 > >>>>> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 > >>>>> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 > >>>>> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 > >>>>> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 > >>>>> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 > >>>>> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 > >>>>> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 > >>>>> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 > >>>>> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 > >>>>> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 > >>>>> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 > >>>>> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 > >>>>> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 > >>>>> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 > >>>>> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 > >>>>> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 > >>>>> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 > >>>>> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 > >>>>> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 > >>>>> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 > >>>>> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 > >>>>> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 > >>>>> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 > >>>>> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 > >>>>> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 > >>>>> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 > >>>>> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 > >>>>> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 > >>>>> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 > >>>>> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 > >>>>> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 > >>>>> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 > >>>>> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 > >>>>> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 > >>>>> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 > >>>>> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 > >>>>> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 > >>>>> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 > >>>>> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 > >>>>> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 > >>>>> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 > >>>>> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 > >>>>> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 > >>>>> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 > >>>>> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 > >>>>> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 > >>>>> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 > >>>>> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 > >>>>> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 > >>>>> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 > >>>>> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 > >>>>> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 > >>>>> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 > >>>>> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 > >>>>> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 > >>>>> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 > >>>>> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 > >>>>> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 > >>>>> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 > >>>>> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 > >>>>> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 > >>>>> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 > >>>>> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 > >>>>> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 > >>>>> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 > >>>>> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 > >>>>> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 > >>>>> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 > >>>>> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 > >>>>> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 > >>>>> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 > >>>>> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 > >>>>> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 > >>>>> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 > >>>>> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 > >>>>> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 > >>>>> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 > >>>>> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 > >>>>> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 > >>>>> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 > >>>>> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 > >>>>> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 > >>>>> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 > >>>>> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 > >>>>> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 > >>>>> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 > >>>>> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 > >>>>> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 > >>>>> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 > >>>>> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 > >>>>> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 > >>>>> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 > >>>>> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 > >>>>> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 > >>>>> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 > >>>>> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 > >>>>> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 > >>>>> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 > >>>>> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 > >>>>> KSP Object: 16 MPI processes > >>>>> type: fgmres > >>>>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > >>>>> happy breakdown tolerance 1e-30 > >>>>> maximum iterations=2000, initial guess is zero > >>>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. > >>>>> right preconditioning > >>>>> using UNPRECONDITIONED norm type for convergence test > >>>>> PC Object: 16 MPI processes > >>>>> type: bjacobi > >>>>> number of blocks = 4 > >>>>> Local solver information for first block is in the following KSP and PC objects on rank 0: > >>>>> Use -ksp_view ::ascii_info_detail to display information for all blocks > >>>>> KSP Object: (sub_) 4 MPI processes > >>>>> type: preonly > >>>>> maximum iterations=10000, initial guess is zero > >>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >>>>> left preconditioning > >>>>> using NONE norm type for convergence test > >>>>> PC Object: (sub_) 4 MPI processes > >>>>> type: telescope > >>>>> petsc subcomm: parent comm size reduction factor = 4 > >>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 > >>>>> petsc subcomm type = contiguous > >>>>> linear system matrix = precond matrix: > >>>>> Mat Object: (sub_) 4 MPI processes > >>>>> type: mpiaij > >>>>> rows=40200, cols=40200 > >>>>> total: nonzeros=199996, allocated nonzeros=203412 > >>>>> total number of mallocs used during MatSetValues calls=0 > >>>>> not using I-node (on process 0) routines > >>>>> setup type: default > >>>>> Parent DM object: NULL > >>>>> Sub DM object: NULL > >>>>> KSP Object: (sub_telescope_) 1 MPI processes > >>>>> type: preonly > >>>>> maximum iterations=10000, initial guess is zero > >>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >>>>> left preconditioning > >>>>> using NONE norm type for convergence test > >>>>> PC Object: (sub_telescope_) 1 MPI processes > >>>>> type: lu > >>>>> out-of-place factorization > >>>>> tolerance for zero pivot 2.22045e-14 > >>>>> matrix ordering: external > >>>>> factor fill ratio given 0., needed 0. > >>>>> Factored matrix follows: > >>>>> Mat Object: 1 MPI processes > >>>>> type: mumps > >>>>> rows=40200, cols=40200 > >>>>> package used to perform factorization: mumps > >>>>> total: nonzeros=1849788, allocated nonzeros=1849788 > >>>>> MUMPS run parameters: > >>>>> SYM (matrix type): 0 > >>>>> PAR (host participation): 1 > >>>>> ICNTL(1) (output for error): 6 > >>>>> ICNTL(2) (output of diagnostic msg): 0 > >>>>> ICNTL(3) (output for global info): 0 > >>>>> ICNTL(4) (level of printing): 0 > >>>>> ICNTL(5) (input mat struct): 0 > >>>>> ICNTL(6) (matrix prescaling): 7 > >>>>> ICNTL(7) (sequential matrix ordering):7 > >>>>> ICNTL(8) (scaling strategy): 77 > >>>>> ICNTL(10) (max num of refinements): 0 > >>>>> ICNTL(11) (error analysis): 0 > >>>>> ICNTL(12) (efficiency control): 1 > >>>>> ICNTL(13) (sequential factorization of the root node): 0 > >>>>> ICNTL(14) (percentage of estimated workspace increase): 20 > >>>>> ICNTL(18) (input mat struct): 0 > >>>>> ICNTL(19) (Schur complement info): 0 > >>>>> ICNTL(20) (RHS sparse pattern): 0 > >>>>> ICNTL(21) (solution struct): 0 > >>>>> ICNTL(22) (in-core/out-of-core facility): 0 > >>>>> ICNTL(23) (max size of memory can be allocated locally):0 > >>>>> ICNTL(24) (detection of null pivot rows): 0 > >>>>> ICNTL(25) (computation of a null space basis): 0 > >>>>> ICNTL(26) (Schur options for RHS or solution): 0 > >>>>> ICNTL(27) (blocking size for multiple RHS): -32 > >>>>> ICNTL(28) (use parallel or sequential ordering): 1 > >>>>> ICNTL(29) (parallel ordering): 0 > >>>>> ICNTL(30) (user-specified set of entries in inv(A)): 0 > >>>>> ICNTL(31) (factors is discarded in the solve phase): 0 > >>>>> ICNTL(33) (compute determinant): 0 > >>>>> ICNTL(35) (activate BLR based factorization): 0 > >>>>> ICNTL(36) (choice of BLR factorization variant): 0 > >>>>> ICNTL(38) (estimated compression rate of LU factors): 333 > >>>>> CNTL(1) (relative pivoting threshold): 0.01 > >>>>> CNTL(2) (stopping criterion of refinement): 1.49012e-08 > >>>>> CNTL(3) (absolute pivoting threshold): 0. > >>>>> CNTL(4) (value of static pivoting): -1. > >>>>> CNTL(5) (fixation for null pivots): 0. > >>>>> CNTL(7) (dropping parameter for BLR): 0. > >>>>> RINFO(1) (local estimated flops for the elimination after analysis): > >>>>> [0] 1.45525e+08 > >>>>> RINFO(2) (local estimated flops for the assembly after factorization): > >>>>> [0] 2.89397e+06 > >>>>> RINFO(3) (local estimated flops for the elimination after factorization): > >>>>> [0] 1.45525e+08 > >>>>> INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization): > >>>>> [0] 29 > >>>>> INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization): > >>>>> [0] 29 > >>>>> INFO(23) (num of pivots eliminated on this processor after factorization): > >>>>> [0] 40200 > >>>>> RINFOG(1) (global estimated flops for the elimination after analysis): 1.45525e+08 > >>>>> RINFOG(2) (global estimated flops for the assembly after factorization): 2.89397e+06 > >>>>> RINFOG(3) (global estimated flops for the elimination after factorization): 1.45525e+08 > >>>>> (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0) > >>>>> INFOG(3) (estimated real workspace for factors on all processors after analysis): 1849788 > >>>>> INFOG(4) (estimated integer workspace for factors on all processors after analysis): 879986 > >>>>> INFOG(5) (estimated maximum front size in the complete tree): 282 > >>>>> INFOG(6) (number of nodes in the complete tree): 23709 > >>>>> INFOG(7) (ordering option effectively used after analysis): 5 > >>>>> INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100 > >>>>> INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 1849788 > >>>>> INFOG(10) (total integer space store the matrix factors after factorization): 879986 > >>>>> INFOG(11) (order of largest frontal matrix after factorization): 282 > >>>>> INFOG(12) (number of off-diagonal pivots): 0 > >>>>> INFOG(13) (number of delayed pivots after factorization): 0 > >>>>> INFOG(14) (number of memory compress after factorization): 0 > >>>>> INFOG(15) (number of steps of iterative refinement after solution): 0 > >>>>> INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 29 > >>>>> INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 29 > >>>>> INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 29 > >>>>> INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 29 > >>>>> INFOG(20) (estimated number of entries in the factors): 1849788 > >>>>> INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 26 > >>>>> INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 26 > >>>>> INFOG(23) (after analysis: value of ICNTL(6) effectively used): 0 > >>>>> INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1 > >>>>> INFOG(25) (after factorization: number of pivots modified by static pivoting): 0 > >>>>> INFOG(28) (after factorization: number of null pivots encountered): 0 > >>>>> INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 1849788 > >>>>> INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 29, 29 > >>>>> INFOG(32) (after analysis: type of analysis done): 1 > >>>>> INFOG(33) (value used for ICNTL(8)): 7 > >>>>> INFOG(34) (exponent of the determinant if determinant is requested): 0 > >>>>> INFOG(35) (after factorization: number of entries taking into account BLR factor compression - sum over all processors): 1849788 > >>>>> INFOG(36) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - value on the most memory consuming processor): 0 > >>>>> INFOG(37) (after analysis: estimated size of all MUMPS internal data for running BLR in-core - sum over all processors): 0 > >>>>> INFOG(38) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - value on the most memory consuming processor): 0 > >>>>> INFOG(39) (after analysis: estimated size of all MUMPS internal data for running BLR out-of-core - sum over all processors): 0 > >>>>> linear system matrix = precond matrix: > >>>>> Mat Object: 1 MPI processes > >>>>> type: seqaijcusparse > >>>>> rows=40200, cols=40200 > >>>>> total: nonzeros=199996, allocated nonzeros=199996 > >>>>> total number of mallocs used during MatSetValues calls=0 > >>>>> not using I-node routines > >>>>> linear system matrix = precond matrix: > >>>>> Mat Object: 16 MPI processes > >>>>> type: mpiaijcusparse > >>>>> rows=160800, cols=160800 > >>>>> total: nonzeros=802396, allocated nonzeros=1608000 > >>>>> total number of mallocs used during MatSetValues calls=0 > >>>>> not using I-node (on process 0) routines > >>>>> Norm of error 9.11684e-07 iterations 189 > >>>>> Chang > >>>>> On 10/14/21 10:10 PM, Chang Liu wrote: > >>>>>> Hi Barry, > >>>>>> > >>>>>> No problem. Here is the output. It seems that the resid norm calculation is incorrect. > >>>>>> > >>>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 > >>>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >>>>>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 > >>>>>> KSP Object: 16 MPI processes > >>>>>> type: fgmres > >>>>>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement > >>>>>> happy breakdown tolerance 1e-30 > >>>>>> maximum iterations=2000, initial guess is zero > >>>>>> tolerances: relative=1e-20, absolute=1e-09, divergence=10000. > >>>>>> right preconditioning > >>>>>> using UNPRECONDITIONED norm type for convergence test > >>>>>> PC Object: 16 MPI processes > >>>>>> type: bjacobi > >>>>>> number of blocks = 4 > >>>>>> Local solver information for first block is in the following KSP and PC objects on rank 0: > >>>>>> Use -ksp_view ::ascii_info_detail to display information for all blocks > >>>>>> KSP Object: (sub_) 4 MPI processes > >>>>>> type: preonly > >>>>>> maximum iterations=10000, initial guess is zero > >>>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >>>>>> left preconditioning > >>>>>> using NONE norm type for convergence test > >>>>>> PC Object: (sub_) 4 MPI processes > >>>>>> type: telescope > >>>>>> petsc subcomm: parent comm size reduction factor = 4 > >>>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 > >>>>>> petsc subcomm type = contiguous > >>>>>> linear system matrix = precond matrix: > >>>>>> Mat Object: (sub_) 4 MPI processes > >>>>>> type: mpiaij > >>>>>> rows=40200, cols=40200 > >>>>>> total: nonzeros=199996, allocated nonzeros=203412 > >>>>>> total number of mallocs used during MatSetValues calls=0 > >>>>>> not using I-node (on process 0) routines > >>>>>> setup type: default > >>>>>> Parent DM object: NULL > >>>>>> Sub DM object: NULL > >>>>>> KSP Object: (sub_telescope_) 1 MPI processes > >>>>>> type: preonly > >>>>>> maximum iterations=10000, initial guess is zero > >>>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > >>>>>> left preconditioning > >>>>>> using NONE norm type for convergence test > >>>>>> PC Object: (sub_telescope_) 1 MPI processes > >>>>>> type: lu > >>>>>> out-of-place factorization > >>>>>> tolerance for zero pivot 2.22045e-14 > >>>>>> matrix ordering: nd > >>>>>> factor fill ratio given 5., needed 8.62558 > >>>>>> Factored matrix follows: > >>>>>> Mat Object: 1 MPI processes > >>>>>> type: seqaijcusparse > >>>>>> rows=40200, cols=40200 > >>>>>> package used to perform factorization: cusparse > >>>>>> total: nonzeros=1725082, allocated nonzeros=1725082 > >>>>>> not using I-node routines > >>>>>> linear system matrix = precond matrix: > >>>>>> Mat Object: 1 MPI processes > >>>>>> type: seqaijcusparse > >>>>>> rows=40200, cols=40200 > >>>>>> total: nonzeros=199996, allocated nonzeros=199996 > >>>>>> total number of mallocs used during MatSetValues calls=0 > >>>>>> not using I-node routines > >>>>>> linear system matrix = precond matrix: > >>>>>> Mat Object: 16 MPI processes > >>>>>> type: mpiaijcusparse > >>>>>> rows=160800, cols=160800 > >>>>>> total: nonzeros=802396, allocated nonzeros=1608000 > >>>>>> total number of mallocs used during MatSetValues calls=0 > >>>>>> not using I-node (on process 0) routines > >>>>>> Norm of error 400.999 iterations 1 > >>>>>> > >>>>>> Chang > >>>>>> > >>>>>> > >>>>>> On 10/14/21 9:47 PM, Barry Smith wrote: > >>>>>>> > >>>>>>> Chang, > >>>>>>> > >>>>>>> Sorry I did not notice that one. Please run that with -ksp_view -ksp_monitor_true_residual so we can see exactly how options are interpreted and solver used. At a glance it looks ok but something must be wrong to get the wrong answer. > >>>>>>> > >>>>>>> Barry > >>>>>>> > >>>>>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu > wrote: > >>>>>>>> > >>>>>>>> Hi Barry, > >>>>>>>> > >>>>>>>> That is exactly what I was doing in the second example, in which the preconditioner works but the GMRES does not. > >>>>>>>> > >>>>>>>> Chang > >>>>>>>> > >>>>>>>> On 10/14/21 5:15 PM, Barry Smith wrote: > >>>>>>>>> You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu > >>>>>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu > wrote: > >>>>>>>>>> > >>>>>>>>>> Hi Pierre, > >>>>>>>>>> > >>>>>>>>>> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work. > >>>>>>>>>> > >>>>>>>>>> The command line options I used for small matrix is like > >>>>>>>>>> > >>>>>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 > >>>>>>>>>> > >>>>>>>>>> which gives the correct output. For iterative solver, I tried > >>>>>>>>>> > >>>>>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20 > >>>>>>>>>> > >>>>>>>>>> for large matrix. The output is like > >>>>>>>>>> > >>>>>>>>>> 0 KSP Residual norm 40.1497 > >>>>>>>>>> 1 KSP Residual norm < 1.e-11 > >>>>>>>>>> Norm of error 400.999 iterations 1 > >>>>>>>>>> > >>>>>>>>>> So it seems to call a direct solver instead of an iterative one. > >>>>>>>>>> > >>>>>>>>>> Can you please help check these options? > >>>>>>>>>> > >>>>>>>>>> Chang > >>>>>>>>>> > >>>>>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: > >>>>>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu > wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually? > >>>>>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). > >>>>>>>>>>> 1) I?m not sure this is implemented for cuSparse matrices, but it should be; > >>>>>>>>>>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning. > >>>>>>>>>>> If you try this out and this does not work, please provide the backtrace (probably something like ?Operation XYZ not implemented for MatType ABC?), and hopefully someone can add the missing plumbing. > >>>>>>>>>>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve. > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Pierre > >>>>>>>>>>>> Chang > >>>>>>>>>>>> > >>>>>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: > >>>>>>>>>>>>> Maybe I?m missing something, but can?t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block? > >>>>>>>>>>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu > >>>>>>>>>>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be ?converted? to OpenMP threads. > >>>>>>>>>>>>> Thus the need for specific code in mumps.c. > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Pierre > >>>>>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users > wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Junchao, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Yes that is what I want. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Chang > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: > >>>>>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith >> wrote: > >>>>>>>>>>>>>>> Junchao, > >>>>>>>>>>>>>>> If I understand correctly Chang is using the block Jacobi > >>>>>>>>>>>>>>> method with a single block for a number of MPI ranks and a direct > >>>>>>>>>>>>>>> solver for each block so it uses PCSetUp_BJacobi_Multiproc() which > >>>>>>>>>>>>>>> is code Hong Zhang wrote a number of years ago for CPUs. For their > >>>>>>>>>>>>>>> particular problems this preconditioner works well, but using an > >>>>>>>>>>>>>>> iterative solver on the blocks does not work well. > >>>>>>>>>>>>>>> If we had complete MPI-GPU direct solvers he could just use > >>>>>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each block but since we do > >>>>>>>>>>>>>>> not he would like to use a single GPU for each block, this means > >>>>>>>>>>>>>>> that diagonal blocks of the global parallel MPI matrix needs to be > >>>>>>>>>>>>>>> sent to a subset of the GPUs (one GPU per block, which has multiple > >>>>>>>>>>>>>>> MPI ranks associated with the blocks). Similarly for the triangular > >>>>>>>>>>>>>>> solves the blocks of the right hand side needs to be shipped to the > >>>>>>>>>>>>>>> appropriate GPU and the resulting solution shipped back to the > >>>>>>>>>>>>>>> multiple GPUs. So Chang is absolutely correct, this is somewhat like > >>>>>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now understand the background.. > >>>>>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the > >>>>>>>>>>>>>>> MPI ranks and then shrink each block down to a single GPU but this > >>>>>>>>>>>>>>> would be pretty inefficient, ideally one would go directly from the > >>>>>>>>>>>>>>> big MPI matrix on all the GPUs to the sub matrices on the subset of > >>>>>>>>>>>>>>> GPUs. But this may be a large coding project. > >>>>>>>>>>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there. In other words, we keep blocks' size, no shrinking or expanding. > >>>>>>>>>>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU. > >>>>>>>>>>>>>>> Barry > >>>>>>>>>>>>>>> Since the matrices being factored and solved directly are relatively > >>>>>>>>>>>>>>> large it is possible that the cusparse code could be reasonably > >>>>>>>>>>>>>>> efficient (they are not the tiny problems one gets at the coarse > >>>>>>>>>>>>>>> level of multigrid). Of course, this is speculation, I don't > >>>>>>>>>>>>>>> actually know how much better the cusparse code would be on the > >>>>>>>>>>>>>>> direct solver than a good CPU direct sparse solver. > >>>>>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu > >>>>>>>>>>>>>>> >> wrote: > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > Sorry I am not familiar with the details either. Can you please > >>>>>>>>>>>>>>> check the code in MatMumpsGatherNonzerosOnMaster in mumps.c? > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > Chang > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: > >>>>>>>>>>>>>>> >> Hi Chang, > >>>>>>>>>>>>>>> >> I did the work in mumps. It is easy for me to understand > >>>>>>>>>>>>>>> gathering matrix rows to one process. > >>>>>>>>>>>>>>> >> But how to gather blocks (submatrices) to form a large block? Can you draw a picture of that? > >>>>>>>>>>>>>>> >> Thanks > >>>>>>>>>>>>>>> >> --Junchao Zhang > >>>>>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> >> Hi Barry, > >>>>>>>>>>>>>>> >> I think mumps solver in petsc does support that. You can > >>>>>>>>>>>>>>> check the > >>>>>>>>>>>>>>> >> documentation on "-mat_mumps_use_omp_threads" at > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> and the code enclosed by #if > >>>>>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in > >>>>>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and > >>>>>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in > >>>>>>>>>>>>>>> >> mumps.c > >>>>>>>>>>>>>>> >> 1. I understand it is ideal to do one MPI rank per GPU. > >>>>>>>>>>>>>>> However, I am > >>>>>>>>>>>>>>> >> working on an existing code that was developed based on MPI > >>>>>>>>>>>>>>> and the the > >>>>>>>>>>>>>>> >> # of mpi ranks is typically equal to # of cpu cores. We don't > >>>>>>>>>>>>>>> want to > >>>>>>>>>>>>>>> >> change the whole structure of the code. > >>>>>>>>>>>>>>> >> 2. What you have suggested has been coded in mumps.c. See > >>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. > >>>>>>>>>>>>>>> >> Regards, > >>>>>>>>>>>>>>> >> Chang > >>>>>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang Liu > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >>> wrote: > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> Hi Barry, > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> That is exactly what I want. > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> Back to my original question, I am looking for an approach to > >>>>>>>>>>>>>>> >> transfer > >>>>>>>>>>>>>>> >> >> matrix > >>>>>>>>>>>>>>> >> >> data from many MPI processes to "master" MPI > >>>>>>>>>>>>>>> >> >> processes, each of which taking care of one GPU, and then > >>>>>>>>>>>>>>> upload > >>>>>>>>>>>>>>> >> the data to GPU to > >>>>>>>>>>>>>>> >> >> solve. > >>>>>>>>>>>>>>> >> >> One can just grab some codes from mumps.c to > >>>>>>>>>>>>>>> aijcusparse.cu > > >>>>>>>>>>>>>>> >> >>. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > mumps.c doesn't actually do that. It never needs to > >>>>>>>>>>>>>>> copy the > >>>>>>>>>>>>>>> >> entire matrix to a single MPI rank. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > It would be possible to write such a code that you > >>>>>>>>>>>>>>> suggest but > >>>>>>>>>>>>>>> >> it is not clear that it makes sense > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > 1) For normal PETSc GPU usage there is one GPU per MPI > >>>>>>>>>>>>>>> rank, so > >>>>>>>>>>>>>>> >> while your one GPU per big domain is solving its systems the > >>>>>>>>>>>>>>> other > >>>>>>>>>>>>>>> >> GPUs (with the other MPI ranks that share that domain) are doing > >>>>>>>>>>>>>>> >> nothing. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > 2) For each triangular solve you would have to gather the > >>>>>>>>>>>>>>> right > >>>>>>>>>>>>>>> >> hand side from the multiple ranks to the single GPU to pass it to > >>>>>>>>>>>>>>> >> the GPU solver and then scatter the resulting solution back > >>>>>>>>>>>>>>> to all > >>>>>>>>>>>>>>> >> of its subdomain ranks. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > What I was suggesting was assign an entire subdomain to a > >>>>>>>>>>>>>>> >> single MPI rank, thus it does everything on one GPU and can > >>>>>>>>>>>>>>> use the > >>>>>>>>>>>>>>> >> GPU solver directly. If all the major computations of a subdomain > >>>>>>>>>>>>>>> >> can fit and be done on a single GPU then you would be > >>>>>>>>>>>>>>> utilizing all > >>>>>>>>>>>>>>> >> the GPUs you are using effectively. > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > Barry > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> Chang > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith wrote: > >>>>>>>>>>>>>>> >> >>> Chang, > >>>>>>>>>>>>>>> >> >>> You are correct there is no MPI + GPU direct > >>>>>>>>>>>>>>> solvers that > >>>>>>>>>>>>>>> >> currently do the triangular solves with MPI + GPU parallelism > >>>>>>>>>>>>>>> that I > >>>>>>>>>>>>>>> >> am aware of. You are limited that individual triangular solves be > >>>>>>>>>>>>>>> >> done on a single GPU. I can only suggest making each subdomain as > >>>>>>>>>>>>>>> >> big as possible to utilize each GPU as much as possible for the > >>>>>>>>>>>>>>> >> direct triangular solves. > >>>>>>>>>>>>>>> >> >>> Barry > >>>>>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> Hi Mark, > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with > >>>>>>>>>>>>>>> other > >>>>>>>>>>>>>>> >> solvers, but with -pc_factor_mat_solver_type cusparse, it > >>>>>>>>>>>>>>> will give > >>>>>>>>>>>>>>> >> an error. > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> Yes what I want is to have mumps or superlu to do the > >>>>>>>>>>>>>>> >> factorization, and then do the rest, including GMRES solver, > >>>>>>>>>>>>>>> on gpu. > >>>>>>>>>>>>>>> >> Is that possible? > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> I have tried to use aijcusparse with superlu_dist, it > >>>>>>>>>>>>>>> runs but > >>>>>>>>>>>>>>> >> the iterative solver is still running on CPUs. I have > >>>>>>>>>>>>>>> contacted the > >>>>>>>>>>>>>>> >> superlu group and they confirmed that is the case right now. > >>>>>>>>>>>>>>> But if > >>>>>>>>>>>>>>> >> I set -pc_factor_mat_solver_type cusparse, it seems that the > >>>>>>>>>>>>>>> >> iterative solver is running on GPU. > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> Chang > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams wrote: > >>>>>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >>>> wrote: > >>>>>>>>>>>>>>> >> >>>>> Thank you Junchao for explaining this. I guess in > >>>>>>>>>>>>>>> my case > >>>>>>>>>>>>>>> >> the code is > >>>>>>>>>>>>>>> >> >>>>> just calling a seq solver like superlu to do > >>>>>>>>>>>>>>> >> factorization on GPUs. > >>>>>>>>>>>>>>> >> >>>>> My idea is that I want to have a traditional MPI > >>>>>>>>>>>>>>> code to > >>>>>>>>>>>>>>> >> utilize GPUs > >>>>>>>>>>>>>>> >> >>>>> with cusparse. Right now cusparse does not support > >>>>>>>>>>>>>>> mpiaij > >>>>>>>>>>>>>>> >> matrix, Sure it does: '-mat_type aijcusparse' will give you an > >>>>>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. > >>>>>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc). > >>>>>>>>>>>>>>> >> >>>>> However, I see in grepping the repo that all the mumps and > >>>>>>>>>>>>>>> >> superlu tests use aij or sell matrix type. > >>>>>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their own solves, I assume > >>>>>>>>>>>>>>> .... but > >>>>>>>>>>>>>>> >> you might want to do other matrix operations on the GPU. Is > >>>>>>>>>>>>>>> that the > >>>>>>>>>>>>>>> >> issue? > >>>>>>>>>>>>>>> >> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or > >>>>>>>>>>>>>>> SuperLU > >>>>>>>>>>>>>>> >> have a problem? (no test with it so it probably does not work) > >>>>>>>>>>>>>>> >> >>>>> Thanks, > >>>>>>>>>>>>>>> >> >>>>> Mark > >>>>>>>>>>>>>>> >> >>>>> so I > >>>>>>>>>>>>>>> >> >>>>> want the code to have a mpiaij matrix when adding > >>>>>>>>>>>>>>> all the > >>>>>>>>>>>>>>> >> matrix terms, > >>>>>>>>>>>>>>> >> >>>>> and then transform the matrix to seqaij when doing the > >>>>>>>>>>>>>>> >> factorization > >>>>>>>>>>>>>>> >> >>>>> and > >>>>>>>>>>>>>>> >> >>>>> solve. This involves sending the data to the master > >>>>>>>>>>>>>>> >> process, and I > >>>>>>>>>>>>>>> >> >>>>> think > >>>>>>>>>>>>>>> >> >>>>> the petsc mumps solver have something similar already. > >>>>>>>>>>>>>>> >> >>>>> Chang > >>>>>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote: > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>>> wrote: > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>>> wrote: > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > Hi Mark, > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > The option I use is like > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16 > >>>>>>>>>>>>>>> >> -ksp_type fgmres > >>>>>>>>>>>>>>> >> >>>>> -mat_type > >>>>>>>>>>>>>>> >> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type > >>>>>>>>>>>>>>> >> cusparse > >>>>>>>>>>>>>>> >> >>>>> *-sub_ksp_type > >>>>>>>>>>>>>>> >> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 > >>>>>>>>>>>>>>> >> -ksp_rtol 1.e-300 > >>>>>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > Note, If you use -log_view the last column > >>>>>>>>>>>>>>> (rows > >>>>>>>>>>>>>>> >> are the > >>>>>>>>>>>>>>> >> >>>>> method like > >>>>>>>>>>>>>>> >> >>>>> > MatFactorNumeric) has the percent of work > >>>>>>>>>>>>>>> in the GPU. > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > Junchao: *This* implies that we have a > >>>>>>>>>>>>>>> cuSparse LU > >>>>>>>>>>>>>>> >> >>>>> factorization. Is > >>>>>>>>>>>>>>> >> >>>>> > that correct? (I don't think we do) > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > No, we don't have cuSparse LU factorization. If you check > >>>>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will > >>>>>>>>>>>>>>> find it > >>>>>>>>>>>>>>> >> calls > >>>>>>>>>>>>>>> >> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead. > >>>>>>>>>>>>>>> >> >>>>> > So I don't understand Chang's idea. Do you want to > >>>>>>>>>>>>>>> >> make bigger > >>>>>>>>>>>>>>> >> >>>>> blocks? > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > I think this one do both factorization and > >>>>>>>>>>>>>>> >> solve on gpu. > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > You can check the > >>>>>>>>>>>>>>> runex72_aijcusparse.sh file > >>>>>>>>>>>>>>> >> in petsc > >>>>>>>>>>>>>>> >> >>>>> install > >>>>>>>>>>>>>>> >> >>>>> > directory, and try it your self (this > >>>>>>>>>>>>>>> is only lu > >>>>>>>>>>>>>>> >> >>>>> factorization > >>>>>>>>>>>>>>> >> >>>>> > without > >>>>>>>>>>>>>>> >> >>>>> > iterative solve). > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > Chang > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote: > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM > >>>>>>>>>>>>>>> Chang Liu > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>>>> wrote: > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > Hi Junchao, > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > No I only needs it to be transferred > >>>>>>>>>>>>>>> >> within a > >>>>>>>>>>>>>>> >> >>>>> node. I use > >>>>>>>>>>>>>>> >> >>>>> > block-Jacobi > >>>>>>>>>>>>>>> >> >>>>> > > method and GMRES to solve the sparse > >>>>>>>>>>>>>>> >> matrix, so each > >>>>>>>>>>>>>>> >> >>>>> > direct solver will > >>>>>>>>>>>>>>> >> >>>>> > > take care of a sub-block of the > >>>>>>>>>>>>>>> whole > >>>>>>>>>>>>>>> >> matrix. In this > >>>>>>>>>>>>>>> >> >>>>> > way, I can use > >>>>>>>>>>>>>>> >> >>>>> > > one > >>>>>>>>>>>>>>> >> >>>>> > > GPU to solve one sub-block, which is > >>>>>>>>>>>>>>> >> stored within > >>>>>>>>>>>>>>> >> >>>>> one node. > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > It was stated in the > >>>>>>>>>>>>>>> documentation that > >>>>>>>>>>>>>>> >> cusparse > >>>>>>>>>>>>>>> >> >>>>> solver > >>>>>>>>>>>>>>> >> >>>>> > is slow. > >>>>>>>>>>>>>>> >> >>>>> > > However, in my test using > >>>>>>>>>>>>>>> ex72.c, the > >>>>>>>>>>>>>>> >> cusparse > >>>>>>>>>>>>>>> >> >>>>> solver is > >>>>>>>>>>>>>>> >> >>>>> > faster than > >>>>>>>>>>>>>>> >> >>>>> > > mumps or superlu_dist on CPUs. > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > Are we talking about the > >>>>>>>>>>>>>>> factorization, the > >>>>>>>>>>>>>>> >> solve, or > >>>>>>>>>>>>>>> >> >>>>> both? > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > We do not have an interface to > >>>>>>>>>>>>>>> cuSparse's LU > >>>>>>>>>>>>>>> >> >>>>> factorization (I > >>>>>>>>>>>>>>> >> >>>>> > just > >>>>>>>>>>>>>>> >> >>>>> > > learned that it exists a few weeks ago). > >>>>>>>>>>>>>>> >> >>>>> > > Perhaps your fast "cusparse solver" is > >>>>>>>>>>>>>>> >> '-pc_type lu > >>>>>>>>>>>>>>> >> >>>>> -mat_type > >>>>>>>>>>>>>>> >> >>>>> > > aijcusparse' ? This would be the CPU > >>>>>>>>>>>>>>> >> factorization, > >>>>>>>>>>>>>>> >> >>>>> which is the > >>>>>>>>>>>>>>> >> >>>>> > > dominant cost. > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > Chang > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > On 10/12/21 10:24 AM, Junchao > >>>>>>>>>>>>>>> Zhang wrote: > >>>>>>>>>>>>>>> >> >>>>> > > > Hi, Chang, > >>>>>>>>>>>>>>> >> >>>>> > > > For the mumps solver, we > >>>>>>>>>>>>>>> usually > >>>>>>>>>>>>>>> >> transfers > >>>>>>>>>>>>>>> >> >>>>> matrix > >>>>>>>>>>>>>>> >> >>>>> > and vector > >>>>>>>>>>>>>>> >> >>>>> > > data > >>>>>>>>>>>>>>> >> >>>>> > > > within a compute node. For > >>>>>>>>>>>>>>> the idea you > >>>>>>>>>>>>>>> >> >>>>> propose, it > >>>>>>>>>>>>>>> >> >>>>> > looks like > >>>>>>>>>>>>>>> >> >>>>> > > we need > >>>>>>>>>>>>>>> >> >>>>> > > > to gather data within > >>>>>>>>>>>>>>> >> MPI_COMM_WORLD, right? > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > Mark, I remember you said > >>>>>>>>>>>>>>> >> cusparse solve is > >>>>>>>>>>>>>>> >> >>>>> slow > >>>>>>>>>>>>>>> >> >>>>> > and you would > >>>>>>>>>>>>>>> >> >>>>> > > > rather do it on CPU. Is it right? > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > --Junchao Zhang > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM > >>>>>>>>>>>>>>> >> Chang Liu via > >>>>>>>>>>>>>>> >> >>>>> petsc-users > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >>>>>>> > >>>>>>>>>>>>>>> >> >>>>> > > wrote: > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > Hi, > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > Currently, it is possible > >>>>>>>>>>>>>>> to use > >>>>>>>>>>>>>>> >> mumps > >>>>>>>>>>>>>>> >> >>>>> solver in > >>>>>>>>>>>>>>> >> >>>>> > PETSC with > >>>>>>>>>>>>>>> >> >>>>> > > > -mat_mumps_use_omp_threads > >>>>>>>>>>>>>>> >> option, so that > >>>>>>>>>>>>>>> >> >>>>> > multiple MPI > >>>>>>>>>>>>>>> >> >>>>> > > processes will > >>>>>>>>>>>>>>> >> >>>>> > > > transfer the matrix and > >>>>>>>>>>>>>>> rhs data > >>>>>>>>>>>>>>> >> to the master > >>>>>>>>>>>>>>> >> >>>>> > rank, and then > >>>>>>>>>>>>>>> >> >>>>> > > master > >>>>>>>>>>>>>>> >> >>>>> > > > rank will call mumps with > >>>>>>>>>>>>>>> OpenMP > >>>>>>>>>>>>>>> >> to solve > >>>>>>>>>>>>>>> >> >>>>> the matrix. > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > I wonder if someone can > >>>>>>>>>>>>>>> develop > >>>>>>>>>>>>>>> >> similar > >>>>>>>>>>>>>>> >> >>>>> option for > >>>>>>>>>>>>>>> >> >>>>> > cusparse > >>>>>>>>>>>>>>> >> >>>>> > > solver. > >>>>>>>>>>>>>>> >> >>>>> > > > Right now, this solver > >>>>>>>>>>>>>>> does not > >>>>>>>>>>>>>>> >> work with > >>>>>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I > >>>>>>>>>>>>>>> >> >>>>> > > think a > >>>>>>>>>>>>>>> >> >>>>> > > > possible workaround is to > >>>>>>>>>>>>>>> >> transfer all the > >>>>>>>>>>>>>>> >> >>>>> matrix > >>>>>>>>>>>>>>> >> >>>>> > data to one MPI > >>>>>>>>>>>>>>> >> >>>>> > > > process, and then upload the > >>>>>>>>>>>>>>> >> data to GPU to > >>>>>>>>>>>>>>> >> >>>>> solve. > >>>>>>>>>>>>>>> >> >>>>> > In this > >>>>>>>>>>>>>>> >> >>>>> > > way, one can > >>>>>>>>>>>>>>> >> >>>>> > > > use cusparse solver for a MPI > >>>>>>>>>>>>>>> >> program. > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > Chang > >>>>>>>>>>>>>>> >> >>>>> > > > -- > >>>>>>>>>>>>>>> >> >>>>> > > > Chang Liu > >>>>>>>>>>>>>>> >> >>>>> > > > Staff Research Physicist > >>>>>>>>>>>>>>> >> >>>>> > > > +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>>> > > > cliu at pppl.gov > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>>>> > >>>>>>>>>>>>>>> >> >>>>> > > > Princeton Plasma Physics > >>>>>>>>>>>>>>> Laboratory > >>>>>>>>>>>>>>> >> >>>>> > > > 100 Stellarator Rd, > >>>>>>>>>>>>>>> Princeton NJ > >>>>>>>>>>>>>>> >> 08540, USA > >>>>>>>>>>>>>>> >> >>>>> > > > > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > -- > >>>>>>>>>>>>>>> >> >>>>> > > Chang Liu > >>>>>>>>>>>>>>> >> >>>>> > > Staff Research Physicist > >>>>>>>>>>>>>>> >> >>>>> > > +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>>>> > >>>>>>>>>>>>>>> >> >>>>> > > Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> >>>>> > > 100 Stellarator Rd, Princeton NJ > >>>>>>>>>>>>>>> 08540, USA > >>>>>>>>>>>>>>> >> >>>>> > > > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> > -- > >>>>>>>>>>>>>>> >> >>>>> > Chang Liu > >>>>>>>>>>>>>>> >> >>>>> > Staff Research Physicist > >>>>>>>>>>>>>>> >> >>>>> > +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>>> > cliu at pppl.gov > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >>> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>>> >> >>>>> > Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>>> >> >>>>> > > >>>>>>>>>>>>>>> >> >>>>> -- Chang Liu > >>>>>>>>>>>>>>> >> >>>>> Staff Research Physicist > >>>>>>>>>>>>>>> >> >>>>> +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> >> >>> > >>>>>>>>>>>>>>> >> >>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>>> >> >>>> > >>>>>>>>>>>>>>> >> >>>> -- > >>>>>>>>>>>>>>> >> >>>> Chang Liu > >>>>>>>>>>>>>>> >> >>>> Staff Research Physicist > >>>>>>>>>>>>>>> >> >>>> +1 609 243 3438 > >>>>>>>>>>>>>>> >> >>>> cliu at pppl.gov > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>> >> >> -- > >>>>>>>>>>>>>>> >> >> Chang Liu > >>>>>>>>>>>>>>> >> >> Staff Research Physicist > >>>>>>>>>>>>>>> >> >> +1 609 243 3438 > >>>>>>>>>>>>>>> >> >> cliu at pppl.gov > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>>> >> > > >>>>>>>>>>>>>>> >> -- Chang Liu > >>>>>>>>>>>>>>> >> Staff Research Physicist > >>>>>>>>>>>>>>> >> +1 609 243 3438 > >>>>>>>>>>>>>>> >> cliu at pppl.gov > > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> >> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > -- > >>>>>>>>>>>>>>> > Chang Liu > >>>>>>>>>>>>>>> > Staff Research Physicist > >>>>>>>>>>>>>>> > +1 609 243 3438 > >>>>>>>>>>>>>>> > cliu at pppl.gov > > >>>>>>>>>>>>>>> > Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Chang Liu > >>>>>>>>>>>>>> Staff Research Physicist > >>>>>>>>>>>>>> +1 609 243 3438 > >>>>>>>>>>>>>> cliu at pppl.gov > >>>>>>>>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Chang Liu > >>>>>>>>>>>> Staff Research Physicist > >>>>>>>>>>>> +1 609 243 3438 > >>>>>>>>>>>> cliu at pppl.gov > >>>>>>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Chang Liu > >>>>>>>>>> Staff Research Physicist > >>>>>>>>>> +1 609 243 3438 > >>>>>>>>>> cliu at pppl.gov > >>>>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Chang Liu > >>>>>>>> Staff Research Physicist > >>>>>>>> +1 609 243 3438 > >>>>>>>> cliu at pppl.gov > >>>>>>>> Princeton Plasma Physics Laboratory > >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >>>>>>> > >>>>>> > >>>> > >>>> -- > >>>> Chang Liu > >>>> Staff Research Physicist > >>>> +1 609 243 3438 > >>>> cliu at pppl.gov > >>>> Princeton Plasma Physics Laboratory > >>>> 100 Stellarator Rd, Princeton NJ 08540, USA > >> > >> -- > >> Chang Liu > >> Staff Research Physicist > >> +1 609 243 3438 > >> cliu at pppl.gov > >> Princeton Plasma Physics Laboratory > >> 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA -------------- next part -------------- An HTML attachment was scrubbed... URL: From cliu at pppl.gov Wed Oct 20 17:14:04 2021 From: cliu at pppl.gov (Chang Liu) Date: Wed, 20 Oct 2021 18:14:04 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <144089C5-D011-4A94-9AC1-F4AD5A66257C@petsc.dev> References: <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> <968434BC-E8DC-49B0-9119-F208DB1E01B0@petsc.dev> <7a3d5347-f2da-b4a9-f44a-aa534a314c7f@pppl.gov> <144089C5-D011-4A94-9AC1-F4AD5A66257C@petsc.dev> Message-ID: <234f9bc5-cdcc-2253-69b6-7a09ab915661@pppl.gov> Hi Barry, Wait, by "branch" are you talking about the MR Junchao submitted? That fix (proposed by me) is only to fix the issue for telescope to work on mpiaijcusparse, when using outside bjacobi. It has nothing to do with the issue for telescope inside bjacobi. It does not help in my tests. If my emails made you think the other way, I apologize for that. Regards, Chang On 10/20/21 4:40 PM, Barry Smith wrote: > > ? Yes, but the branch can be used to do telescoping inside the bjacobi > as needed. > >> On Oct 20, 2021, at 2:59 PM, Junchao Zhang > > wrote: >> >> The MR https://gitlab.com/petsc/petsc/-/merge_requests/4471 >> has not been >> merged yet. >> >> --Junchao Zhang >> >> >> On Wed, Oct 20, 2021 at 1:47 PM Chang Liu via petsc-users >> > wrote: >> >> Hi Barry, >> >> Are the fixes merged in the master? I was using bjacobi as a >> preconditioner. Using the latest version of petsc, I found that by >> calling >> >> mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view >> -ksp_monitor_true_residual -ksp_type fgmres -pc_type bjacobi >> -pc_bjacobi >> _blocks 4 -sub_ksp_type preonly -sub_pc_type telescope >> -sub_pc_telescope_reduction_factor 8 -sub_pc_telescope_subcomm_type >> contiguous -sub_telescope_pc_type lu -sub_telescope_ksp_type preonly >> -sub_telescope_pc_factor_mat_solver_type mumps -ksp_max_it 2000 >> -ksp_rtol 1.e-30 -ksp_atol 1.e-30 >> >> The code is calling PCApply_BJacobi_Multiproc. If I use >> >> mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view >> -ksp_monitor_true_residual -telescope_ksp_monitor_true_residual >> -ksp_type preonly -pc_type telescope -pc_telescope_reduction_factor 8 >> -pc_telescope_subcomm_type contiguous -telescope_pc_type bjacobi >> -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 >> -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu >> -telescope_sub_pc_factor_mat_solver_type mumps -telescope_ksp_max_it >> 2000 -telescope_ksp_rtol 1.e-30 -telescope_ksp_atol 1.e-30 >> >> The code is calling PCApply_BJacobi_Singleblock. You can test it >> yourself. >> >> Regards, >> >> Chang >> >> On 10/20/21 1:14 PM, Barry Smith wrote: >> > >> > >> >> On Oct 20, 2021, at 12:48 PM, Chang Liu > > wrote: >> >> >> >> Hi Pierre, >> >> >> >> I have another suggestion for telescope. I have achieved my >> goal by putting telescope outside bjacobi. But the code still does >> not work if I use telescope as a pc for subblock. I think the >> reason is that I want to use cusparse as the solver, which can >> only deal with seqaij matrix and not mpiaij matrix. >> > >> > >> >? ? ? This is suppose to work with the recent fixes. The >> telescope should produce a seq matrix and for each solve map the >> parallel vector (over the subdomain) automatically down to the one >> rank with the GPU to solve it on the GPU. It is not clear to me >> where the process is going wrong. >> > >> >? ? Barry >> > >> > >> > >> >> However, for telescope pc, it can put the matrix into one mpi >> rank, thus making it a seqaij for factorization stage, but then >> after factorization it will give the data back to the original >> comminicator. This will make the matrix back to mpiaij, and then >> cusparse cannot solve it. >> >> >> >> I think a better option is to do the factorization on CPU with >> mpiaij, then then transform the preconditioner matrix to seqaij >> and do the matsolve GPU. But I am not sure if it can be achieved >> using telescope. >> >> >> >> Regads, >> >> >> >> Chang >> >> >> >> On 10/15/21 5:29 AM, Pierre Jolivet wrote: >> >>> Hi Chang, >> >>> The output you sent with MUMPS looks alright to me, you can >> see that the MatType is properly set to seqaijcusparse (and not >> mpiaijcusparse). >> >>> I don?t know what is wrong with >> -sub_telescope_pc_factor_mat_solver_type cusparse, I don?t have a >> PETSc installation for testing this, hopefully Barry or Junchao >> can confirm this wrong behavior and get this fixed. >> >>> As for permuting PCTELESCOPE and PCBJACOBI, in your case, the >> outer PC will be equivalent, yes. >> >>> However, it would be more efficient to do PCBJACOBI and then >> PCTELESCOPE. >> >>> PCBJACOBI prunes the operator by basically removing all >> coefficients outside of the diagonal blocks. >> >>> Then, PCTELESCOPE "groups everything together?. >> >>> If you do it the other way around, PCTELESCOPE will ?group >> everything together? and then PCBJACOBI will prune the operator. >> >>> So the PCTELESCOPE SetUp will be costly for nothing since some >> coefficients will be thrown out afterwards in the PCBJACOBI SetUp. >> >>> I hope I?m clear enough, otherwise I can try do draw some >> pictures. >> >>> Thanks, >> >>> Pierre >> >>>> On 15 Oct 2021, at 4:39 AM, Chang Liu > > wrote: >> >>>> >> >>>> Hi Pierre and Barry, >> >>>> >> >>>> I think maybe I should use telescope outside bjacobi? like this >> >>>> >> >>>> mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m >> 400 -ksp_view -ksp_monitor_true_residual -pc_type telescope >> -pc_telescope_reduction_factor 4 -t >> >>>> elescope_pc_type bjacobi -telescope_ksp_type fgmres >> -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse >> -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu >> -telescope_sub_pc_factor_mat_solve >> >>>> r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> >>>> >> >>>> But then I got an error that >> >>>> >> >>>> [0]PETSC ERROR: MatSolverType cusparse does not support >> matrix type seqaij >> >>>> >> >>>> But the mat type should be aijcusparse. I think telescope >> change the mat type. >> >>>> >> >>>> Chang >> >>>> >> >>>> On 10/14/21 10:11 PM, Chang Liu wrote: >> >>>>> For comparison, here is the output using mumps instead of >> cusparse >> >>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m >> 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >> -sub_pc_type telescope -sub_ksp_type preonly >> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >> -sub_telescope_pc_factor_mat_solver_type mumps >> -sub_pc_telescope_reduction_factor 4 >> -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 >> -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> >>>>>? ? 0 KSP unpreconditioned resid norm 4.014971979977e+01 true >> resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> >>>>>? ? 1 KSP unpreconditioned resid norm 2.439995191694e+00 true >> resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 >> >>>>>? ? 2 KSP unpreconditioned resid norm 1.280694102588e+00 true >> resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 >> >>>>>? ? 3 KSP unpreconditioned resid norm 1.041100266810e+00 true >> resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 >> >>>>>? ? 4 KSP unpreconditioned resid norm 7.274347137268e-01 true >> resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 >> >>>>>? ? 5 KSP unpreconditioned resid norm 5.429229329787e-01 true >> resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 >> >>>>>? ? 6 KSP unpreconditioned resid norm 4.332970410353e-01 true >> resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 >> >>>>>? ? 7 KSP unpreconditioned resid norm 3.948206050950e-01 true >> resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 >> >>>>>? ? 8 KSP unpreconditioned resid norm 3.379580577269e-01 true >> resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 >> >>>>>? ? 9 KSP unpreconditioned resid norm 2.875593971410e-01 true >> resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 >> >>>>>? ?10 KSP unpreconditioned resid norm 2.533983363244e-01 true >> resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 >> >>>>>? ?11 KSP unpreconditioned resid norm 2.389169921094e-01 true >> resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 >> >>>>>? ?12 KSP unpreconditioned resid norm 2.118961639089e-01 true >> resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 >> >>>>>? ?13 KSP unpreconditioned resid norm 1.885892030223e-01 true >> resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 >> >>>>>? ?14 KSP unpreconditioned resid norm 1.763510666948e-01 true >> resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 >> >>>>>? ?15 KSP unpreconditioned resid norm 1.638219366731e-01 true >> resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 >> >>>>>? ?16 KSP unpreconditioned resid norm 1.476792766432e-01 true >> resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 >> >>>>>? ?17 KSP unpreconditioned resid norm 1.349906937321e-01 true >> resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 >> >>>>>? ?18 KSP unpreconditioned resid norm 1.289673236836e-01 true >> resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 >> >>>>>? ?19 KSP unpreconditioned resid norm 1.167505658153e-01 true >> resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 >> >>>>>? ?20 KSP unpreconditioned resid norm 1.046037988999e-01 true >> resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 >> >>>>>? ?21 KSP unpreconditioned resid norm 9.832660514331e-02 true >> resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 >> >>>>>? ?22 KSP unpreconditioned resid norm 8.835618950141e-02 true >> resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 >> >>>>>? ?23 KSP unpreconditioned resid norm 7.563496650115e-02 true >> resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 >> >>>>>? ?24 KSP unpreconditioned resid norm 6.651291376834e-02 true >> resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 >> >>>>>? ?25 KSP unpreconditioned resid norm 5.890393227906e-02 true >> resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 >> >>>>>? ?26 KSP unpreconditioned resid norm 4.661992782780e-02 true >> resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 >> >>>>>? ?27 KSP unpreconditioned resid norm 3.690705358716e-02 true >> resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 >> >>>>>? ?28 KSP unpreconditioned resid norm 3.209680460188e-02 true >> resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 >> >>>>>? ?29 KSP unpreconditioned resid norm 2.354337626000e-02 true >> resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 >> >>>>>? ?30 KSP unpreconditioned resid norm 1.701296561785e-02 true >> resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 >> >>>>>? ?31 KSP unpreconditioned resid norm 1.509942937258e-02 true >> resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 >> >>>>>? ?32 KSP unpreconditioned resid norm 1.258274688515e-02 true >> resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 >> >>>>>? ?33 KSP unpreconditioned resid norm 9.805748771638e-03 true >> resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 >> >>>>>? ?34 KSP unpreconditioned resid norm 8.596552678160e-03 true >> resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 >> >>>>>? ?35 KSP unpreconditioned resid norm 6.936406707500e-03 true >> resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 >> >>>>>? ?36 KSP unpreconditioned resid norm 5.533741607932e-03 true >> resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 >> >>>>>? ?37 KSP unpreconditioned resid norm 4.982347757923e-03 true >> resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 >> >>>>>? ?38 KSP unpreconditioned resid norm 4.309608348059e-03 true >> resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 >> >>>>>? ?39 KSP unpreconditioned resid norm 3.729408303186e-03 true >> resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 >> >>>>>? ?40 KSP unpreconditioned resid norm 3.490003351128e-03 true >> resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 >> >>>>>? ?41 KSP unpreconditioned resid norm 3.069012426454e-03 true >> resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 >> >>>>>? ?42 KSP unpreconditioned resid norm 2.772928845284e-03 true >> resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 >> >>>>>? ?43 KSP unpreconditioned resid norm 2.561454192399e-03 true >> resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 >> >>>>>? ?44 KSP unpreconditioned resid norm 2.253662762802e-03 true >> resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 >> >>>>>? ?45 KSP unpreconditioned resid norm 2.086800523919e-03 true >> resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 >> >>>>>? ?46 KSP unpreconditioned resid norm 1.926028182896e-03 true >> resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 >> >>>>>? ?47 KSP unpreconditioned resid norm 1.769243808622e-03 true >> resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 >> >>>>>? ?48 KSP unpreconditioned resid norm 1.656654905964e-03 true >> resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 >> >>>>>? ?49 KSP unpreconditioned resid norm 1.572052627273e-03 true >> resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 >> >>>>>? ?50 KSP unpreconditioned resid norm 1.454960682355e-03 true >> resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 >> >>>>>? ?51 KSP unpreconditioned resid norm 1.375985053014e-03 true >> resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 >> >>>>>? ?52 KSP unpreconditioned resid norm 1.269325501087e-03 true >> resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 >> >>>>>? ?53 KSP unpreconditioned resid norm 1.184791772965e-03 true >> resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 >> >>>>>? ?54 KSP unpreconditioned resid norm 1.064535156080e-03 true >> resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 >> >>>>>? ?55 KSP unpreconditioned resid norm 9.639036688120e-04 true >> resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 >> >>>>>? ?56 KSP unpreconditioned resid norm 8.632359780260e-04 true >> resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 >> >>>>>? ?57 KSP unpreconditioned resid norm 7.613605783850e-04 true >> resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 >> >>>>>? ?58 KSP unpreconditioned resid norm 6.681073248348e-04 true >> resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 >> >>>>>? ?59 KSP unpreconditioned resid norm 5.656127908544e-04 true >> resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 >> >>>>>? ?60 KSP unpreconditioned resid norm 4.850863370767e-04 true >> resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 >> >>>>>? ?61 KSP unpreconditioned resid norm 4.374055762320e-04 true >> resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 >> >>>>>? ?62 KSP unpreconditioned resid norm 3.874398257079e-04 true >> resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 >> >>>>>? ?63 KSP unpreconditioned resid norm 3.364908694427e-04 true >> resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 >> >>>>>? ?64 KSP unpreconditioned resid norm 2.961034697265e-04 true >> resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 >> >>>>>? ?65 KSP unpreconditioned resid norm 2.640593092764e-04 true >> resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 >> >>>>>? ?66 KSP unpreconditioned resid norm 2.423231125743e-04 true >> resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 >> >>>>>? ?67 KSP unpreconditioned resid norm 2.182349471179e-04 true >> resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 >> >>>>>? ?68 KSP unpreconditioned resid norm 2.008438265031e-04 true >> resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 >> >>>>>? ?69 KSP unpreconditioned resid norm 1.838732863386e-04 true >> resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 >> >>>>>? ?70 KSP unpreconditioned resid norm 1.723786027645e-04 true >> resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 >> >>>>>? ?71 KSP unpreconditioned resid norm 1.580945192204e-04 true >> resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 >> >>>>>? ?72 KSP unpreconditioned resid norm 1.476687469671e-04 true >> resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 >> >>>>>? ?73 KSP unpreconditioned resid norm 1.385018526182e-04 true >> resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 >> >>>>>? ?74 KSP unpreconditioned resid norm 1.279712893541e-04 true >> resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 >> >>>>>? ?75 KSP unpreconditioned resid norm 1.202010411772e-04 true >> resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 >> >>>>>? ?76 KSP unpreconditioned resid norm 1.113459414198e-04 true >> resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 >> >>>>>? ?77 KSP unpreconditioned resid norm 1.042523036036e-04 true >> resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 >> >>>>>? ?78 KSP unpreconditioned resid norm 9.565176453232e-05 true >> resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 >> >>>>>? ?79 KSP unpreconditioned resid norm 8.896901670359e-05 true >> resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 >> >>>>>? ?80 KSP unpreconditioned resid norm 8.119298425803e-05 true >> resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 >> >>>>>? ?81 KSP unpreconditioned resid norm 7.544528309154e-05 true >> resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 >> >>>>>? ?82 KSP unpreconditioned resid norm 6.755385041138e-05 true >> resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 >> >>>>>? ?83 KSP unpreconditioned resid norm 6.158629300870e-05 true >> resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 >> >>>>>? ?84 KSP unpreconditioned resid norm 5.358756885754e-05 true >> resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 >> >>>>>? ?85 KSP unpreconditioned resid norm 4.774852370380e-05 true >> resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 >> >>>>>? ?86 KSP unpreconditioned resid norm 3.919358737908e-05 true >> resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 >> >>>>>? ?87 KSP unpreconditioned resid norm 3.434042319950e-05 true >> resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 >> >>>>>? ?88 KSP unpreconditioned resid norm 2.813699436281e-05 true >> resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 >> >>>>>? ?89 KSP unpreconditioned resid norm 2.462248069068e-05 true >> resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 >> >>>>>? ?90 KSP unpreconditioned resid norm 2.040558789626e-05 true >> resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 >> >>>>>? ?91 KSP unpreconditioned resid norm 1.888523204468e-05 true >> resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 >> >>>>>? ?92 KSP unpreconditioned resid norm 1.707071292484e-05 true >> resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 >> >>>>>? ?93 KSP unpreconditioned resid norm 1.498636454665e-05 true >> resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 >> >>>>>? ?94 KSP unpreconditioned resid norm 1.219393542993e-05 true >> resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 >> >>>>>? ?95 KSP unpreconditioned resid norm 1.059996963300e-05 true >> resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 >> >>>>>? ?96 KSP unpreconditioned resid norm 9.099659872548e-06 true >> resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 >> >>>>>? ?97 KSP unpreconditioned resid norm 8.147347587295e-06 true >> resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 >> >>>>>? ?98 KSP unpreconditioned resid norm 7.167226146744e-06 true >> resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 >> >>>>>? ?99 KSP unpreconditioned resid norm 6.552540209538e-06 true >> resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 >> >>>>> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true >> resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 >> >>>>> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true >> resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 >> >>>>> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true >> resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 >> >>>>> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true >> resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 >> >>>>> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true >> resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 >> >>>>> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true >> resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 >> >>>>> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true >> resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 >> >>>>> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true >> resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 >> >>>>> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true >> resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 >> >>>>> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true >> resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 >> >>>>> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true >> resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 >> >>>>> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true >> resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 >> >>>>> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true >> resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 >> >>>>> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true >> resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 >> >>>>> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true >> resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 >> >>>>> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true >> resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 >> >>>>> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true >> resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 >> >>>>> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true >> resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 >> >>>>> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true >> resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 >> >>>>> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true >> resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 >> >>>>> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true >> resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 >> >>>>> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true >> resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 >> >>>>> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true >> resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 >> >>>>> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true >> resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 >> >>>>> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true >> resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 >> >>>>> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true >> resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 >> >>>>> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true >> resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 >> >>>>> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true >> resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 >> >>>>> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true >> resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 >> >>>>> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true >> resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 >> >>>>> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true >> resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 >> >>>>> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true >> resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 >> >>>>> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true >> resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 >> >>>>> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true >> resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 >> >>>>> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true >> resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 >> >>>>> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true >> resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 >> >>>>> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true >> resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 >> >>>>> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true >> resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 >> >>>>> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true >> resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 >> >>>>> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true >> resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 >> >>>>> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true >> resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 >> >>>>> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true >> resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 >> >>>>> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true >> resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 >> >>>>> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true >> resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 >> >>>>> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true >> resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 >> >>>>> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true >> resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 >> >>>>> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true >> resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 >> >>>>> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true >> resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 >> >>>>> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true >> resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 >> >>>>> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true >> resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 >> >>>>> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true >> resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 >> >>>>> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true >> resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 >> >>>>> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true >> resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 >> >>>>> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true >> resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 >> >>>>> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true >> resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 >> >>>>> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true >> resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 >> >>>>> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true >> resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 >> >>>>> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true >> resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 >> >>>>> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true >> resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 >> >>>>> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true >> resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 >> >>>>> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true >> resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 >> >>>>> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true >> resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 >> >>>>> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true >> resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 >> >>>>> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true >> resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 >> >>>>> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true >> resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 >> >>>>> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true >> resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 >> >>>>> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true >> resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 >> >>>>> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true >> resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 >> >>>>> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true >> resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 >> >>>>> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true >> resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 >> >>>>> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true >> resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 >> >>>>> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true >> resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 >> >>>>> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true >> resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 >> >>>>> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true >> resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 >> >>>>> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true >> resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 >> >>>>> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true >> resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 >> >>>>> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true >> resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 >> >>>>> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true >> resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 >> >>>>> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true >> resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 >> >>>>> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true >> resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 >> >>>>> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true >> resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 >> >>>>> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true >> resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 >> >>>>> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true >> resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 >> >>>>> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true >> resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 >> >>>>> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true >> resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 >> >>>>> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true >> resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 >> >>>>> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true >> resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 >> >>>>> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true >> resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 >> >>>>> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true >> resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 >> >>>>> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true >> resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 >> >>>>> KSP Object: 16 MPI processes >> >>>>>? ? type: fgmres >> >>>>>? ? ? restart=30, using Classical (unmodified) Gram-Schmidt >> Orthogonalization with no iterative refinement >> >>>>>? ? ? happy breakdown tolerance 1e-30 >> >>>>>? ? maximum iterations=2000, initial guess is zero >> >>>>>? ? tolerances:? relative=1e-20, absolute=1e-09, >> divergence=10000. >> >>>>>? ? right preconditioning >> >>>>>? ? using UNPRECONDITIONED norm type for convergence test >> >>>>> PC Object: 16 MPI processes >> >>>>>? ? type: bjacobi >> >>>>>? ? ? number of blocks = 4 >> >>>>>? ? ? Local solver information for first block is in the >> following KSP and PC objects on rank 0: >> >>>>>? ? ? Use -ksp_view ::ascii_info_detail to display >> information for all blocks >> >>>>>? ? KSP Object: (sub_) 4 MPI processes >> >>>>>? ? ? type: preonly >> >>>>>? ? ? maximum iterations=10000, initial guess is zero >> >>>>>? ? ? tolerances:? relative=1e-05, absolute=1e-50, >> divergence=10000. >> >>>>>? ? ? left preconditioning >> >>>>>? ? ? using NONE norm type for convergence test >> >>>>>? ? PC Object: (sub_) 4 MPI processes >> >>>>>? ? ? type: telescope >> >>>>>? ? ? ? petsc subcomm: parent comm size reduction factor = 4 >> >>>>>? ? ? ? petsc subcomm: parent_size = 4 , subcomm_size = 1 >> >>>>>? ? ? ? petsc subcomm type = contiguous >> >>>>>? ? ? linear system matrix = precond matrix: >> >>>>>? ? ? Mat Object: (sub_) 4 MPI processes >> >>>>>? ? ? ? type: mpiaij >> >>>>>? ? ? ? rows=40200, cols=40200 >> >>>>>? ? ? ? total: nonzeros=199996, allocated nonzeros=203412 >> >>>>>? ? ? ? total number of mallocs used during MatSetValues calls=0 >> >>>>>? ? ? ? ? not using I-node (on process 0) routines >> >>>>>? ? ? ? ? setup type: default >> >>>>>? ? ? ? ? Parent DM object: NULL >> >>>>>? ? ? ? ? Sub DM object: NULL >> >>>>>? ? ? ? ? KSP Object:? ?(sub_telescope_)? ?1 MPI processes >> >>>>>? ? ? ? ? ? type: preonly >> >>>>>? ? ? ? ? ? maximum iterations=10000, initial guess is zero >> >>>>>? ? ? ? ? ? tolerances:? relative=1e-05, absolute=1e-50, >> divergence=10000. >> >>>>>? ? ? ? ? ? left preconditioning >> >>>>>? ? ? ? ? ? using NONE norm type for convergence test >> >>>>>? ? ? ? ? PC Object:? ?(sub_telescope_)? ?1 MPI processes >> >>>>>? ? ? ? ? ? type: lu >> >>>>>? ? ? ? ? ? ? out-of-place factorization >> >>>>>? ? ? ? ? ? ? tolerance for zero pivot 2.22045e-14 >> >>>>>? ? ? ? ? ? ? matrix ordering: external >> >>>>>? ? ? ? ? ? ? factor fill ratio given 0., needed 0. >> >>>>>? ? ? ? ? ? ? ? Factored matrix follows: >> >>>>>? ? ? ? ? ? ? ? ? Mat Object:? ?1 MPI processes >> >>>>>? ? ? ? ? ? ? ? ? ? type: mumps >> >>>>>? ? ? ? ? ? ? ? ? ? rows=40200, cols=40200 >> >>>>>? ? ? ? ? ? ? ? ? ? package used to perform factorization: mumps >> >>>>>? ? ? ? ? ? ? ? ? ? total: nonzeros=1849788, allocated >> nonzeros=1849788 >> >>>>>? ? ? ? ? ? ? ? ? ? ? MUMPS run parameters: >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? SYM (matrix type):? ? ? ? ? ? ? ? ? ?0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? PAR (host participation):? ? ? ? ? ? 1 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(1) (output for error):? ? ? ? ?6 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(2) (output of diagnostic msg): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(3) (output for global info):? ?0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(4) (level of printing):? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(5) (input mat struct):? ? ? ? ?0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(6) (matrix prescaling):? ? ? ? 7 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(7) (sequential matrix ordering):7 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(8) (scaling strategy):? ? ? ? 77 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(10) (max num of refinements):? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(11) (error analysis):? ? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(12) (efficiency control):? ? ? ? 1 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(13) (sequential factorization >> of the root node):? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(14) (percentage of estimated >> workspace increase): 20 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(18) (input mat struct):? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(19) (Schur complement info): >> ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(20) (RHS sparse pattern):? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(21) (solution struct):? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(22) (in-core/out-of-core >> facility):? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(23) (max size of memory can be >> allocated locally):0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(24) (detection of null pivot >> rows):? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(25) (computation of a null >> space basis):? ? ? ? ?0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(26) (Schur options for RHS or >> solution):? ? ? ? ?0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(27) (blocking size for multiple >> RHS):? ? ? ? ?-32 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(28) (use parallel or sequential >> ordering):? ? ? ? ?1 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(29) (parallel ordering):? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(30) (user-specified set of >> entries in inv(A)):? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(31) (factors is discarded in >> the solve phase):? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(33) (compute determinant):? ? ? ? 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(35) (activate BLR based >> factorization):? ? ? ? ?0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(36) (choice of BLR >> factorization variant):? ? ? ? ?0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ICNTL(38) (estimated compression rate >> of LU factors):? ?333 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? CNTL(1) (relative pivoting >> threshold):? ? ? 0.01 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? CNTL(2) (stopping criterion of >> refinement): 1.49012e-08 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? CNTL(3) (absolute pivoting >> threshold):? ? ? 0. >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? CNTL(4) (value of static pivoting): >> ? ? ? ?-1. >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? CNTL(5) (fixation for null pivots): >> ? ? ? ?0. >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? CNTL(7) (dropping parameter for >> BLR):? ? ? ?0. >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? RINFO(1) (local estimated flops for >> the elimination after analysis): >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ? [0] 1.45525e+08 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? RINFO(2) (local estimated flops for >> the assembly after factorization): >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ? [0]? 2.89397e+06 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? RINFO(3) (local estimated flops for >> the elimination after factorization): >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ? [0]? 1.45525e+08 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFO(15) (estimated size of (in MB) >> MUMPS internal data for running numerical factorization): >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? [0] 29 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFO(16) (size of (in MB) MUMPS >> internal data used during numerical factorization): >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ? [0] 29 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFO(23) (num of pivots eliminated on >> this processor after factorization): >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? ? [0] 40200 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? RINFOG(1) (global estimated flops for >> the elimination after analysis): 1.45525e+08 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? RINFOG(2) (global estimated flops for >> the assembly after factorization): 2.89397e+06 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? RINFOG(3) (global estimated flops for >> the elimination after factorization): 1.45525e+08 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? (RINFOG(12) RINFOG(13))*2^INFOG(34) >> (determinant): (0.,0.)*(2^0) >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(3) (estimated real workspace >> for factors on all processors after analysis): 1849788 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(4) (estimated integer workspace >> for factors on all processors after analysis): 879986 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(5) (estimated maximum front >> size in the complete tree): 282 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(6) (number of nodes in the >> complete tree): 23709 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(7) (ordering option effectively >> used after analysis): 5 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(8) (structural symmetry in >> percent of the permuted matrix after analysis): 100 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(9) (total real/complex >> workspace to store the matrix factors after factorization): 1849788 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(10) (total integer space store >> the matrix factors after factorization): 879986 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(11) (order of largest frontal >> matrix after factorization): 282 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(12) (number of off-diagonal >> pivots): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(13) (number of delayed pivots >> after factorization): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(14) (number of memory compress >> after factorization): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(15) (number of steps of >> iterative refinement after solution): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(16) (estimated size (in MB) of >> all MUMPS internal data for factorization after analysis: value on >> the most memory consuming processor): 29 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(17) (estimated size of all >> MUMPS internal data for factorization after analysis: sum over all >> processors): 29 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(18) (size of all MUMPS internal >> data allocated during factorization: value on the most memory >> consuming processor): 29 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(19) (size of all MUMPS internal >> data allocated during factorization: sum over all processors): 29 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(20) (estimated number of >> entries in the factors): 1849788 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(21) (size in MB of memory >> effectively used during factorization - value on the most memory >> consuming processor): 26 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(22) (size in MB of memory >> effectively used during factorization - sum over all processors): 26 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(23) (after analysis: value of >> ICNTL(6) effectively used): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(24) (after analysis: value of >> ICNTL(12) effectively used): 1 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(25) (after factorization: >> number of pivots modified by static pivoting): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(28) (after factorization: >> number of null pivots encountered): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(29) (after factorization: >> effective number of entries in the factors (sum over all >> processors)): 1849788 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(30, 31) (after solution: size >> in Mbytes of memory used during solution phase): 29, 29 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(32) (after analysis: type of >> analysis done): 1 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(33) (value used for ICNTL(8)): 7 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(34) (exponent of the >> determinant if determinant is requested): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(35) (after factorization: >> number of entries taking into account BLR factor compression - sum >> over all processors): 1849788 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(36) (after analysis: estimated >> size of all MUMPS internal data for running BLR in-core - value on >> the most memory consuming processor): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(37) (after analysis: estimated >> size of all MUMPS internal data for running BLR in-core - sum over >> all processors): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(38) (after analysis: estimated >> size of all MUMPS internal data for running BLR out-of-core - >> value on the most memory consuming processor): 0 >> >>>>>? ? ? ? ? ? ? ? ? ? ? ? INFOG(39) (after analysis: estimated >> size of all MUMPS internal data for running BLR out-of-core - sum >> over all processors): 0 >> >>>>>? ? ? ? ? ? linear system matrix = precond matrix: >> >>>>>? ? ? ? ? ? Mat Object:? ?1 MPI processes >> >>>>>? ? ? ? ? ? ? type: seqaijcusparse >> >>>>>? ? ? ? ? ? ? rows=40200, cols=40200 >> >>>>>? ? ? ? ? ? ? total: nonzeros=199996, allocated nonzeros=199996 >> >>>>>? ? ? ? ? ? ? total number of mallocs used during >> MatSetValues calls=0 >> >>>>>? ? ? ? ? ? ? ? not using I-node routines >> >>>>>? ? linear system matrix = precond matrix: >> >>>>>? ? Mat Object: 16 MPI processes >> >>>>>? ? ? type: mpiaijcusparse >> >>>>>? ? ? rows=160800, cols=160800 >> >>>>>? ? ? total: nonzeros=802396, allocated nonzeros=1608000 >> >>>>>? ? ? total number of mallocs used during MatSetValues calls=0 >> >>>>>? ? ? ? not using I-node (on process 0) routines >> >>>>> Norm of error 9.11684e-07 iterations 189 >> >>>>> Chang >> >>>>> On 10/14/21 10:10 PM, Chang Liu wrote: >> >>>>>> Hi Barry, >> >>>>>> >> >>>>>> No problem. Here is the output. It seems that the resid >> norm calculation is incorrect. >> >>>>>> >> >>>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 >> -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >> -sub_pc_type telescope -sub_ksp_type preonly >> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >> -sub_telescope_pc_factor_mat_solver_type cusparse >> -sub_pc_telescope_reduction_factor 4 >> -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 >> -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >> >>>>>>? ? ?0 KSP unpreconditioned resid norm 4.014971979977e+01 >> true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> >>>>>>? ? ?1 KSP unpreconditioned resid norm 0.000000000000e+00 >> true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >> >>>>>> KSP Object: 16 MPI processes >> >>>>>>? ? ?type: fgmres >> >>>>>>? ? ? ?restart=30, using Classical (unmodified) Gram-Schmidt >> Orthogonalization with no iterative refinement >> >>>>>>? ? ? ?happy breakdown tolerance 1e-30 >> >>>>>>? ? ?maximum iterations=2000, initial guess is zero >> >>>>>>? ? ?tolerances:? relative=1e-20, absolute=1e-09, >> divergence=10000. >> >>>>>>? ? ?right preconditioning >> >>>>>>? ? ?using UNPRECONDITIONED norm type for convergence test >> >>>>>> PC Object: 16 MPI processes >> >>>>>>? ? ?type: bjacobi >> >>>>>>? ? ? ?number of blocks = 4 >> >>>>>>? ? ? ?Local solver information for first block is in the >> following KSP and PC objects on rank 0: >> >>>>>>? ? ? ?Use -ksp_view ::ascii_info_detail to display >> information for all blocks >> >>>>>>? ? ?KSP Object: (sub_) 4 MPI processes >> >>>>>>? ? ? ?type: preonly >> >>>>>>? ? ? ?maximum iterations=10000, initial guess is zero >> >>>>>>? ? ? ?tolerances:? relative=1e-05, absolute=1e-50, >> divergence=10000. >> >>>>>>? ? ? ?left preconditioning >> >>>>>>? ? ? ?using NONE norm type for convergence test >> >>>>>>? ? ?PC Object: (sub_) 4 MPI processes >> >>>>>>? ? ? ?type: telescope >> >>>>>>? ? ? ? ?petsc subcomm: parent comm size reduction factor = 4 >> >>>>>>? ? ? ? ?petsc subcomm: parent_size = 4 , subcomm_size = 1 >> >>>>>>? ? ? ? ?petsc subcomm type = contiguous >> >>>>>>? ? ? ?linear system matrix = precond matrix: >> >>>>>>? ? ? ?Mat Object: (sub_) 4 MPI processes >> >>>>>>? ? ? ? ?type: mpiaij >> >>>>>>? ? ? ? ?rows=40200, cols=40200 >> >>>>>>? ? ? ? ?total: nonzeros=199996, allocated nonzeros=203412 >> >>>>>>? ? ? ? ?total number of mallocs used during MatSetValues >> calls=0 >> >>>>>>? ? ? ? ? ?not using I-node (on process 0) routines >> >>>>>>? ? ? ? ? ?setup type: default >> >>>>>>? ? ? ? ? ?Parent DM object: NULL >> >>>>>>? ? ? ? ? ?Sub DM object: NULL >> >>>>>>? ? ? ? ? ?KSP Object:? ?(sub_telescope_)? ?1 MPI processes >> >>>>>>? ? ? ? ? ? ?type: preonly >> >>>>>>? ? ? ? ? ? ?maximum iterations=10000, initial guess is zero >> >>>>>>? ? ? ? ? ? ?tolerances:? relative=1e-05, absolute=1e-50, >> divergence=10000. >> >>>>>>? ? ? ? ? ? ?left preconditioning >> >>>>>>? ? ? ? ? ? ?using NONE norm type for convergence test >> >>>>>>? ? ? ? ? ?PC Object:? ?(sub_telescope_)? ?1 MPI processes >> >>>>>>? ? ? ? ? ? ?type: lu >> >>>>>>? ? ? ? ? ? ? ?out-of-place factorization >> >>>>>>? ? ? ? ? ? ? ?tolerance for zero pivot 2.22045e-14 >> >>>>>>? ? ? ? ? ? ? ?matrix ordering: nd >> >>>>>>? ? ? ? ? ? ? ?factor fill ratio given 5., needed 8.62558 >> >>>>>>? ? ? ? ? ? ? ? ?Factored matrix follows: >> >>>>>>? ? ? ? ? ? ? ? ? ?Mat Object:? ?1 MPI processes >> >>>>>>? ? ? ? ? ? ? ? ? ? ?type: seqaijcusparse >> >>>>>>? ? ? ? ? ? ? ? ? ? ?rows=40200, cols=40200 >> >>>>>>? ? ? ? ? ? ? ? ? ? ?package used to perform factorization: >> cusparse >> >>>>>>? ? ? ? ? ? ? ? ? ? ?total: nonzeros=1725082, allocated >> nonzeros=1725082 >> >>>>>>? ? ? ? ? ? ? ? ? ? ? ?not using I-node routines >> >>>>>>? ? ? ? ? ? ?linear system matrix = precond matrix: >> >>>>>>? ? ? ? ? ? ?Mat Object:? ?1 MPI processes >> >>>>>>? ? ? ? ? ? ? ?type: seqaijcusparse >> >>>>>>? ? ? ? ? ? ? ?rows=40200, cols=40200 >> >>>>>>? ? ? ? ? ? ? ?total: nonzeros=199996, allocated nonzeros=199996 >> >>>>>>? ? ? ? ? ? ? ?total number of mallocs used during >> MatSetValues calls=0 >> >>>>>>? ? ? ? ? ? ? ? ?not using I-node routines >> >>>>>>? ? ?linear system matrix = precond matrix: >> >>>>>>? ? ?Mat Object: 16 MPI processes >> >>>>>>? ? ? ?type: mpiaijcusparse >> >>>>>>? ? ? ?rows=160800, cols=160800 >> >>>>>>? ? ? ?total: nonzeros=802396, allocated nonzeros=1608000 >> >>>>>>? ? ? ?total number of mallocs used during MatSetValues calls=0 >> >>>>>>? ? ? ? ?not using I-node (on process 0) routines >> >>>>>> Norm of error 400.999 iterations 1 >> >>>>>> >> >>>>>> Chang >> >>>>>> >> >>>>>> >> >>>>>> On 10/14/21 9:47 PM, Barry Smith wrote: >> >>>>>>> >> >>>>>>>? ? ?Chang, >> >>>>>>> >> >>>>>>>? ? ? Sorry I did not notice that one. Please run that with >> -ksp_view -ksp_monitor_true_residual so we can see exactly how >> options are interpreted and solver used. At a glance it looks ok >> but something must be wrong to get the wrong answer. >> >>>>>>> >> >>>>>>>? ? ?Barry >> >>>>>>> >> >>>>>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu > > wrote: >> >>>>>>>> >> >>>>>>>> Hi Barry, >> >>>>>>>> >> >>>>>>>> That is exactly what I was doing in the second example, >> in which the preconditioner works but the GMRES does not. >> >>>>>>>> >> >>>>>>>> Chang >> >>>>>>>> >> >>>>>>>> On 10/14/21 5:15 PM, Barry Smith wrote: >> >>>>>>>>>? ? ?You need to use the PCTELESCOPE inside the block >> Jacobi, not outside it. So something like -pc_type bjacobi >> -sub_pc_type telescope -sub_telescope_pc_type lu >> >>>>>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu > > wrote: >> >>>>>>>>>> >> >>>>>>>>>> Hi Pierre, >> >>>>>>>>>> >> >>>>>>>>>> I wonder if the trick of PCTELESCOPE only works for >> preconditioner and not for the solver. I have done some tests, and >> find that for solving a small matrix using -telescope_ksp_type >> preonly, it does work for GPU with multiple MPI processes. >> However, for bjacobi and gmres, it does not work. >> >>>>>>>>>> >> >>>>>>>>>> The command line options I used for small matrix is like >> >>>>>>>>>> >> >>>>>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 >> -ksp_monitor_short -pc_type telescope -mat_type aijcusparse >> -telescope_pc_type lu -telescope_pc_factor_mat_solver_type >> cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >> >>>>>>>>>> >> >>>>>>>>>> which gives the correct output. For iterative solver, I >> tried >> >>>>>>>>>> >> >>>>>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 >> -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type >> fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type >> preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >> -sub_telescope_pc_factor_mat_solver_type cusparse >> -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol >> 1.e-9 -ksp_atol 1.e-20 >> >>>>>>>>>> >> >>>>>>>>>> for large matrix. The output is like >> >>>>>>>>>> >> >>>>>>>>>>? ? 0 KSP Residual norm 40.1497 >> >>>>>>>>>>? ? 1 KSP Residual norm < 1.e-11 >> >>>>>>>>>> Norm of error 400.999 iterations 1 >> >>>>>>>>>> >> >>>>>>>>>> So it seems to call a direct solver instead of an >> iterative one. >> >>>>>>>>>> >> >>>>>>>>>> Can you please help check these options? >> >>>>>>>>>> >> >>>>>>>>>> Chang >> >>>>>>>>>> >> >>>>>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >> >>>>>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu > > wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE >> before. This sounds exactly what I need. I wonder if PCTELESCOPE >> can transform a mpiaijcusparse to seqaircusparse? Or I have to do >> it manually? >> >>>>>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >> >>>>>>>>>>> 1) I?m not sure this is implemented for cuSparse >> matrices, but it should be; >> >>>>>>>>>>> 2) at least for the implementations >> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and >> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType >> is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough >> to detect if the MPI communicator on which the Mat lives is of >> size 1 (your case), and then the resulting Mat is of type MatSeqX >> instead of MatMPIX, so you would not need to worry about the >> transformation you are mentioning. >> >>>>>>>>>>> If you try this out and this does not work, please >> provide the backtrace (probably something like ?Operation XYZ not >> implemented for MatType ABC?), and hopefully someone can add the >> missing plumbing. >> >>>>>>>>>>> I do not claim that this will be efficient, but I >> think this goes in the direction of what you want to achieve. >> >>>>>>>>>>> Thanks, >> >>>>>>>>>>> Pierre >> >>>>>>>>>>>> Chang >> >>>>>>>>>>>> >> >>>>>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >> >>>>>>>>>>>>> Maybe I?m missing something, but can?t you use >> PCTELESCOPE as a subdomain solver, with a reduction factor equal >> to the number of MPI processes you have per block? >> >>>>>>>>>>>>> -sub_pc_type telescope >> -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >> >>>>>>>>>>>>> This does not work with MUMPS >> -mat_mumps_use_omp_threads because not only do the Mat needs to be >> redistributed, the secondary processes also need to be ?converted? >> to OpenMP threads. >> >>>>>>>>>>>>> Thus the need for specific code in mumps.c. >> >>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>> Pierre >> >>>>>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via >> petsc-users > > wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Hi Junchao, >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Yes that is what I want. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Chang >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >> >>>>>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith >> >> >> wrote: >> >>>>>>>>>>>>>>>? ? ? ? ?Junchao, >> >>>>>>>>>>>>>>>? ? ? ? ? ? If I understand correctly Chang is >> using the block Jacobi >> >>>>>>>>>>>>>>>? ? ? method with a single block for a number of >> MPI ranks and a direct >> >>>>>>>>>>>>>>>? ? ? solver for each block so it uses >> PCSetUp_BJacobi_Multiproc() which >> >>>>>>>>>>>>>>>? ? ? is code Hong Zhang wrote a number of years >> ago for CPUs. For their >> >>>>>>>>>>>>>>>? ? ? particular problems this preconditioner works >> well, but using an >> >>>>>>>>>>>>>>>? ? ? iterative solver on the blocks does not work >> well. >> >>>>>>>>>>>>>>>? ? ? ? ? ? If we had complete MPI-GPU direct >> solvers he could just use >> >>>>>>>>>>>>>>>? ? ? the current code with MPIAIJCUSPARSE on each >> block but since we do >> >>>>>>>>>>>>>>>? ? ? not he would like to use a single GPU for >> each block, this means >> >>>>>>>>>>>>>>>? ? ? that diagonal blocks of? the global parallel >> MPI matrix needs to be >> >>>>>>>>>>>>>>>? ? ? sent to a subset of the GPUs (one GPU per >> block, which has multiple >> >>>>>>>>>>>>>>>? ? ? MPI ranks associated with the blocks). >> Similarly for the triangular >> >>>>>>>>>>>>>>>? ? ? solves the blocks of the right hand side >> needs to be shipped to the >> >>>>>>>>>>>>>>>? ? ? appropriate GPU and the resulting solution >> shipped back to the >> >>>>>>>>>>>>>>>? ? ? multiple GPUs. So Chang is absolutely >> correct, this is somewhat like >> >>>>>>>>>>>>>>>? ? ? your code for MUMPS with OpenMP. OK, I now >> understand the background.. >> >>>>>>>>>>>>>>>? ? ? One could use PCSetUp_BJacobi_Multiproc() and >> get the blocks on the >> >>>>>>>>>>>>>>>? ? ? MPI ranks and then shrink each block down to >> a single GPU but this >> >>>>>>>>>>>>>>>? ? ? would be pretty inefficient, ideally one >> would go directly from the >> >>>>>>>>>>>>>>>? ? ? big MPI matrix on all the GPUs to the sub >> matrices on the subset of >> >>>>>>>>>>>>>>>? ? ? GPUs. But this may be a large coding project. >> >>>>>>>>>>>>>>> I don't understand these sentences. Why do you say >> "shrink"? In my mind, we just need to move each block (submatrix) >> living over multiple MPI ranks to one of them and solve directly >> there.? In other words, we keep blocks' size, no shrinking or >> expanding. >> >>>>>>>>>>>>>>> As mentioned before, cusparse does not provide LU >> factorization. So the LU factorization would be done on CPU, and >> the solve be done on GPU. I assume Chang wants to gain from the >> (potential) faster solve (instead of factorization) on GPU. >> >>>>>>>>>>>>>>>? ? ? ? ?Barry >> >>>>>>>>>>>>>>>? ? ? Since the matrices being factored and solved >> directly are relatively >> >>>>>>>>>>>>>>>? ? ? large it is possible that the cusparse code >> could be reasonably >> >>>>>>>>>>>>>>>? ? ? efficient (they are not the tiny problems one >> gets at the coarse >> >>>>>>>>>>>>>>>? ? ? level of multigrid). Of course, this is >> speculation, I don't >> >>>>>>>>>>>>>>>? ? ? actually know how much better the cusparse >> code would be on the >> >>>>>>>>>>>>>>>? ? ? direct solver than a good CPU direct sparse >> solver. >> >>>>>>>>>>>>>>>? ? ? ?> On Oct 13, 2021, at 9:32 PM, Chang Liu >> >> >>>>>>>>>>>>>>>? ? ? > >> wrote: >> >>>>>>>>>>>>>>>? ? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?> Sorry I am not familiar with the details >> either. Can you please >> >>>>>>>>>>>>>>>? ? ? check the code in >> MatMumpsGatherNonzerosOnMaster in mumps.c? >> >>>>>>>>>>>>>>>? ? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?> Chang >> >>>>>>>>>>>>>>>? ? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?> On 10/13/21 9:24 PM, Junchao Zhang wrote: >> >>>>>>>>>>>>>>>? ? ? ?>> Hi Chang, >> >>>>>>>>>>>>>>>? ? ? ?>>? ?I did the work in mumps. It is easy for >> me to understand >> >>>>>>>>>>>>>>>? ? ? gathering matrix rows to one process. >> >>>>>>>>>>>>>>>? ? ? ?>>? ?But how to gather blocks (submatrices) >> to form a large block?? ? ?Can you draw a picture of that? >> >>>>>>>>>>>>>>>? ? ? ?>>? ?Thanks >> >>>>>>>>>>>>>>>? ? ? ?>> --Junchao Zhang >> >>>>>>>>>>>>>>>? ? ? ?>> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu >> via petsc-users >> >>>>>>>>>>>>>>>? ? ? > > > >> >>>>>>>>>>>>>>>? ? ? > > >>> >> >>>>>>>>>>>>>>>? ? ? wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? Hi Barry, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? I think mumps solver in petsc does >> support that. You can >> >>>>>>>>>>>>>>>? ? ? check the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? documentation on >> "-mat_mumps_use_omp_threads" at >> >>>>>>>>>>>>>>>? ? ? ?>> >> >>>>>>>>>>>>>>> >> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >> >> >>>>>>>>>>>>>>> >> > >> >>>>>>>>>>>>>>>? ? ? ?>> >> >> >>>>>>>>>>>>>>> >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? and the code enclosed by #if >> >>>>>>>>>>>>>>>? ? ? defined(PETSC_HAVE_OPENMP_SUPPORT) in >> >>>>>>>>>>>>>>>? ? ? ?>>? ? functions MatMumpsSetUpDistRHSInfo and >> >>>>>>>>>>>>>>>? ? ? ?>>? ? MatMumpsGatherNonzerosOnMaster in >> >>>>>>>>>>>>>>>? ? ? ?>>? ? mumps.c >> >>>>>>>>>>>>>>>? ? ? ?>>? ? 1. I understand it is ideal to do one >> MPI rank per GPU. >> >>>>>>>>>>>>>>>? ? ? However, I am >> >>>>>>>>>>>>>>>? ? ? ?>>? ? working on an existing code that was >> developed based on MPI >> >>>>>>>>>>>>>>>? ? ? and the the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? # of mpi ranks is typically equal to # >> of cpu cores. We don't >> >>>>>>>>>>>>>>>? ? ? want to >> >>>>>>>>>>>>>>>? ? ? ?>>? ? change the whole structure of the code. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? 2. What you have suggested has been >> coded in mumps.c. See >> >>>>>>>>>>>>>>>? ? ? function >> >>>>>>>>>>>>>>>? ? ? ?>>? ? MatMumpsSetUpDistRHSInfo. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? Regards, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? Chang >> >>>>>>>>>>>>>>>? ? ? ?>>? ? On 10/13/21 7:53 PM, Barry Smith wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> On Oct 13, 2021, at 3:50 PM, Chang >> Liu >> >>>>>>>>>>>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >>> wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> Hi Barry, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> That is exactly what I want. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> Back to my original question, I am >> looking for an approach to >> >>>>>>>>>>>>>>>? ? ? ?>>? ? transfer >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> matrix >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> data from many MPI processes to >> "master" MPI >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> processes, each of which taking >> care of one GPU, and then >> >>>>>>>>>>>>>>>? ? ? upload >> >>>>>>>>>>>>>>>? ? ? ?>>? ? the data to GPU to >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> solve. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> One can just grab some codes from >> mumps.c to >> >>>>>>>>>>>>>>> aijcusparse.cu >> > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >>. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>? ? mumps.c doesn't actually do >> that. It never needs to >> >>>>>>>>>>>>>>>? ? ? copy the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? entire matrix to a single MPI rank. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>? ? It would be possible to write >> such a code that you >> >>>>>>>>>>>>>>>? ? ? suggest but >> >>>>>>>>>>>>>>>? ? ? ?>>? ? it is not clear that it makes sense >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> 1)? For normal PETSc GPU usage >> there is one GPU per MPI >> >>>>>>>>>>>>>>>? ? ? rank, so >> >>>>>>>>>>>>>>>? ? ? ?>>? ? while your one GPU per big domain is >> solving its systems the >> >>>>>>>>>>>>>>>? ? ? other >> >>>>>>>>>>>>>>>? ? ? ?>>? ? GPUs (with the other MPI ranks that >> share that domain) are doing >> >>>>>>>>>>>>>>>? ? ? ?>>? ? nothing. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> 2) For each triangular solve you >> would have to gather the >> >>>>>>>>>>>>>>>? ? ? right >> >>>>>>>>>>>>>>>? ? ? ?>>? ? hand side from the multiple ranks to >> the single GPU to pass it to >> >>>>>>>>>>>>>>>? ? ? ?>>? ? the GPU solver and then scatter the >> resulting solution back >> >>>>>>>>>>>>>>>? ? ? to all >> >>>>>>>>>>>>>>>? ? ? ?>>? ? of its subdomain ranks. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>? ? What I was suggesting was assign >> an entire subdomain to a >> >>>>>>>>>>>>>>>? ? ? ?>>? ? single MPI rank, thus it does >> everything on one GPU and can >> >>>>>>>>>>>>>>>? ? ? use the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? GPU solver directly. If all the major >> computations of a subdomain >> >>>>>>>>>>>>>>>? ? ? ?>>? ? can fit and be done on a single GPU >> then you would be >> >>>>>>>>>>>>>>>? ? ? utilizing all >> >>>>>>>>>>>>>>>? ? ? ?>>? ? the GPUs you are using effectively. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>? ? Barry >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> Chang >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> On 10/13/21 1:53 PM, Barry Smith >> wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>? ? Chang, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>? ? ? You are correct there is no >> MPI + GPU direct >> >>>>>>>>>>>>>>>? ? ? solvers that >> >>>>>>>>>>>>>>>? ? ? ?>>? ? currently do the triangular solves >> with MPI + GPU parallelism >> >>>>>>>>>>>>>>>? ? ? that I >> >>>>>>>>>>>>>>>? ? ? ?>>? ? am aware of. You are limited that >> individual triangular solves be >> >>>>>>>>>>>>>>>? ? ? ?>>? ? done on a single GPU. I can only >> suggest making each subdomain as >> >>>>>>>>>>>>>>>? ? ? ?>>? ? big as possible to utilize each GPU as >> much as possible for the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? direct triangular solves. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>? ? ?Barry >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> On Oct 13, 2021, at 12:16 PM, >> Chang Liu via petsc-users >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > > >> >>>>>>>>>>>>>>>? ? ? > > >>> >> >>>>>>>>>>>>>>>? ? ? wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> Hi Mark, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> '-mat_type aijcusparse' works >> with mpiaijcusparse with >> >>>>>>>>>>>>>>>? ? ? other >> >>>>>>>>>>>>>>>? ? ? ?>>? ? solvers, but with >> -pc_factor_mat_solver_type cusparse, it >> >>>>>>>>>>>>>>>? ? ? will give >> >>>>>>>>>>>>>>>? ? ? ?>>? ? an error. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> Yes what I want is to have mumps >> or superlu to do the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? factorization, and then do the rest, >> including GMRES solver, >> >>>>>>>>>>>>>>>? ? ? on gpu. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? Is that possible? >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> I have tried to use aijcusparse >> with superlu_dist, it >> >>>>>>>>>>>>>>>? ? ? runs but >> >>>>>>>>>>>>>>>? ? ? ?>>? ? the iterative solver is still running >> on CPUs. I have >> >>>>>>>>>>>>>>>? ? ? contacted the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? superlu group and they confirmed that >> is the case right now. >> >>>>>>>>>>>>>>>? ? ? But if >> >>>>>>>>>>>>>>>? ? ? ?>>? ? I set -pc_factor_mat_solver_type >> cusparse, it seems that the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? iterative solver is running on GPU. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> Chang >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> On 10/13/21 12:03 PM, Mark Adams >> wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> On Wed, Oct 13, 2021 at 11:10 >> AM Chang Liu >> >>>>>>>>>>>>>>>? ? ? >> > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >> >>>>>>>>>>>>>>>? ? ? >> > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >>>> wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?Thank you Junchao for >> explaining this. I guess in >> >>>>>>>>>>>>>>>? ? ? my case >> >>>>>>>>>>>>>>>? ? ? ?>>? ? the code is >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?just calling a seq solver >> like superlu to do >> >>>>>>>>>>>>>>>? ? ? ?>>? ? factorization on GPUs. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?My idea is that I want to >> have a traditional MPI >> >>>>>>>>>>>>>>>? ? ? code to >> >>>>>>>>>>>>>>>? ? ? ?>>? ? utilize GPUs >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?with cusparse. Right now >> cusparse does not support >> >>>>>>>>>>>>>>>? ? ? mpiaij >> >>>>>>>>>>>>>>>? ? ? ?>>? ? matrix, Sure it does: '-mat_type >> aijcusparse' will give you an >> >>>>>>>>>>>>>>>? ? ? ?>>? ? mpiaijcusparse matrix with > 1 processes. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> (-mat_type mpiaijcusparse might >> also work with >1 proc). >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> However, I see in grepping the >> repo that all the mumps and >> >>>>>>>>>>>>>>>? ? ? ?>>? ? superlu tests use aij or sell matrix type. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> MUMPS and SuperLU provide their >> own solves, I assume >> >>>>>>>>>>>>>>>? ? ? .... but >> >>>>>>>>>>>>>>>? ? ? ?>>? ? you might want to do other matrix >> operations on the GPU. Is >> >>>>>>>>>>>>>>>? ? ? that the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? issue? >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> Did you try -mat_type >> aijcusparse with MUMPS and/or >> >>>>>>>>>>>>>>>? ? ? SuperLU >> >>>>>>>>>>>>>>>? ? ? ?>>? ? have a problem? (no test with it so it >> probably does not work) >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> Thanks, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> Mark >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?so I >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?want the code to have a >> mpiaij matrix when adding >> >>>>>>>>>>>>>>>? ? ? all the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? matrix terms, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?and then transform the >> matrix to seqaij when doing the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? factorization >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?and >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?solve. This involves >> sending the data to the master >> >>>>>>>>>>>>>>>? ? ? ?>>? ? process, and I >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?think >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?the petsc mumps solver have >> something similar already. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?Chang >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?On 10/13/21 10:18 AM, >> Junchao Zhang wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > On Tue, Oct 12, 2021 at >> 1:07 PM Mark Adams >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > > >> >>>>>>>>>>>>>>>? ? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > > >> >>>>>>>>>>>>>>>? ? ? > > >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? > > > >> >>>>>>>>>>>>>>>? ? ? > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > > >> >>>>>>>>>>>>>>>? ? ? > > >>>>> wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ?On Tue, Oct 12, 2021 >> at 1:45 PM Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? >> > >> >> >>>>>>>>>>>>>>>? ? ? >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > >> > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>>> wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?Hi Mark, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?The option I use >> is like >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?-pc_type bjacobi >> -pc_bjacobi_blocks 16 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? -ksp_type fgmres >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?-mat_type >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?aijcusparse >> *-sub_pc_factor_mat_solver_type >> >>>>>>>>>>>>>>>? ? ? ?>>? ? cusparse >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?*-sub_ksp_type >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?preonly >> *-sub_pc_type lu* -ksp_max_it 2000 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? -ksp_rtol 1.e-300 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?-ksp_atol 1.e-300 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ?Note, If you use >> -log_view the last column >> >>>>>>>>>>>>>>>? ? ? (rows >> >>>>>>>>>>>>>>>? ? ? ?>>? ? are the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?method like >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ?MatFactorNumeric) >> has the percent of work >> >>>>>>>>>>>>>>>? ? ? in the GPU. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ?Junchao: *This* >> implies that we have a >> >>>>>>>>>>>>>>>? ? ? cuSparse LU >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?factorization. Is >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ?that correct? (I >> don't think we do) >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > No, we don't have >> cuSparse LU factorization.? ? ?If you check >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >> >>>>>>>>>>>>>>>? ? ? find it >> >>>>>>>>>>>>>>>? ? ? ?>>? ? calls >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> MatLUFactorSymbolic_SeqAIJ() instead. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > So I don't understand >> Chang's idea. Do you want to >> >>>>>>>>>>>>>>>? ? ? ?>>? ? make bigger >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?blocks? >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?I think this one >> do both factorization and >> >>>>>>>>>>>>>>>? ? ? ?>>? ? solve on gpu. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?You can check the >> >>>>>>>>>>>>>>>? ? ? runex72_aijcusparse.sh file >> >>>>>>>>>>>>>>>? ? ? ?>>? ? in petsc >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?install >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?directory, and >> try it your self (this >> >>>>>>>>>>>>>>>? ? ? is only lu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?factorization >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?without >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?iterative solve). >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?Chang >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?On 10/12/21 1:17 >> PM, Mark Adams wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > On Tue, Oct >> 12, 2021 at 11:19 AM >> >>>>>>>>>>>>>>>? ? ? Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > >> > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >> >>>>>>>>>>>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >> >>>>>>>>>>>>>>>? ? ? >> > >> >> >>>>>>>>>>>>>>>? ? ? >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>>>> wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Hi Junchao, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?No I only >> needs it to be transferred >> >>>>>>>>>>>>>>>? ? ? ?>>? ? within a >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?node. I use >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?block-Jacobi >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?method >> and GMRES to solve the sparse >> >>>>>>>>>>>>>>>? ? ? ?>>? ? matrix, so each >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?direct solver will >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?take care >> of a sub-block of the >> >>>>>>>>>>>>>>>? ? ? whole >> >>>>>>>>>>>>>>>? ? ? ?>>? ? matrix. In this >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?way, I can use >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?one >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?GPU to >> solve one sub-block, which is >> >>>>>>>>>>>>>>>? ? ? ?>>? ? stored within >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?one node. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?It was >> stated in the >> >>>>>>>>>>>>>>>? ? ? documentation that >> >>>>>>>>>>>>>>>? ? ? ?>>? ? cusparse >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?solver >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?is slow. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?However, >> in my test using >> >>>>>>>>>>>>>>>? ? ? ex72.c, the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? cusparse >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?solver is >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?faster than >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?mumps or >> superlu_dist on CPUs. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > Are we >> talking about the >> >>>>>>>>>>>>>>>? ? ? factorization, the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? solve, or >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?both? >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > We do not >> have an interface to >> >>>>>>>>>>>>>>>? ? ? cuSparse's LU >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?factorization (I >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?just >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > learned that >> it exists a few weeks ago). >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > Perhaps your >> fast "cusparse solver" is >> >>>>>>>>>>>>>>>? ? ? ?>>? ? '-pc_type lu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?-mat_type >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > aijcusparse' >> ? This would be the CPU >> >>>>>>>>>>>>>>>? ? ? ?>>? ? factorization, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?which is the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > dominant cost. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Chang >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?On >> 10/12/21 10:24 AM, Junchao >> >>>>>>>>>>>>>>>? ? ? Zhang wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > Hi, Chang, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?For the mumps solver, we >> >>>>>>>>>>>>>>>? ? ? usually >> >>>>>>>>>>>>>>>? ? ? ?>>? ? transfers >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?matrix >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?and vector >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?data >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > within >> a compute node.? For >> >>>>>>>>>>>>>>>? ? ? the idea you >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?propose, it >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?looks like >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?we need >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > to >> gather data within >> >>>>>>>>>>>>>>>? ? ? ?>>? ? MPI_COMM_WORLD, right? >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?Mark, I remember you said >> >>>>>>>>>>>>>>>? ? ? ?>>? ? cusparse solve is >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?slow >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?and you would >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > rather >> do it on CPU. Is it right? >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> --Junchao Zhang >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > On >> Mon, Oct 11, 2021 at 10:25 PM >> >>>>>>>>>>>>>>>? ? ? ?>>? ? Chang Liu via >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?petsc-users >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >>>>>>>>>>>>>>>? ? ? > >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >>>>>>>>>>>>>>>? ? ? > >>>> >> >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >>>>>>>>>>>>>>>? ? ? > >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >>>>>>>>>>>>>>>? ? ? > >>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >>>>>>>>>>>>>>>? ? ? > >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >>>>>>>>>>>>>>>? ? ? > >>>> >> >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >>>>>>>>>>>>>>>? ? ? > >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> >> ? >> >>>>>>>>>>>>>>>? ? ? > > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >>>>>>>>>>>>>>>? ? ? > >>>>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?wrote: >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Hi, >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?Currently, it is possible >> >>>>>>>>>>>>>>>? ? ? to use >> >>>>>>>>>>>>>>>? ? ? ?>>? ? mumps >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?solver in >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?PETSC with >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> -mat_mumps_use_omp_threads >> >>>>>>>>>>>>>>>? ? ? ?>>? ? option, so that >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?multiple MPI >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?processes >> will >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?transfer the matrix and >> >>>>>>>>>>>>>>>? ? ? rhs data >> >>>>>>>>>>>>>>>? ? ? ?>>? ? to the master >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?rank, and then >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?master >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?rank will call mumps with >> >>>>>>>>>>>>>>>? ? ? OpenMP >> >>>>>>>>>>>>>>>? ? ? ?>>? ? to solve >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?the matrix. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?I >> wonder if someone can >> >>>>>>>>>>>>>>>? ? ? develop >> >>>>>>>>>>>>>>>? ? ? ?>>? ? similar >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?option for >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?cusparse >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?solver. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?Right now, this solver >> >>>>>>>>>>>>>>>? ? ? does not >> >>>>>>>>>>>>>>>? ? ? ?>>? ? work with >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?mpiaijcusparse. I >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?think a >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?possible workaround is to >> >>>>>>>>>>>>>>>? ? ? ?>>? ? transfer all the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?matrix >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?data to one MPI >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?process, and then upload the >> >>>>>>>>>>>>>>>? ? ? ?>>? ? data to GPU to >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?solve. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?In this >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?way, one can >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?use cusparse solver for a MPI >> >>>>>>>>>>>>>>>? ? ? ?>>? ? program. >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?Chang >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?-- >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?Staff Research Physicist >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? >? ? ?+1 >> 609 243 3438 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> cliu at pppl.gov >> >>>>>>>>>>>>>>>? ? ? > >> > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > >> > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > >> > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >> >>>>>>>>>>>>>>>? ? ? >> > >> >> >>>>>>>>>>>>>>>? ? ? >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?Princeton Plasma Physics >> >>>>>>>>>>>>>>>? ? ? Laboratory >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> ?100 Stellarator Rd, >> >>>>>>>>>>>>>>>? ? ? Princeton NJ >> >>>>>>>>>>>>>>>? ? ? ?>>? ? 08540, USA >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?-- >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Staff >> Research Physicist >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?+1 609 >> 243 3438 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > cliu at pppl.gov >> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>> >> >>>>>>>>>>>>>>>? ? ? >> > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> ? >> >>>>>>>>>>>>>>>? ? ? > >> > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?Princeton >> Plasma Physics Laboratory >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? >? ? ?100 >> Stellarator Rd, Princeton NJ >> >>>>>>>>>>>>>>>? ? ? 08540, USA >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?-- >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?Staff Research >> Physicist >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?+1 609 243 3438 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > cliu at pppl.gov >> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > > >> >>>>>>>>>>>>>>>? ? ? >> >>> >> >> >>>>>>>>>>>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?> > >> >>>>>>>>>>>>>>>? ? ? >> >>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?Princeton Plasma >> Physics Laboratory >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? >? ? ? ? ?100 Stellarator >> Rd, Princeton NJ 08540, USA >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?--? ? ?Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?Staff Research Physicist >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?+1 609 243 3438 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>> cliu at pppl.gov >> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >> >>>>>>>>>>>>>>>? ? ? > >> >>>>>>>>>>>>>>>? ? ? ?>>? ? > >>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?Princeton Plasma Physics >> Laboratory >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>>>? ? ?100 Stellarator Rd, >> Princeton NJ 08540, USA >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> -- >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> Staff Research Physicist >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> +1 609 243 3438 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> cliu at pppl.gov >> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>>>> 100 Stellarator Rd, Princeton NJ >> 08540, USA >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> -- >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> Staff Research Physicist >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> +1 609 243 3438 >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> cliu at pppl.gov >> > >> >>>>>>>>>>>>>>>? ? ? >> >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?>> 100 Stellarator Rd, Princeton NJ >> 08540, USA >> >>>>>>>>>>>>>>>? ? ? ?>>? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? --? ? ?Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?>>? ? Staff Research Physicist >> >>>>>>>>>>>>>>>? ? ? ?>>? ? +1 609 243 3438 >> >>>>>>>>>>>>>>>? ? ? ?>> cliu at pppl.gov >> > >> >> >>>>>>>>>>>>>>>? ? ? >> >> >>>>>>>>>>>>>>>? ? ? ?>>? ? Princeton Plasma Physics Laboratory >> >>>>>>>>>>>>>>>? ? ? ?>>? ? 100 Stellarator Rd, Princeton NJ >> 08540, USA >> >>>>>>>>>>>>>>>? ? ? ?> >> >>>>>>>>>>>>>>>? ? ? ?> -- >> >>>>>>>>>>>>>>>? ? ? ?> Chang Liu >> >>>>>>>>>>>>>>>? ? ? ?> Staff Research Physicist >> >>>>>>>>>>>>>>>? ? ? ?> +1 609 243 3438 >> >>>>>>>>>>>>>>>? ? ? ?> cliu at pppl.gov >> > >> >>>>>>>>>>>>>>>? ? ? ?> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>>>>>? ? ? ?> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> -- >> >>>>>>>>>>>>>> Chang Liu >> >>>>>>>>>>>>>> Staff Research Physicist >> >>>>>>>>>>>>>> +1 609 243 3438 >> >>>>>>>>>>>>>> cliu at pppl.gov >> >>>>>>>>>>>>>> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>>>>>> >> >>>>>>>>>>>> -- >> >>>>>>>>>>>> Chang Liu >> >>>>>>>>>>>> Staff Research Physicist >> >>>>>>>>>>>> +1 609 243 3438 >> >>>>>>>>>>>> cliu at pppl.gov >> >>>>>>>>>>>> Princeton Plasma Physics Laboratory >> >>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>>>> >> >>>>>>>>>> -- >> >>>>>>>>>> Chang Liu >> >>>>>>>>>> Staff Research Physicist >> >>>>>>>>>> +1 609 243 3438 >> >>>>>>>>>> cliu at pppl.gov >> >>>>>>>>>> Princeton Plasma Physics Laboratory >> >>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>>> >> >>>>>>>> -- >> >>>>>>>> Chang Liu >> >>>>>>>> Staff Research Physicist >> >>>>>>>> +1 609 243 3438 >> >>>>>>>> cliu at pppl.gov >> >>>>>>>> Princeton Plasma Physics Laboratory >> >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >>>>>>> >> >>>>>> >> >>>> >> >>>> -- >> >>>> Chang Liu >> >>>> Staff Research Physicist >> >>>> +1 609 243 3438 >> >>>> cliu at pppl.gov >> >>>> Princeton Plasma Physics Laboratory >> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >> >> >> >> -- >> >> Chang Liu >> >> Staff Research Physicist >> >> +1 609 243 3438 >> >> cliu at pppl.gov >> >> Princeton Plasma Physics Laboratory >> >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > >> >> -- >> Chang Liu >> Staff Research Physicist >> +1 609 243 3438 >> cliu at pppl.gov >> Princeton Plasma Physics Laboratory >> 100 Stellarator Rd, Princeton NJ 08540, USA >> > -- Chang Liu Staff Research Physicist +1 609 243 3438 cliu at pppl.gov Princeton Plasma Physics Laboratory 100 Stellarator Rd, Princeton NJ 08540, USA From knepley at gmail.com Wed Oct 20 17:55:58 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 20 Oct 2021 18:55:58 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> Message-ID: On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland < Eric.Chamberland at giref.ulaval.ca> wrote: > Hi Matthew, > > we tried to reproduce the error in a simple example. > > The context is the following: We hard coded the mesh and initial partition > into the code (see sConnectivity and sInitialPartition) for 2 ranks and try > to create a section in order to use the DMPlexNaturalToGlobalBegin function > to retreive our initial element numbers. > > Now the call to DMPlexDistribute give different errors depending on what > type of component we ask the field to be created. For our objective, we > would like a global field to be created on elements only (like a P0 > interpolation). > > We now have the following error generated: > > [0]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [0]PETSC ERROR: Petsc has generated inconsistent data > [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 > [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html > for trouble shooting. > [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 > [0]PETSC ERROR: ./bug on a named rohan by ericc Wed Oct 20 14:52:36 2021 > [0]PETSC ERROR: Configure options > --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 --with-mpi-compilers=1 > --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 --with-cxx-dialect=C++14 > --with-make-np=12 --with-shared-libraries=1 --with-debugging=yes > --with-memalign=64 --with-visibility=0 --with-64-bit-indices=0 > --download-ml=yes --download-mumps=yes --download-superlu=yes > --download-hpddm=yes --download-slepc=yes --download-superlu_dist=yes > --download-parmetis=yes --download-ptscotch=yes --download-metis=yes > --download-strumpack=yes --download-suitesparse=yes --download-hypre=yes > --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 > --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. > --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. > --with-scalapack=1 > --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include > --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 > -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" > [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at > /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 > [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at > /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 > [0]PETSC ERROR: #3 DMPlexDistribute() at > /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 > [0]PETSC ERROR: #4 main() at bug_section.cc:159 > [0]PETSC ERROR: No PETSc Option Table entries > [0]PETSC ERROR: ----------------End of Error Message -------send entire > error message to petsc-maint at mcs.anl.gov---------- > > Hope the attached code is self-explaining, note that to make it short, we > have not included the final part of it, just the buggy part we are > encountering right now... > > Thanks for your insights, > > Thanks for making the example. I tweaked it slightly. I put in a test case that just makes a parallel 7 x 10 quad mesh. This works fine. Thus I think it must be something connected with the original mesh. It is hard to get a handle on it without the coordinates. Do you think you could put the coordinate array in? I have added the code to load them (see attached file). Thanks, Matt > Eric > On 2021-10-06 9:23 p.m., Matthew Knepley wrote: > > On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland < > Eric.Chamberland at giref.ulaval.ca> wrote: > >> Hi Matthew, >> >> we tried to use that. Now, we discovered that: >> >> 1- even if we "ask" for sfNatural creation with DMSetUseNatural, it is >> not created because DMPlexCreateGlobalToNaturalSF looks for a "section": >> this is not documented in DMSetUseNaturalso we are asking ourselfs: "is >> this a permanent feature or a temporary situation?" >> > I think explaining this will help clear up a lot. > > What the Natural2Global map does is permute a solution vector into the > ordering that it would have had prior to mesh distribution. > Now, in order to do this permutation, I need to know the original (global) > data layout. If it is not specified _before_ distribution, we > cannot build the permutation. The section describes the data layout, so I > need it before distribution. > > I cannot think of another way that you would implement this, but if you > want something else, let me know. > >> 2- We then tried to create a "section" in different manners: we took the >> code into the example petsc/src/dm/impls/plex/tests/ex15.c. However, we >> ended up with a segfault: >> >> corrupted size vs. prev_size >> [rohan:07297] *** Process received signal *** >> [rohan:07297] Signal: Aborted (6) >> [rohan:07297] Signal code: (-6) >> [rohan:07297] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >> [rohan:07297] [ 1] /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >> [rohan:07297] [ 2] /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >> [rohan:07297] [ 3] /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >> [rohan:07297] [ 4] /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >> [rohan:07297] [ 5] /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >> [rohan:07297] [ 6] /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >> [rohan:07297] [ 7] /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >> [rohan:07297] [ 8] /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >> [rohan:07297] [ 9] >> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >> [rohan:07297] [10] >> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >> [rohan:07297] [11] >> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >> [rohan:07297] [12] >> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >> [rohan:07297] [13] /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >> >> [rohan:07297] [14] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >> [rohan:07297] [15] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >> [rohan:07297] [16] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >> [rohan:07297] [17] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >> [rohan:07297] [18] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >> > I am not sure what happened here, but if you could send a sample code, I > will figure it out. > >> If we do not create a section, the call to DMPlexDistribute is >> successful, but DMPlexGetGlobalToNaturalSF return a null SF pointer... >> > Yes, it just ignores it in this case because it does not have a global > layout. > >> Here are the operations we are calling ( this is almost the code we are >> using, I just removed verifications and creation of the connectivity which >> use our parallel structure and code): >> >> =========== >> >> PetscInt* lCells = 0; >> PetscInt lNumCorners = 0; >> PetscInt lDimMail = 0; >> PetscInt lnumCells = 0; >> >> //At this point we create the cells for PETSc expected input for >> DMPlexBuildFromCellListParallel and set lNumCorners, lDimMail and lnumCells >> to correct values. >> ... >> >> DM lDMBete = 0 >> DMPlexCreate(lMPIComm,&lDMBete); >> >> DMSetDimension(lDMBete, lDimMail); >> >> DMPlexBuildFromCellListParallel(lDMBete, >> lnumCells, >> PETSC_DECIDE, >> >> pLectureElementsLocaux.reqNbTotalSommets(), >> lNumCorners, >> lCells, >> PETSC_NULL); >> >> DM lDMBeteInterp = 0; >> DMPlexInterpolate(lDMBete, &lDMBeteInterp); >> DMDestroy(&lDMBete); >> lDMBete = lDMBeteInterp; >> >> DMSetUseNatural(lDMBete,PETSC_TRUE); >> >> PetscSF lSFMigrationSansOvl = 0; >> PetscSF lSFMigrationOvl = 0; >> DM lDMDistribueSansOvl = 0; >> DM lDMAvecOverlap = 0; >> >> PetscPartitioner lPart; >> DMPlexGetPartitioner(lDMBete, &lPart); >> PetscPartitionerSetFromOptions(lPart); >> >> PetscSection section; >> PetscInt numFields = 1; >> PetscInt numBC = 0; >> PetscInt numComp[1] = {1}; >> PetscInt numDof[4] = {1, 0, 0, 0}; >> PetscInt bcFields[1] = {0}; >> IS bcPoints[1] = {NULL}; >> >> DMSetNumFields(lDMBete, numFields); >> >> DMPlexCreateSection(lDMBete, NULL, numComp, numDof, numBC, bcFields, >> bcPoints, NULL, NULL, §ion); >> DMSetLocalSection(lDMBete, section); >> >> DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, >> &lDMDistribueSansOvl); // segfault! >> >> =========== >> >> So we have other question/remarks: >> >> 3- Maybe PETSc expect something specific that is missing/not verified: >> for example, we didn't gave any coordinates since we just want to partition >> and compute overlap for the mesh... and then recover our element numbers in >> a "simple way" >> >> 4- We are telling ourselves it is somewhat a "big price to pay" to have >> to build an unused section to have the global to natural ordering set ? >> Could this requirement be avoided? >> > I don't think so. There would have to be _some_ way of describing your > data layout in terms of mesh points, and I do not see how you could use > less memory doing that. > >> 5- Are there any improvement towards our usages in 3.16 release? >> > Let me try and run the code above. > > Thanks, > > Matt > >> Thanks, >> >> Eric >> >> >> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >> >> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland < >> Eric.Chamberland at giref.ulaval.ca> wrote: >> >>> Hi, >>> >>> I come back with _almost_ the original question: >>> >>> I would like to add an integer information (*our* original element >>> number, not petsc one) on each element of the DMPlex I create with >>> DMPlexBuildFromCellListParallel. >>> >>> I would like this interger to be distribruted by or the same way >>> DMPlexDistribute distribute the mesh. >>> >>> Is it possible to do this? >>> >> >> I think we already have support for what you want. If you call >> >> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >> >> before DMPlexDistribute(), it will compute a PetscSF encoding the global >> to natural map. You >> can get it with >> >> >> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >> >> and use it with >> >> >> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >> >> Is this sufficient? >> >> Thanks, >> >> Matt >> >> >>> Thanks, >>> >>> Eric >>> >>> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >>> > Hi, >>> > >>> > I want to use DMPlexDistribute from PETSc for computing overlapping >>> > and play with the different partitioners supported. >>> > >>> > However, after calling DMPlexDistribute, I noticed the elements are >>> > renumbered and then the original number is lost. >>> > >>> > What would be the best way to keep track of the element renumbering? >>> > >>> > a) Adding an optional parameter to let the user retrieve a vector or >>> > "IS" giving the old number? >>> > >>> > b) Adding a DMLabel (seems a wrong good solution) >>> > >>> > c) Other idea? >>> > >>> > Of course, I don't want to loose performances with the need of this >>> > "mapping"... >>> > >>> > Thanks, >>> > >>> > Eric >>> > >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex44.c Type: application/octet-stream Size: 5243 bytes Desc: not available URL: From bsmith at petsc.dev Wed Oct 20 18:56:19 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 20 Oct 2021 19:56:19 -0400 Subject: [petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver In-Reply-To: <234f9bc5-cdcc-2253-69b6-7a09ab915661@pppl.gov> References: <49ed3af6-a2e9-a55b-f196-988f9cf13e2b@pppl.gov> <280f385e-c242-0497-7b08-171246e0f5ad@pppl.gov> <879c30a1-ea85-1c24-4139-268925d511da@pppl.gov> <53D4EDD7-E05C-4485-B7AE-23AB10DD81B1@joliv.et> <968434BC-E8DC-49B0-9119-F208DB1E01B0@petsc.dev> <7a3d5347-f2da-b4a9-f44a-aa534a314c7f@pppl.gov> <144089C5-D011-4A94-9AC1-F4AD5A66257C@petsc.dev> <234f9bc5-cdcc-2253-69b6-7a09ab915661@pppl.gov> Message-ID: <53F8E30F-5A8D-47AF-BBE2-9FA928047FBD@petsc.dev> Hmm. A fix should work (almost exactly the same) with or without the block Jacobi on subdomains level, I had assumed that Junchao's branch would handle this. Have you looked at it? Barry > On Oct 20, 2021, at 6:14 PM, Chang Liu wrote: > > Hi Barry, > > Wait, by "branch" are you talking about the MR Junchao submitted? > > That fix (proposed by me) is only to fix the issue for telescope to work on mpiaijcusparse, when using outside bjacobi. It has nothing to do with the issue for telescope inside bjacobi. It does not help in my tests. > > If my emails made you think the other way, I apologize for that. > > Regards, > > Chang > > On 10/20/21 4:40 PM, Barry Smith wrote: >> Yes, but the branch can be used to do telescoping inside the bjacobi as needed. >>> On Oct 20, 2021, at 2:59 PM, Junchao Zhang > wrote: >>> >>> The MR https://gitlab.com/petsc/petsc/-/merge_requests/4471 has not been merged yet. >>> >>> --Junchao Zhang >>> >>> >>> On Wed, Oct 20, 2021 at 1:47 PM Chang Liu via petsc-users > wrote: >>> >>> Hi Barry, >>> >>> Are the fixes merged in the master? I was using bjacobi as a >>> preconditioner. Using the latest version of petsc, I found that by >>> calling >>> >>> mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view >>> -ksp_monitor_true_residual -ksp_type fgmres -pc_type bjacobi >>> -pc_bjacobi >>> _blocks 4 -sub_ksp_type preonly -sub_pc_type telescope >>> -sub_pc_telescope_reduction_factor 8 -sub_pc_telescope_subcomm_type >>> contiguous -sub_telescope_pc_type lu -sub_telescope_ksp_type preonly >>> -sub_telescope_pc_factor_mat_solver_type mumps -ksp_max_it 2000 >>> -ksp_rtol 1.e-30 -ksp_atol 1.e-30 >>> >>> The code is calling PCApply_BJacobi_Multiproc. If I use >>> >>> mpiexec -n 32 --oversubscribe ./ex7 -m 1000 -ksp_view >>> -ksp_monitor_true_residual -telescope_ksp_monitor_true_residual >>> -ksp_type preonly -pc_type telescope -pc_telescope_reduction_factor 8 >>> -pc_telescope_subcomm_type contiguous -telescope_pc_type bjacobi >>> -telescope_ksp_type fgmres -telescope_pc_bjacobi_blocks 4 >>> -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu >>> -telescope_sub_pc_factor_mat_solver_type mumps -telescope_ksp_max_it >>> 2000 -telescope_ksp_rtol 1.e-30 -telescope_ksp_atol 1.e-30 >>> >>> The code is calling PCApply_BJacobi_Singleblock. You can test it >>> yourself. >>> >>> Regards, >>> >>> Chang >>> >>> On 10/20/21 1:14 PM, Barry Smith wrote: >>> > >>> > >>> >> On Oct 20, 2021, at 12:48 PM, Chang Liu >> > wrote: >>> >> >>> >> Hi Pierre, >>> >> >>> >> I have another suggestion for telescope. I have achieved my >>> goal by putting telescope outside bjacobi. But the code still does >>> not work if I use telescope as a pc for subblock. I think the >>> reason is that I want to use cusparse as the solver, which can >>> only deal with seqaij matrix and not mpiaij matrix. >>> > >>> > >>> > This is suppose to work with the recent fixes. The >>> telescope should produce a seq matrix and for each solve map the >>> parallel vector (over the subdomain) automatically down to the one >>> rank with the GPU to solve it on the GPU. It is not clear to me >>> where the process is going wrong. >>> > >>> > Barry >>> > >>> > >>> > >>> >> However, for telescope pc, it can put the matrix into one mpi >>> rank, thus making it a seqaij for factorization stage, but then >>> after factorization it will give the data back to the original >>> comminicator. This will make the matrix back to mpiaij, and then >>> cusparse cannot solve it. >>> >> >>> >> I think a better option is to do the factorization on CPU with >>> mpiaij, then then transform the preconditioner matrix to seqaij >>> and do the matsolve GPU. But I am not sure if it can be achieved >>> using telescope. >>> >> >>> >> Regads, >>> >> >>> >> Chang >>> >> >>> >> On 10/15/21 5:29 AM, Pierre Jolivet wrote: >>> >>> Hi Chang, >>> >>> The output you sent with MUMPS looks alright to me, you can >>> see that the MatType is properly set to seqaijcusparse (and not >>> mpiaijcusparse). >>> >>> I don?t know what is wrong with >>> -sub_telescope_pc_factor_mat_solver_type cusparse, I don?t have a >>> PETSc installation for testing this, hopefully Barry or Junchao >>> can confirm this wrong behavior and get this fixed. >>> >>> As for permuting PCTELESCOPE and PCBJACOBI, in your case, the >>> outer PC will be equivalent, yes. >>> >>> However, it would be more efficient to do PCBJACOBI and then >>> PCTELESCOPE. >>> >>> PCBJACOBI prunes the operator by basically removing all >>> coefficients outside of the diagonal blocks. >>> >>> Then, PCTELESCOPE "groups everything together?. >>> >>> If you do it the other way around, PCTELESCOPE will ?group >>> everything together? and then PCBJACOBI will prune the operator. >>> >>> So the PCTELESCOPE SetUp will be costly for nothing since some >>> coefficients will be thrown out afterwards in the PCBJACOBI SetUp. >>> >>> I hope I?m clear enough, otherwise I can try do draw some >>> pictures. >>> >>> Thanks, >>> >>> Pierre >>> >>>> On 15 Oct 2021, at 4:39 AM, Chang Liu >> > wrote: >>> >>>> >>> >>>> Hi Pierre and Barry, >>> >>>> >>> >>>> I think maybe I should use telescope outside bjacobi? like this >>> >>>> >>> >>>> mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m >>> 400 -ksp_view -ksp_monitor_true_residual -pc_type telescope >>> -pc_telescope_reduction_factor 4 -t >>> >>>> elescope_pc_type bjacobi -telescope_ksp_type fgmres >>> -telescope_pc_bjacobi_blocks 4 -mat_type aijcusparse >>> -telescope_sub_ksp_type preonly -telescope_sub_pc_type lu >>> -telescope_sub_pc_factor_mat_solve >>> >>>> r_type cusparse -ksp_max_it 2000 -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>> >>>> >>> >>>> But then I got an error that >>> >>>> >>> >>>> [0]PETSC ERROR: MatSolverType cusparse does not support >>> matrix type seqaij >>> >>>> >>> >>>> But the mat type should be aijcusparse. I think telescope >>> change the mat type. >>> >>>> >>> >>>> Chang >>> >>>> >>> >>>> On 10/14/21 10:11 PM, Chang Liu wrote: >>> >>>>> For comparison, here is the output using mumps instead of >>> cusparse >>> >>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 -m >>> 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >>> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >>> -sub_pc_type telescope -sub_ksp_type preonly >>> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >>> -sub_telescope_pc_factor_mat_solver_type mumps >>> -sub_pc_telescope_reduction_factor 4 >>> -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 >>> -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>> >>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 true >>> resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>> >>>>> 1 KSP unpreconditioned resid norm 2.439995191694e+00 true >>> resid norm 2.439995191694e+00 ||r(i)||/||b|| 6.077240896978e-02 >>> >>>>> 2 KSP unpreconditioned resid norm 1.280694102588e+00 true >>> resid norm 1.280694102588e+00 ||r(i)||/||b|| 3.189795866509e-02 >>> >>>>> 3 KSP unpreconditioned resid norm 1.041100266810e+00 true >>> resid norm 1.041100266810e+00 ||r(i)||/||b|| 2.593044912896e-02 >>> >>>>> 4 KSP unpreconditioned resid norm 7.274347137268e-01 true >>> resid norm 7.274347137268e-01 ||r(i)||/||b|| 1.811805206499e-02 >>> >>>>> 5 KSP unpreconditioned resid norm 5.429229329787e-01 true >>> resid norm 5.429229329787e-01 ||r(i)||/||b|| 1.352245882876e-02 >>> >>>>> 6 KSP unpreconditioned resid norm 4.332970410353e-01 true >>> resid norm 4.332970410353e-01 ||r(i)||/||b|| 1.079203150598e-02 >>> >>>>> 7 KSP unpreconditioned resid norm 3.948206050950e-01 true >>> resid norm 3.948206050950e-01 ||r(i)||/||b|| 9.833707609019e-03 >>> >>>>> 8 KSP unpreconditioned resid norm 3.379580577269e-01 true >>> resid norm 3.379580577269e-01 ||r(i)||/||b|| 8.417444988714e-03 >>> >>>>> 9 KSP unpreconditioned resid norm 2.875593971410e-01 true >>> resid norm 2.875593971410e-01 ||r(i)||/||b|| 7.162176936105e-03 >>> >>>>> 10 KSP unpreconditioned resid norm 2.533983363244e-01 true >>> resid norm 2.533983363244e-01 ||r(i)||/||b|| 6.311335112378e-03 >>> >>>>> 11 KSP unpreconditioned resid norm 2.389169921094e-01 true >>> resid norm 2.389169921094e-01 ||r(i)||/||b|| 5.950651543793e-03 >>> >>>>> 12 KSP unpreconditioned resid norm 2.118961639089e-01 true >>> resid norm 2.118961639089e-01 ||r(i)||/||b|| 5.277649880637e-03 >>> >>>>> 13 KSP unpreconditioned resid norm 1.885892030223e-01 true >>> resid norm 1.885892030223e-01 ||r(i)||/||b|| 4.697148671593e-03 >>> >>>>> 14 KSP unpreconditioned resid norm 1.763510666948e-01 true >>> resid norm 1.763510666948e-01 ||r(i)||/||b|| 4.392336175055e-03 >>> >>>>> 15 KSP unpreconditioned resid norm 1.638219366731e-01 true >>> resid norm 1.638219366731e-01 ||r(i)||/||b|| 4.080275964317e-03 >>> >>>>> 16 KSP unpreconditioned resid norm 1.476792766432e-01 true >>> resid norm 1.476792766432e-01 ||r(i)||/||b|| 3.678214378076e-03 >>> >>>>> 17 KSP unpreconditioned resid norm 1.349906937321e-01 true >>> resid norm 1.349906937321e-01 ||r(i)||/||b|| 3.362182710248e-03 >>> >>>>> 18 KSP unpreconditioned resid norm 1.289673236836e-01 true >>> resid norm 1.289673236836e-01 ||r(i)||/||b|| 3.212159993314e-03 >>> >>>>> 19 KSP unpreconditioned resid norm 1.167505658153e-01 true >>> resid norm 1.167505658153e-01 ||r(i)||/||b|| 2.907879965230e-03 >>> >>>>> 20 KSP unpreconditioned resid norm 1.046037988999e-01 true >>> resid norm 1.046037988999e-01 ||r(i)||/||b|| 2.605343185995e-03 >>> >>>>> 21 KSP unpreconditioned resid norm 9.832660514331e-02 true >>> resid norm 9.832660514331e-02 ||r(i)||/||b|| 2.448998539309e-03 >>> >>>>> 22 KSP unpreconditioned resid norm 8.835618950141e-02 true >>> resid norm 8.835618950142e-02 ||r(i)||/||b|| 2.200667649539e-03 >>> >>>>> 23 KSP unpreconditioned resid norm 7.563496650115e-02 true >>> resid norm 7.563496650116e-02 ||r(i)||/||b|| 1.883823022386e-03 >>> >>>>> 24 KSP unpreconditioned resid norm 6.651291376834e-02 true >>> resid norm 6.651291376834e-02 ||r(i)||/||b|| 1.656622115921e-03 >>> >>>>> 25 KSP unpreconditioned resid norm 5.890393227906e-02 true >>> resid norm 5.890393227906e-02 ||r(i)||/||b|| 1.467106933070e-03 >>> >>>>> 26 KSP unpreconditioned resid norm 4.661992782780e-02 true >>> resid norm 4.661992782780e-02 ||r(i)||/||b|| 1.161152009536e-03 >>> >>>>> 27 KSP unpreconditioned resid norm 3.690705358716e-02 true >>> resid norm 3.690705358716e-02 ||r(i)||/||b|| 9.192356452602e-04 >>> >>>>> 28 KSP unpreconditioned resid norm 3.209680460188e-02 true >>> resid norm 3.209680460188e-02 ||r(i)||/||b|| 7.994278605666e-04 >>> >>>>> 29 KSP unpreconditioned resid norm 2.354337626000e-02 true >>> resid norm 2.354337626001e-02 ||r(i)||/||b|| 5.863895533373e-04 >>> >>>>> 30 KSP unpreconditioned resid norm 1.701296561785e-02 true >>> resid norm 1.701296561785e-02 ||r(i)||/||b|| 4.237380908932e-04 >>> >>>>> 31 KSP unpreconditioned resid norm 1.509942937258e-02 true >>> resid norm 1.509942937258e-02 ||r(i)||/||b|| 3.760780759588e-04 >>> >>>>> 32 KSP unpreconditioned resid norm 1.258274688515e-02 true >>> resid norm 1.258274688515e-02 ||r(i)||/||b|| 3.133956338402e-04 >>> >>>>> 33 KSP unpreconditioned resid norm 9.805748771638e-03 true >>> resid norm 9.805748771638e-03 ||r(i)||/||b|| 2.442295692359e-04 >>> >>>>> 34 KSP unpreconditioned resid norm 8.596552678160e-03 true >>> resid norm 8.596552678160e-03 ||r(i)||/||b|| 2.141123953301e-04 >>> >>>>> 35 KSP unpreconditioned resid norm 6.936406707500e-03 true >>> resid norm 6.936406707500e-03 ||r(i)||/||b|| 1.727635147167e-04 >>> >>>>> 36 KSP unpreconditioned resid norm 5.533741607932e-03 true >>> resid norm 5.533741607932e-03 ||r(i)||/||b|| 1.378276519869e-04 >>> >>>>> 37 KSP unpreconditioned resid norm 4.982347757923e-03 true >>> resid norm 4.982347757923e-03 ||r(i)||/||b|| 1.240942099414e-04 >>> >>>>> 38 KSP unpreconditioned resid norm 4.309608348059e-03 true >>> resid norm 4.309608348059e-03 ||r(i)||/||b|| 1.073384414524e-04 >>> >>>>> 39 KSP unpreconditioned resid norm 3.729408303186e-03 true >>> resid norm 3.729408303185e-03 ||r(i)||/||b|| 9.288753001974e-05 >>> >>>>> 40 KSP unpreconditioned resid norm 3.490003351128e-03 true >>> resid norm 3.490003351128e-03 ||r(i)||/||b|| 8.692472496776e-05 >>> >>>>> 41 KSP unpreconditioned resid norm 3.069012426454e-03 true >>> resid norm 3.069012426453e-03 ||r(i)||/||b|| 7.643919912166e-05 >>> >>>>> 42 KSP unpreconditioned resid norm 2.772928845284e-03 true >>> resid norm 2.772928845284e-03 ||r(i)||/||b|| 6.906471225983e-05 >>> >>>>> 43 KSP unpreconditioned resid norm 2.561454192399e-03 true >>> resid norm 2.561454192398e-03 ||r(i)||/||b|| 6.379756085902e-05 >>> >>>>> 44 KSP unpreconditioned resid norm 2.253662762802e-03 true >>> resid norm 2.253662762802e-03 ||r(i)||/||b|| 5.613146926159e-05 >>> >>>>> 45 KSP unpreconditioned resid norm 2.086800523919e-03 true >>> resid norm 2.086800523919e-03 ||r(i)||/||b|| 5.197546917701e-05 >>> >>>>> 46 KSP unpreconditioned resid norm 1.926028182896e-03 true >>> resid norm 1.926028182896e-03 ||r(i)||/||b|| 4.797114880257e-05 >>> >>>>> 47 KSP unpreconditioned resid norm 1.769243808622e-03 true >>> resid norm 1.769243808622e-03 ||r(i)||/||b|| 4.406615581492e-05 >>> >>>>> 48 KSP unpreconditioned resid norm 1.656654905964e-03 true >>> resid norm 1.656654905964e-03 ||r(i)||/||b|| 4.126192945371e-05 >>> >>>>> 49 KSP unpreconditioned resid norm 1.572052627273e-03 true >>> resid norm 1.572052627273e-03 ||r(i)||/||b|| 3.915475961260e-05 >>> >>>>> 50 KSP unpreconditioned resid norm 1.454960682355e-03 true >>> resid norm 1.454960682355e-03 ||r(i)||/||b|| 3.623837699518e-05 >>> >>>>> 51 KSP unpreconditioned resid norm 1.375985053014e-03 true >>> resid norm 1.375985053014e-03 ||r(i)||/||b|| 3.427134883820e-05 >>> >>>>> 52 KSP unpreconditioned resid norm 1.269325501087e-03 true >>> resid norm 1.269325501087e-03 ||r(i)||/||b|| 3.161480347603e-05 >>> >>>>> 53 KSP unpreconditioned resid norm 1.184791772965e-03 true >>> resid norm 1.184791772965e-03 ||r(i)||/||b|| 2.950934100844e-05 >>> >>>>> 54 KSP unpreconditioned resid norm 1.064535156080e-03 true >>> resid norm 1.064535156080e-03 ||r(i)||/||b|| 2.651413662135e-05 >>> >>>>> 55 KSP unpreconditioned resid norm 9.639036688120e-04 true >>> resid norm 9.639036688117e-04 ||r(i)||/||b|| 2.400773090370e-05 >>> >>>>> 56 KSP unpreconditioned resid norm 8.632359780260e-04 true >>> resid norm 8.632359780260e-04 ||r(i)||/||b|| 2.150042347322e-05 >>> >>>>> 57 KSP unpreconditioned resid norm 7.613605783850e-04 true >>> resid norm 7.613605783850e-04 ||r(i)||/||b|| 1.896303591113e-05 >>> >>>>> 58 KSP unpreconditioned resid norm 6.681073248348e-04 true >>> resid norm 6.681073248349e-04 ||r(i)||/||b|| 1.664039819373e-05 >>> >>>>> 59 KSP unpreconditioned resid norm 5.656127908544e-04 true >>> resid norm 5.656127908545e-04 ||r(i)||/||b|| 1.408758999254e-05 >>> >>>>> 60 KSP unpreconditioned resid norm 4.850863370767e-04 true >>> resid norm 4.850863370767e-04 ||r(i)||/||b|| 1.208193580169e-05 >>> >>>>> 61 KSP unpreconditioned resid norm 4.374055762320e-04 true >>> resid norm 4.374055762316e-04 ||r(i)||/||b|| 1.089436186387e-05 >>> >>>>> 62 KSP unpreconditioned resid norm 3.874398257079e-04 true >>> resid norm 3.874398257077e-04 ||r(i)||/||b|| 9.649876204364e-06 >>> >>>>> 63 KSP unpreconditioned resid norm 3.364908694427e-04 true >>> resid norm 3.364908694429e-04 ||r(i)||/||b|| 8.380902061609e-06 >>> >>>>> 64 KSP unpreconditioned resid norm 2.961034697265e-04 true >>> resid norm 2.961034697268e-04 ||r(i)||/||b|| 7.374982221632e-06 >>> >>>>> 65 KSP unpreconditioned resid norm 2.640593092764e-04 true >>> resid norm 2.640593092767e-04 ||r(i)||/||b|| 6.576865557059e-06 >>> >>>>> 66 KSP unpreconditioned resid norm 2.423231125743e-04 true >>> resid norm 2.423231125745e-04 ||r(i)||/||b|| 6.035487016671e-06 >>> >>>>> 67 KSP unpreconditioned resid norm 2.182349471179e-04 true >>> resid norm 2.182349471179e-04 ||r(i)||/||b|| 5.435528521898e-06 >>> >>>>> 68 KSP unpreconditioned resid norm 2.008438265031e-04 true >>> resid norm 2.008438265028e-04 ||r(i)||/||b|| 5.002371809927e-06 >>> >>>>> 69 KSP unpreconditioned resid norm 1.838732863386e-04 true >>> resid norm 1.838732863388e-04 ||r(i)||/||b|| 4.579690400226e-06 >>> >>>>> 70 KSP unpreconditioned resid norm 1.723786027645e-04 true >>> resid norm 1.723786027645e-04 ||r(i)||/||b|| 4.293394913444e-06 >>> >>>>> 71 KSP unpreconditioned resid norm 1.580945192204e-04 true >>> resid norm 1.580945192205e-04 ||r(i)||/||b|| 3.937624471826e-06 >>> >>>>> 72 KSP unpreconditioned resid norm 1.476687469671e-04 true >>> resid norm 1.476687469671e-04 ||r(i)||/||b|| 3.677952117812e-06 >>> >>>>> 73 KSP unpreconditioned resid norm 1.385018526182e-04 true >>> resid norm 1.385018526184e-04 ||r(i)||/||b|| 3.449634351350e-06 >>> >>>>> 74 KSP unpreconditioned resid norm 1.279712893541e-04 true >>> resid norm 1.279712893541e-04 ||r(i)||/||b|| 3.187351991305e-06 >>> >>>>> 75 KSP unpreconditioned resid norm 1.202010411772e-04 true >>> resid norm 1.202010411774e-04 ||r(i)||/||b|| 2.993820175504e-06 >>> >>>>> 76 KSP unpreconditioned resid norm 1.113459414198e-04 true >>> resid norm 1.113459414200e-04 ||r(i)||/||b|| 2.773268206485e-06 >>> >>>>> 77 KSP unpreconditioned resid norm 1.042523036036e-04 true >>> resid norm 1.042523036037e-04 ||r(i)||/||b|| 2.596588572066e-06 >>> >>>>> 78 KSP unpreconditioned resid norm 9.565176453232e-05 true >>> resid norm 9.565176453227e-05 ||r(i)||/||b|| 2.382376888539e-06 >>> >>>>> 79 KSP unpreconditioned resid norm 8.896901670359e-05 true >>> resid norm 8.896901670365e-05 ||r(i)||/||b|| 2.215931198209e-06 >>> >>>>> 80 KSP unpreconditioned resid norm 8.119298425803e-05 true >>> resid norm 8.119298425824e-05 ||r(i)||/||b|| 2.022255314935e-06 >>> >>>>> 81 KSP unpreconditioned resid norm 7.544528309154e-05 true >>> resid norm 7.544528309154e-05 ||r(i)||/||b|| 1.879098620558e-06 >>> >>>>> 82 KSP unpreconditioned resid norm 6.755385041138e-05 true >>> resid norm 6.755385041176e-05 ||r(i)||/||b|| 1.682548489719e-06 >>> >>>>> 83 KSP unpreconditioned resid norm 6.158629300870e-05 true >>> resid norm 6.158629300835e-05 ||r(i)||/||b|| 1.533915885727e-06 >>> >>>>> 84 KSP unpreconditioned resid norm 5.358756885754e-05 true >>> resid norm 5.358756885765e-05 ||r(i)||/||b|| 1.334693470462e-06 >>> >>>>> 85 KSP unpreconditioned resid norm 4.774852370380e-05 true >>> resid norm 4.774852370387e-05 ||r(i)||/||b|| 1.189261692037e-06 >>> >>>>> 86 KSP unpreconditioned resid norm 3.919358737908e-05 true >>> resid norm 3.919358737930e-05 ||r(i)||/||b|| 9.761858258229e-07 >>> >>>>> 87 KSP unpreconditioned resid norm 3.434042319950e-05 true >>> resid norm 3.434042319947e-05 ||r(i)||/||b|| 8.553091620745e-07 >>> >>>>> 88 KSP unpreconditioned resid norm 2.813699436281e-05 true >>> resid norm 2.813699436302e-05 ||r(i)||/||b|| 7.008017615898e-07 >>> >>>>> 89 KSP unpreconditioned resid norm 2.462248069068e-05 true >>> resid norm 2.462248069051e-05 ||r(i)||/||b|| 6.132665635851e-07 >>> >>>>> 90 KSP unpreconditioned resid norm 2.040558789626e-05 true >>> resid norm 2.040558789626e-05 ||r(i)||/||b|| 5.082373674841e-07 >>> >>>>> 91 KSP unpreconditioned resid norm 1.888523204468e-05 true >>> resid norm 1.888523204470e-05 ||r(i)||/||b|| 4.703702077842e-07 >>> >>>>> 92 KSP unpreconditioned resid norm 1.707071292484e-05 true >>> resid norm 1.707071292474e-05 ||r(i)||/||b|| 4.251763900191e-07 >>> >>>>> 93 KSP unpreconditioned resid norm 1.498636454665e-05 true >>> resid norm 1.498636454672e-05 ||r(i)||/||b|| 3.732619958859e-07 >>> >>>>> 94 KSP unpreconditioned resid norm 1.219393542993e-05 true >>> resid norm 1.219393543006e-05 ||r(i)||/||b|| 3.037115947725e-07 >>> >>>>> 95 KSP unpreconditioned resid norm 1.059996963300e-05 true >>> resid norm 1.059996963303e-05 ||r(i)||/||b|| 2.640110487917e-07 >>> >>>>> 96 KSP unpreconditioned resid norm 9.099659872548e-06 true >>> resid norm 9.099659873214e-06 ||r(i)||/||b|| 2.266431725699e-07 >>> >>>>> 97 KSP unpreconditioned resid norm 8.147347587295e-06 true >>> resid norm 8.147347587584e-06 ||r(i)||/||b|| 2.029241456283e-07 >>> >>>>> 98 KSP unpreconditioned resid norm 7.167226146744e-06 true >>> resid norm 7.167226146783e-06 ||r(i)||/||b|| 1.785124823418e-07 >>> >>>>> 99 KSP unpreconditioned resid norm 6.552540209538e-06 true >>> resid norm 6.552540209577e-06 ||r(i)||/||b|| 1.632026385802e-07 >>> >>>>> 100 KSP unpreconditioned resid norm 5.767783600111e-06 true >>> resid norm 5.767783600320e-06 ||r(i)||/||b|| 1.436568830140e-07 >>> >>>>> 101 KSP unpreconditioned resid norm 5.261057430584e-06 true >>> resid norm 5.261057431144e-06 ||r(i)||/||b|| 1.310359688033e-07 >>> >>>>> 102 KSP unpreconditioned resid norm 4.715498525786e-06 true >>> resid norm 4.715498525947e-06 ||r(i)||/||b|| 1.174478564100e-07 >>> >>>>> 103 KSP unpreconditioned resid norm 4.380052669622e-06 true >>> resid norm 4.380052669825e-06 ||r(i)||/||b|| 1.090929822591e-07 >>> >>>>> 104 KSP unpreconditioned resid norm 3.911664470060e-06 true >>> resid norm 3.911664470226e-06 ||r(i)||/||b|| 9.742694319496e-08 >>> >>>>> 105 KSP unpreconditioned resid norm 3.652211458315e-06 true >>> resid norm 3.652211458259e-06 ||r(i)||/||b|| 9.096480564430e-08 >>> >>>>> 106 KSP unpreconditioned resid norm 3.387532128049e-06 true >>> resid norm 3.387532128358e-06 ||r(i)||/||b|| 8.437249737363e-08 >>> >>>>> 107 KSP unpreconditioned resid norm 3.234218880987e-06 true >>> resid norm 3.234218880798e-06 ||r(i)||/||b|| 8.055395895481e-08 >>> >>>>> 108 KSP unpreconditioned resid norm 3.016905196388e-06 true >>> resid norm 3.016905196492e-06 ||r(i)||/||b|| 7.514137611763e-08 >>> >>>>> 109 KSP unpreconditioned resid norm 2.858246441921e-06 true >>> resid norm 2.858246441975e-06 ||r(i)||/||b|| 7.118969836476e-08 >>> >>>>> 110 KSP unpreconditioned resid norm 2.637118810847e-06 true >>> resid norm 2.637118810750e-06 ||r(i)||/||b|| 6.568212241336e-08 >>> >>>>> 111 KSP unpreconditioned resid norm 2.494976088717e-06 true >>> resid norm 2.494976088700e-06 ||r(i)||/||b|| 6.214180574966e-08 >>> >>>>> 112 KSP unpreconditioned resid norm 2.270639574272e-06 true >>> resid norm 2.270639574200e-06 ||r(i)||/||b|| 5.655430686750e-08 >>> >>>>> 113 KSP unpreconditioned resid norm 2.104988663813e-06 true >>> resid norm 2.104988664169e-06 ||r(i)||/||b|| 5.242847707696e-08 >>> >>>>> 114 KSP unpreconditioned resid norm 1.889361127301e-06 true >>> resid norm 1.889361127526e-06 ||r(i)||/||b|| 4.705789073868e-08 >>> >>>>> 115 KSP unpreconditioned resid norm 1.732367008052e-06 true >>> resid norm 1.732367007971e-06 ||r(i)||/||b|| 4.314767367271e-08 >>> >>>>> 116 KSP unpreconditioned resid norm 1.509288268391e-06 true >>> resid norm 1.509288268645e-06 ||r(i)||/||b|| 3.759150191264e-08 >>> >>>>> 117 KSP unpreconditioned resid norm 1.359169217644e-06 true >>> resid norm 1.359169217445e-06 ||r(i)||/||b|| 3.385252062089e-08 >>> >>>>> 118 KSP unpreconditioned resid norm 1.180146337735e-06 true >>> resid norm 1.180146337908e-06 ||r(i)||/||b|| 2.939363820703e-08 >>> >>>>> 119 KSP unpreconditioned resid norm 1.067757039683e-06 true >>> resid norm 1.067757039924e-06 ||r(i)||/||b|| 2.659438335433e-08 >>> >>>>> 120 KSP unpreconditioned resid norm 9.435833073736e-07 true >>> resid norm 9.435833073736e-07 ||r(i)||/||b|| 2.350161625235e-08 >>> >>>>> 121 KSP unpreconditioned resid norm 8.749457237613e-07 true >>> resid norm 8.749457236791e-07 ||r(i)||/||b|| 2.179207546261e-08 >>> >>>>> 122 KSP unpreconditioned resid norm 7.945760150897e-07 true >>> resid norm 7.945760150444e-07 ||r(i)||/||b|| 1.979032528762e-08 >>> >>>>> 123 KSP unpreconditioned resid norm 7.141240839013e-07 true >>> resid norm 7.141240838682e-07 ||r(i)||/||b|| 1.778652721438e-08 >>> >>>>> 124 KSP unpreconditioned resid norm 6.300566936733e-07 true >>> resid norm 6.300566936607e-07 ||r(i)||/||b|| 1.569267971988e-08 >>> >>>>> 125 KSP unpreconditioned resid norm 5.628986997544e-07 true >>> resid norm 5.628986995849e-07 ||r(i)||/||b|| 1.401999073448e-08 >>> >>>>> 126 KSP unpreconditioned resid norm 5.119018951602e-07 true >>> resid norm 5.119018951837e-07 ||r(i)||/||b|| 1.274982484900e-08 >>> >>>>> 127 KSP unpreconditioned resid norm 4.664670343748e-07 true >>> resid norm 4.664670344042e-07 ||r(i)||/||b|| 1.161818903670e-08 >>> >>>>> 128 KSP unpreconditioned resid norm 4.253264691112e-07 true >>> resid norm 4.253264691948e-07 ||r(i)||/||b|| 1.059351027394e-08 >>> >>>>> 129 KSP unpreconditioned resid norm 3.868921150516e-07 true >>> resid norm 3.868921150517e-07 ||r(i)||/||b|| 9.636234498800e-09 >>> >>>>> 130 KSP unpreconditioned resid norm 3.558445658540e-07 true >>> resid norm 3.558445660061e-07 ||r(i)||/||b|| 8.862940209315e-09 >>> >>>>> 131 KSP unpreconditioned resid norm 3.268710273840e-07 true >>> resid norm 3.268710272455e-07 ||r(i)||/||b|| 8.141302825416e-09 >>> >>>>> 132 KSP unpreconditioned resid norm 3.041273897592e-07 true >>> resid norm 3.041273896694e-07 ||r(i)||/||b|| 7.574832182794e-09 >>> >>>>> 133 KSP unpreconditioned resid norm 2.851926677922e-07 true >>> resid norm 2.851926674248e-07 ||r(i)||/||b|| 7.103229333782e-09 >>> >>>>> 134 KSP unpreconditioned resid norm 2.694708315072e-07 true >>> resid norm 2.694708309500e-07 ||r(i)||/||b|| 6.711649104748e-09 >>> >>>>> 135 KSP unpreconditioned resid norm 2.534825559099e-07 true >>> resid norm 2.534825557469e-07 ||r(i)||/||b|| 6.313432746507e-09 >>> >>>>> 136 KSP unpreconditioned resid norm 2.387342352458e-07 true >>> resid norm 2.387342351804e-07 ||r(i)||/||b|| 5.946099658254e-09 >>> >>>>> 137 KSP unpreconditioned resid norm 2.200861667617e-07 true >>> resid norm 2.200861665255e-07 ||r(i)||/||b|| 5.481636425438e-09 >>> >>>>> 138 KSP unpreconditioned resid norm 2.051415370616e-07 true >>> resid norm 2.051415370614e-07 ||r(i)||/||b|| 5.109413915824e-09 >>> >>>>> 139 KSP unpreconditioned resid norm 1.887376429396e-07 true >>> resid norm 1.887376426682e-07 ||r(i)||/||b|| 4.700845824315e-09 >>> >>>>> 140 KSP unpreconditioned resid norm 1.729743133005e-07 true >>> resid norm 1.729743128342e-07 ||r(i)||/||b|| 4.308232129561e-09 >>> >>>>> 141 KSP unpreconditioned resid norm 1.541021130781e-07 true >>> resid norm 1.541021128364e-07 ||r(i)||/||b|| 3.838186508023e-09 >>> >>>>> 142 KSP unpreconditioned resid norm 1.384631628565e-07 true >>> resid norm 1.384631627735e-07 ||r(i)||/||b|| 3.448670712125e-09 >>> >>>>> 143 KSP unpreconditioned resid norm 1.223114405626e-07 true >>> resid norm 1.223114403883e-07 ||r(i)||/||b|| 3.046383411846e-09 >>> >>>>> 144 KSP unpreconditioned resid norm 1.087313066223e-07 true >>> resid norm 1.087313065117e-07 ||r(i)||/||b|| 2.708146085550e-09 >>> >>>>> 145 KSP unpreconditioned resid norm 9.181901998734e-08 true >>> resid norm 9.181901984268e-08 ||r(i)||/||b|| 2.286915582489e-09 >>> >>>>> 146 KSP unpreconditioned resid norm 7.885850510808e-08 true >>> resid norm 7.885850531446e-08 ||r(i)||/||b|| 1.964110975313e-09 >>> >>>>> 147 KSP unpreconditioned resid norm 6.483393946950e-08 true >>> resid norm 6.483393931383e-08 ||r(i)||/||b|| 1.614804278515e-09 >>> >>>>> 148 KSP unpreconditioned resid norm 5.690132597004e-08 true >>> resid norm 5.690132577518e-08 ||r(i)||/||b|| 1.417228465328e-09 >>> >>>>> 149 KSP unpreconditioned resid norm 5.023671521579e-08 true >>> resid norm 5.023671502186e-08 ||r(i)||/||b|| 1.251234511035e-09 >>> >>>>> 150 KSP unpreconditioned resid norm 4.625371062660e-08 true >>> resid norm 4.625371062660e-08 ||r(i)||/||b|| 1.152030720445e-09 >>> >>>>> 151 KSP unpreconditioned resid norm 4.349049084805e-08 true >>> resid norm 4.349049089337e-08 ||r(i)||/||b|| 1.083207830846e-09 >>> >>>>> 152 KSP unpreconditioned resid norm 3.932593324498e-08 true >>> resid norm 3.932593376918e-08 ||r(i)||/||b|| 9.794821474546e-10 >>> >>>>> 153 KSP unpreconditioned resid norm 3.504167649202e-08 true >>> resid norm 3.504167638113e-08 ||r(i)||/||b|| 8.727751166356e-10 >>> >>>>> 154 KSP unpreconditioned resid norm 2.892726347747e-08 true >>> resid norm 2.892726348583e-08 ||r(i)||/||b|| 7.204848160858e-10 >>> >>>>> 155 KSP unpreconditioned resid norm 2.477647033202e-08 true >>> resid norm 2.477647041570e-08 ||r(i)||/||b|| 6.171019508795e-10 >>> >>>>> 156 KSP unpreconditioned resid norm 2.128504065757e-08 true >>> resid norm 2.128504067423e-08 ||r(i)||/||b|| 5.301416991298e-10 >>> >>>>> 157 KSP unpreconditioned resid norm 1.879248809429e-08 true >>> resid norm 1.879248818928e-08 ||r(i)||/||b|| 4.680602575310e-10 >>> >>>>> 158 KSP unpreconditioned resid norm 1.673649140073e-08 true >>> resid norm 1.673649134005e-08 ||r(i)||/||b|| 4.168520085200e-10 >>> >>>>> 159 KSP unpreconditioned resid norm 1.497123388109e-08 true >>> resid norm 1.497123365569e-08 ||r(i)||/||b|| 3.728851342016e-10 >>> >>>>> 160 KSP unpreconditioned resid norm 1.315982130162e-08 true >>> resid norm 1.315982149329e-08 ||r(i)||/||b|| 3.277687007261e-10 >>> >>>>> 161 KSP unpreconditioned resid norm 1.182395864938e-08 true >>> resid norm 1.182395868430e-08 ||r(i)||/||b|| 2.944966675550e-10 >>> >>>>> 162 KSP unpreconditioned resid norm 1.070204481679e-08 true >>> resid norm 1.070204466432e-08 ||r(i)||/||b|| 2.665534085342e-10 >>> >>>>> 163 KSP unpreconditioned resid norm 9.969290307649e-09 true >>> resid norm 9.969290432333e-09 ||r(i)||/||b|| 2.483028644297e-10 >>> >>>>> 164 KSP unpreconditioned resid norm 9.134440883306e-09 true >>> resid norm 9.134440980976e-09 ||r(i)||/||b|| 2.275094577628e-10 >>> >>>>> 165 KSP unpreconditioned resid norm 8.593316427292e-09 true >>> resid norm 8.593316413360e-09 ||r(i)||/||b|| 2.140317904139e-10 >>> >>>>> 166 KSP unpreconditioned resid norm 8.042173048464e-09 true >>> resid norm 8.042173332848e-09 ||r(i)||/||b|| 2.003045942277e-10 >>> >>>>> 167 KSP unpreconditioned resid norm 7.655518522782e-09 true >>> resid norm 7.655518879144e-09 ||r(i)||/||b|| 1.906742791064e-10 >>> >>>>> 168 KSP unpreconditioned resid norm 7.210283391815e-09 true >>> resid norm 7.210283220312e-09 ||r(i)||/||b|| 1.795848951442e-10 >>> >>>>> 169 KSP unpreconditioned resid norm 6.793967416271e-09 true >>> resid norm 6.793967448832e-09 ||r(i)||/||b|| 1.692158122825e-10 >>> >>>>> 170 KSP unpreconditioned resid norm 6.249160304588e-09 true >>> resid norm 6.249160382647e-09 ||r(i)||/||b|| 1.556464257736e-10 >>> >>>>> 171 KSP unpreconditioned resid norm 5.794936438798e-09 true >>> resid norm 5.794936332552e-09 ||r(i)||/||b|| 1.443331699811e-10 >>> >>>>> 172 KSP unpreconditioned resid norm 5.222337397128e-09 true >>> resid norm 5.222337443277e-09 ||r(i)||/||b|| 1.300715788135e-10 >>> >>>>> 173 KSP unpreconditioned resid norm 4.755359110447e-09 true >>> resid norm 4.755358888996e-09 ||r(i)||/||b|| 1.184406494668e-10 >>> >>>>> 174 KSP unpreconditioned resid norm 4.317537007873e-09 true >>> resid norm 4.317537267718e-09 ||r(i)||/||b|| 1.075359252630e-10 >>> >>>>> 175 KSP unpreconditioned resid norm 3.924177535665e-09 true >>> resid norm 3.924177629720e-09 ||r(i)||/||b|| 9.773860563138e-11 >>> >>>>> 176 KSP unpreconditioned resid norm 3.502843065115e-09 true >>> resid norm 3.502843126359e-09 ||r(i)||/||b|| 8.724452234855e-11 >>> >>>>> 177 KSP unpreconditioned resid norm 3.083873232869e-09 true >>> resid norm 3.083873352938e-09 ||r(i)||/||b|| 7.680933686007e-11 >>> >>>>> 178 KSP unpreconditioned resid norm 2.758980676473e-09 true >>> resid norm 2.758980618096e-09 ||r(i)||/||b|| 6.871730691658e-11 >>> >>>>> 179 KSP unpreconditioned resid norm 2.510978240429e-09 true >>> resid norm 2.510978327392e-09 ||r(i)||/||b|| 6.254036989334e-11 >>> >>>>> 180 KSP unpreconditioned resid norm 2.323000193205e-09 true >>> resid norm 2.323000193205e-09 ||r(i)||/||b|| 5.785844097519e-11 >>> >>>>> 181 KSP unpreconditioned resid norm 2.167480159274e-09 true >>> resid norm 2.167480113693e-09 ||r(i)||/||b|| 5.398493749153e-11 >>> >>>>> 182 KSP unpreconditioned resid norm 1.983545827983e-09 true >>> resid norm 1.983546404840e-09 ||r(i)||/||b|| 4.940374216139e-11 >>> >>>>> 183 KSP unpreconditioned resid norm 1.794576286774e-09 true >>> resid norm 1.794576224361e-09 ||r(i)||/||b|| 4.469710457036e-11 >>> >>>>> 184 KSP unpreconditioned resid norm 1.583490590644e-09 true >>> resid norm 1.583490380603e-09 ||r(i)||/||b|| 3.943963715064e-11 >>> >>>>> 185 KSP unpreconditioned resid norm 1.412659866247e-09 true >>> resid norm 1.412659832191e-09 ||r(i)||/||b|| 3.518479927722e-11 >>> >>>>> 186 KSP unpreconditioned resid norm 1.285613344939e-09 true >>> resid norm 1.285612984761e-09 ||r(i)||/||b|| 3.202047215205e-11 >>> >>>>> 187 KSP unpreconditioned resid norm 1.168115133929e-09 true >>> resid norm 1.168114766904e-09 ||r(i)||/||b|| 2.909397058634e-11 >>> >>>>> 188 KSP unpreconditioned resid norm 1.063377926053e-09 true >>> resid norm 1.063377647554e-09 ||r(i)||/||b|| 2.648530681802e-11 >>> >>>>> 189 KSP unpreconditioned resid norm 9.548967728122e-10 true >>> resid norm 9.548964523410e-10 ||r(i)||/||b|| 2.378339019807e-11 >>> >>>>> KSP Object: 16 MPI processes >>> >>>>> type: fgmres >>> >>>>> restart=30, using Classical (unmodified) Gram-Schmidt >>> Orthogonalization with no iterative refinement >>> >>>>> happy breakdown tolerance 1e-30 >>> >>>>> maximum iterations=2000, initial guess is zero >>> >>>>> tolerances: relative=1e-20, absolute=1e-09, >>> divergence=10000. >>> >>>>> right preconditioning >>> >>>>> using UNPRECONDITIONED norm type for convergence test >>> >>>>> PC Object: 16 MPI processes >>> >>>>> type: bjacobi >>> >>>>> number of blocks = 4 >>> >>>>> Local solver information for first block is in the >>> following KSP and PC objects on rank 0: >>> >>>>> Use -ksp_view ::ascii_info_detail to display >>> information for all blocks >>> >>>>> KSP Object: (sub_) 4 MPI processes >>> >>>>> type: preonly >>> >>>>> maximum iterations=10000, initial guess is zero >>> >>>>> tolerances: relative=1e-05, absolute=1e-50, >>> divergence=10000. >>> >>>>> left preconditioning >>> >>>>> using NONE norm type for convergence test >>> >>>>> PC Object: (sub_) 4 MPI processes >>> >>>>> type: telescope >>> >>>>> petsc subcomm: parent comm size reduction factor = 4 >>> >>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>> >>>>> petsc subcomm type = contiguous >>> >>>>> linear system matrix = precond matrix: >>> >>>>> Mat Object: (sub_) 4 MPI processes >>> >>>>> type: mpiaij >>> >>>>> rows=40200, cols=40200 >>> >>>>> total: nonzeros=199996, allocated nonzeros=203412 >>> >>>>> total number of mallocs used during MatSetValues calls=0 >>> >>>>> not using I-node (on process 0) routines >>> >>>>> setup type: default >>> >>>>> Parent DM object: NULL >>> >>>>> Sub DM object: NULL >>> >>>>> KSP Object: (sub_telescope_) 1 MPI processes >>> >>>>> type: preonly >>> >>>>> maximum iterations=10000, initial guess is zero >>> >>>>> tolerances: relative=1e-05, absolute=1e-50, >>> divergence=10000. >>> >>>>> left preconditioning >>> >>>>> using NONE norm type for convergence test >>> >>>>> PC Object: (sub_telescope_) 1 MPI processes >>> >>>>> type: lu >>> >>>>> out-of-place factorization >>> >>>>> tolerance for zero pivot 2.22045e-14 >>> >>>>> matrix ordering: external >>> >>>>> factor fill ratio given 0., needed 0. >>> >>>>> Factored matrix follows: >>> >>>>> Mat Object: 1 MPI processes >>> >>>>> type: mumps >>> >>>>> rows=40200, cols=40200 >>> >>>>> package used to perform factorization: mumps >>> >>>>> total: nonzeros=1849788, allocated >>> nonzeros=1849788 >>> >>>>> MUMPS run parameters: >>> >>>>> SYM (matrix type): 0 >>> >>>>> PAR (host participation): 1 >>> >>>>> ICNTL(1) (output for error): 6 >>> >>>>> ICNTL(2) (output of diagnostic msg): 0 >>> >>>>> ICNTL(3) (output for global info): 0 >>> >>>>> ICNTL(4) (level of printing): 0 >>> >>>>> ICNTL(5) (input mat struct): 0 >>> >>>>> ICNTL(6) (matrix prescaling): 7 >>> >>>>> ICNTL(7) (sequential matrix ordering):7 >>> >>>>> ICNTL(8) (scaling strategy): 77 >>> >>>>> ICNTL(10) (max num of refinements): 0 >>> >>>>> ICNTL(11) (error analysis): 0 >>> >>>>> ICNTL(12) (efficiency control): 1 >>> >>>>> ICNTL(13) (sequential factorization >>> of the root node): 0 >>> >>>>> ICNTL(14) (percentage of estimated >>> workspace increase): 20 >>> >>>>> ICNTL(18) (input mat struct): 0 >>> >>>>> ICNTL(19) (Schur complement info): 0 >>> >>>>> ICNTL(20) (RHS sparse pattern): 0 >>> >>>>> ICNTL(21) (solution struct): 0 >>> >>>>> ICNTL(22) (in-core/out-of-core >>> facility): 0 >>> >>>>> ICNTL(23) (max size of memory can be >>> allocated locally):0 >>> >>>>> ICNTL(24) (detection of null pivot >>> rows): 0 >>> >>>>> ICNTL(25) (computation of a null >>> space basis): 0 >>> >>>>> ICNTL(26) (Schur options for RHS or >>> solution): 0 >>> >>>>> ICNTL(27) (blocking size for multiple >>> RHS): -32 >>> >>>>> ICNTL(28) (use parallel or sequential >>> ordering): 1 >>> >>>>> ICNTL(29) (parallel ordering): 0 >>> >>>>> ICNTL(30) (user-specified set of >>> entries in inv(A)): 0 >>> >>>>> ICNTL(31) (factors is discarded in >>> the solve phase): 0 >>> >>>>> ICNTL(33) (compute determinant): 0 >>> >>>>> ICNTL(35) (activate BLR based >>> factorization): 0 >>> >>>>> ICNTL(36) (choice of BLR >>> factorization variant): 0 >>> >>>>> ICNTL(38) (estimated compression rate >>> of LU factors): 333 >>> >>>>> CNTL(1) (relative pivoting >>> threshold): 0.01 >>> >>>>> CNTL(2) (stopping criterion of >>> refinement): 1.49012e-08 >>> >>>>> CNTL(3) (absolute pivoting >>> threshold): 0. >>> >>>>> CNTL(4) (value of static pivoting): -1. >>> >>>>> CNTL(5) (fixation for null pivots): 0. >>> >>>>> CNTL(7) (dropping parameter for >>> BLR): 0. >>> >>>>> RINFO(1) (local estimated flops for >>> the elimination after analysis): >>> >>>>> [0] 1.45525e+08 >>> >>>>> RINFO(2) (local estimated flops for >>> the assembly after factorization): >>> >>>>> [0] 2.89397e+06 >>> >>>>> RINFO(3) (local estimated flops for >>> the elimination after factorization): >>> >>>>> [0] 1.45525e+08 >>> >>>>> INFO(15) (estimated size of (in MB) >>> MUMPS internal data for running numerical factorization): >>> >>>>> [0] 29 >>> >>>>> INFO(16) (size of (in MB) MUMPS >>> internal data used during numerical factorization): >>> >>>>> [0] 29 >>> >>>>> INFO(23) (num of pivots eliminated on >>> this processor after factorization): >>> >>>>> [0] 40200 >>> >>>>> RINFOG(1) (global estimated flops for >>> the elimination after analysis): 1.45525e+08 >>> >>>>> RINFOG(2) (global estimated flops for >>> the assembly after factorization): 2.89397e+06 >>> >>>>> RINFOG(3) (global estimated flops for >>> the elimination after factorization): 1.45525e+08 >>> >>>>> (RINFOG(12) RINFOG(13))*2^INFOG(34) >>> (determinant): (0.,0.)*(2^0) >>> >>>>> INFOG(3) (estimated real workspace >>> for factors on all processors after analysis): 1849788 >>> >>>>> INFOG(4) (estimated integer workspace >>> for factors on all processors after analysis): 879986 >>> >>>>> INFOG(5) (estimated maximum front >>> size in the complete tree): 282 >>> >>>>> INFOG(6) (number of nodes in the >>> complete tree): 23709 >>> >>>>> INFOG(7) (ordering option effectively >>> used after analysis): 5 >>> >>>>> INFOG(8) (structural symmetry in >>> percent of the permuted matrix after analysis): 100 >>> >>>>> INFOG(9) (total real/complex >>> workspace to store the matrix factors after factorization): 1849788 >>> >>>>> INFOG(10) (total integer space store >>> the matrix factors after factorization): 879986 >>> >>>>> INFOG(11) (order of largest frontal >>> matrix after factorization): 282 >>> >>>>> INFOG(12) (number of off-diagonal >>> pivots): 0 >>> >>>>> INFOG(13) (number of delayed pivots >>> after factorization): 0 >>> >>>>> INFOG(14) (number of memory compress >>> after factorization): 0 >>> >>>>> INFOG(15) (number of steps of >>> iterative refinement after solution): 0 >>> >>>>> INFOG(16) (estimated size (in MB) of >>> all MUMPS internal data for factorization after analysis: value on >>> the most memory consuming processor): 29 >>> >>>>> INFOG(17) (estimated size of all >>> MUMPS internal data for factorization after analysis: sum over all >>> processors): 29 >>> >>>>> INFOG(18) (size of all MUMPS internal >>> data allocated during factorization: value on the most memory >>> consuming processor): 29 >>> >>>>> INFOG(19) (size of all MUMPS internal >>> data allocated during factorization: sum over all processors): 29 >>> >>>>> INFOG(20) (estimated number of >>> entries in the factors): 1849788 >>> >>>>> INFOG(21) (size in MB of memory >>> effectively used during factorization - value on the most memory >>> consuming processor): 26 >>> >>>>> INFOG(22) (size in MB of memory >>> effectively used during factorization - sum over all processors): 26 >>> >>>>> INFOG(23) (after analysis: value of >>> ICNTL(6) effectively used): 0 >>> >>>>> INFOG(24) (after analysis: value of >>> ICNTL(12) effectively used): 1 >>> >>>>> INFOG(25) (after factorization: >>> number of pivots modified by static pivoting): 0 >>> >>>>> INFOG(28) (after factorization: >>> number of null pivots encountered): 0 >>> >>>>> INFOG(29) (after factorization: >>> effective number of entries in the factors (sum over all >>> processors)): 1849788 >>> >>>>> INFOG(30, 31) (after solution: size >>> in Mbytes of memory used during solution phase): 29, 29 >>> >>>>> INFOG(32) (after analysis: type of >>> analysis done): 1 >>> >>>>> INFOG(33) (value used for ICNTL(8)): 7 >>> >>>>> INFOG(34) (exponent of the >>> determinant if determinant is requested): 0 >>> >>>>> INFOG(35) (after factorization: >>> number of entries taking into account BLR factor compression - sum >>> over all processors): 1849788 >>> >>>>> INFOG(36) (after analysis: estimated >>> size of all MUMPS internal data for running BLR in-core - value on >>> the most memory consuming processor): 0 >>> >>>>> INFOG(37) (after analysis: estimated >>> size of all MUMPS internal data for running BLR in-core - sum over >>> all processors): 0 >>> >>>>> INFOG(38) (after analysis: estimated >>> size of all MUMPS internal data for running BLR out-of-core - >>> value on the most memory consuming processor): 0 >>> >>>>> INFOG(39) (after analysis: estimated >>> size of all MUMPS internal data for running BLR out-of-core - sum >>> over all processors): 0 >>> >>>>> linear system matrix = precond matrix: >>> >>>>> Mat Object: 1 MPI processes >>> >>>>> type: seqaijcusparse >>> >>>>> rows=40200, cols=40200 >>> >>>>> total: nonzeros=199996, allocated nonzeros=199996 >>> >>>>> total number of mallocs used during >>> MatSetValues calls=0 >>> >>>>> not using I-node routines >>> >>>>> linear system matrix = precond matrix: >>> >>>>> Mat Object: 16 MPI processes >>> >>>>> type: mpiaijcusparse >>> >>>>> rows=160800, cols=160800 >>> >>>>> total: nonzeros=802396, allocated nonzeros=1608000 >>> >>>>> total number of mallocs used during MatSetValues calls=0 >>> >>>>> not using I-node (on process 0) routines >>> >>>>> Norm of error 9.11684e-07 iterations 189 >>> >>>>> Chang >>> >>>>> On 10/14/21 10:10 PM, Chang Liu wrote: >>> >>>>>> Hi Barry, >>> >>>>>> >>> >>>>>> No problem. Here is the output. It seems that the resid >>> norm calculation is incorrect. >>> >>>>>> >>> >>>>>> $ mpiexec -n 16 --hostfile hostfile --oversubscribe ./ex7 >>> -m 400 -ksp_view -ksp_monitor_true_residual -pc_type bjacobi >>> -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse >>> -sub_pc_type telescope -sub_ksp_type preonly >>> -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >>> -sub_telescope_pc_factor_mat_solver_type cusparse >>> -sub_pc_telescope_reduction_factor 4 >>> -sub_pc_telescope_subcomm_type contiguous -ksp_max_it 2000 >>> -ksp_rtol 1.e-20 -ksp_atol 1.e-9 >>> >>>>>> 0 KSP unpreconditioned resid norm 4.014971979977e+01 >>> true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>> >>>>>> 1 KSP unpreconditioned resid norm 0.000000000000e+00 >>> true resid norm 4.014971979977e+01 ||r(i)||/||b|| 1.000000000000e+00 >>> >>>>>> KSP Object: 16 MPI processes >>> >>>>>> type: fgmres >>> >>>>>> restart=30, using Classical (unmodified) Gram-Schmidt >>> Orthogonalization with no iterative refinement >>> >>>>>> happy breakdown tolerance 1e-30 >>> >>>>>> maximum iterations=2000, initial guess is zero >>> >>>>>> tolerances: relative=1e-20, absolute=1e-09, >>> divergence=10000. >>> >>>>>> right preconditioning >>> >>>>>> using UNPRECONDITIONED norm type for convergence test >>> >>>>>> PC Object: 16 MPI processes >>> >>>>>> type: bjacobi >>> >>>>>> number of blocks = 4 >>> >>>>>> Local solver information for first block is in the >>> following KSP and PC objects on rank 0: >>> >>>>>> Use -ksp_view ::ascii_info_detail to display >>> information for all blocks >>> >>>>>> KSP Object: (sub_) 4 MPI processes >>> >>>>>> type: preonly >>> >>>>>> maximum iterations=10000, initial guess is zero >>> >>>>>> tolerances: relative=1e-05, absolute=1e-50, >>> divergence=10000. >>> >>>>>> left preconditioning >>> >>>>>> using NONE norm type for convergence test >>> >>>>>> PC Object: (sub_) 4 MPI processes >>> >>>>>> type: telescope >>> >>>>>> petsc subcomm: parent comm size reduction factor = 4 >>> >>>>>> petsc subcomm: parent_size = 4 , subcomm_size = 1 >>> >>>>>> petsc subcomm type = contiguous >>> >>>>>> linear system matrix = precond matrix: >>> >>>>>> Mat Object: (sub_) 4 MPI processes >>> >>>>>> type: mpiaij >>> >>>>>> rows=40200, cols=40200 >>> >>>>>> total: nonzeros=199996, allocated nonzeros=203412 >>> >>>>>> total number of mallocs used during MatSetValues >>> calls=0 >>> >>>>>> not using I-node (on process 0) routines >>> >>>>>> setup type: default >>> >>>>>> Parent DM object: NULL >>> >>>>>> Sub DM object: NULL >>> >>>>>> KSP Object: (sub_telescope_) 1 MPI processes >>> >>>>>> type: preonly >>> >>>>>> maximum iterations=10000, initial guess is zero >>> >>>>>> tolerances: relative=1e-05, absolute=1e-50, >>> divergence=10000. >>> >>>>>> left preconditioning >>> >>>>>> using NONE norm type for convergence test >>> >>>>>> PC Object: (sub_telescope_) 1 MPI processes >>> >>>>>> type: lu >>> >>>>>> out-of-place factorization >>> >>>>>> tolerance for zero pivot 2.22045e-14 >>> >>>>>> matrix ordering: nd >>> >>>>>> factor fill ratio given 5., needed 8.62558 >>> >>>>>> Factored matrix follows: >>> >>>>>> Mat Object: 1 MPI processes >>> >>>>>> type: seqaijcusparse >>> >>>>>> rows=40200, cols=40200 >>> >>>>>> package used to perform factorization: >>> cusparse >>> >>>>>> total: nonzeros=1725082, allocated >>> nonzeros=1725082 >>> >>>>>> not using I-node routines >>> >>>>>> linear system matrix = precond matrix: >>> >>>>>> Mat Object: 1 MPI processes >>> >>>>>> type: seqaijcusparse >>> >>>>>> rows=40200, cols=40200 >>> >>>>>> total: nonzeros=199996, allocated nonzeros=199996 >>> >>>>>> total number of mallocs used during >>> MatSetValues calls=0 >>> >>>>>> not using I-node routines >>> >>>>>> linear system matrix = precond matrix: >>> >>>>>> Mat Object: 16 MPI processes >>> >>>>>> type: mpiaijcusparse >>> >>>>>> rows=160800, cols=160800 >>> >>>>>> total: nonzeros=802396, allocated nonzeros=1608000 >>> >>>>>> total number of mallocs used during MatSetValues calls=0 >>> >>>>>> not using I-node (on process 0) routines >>> >>>>>> Norm of error 400.999 iterations 1 >>> >>>>>> >>> >>>>>> Chang >>> >>>>>> >>> >>>>>> >>> >>>>>> On 10/14/21 9:47 PM, Barry Smith wrote: >>> >>>>>>> >>> >>>>>>> Chang, >>> >>>>>>> >>> >>>>>>> Sorry I did not notice that one. Please run that with >>> -ksp_view -ksp_monitor_true_residual so we can see exactly how >>> options are interpreted and solver used. At a glance it looks ok >>> but something must be wrong to get the wrong answer. >>> >>>>>>> >>> >>>>>>> Barry >>> >>>>>>> >>> >>>>>>>> On Oct 14, 2021, at 6:02 PM, Chang Liu >> > wrote: >>> >>>>>>>> >>> >>>>>>>> Hi Barry, >>> >>>>>>>> >>> >>>>>>>> That is exactly what I was doing in the second example, >>> in which the preconditioner works but the GMRES does not. >>> >>>>>>>> >>> >>>>>>>> Chang >>> >>>>>>>> >>> >>>>>>>> On 10/14/21 5:15 PM, Barry Smith wrote: >>> >>>>>>>>> You need to use the PCTELESCOPE inside the block >>> Jacobi, not outside it. So something like -pc_type bjacobi >>> -sub_pc_type telescope -sub_telescope_pc_type lu >>> >>>>>>>>>> On Oct 14, 2021, at 4:14 PM, Chang Liu >> > wrote: >>> >>>>>>>>>> >>> >>>>>>>>>> Hi Pierre, >>> >>>>>>>>>> >>> >>>>>>>>>> I wonder if the trick of PCTELESCOPE only works for >>> preconditioner and not for the solver. I have done some tests, and >>> find that for solving a small matrix using -telescope_ksp_type >>> preonly, it does work for GPU with multiple MPI processes. >>> However, for bjacobi and gmres, it does not work. >>> >>>>>>>>>> >>> >>>>>>>>>> The command line options I used for small matrix is like >>> >>>>>>>>>> >>> >>>>>>>>>> mpiexec -n 4 --oversubscribe ./ex7 -m 100 >>> -ksp_monitor_short -pc_type telescope -mat_type aijcusparse >>> -telescope_pc_type lu -telescope_pc_factor_mat_solver_type >>> cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4 >>> >>>>>>>>>> >>> >>>>>>>>>> which gives the correct output. For iterative solver, I >>> tried >>> >>>>>>>>>> >>> >>>>>>>>>> mpiexec -n 16 --oversubscribe ./ex7 -m 400 >>> -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type >>> fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type >>> preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu >>> -sub_telescope_pc_factor_mat_solver_type cusparse >>> -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol >>> 1.e-9 -ksp_atol 1.e-20 >>> >>>>>>>>>> >>> >>>>>>>>>> for large matrix. The output is like >>> >>>>>>>>>> >>> >>>>>>>>>> 0 KSP Residual norm 40.1497 >>> >>>>>>>>>> 1 KSP Residual norm < 1.e-11 >>> >>>>>>>>>> Norm of error 400.999 iterations 1 >>> >>>>>>>>>> >>> >>>>>>>>>> So it seems to call a direct solver instead of an >>> iterative one. >>> >>>>>>>>>> >>> >>>>>>>>>> Can you please help check these options? >>> >>>>>>>>>> >>> >>>>>>>>>> Chang >>> >>>>>>>>>> >>> >>>>>>>>>> On 10/14/21 10:04 AM, Pierre Jolivet wrote: >>> >>>>>>>>>>>> On 14 Oct 2021, at 3:50 PM, Chang Liu >> > wrote: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Thank you Pierre. I was not aware of PCTELESCOPE >>> before. This sounds exactly what I need. I wonder if PCTELESCOPE >>> can transform a mpiaijcusparse to seqaircusparse? Or I have to do >>> it manually? >>> >>>>>>>>>>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat(). >>> >>>>>>>>>>> 1) I?m not sure this is implemented for cuSparse >>> matrices, but it should be; >>> >>>>>>>>>>> 2) at least for the implementations >>> MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and >>> MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType >>> is MATBAIJ (resp. MATAIJ). Constructors are usually ?smart? enough >>> to detect if the MPI communicator on which the Mat lives is of >>> size 1 (your case), and then the resulting Mat is of type MatSeqX >>> instead of MatMPIX, so you would not need to worry about the >>> transformation you are mentioning. >>> >>>>>>>>>>> If you try this out and this does not work, please >>> provide the backtrace (probably something like ?Operation XYZ not >>> implemented for MatType ABC?), and hopefully someone can add the >>> missing plumbing. >>> >>>>>>>>>>> I do not claim that this will be efficient, but I >>> think this goes in the direction of what you want to achieve. >>> >>>>>>>>>>> Thanks, >>> >>>>>>>>>>> Pierre >>> >>>>>>>>>>>> Chang >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote: >>> >>>>>>>>>>>>> Maybe I?m missing something, but can?t you use >>> PCTELESCOPE as a subdomain solver, with a reduction factor equal >>> to the number of MPI processes you have per block? >>> >>>>>>>>>>>>> -sub_pc_type telescope >>> -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu >>> >>>>>>>>>>>>> This does not work with MUMPS >>> -mat_mumps_use_omp_threads because not only do the Mat needs to be >>> redistributed, the secondary processes also need to be ?converted? >>> to OpenMP threads. >>> >>>>>>>>>>>>> Thus the need for specific code in mumps.c. >>> >>>>>>>>>>>>> Thanks, >>> >>>>>>>>>>>>> Pierre >>> >>>>>>>>>>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via >>> petsc-users >> > wrote: >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Hi Junchao, >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Yes that is what I want. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Chang >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote: >>> >>>>>>>>>>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith >>> >>> >> wrote: >>> >>>>>>>>>>>>>>> Junchao, >>> >>>>>>>>>>>>>>> If I understand correctly Chang is >>> using the block Jacobi >>> >>>>>>>>>>>>>>> method with a single block for a number of >>> MPI ranks and a direct >>> >>>>>>>>>>>>>>> solver for each block so it uses >>> PCSetUp_BJacobi_Multiproc() which >>> >>>>>>>>>>>>>>> is code Hong Zhang wrote a number of years >>> ago for CPUs. For their >>> >>>>>>>>>>>>>>> particular problems this preconditioner works >>> well, but using an >>> >>>>>>>>>>>>>>> iterative solver on the blocks does not work >>> well. >>> >>>>>>>>>>>>>>> If we had complete MPI-GPU direct >>> solvers he could just use >>> >>>>>>>>>>>>>>> the current code with MPIAIJCUSPARSE on each >>> block but since we do >>> >>>>>>>>>>>>>>> not he would like to use a single GPU for >>> each block, this means >>> >>>>>>>>>>>>>>> that diagonal blocks of the global parallel >>> MPI matrix needs to be >>> >>>>>>>>>>>>>>> sent to a subset of the GPUs (one GPU per >>> block, which has multiple >>> >>>>>>>>>>>>>>> MPI ranks associated with the blocks). >>> Similarly for the triangular >>> >>>>>>>>>>>>>>> solves the blocks of the right hand side >>> needs to be shipped to the >>> >>>>>>>>>>>>>>> appropriate GPU and the resulting solution >>> shipped back to the >>> >>>>>>>>>>>>>>> multiple GPUs. So Chang is absolutely >>> correct, this is somewhat like >>> >>>>>>>>>>>>>>> your code for MUMPS with OpenMP. OK, I now >>> understand the background.. >>> >>>>>>>>>>>>>>> One could use PCSetUp_BJacobi_Multiproc() and >>> get the blocks on the >>> >>>>>>>>>>>>>>> MPI ranks and then shrink each block down to >>> a single GPU but this >>> >>>>>>>>>>>>>>> would be pretty inefficient, ideally one >>> would go directly from the >>> >>>>>>>>>>>>>>> big MPI matrix on all the GPUs to the sub >>> matrices on the subset of >>> >>>>>>>>>>>>>>> GPUs. But this may be a large coding project. >>> >>>>>>>>>>>>>>> I don't understand these sentences. Why do you say >>> "shrink"? In my mind, we just need to move each block (submatrix) >>> living over multiple MPI ranks to one of them and solve directly >>> there. In other words, we keep blocks' size, no shrinking or >>> expanding. >>> >>>>>>>>>>>>>>> As mentioned before, cusparse does not provide LU >>> factorization. So the LU factorization would be done on CPU, and >>> the solve be done on GPU. I assume Chang wants to gain from the >>> (potential) faster solve (instead of factorization) on GPU. >>> >>>>>>>>>>>>>>> Barry >>> >>>>>>>>>>>>>>> Since the matrices being factored and solved >>> directly are relatively >>> >>>>>>>>>>>>>>> large it is possible that the cusparse code >>> could be reasonably >>> >>>>>>>>>>>>>>> efficient (they are not the tiny problems one >>> gets at the coarse >>> >>>>>>>>>>>>>>> level of multigrid). Of course, this is >>> speculation, I don't >>> >>>>>>>>>>>>>>> actually know how much better the cusparse >>> code would be on the >>> >>>>>>>>>>>>>>> direct solver than a good CPU direct sparse >>> solver. >>> >>>>>>>>>>>>>>> > On Oct 13, 2021, at 9:32 PM, Chang Liu >>> >>> >>>>>>>>>>>>>>> >> >> wrote: >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> > Sorry I am not familiar with the details >>> either. Can you please >>> >>>>>>>>>>>>>>> check the code in >>> MatMumpsGatherNonzerosOnMaster in mumps.c? >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> > Chang >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> > On 10/13/21 9:24 PM, Junchao Zhang wrote: >>> >>>>>>>>>>>>>>> >> Hi Chang, >>> >>>>>>>>>>>>>>> >> I did the work in mumps. It is easy for >>> me to understand >>> >>>>>>>>>>>>>>> gathering matrix rows to one process. >>> >>>>>>>>>>>>>>> >> But how to gather blocks (submatrices) >>> to form a large block? Can you draw a picture of that? >>> >>>>>>>>>>>>>>> >> Thanks >>> >>>>>>>>>>>>>>> >> --Junchao Zhang >>> >>>>>>>>>>>>>>> >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu >>> via petsc-users >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>> >>>>>>>>>>>>>>> wrote: >>> >>>>>>>>>>>>>>> >> Hi Barry, >>> >>>>>>>>>>>>>>> >> I think mumps solver in petsc does >>> support that. You can >>> >>>>>>>>>>>>>>> check the >>> >>>>>>>>>>>>>>> >> documentation on >>> "-mat_mumps_use_omp_threads" at >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> >>> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html >>> >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> >> >>> >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> and the code enclosed by #if >>> >>>>>>>>>>>>>>> defined(PETSC_HAVE_OPENMP_SUPPORT) in >>> >>>>>>>>>>>>>>> >> functions MatMumpsSetUpDistRHSInfo and >>> >>>>>>>>>>>>>>> >> MatMumpsGatherNonzerosOnMaster in >>> >>>>>>>>>>>>>>> >> mumps.c >>> >>>>>>>>>>>>>>> >> 1. I understand it is ideal to do one >>> MPI rank per GPU. >>> >>>>>>>>>>>>>>> However, I am >>> >>>>>>>>>>>>>>> >> working on an existing code that was >>> developed based on MPI >>> >>>>>>>>>>>>>>> and the the >>> >>>>>>>>>>>>>>> >> # of mpi ranks is typically equal to # >>> of cpu cores. We don't >>> >>>>>>>>>>>>>>> want to >>> >>>>>>>>>>>>>>> >> change the whole structure of the code. >>> >>>>>>>>>>>>>>> >> 2. What you have suggested has been >>> coded in mumps.c. See >>> >>>>>>>>>>>>>>> function >>> >>>>>>>>>>>>>>> >> MatMumpsSetUpDistRHSInfo. >>> >>>>>>>>>>>>>>> >> Regards, >>> >>>>>>>>>>>>>>> >> Chang >>> >>>>>>>>>>>>>>> >> On 10/13/21 7:53 PM, Barry Smith wrote: >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> On Oct 13, 2021, at 3:50 PM, Chang >>> Liu >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> >> >> >> >>> wrote: >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >> Hi Barry, >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >> That is exactly what I want. >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >> Back to my original question, I am >>> looking for an approach to >>> >>>>>>>>>>>>>>> >> transfer >>> >>>>>>>>>>>>>>> >> >> matrix >>> >>>>>>>>>>>>>>> >> >> data from many MPI processes to >>> "master" MPI >>> >>>>>>>>>>>>>>> >> >> processes, each of which taking >>> care of one GPU, and then >>> >>>>>>>>>>>>>>> upload >>> >>>>>>>>>>>>>>> >> the data to GPU to >>> >>>>>>>>>>>>>>> >> >> solve. >>> >>>>>>>>>>>>>>> >> >> One can just grab some codes from >>> mumps.c to >>> >>>>>>>>>>>>>>> aijcusparse.cu >>> > >>> >>>>>>>>>>>>>>> >> >> >> >>. >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > mumps.c doesn't actually do >>> that. It never needs to >>> >>>>>>>>>>>>>>> copy the >>> >>>>>>>>>>>>>>> >> entire matrix to a single MPI rank. >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > It would be possible to write >>> such a code that you >>> >>>>>>>>>>>>>>> suggest but >>> >>>>>>>>>>>>>>> >> it is not clear that it makes sense >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > 1) For normal PETSc GPU usage >>> there is one GPU per MPI >>> >>>>>>>>>>>>>>> rank, so >>> >>>>>>>>>>>>>>> >> while your one GPU per big domain is >>> solving its systems the >>> >>>>>>>>>>>>>>> other >>> >>>>>>>>>>>>>>> >> GPUs (with the other MPI ranks that >>> share that domain) are doing >>> >>>>>>>>>>>>>>> >> nothing. >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > 2) For each triangular solve you >>> would have to gather the >>> >>>>>>>>>>>>>>> right >>> >>>>>>>>>>>>>>> >> hand side from the multiple ranks to >>> the single GPU to pass it to >>> >>>>>>>>>>>>>>> >> the GPU solver and then scatter the >>> resulting solution back >>> >>>>>>>>>>>>>>> to all >>> >>>>>>>>>>>>>>> >> of its subdomain ranks. >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > What I was suggesting was assign >>> an entire subdomain to a >>> >>>>>>>>>>>>>>> >> single MPI rank, thus it does >>> everything on one GPU and can >>> >>>>>>>>>>>>>>> use the >>> >>>>>>>>>>>>>>> >> GPU solver directly. If all the major >>> computations of a subdomain >>> >>>>>>>>>>>>>>> >> can fit and be done on a single GPU >>> then you would be >>> >>>>>>>>>>>>>>> utilizing all >>> >>>>>>>>>>>>>>> >> the GPUs you are using effectively. >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > Barry >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >> Chang >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >> On 10/13/21 1:53 PM, Barry Smith >>> wrote: >>> >>>>>>>>>>>>>>> >> >>> Chang, >>> >>>>>>>>>>>>>>> >> >>> You are correct there is no >>> MPI + GPU direct >>> >>>>>>>>>>>>>>> solvers that >>> >>>>>>>>>>>>>>> >> currently do the triangular solves >>> with MPI + GPU parallelism >>> >>>>>>>>>>>>>>> that I >>> >>>>>>>>>>>>>>> >> am aware of. You are limited that >>> individual triangular solves be >>> >>>>>>>>>>>>>>> >> done on a single GPU. I can only >>> suggest making each subdomain as >>> >>>>>>>>>>>>>>> >> big as possible to utilize each GPU as >>> much as possible for the >>> >>>>>>>>>>>>>>> >> direct triangular solves. >>> >>>>>>>>>>>>>>> >> >>> Barry >>> >>>>>>>>>>>>>>> >> >>>> On Oct 13, 2021, at 12:16 PM, >>> Chang Liu via petsc-users >>> >>>>>>>>>>>>>>> >> >> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>> >>>>>>>>>>>>>>> wrote: >>> >>>>>>>>>>>>>>> >> >>>> >>> >>>>>>>>>>>>>>> >> >>>> Hi Mark, >>> >>>>>>>>>>>>>>> >> >>>> >>> >>>>>>>>>>>>>>> >> >>>> '-mat_type aijcusparse' works >>> with mpiaijcusparse with >>> >>>>>>>>>>>>>>> other >>> >>>>>>>>>>>>>>> >> solvers, but with >>> -pc_factor_mat_solver_type cusparse, it >>> >>>>>>>>>>>>>>> will give >>> >>>>>>>>>>>>>>> >> an error. >>> >>>>>>>>>>>>>>> >> >>>> >>> >>>>>>>>>>>>>>> >> >>>> Yes what I want is to have mumps >>> or superlu to do the >>> >>>>>>>>>>>>>>> >> factorization, and then do the rest, >>> including GMRES solver, >>> >>>>>>>>>>>>>>> on gpu. >>> >>>>>>>>>>>>>>> >> Is that possible? >>> >>>>>>>>>>>>>>> >> >>>> >>> >>>>>>>>>>>>>>> >> >>>> I have tried to use aijcusparse >>> with superlu_dist, it >>> >>>>>>>>>>>>>>> runs but >>> >>>>>>>>>>>>>>> >> the iterative solver is still running >>> on CPUs. I have >>> >>>>>>>>>>>>>>> contacted the >>> >>>>>>>>>>>>>>> >> superlu group and they confirmed that >>> is the case right now. >>> >>>>>>>>>>>>>>> But if >>> >>>>>>>>>>>>>>> >> I set -pc_factor_mat_solver_type >>> cusparse, it seems that the >>> >>>>>>>>>>>>>>> >> iterative solver is running on GPU. >>> >>>>>>>>>>>>>>> >> >>>> >>> >>>>>>>>>>>>>>> >> >>>> Chang >>> >>>>>>>>>>>>>>> >> >>>> >>> >>>>>>>>>>>>>>> >> >>>> On 10/13/21 12:03 PM, Mark Adams >>> wrote: >>> >>>>>>>>>>>>>>> >> >>>>> On Wed, Oct 13, 2021 at 11:10 >>> AM Chang Liu >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> >> >> >> >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> >> >> >> >>>> wrote: >>> >>>>>>>>>>>>>>> >> >>>>> Thank you Junchao for >>> explaining this. I guess in >>> >>>>>>>>>>>>>>> my case >>> >>>>>>>>>>>>>>> >> the code is >>> >>>>>>>>>>>>>>> >> >>>>> just calling a seq solver >>> like superlu to do >>> >>>>>>>>>>>>>>> >> factorization on GPUs. >>> >>>>>>>>>>>>>>> >> >>>>> My idea is that I want to >>> have a traditional MPI >>> >>>>>>>>>>>>>>> code to >>> >>>>>>>>>>>>>>> >> utilize GPUs >>> >>>>>>>>>>>>>>> >> >>>>> with cusparse. Right now >>> cusparse does not support >>> >>>>>>>>>>>>>>> mpiaij >>> >>>>>>>>>>>>>>> >> matrix, Sure it does: '-mat_type >>> aijcusparse' will give you an >>> >>>>>>>>>>>>>>> >> mpiaijcusparse matrix with > 1 processes. >>> >>>>>>>>>>>>>>> >> >>>>> (-mat_type mpiaijcusparse might >>> also work with >1 proc). >>> >>>>>>>>>>>>>>> >> >>>>> However, I see in grepping the >>> repo that all the mumps and >>> >>>>>>>>>>>>>>> >> superlu tests use aij or sell matrix type. >>> >>>>>>>>>>>>>>> >> >>>>> MUMPS and SuperLU provide their >>> own solves, I assume >>> >>>>>>>>>>>>>>> .... but >>> >>>>>>>>>>>>>>> >> you might want to do other matrix >>> operations on the GPU. Is >>> >>>>>>>>>>>>>>> that the >>> >>>>>>>>>>>>>>> >> issue? >>> >>>>>>>>>>>>>>> >> >>>>> Did you try -mat_type >>> aijcusparse with MUMPS and/or >>> >>>>>>>>>>>>>>> SuperLU >>> >>>>>>>>>>>>>>> >> have a problem? (no test with it so it >>> probably does not work) >>> >>>>>>>>>>>>>>> >> >>>>> Thanks, >>> >>>>>>>>>>>>>>> >> >>>>> Mark >>> >>>>>>>>>>>>>>> >> >>>>> so I >>> >>>>>>>>>>>>>>> >> >>>>> want the code to have a >>> mpiaij matrix when adding >>> >>>>>>>>>>>>>>> all the >>> >>>>>>>>>>>>>>> >> matrix terms, >>> >>>>>>>>>>>>>>> >> >>>>> and then transform the >>> matrix to seqaij when doing the >>> >>>>>>>>>>>>>>> >> factorization >>> >>>>>>>>>>>>>>> >> >>>>> and >>> >>>>>>>>>>>>>>> >> >>>>> solve. This involves >>> sending the data to the master >>> >>>>>>>>>>>>>>> >> process, and I >>> >>>>>>>>>>>>>>> >> >>>>> think >>> >>>>>>>>>>>>>>> >> >>>>> the petsc mumps solver have >>> something similar already. >>> >>>>>>>>>>>>>>> >> >>>>> Chang >>> >>>>>>>>>>>>>>> >> >>>>> On 10/13/21 10:18 AM, >>> Junchao Zhang wrote: >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 at >>> 1:07 PM Mark Adams >>> >>>>>>>>>>>>>>> >> >> >> > >>> >>>>>>>>>>>>>>> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>> >>>>>>>>>>>>>>> >> >>>>> > >> >>> >>>>>>>>>>>>>>> >> > >> >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >> >> > >>> >>>>>>>>>>>>>>> >> >> >>>>> wrote: >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > On Tue, Oct 12, 2021 >>> at 1:45 PM Chang Liu >>> >>>>>>>>>>>>>>> >> >>> > >>> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> > >>> >> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>>> wrote: >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > Hi Mark, >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > The option I use >>> is like >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > -pc_type bjacobi >>> -pc_bjacobi_blocks 16 >>> >>>>>>>>>>>>>>> >> -ksp_type fgmres >>> >>>>>>>>>>>>>>> >> >>>>> -mat_type >>> >>>>>>>>>>>>>>> >> >>>>> > aijcusparse >>> *-sub_pc_factor_mat_solver_type >>> >>>>>>>>>>>>>>> >> cusparse >>> >>>>>>>>>>>>>>> >> >>>>> *-sub_ksp_type >>> >>>>>>>>>>>>>>> >> >>>>> > preonly >>> *-sub_pc_type lu* -ksp_max_it 2000 >>> >>>>>>>>>>>>>>> >> -ksp_rtol 1.e-300 >>> >>>>>>>>>>>>>>> >> >>>>> > -ksp_atol 1.e-300 >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > Note, If you use >>> -log_view the last column >>> >>>>>>>>>>>>>>> (rows >>> >>>>>>>>>>>>>>> >> are the >>> >>>>>>>>>>>>>>> >> >>>>> method like >>> >>>>>>>>>>>>>>> >> >>>>> > MatFactorNumeric) >>> has the percent of work >>> >>>>>>>>>>>>>>> in the GPU. >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > Junchao: *This* >>> implies that we have a >>> >>>>>>>>>>>>>>> cuSparse LU >>> >>>>>>>>>>>>>>> >> >>>>> factorization. Is >>> >>>>>>>>>>>>>>> >> >>>>> > that correct? (I >>> don't think we do) >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > No, we don't have >>> cuSparse LU factorization. If you check >>> >>>>>>>>>>>>>>> >> >>>>> > >>> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will >>> >>>>>>>>>>>>>>> find it >>> >>>>>>>>>>>>>>> >> calls >>> >>>>>>>>>>>>>>> >> >>>>> > >>> MatLUFactorSymbolic_SeqAIJ() instead. >>> >>>>>>>>>>>>>>> >> >>>>> > So I don't understand >>> Chang's idea. Do you want to >>> >>>>>>>>>>>>>>> >> make bigger >>> >>>>>>>>>>>>>>> >> >>>>> blocks? >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > I think this one >>> do both factorization and >>> >>>>>>>>>>>>>>> >> solve on gpu. >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > You can check the >>> >>>>>>>>>>>>>>> runex72_aijcusparse.sh file >>> >>>>>>>>>>>>>>> >> in petsc >>> >>>>>>>>>>>>>>> >> >>>>> install >>> >>>>>>>>>>>>>>> >> >>>>> > directory, and >>> try it your self (this >>> >>>>>>>>>>>>>>> is only lu >>> >>>>>>>>>>>>>>> >> >>>>> factorization >>> >>>>>>>>>>>>>>> >> >>>>> > without >>> >>>>>>>>>>>>>>> >> >>>>> > iterative solve). >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > Chang >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > On 10/12/21 1:17 >>> PM, Mark Adams wrote: >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > On Tue, Oct >>> 12, 2021 at 11:19 AM >>> >>>>>>>>>>>>>>> Chang Liu >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> > >>> >> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>> >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> >> >> >> >>> >>>>>>>>>>>>>>> >>> > >>> >>> >>>>>>>>>>>>>>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>>>> wrote: >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > Hi Junchao, >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > No I only >>> needs it to be transferred >>> >>>>>>>>>>>>>>> >> within a >>> >>>>>>>>>>>>>>> >> >>>>> node. I use >>> >>>>>>>>>>>>>>> >> >>>>> > block-Jacobi >>> >>>>>>>>>>>>>>> >> >>>>> > > method >>> and GMRES to solve the sparse >>> >>>>>>>>>>>>>>> >> matrix, so each >>> >>>>>>>>>>>>>>> >> >>>>> > direct solver will >>> >>>>>>>>>>>>>>> >> >>>>> > > take care >>> of a sub-block of the >>> >>>>>>>>>>>>>>> whole >>> >>>>>>>>>>>>>>> >> matrix. In this >>> >>>>>>>>>>>>>>> >> >>>>> > way, I can use >>> >>>>>>>>>>>>>>> >> >>>>> > > one >>> >>>>>>>>>>>>>>> >> >>>>> > > GPU to >>> solve one sub-block, which is >>> >>>>>>>>>>>>>>> >> stored within >>> >>>>>>>>>>>>>>> >> >>>>> one node. >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > It was >>> stated in the >>> >>>>>>>>>>>>>>> documentation that >>> >>>>>>>>>>>>>>> >> cusparse >>> >>>>>>>>>>>>>>> >> >>>>> solver >>> >>>>>>>>>>>>>>> >> >>>>> > is slow. >>> >>>>>>>>>>>>>>> >> >>>>> > > However, >>> in my test using >>> >>>>>>>>>>>>>>> ex72.c, the >>> >>>>>>>>>>>>>>> >> cusparse >>> >>>>>>>>>>>>>>> >> >>>>> solver is >>> >>>>>>>>>>>>>>> >> >>>>> > faster than >>> >>>>>>>>>>>>>>> >> >>>>> > > mumps or >>> superlu_dist on CPUs. >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > Are we >>> talking about the >>> >>>>>>>>>>>>>>> factorization, the >>> >>>>>>>>>>>>>>> >> solve, or >>> >>>>>>>>>>>>>>> >> >>>>> both? >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > We do not >>> have an interface to >>> >>>>>>>>>>>>>>> cuSparse's LU >>> >>>>>>>>>>>>>>> >> >>>>> factorization (I >>> >>>>>>>>>>>>>>> >> >>>>> > just >>> >>>>>>>>>>>>>>> >> >>>>> > > learned that >>> it exists a few weeks ago). >>> >>>>>>>>>>>>>>> >> >>>>> > > Perhaps your >>> fast "cusparse solver" is >>> >>>>>>>>>>>>>>> >> '-pc_type lu >>> >>>>>>>>>>>>>>> >> >>>>> -mat_type >>> >>>>>>>>>>>>>>> >> >>>>> > > aijcusparse' >>> ? This would be the CPU >>> >>>>>>>>>>>>>>> >> factorization, >>> >>>>>>>>>>>>>>> >> >>>>> which is the >>> >>>>>>>>>>>>>>> >> >>>>> > > dominant cost. >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > Chang >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > On >>> 10/12/21 10:24 AM, Junchao >>> >>>>>>>>>>>>>>> Zhang wrote: >>> >>>>>>>>>>>>>>> >> >>>>> > > > Hi, Chang, >>> >>>>>>>>>>>>>>> >> >>>>> > > > For the mumps solver, we >>> >>>>>>>>>>>>>>> usually >>> >>>>>>>>>>>>>>> >> transfers >>> >>>>>>>>>>>>>>> >> >>>>> matrix >>> >>>>>>>>>>>>>>> >> >>>>> > and vector >>> >>>>>>>>>>>>>>> >> >>>>> > > data >>> >>>>>>>>>>>>>>> >> >>>>> > > > within >>> a compute node. For >>> >>>>>>>>>>>>>>> the idea you >>> >>>>>>>>>>>>>>> >> >>>>> propose, it >>> >>>>>>>>>>>>>>> >> >>>>> > looks like >>> >>>>>>>>>>>>>>> >> >>>>> > > we need >>> >>>>>>>>>>>>>>> >> >>>>> > > > to >>> gather data within >>> >>>>>>>>>>>>>>> >> MPI_COMM_WORLD, right? >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > > Mark, I remember you said >>> >>>>>>>>>>>>>>> >> cusparse solve is >>> >>>>>>>>>>>>>>> >> >>>>> slow >>> >>>>>>>>>>>>>>> >> >>>>> > and you would >>> >>>>>>>>>>>>>>> >> >>>>> > > > rather >>> do it on CPU. Is it right? >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> --Junchao Zhang >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > > On >>> Mon, Oct 11, 2021 at 10:25 PM >>> >>>>>>>>>>>>>>> >> Chang Liu via >>> >>>>>>>>>>>>>>> >> >>>>> petsc-users >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >>>> >>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >>>> >>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >>>>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > > wrote: >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > > Hi, >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > > Currently, it is possible >>> >>>>>>>>>>>>>>> to use >>> >>>>>>>>>>>>>>> >> mumps >>> >>>>>>>>>>>>>>> >> >>>>> solver in >>> >>>>>>>>>>>>>>> >> >>>>> > PETSC with >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> -mat_mumps_use_omp_threads >>> >>>>>>>>>>>>>>> >> option, so that >>> >>>>>>>>>>>>>>> >> >>>>> > multiple MPI >>> >>>>>>>>>>>>>>> >> >>>>> > > processes >>> will >>> >>>>>>>>>>>>>>> >> >>>>> > > > transfer the matrix and >>> >>>>>>>>>>>>>>> rhs data >>> >>>>>>>>>>>>>>> >> to the master >>> >>>>>>>>>>>>>>> >> >>>>> > rank, and then >>> >>>>>>>>>>>>>>> >> >>>>> > > master >>> >>>>>>>>>>>>>>> >> >>>>> > > > rank will call mumps with >>> >>>>>>>>>>>>>>> OpenMP >>> >>>>>>>>>>>>>>> >> to solve >>> >>>>>>>>>>>>>>> >> >>>>> the matrix. >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > > I >>> wonder if someone can >>> >>>>>>>>>>>>>>> develop >>> >>>>>>>>>>>>>>> >> similar >>> >>>>>>>>>>>>>>> >> >>>>> option for >>> >>>>>>>>>>>>>>> >> >>>>> > cusparse >>> >>>>>>>>>>>>>>> >> >>>>> > > solver. >>> >>>>>>>>>>>>>>> >> >>>>> > > > Right now, this solver >>> >>>>>>>>>>>>>>> does not >>> >>>>>>>>>>>>>>> >> work with >>> >>>>>>>>>>>>>>> >> >>>>> > mpiaijcusparse. I >>> >>>>>>>>>>>>>>> >> >>>>> > > think a >>> >>>>>>>>>>>>>>> >> >>>>> > > > possible workaround is to >>> >>>>>>>>>>>>>>> >> transfer all the >>> >>>>>>>>>>>>>>> >> >>>>> matrix >>> >>>>>>>>>>>>>>> >> >>>>> > data to one MPI >>> >>>>>>>>>>>>>>> >> >>>>> > > > process, and then upload the >>> >>>>>>>>>>>>>>> >> data to GPU to >>> >>>>>>>>>>>>>>> >> >>>>> solve. >>> >>>>>>>>>>>>>>> >> >>>>> > In this >>> >>>>>>>>>>>>>>> >> >>>>> > > way, one can >>> >>>>>>>>>>>>>>> >> >>>>> > > > use cusparse solver for a MPI >>> >>>>>>>>>>>>>>> >> program. >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > > Chang >>> >>>>>>>>>>>>>>> >> >>>>> > > > -- >>> >>>>>>>>>>>>>>> >> >>>>> > > > Chang Liu >>> >>>>>>>>>>>>>>> >> >>>>> > > > Staff Research Physicist >>> >>>>>>>>>>>>>>> >> >>>>> > > > +1 >>> 609 243 3438 >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> cliu at pppl.gov >>> >>>>>>>>>>>>>>> > >>> >> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> > >>> >> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> > >>> >> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>> >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> >> >> >> >>> >>>>>>>>>>>>>>> >>> > >>> >>> >>>>>>>>>>>>>>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > > > Princeton Plasma Physics >>> >>>>>>>>>>>>>>> Laboratory >>> >>>>>>>>>>>>>>> >> >>>>> > > > 100 Stellarator Rd, >>> >>>>>>>>>>>>>>> Princeton NJ >>> >>>>>>>>>>>>>>> >> 08540, USA >>> >>>>>>>>>>>>>>> >> >>>>> > > > >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > > -- >>> >>>>>>>>>>>>>>> >> >>>>> > > Chang Liu >>> >>>>>>>>>>>>>>> >> >>>>> > > Staff >>> Research Physicist >>> >>>>>>>>>>>>>>> >> >>>>> > > +1 609 >>> 243 3438 >>> >>>>>>>>>>>>>>> >> >>>>> > > cliu at pppl.gov >>> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>> >>> >>>>>>>>>>>>>>> >>> > >>> >>>>>>>>>>>>>>> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >>> >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> > >>> >> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>>>> >>> >>>>>>>>>>>>>>> >> >>>>> > > Princeton >>> Plasma Physics Laboratory >>> >>>>>>>>>>>>>>> >> >>>>> > > 100 >>> Stellarator Rd, Princeton NJ >>> >>>>>>>>>>>>>>> 08540, USA >>> >>>>>>>>>>>>>>> >> >>>>> > > >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> > -- >>> >>>>>>>>>>>>>>> >> >>>>> > Chang Liu >>> >>>>>>>>>>>>>>> >> >>>>> > Staff Research >>> Physicist >>> >>>>>>>>>>>>>>> >> >>>>> > +1 609 243 3438 >>> >>>>>>>>>>>>>>> >> >>>>> > cliu at pppl.gov >>> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> > >>> >>>>>>>>>>>>>>> >>> >>> >>> >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> >> >> >> >>> >>>>>>>>>>>>>>> >> >>>>> >> > >>> >>>>>>>>>>>>>>> >>> >>>> >>> >>>>>>>>>>>>>>> >> >>>>> > Princeton Plasma >>> Physics Laboratory >>> >>>>>>>>>>>>>>> >> >>>>> > 100 Stellarator >>> Rd, Princeton NJ 08540, USA >>> >>>>>>>>>>>>>>> >> >>>>> > >>> >>>>>>>>>>>>>>> >> >>>>> -- Chang Liu >>> >>>>>>>>>>>>>>> >> >>>>> Staff Research Physicist >>> >>>>>>>>>>>>>>> >> >>>>> +1 609 243 3438 >>> >>>>>>>>>>>>>>> >> >>>>> cliu at pppl.gov >>> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> >> >> >>> >>> >>>>>>>>>>>>>>> >> >>>>> Princeton Plasma Physics >>> Laboratory >>> >>>>>>>>>>>>>>> >> >>>>> 100 Stellarator Rd, >>> Princeton NJ 08540, USA >>> >>>>>>>>>>>>>>> >> >>>> >>> >>>>>>>>>>>>>>> >> >>>> -- >>> >>>>>>>>>>>>>>> >> >>>> Chang Liu >>> >>>>>>>>>>>>>>> >> >>>> Staff Research Physicist >>> >>>>>>>>>>>>>>> >> >>>> +1 609 243 3438 >>> >>>>>>>>>>>>>>> >> >>>> cliu at pppl.gov >>> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >>>> Princeton Plasma Physics Laboratory >>> >>>>>>>>>>>>>>> >> >>>> 100 Stellarator Rd, Princeton NJ >>> 08540, USA >>> >>>>>>>>>>>>>>> >> >> >>> >>>>>>>>>>>>>>> >> >> -- >>> >>>>>>>>>>>>>>> >> >> Chang Liu >>> >>>>>>>>>>>>>>> >> >> Staff Research Physicist >>> >>>>>>>>>>>>>>> >> >> +1 609 243 3438 >>> >>>>>>>>>>>>>>> >> >> cliu at pppl.gov >>> > >>> >>>>>>>>>>>>>>> >>> >> >>> >>>>>>>>>>>>>>> >> >> Princeton Plasma Physics Laboratory >>> >>>>>>>>>>>>>>> >> >> 100 Stellarator Rd, Princeton NJ >>> 08540, USA >>> >>>>>>>>>>>>>>> >> > >>> >>>>>>>>>>>>>>> >> -- Chang Liu >>> >>>>>>>>>>>>>>> >> Staff Research Physicist >>> >>>>>>>>>>>>>>> >> +1 609 243 3438 >>> >>>>>>>>>>>>>>> >> cliu at pppl.gov >>> > >>> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> >> Princeton Plasma Physics Laboratory >>> >>>>>>>>>>>>>>> >> 100 Stellarator Rd, Princeton NJ >>> 08540, USA >>> >>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>> > -- >>> >>>>>>>>>>>>>>> > Chang Liu >>> >>>>>>>>>>>>>>> > Staff Research Physicist >>> >>>>>>>>>>>>>>> > +1 609 243 3438 >>> >>>>>>>>>>>>>>> > cliu at pppl.gov >>> > >>> >>>>>>>>>>>>>>> > Princeton Plasma Physics Laboratory >>> >>>>>>>>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> -- >>> >>>>>>>>>>>>>> Chang Liu >>> >>>>>>>>>>>>>> Staff Research Physicist >>> >>>>>>>>>>>>>> +1 609 243 3438 >>> >>>>>>>>>>>>>> cliu at pppl.gov >>> >>>>>>>>>>>>>> Princeton Plasma Physics Laboratory >>> >>>>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> -- >>> >>>>>>>>>>>> Chang Liu >>> >>>>>>>>>>>> Staff Research Physicist >>> >>>>>>>>>>>> +1 609 243 3438 >>> >>>>>>>>>>>> cliu at pppl.gov >>> >>>>>>>>>>>> Princeton Plasma Physics Laboratory >>> >>>>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>>>>>>>>> >>> >>>>>>>>>> -- >>> >>>>>>>>>> Chang Liu >>> >>>>>>>>>> Staff Research Physicist >>> >>>>>>>>>> +1 609 243 3438 >>> >>>>>>>>>> cliu at pppl.gov >>> >>>>>>>>>> Princeton Plasma Physics Laboratory >>> >>>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>>>>>>> >>> >>>>>>>> -- >>> >>>>>>>> Chang Liu >>> >>>>>>>> Staff Research Physicist >>> >>>>>>>> +1 609 243 3438 >>> >>>>>>>> cliu at pppl.gov >>> >>>>>>>> Princeton Plasma Physics Laboratory >>> >>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >>>>>>> >>> >>>>>> >>> >>>> >>> >>>> -- >>> >>>> Chang Liu >>> >>>> Staff Research Physicist >>> >>>> +1 609 243 3438 >>> >>>> cliu at pppl.gov >>> >>>> Princeton Plasma Physics Laboratory >>> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> >> >>> >> -- >>> >> Chang Liu >>> >> Staff Research Physicist >>> >> +1 609 243 3438 >>> >> cliu at pppl.gov >>> >> Princeton Plasma Physics Laboratory >>> >> 100 Stellarator Rd, Princeton NJ 08540, USA >>> > >>> >>> -- Chang Liu >>> Staff Research Physicist >>> +1 609 243 3438 >>> cliu at pppl.gov >>> Princeton Plasma Physics Laboratory >>> 100 Stellarator Rd, Princeton NJ 08540, USA >>> > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA From junchao.zhang at gmail.com Wed Oct 20 19:24:53 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Wed, 20 Oct 2021 19:24:53 -0500 Subject: [petsc-users] [External] Re: MatVec on GPUs In-Reply-To: <07787336-5a69-d6f8-45ca-b2f4223f9311@pppl.gov> References: <7621a2fb-0ae6-b98d-54c5-a968affd09c2@pppl.gov> <07787336-5a69-d6f8-45ca-b2f4223f9311@pppl.gov> Message-ID: Hi, Chang, Do you have the error stack message? And yes, petsc gpu code is tricky, since we have to carefully sync data on GPU and CPU. Thanks. On Wed, Oct 20, 2021 at 11:55 AM Chang Liu wrote: > Hi Junchao, > > Thank you for the suggestion. I did some more tests and found that > MatConvert does not always work. In one of my tests, I did MatConvert to > convert the matrix to aijcusparse, then did a preonly ksp solver and it > works well. But then I tried a fgmres solver and it gave an error. It > only happen when the matrix is mpiaijcusparse and for seqaijcusparse it > works. > > So I tried to create a new aijcusparse matrix and copy the data line by > line, then both solvers works. So I guess there are some tricky things > with MatConvert. > > Chang > > On 10/18/21 9:23 PM, Junchao Zhang wrote: > > MatSetOptionsPrefix(A,"mymat") > > VecSetOptionsPrefix(v,"myvec") > > > > --Junchao Zhang > > > > > > On Mon, Oct 18, 2021 at 8:04 PM Chang Liu > > wrote: > > > > Hi Junchao, > > > > Thank you for your answer. I tried MatConvert and it works. I didn't > > make it before because I forgot to convert a vector from mpi to > mpicuda > > previously. > > > > For vector, there is no VecConvert to use, so I have to do > > VecDuplicate, > > VecSetType and VecCopy. Is there an easier option? > > > > As Matt suggested, you could single out the matrix and vector with > > options prefix and set their type on command line > > > > MatSetOptionsPrefix(A,"mymat"); > > VecSetOptionsPrefix(v,"myvec"); > > > > Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda > > A simpler code is to have the vector type automatically set by > > MatCreateVecs(A,&v,NULL) > > > > > > Chang > > > > On 10/18/21 5:23 PM, Junchao Zhang wrote: > > > > > > > > > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users > > > > > >> > > wrote: > > > > > > Hi Matt, > > > > > > I have a related question. In my code I have many matrices > > and I only > > > want to have one living on GPU, the others still staying on > > CPU mem. > > > > > > I wonder if there is an easier way to copy a mpiaij matrix to > > > mpiaijcusparse (in other words, copy data to GPUs). I can > > think of > > > creating a new mpiaijcusparse matrix, and copying the data > > line by > > > line. > > > But I wonder if there is a better option. > > > > > > I have tried MatCopy and MatConvert but neither work. > > > > > > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? > > > > > > > > > Chang > > > > > > On 10/17/21 7:50 PM, Matthew Knepley wrote: > > > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh > > > > > > > > > > > > >>> wrote: > > > > > > > > Do I need convert the MATSEQBAIJ to a cuda matrix in > code? > > > > > > > > > > > > You would need a call to MatSetFromOptions() to take that > type > > > from the > > > > command line, and not have > > > > the type hard-coded in your application. It is generally a > bad > > > idea to > > > > hard code the implementation type. > > > > > > > > If I do it from command line, then are the other > > MatVec calls are > > > > ported onto CUDA? I have many MatVec calls in my code, > > but I > > > > specifically want to port just one call. > > > > > > > > > > > > You can give that one matrix an options prefix to isolate > it. > > > > > > > > Thanks, > > > > > > > > Matt > > > > > > > > Sincerely, > > > > Swarnava > > > > > > > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang > > > > > > > > > > > > >>> > > > wrote: > > > > > > > > You can do that with command line options -mat_type > > > aijcusparse > > > > -vec_type cuda > > > > > > > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh > > > > > > > > > > > > >>> wrote: > > > > > > > > Dear Petsc team, > > > > > > > > I had a query regarding using CUDA to > > accelerate a matrix > > > > vector product. > > > > I have a sequential sparse matrix > > (MATSEQBAIJ type). > > > I want > > > > to port a MatVec call onto GPUs. Is there any > > > code/example I > > > > can look at? > > > > > > > > Sincerely, > > > > SG > > > > > > > > > > > > > > > > -- > > > > What most experimenters take for granted before they begin > > their > > > > experiments is infinitely more interesting than any > > results to which > > > > their experiments lead. > > > > -- Norbert Wiener > > > > > > > > https://www.cse.buffalo.edu/~knepley/ > > > > > > > > > > > > > > > >> > > > > > > -- > > > Chang Liu > > > Staff Research Physicist > > > +1 609 243 3438 > > > cliu at pppl.gov > > > > > Princeton Plasma Physics Laboratory > > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > > > > -- > > Chang Liu > > Staff Research Physicist > > +1 609 243 3438 > > cliu at pppl.gov > > Princeton Plasma Physics Laboratory > > 100 Stellarator Rd, Princeton NJ 08540, USA > > > > -- > Chang Liu > Staff Research Physicist > +1 609 243 3438 > cliu at pppl.gov > Princeton Plasma Physics Laboratory > 100 Stellarator Rd, Princeton NJ 08540, USA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Thu Oct 21 12:04:57 2021 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 21 Oct 2021 13:04:57 -0400 Subject: [petsc-users] HDF5 timestepping in PETSc 3.16 In-Reply-To: References: <18b4e68c-2524-932d-9aa4-c1a28ea44158@auckland.ac.nz> Message-ID: On Tue, Oct 19, 2021 at 6:12 AM Matthew Knepley wrote: > On Mon, Oct 18, 2021 at 10:35 PM Adrian Croucher < > a.croucher at auckland.ac.nz> wrote: > >> Any response on this? >> >> This is a bit of a showstopper for me - I can't upgrade to PETSc 3.16 if >> it does not allow my users to read their HDF5 files created using >> earlier versions of PETSc. >> >> So far I can't see a workaround. Possibly the timestepping functions >> need some kind of optional parameter to specify what the default >> timestepping attribute should be, if it's not present in the file >> (rather than just assuming it's false)? >> > > I will fix it. I think I can do it tomorrow. Class just started this week > do it is hectic :) > > I think you are right. We should always write the attribute, but have it > be false. We should > interpret a missing attribute as an old file. > Okay, I think I have it. Can you look at this branch? https://gitlab.com/petsc/petsc/-/merge_requests/4483 There is now an option that lets you set the default timestepping behavior -viewer_hdf5_default_timestepping I think that is what you want. Thanks, Matt > Thanks, > > Matt > > >> Regards, Adrian >> >> On 10/14/21 4:19 PM, Adrian Croucher wrote: >> > hi >> > >> > I am just testing out PETSc 3.16 and making the necessary changes to >> > my code. Amongst other things I now have to add a >> > PetscViewerHDF5PushTimestepping() call before starting to output >> > time-dependent results to HDF5 using a PetscViewer. >> > >> > I now also have to add this call before reading in sets of previously >> > computed time-dependent results (for restarting a simulation from the >> > results of a previous run). >> > >> > The problem with this is that if I try to read in the results of any >> > previous run, computed with an earlier version of PETSc (< 3.16), an >> > error is raised because the time-dependent datasets in the file do not >> > have the 'timestepping' attribute. >> > >> > Is there something else I need to do to make this work? >> > >> > - Adrian >> > >> -- >> Dr Adrian Croucher >> Senior Research Fellow >> Department of Engineering Science >> University of Auckland, New Zealand >> email: a.croucher at auckland.ac.nz >> tel: +64 (0)9 923 4611 >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From amartin at cimne.upc.edu Fri Oct 22 20:21:56 2021 From: amartin at cimne.upc.edu (=?UTF-8?Q?Alberto_F=2e_Mart=c3=adn?=) Date: Sat, 23 Oct 2021 12:21:56 +1100 Subject: [petsc-users] Why PetscDestroy global collective semantics? Message-ID: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> Dear PETSc users, What is the main reason underlying PetscDestroy subroutines having global collective semantics? Is this actually true for all PETSc objects? Can this be relaxed/deactivated by, e.g., compilation macros/configuration options? Thanks in advance! Best regards, ?Alberto. From bsmith at petsc.dev Fri Oct 22 21:13:11 2021 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 22 Oct 2021 22:13:11 -0400 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> Message-ID: <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> One technical reason is that PetscHeaderDestroy_Private() may call PetscCommDestroy() which may call MPI_Comm_free() which is defined by the standard to be collective. Though PETSc tries to limit its use of new MPI communicators (for example generally many objects shared the same communicator) if we did not free those we no longer need when destroying objects we could run out. I cannot off-hand think of another specific technical reason they must be collective besides this good housekeeping. In what use case can you not call them collectively? Barry > On Oct 22, 2021, at 9:21 PM, Alberto F. Mart?n wrote: > > Dear PETSc users, > > What is the main reason underlying PetscDestroy subroutines having global collective semantics? Is this actually true for all PETSc objects? Can this be relaxed/deactivated by, e.g., compilation macros/configuration options? > > Thanks in advance! > > Best regards, > > Alberto. > From junchao.zhang at gmail.com Fri Oct 22 22:57:40 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Fri, 22 Oct 2021 22:57:40 -0500 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> Message-ID: On Fri, Oct 22, 2021 at 9:13 PM Barry Smith wrote: > > One technical reason is that PetscHeaderDestroy_Private() may call > PetscCommDestroy() which may call MPI_Comm_free() which is defined by the > standard to be collective. Though PETSc tries to limit its use of new MPI > communicators (for example generally many objects shared the same > communicator) if we did not free those we no longer need when destroying > objects we could run out. > PetscCommDestroy() might call MPI_Comm_free() , but it is very unlikely. Petsc uses reference counting on communicators, so in PetscCommDestroy(), it likely just decreases the count. In other words, PetscCommDestroy() is cheap and in effect not collective. > > I cannot off-hand think of another specific technical reason they must > be collective besides this good housekeeping. > > In what use case can you not call them collectively? > > Barry > > > > On Oct 22, 2021, at 9:21 PM, Alberto F. Mart?n > wrote: > > > > Dear PETSc users, > > > > What is the main reason underlying PetscDestroy subroutines having > global collective semantics? Is this actually true for all PETSc objects? > Can this be relaxed/deactivated by, e.g., compilation macros/configuration > options? > > > > Thanks in advance! > > > > Best regards, > > > > Alberto. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jed at jedbrown.org Fri Oct 22 23:33:48 2021 From: jed at jedbrown.org (Jed Brown) Date: Fri, 22 Oct 2021 22:33:48 -0600 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> Message-ID: <87zgr0fjgz.fsf@jedbrown.org> Junchao Zhang writes: > On Fri, Oct 22, 2021 at 9:13 PM Barry Smith wrote: > >> >> One technical reason is that PetscHeaderDestroy_Private() may call >> PetscCommDestroy() which may call MPI_Comm_free() which is defined by the >> standard to be collective. Though PETSc tries to limit its use of new MPI >> communicators (for example generally many objects shared the same >> communicator) if we did not free those we no longer need when destroying >> objects we could run out. >> > PetscCommDestroy() might call MPI_Comm_free() , but it is very unlikely. > Petsc uses reference counting on communicators, so in PetscCommDestroy(), > it likely just decreases the count. In other words, PetscCommDestroy() is > cheap and in effect not collective. Unless it's the last reference to a given communicator, which is a risky/difficult thing for a user to guarantee and the consequences are potentially dire (deadlock being way worse than a crash) when the user's intent is to relax ordering for destruction. Alberto, what is the use case in which deterministic destruction is problematic? If you relax it for individual objects, is there a place you can be collective to collect any stale communicators? From zhugp01 at nus.edu.sg Sat Oct 23 03:46:03 2021 From: zhugp01 at nus.edu.sg (Guangpu Zhu) Date: Sat, 23 Oct 2021 08:46:03 +0000 Subject: [petsc-users] Questions on Petsc4py with PyCUDA Message-ID: Dear Sir/Madam, I am using the Petsc4py with PyCUDA. According to the following link https://www.mcs.anl.gov/petsc/petsc4py-current/docs/apiref/petsc4py.PETSc.Vec.Type-class.html I set the vector type as 'cuda', the simple code is as follows: import sys import petsc4py from petsc4py import PETSc petsc4py.init(sys.argv) from pycuda import autoinit import pycuda.driver as drv import pycuda.compiler as compiler import pycuda.gpuarray as gpuarray a = PETSc.Vec().create() a.setType('cuda') a.setSizes(8) But when I run this code, it always shows that "Unknown vector type: cuda". I have tried: (a) petsc4py 3.15.0 with PyCUDA 2020.1 (b) petsc4py 3.15.1 with PyCUDA 2021.1 (c) petsc4py 3.16.0 with PyCUDA 2021.1 but it always shows the same message: Unknown vector type: cuda The CUDA version on my computer is CUDA 11.3. So I am writing this e-mail to ask for your help and advice. Thank you in advance. Best, Guangpu Zhu --- Guangpu Zhu Research Associate, Department of Mechanical Engineering National University of Singapore Personal E-mail: zhugpupc at gmail.com Phone: (+65) 87581879 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Sat Oct 23 04:33:46 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Sat, 23 Oct 2021 12:33:46 +0300 Subject: [petsc-users] Questions on Petsc4py with PyCUDA In-Reply-To: References: Message-ID: Use v.setType('veccuda')? Or v.setType(PETSc.Vec.Type.VECCUDA) Il Sab 23 Ott 2021, 11:46 Guangpu Zhu ha scritto: > Dear Sir/Madam, > > I am using the Petsc4py with PyCUDA. According to the following > link > > > https://www.mcs.anl.gov/petsc/petsc4py-current/docs/apiref/petsc4py.PETSc.Vec.Type-class.html > > I set the vector type as 'cuda', the simple code is as follows: > > import sys > import petsc4py > from petsc4py import PETSc > petsc4py.init(sys.argv) > from pycuda import autoinit > import pycuda.driver as drv > import pycuda.compiler as compiler > import pycuda.gpuarray as gpuarray > > a = PETSc.Vec().create() > a.setType('cuda') > a.setSizes(8) > > But when I run this code, it always shows that "Unknown vector type: cuda > ". > > I have tried: > (a) petsc4py 3.15.0 with PyCUDA 2020.1 > (b) petsc4py 3.15.1 with PyCUDA 2021.1 > (c) petsc4py 3.16.0 with PyCUDA 2021.1 > > but it always shows the same message: Unknown vector type: cuda > > The CUDA version on my computer is CUDA 11.3. > > So I am writing this e-mail to ask for your help and advice. Thank you in > advance. > > > Best, > > Guangpu Zhu > > > --- > Guangpu Zhu > > Research Associate, Department of Mechanical Engineering > > National University of Singapore > > Personal E-mail: zhugpupc at gmail.com > > Phone: (+65) 87581879 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jroman at dsic.upv.es Sat Oct 23 05:53:15 2021 From: jroman at dsic.upv.es (Jose E. Roman) Date: Sat, 23 Oct 2021 12:53:15 +0200 Subject: [petsc-users] Questions on Petsc4py with PyCUDA In-Reply-To: References: Message-ID: The correct ways are a.setType('cuda') or a.setType(PETSc.Vec.Type.CUDA) Probably what is happening is that PETSc has not been configured --with-cuda - that's why it complains with "Unknown vector type: cuda" Jose > El 23 oct 2021, a las 11:33, Stefano Zampini escribi?: > > Use v.setType('veccuda')? Or v.setType(PETSc.Vec.Type.VECCUDA) > > Il Sab 23 Ott 2021, 11:46 Guangpu Zhu ha scritto: > Dear Sir/Madam, > > I am using the Petsc4py with PyCUDA. According to the following link > > https://www.mcs.anl.gov/petsc/petsc4py-current/docs/apiref/petsc4py.PETSc.Vec.Type-class.html > > I set the vector type as 'cuda', the simple code is as follows: > > import sys > import petsc4py > from petsc4py import PETSc > petsc4py.init(sys.argv) > from pycuda import autoinit > import pycuda.driver as drv > import pycuda.compiler as compiler > import pycuda.gpuarray as gpuarray > > a = PETSc.Vec().create() > a.setType('cuda') > a.setSizes(8) > But when I run this code, it always shows that "Unknown vector type: cuda". > > I have tried: > (a) petsc4py 3.15.0 with PyCUDA 2020.1 > (b) petsc4py 3.15.1 with PyCUDA 2021.1 > (c) petsc4py 3.16.0 with PyCUDA 2021.1 > > but it always shows the same message: Unknown vector type: cuda > > The CUDA version on my computer is CUDA 11.3. > > So I am writing this e-mail to ask for your help and advice. Thank you in advance. > > > Best, > > Guangpu Zhu > > > --- > Guangpu Zhu > > Research Associate, Department of Mechanical Engineering > > National University of Singapore > > Personal E-mail: zhugpupc at gmail.com > > Phone: (+65) 87581879 From jacob.fai at gmail.com Sat Oct 23 09:10:34 2021 From: jacob.fai at gmail.com (Jacob Faibussowitsch) Date: Sat, 23 Oct 2021 09:10:34 -0500 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: <87zgr0fjgz.fsf@jedbrown.org> References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> Message-ID: <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> Depending on the use-case you may also find PetscObjectRegisterDestroy() useful. If you can?t guarantee your PetscObjectDestroy() calls are collective, but have some other collective section you may call it then to punt the destruction of your object to PetscFinalize() which is guaranteed to be collective. https://petsc.org/main/docs/manualpages/Sys/PetscObjectRegisterDestroy.html Best regards, Jacob Faibussowitsch (Jacob Fai - booss - oh - vitch) > On Oct 22, 2021, at 23:33, Jed Brown wrote: > > Junchao Zhang > writes: > >> On Fri, Oct 22, 2021 at 9:13 PM Barry Smith wrote: >> >>> >>> One technical reason is that PetscHeaderDestroy_Private() may call >>> PetscCommDestroy() which may call MPI_Comm_free() which is defined by the >>> standard to be collective. Though PETSc tries to limit its use of new MPI >>> communicators (for example generally many objects shared the same >>> communicator) if we did not free those we no longer need when destroying >>> objects we could run out. >>> >> PetscCommDestroy() might call MPI_Comm_free() , but it is very unlikely. >> Petsc uses reference counting on communicators, so in PetscCommDestroy(), >> it likely just decreases the count. In other words, PetscCommDestroy() is >> cheap and in effect not collective. > > Unless it's the last reference to a given communicator, which is a risky/difficult thing for a user to guarantee and the consequences are potentially dire (deadlock being way worse than a crash) when the user's intent is to relax ordering for destruction. > > Alberto, what is the use case in which deterministic destruction is problematic? If you relax it for individual objects, is there a place you can be collective to collect any stale communicators? -------------- next part -------------- An HTML attachment was scrubbed... URL: From amartin at cimne.upc.edu Sat Oct 23 21:40:45 2021 From: amartin at cimne.upc.edu (=?UTF-8?Q?Alberto_F=2e_Mart=c3=adn?=) Date: Sun, 24 Oct 2021 13:40:45 +1100 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> Message-ID: <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> Thanks all for your very insightful answers. We are leveraging PETSc from Julia in a parallel distributed memory context (several MPI tasks running the Julia REPL each). Julia uses Garbage Collection (GC), and we would like to destroy the PETSc objects automatically when the GC decides so along the simulation. In this context, we cannot guarantee deterministic destruction on all MPI tasks as the GC decisions are local to each task, no global semantics guaranteed. As far as I understand from your answers, there seems to be the possibility to defer the destruction of objects till points in the parallel program in which you can guarantee collective semantics, correct? If yes I guess that this may occur at any point in the simulation, not necessarily at shut down via PetscFinalize(), right? Best regards, ?Alberto. On 24/10/21 1:10 am, Jacob Faibussowitsch wrote: > Depending on the use-case you may also find > PetscObjectRegisterDestroy() useful. If you can?t guarantee your > PetscObjectDestroy() calls are collective, but have some other > collective section you may call it then to punt the destruction of > your object to PetscFinalize() which is guaranteed to be collective. > > https://petsc.org/main/docs/manualpages/Sys/PetscObjectRegisterDestroy.html > > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > >> On Oct 22, 2021, at 23:33, Jed Brown > > wrote: >> >> Junchao Zhang > > writes: >> >>> On Fri, Oct 22, 2021 at 9:13 PM Barry Smith >> > wrote: >>> >>>> >>>> ?One technical reason is that PetscHeaderDestroy_Private() may call >>>> PetscCommDestroy() which may call MPI_Comm_free() which is defined >>>> by the >>>> standard to be collective. Though PETSc tries to limit its use of >>>> new MPI >>>> communicators (for example generally many objects shared the same >>>> communicator) if we did not free those we no longer need when >>>> destroying >>>> objects we could run out. >>>> >>> PetscCommDestroy() might call MPI_Comm_free() , but it is very unlikely. >>> Petsc uses reference counting on communicators, so in >>> PetscCommDestroy(), >>> it likely just decreases the count. In other words, >>> PetscCommDestroy() is >>> cheap and in effect not collective. >> >> Unless it's the last reference to a given communicator, which is a >> risky/difficult thing for a user to guarantee and the consequences >> are potentially dire (deadlock being way worse than a crash) when the >> user's intent is to relax ordering for destruction. >> >> Alberto, what is the use case in which deterministic destruction is >> problematic? If you relax it for individual objects, is there a place >> you can be collective to collect any stale communicators? > -- Alberto F. Mart?n-Huertas Senior Researcher, PhD. Computational Science Centre Internacional de M?todes Num?rics a l'Enginyeria (CIMNE) Parc Mediterrani de la Tecnologia, UPC Esteve Terradas 5, Building C3, Office 215, 08860 Castelldefels (Barcelona, Spain) Tel.: (+34) 9341 34223 e-mail:amartin at cimne.upc.edu FEMPAR project co-founder web: http://www.fempar.org ********************** IMPORTANT ANNOUNCEMENT The information contained in this message and / or attached file (s), sent from CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE, is confidential / privileged and is intended to be read only by the person (s) to the one (s) that is directed. Your data has been incorporated into the treatment system of CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE by virtue of its status as client, user of the website, provider and / or collaborator in order to contact you and send you information that may be of your interest and resolve your queries. You can exercise your rights of access, rectification, limitation of treatment, deletion, and opposition / revocation, in the terms established by the current regulations on data protection, directing your request to the postal address C / Gran Capit?, s / n Building C1 - 2nd Floor - Office C15 -Campus Nord - UPC 08034 Barcelona or via email to dpo at cimne.upc.edu If you read this message and it is not the designated recipient, or you have received this communication in error, we inform you that it is totally prohibited, and may be illegal, any disclosure, distribution or reproduction of this communication, and please notify us immediately. and return the original message to the address mentioned above. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Sat Oct 23 21:58:44 2021 From: bsmith at petsc.dev (Barry Smith) Date: Sat, 23 Oct 2021 22:58:44 -0400 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> Message-ID: <45846287-9326-4CA4-8C00-7121801DA01B@petsc.dev> Ahh, this makes perfect sense. The code for PetscObjectRegisterDestroy() and the actual destruction (called in PetscFinalize()) is very simply and can be found in src/sys/objects/destroy.c PetscObjectRegisterDestroy(), PetscObjectRegisterDestroyAll(). You could easily maintain a new array like PetscObjectRegisterGCDestroy_Objects[] and add objects with PetscObjectRegisterGCDestroy() and then destroy them with PetscObjectRegisterDestroyGCAll(). The only tricky part is that you have to have, in the context of your Julia MPI, make sure that PetscObjectRegisterDestroyGCAll() is called collectively over all the MPI ranks (that is it has to be called where all the ranks have made the same progress on MPI communication) that have registered objects to destroy, generally PETSC_COMM_ALL. We would be happy to incorporate such a system into the PETSc source with a merge request. Barry > On Oct 23, 2021, at 10:40 PM, Alberto F. Mart?n wrote: > > Thanks all for your very insightful answers. > > We are leveraging PETSc from Julia in a parallel distributed memory context (several MPI tasks running the Julia REPL each). > > Julia uses Garbage Collection (GC), and we would like to destroy the PETSc objects automatically when the GC decides so along the simulation. > > In this context, we cannot guarantee deterministic destruction on all MPI tasks as the GC decisions are local to each task, no global semantics guaranteed. > > As far as I understand from your answers, there seems to be the possibility to defer the destruction of objects till points in the parallel program in which you can guarantee collective semantics, correct? If yes I guess that this may occur at any point in the simulation, not necessarily at shut down via PetscFinalize(), right? > > Best regards, > > Alberto. > > > > On 24/10/21 1:10 am, Jacob Faibussowitsch wrote: >> Depending on the use-case you may also find PetscObjectRegisterDestroy() useful. If you can?t guarantee your PetscObjectDestroy() calls are collective, but have some other collective section you may call it then to punt the destruction of your object to PetscFinalize() which is guaranteed to be collective. >> >> https://petsc.org/main/docs/manualpages/Sys/PetscObjectRegisterDestroy.html >> >> Best regards, >> >> Jacob Faibussowitsch >> (Jacob Fai - booss - oh - vitch) >> >>> On Oct 22, 2021, at 23:33, Jed Brown > wrote: >>> >>> Junchao Zhang > writes: >>> >>>> On Fri, Oct 22, 2021 at 9:13 PM Barry Smith > wrote: >>>> >>>>> >>>>> One technical reason is that PetscHeaderDestroy_Private() may call >>>>> PetscCommDestroy() which may call MPI_Comm_free() which is defined by the >>>>> standard to be collective. Though PETSc tries to limit its use of new MPI >>>>> communicators (for example generally many objects shared the same >>>>> communicator) if we did not free those we no longer need when destroying >>>>> objects we could run out. >>>>> >>>> PetscCommDestroy() might call MPI_Comm_free() , but it is very unlikely. >>>> Petsc uses reference counting on communicators, so in PetscCommDestroy(), >>>> it likely just decreases the count. In other words, PetscCommDestroy() is >>>> cheap and in effect not collective. >>> >>> Unless it's the last reference to a given communicator, which is a risky/difficult thing for a user to guarantee and the consequences are potentially dire (deadlock being way worse than a crash) when the user's intent is to relax ordering for destruction. >>> >>> Alberto, what is the use case in which deterministic destruction is problematic? If you relax it for individual objects, is there a place you can be collective to collect any stale communicators? >> > -- > Alberto F. Mart?n-Huertas > Senior Researcher, PhD. Computational Science > Centre Internacional de M?todes Num?rics a l'Enginyeria (CIMNE) > Parc Mediterrani de la Tecnologia, UPC > Esteve Terradas 5, Building C3, Office 215, > 08860 Castelldefels (Barcelona, Spain) > Tel.: (+34) 9341 34223 > e-mail:amartin at cimne.upc.edu > > FEMPAR project co-founder > web: http://www.fempar.org > > ********************** > IMPORTANT ANNOUNCEMENT > > The information contained in this message and / or attached file (s), sent from CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE, > is confidential / privileged and is intended to be read only by the person (s) to the one (s) that is directed. Your data has been incorporated > into the treatment system of CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE by virtue of its status as client, user of the website, > provider and / or collaborator in order to contact you and send you information that may be of your interest and resolve your queries. > You can exercise your rights of access, rectification, limitation of treatment, deletion, and opposition / revocation, in the terms established > by the current regulations on data protection, directing your request to the postal address C / Gran Capit?, s / n Building C1 - 2nd Floor - > Office C15 -Campus Nord - UPC 08034 Barcelona or via email to dpo at cimne.upc.edu > > If you read this message and it is not the designated recipient, or you have received this communication in error, we inform you that it is > totally prohibited, and may be illegal, any disclosure, distribution or reproduction of this communication, and please notify us immediately. > and return the original message to the address mentioned above. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Sun Oct 24 00:51:51 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Sun, 24 Oct 2021 08:51:51 +0300 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: <45846287-9326-4CA4-8C00-7121801DA01B@petsc.dev> References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> <45846287-9326-4CA4-8C00-7121801DA01B@petsc.dev> Message-ID: Non-deterministic garbage collection is an issue from Python too, and firedrake folks are also working on that. We may consider deferring all calls to MPI_Comm_free done on communicators with 1 as ref count (i.e., the call will actually wipe out some internal MPI data) in a collective call that can be either run by the user (on PETSC_COMM_WORLD), or at PetscFinalize() stage. I.e., something like that #define MPI_Comm_free(comm) PutCommInAList(comm) Comm creation is collective by definition, and thus collectiveness of the order of the destruction can be easily enforced. I don't see problems with 3rd party libraries using comms, since we always duplicate the comm we passed them Lawrence, do you think this may help you? Thanks Stefano Il giorno dom 24 ott 2021 alle ore 05:58 Barry Smith ha scritto: > > Ahh, this makes perfect sense. > > The code for PetscObjectRegisterDestroy() and the actual destruction > (called in PetscFinalize()) is very simply and can be found in > src/sys/objects/destroy.c PetscObjectRegisterDestroy(), PetscObjectRegisterDestroyAll(). > > You could easily maintain a new array > like PetscObjectRegisterGCDestroy_Objects[] and add objects > with PetscObjectRegisterGCDestroy() and then destroy them > with PetscObjectRegisterDestroyGCAll(). The only tricky part is that you > have to have, in the context of your Julia MPI, make sure > that PetscObjectRegisterDestroyGCAll() is called collectively over all the > MPI ranks (that is it has to be called where all the ranks have made the > same progress on MPI communication) that have registered objects to > destroy, generally PETSC_COMM_ALL. We would be happy to incorporate such a > system into the PETSc source with a merge request. > > Barry > > On Oct 23, 2021, at 10:40 PM, Alberto F. Mart?n > wrote: > > Thanks all for your very insightful answers. > > We are leveraging PETSc from Julia in a parallel distributed memory > context (several MPI tasks running the Julia REPL each). > > Julia uses Garbage Collection (GC), and we would like to destroy the PETSc > objects automatically when the GC decides so along the simulation. > > In this context, we cannot guarantee deterministic destruction on all MPI > tasks as the GC decisions are local to each task, no global semantics > guaranteed. > > As far as I understand from your answers, there seems to be the > possibility to defer the destruction of objects till points in the parallel > program in which you can guarantee collective semantics, correct? If yes I > guess that this may occur at any point in the simulation, not necessarily > at shut down via PetscFinalize(), right? > > Best regards, > > Alberto. > > > On 24/10/21 1:10 am, Jacob Faibussowitsch wrote: > > Depending on the use-case you may also find PetscObjectRegisterDestroy() > useful. If you can?t guarantee your PetscObjectDestroy() calls are > collective, but have some other collective section you may call it then to > punt the destruction of your object to PetscFinalize() which is guaranteed > to be collective. > > https://petsc.org/main/docs/manualpages/Sys/PetscObjectRegisterDestroy.html > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > > On Oct 22, 2021, at 23:33, Jed Brown wrote: > > Junchao Zhang writes: > > On Fri, Oct 22, 2021 at 9:13 PM Barry Smith wrote: > > > One technical reason is that PetscHeaderDestroy_Private() may call > PetscCommDestroy() which may call MPI_Comm_free() which is defined by the > standard to be collective. Though PETSc tries to limit its use of new MPI > communicators (for example generally many objects shared the same > communicator) if we did not free those we no longer need when destroying > objects we could run out. > > PetscCommDestroy() might call MPI_Comm_free() , but it is very unlikely. > Petsc uses reference counting on communicators, so in PetscCommDestroy(), > it likely just decreases the count. In other words, PetscCommDestroy() is > cheap and in effect not collective. > > > Unless it's the last reference to a given communicator, which is a > risky/difficult thing for a user to guarantee and the consequences are > potentially dire (deadlock being way worse than a crash) when the user's > intent is to relax ordering for destruction. > > Alberto, what is the use case in which deterministic destruction is > problematic? If you relax it for individual objects, is there a place you > can be collective to collect any stale communicators? > > > -- > Alberto F. Mart?n-Huertas > Senior Researcher, PhD. Computational Science > Centre Internacional de M?todes Num?rics a l'Enginyeria (CIMNE) > Parc Mediterrani de la Tecnologia, UPC > Esteve Terradas 5, Building C3, Office 215, > 08860 Castelldefels (Barcelona, Spain) > Tel.: (+34) 9341 34223e-mail:amartin at cimne.upc.edu > > FEMPAR project co-founder > web: http://www.fempar.org > > ********************** > IMPORTANT ANNOUNCEMENT > > The information contained in this message and / or attached file (s), sent from CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE, > is confidential / privileged and is intended to be read only by the person (s) to the one (s) that is directed. Your data has been incorporated > into the treatment system of CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE by virtue of its status as client, user of the website, > provider and / or collaborator in order to contact you and send you information that may be of your interest and resolve your queries. > You can exercise your rights of access, rectification, limitation of treatment, deletion, and opposition / revocation, in the terms established > by the current regulations on data protection, directing your request to the postal address C / Gran Capit?, s / n Building C1 - 2nd Floor - > Office C15 -Campus Nord - UPC 08034 Barcelona or via email to dpo at cimne.upc.edu > > If you read this message and it is not the designated recipient, or you have received this communication in error, we inform you that it is > totally prohibited, and may be illegal, any disclosure, distribution or reproduction of this communication, and please notify us immediately. > and return the original message to the address mentioned above. > > > -- Stefano -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.sanan at gmail.com Sun Oct 24 01:29:59 2021 From: patrick.sanan at gmail.com (Patrick Sanan) Date: Sun, 24 Oct 2021 08:29:59 +0200 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> <45846287-9326-4CA4-8C00-7121801DA01B@petsc.dev> Message-ID: I think Jeremy (cc?d) has also been thinking about this in the context of PETSc.jl Stefano Zampini schrieb am So. 24. Okt. 2021 um 07:52: > Non-deterministic garbage collection is an issue from Python too, and > firedrake folks are also working on that. > > We may consider deferring all calls to MPI_Comm_free done on communicators > with 1 as ref count (i.e., the call will actually wipe out some internal > MPI data) in a collective call that can be either run by the user (on > PETSC_COMM_WORLD), or at PetscFinalize() stage. > I.e., something like that > > #define MPI_Comm_free(comm) PutCommInAList(comm) > > Comm creation is collective by definition, and thus collectiveness of the > order of the destruction can be easily enforced. > I don't see problems with 3rd party libraries using comms, since we always > duplicate the comm we passed them > > Lawrence, do you think this may help you? > > Thanks > Stefano > > Il giorno dom 24 ott 2021 alle ore 05:58 Barry Smith > ha scritto: > >> >> Ahh, this makes perfect sense. >> >> The code for PetscObjectRegisterDestroy() and the actual destruction >> (called in PetscFinalize()) is very simply and can be found in >> src/sys/objects/destroy.c PetscObjectRegisterDestroy(), PetscObjectRegisterDestroyAll(). >> >> You could easily maintain a new array >> like PetscObjectRegisterGCDestroy_Objects[] and add objects >> with PetscObjectRegisterGCDestroy() and then destroy them >> with PetscObjectRegisterDestroyGCAll(). The only tricky part is that you >> have to have, in the context of your Julia MPI, make sure >> that PetscObjectRegisterDestroyGCAll() is called collectively over all the >> MPI ranks (that is it has to be called where all the ranks have made the >> same progress on MPI communication) that have registered objects to >> destroy, generally PETSC_COMM_ALL. We would be happy to incorporate such a >> system into the PETSc source with a merge request. >> >> Barry >> >> On Oct 23, 2021, at 10:40 PM, Alberto F. Mart?n >> wrote: >> >> Thanks all for your very insightful answers. >> >> We are leveraging PETSc from Julia in a parallel distributed memory >> context (several MPI tasks running the Julia REPL each). >> >> Julia uses Garbage Collection (GC), and we would like to destroy the >> PETSc objects automatically when the GC decides so along the simulation. >> >> In this context, we cannot guarantee deterministic destruction on all MPI >> tasks as the GC decisions are local to each task, no global semantics >> guaranteed. >> >> As far as I understand from your answers, there seems to be the >> possibility to defer the destruction of objects till points in the parallel >> program in which you can guarantee collective semantics, correct? If yes I >> guess that this may occur at any point in the simulation, not necessarily >> at shut down via PetscFinalize(), right? >> >> Best regards, >> >> Alberto. >> >> >> On 24/10/21 1:10 am, Jacob Faibussowitsch wrote: >> >> Depending on the use-case you may also find PetscObjectRegisterDestroy() >> useful. If you can?t guarantee your PetscObjectDestroy() calls are >> collective, but have some other collective section you may call it then to >> punt the destruction of your object to PetscFinalize() which is guaranteed >> to be collective. >> >> >> https://petsc.org/main/docs/manualpages/Sys/PetscObjectRegisterDestroy.html >> >> Best regards, >> >> Jacob Faibussowitsch >> (Jacob Fai - booss - oh - vitch) >> >> On Oct 22, 2021, at 23:33, Jed Brown wrote: >> >> Junchao Zhang writes: >> >> On Fri, Oct 22, 2021 at 9:13 PM Barry Smith wrote: >> >> >> One technical reason is that PetscHeaderDestroy_Private() may call >> PetscCommDestroy() which may call MPI_Comm_free() which is defined by the >> standard to be collective. Though PETSc tries to limit its use of new MPI >> communicators (for example generally many objects shared the same >> communicator) if we did not free those we no longer need when destroying >> objects we could run out. >> >> PetscCommDestroy() might call MPI_Comm_free() , but it is very unlikely. >> Petsc uses reference counting on communicators, so in PetscCommDestroy(), >> it likely just decreases the count. In other words, PetscCommDestroy() is >> cheap and in effect not collective. >> >> >> Unless it's the last reference to a given communicator, which is a >> risky/difficult thing for a user to guarantee and the consequences are >> potentially dire (deadlock being way worse than a crash) when the user's >> intent is to relax ordering for destruction. >> >> Alberto, what is the use case in which deterministic destruction is >> problematic? If you relax it for individual objects, is there a place you >> can be collective to collect any stale communicators? >> >> >> -- >> Alberto F. Mart?n-Huertas >> Senior Researcher, PhD. Computational Science >> Centre Internacional de M?todes Num?rics a l'Enginyeria (CIMNE) >> Parc Mediterrani de la Tecnologia, UPCEsteve Terradas 5, Building C3, Office 215 , >> 08860 Castelldefels (Barcelona, Spain) >> Tel.: (+34) 9341 34223e-mail:amartin at cimne.upc.edu >> >> FEMPAR project co-founder >> web: http://www.fempar.org >> >> ********************** >> IMPORTANT ANNOUNCEMENT >> >> The information contained in this message and / or attached file (s), sent from CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE, >> is confidential / privileged and is intended to be read only by the person (s) to the one (s) that is directed. Your data has been incorporated >> into the treatment system of CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE by virtue of its status as client, user of the website, >> provider and / or collaborator in order to contact you and send you information that may be of your interest and resolve your queries. >> You can exercise your rights of access, rectification, limitation of treatment, deletion, and opposition / revocation, in the terms established >> by the current regulations on data protection, directing your request to the postal address C / Gran Capit?, s / n Building C1 - 2nd Floor - >> Office C15 -Campus Nord - UPC 08034 Barcelona or via email to dpo at cimne.upc.edu >> >> If you read this message and it is not the designated recipient, or you have received this communication in error, we inform you that it is >> totally prohibited, and may be illegal, any disclosure, distribution or reproduction of this communication, and please notify us immediately. >> and return the original message to the address mentioned above. >> >> >> > > -- > Stefano > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cheng at cerfacs.fr Sun Oct 24 18:00:14 2021 From: cheng at cerfacs.fr (Lionel CHENG) Date: Mon, 25 Oct 2021 01:00:14 +0200 (CEST) Subject: [petsc-users] Convergence on Axisymmetric Poisson matrix Message-ID: <726522925.59047959.1635116414656.JavaMail.zimbra@cerfacs.fr> Hello everyone, I have some questions regarding a linear system that I am solving in my plasma simulations. We have in this case a strongly non-symmetric matrix due to the cylindrical coordinates for which the Laplacian cell is given by Fig. 2 for two kinds of triangles. The different unstructured grids have from 300 000 nodes to 7 000 000 nodes. To my understanding, CG should not work properly on this matrix but BiCGStab(1) should. When using SOR preconditioner it is indeed the case: -ksp_type cg -pc_type sor yields solutions in 10 to 20 times more iterations than -ksp_type bcgs -pc_type sor. However, when switching to -ksp_type cg -pc_type gamg the convergence is great and even slightly better than -ksp_type bcgs. I do not understand how CG is able to make the system converge when using GAMG although the matrix is non-symmetric ? Is GAMG able to somehow symmetrize the system? I have the impression that when using -pc_type gamg the Krylov solver is actually the Pre-relaxation and post-relaxation of the initial grid, is that right? For GAMG since the matrix is non-symmetric -mg_levels_pc_type sor for and -mg_levels_ksp_type richardson have been used and yields better results than the original chebychev solver. Sincerely yours, Lionel Cheng -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: main.pdf Type: application/pdf Size: 133390 bytes Desc: not available URL: From Eric.Chamberland at giref.ulaval.ca Sun Oct 24 22:49:38 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Sun, 24 Oct 2021 23:49:38 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> Message-ID: Hi Matthew, ok, I started back from your ex44.c example and added the global array of coordinates.? I just have to code the creation of the local coordinates now. Eric On 2021-10-20 6:55 p.m., Matthew Knepley wrote: > On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland > > wrote: > > Hi Matthew, > > we tried to reproduce the error in a simple example. > > The context is the following: We hard coded the mesh and initial > partition into the code (see sConnectivity and sInitialPartition) > for 2 ranks and try to create a section in order to use the > DMPlexNaturalToGlobalBegin function to retreive our initial > element numbers. > > Now the call to DMPlexDistribute give different errors depending > on what type of component we ask the field to be created.? For our > objective, we would like a global field to be created on elements > only (like a P0 interpolation). > > We now have the following error generated: > > [0]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [0]PETSC ERROR: Petsc has generated inconsistent data > [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 > [0]PETSC ERROR: See > https://www.mcs.anl.gov/petsc/documentation/faq.html > for trouble > shooting. > [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 > [0]PETSC ERROR: ./bug on a? named rohan by ericc Wed Oct 20 > 14:52:36 2021 > [0]PETSC ERROR: Configure options > --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 > --with-mpi-compilers=1 --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 > --with-cxx-dialect=C++14 --with-make-np=12 > --with-shared-libraries=1 --with-debugging=yes --with-memalign=64 > --with-visibility=0 --with-64-bit-indices=0 --download-ml=yes > --download-mumps=yes --download-superlu=yes --download-hpddm=yes > --download-slepc=yes --download-superlu_dist=yes > --download-parmetis=yes --download-ptscotch=yes > --download-metis=yes --download-strumpack=yes > --download-suitesparse=yes --download-hypre=yes > --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 > --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. > --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. > --with-scalapack=1 > --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include > --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 > -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" > [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at > /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 > [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at > /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 > [0]PETSC ERROR: #3 DMPlexDistribute() at > /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 > [0]PETSC ERROR: #4 main() at bug_section.cc:159 > [0]PETSC ERROR: No PETSc Option Table entries > [0]PETSC ERROR: ----------------End of Error Message -------send > entire error message to petsc-maint at mcs.anl.gov > ---------- > > Hope the attached code is self-explaining, note that to make it > short, we have not included the final part of it, just the buggy > part we are encountering right now... > > Thanks for your insights, > > Thanks for making the example. I tweaked it slightly. I put in a test > case that just makes a parallel 7 x 10 quad mesh. This works > fine. Thus I think it must be something connected with the original > mesh. It is hard to get a handle on it without the coordinates. > Do you think you could put the coordinate array in? I have added the > code to load them (see attached file). > > ? Thanks, > > ? ? ?Matt > > Eric > > On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland >> > > wrote: >> >> Hi Matthew, >> >> we tried to use that.? Now, we discovered that: >> >> 1- even if we "ask" for sfNatural creation with >> DMSetUseNatural, it is not created because >> DMPlexCreateGlobalToNaturalSF looks for a "section": this is >> not documented in DMSetUseNaturalso we are asking ourselfs: >> "is this a permanent feature or a temporary situation?" >> >> I think explaining this will help clear up a lot. >> >> What the Natural2Global?map does is permute a solution vector >> into the ordering that it would have had prior to mesh distribution. >> Now, in order to do this permutation, I need to know the original >> (global) data layout. If it is not specified _before_ >> distribution, we >> cannot build the permutation.? The section describes the data >> layout, so I need it before distribution. >> >> I cannot think of another way that you would implement this, but >> if you want something else, let me know. >> >> 2- We then tried to create a "section" in different manners: >> we took the code into the example >> petsc/src/dm/impls/plex/tests/ex15.c. However, we ended up >> with a segfault: >> >> corrupted size vs. prev_size >> [rohan:07297] *** Process received signal *** >> [rohan:07297] Signal: Aborted (6) >> [rohan:07297] Signal code:? (-6) >> [rohan:07297] [ 0] >> /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >> [rohan:07297] [ 1] >> /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >> [rohan:07297] [ 2] /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >> [rohan:07297] [ 3] /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >> [rohan:07297] [ 4] /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >> [rohan:07297] [ 5] /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >> [rohan:07297] [ 6] /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >> [rohan:07297] [ 7] /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >> [rohan:07297] [ 8] >> /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >> [rohan:07297] [ 9] >> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >> [rohan:07297] [10] >> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >> [rohan:07297] [11] >> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >> [rohan:07297] [12] >> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >> [rohan:07297] [13] /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >> >> [rohan:07297] [14] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >> [rohan:07297] [15] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >> [rohan:07297] [16] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >> [rohan:07297] [17] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >> [rohan:07297] [18] >> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >> >> I am not sure what happened here, but if you could send a sample >> code, I will figure it out. >> >> If we do not create a section, the call to DMPlexDistribute >> is successful, but DMPlexGetGlobalToNaturalSF return a null >> SF pointer... >> >> Yes, it just ignores it in this case because it does not have a >> global layout. >> >> Here are the operations we are calling ( this is almost the >> code we are using, I just removed verifications and creation >> of the connectivity which use our parallel structure and code): >> >> =========== >> >> ? PetscInt* lCells????? = 0; >> ? PetscInt? lNumCorners = 0; >> ? PetscInt? lDimMail??? = 0; >> ? PetscInt? lnumCells?? = 0; >> >> ? //At this point we create the cells for PETSc expected >> input for DMPlexBuildFromCellListParallel and set >> lNumCorners, lDimMail and lnumCells to correct values. >> ? ... >> >> ? DM?????? lDMBete = 0 >> ? DMPlexCreate(lMPIComm,&lDMBete); >> >> ? DMSetDimension(lDMBete, lDimMail); >> >> ? DMPlexBuildFromCellListParallel(lDMBete, >> ????????????????????????????????? lnumCells, >> PETSC_DECIDE, >> pLectureElementsLocaux.reqNbTotalSommets(), >> ????????????????????????????????? lNumCorners, >> ????????????????????????????????? lCells, >> ????????????????????????????????? PETSC_NULL); >> >> ? DM lDMBeteInterp = 0; >> ? DMPlexInterpolate(lDMBete, &lDMBeteInterp); >> ? DMDestroy(&lDMBete); >> ? lDMBete = lDMBeteInterp; >> >> ? DMSetUseNatural(lDMBete,PETSC_TRUE); >> >> ? PetscSF lSFMigrationSansOvl = 0; >> ? PetscSF lSFMigrationOvl = 0; >> ? DM lDMDistribueSansOvl = 0; >> ? DM lDMAvecOverlap = 0; >> >> ? PetscPartitioner lPart; >> ? DMPlexGetPartitioner(lDMBete, &lPart); >> ? PetscPartitionerSetFromOptions(lPart); >> >> ? PetscSection?? section; >> ? PetscInt?????? numFields?? = 1; >> ? PetscInt?????? numBC?????? = 0; >> ? PetscInt?????? numComp[1]? = {1}; >> ? PetscInt?????? numDof[4]?? = {1, 0, 0, 0}; >> ? PetscInt?????? bcFields[1] = {0}; >> ? IS???????????? bcPoints[1] = {NULL}; >> >> ? DMSetNumFields(lDMBete, numFields); >> >> ? DMPlexCreateSection(lDMBete, NULL, numComp, numDof, numBC, >> bcFields, bcPoints, NULL, NULL, §ion); >> ? DMSetLocalSection(lDMBete, section); >> >> ? DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, >> &lDMDistribueSansOvl); // segfault! >> >> =========== >> >> So we have other question/remarks: >> >> 3- Maybe PETSc expect something specific that is missing/not >> verified: for example, we didn't gave any coordinates since >> we just want to partition and compute overlap for the mesh... >> and then recover our element numbers in a "simple way" >> >> 4- We are telling ourselves it is somewhat a "big price to >> pay" to have to build an unused section to have the global to >> natural ordering set ?? Could this requirement be avoided? >> >> I don't think so. There would have to be _some_ way of describing >> your data layout in terms of mesh points, and I do not see how >> you could use less memory doing that. >> >> 5- Are there any improvement towards our usages in 3.16 release? >> >> Let me try and run the code above. >> >> ? Thanks, >> >> ? ? ?Matt >> >> Thanks, >> >> Eric >> >> >> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland >>> >> > wrote: >>> >>> Hi, >>> >>> I come back with _almost_ the original question: >>> >>> I would like to add an integer information (*our* >>> original element >>> number, not petsc one) on each element of the DMPlex I >>> create with >>> DMPlexBuildFromCellListParallel. >>> >>> I would like this interger to be distribruted by or the >>> same way >>> DMPlexDistribute distribute the mesh. >>> >>> Is it possible to do this? >>> >>> >>> I think we already have support for what you want. If you call >>> >>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>> >>> >>> before DMPlexDistribute(), it will compute a PetscSF >>> encoding the global to natural map. You >>> can get it with >>> >>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>> >>> >>> and use it with >>> >>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>> >>> >>> Is this sufficient? >>> >>> ? Thanks, >>> >>> ? ? ?Matt >>> >>> Thanks, >>> >>> Eric >>> >>> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >>> > Hi, >>> > >>> > I want to use DMPlexDistribute from PETSc for >>> computing overlapping >>> > and play with the different partitioners supported. >>> > >>> > However, after calling DMPlexDistribute, I noticed the >>> elements are >>> > renumbered and then the original number is lost. >>> > >>> > What would be the best way to keep track of the >>> element renumbering? >>> > >>> > a) Adding an optional parameter to let the user >>> retrieve a vector or >>> > "IS" giving the old number? >>> > >>> > b) Adding a DMLabel (seems a wrong good solution) >>> > >>> > c) Other idea? >>> > >>> > Of course, I don't want to loose performances with the >>> need of this >>> > "mapping"... >>> > >>> > Thanks, >>> > >>> > Eric >>> > >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin >>> their experiments is infinitely more interesting than any >>> results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex44.c Type: text/x-csrc Size: 9586 bytes Desc: not available URL: From wence at gmx.li Mon Oct 25 06:34:36 2021 From: wence at gmx.li (Lawrence Mitchell) Date: Mon, 25 Oct 2021 12:34:36 +0100 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> <45846287-9326-4CA4-8C00-7121801DA01B@petsc.dev> Message-ID: <621B3D93-E4C2-43B6-B6A1-6F8324CB7E8D@gmx.li> Hi all, (I cc Jack who is doing the implementation in the petsc4py setting) > On 24 Oct 2021, at 06:51, Stefano Zampini wrote: > > Non-deterministic garbage collection is an issue from Python too, and firedrake folks are also working on that. > > We may consider deferring all calls to MPI_Comm_free done on communicators with 1 as ref count (i.e., the call will actually wipe out some internal MPI data) in a collective call that can be either run by the user (on PETSC_COMM_WORLD), or at PetscFinalize() stage. > I.e., something like that > > #define MPI_Comm_free(comm) PutCommInAList(comm) > > Comm creation is collective by definition, and thus collectiveness of the order of the destruction can be easily enforced. > I don't see problems with 3rd party libraries using comms, since we always duplicate the comm we passed them > Lawrence, do you think this may help you? I think that it is not just MPI_Comm_free that is potentially problematic. Here are some additional areas off the top of my head: 1. PetscSF with -sf_type window. Destroy (when the refcount drops to zero) calls MPI_Win_free (which is collective over comm) 2. Deallocation of MUMPS objects is tremendously collective. In general the solution of just punting MPI_Comm_free to PetscFinalize (or some user-defined time) is, I think, insufficient since it requires us to audit the collectiveness of all `XXX_Destroy` functions (including in third-party packages). Barry's suggestion of resurrecting objects in finalisation using PetscObjectRegisterDestroy and then collectively clearing that array periodically is pretty close to the proposal that we cooked up I think. Jack can correct any missteps I make in explanation, but perhaps this is helpful for Alberto: 1. Each PETSc communicator gets two new attributes "creation_index" [an int64], "resurrected_objects" [a set-like thing] 2. PetscHeaderCreate grabs the next creation_index out of the input communicator and stashes it on the object. Since object creation is collective this is guaranteed to agree on any given communicator across processes. 3. When the Python garbage collector tries to destroy PETSc objects we resurrect the _C_ object in finalisation and stash it in "resurrected_objects" on the communicator. 4. Periodically (as a result of user intervention in the first instance), we do garbage collection collectively on these resurrected objects by performing a set intersection of the creation_indices across the communicator's processes, and then calling XXXDestroy in order on the sorted_by_creation_index set intersection. I think that most of this infrastructure is agnostic of the managed language, so Jack was doing implementation in PETSc (rather than petsc4py). This wasn't a perfect solution (I recall that we could still cook up situations in which objects would not be collected), but it did seem to (in theory) solve any potential deadlock issues. Lawrence From j.betteridge at imperial.ac.uk Mon Oct 25 07:12:52 2021 From: j.betteridge at imperial.ac.uk (Betteridge, Jack D) Date: Mon, 25 Oct 2021 12:12:52 +0000 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: <621B3D93-E4C2-43B6-B6A1-6F8324CB7E8D@gmx.li> References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> <45846287-9326-4CA4-8C00-7121801DA01B@petsc.dev> <621B3D93-E4C2-43B6-B6A1-6F8324CB7E8D@gmx.li> Message-ID: Hi Everyone, I cannot fault Lawrence's explanation, that is precisely what I'm implementing. The only difference is I was adding most of the logic for the "resurrected objects map" to petsc4py rather than PETSc. Given that this solution is truly Python agnostic, I will move what I have written to C and merely add the interface to the functionality to petsc4py. Indeed, this works out better for me as I was not enjoying writing all the code in Cython! I'll post an update once there is a working prototype in my PETSc fork, and the code is ready for testing. Cheers, Jack ________________________________ From: Lawrence Mitchell Sent: 25 October 2021 12:34 To: Stefano Zampini Cc: Barry Smith ; "Alberto F. Mart?n" ; PETSc users list ; Francesc Verdugo ; Betteridge, Jack D Subject: Re: [petsc-users] Why PetscDestroy global collective semantics? ******************* This email originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list https://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address. ******************* Hi all, (I cc Jack who is doing the implementation in the petsc4py setting) > On 24 Oct 2021, at 06:51, Stefano Zampini wrote: > > Non-deterministic garbage collection is an issue from Python too, and firedrake folks are also working on that. > > We may consider deferring all calls to MPI_Comm_free done on communicators with 1 as ref count (i.e., the call will actually wipe out some internal MPI data) in a collective call that can be either run by the user (on PETSC_COMM_WORLD), or at PetscFinalize() stage. > I.e., something like that > > #define MPI_Comm_free(comm) PutCommInAList(comm) > > Comm creation is collective by definition, and thus collectiveness of the order of the destruction can be easily enforced. > I don't see problems with 3rd party libraries using comms, since we always duplicate the comm we passed them > Lawrence, do you think this may help you? I think that it is not just MPI_Comm_free that is potentially problematic. Here are some additional areas off the top of my head: 1. PetscSF with -sf_type window. Destroy (when the refcount drops to zero) calls MPI_Win_free (which is collective over comm) 2. Deallocation of MUMPS objects is tremendously collective. In general the solution of just punting MPI_Comm_free to PetscFinalize (or some user-defined time) is, I think, insufficient since it requires us to audit the collectiveness of all `XXX_Destroy` functions (including in third-party packages). Barry's suggestion of resurrecting objects in finalisation using PetscObjectRegisterDestroy and then collectively clearing that array periodically is pretty close to the proposal that we cooked up I think. Jack can correct any missteps I make in explanation, but perhaps this is helpful for Alberto: 1. Each PETSc communicator gets two new attributes "creation_index" [an int64], "resurrected_objects" [a set-like thing] 2. PetscHeaderCreate grabs the next creation_index out of the input communicator and stashes it on the object. Since object creation is collective this is guaranteed to agree on any given communicator across processes. 3. When the Python garbage collector tries to destroy PETSc objects we resurrect the _C_ object in finalisation and stash it in "resurrected_objects" on the communicator. 4. Periodically (as a result of user intervention in the first instance), we do garbage collection collectively on these resurrected objects by performing a set intersection of the creation_indices across the communicator's processes, and then calling XXXDestroy in order on the sorted_by_creation_index set intersection. I think that most of this infrastructure is agnostic of the managed language, so Jack was doing implementation in PETSc (rather than petsc4py). This wasn't a perfect solution (I recall that we could still cook up situations in which objects would not be collected), but it did seem to (in theory) solve any potential deadlock issues. Lawrence -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Mon Oct 25 08:33:50 2021 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 25 Oct 2021 09:33:50 -0400 Subject: [petsc-users] Convergence on Axisymmetric Poisson matrix In-Reply-To: <726522925.59047959.1635116414656.JavaMail.zimbra@cerfacs.fr> References: <726522925.59047959.1635116414656.JavaMail.zimbra@cerfacs.fr> Message-ID: <18054E4C-B03B-4821-9D30-619E9B6BB173@petsc.dev> Are you running with -ksp_monitor_true_residual to track the b - A*x residual instead of just the preconditioned residual? GAMG definitely does not symmetrize the system but it is possible the preconditioner results in the solve "not seeing" the unsymmetry during the solution process and hence CG still converging; it would be dangerous to rely on this in general I think. You could also run this case with GMRES to see if that is better than the CG iterations. Barry > On Oct 24, 2021, at 7:00 PM, Lionel CHENG wrote: > > Hello everyone, > > I have some questions regarding a linear system that I am solving in my plasma simulations. We have in this case a strongly non-symmetric matrix due to the cylindrical coordinates for which the Laplacian cell is given by Fig. 2 for two kinds of triangles. The different unstructured grids have from 300 000 nodes to 7 000 000 nodes. > > To my understanding, CG should not work properly on this matrix but BiCGStab(1) should. When using SOR preconditioner it is indeed the case: -ksp_type cg -pc_type sor yields solutions in 10 to 20 times more iterations than -ksp_type bcgs -pc_type sor. > > However, when switching to -ksp_type cg -pc_type gamg the convergence is great and even slightly better than -ksp_type bcgs. I do not understand how CG is able to make the system converge when using GAMG although the matrix is non-symmetric ? Is GAMG able to somehow symmetrize the system? I have the impression that when using -pc_type gamg the Krylov solver is actually the Pre-relaxation and post-relaxation of the initial grid, is that right? > > For GAMG since the matrix is non-symmetric -mg_levels_pc_type sor for and -mg_levels_ksp_type richardson have been used and yields better results than the original chebychev solver. > > Sincerely yours, > > Lionel Cheng > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yhcy1993 at gmail.com Mon Oct 25 08:53:12 2021 From: yhcy1993 at gmail.com (=?UTF-8?B?5LuT5a6H?=) Date: Mon, 25 Oct 2021 21:53:12 +0800 Subject: [petsc-users] Strange behavior of TS after setting hand-coded Jacobian Message-ID: I'm using TS to solve a set of DAE, which originates from a one-dimensional problem. The grid points are uniformly distributed. For simplicity, the DMDA is not employed for discretization. At first, only the residual function is prescribed through 'TSSetIFunction', and PETSC produces converged results. However, after providing hand-coded Jacobian through 'TSSetIJacobian', the internal SNES object fails (residual norm does not change), and TS reports 'DIVERGED_STEP_REJECTED'. I have tried to add the option '-snes_test_jacobian' to see if the hand-coded jacobian is somewhere wrong, but it shows '||J - Jfd||_F/||J||_F = 1.07488e-10, ||J - Jfd||_F = 2.14458e-07', indicating that the hand-coded jacobian is correct. Then, I added a monitor for the internal SNES object through 'SNESMonitorSet', in which the solution vector will be displayed at each iteration. It is interesting to find that, if the jacobian is not provided, meaning finite-difference is utilized for jacobian evaluation internally, the solution vector converges to steady solution and the SNES residual norm is reduced continuously. However, it turns out that, as long as the jacobian is provided, the solution vector will NEVER get changed! So the solution procedure stucked! This is quite strange! Hope to get some advice. PETSC version=3.14.6, program run in serial mode. Regards Yu Cang From bsmith at petsc.dev Mon Oct 25 09:50:55 2021 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 25 Oct 2021 10:50:55 -0400 Subject: [petsc-users] Strange behavior of TS after setting hand-coded Jacobian In-Reply-To: References: Message-ID: <0C6ACBF3-F457-4BFD-AD19-8C455444748F@petsc.dev> It is definitely unexpected that -snes_test_jacobian verifies the Jacobian as matching but the solve process is completely different. Please run with -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian and send all the output Barry > On Oct 25, 2021, at 9:53 AM, ?? wrote: > > I'm using TS to solve a set of DAE, which originates from a > one-dimensional problem. The grid points are uniformly distributed. > For simplicity, the DMDA is not employed for discretization. > > At first, only the residual function is prescribed through > 'TSSetIFunction', and PETSC produces converged results. However, after > providing hand-coded Jacobian through 'TSSetIJacobian', the internal > SNES object fails (residual norm does not change), and TS reports > 'DIVERGED_STEP_REJECTED'. > > I have tried to add the option '-snes_test_jacobian' to see if the > hand-coded jacobian is somewhere wrong, but it shows '||J - > Jfd||_F/||J||_F = 1.07488e-10, ||J - Jfd||_F = 2.14458e-07', > indicating that the hand-coded jacobian is correct. > > Then, I added a monitor for the internal SNES object through > 'SNESMonitorSet', in which the solution vector will be displayed at > each iteration. It is interesting to find that, if the jacobian is not > provided, meaning finite-difference is utilized for jacobian > evaluation internally, the solution vector converges to steady > solution and the SNES residual norm is reduced continuously. However, > it turns out that, as long as the jacobian is provided, the solution > vector will NEVER get changed! So the solution procedure stucked! > > This is quite strange! Hope to get some advice. > PETSC version=3.14.6, program run in serial mode. > > Regards > > Yu Cang From bsmith at petsc.dev Mon Oct 25 09:55:12 2021 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 25 Oct 2021 10:55:12 -0400 Subject: [petsc-users] [petsc-maint] PETSc () -- MPI -- Versions Conflict on Mac 1 In-Reply-To: References: Message-ID: <8DE6DAA6-0CB5-4F5A-87DA-ABFC8F1E271F@petsc.dev> Send the output from otool -L libdeal_II.g.9.3.0.dylib otool -L libslepc.3.15.dylib otool -L libpetsc.3.15.dylib You will need to find the directories of these libraries and include them in the otool command. > On Oct 25, 2021, at 10:00 AM, Ahmed Galal wrote: > > Hello, > > I tried to run Step-17 in Dealii, a PETSc Dependenant software, but I got the following error. Given I have two versions of PETSc. I use the other one for another software. > > I googled and found that the issue is solved by commenting out the line "ADD_FLAGS(DEAL_II_LINKER_FLAGS "-fuse-ld=gold")" in "cmake/checks/check_01_compiler_features.cmake", make test, worked. > --- > https://groups.google.com/u/1/g/dealii/c/JxUOyd_4eFM?pli=1https://groups.google.com/u/1/g/dealii/c/JxUOyd_4eFM?pli=1 > --- > > Any suggestions for how to get around that error "PETSc Error --- Application was linked against both OpenMPI and MPICH based MPI libraries and will not run correctly" in the new version of DealII.9.3 as I could not find a .cmake similar file to edit? > > Thanks in advance! > Ahmed > > Error: > ------------------------ > > PETSc Error --- Application was linked against both OpenMPI and MPICH based MPI libraries and will not run correctly > > [----] *** Process received signal *** > > [----] Signal: Segmentation fault: 11 (11) > > [----] Signal code: Address not mapped (1) > > [----] Failing at address: 0x68 > > [----] [ 0] 0 libsystem_platform.dylib 0x00007fff2041cd7d _sigtramp + 29 > > [----] [ 1] 0 ??? 0x0000000060b9d791 0x0 + 1622792081 > > [----] [ 2] 0 libsystem_c.dylib 0x00007fff202edfcc vfprintf_l + 28 > > [----] [ 3] 0 libsystem_c.dylib 0x00007fff202e69a2 fprintf + 160 > > [----] [ 4] 0 libpetsc.3.15.dylib 0x000000014c224f0d PetscVFPrintfDefault + 685 > > [ 5] 0 libpetsc.3.15.dylib 0x000000014c2280d6 PetscFPrintf + 726 > > [ 6] 0 libpetsc.3.15.dylib 0x000000014c21d537 PetscErrorPrintfDefault + 375 > > [ 7] 0 libpetsc.3.15.dylib 0x000000014c21d66a PetscTraceBackErrorHandler + 154 > > [ 8] 0 libpetsc.3.15.dylib 0x000000014c21713c PetscError + 716 > > [ 9] 0 libslepc.3.15.dylib 0x000000014bc9cc8c SlepcInitialize + 428 > > [10] 0 libdeal_II.g.9.3.0.dylib 0x00000001151ae627 _ZN6dealii9Utilities3MPI16MPI_InitFinalizeC2ERiRPPcj + 135 > > [11] 0 libdeal_II.g.9.3.0.dylib 0x00000001151aece9 _ZN6dealii9Utilities3MPI16MPI_InitFinalizeC1ERiRPPcj + 9 > > [12] 0 step-17 0x000000010f244892 main + 82 > > [13] 0 libdyld.dylib 0x00007fff203f2f3d start + 1 > > [14] 0 ??? 0x0000000000000001 0x0 + 1 > > *** End of error message *** > > make[3]: *** [CMakeFiles/run] Segmentation fault: 11 > > make[2]: *** [CMakeFiles/run.dir/all] Error 2 > > make[1]: *** [CMakeFiles/run.dir/rule] Error 2 > > make: *** [run] Error 2 > > bash-3.2$ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jekozdon at nps.edu Mon Oct 25 11:37:51 2021 From: jekozdon at nps.edu (Kozdon, Jeremy (CIV)) Date: Mon, 25 Oct 2021 16:37:51 +0000 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> <45846287-9326-4CA4-8C00-7121801DA01B@petsc.dev> Message-ID: <43DEA1A5-C36F-4F13-B732-834124A3CDBA@nps.edu> I the PETSc.jl stuff I?ve worked on, I punted on the issue and only register a finalizer when there is 1 MPI rank, so something like this when objects are created: if MPI.Comm_size(comm) == 1 finalizer(destroy, mat) end see: https://github.com/JuliaParallel/PETSc.jl/blob/581f37990b6e54fd31cf2bec8e938d51a73dbc92/src/mat.jl#L210-L212 (warning this is a WIP branch, but shouldn?t change any in the next few weeks). I?ve tried to think about some other ways of handling this, such as some sort of collective clean up routine that could be called or exploring using a thread to handle the destroy, but not given it a ton of thought and everything could involve collective communication which I would like to avoid. I?ll have to dig a little bit more into petsc4py discussed elsewhere in the thread, but it seems even there you don?t get the cannot skip completely forgetting about cleanup. I have been wondering if you could use threading to handle this, but then I think a collective barrier would be needed which would not be so nice. Another garbage collection issue I found was that if you rely on the garbage collector for serial objects and you allow PETSc to be finalizd and reinitialized, you can end up with the garbage collector trying to clean up objects from previous runs. To get around this, I introduced a petsc age to my global petsc object and each object also knew what petsc age it was created during. And an object was only destroyed when it was created in the current petsc age. function destroy(M::AbstractMat{PetscLib}) where {PetscLib} if !(finalized(PetscLib)) && M.age == getlib(PetscLib).age && M.ptr != C_NULL LibPETSc.MatDestroy(PetscLib, M) end M.ptr = C_NULL return nothing end see: https://github.com/JuliaParallel/PETSc.jl/blob/581f37990b6e54fd31cf2bec8e938d51a73dbc92/src/mat.jl#L14-L22 > On Oct 23, 2021, at 11:29 PM, Patrick Sanan wrote: > > > NPS WARNING: *external sender* verify before acting. > > > I think Jeremy (cc?d) has also been thinking about this in the context of PETSc.jl > > Stefano Zampini schrieb am So. 24. Okt. 2021 um 07:52: > Non-deterministic garbage collection is an issue from Python too, and firedrake folks are also working on that. > > We may consider deferring all calls to MPI_Comm_free done on communicators with 1 as ref count (i.e., the call will actually wipe out some internal MPI data) in a collective call that can be either run by the user (on PETSC_COMM_WORLD), or at PetscFinalize() stage. > I.e., something like that > > #define MPI_Comm_free(comm) PutCommInAList(comm) > > Comm creation is collective by definition, and thus collectiveness of the order of the destruction can be easily enforced. > I don't see problems with 3rd party libraries using comms, since we always duplicate the comm we passed them > > Lawrence, do you think this may help you? > > Thanks > Stefano > > Il giorno dom 24 ott 2021 alle ore 05:58 Barry Smith ha scritto: > > Ahh, this makes perfect sense. > > The code for PetscObjectRegisterDestroy() and the actual destruction (called in PetscFinalize()) is very simply and can be found in src/sys/objects/destroy.c PetscObjectRegisterDestroy(), PetscObjectRegisterDestroyAll(). > > You could easily maintain a new array like PetscObjectRegisterGCDestroy_Objects[] and add objects with PetscObjectRegisterGCDestroy() and then destroy them with PetscObjectRegisterDestroyGCAll(). The only tricky part is that you have to have, in the context of your Julia MPI, make sure that PetscObjectRegisterDestroyGCAll() is called collectively over all the MPI ranks (that is it has to be called where all the ranks have made the same progress on MPI communication) that have registered objects to destroy, generally PETSC_COMM_ALL. We would be happy to incorporate such a system into the PETSc source with a merge request. > > Barry > >> On Oct 23, 2021, at 10:40 PM, Alberto F. Mart?n wrote: >> >> Thanks all for your very insightful answers. >> >> We are leveraging PETSc from Julia in a parallel distributed memory context (several MPI tasks running the Julia REPL each). >> >> Julia uses Garbage Collection (GC), and we would like to destroy the PETSc objects automatically when the GC decides so along the simulation. >> >> In this context, we cannot guarantee deterministic destruction on all MPI tasks as the GC decisions are local to each task, no global semantics guaranteed. >> >> As far as I understand from your answers, there seems to be the possibility to defer the destruction of objects till points in the parallel program in which you can guarantee collective semantics, correct? If yes I guess that this may occur at any point in the simulation, not necessarily at shut down via PetscFinalize(), right? >> >> Best regards, >> >> Alberto. >> >> >> >> On 24/10/21 1:10 am, Jacob Faibussowitsch wrote: >>> Depending on the use-case you may also find PetscObjectRegisterDestroy() useful. If you can?t guarantee your PetscObjectDestroy() calls are collective, but have some other collective section you may call it then to punt the destruction of your object to PetscFinalize() which is guaranteed to be collective. >>> >>> https://petsc.org/main/docs/manualpages/Sys/PetscObjectRegisterDestroy.html >>> >>> Best regards, >>> >>> Jacob Faibussowitsch >>> (Jacob Fai - booss - oh - vitch) >>> >>>> On Oct 22, 2021, at 23:33, Jed Brown wrote: >>>> >>>> Junchao Zhang writes: >>>> >>>>> On Fri, Oct 22, 2021 at 9:13 PM Barry Smith wrote: >>>>> >>>>>> >>>>>> One technical reason is that PetscHeaderDestroy_Private() may call >>>>>> PetscCommDestroy() which may call MPI_Comm_free() which is defined by the >>>>>> standard to be collective. Though PETSc tries to limit its use of new MPI >>>>>> communicators (for example generally many objects shared the same >>>>>> communicator) if we did not free those we no longer need when destroying >>>>>> objects we could run out. >>>>>> >>>>> PetscCommDestroy() might call MPI_Comm_free() , but it is very unlikely. >>>>> Petsc uses reference counting on communicators, so in PetscCommDestroy(), >>>>> it likely just decreases the count. In other words, PetscCommDestroy() is >>>>> cheap and in effect not collective. >>>> >>>> Unless it's the last reference to a given communicator, which is a risky/difficult thing for a user to guarantee and the consequences are potentially dire (deadlock being way worse than a crash) when the user's intent is to relax ordering for destruction. >>>> >>>> Alberto, what is the use case in which deterministic destruction is problematic? If you relax it for individual objects, is there a place you can be collective to collect any stale communicators? >>> >> -- >> Alberto F. Mart?n-Huertas >> Senior Researcher, PhD. Computational Science >> Centre Internacional de M?todes Num?rics a l'Enginyeria (CIMNE) >> Parc Mediterrani de la Tecnologia, UPC >> >> Esteve Terradas 5, Building C3, Office 215 >> , >> 08860 Castelldefels (Barcelona, Spain) >> Tel.: (+34) 9341 34223 >> >> e-mail:amartin at cimne.upc.edu >> >> >> FEMPAR project co-founder >> web: >> http://www.fempar.org >> >> >> ********************** >> IMPORTANT ANNOUNCEMENT >> >> The information contained in this message and / or attached file (s), sent from CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE, >> is confidential / privileged and is intended to be read only by the person (s) to the one (s) that is directed. Your data has been incorporated >> into the treatment system of CENTRO INTERNACIONAL DE METODES NUMERICS EN ENGINYERIA-CIMNE by virtue of its status as client, user of the website, >> provider and / or collaborator in order to contact you and send you information that may be of your interest and resolve your queries. >> You can exercise your rights of access, rectification, limitation of treatment, deletion, and opposition / revocation, in the terms established >> by the current regulations on data protection, directing your request to the postal address C / Gran Capit?, s / n Building C1 - 2nd Floor - >> Office C15 -Campus Nord - UPC 08034 Barcelona or via email to >> dpo at cimne.upc.edu >> >> >> If you read this message and it is not the designated recipient, or you have received this communication in error, we inform you that it is >> totally prohibited, and may be illegal, any disclosure, distribution or reproduction of this communication, and please notify us immediately. >> and return the original message to the address mentioned above. >> > > > > -- > Stefano From cheng at cerfacs.fr Mon Oct 25 12:01:42 2021 From: cheng at cerfacs.fr (Lionel CHENG) Date: Mon, 25 Oct 2021 19:01:42 +0200 (CEST) Subject: [petsc-users] Convergence on Axisymmetric Poisson matrix In-Reply-To: <18054E4C-B03B-4821-9D30-619E9B6BB173@petsc.dev> References: <726522925.59047959.1635116414656.JavaMail.zimbra@cerfacs.fr> <18054E4C-B03B-4821-9D30-619E9B6BB173@petsc.dev> Message-ID: <2045838682.60349186.1635181302892.JavaMail.zimbra@cerfacs.fr> We are running with the -ksp_norm_type unpreconditioned so the convergence is done with the true residual for all the previous tests. I have a case with 800 000 nodes that I have run for 200 iterations on 36 CPU cor es (Intel Xeon Gold 6140 - Skylake) and the Poisson solver gives me | Krylov Solver | Poisson running time [s] | | `cg` | 3.9150E+00 | | `gmres` | 4.6527E+00 | | `bcgs` | 5.4416E+00 | Only the ksp_type has been changed in the following line: mpirun -np $nb_cpu $exec -ksp_initial_guess_nonzero true \ -ksp_type bcgs -ksp_norm_type unpreconditioned \ -ksp_rtol 1e-10 \ -pc_type gamg -mg_levels_pc_type sor -mg_levels_ksp_type richardson \ So CG is better than gmres (I have included the BiCGStab runs as well as I have talked about them earlier). I find it really weird that it behaves well with the preconditioner gamg I can't really find an explanation why, it is really against my intuition. Apart from that I have also played around with the number of multi-grid levels (-pc_mg_levels): | Number of MG levels | Poisson running time [s] | | ------------------------------- | ------------------------ | | 2 | 1.0385E+01 | | 3 | 5.0015E+00 | | 4 | 3.9150E+00 | | 5 | 4.5015E+00 | | 6 (default petsc for this case) | 4.5510E+00 | So that I find an optimum for 4 and not 6 as in the default PETSc configuration and not specifying anything. How should I choose the number of multi grid level depending on my problem? How does GAMG evaluate the number of grid levels required? Lionel De: "Barry Smith" ?: "cheng" Cc: "petsc-users" Envoy?: Lundi 25 Octobre 2021 15:33:50 Objet: Re: [petsc-users] Convergence on Axisymmetric Poisson matrix Are you running with -ksp_monitor_true_residual to track the b - A*x residual instead of just the preconditioned residual? GAMG definitely does not symmetrize the system but it is possible the preconditioner results in the solve "not seeing" the unsymmetry during the solution process and hence CG still converging; it would be dangerous to rely on this in general I think. You could also run this case with GMRES to see if that is better than the CG iterations. Barry On Oct 24, 2021, at 7:00 PM, Lionel CHENG < [ mailto:cheng at cerfacs.fr | cheng at cerfacs.fr ] > wrote: Hello everyone, I have some questions regarding a linear system that I am solving in my plasma simulations. We have in this case a strongly non-symmetric matrix due to the cylindrical coordinates for which the Laplacian cell is given by Fig. 2 for two kinds of triangles. The different unstructured grids have from 300 000 nodes to 7 000 000 nodes. To my understanding, CG should not work properly on this matrix but BiCGStab(1) should. When using SOR preconditioner it is indeed the case: -ksp_type cg -pc_type sor yields solutions in 10 to 20 times more iterations than -ksp_type bcgs -pc_type sor. However, when switching to -ksp_type cg -pc_type gamg the convergence is great and even slightly better than -ksp_type bcgs. I do not understand how CG is able to make the system converge when using GAMG although the matrix is non-symmetric ? Is GAMG able to somehow symmetrize the system? I have the impression that when using -pc_type gamg the Krylov solver is actually the Pre-relaxation and post-relaxation of the initial grid, is that right? For GAMG since the matrix is non-symmetric -mg_levels_pc_type sor for and -mg_levels_ksp_type richardson have been used and yields better results than the original chebychev solver. Sincerely yours, Lionel Cheng -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Mon Oct 25 12:52:26 2021 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 25 Oct 2021 13:52:26 -0400 Subject: [petsc-users] Convergence on Axisymmetric Poisson matrix In-Reply-To: <2045838682.60349186.1635181302892.JavaMail.zimbra@cerfacs.fr> References: <726522925.59047959.1635116414656.JavaMail.zimbra@cerfacs.fr> <18054E4C-B03B-4821-9D30-619E9B6BB173@petsc.dev> <2045838682.60349186.1635181302892.JavaMail.zimbra@cerfacs.fr> Message-ID: <9795AF8B-8480-4EDF-93BF-E4BBAF7479CF@petsc.dev> > On Oct 25, 2021, at 1:01 PM, Lionel CHENG wrote: > > We are running with the -ksp_norm_type unpreconditioned so the convergence is done with the true residual for all the previous tests. I have a case with 800 000 nodes that I have run for 200 iterations on 36 CPU cores (Intel Xeon Gold 6140 - Skylake) and the Poisson solver gives me > > | Krylov Solver | Poisson running time [s] | > | `cg` | 3.9150E+00 | > | `gmres` | 4.6527E+00 | > | `bcgs` | 5.4416E+00 | > > Only the ksp_type has been changed in the following line: > mpirun -np $nb_cpu $exec -ksp_initial_guess_nonzero true \ > -ksp_type bcgs -ksp_norm_type unpreconditioned \ > -ksp_rtol 1e-10 \ > -pc_type gamg -mg_levels_pc_type sor -mg_levels_ksp_type richardson \ > > So CG is better than gmres (I have included the BiCGStab runs as well as I have talked about them earlier). I was not interested in the runtime, I was interested in the convergence behavior of CG vs GMRES for this problem. If CG is "faking it" then one would see the GMRES converging faster (its residual would get smaller with fewer iterations). Barry > I find it really weird that it behaves well with the preconditioner gamg I can't really find an explanation why, it is really against my intuition. > > Apart from that I have also played around with the number of multi-grid levels (-pc_mg_levels): > > | Number of MG levels | Poisson running time [s] | > | ------------------------------- | ------------------------ | > | 2 | 1.0385E+01 | > | 3 | 5.0015E+00 | > | 4 | 3.9150E+00 | > | 5 | 4.5015E+00 | > | 6 (default petsc for this case) | 4.5510E+00 | > > So that I find an optimum for 4 and not 6 as in the default PETSc configuration and not specifying anything. How should I choose the number of multi grid level depending on my problem? How does GAMG evaluate the number of grid levels required? > > Lionel > > De: "Barry Smith" > ?: "cheng" > Cc: "petsc-users" > Envoy?: Lundi 25 Octobre 2021 15:33:50 > Objet: Re: [petsc-users] Convergence on Axisymmetric Poisson matrix > > > Are you running with -ksp_monitor_true_residual to track the b - A*x residual instead of just the preconditioned residual? > > GAMG definitely does not symmetrize the system but it is possible the preconditioner results in the solve "not seeing" the unsymmetry during the solution process and hence CG still converging; it would be dangerous to rely on this in general I think. You could also run this case with GMRES to see if that is better than the CG iterations. > > Barry > > On Oct 24, 2021, at 7:00 PM, Lionel CHENG > wrote: > > Hello everyone, > > I have some questions regarding a linear system that I am solving in my plasma simulations. We have in this case a strongly non-symmetric matrix due to the cylindrical coordinates for which the Laplacian cell is given by Fig. 2 for two kinds of triangles. The different unstructured grids have from 300 000 nodes to 7 000 000 nodes. > > To my understanding, CG should not work properly on this matrix but BiCGStab(1) should. When using SOR preconditioner it is indeed the case: -ksp_type cg -pc_type sor yields solutions in 10 to 20 times more iterations than -ksp_type bcgs -pc_type sor. > > However, when switching to -ksp_type cg -pc_type gamg the convergence is great and even slightly better than -ksp_type bcgs. I do not understand how CG is able to make the system converge when using GAMG although the matrix is non-symmetric ? Is GAMG able to somehow symmetrize the system? I have the impression that when using -pc_type gamg the Krylov solver is actually the Pre-relaxation and post-relaxation of the initial grid, is that right? > > For GAMG since the matrix is non-symmetric -mg_levels_pc_type sor for and -mg_levels_ksp_type richardson have been used and yields better results than the original chebychev solver. > > Sincerely yours, > > Lionel Cheng > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cheng at cerfacs.fr Mon Oct 25 15:14:14 2021 From: cheng at cerfacs.fr (Lionel CHENG) Date: Mon, 25 Oct 2021 22:14:14 +0200 (CEST) Subject: [petsc-users] Convergence on Axisymmetric Poisson matrix In-Reply-To: <9795AF8B-8480-4EDF-93BF-E4BBAF7479CF@petsc.dev> References: <726522925.59047959.1635116414656.JavaMail.zimbra@cerfacs.fr> <18054E4C-B03B-4821-9D30-619E9B6BB173@petsc.dev> <2045838682.60349186.1635181302892.JavaMail.zimbra@cerfacs.fr> <9795AF8B-8480-4EDF-93BF-E4BBAF7479CF@petsc.dev> Message-ID: <305315019.60519677.1635192854702.JavaMail.zimbra@cerfacs.fr> The number of iterations at initialization (with rtol=1e-10) for both cg is 11 for gmres and 12 for cg so roughly the same. Switching to bcgs the number of iterations goes down to 6. So CG does not seem to fake it. Going back to the number of multi grid levels: how should I choose the number of multi grid level depending on the problem at hand? How does GAMG evaluate the number of grid levels required? Lionel De: "Barry Smith" ?: "cheng" Cc: "petsc-users" Envoy?: Lundi 25 Octobre 2021 19:52:26 Objet: Re: [petsc-users] Convergence on Axisymmetric Poisson matrix On Oct 25, 2021, at 1:01 PM, Lionel CHENG < [ mailto:cheng at cerfacs.fr | cheng at cerfacs.fr ] > wrote: We are running with the -ksp_norm_type unpreconditioned so the convergence is done with the true residual for all the previous tests. I have a case with 800 000 nodes that I have run for 200 iterations on 36 CPU cor es (Intel Xeon Gold 6140 - Skylake) and the Poisson solver gives me | Krylov Solver | Poisson running time [s] | | `cg` | 3.9150E+00 | | `gmres` | 4.6527E+00 | | `bcgs` | 5.4416E+00 | Only the ksp_type has been changed in the following line: mpirun -np $nb_cpu $exec -ksp_initial_guess_nonzero true \ -ksp_type bcgs -ksp_norm_type unpreconditioned \ -ksp_rtol 1e-10 \ -pc_type gamg -mg_levels_pc_type sor -mg_levels_ksp_type richardson \ So CG is better than gmres (I have included the BiCGStab runs as well as I have talked about them earlier). I was not interested in the runtime, I was interested in the convergence behavior of CG vs GMRES for this problem. If CG is "faking it" then one would see the GMRES converging faster (its residual would get smaller with fewer iterations). Barry BQ_BEGIN I find it really weird that it behaves well with the preconditioner gamg I can't really find an explanation why, it is really against my intuition. Apart from that I have also played around with the number of multi-grid levels (-pc_mg_levels): | Number of MG levels | Poisson running time [s] | | ------------------------------- | ------------------------ | | 2 | 1.0385E+01 | | 3 | 5.0015E+00 | | 4 | 3.9150E+00 | | 5 | 4.5015E+00 | | 6 (default petsc for this case) | 4.5510E+00 | So that I find an optimum for 4 and not 6 as in the default PETSc configuration and not specifying anything. How should I choose the number of multi grid level depending on my problem? How does GAMG evaluate the number of grid levels required? Lionel De: "Barry Smith" < [ mailto:bsmith at petsc.dev | bsmith at petsc.dev ] > ?: "cheng" < [ mailto:cheng at cerfacs.fr | cheng at cerfacs.fr ] > Cc: "petsc-users" < [ mailto:petsc-users at mcs.anl.gov | petsc-users at mcs.anl.gov ] > Envoy?: Lundi 25 Octobre 2021 15:33:50 Objet: Re: [petsc-users] Convergence on Axisymmetric Poisson matrix Are you running with -ksp_monitor_true_residual to track the b - A*x residual instead of just the preconditioned residual? GAMG definitely does not symmetrize the system but it is possible the preconditioner results in the solve "not seeing" the unsymmetry during the solution process and hence CG still converging; it would be dangerous to rely on this in general I think. You could also run this case with GMRES to see if that is better than the CG iterations. Barry BQ_BEGIN On Oct 24, 2021, at 7:00 PM, Lionel CHENG < [ mailto:cheng at cerfacs.fr | cheng at cerfacs.fr ] > wrote: Hello everyone, I have some questions regarding a linear system that I am solving in my plasma simulations. We have in this case a strongly non-symmetric matrix due to the cylindrical coordinates for which the Laplacian cell is given by Fig. 2 for two kinds of triangles. The different unstructured grids have from 300 000 nodes to 7 000 000 nodes. To my understanding, CG should not work properly on this matrix but BiCGStab(1) should. When using SOR preconditioner it is indeed the case: -ksp_type cg -pc_type sor yields solutions in 10 to 20 times more iterations than -ksp_type bcgs -pc_type sor. However, when switching to -ksp_type cg -pc_type gamg the convergence is great and even slightly better than -ksp_type bcgs. I do not understand how CG is able to make the system converge when using GAMG although the matrix is non-symmetric ? Is GAMG able to somehow symmetrize the system? I have the impression that when using -pc_type gamg the Krylov solver is actually the Pre-relaxation and post-relaxation of the initial grid, is that right? For GAMG since the matrix is non-symmetric -mg_levels_pc_type sor for and -mg_levels_ksp_type richardson have been used and yields better results than the original chebychev solver. Sincerely yours, Lionel Cheng BQ_END BQ_END -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Mon Oct 25 16:23:47 2021 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 25 Oct 2021 17:23:47 -0400 Subject: [petsc-users] Convergence on Axisymmetric Poisson matrix In-Reply-To: <305315019.60519677.1635192854702.JavaMail.zimbra@cerfacs.fr> References: <726522925.59047959.1635116414656.JavaMail.zimbra@cerfacs.fr> <18054E4C-B03B-4821-9D30-619E9B6BB173@petsc.dev> <2045838682.60349186.1635181302892.JavaMail.zimbra@cerfacs.fr> <9795AF8B-8480-4EDF-93BF-E4BBAF7479CF@petsc.dev> <305315019.60519677.1635192854702.JavaMail.zimbra@cerfacs.fr> Message-ID: <3AA65142-0DAF-4797-90DE-8D376B752DE8@petsc.dev> > On Oct 25, 2021, at 4:14 PM, Lionel CHENG wrote: > > The number of iterations at initialization (with rtol=1e-10) for both cg is 11 for gmres and 12 for cg so roughly the same. Switching to bcgs the number of iterations goes down to 6. So CG does not seem to fake it. Yes, this sounds reasonable enough, maybe a touch of loss of orthogonality with the CG. > > Going back to the number of multi grid levels: how should I choose the number of multi grid level depending on the problem at hand? How does GAMG evaluate the number of grid levels required? Usually we just stick with the heuristic that GAMG comes up with. Maybe Mark has some better advice. Barry > > Lionel > > De: "Barry Smith" > ?: "cheng" > Cc: "petsc-users" > Envoy?: Lundi 25 Octobre 2021 19:52:26 > Objet: Re: [petsc-users] Convergence on Axisymmetric Poisson matrix > > > > On Oct 25, 2021, at 1:01 PM, Lionel CHENG > wrote: > > We are running with the -ksp_norm_type unpreconditioned so the convergence is done with the true residual for all the previous tests. I have a case with 800 000 nodes that I have run for 200 iterations on 36 CPU cores (Intel Xeon Gold 6140 - Skylake) and the Poisson solver gives me > > | Krylov Solver | Poisson running time [s] | > | `cg` | 3.9150E+00 | > | `gmres` | 4.6527E+00 | > | `bcgs` | 5.4416E+00 | > > Only the ksp_type has been changed in the following line: > mpirun -np $nb_cpu $exec -ksp_initial_guess_nonzero true \ > -ksp_type bcgs -ksp_norm_type unpreconditioned \ > -ksp_rtol 1e-10 \ > -pc_type gamg -mg_levels_pc_type sor -mg_levels_ksp_type richardson \ > > So CG is better than gmres (I have included the BiCGStab runs as well as I have talked about them earlier). > > I was not interested in the runtime, I was interested in the convergence behavior of CG vs GMRES for this problem. If CG is "faking it" then one would see the GMRES converging faster (its residual would get smaller with fewer iterations). > > Barry > > > I find it really weird that it behaves well with the preconditioner gamg I can't really find an explanation why, it is really against my intuition. > > Apart from that I have also played around with the number of multi-grid levels (-pc_mg_levels): > > | Number of MG levels | Poisson running time [s] | > | ------------------------------- | ------------------------ | > | 2 | 1.0385E+01 | > | 3 | 5.0015E+00 | > | 4 | 3.9150E+00 | > | 5 | 4.5015E+00 | > | 6 (default petsc for this case) | 4.5510E+00 | > > So that I find an optimum for 4 and not 6 as in the default PETSc configuration and not specifying anything. How should I choose the number of multi grid level depending on my problem? How does GAMG evaluate the number of grid levels required? > > Lionel > > De: "Barry Smith" > > ?: "cheng" > > Cc: "petsc-users" > > Envoy?: Lundi 25 Octobre 2021 15:33:50 > Objet: Re: [petsc-users] Convergence on Axisymmetric Poisson matrix > > > Are you running with -ksp_monitor_true_residual to track the b - A*x residual instead of just the preconditioned residual? > > GAMG definitely does not symmetrize the system but it is possible the preconditioner results in the solve "not seeing" the unsymmetry during the solution process and hence CG still converging; it would be dangerous to rely on this in general I think. You could also run this case with GMRES to see if that is better than the CG iterations. > > Barry > > On Oct 24, 2021, at 7:00 PM, Lionel CHENG > wrote: > > Hello everyone, > > I have some questions regarding a linear system that I am solving in my plasma simulations. We have in this case a strongly non-symmetric matrix due to the cylindrical coordinates for which the Laplacian cell is given by Fig. 2 for two kinds of triangles. The different unstructured grids have from 300 000 nodes to 7 000 000 nodes. > > To my understanding, CG should not work properly on this matrix but BiCGStab(1) should. When using SOR preconditioner it is indeed the case: -ksp_type cg -pc_type sor yields solutions in 10 to 20 times more iterations than -ksp_type bcgs -pc_type sor. > > However, when switching to -ksp_type cg -pc_type gamg the convergence is great and even slightly better than -ksp_type bcgs. I do not understand how CG is able to make the system converge when using GAMG although the matrix is non-symmetric ? Is GAMG able to somehow symmetrize the system? I have the impression that when using -pc_type gamg the Krylov solver is actually the Pre-relaxation and post-relaxation of the initial grid, is that right? > > For GAMG since the matrix is non-symmetric -mg_levels_pc_type sor for and -mg_levels_ksp_type richardson have been used and yields better results than the original chebychev solver. > > Sincerely yours, > > Lionel Cheng > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amartin at cimne.upc.edu Tue Oct 26 06:48:58 2021 From: amartin at cimne.upc.edu (=?UTF-8?Q?Alberto_F=2e_Mart=c3=adn?=) Date: Tue, 26 Oct 2021 22:48:58 +1100 Subject: [petsc-users] Why PetscDestroy global collective semantics? In-Reply-To: References: <9578e720-178a-6515-1e6d-99043fb20c39@cimne.upc.edu> <00B7FB22-AFC3-43BB-8C12-3A5F4AEDDF53@petsc.dev> <87zgr0fjgz.fsf@jedbrown.org> <171199C7-42C9-4D5C-96FD-83F08CFA33A8@gmail.com> <04f4d613-da5c-c321-35ea-ff6ece8e59ef@cimne.upc.edu> <45846287-9326-4CA4-8C00-7121801DA01B@petsc.dev> <621B3D93-E4C2-43B6-B6A1-6F8324CB7E8D@gmx.li> Message-ID: <72deae68-86ac-b391-1594-3a625d939e44@cimne.upc.edu> Thanks all for this second round of detailed responses. Highly appreciated! I think that I have enough material to continue exploring a solution in our particular context. Best regards, ?Alberto. On 25/10/21 11:12 pm, Betteridge, Jack D wrote: > Hi Everyone, > > I cannot fault Lawrence's explanation, that is precisely what I'm > implementing. The only difference is I was adding most of the logic > for the "resurrected objects map" to petsc4py rather than PETSc. Given > that this solution is truly Python agnostic, I will move what I have > written to C and merely add the interface to the functionality to > petsc4py. > > Indeed, this works out better for me as I was not enjoying writing all > the code in Cython! I'll post an update once there is a working > prototype in my PETSc fork, and the code is ready for testing. > > Cheers, > Jack > > > ------------------------------------------------------------------------ > *From:* Lawrence Mitchell > *Sent:* 25 October 2021 12:34 > *To:* Stefano Zampini > *Cc:* Barry Smith ; "Alberto F. Mart?n" > ; PETSc users list ; > Francesc Verdugo ; Betteridge, Jack D > > *Subject:* Re: [petsc-users] Why PetscDestroy global collective > semantics? > > ******************* > This email originates from outside Imperial. Do not click on links and > attachments unless you recognise the sender. > If you trust the sender, add them to your safe senders list > https://spam.ic.ac.uk/SpamConsole/Senders.aspx > to disable email > stamping for this address. > ******************* > Hi all, > > (I cc Jack who is doing the implementation in the petsc4py setting) > > > On 24 Oct 2021, at 06:51, Stefano Zampini > wrote: > > > > Non-deterministic garbage collection is an issue from Python too, > and firedrake folks are also working on that. > > > > We may consider deferring all calls to MPI_Comm_free done on > communicators with 1 as ref count (i.e., the call will actually wipe > out some internal MPI data) in a collective call that can be either > run by the user (on PETSC_COMM_WORLD), or at PetscFinalize() stage. > > I.e., something like that > > > > #define MPI_Comm_free(comm) PutCommInAList(comm) > > > > Comm creation is collective by definition, and thus collectiveness > of the order of the destruction can be easily enforced. > > I don't see problems with 3rd party libraries using comms, since we > always duplicate the comm we passed them > > > Lawrence, do you think this may help you? > > I think that it is not just MPI_Comm_free that is potentially problematic. > > Here are some additional areas off the top of my head: > > 1. PetscSF with -sf_type window. Destroy (when the refcount drops to > zero) calls MPI_Win_free (which is collective over comm) > 2. Deallocation of MUMPS objects is tremendously collective. > > In general the solution of just punting MPI_Comm_free to PetscFinalize > (or some user-defined time) is, I think, insufficient since it > requires us to audit the collectiveness of all `XXX_Destroy` functions > (including in third-party packages). > > Barry's suggestion of resurrecting objects in finalisation using > PetscObjectRegisterDestroy and then collectively clearing that array > periodically is pretty close to the proposal that we cooked up I think. > > Jack can correct any missteps I make in explanation, but perhaps this > is helpful for Alberto: > > 1. Each PETSc communicator gets two new attributes "creation_index" > [an int64], "resurrected_objects" [a set-like thing] > 2. PetscHeaderCreate grabs the next creation_index out of the input > communicator and stashes it on the object. Since object creation is > collective this is guaranteed to agree on any given communicator > across processes. > 3. When the Python garbage collector tries to destroy PETSc objects we > resurrect the _C_ object in finalisation and stash it in > "resurrected_objects" on the communicator. > 4. Periodically (as a result of user intervention in the first > instance), we do garbage collection collectively on these resurrected > objects by performing a set intersection of the creation_indices > across the communicator's processes, and then calling XXXDestroy in > order on the sorted_by_creation_index set intersection. > > > I think that most of this infrastructure is agnostic of the managed > language, so Jack was doing implementation in PETSc (rather than > petsc4py). > > This wasn't a perfect solution (I recall that we could still cook up > situations in which objects would not be collected), but it did seem > to (in theory) solve any potential deadlock issues. > > Lawrence -------------- next part -------------- An HTML attachment was scrubbed... URL: From s6hsbran at uni-bonn.de Tue Oct 26 07:41:41 2021 From: s6hsbran at uni-bonn.de (Hannes Phil Niklas Brandt) Date: Tue, 26 Oct 2021 14:41:41 +0200 Subject: [petsc-users] Possibilities to run further computations based on intermediate results of VecScatter Message-ID: Hello, I am interested in the non-blocking, collective communication of Petsc-Vecs. Right now I am using VecScatterBegin and VecScatterEnd to scatter different entries of a parallel distributed MPI-Vec to local sequential vectors on each process. After the call to VecScatterEnd I perform separate computations on each block of the sequential Vec corresponding to a process. However, I would prefer to use each block of the local sequential Vec for those further computations as soon as I receive it from the corresponding process (so I do not want to wait for the whole scattering to finish). Are there functionalities in Petsc capable of this? I am trying to compute the matrix-vector-product for a parallel distributed MPI-Vec and a parallel distributed sparse matrix format I implemented myself. Each process needs entries from the whole MPI-Vec for the product, but does not have enough storage capacities to store those entries all at once, not even in a sparse format. Therefore, I need to process the entries in small blocks and add the results onto a local result vector. Best Regards Hannes p { margin-bottom: 0.25cm; line-height: 115%; background: transparent } From knepley at gmail.com Tue Oct 26 08:46:30 2021 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 26 Oct 2021 09:46:30 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> Message-ID: Okay, I ran it. Something seems off with the mesh. First, I cannot simply explain the partition. The number of shared vertices and edges does not seem to come from a straight cut. Second, the mesh look scrambled on output. Thanks, Matt On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland < Eric.Chamberland at giref.ulaval.ca> wrote: > Hi Matthew, > > ok, I started back from your ex44.c example and added the global array of > coordinates. I just have to code the creation of the local coordinates now. > > Eric > On 2021-10-20 6:55 p.m., Matthew Knepley wrote: > > On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland < > Eric.Chamberland at giref.ulaval.ca> wrote: > >> Hi Matthew, >> >> we tried to reproduce the error in a simple example. >> >> The context is the following: We hard coded the mesh and initial >> partition into the code (see sConnectivity and sInitialPartition) for 2 >> ranks and try to create a section in order to use the >> DMPlexNaturalToGlobalBegin function to retreive our initial element numbers. >> >> Now the call to DMPlexDistribute give different errors depending on what >> type of component we ask the field to be created. For our objective, we >> would like a global field to be created on elements only (like a P0 >> interpolation). >> >> We now have the following error generated: >> >> [0]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [0]PETSC ERROR: Petsc has generated inconsistent data >> [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 >> [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html >> for trouble shooting. >> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 >> [0]PETSC ERROR: ./bug on a named rohan by ericc Wed Oct 20 14:52:36 2021 >> [0]PETSC ERROR: Configure options >> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 --with-mpi-compilers=1 >> --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 --with-cxx-dialect=C++14 >> --with-make-np=12 --with-shared-libraries=1 --with-debugging=yes >> --with-memalign=64 --with-visibility=0 --with-64-bit-indices=0 >> --download-ml=yes --download-mumps=yes --download-superlu=yes >> --download-hpddm=yes --download-slepc=yes --download-superlu_dist=yes >> --download-parmetis=yes --download-ptscotch=yes --download-metis=yes >> --download-strumpack=yes --download-suitesparse=yes --download-hypre=yes >> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >> --with-scalapack=1 >> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >> [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at >> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >> [0]PETSC ERROR: #3 DMPlexDistribute() at >> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >> [0]PETSC ERROR: No PETSc Option Table entries >> [0]PETSC ERROR: ----------------End of Error Message -------send entire >> error message to petsc-maint at mcs.anl.gov---------- >> >> Hope the attached code is self-explaining, note that to make it short, we >> have not included the final part of it, just the buggy part we are >> encountering right now... >> >> Thanks for your insights, >> > Thanks for making the example. I tweaked it slightly. I put in a test case > that just makes a parallel 7 x 10 quad mesh. This works > fine. Thus I think it must be something connected with the original mesh. > It is hard to get a handle on it without the coordinates. > Do you think you could put the coordinate array in? I have added the code > to load them (see attached file). > > Thanks, > > Matt > >> Eric >> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >> >> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland < >> Eric.Chamberland at giref.ulaval.ca> wrote: >> >>> Hi Matthew, >>> >>> we tried to use that. Now, we discovered that: >>> >>> 1- even if we "ask" for sfNatural creation with DMSetUseNatural, it is >>> not created because DMPlexCreateGlobalToNaturalSF looks for a "section": >>> this is not documented in DMSetUseNaturalso we are asking ourselfs: "is >>> this a permanent feature or a temporary situation?" >>> >> I think explaining this will help clear up a lot. >> >> What the Natural2Global map does is permute a solution vector into the >> ordering that it would have had prior to mesh distribution. >> Now, in order to do this permutation, I need to know the original >> (global) data layout. If it is not specified _before_ distribution, we >> cannot build the permutation. The section describes the data layout, so >> I need it before distribution. >> >> I cannot think of another way that you would implement this, but if you >> want something else, let me know. >> >>> 2- We then tried to create a "section" in different manners: we took the >>> code into the example petsc/src/dm/impls/plex/tests/ex15.c. However, we >>> ended up with a segfault: >>> >>> corrupted size vs. prev_size >>> [rohan:07297] *** Process received signal *** >>> [rohan:07297] Signal: Aborted (6) >>> [rohan:07297] Signal code: (-6) >>> [rohan:07297] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>> [rohan:07297] [ 1] /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>> [rohan:07297] [ 2] /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>> [rohan:07297] [ 3] /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>> [rohan:07297] [ 4] /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>> [rohan:07297] [ 5] /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>> [rohan:07297] [ 6] /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>> [rohan:07297] [ 7] /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>> [rohan:07297] [ 8] /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>> [rohan:07297] [ 9] >>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>> [rohan:07297] [10] >>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>> [rohan:07297] [11] >>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>> [rohan:07297] [12] >>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>> [rohan:07297] [13] /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>> >>> [rohan:07297] [14] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>> [rohan:07297] [15] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>> [rohan:07297] [16] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>> [rohan:07297] [17] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>> [rohan:07297] [18] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>> >> I am not sure what happened here, but if you could send a sample code, I >> will figure it out. >> >>> If we do not create a section, the call to DMPlexDistribute is >>> successful, but DMPlexGetGlobalToNaturalSF return a null SF pointer... >>> >> Yes, it just ignores it in this case because it does not have a global >> layout. >> >>> Here are the operations we are calling ( this is almost the code we are >>> using, I just removed verifications and creation of the connectivity which >>> use our parallel structure and code): >>> >>> =========== >>> >>> PetscInt* lCells = 0; >>> PetscInt lNumCorners = 0; >>> PetscInt lDimMail = 0; >>> PetscInt lnumCells = 0; >>> >>> //At this point we create the cells for PETSc expected input for >>> DMPlexBuildFromCellListParallel and set lNumCorners, lDimMail and lnumCells >>> to correct values. >>> ... >>> >>> DM lDMBete = 0 >>> DMPlexCreate(lMPIComm,&lDMBete); >>> >>> DMSetDimension(lDMBete, lDimMail); >>> >>> DMPlexBuildFromCellListParallel(lDMBete, >>> lnumCells, >>> PETSC_DECIDE, >>> >>> pLectureElementsLocaux.reqNbTotalSommets(), >>> lNumCorners, >>> lCells, >>> PETSC_NULL); >>> >>> DM lDMBeteInterp = 0; >>> DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>> DMDestroy(&lDMBete); >>> lDMBete = lDMBeteInterp; >>> >>> DMSetUseNatural(lDMBete,PETSC_TRUE); >>> >>> PetscSF lSFMigrationSansOvl = 0; >>> PetscSF lSFMigrationOvl = 0; >>> DM lDMDistribueSansOvl = 0; >>> DM lDMAvecOverlap = 0; >>> >>> PetscPartitioner lPart; >>> DMPlexGetPartitioner(lDMBete, &lPart); >>> PetscPartitionerSetFromOptions(lPart); >>> >>> PetscSection section; >>> PetscInt numFields = 1; >>> PetscInt numBC = 0; >>> PetscInt numComp[1] = {1}; >>> PetscInt numDof[4] = {1, 0, 0, 0}; >>> PetscInt bcFields[1] = {0}; >>> IS bcPoints[1] = {NULL}; >>> >>> DMSetNumFields(lDMBete, numFields); >>> >>> DMPlexCreateSection(lDMBete, NULL, numComp, numDof, numBC, bcFields, >>> bcPoints, NULL, NULL, §ion); >>> DMSetLocalSection(lDMBete, section); >>> >>> DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, >>> &lDMDistribueSansOvl); // segfault! >>> >>> =========== >>> >>> So we have other question/remarks: >>> >>> 3- Maybe PETSc expect something specific that is missing/not verified: >>> for example, we didn't gave any coordinates since we just want to partition >>> and compute overlap for the mesh... and then recover our element numbers in >>> a "simple way" >>> >>> 4- We are telling ourselves it is somewhat a "big price to pay" to have >>> to build an unused section to have the global to natural ordering set ? >>> Could this requirement be avoided? >>> >> I don't think so. There would have to be _some_ way of describing your >> data layout in terms of mesh points, and I do not see how you could use >> less memory doing that. >> >>> 5- Are there any improvement towards our usages in 3.16 release? >>> >> Let me try and run the code above. >> >> Thanks, >> >> Matt >> >>> Thanks, >>> >>> Eric >>> >>> >>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>> >>> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland < >>> Eric.Chamberland at giref.ulaval.ca> wrote: >>> >>>> Hi, >>>> >>>> I come back with _almost_ the original question: >>>> >>>> I would like to add an integer information (*our* original element >>>> number, not petsc one) on each element of the DMPlex I create with >>>> DMPlexBuildFromCellListParallel. >>>> >>>> I would like this interger to be distribruted by or the same way >>>> DMPlexDistribute distribute the mesh. >>>> >>>> Is it possible to do this? >>>> >>> >>> I think we already have support for what you want. If you call >>> >>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>> >>> before DMPlexDistribute(), it will compute a PetscSF encoding the global >>> to natural map. You >>> can get it with >>> >>> >>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>> >>> and use it with >>> >>> >>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>> >>> Is this sufficient? >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Thanks, >>>> >>>> Eric >>>> >>>> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >>>> > Hi, >>>> > >>>> > I want to use DMPlexDistribute from PETSc for computing overlapping >>>> > and play with the different partitioners supported. >>>> > >>>> > However, after calling DMPlexDistribute, I noticed the elements are >>>> > renumbered and then the original number is lost. >>>> > >>>> > What would be the best way to keep track of the element renumbering? >>>> > >>>> > a) Adding an optional parameter to let the user retrieve a vector or >>>> > "IS" giving the old number? >>>> > >>>> > b) Adding a DMLabel (seems a wrong good solution) >>>> > >>>> > c) Other idea? >>>> > >>>> > Of course, I don't want to loose performances with the need of this >>>> > "mapping"... >>>> > >>>> > Thanks, >>>> > >>>> > Eric >>>> > >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mesh.png Type: image/png Size: 130060 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex44.c Type: application/octet-stream Size: 9780 bytes Desc: not available URL: From pierre.seize at onera.fr Tue Oct 26 09:17:04 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Tue, 26 Oct 2021 16:17:04 +0200 Subject: [petsc-users] Question regarding DMPlex reordering Message-ID: <01e3a622-8561-93b4-4ebd-25331bd93486@onera.fr> Hi, I had the idea to try and renumber my mesh cells, as I've heard it's better: "neighbouring cells are stored next to one another, and memory access are faster". Right now, I load the mesh then I distribute it over the processes. I thought I'd try to permute the numbering between those two steps : DMPlexCreateFromFile DMPlexGetOrdering DMPlexPermute DMPlexDistribute but that gives me an error when it runs on more than one process: [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: No support for this operation for this object type [0]PETSC ERROR: Number of dofs for point 0 in the local section should be positive [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.16.0, unknown [0]PETSC ERROR: ./build/bin/yanss on a? named ldmpe202z.onera by pseize Tue Oct 26 16:03:33 2021 [0]PETSC ERROR: Configure options --PETSC_ARCH=arch-ld-gcc --download-metis --download-parmetis --prefix=~/.local --with-cgns [0]PETSC ERROR: #1 PetscPartitionerDMPlexPartition() at /stck/pseize/softwares/petsc/src/dm/impls/plex/plexpartition.c:720 [0]PETSC ERROR: #2 DMPlexDistribute() at /stck/pseize/softwares/petsc/src/dm/impls/plex/plexdistribute.c:1630 [0]PETSC ERROR: #3 MeshLoadFromFile() at src/spatial.c:689 [0]PETSC ERROR: #4 main() at src/main.c:22 [0]PETSC ERROR: PETSc Option Table entries: [0]PETSC ERROR: -draw_comp 0 [0]PETSC ERROR: -mesh data/box.msh [0]PETSC ERROR: -mesh_view draw [0]PETSC ERROR: -riemann anrs [0]PETSC ERROR: -ts_max_steps 100 [0]PETSC ERROR: -vec_view_partition [0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov---------- I checked and before I tried to reorder the mesh, the dm->localSection was NULL before entering DMPlexDistribute, and I was able to fix the error with DMSetLocalSection(dm, NULL) after DMPlexPermute, but it doesn't seems it's the right way to do what I want. Does someone have any advice ? Thanks in advance Pierre Seize -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahmed.galal8591 at gmail.com Mon Oct 25 09:00:53 2021 From: ahmed.galal8591 at gmail.com (Ahmed Galal) Date: Mon, 25 Oct 2021 09:00:53 -0500 Subject: [petsc-users] PETSc () -- MPI -- Versions Conflict on Mac 1 Message-ID: Hello, I tried to run Step-17 in Dealii, a PETSc Dependenant software, but I got the following error. Given I have two versions of PETSc. I use the other one for another software. I googled and found that the issue is solved by commenting out the line "ADD_FLAGS(DEAL_II_LINKER_FLAGS "-fuse-ld=gold")" in "cmake/checks/check_01_compiler_features.cmake", make test, worked. --- https://groups.google.com/u/1/g/dealii/c/JxUOyd_4eFM?pli=1https://groups.google.com/u/1/g/dealii/c/JxUOyd_4eFM?pli=1 --- Any suggestions for how to get around that error "PETSc Error --- Application was linked against both OpenMPI and MPICH based MPI libraries and will not run correctly" in the new version of DealII.9.3 as I could not find a .cmake similar file to edit? Thanks in advance! Ahmed Error: ------------------------ PETSc Error --- Application was linked against both OpenMPI and MPICH based MPI libraries and will not run correctly [----] *** Process received signal *** [----] Signal: Segmentation fault: 11 (11) [----] Signal code: Address not mapped (1) [----] Failing at address: 0x68 [----] [ 0] 0 libsystem_platform.dylib 0x00007fff2041cd7d _sigtramp + 29 [----] [ 1] 0 ??? 0x0000000060b9d791 0x0 + 1622792081 [----] [ 2] 0 libsystem_c.dylib 0x00007fff202edfcc vfprintf_l + 28 [----] [ 3] 0 libsystem_c.dylib 0x00007fff202e69a2 fprintf + 160 [----] [ 4] 0 libpetsc.3.15.dylib 0x000000014c224f0d PetscVFPrintfDefault + 685 [ 5] 0 libpetsc.3.15.dylib 0x000000014c2280d6 PetscFPrintf + 726 [ 6] 0 libpetsc.3.15.dylib 0x000000014c21d537 PetscErrorPrintfDefault + 375 [ 7] 0 libpetsc.3.15.dylib 0x000000014c21d66a PetscTraceBackErrorHandler + 154 [ 8] 0 libpetsc.3.15.dylib 0x000000014c21713c PetscError + 716 [ 9] 0 libslepc.3.15.dylib 0x000000014bc9cc8c SlepcInitialize + 428 [10] 0 libdeal_II.g.9.3.0.dylib 0x00000001151ae627 _ZN6dealii9Utilities3MPI16MPI_InitFinalizeC2ERiRPPcj + 135 [11] 0 libdeal_II.g.9.3.0.dylib 0x00000001151aece9 _ZN6dealii9Utilities3MPI16MPI_InitFinalizeC1ERiRPPcj + 9 [12] 0 step-17 0x000000010f244892 main + 82 [13] 0 libdyld.dylib 0x00007fff203f2f3d start + 1 [14] 0 ??? 0x0000000000000001 0x0 + 1 *** End of error message *** make[3]: *** [CMakeFiles/run] Segmentation fault: 11 make[2]: *** [CMakeFiles/run.dir/all] Error 2 make[1]: *** [CMakeFiles/run.dir/rule] Error 2 make: *** [run] Error 2 bash-3.2$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahmed.galal8591 at gmail.com Mon Oct 25 10:18:24 2021 From: ahmed.galal8591 at gmail.com (Ahmed Galal) Date: Mon, 25 Oct 2021 10:18:24 -0500 Subject: [petsc-users] [petsc-maint] PETSc () -- MPI -- Versions Conflict on Mac 1 In-Reply-To: <8DE6DAA6-0CB5-4F5A-87DA-ABFC8F1E271F@petsc.dev> References: <8DE6DAA6-0CB5-4F5A-87DA-ABFC8F1E271F@petsc.dev> Message-ID: Hello Barry, Not existent. Here is what I got: bash-3.2$ otool -L libdeal_II.g.9.3.0.dylib error: otool: can't open file: libdeal_II.g.9.3.0.dylib (No such file or directory) bash-3.2$ otool -L libslepc.3.15.dylib error: otool: can't open file: libslepc.3.15.dylib (No such file or directory) bash-3.2$ otool -L libpetsc.3.15.dylib error: otool: can't open file: libpetsc.3.15.dylib (No such file or directory) bash-3.2$ ls Kind regards, Ahmed On Mon, Oct 25, 2021 at 9:55 AM Barry Smith wrote: > > Send the output from > > otool -L libdeal_II.g.9.3.0.dylib > > otool -L libslepc.3.15.dylib > > otool -L libpetsc.3.15.dylib > > You will need to find the directories of these libraries and include > them in the otool command. > > > > On Oct 25, 2021, at 10:00 AM, Ahmed Galal > wrote: > > Hello, > > I tried to run Step-17 in Dealii, a PETSc Dependenant software, but I got > the following error. Given I have two versions of PETSc. I use the other > one for another software. > > I googled and found that the issue is solved by commenting out the line > "ADD_FLAGS(DEAL_II_LINKER_FLAGS "-fuse-ld=gold")" in > "cmake/checks/check_01_compiler_features.cmake", make test, worked. > --- > > https://groups.google.com/u/1/g/dealii/c/JxUOyd_4eFM?pli=1https://groups.google.com/u/1/g/dealii/c/JxUOyd_4eFM?pli=1 > > --- > > Any suggestions for how to get around that error "PETSc Error --- > Application was linked against both OpenMPI and MPICH based MPI libraries > and will not run correctly" in the new version of DealII.9.3 as I could > not find a .cmake similar file to edit? > > Thanks in advance! > Ahmed > > Error: > ------------------------ > > PETSc Error --- Application was linked against both OpenMPI and MPICH > based MPI libraries and will not run correctly > > [----] *** Process received signal *** > > [----] Signal: Segmentation fault: 11 (11) > > [----] Signal code: Address not mapped (1) > > [----] Failing at address: 0x68 > > [----] [ 0] 0 libsystem_platform.dylib 0x00007fff2041cd7d > _sigtramp + 29 > > [----] [ 1] 0 ??? 0x0000000060b9d791 > 0x0 + 1622792081 > > [----] [ 2] 0 libsystem_c.dylib 0x00007fff202edfcc > vfprintf_l + 28 > > [----] [ 3] 0 libsystem_c.dylib 0x00007fff202e69a2 > fprintf + 160 > > [----] [ 4] 0 libpetsc.3.15.dylib 0x000000014c224f0d > PetscVFPrintfDefault + 685 > > [ 5] 0 libpetsc.3.15.dylib 0x000000014c2280d6 > PetscFPrintf + 726 > > [ 6] 0 libpetsc.3.15.dylib 0x000000014c21d537 > PetscErrorPrintfDefault + 375 > > [ 7] 0 libpetsc.3.15.dylib 0x000000014c21d66a > PetscTraceBackErrorHandler + 154 > > [ 8] 0 libpetsc.3.15.dylib 0x000000014c21713c > PetscError + 716 > > [ 9] 0 libslepc.3.15.dylib 0x000000014bc9cc8c > SlepcInitialize + 428 > > [10] 0 libdeal_II.g.9.3.0.dylib 0x00000001151ae627 > _ZN6dealii9Utilities3MPI16MPI_InitFinalizeC2ERiRPPcj + 135 > > [11] 0 libdeal_II.g.9.3.0.dylib 0x00000001151aece9 > _ZN6dealii9Utilities3MPI16MPI_InitFinalizeC1ERiRPPcj + 9 > > [12] 0 step-17 0x000000010f244892 main + 82 > > [13] 0 libdyld.dylib 0x00007fff203f2f3d start + 1 > > [14] 0 ??? 0x0000000000000001 0x0 + 1 > > *** End of error message *** > > make[3]: *** [CMakeFiles/run] Segmentation fault: 11 > > make[2]: *** [CMakeFiles/run.dir/all] Error 2 > > make[1]: *** [CMakeFiles/run.dir/rule] Error 2 > > make: *** [run] Error 2 > > bash-3.2$ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Tue Oct 26 10:02:13 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Tue, 26 Oct 2021 10:02:13 -0500 Subject: [petsc-users] Possibilities to run further computations based on intermediate results of VecScatter In-Reply-To: References: Message-ID: Hi, Hannes, It looks your concern is no enough memory to store vector entries needed in SpMV (instead of the performance one might gain by doing computation immediately upon arrival of data from neighbors). Please note in petsc SpMV, it does not need to store the whole vector locally, it only needs some entries (e.g., corresponding to nonzero columns). If storing these sparse entries causes memory consumption problems for you, then I wonder how you would store your matrix, which supposedly needs more memory. With that said, petsc vecscatter does not have something like VecScatterWaitAny(). For your experiment, you can leverage info provided by vecscatter, since the communication analysis part is hard. Let's say you created VecScatter *sf* to scattering MPI Vec x to sequential Vec y (Note PetscSF and Vecscatter are the same type) Call PetscSFGetLeafRanks (PetscSF sf,PetscInt *niranks,const PetscMPIInt **iranks,const PetscInt **ioffset,const PetscInt **irootloc) to get send info niranks: number of MPI ranks to which this rank wants to send entries of x iranks[]: of length niranks, storing MPI ranks mentioned above ioffset[]: of length niranks+1. ioffset[] stores indices to irootloc[]. irootloc[]: irootloc[ioffset[i]..ioffset[i+1]] stores (local) indices of entries of x that should be sent to iranks[i] Call PetscSFGetRootRanks (PetscSF sf,PetscInt *nranks,const PetscMPIInt **ranks,const PetscInt **roffset,const PetscInt **rmine,const PetscInt **rremote) to get receive info nranks: number of MPI ranks from which this rank will receive entries of x ranks[]: of length nranks, storing MPI ranks mentioned above roffset[]: of length nranks+1. roffset[] stores indices to rmine[] rmine[]: rmine[roffset[i]..roffset[i+1]] stores (local) indices of entries of y that should receive data from ranks[i] Using above info, you can allocate send/recv buffers, post MPI_Isend/Irecv, and do MPI_Waitany() on MPI_Requests returned from MPI_Irecv. You can use PetscCommGetNewTag(PetscObjectComm((PetscObject)sf), &newtag) to get a good MPI tag for your own MPI_Isend/Irecv. --Junchao Zhang On Tue, Oct 26, 2021 at 7:41 AM Hannes Phil Niklas Brandt < s6hsbran at uni-bonn.de> wrote: > Hello, > > I > am interested in the non-blocking, collective communication of > Petsc-Vecs. > Right > now I am using VecScatterBegin and > VecScatterEnd to scatter different entries of a parallel distributed > MPI-Vec to local sequential vectors on each process. > After the call to VecScatterEnd I perform > separate > computations on each block of the > sequential Vec > corresponding to a process. > However, > I would prefer to use each block of the local sequential Vec for > those further > computations as soon as I receive it from the > corresponding process (so I do not want to > wait for the whole scattering > to finish). Are there functionalities in Petsc capable > of this? > > I > am trying to compute the matrix-vector-product for a parallel > distributed MPI-Vec and a parallel distributed sparse matrix format I > implemented myself. Each process needs entries from the whole MPI-Vec > for the product, but does not have enough storage capacities to store > those entries all at once, not even in a sparse format. Therefore, I need > to > process the entries in small blocks and add the results onto a local > result vector. > > Best > Regards > Hannes > p { margin-bottom: 0.25cm; line-height: 115%; background: transparent } > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Eric.Chamberland at giref.ulaval.ca Tue Oct 26 12:35:01 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Tue, 26 Oct 2021 13:35:01 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> Message-ID: Here is a screenshot of the partition I hard coded (top) and vertices/element numbers (down): I have not yet modified the ex44.c example to properly assign the coordinates... (but I would not have done it like it is in the last version because the sCoords array is the global array with global vertices number) I will have time to do this tomorrow... Maybe I can first try to reproduce all this with a smaller mesh? Eric On 2021-10-26 9:46 a.m., Matthew Knepley wrote: > Okay, I ran it. Something seems off with the mesh. First, I cannot > simply explain the partition. The number of shared vertices and edges > does not seem to come from a straight cut. Second, the mesh look > scrambled on output. > > ? Thanks, > > ? ? Matt > > On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland > > wrote: > > Hi Matthew, > > ok, I started back from your ex44.c example and added the global > array of coordinates.? I just have to code the creation of the > local coordinates now. > > Eric > > On 2021-10-20 6:55 p.m., Matthew Knepley wrote: >> On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland >> > > wrote: >> >> Hi Matthew, >> >> we tried to reproduce the error in a simple example. >> >> The context is the following: We hard coded the mesh and >> initial partition into the code (see sConnectivity and >> sInitialPartition) for 2 ranks and try to create a section in >> order to use the DMPlexNaturalToGlobalBegin function to >> retreive our initial element numbers. >> >> Now the call to DMPlexDistribute give different errors >> depending on what type of component we ask the field to be >> created.? For our objective, we would like a global field to >> be created on elements only (like a P0 interpolation). >> >> We now have the following error generated: >> >> [0]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [0]PETSC ERROR: Petsc has generated inconsistent data >> [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 >> [0]PETSC ERROR: See >> https://www.mcs.anl.gov/petsc/documentation/faq.html >> for >> trouble shooting. >> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 >> [0]PETSC ERROR: ./bug on a? named rohan by ericc Wed Oct 20 >> 14:52:36 2021 >> [0]PETSC ERROR: Configure options >> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 >> --with-mpi-compilers=1 --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 >> --with-cxx-dialect=C++14 --with-make-np=12 >> --with-shared-libraries=1 --with-debugging=yes >> --with-memalign=64 --with-visibility=0 >> --with-64-bit-indices=0 --download-ml=yes >> --download-mumps=yes --download-superlu=yes >> --download-hpddm=yes --download-slepc=yes >> --download-superlu_dist=yes --download-parmetis=yes >> --download-ptscotch=yes --download-metis=yes >> --download-strumpack=yes --download-suitesparse=yes >> --download-hypre=yes >> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >> --with-scalapack=1 >> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >> [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at >> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >> [0]PETSC ERROR: #3 DMPlexDistribute() at >> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >> [0]PETSC ERROR: No PETSc Option Table entries >> [0]PETSC ERROR: ----------------End of Error Message >> -------send entire error message to petsc-maint at mcs.anl.gov >> ---------- >> >> Hope the attached code is self-explaining, note that to make >> it short, we have not included the final part of it, just the >> buggy part we are encountering right now... >> >> Thanks for your insights, >> >> Thanks for making the example. I tweaked it slightly. I put in a >> test case that just makes a parallel 7 x 10 quad mesh. This works >> fine. Thus I think it must be something connected with the >> original mesh. It is hard to get a handle on it without the >> coordinates. >> Do you think you could put the coordinate array in? I have added >> the code to load them (see attached file). >> >> ? Thanks, >> >> ? ? ?Matt >> >> Eric >> >> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >>> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland >>> >> > wrote: >>> >>> Hi Matthew, >>> >>> we tried to use that.? Now, we discovered that: >>> >>> 1- even if we "ask" for sfNatural creation with >>> DMSetUseNatural, it is not created because >>> DMPlexCreateGlobalToNaturalSF looks for a "section": >>> this is not documented in DMSetUseNaturalso we are >>> asking ourselfs: "is this a permanent feature or a >>> temporary situation?" >>> >>> I think explaining this will help clear up a lot. >>> >>> What the Natural2Global?map does is permute a solution >>> vector into the ordering that it would have had prior to >>> mesh distribution. >>> Now, in order to do this permutation, I need to know the >>> original (global) data layout. If it is not specified >>> _before_ distribution, we >>> cannot build the permutation.? The section describes the >>> data layout, so I need it before distribution. >>> >>> I cannot think of another way that you would implement this, >>> but if you want something else, let me know. >>> >>> 2- We then tried to create a "section" in different >>> manners: we took the code into the example >>> petsc/src/dm/impls/plex/tests/ex15.c. However, we ended >>> up with a segfault: >>> >>> corrupted size vs. prev_size >>> [rohan:07297] *** Process received signal *** >>> [rohan:07297] Signal: Aborted (6) >>> [rohan:07297] Signal code:? (-6) >>> [rohan:07297] [ 0] >>> /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>> [rohan:07297] [ 1] >>> /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>> [rohan:07297] [ 2] >>> /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>> [rohan:07297] [ 3] >>> /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>> [rohan:07297] [ 4] >>> /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>> [rohan:07297] [ 5] >>> /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>> [rohan:07297] [ 6] >>> /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>> [rohan:07297] [ 7] >>> /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>> [rohan:07297] [ 8] >>> /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>> [rohan:07297] [ 9] >>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>> [rohan:07297] [10] >>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>> [rohan:07297] [11] >>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>> [rohan:07297] [12] >>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>> [rohan:07297] [13] >>> /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>> >>> [rohan:07297] [14] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>> [rohan:07297] [15] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>> [rohan:07297] [16] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>> [rohan:07297] [17] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>> [rohan:07297] [18] >>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>> >>> I am not sure what happened here, but if you could send a >>> sample code, I will figure it out. >>> >>> If we do not create a section, the call to >>> DMPlexDistribute is successful, but >>> DMPlexGetGlobalToNaturalSF return a null SF pointer... >>> >>> Yes, it just ignores it in this case because it does not >>> have a global layout. >>> >>> Here are the operations we are calling ( this is almost >>> the code we are using, I just removed verifications and >>> creation of the connectivity which use our parallel >>> structure and code): >>> >>> =========== >>> >>> ? PetscInt* lCells????? = 0; >>> ? PetscInt? lNumCorners = 0; >>> ? PetscInt? lDimMail??? = 0; >>> ? PetscInt? lnumCells?? = 0; >>> >>> ? //At this point we create the cells for PETSc expected >>> input for DMPlexBuildFromCellListParallel and set >>> lNumCorners, lDimMail and lnumCells to correct values. >>> ? ... >>> >>> ? DM?????? lDMBete = 0 >>> ? DMPlexCreate(lMPIComm,&lDMBete); >>> >>> ? DMSetDimension(lDMBete, lDimMail); >>> >>> DMPlexBuildFromCellListParallel(lDMBete, >>> lnumCells, >>> PETSC_DECIDE, >>> pLectureElementsLocaux.reqNbTotalSommets(), >>> lNumCorners, >>> lCells, >>> PETSC_NULL); >>> >>> ? DM lDMBeteInterp = 0; >>> ? DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>> ? DMDestroy(&lDMBete); >>> ? lDMBete = lDMBeteInterp; >>> >>> ? DMSetUseNatural(lDMBete,PETSC_TRUE); >>> >>> ? PetscSF lSFMigrationSansOvl = 0; >>> ? PetscSF lSFMigrationOvl = 0; >>> ? DM lDMDistribueSansOvl = 0; >>> ? DM lDMAvecOverlap = 0; >>> >>> ? PetscPartitioner lPart; >>> ? DMPlexGetPartitioner(lDMBete, &lPart); >>> PetscPartitionerSetFromOptions(lPart); >>> >>> ? PetscSection?? section; >>> ? PetscInt?????? numFields?? = 1; >>> ? PetscInt?????? numBC?????? = 0; >>> ? PetscInt?????? numComp[1]? = {1}; >>> ? PetscInt?????? numDof[4]?? = {1, 0, 0, 0}; >>> ? PetscInt?????? bcFields[1] = {0}; >>> ? IS???????????? bcPoints[1] = {NULL}; >>> >>> ? DMSetNumFields(lDMBete, numFields); >>> >>> ? DMPlexCreateSection(lDMBete, NULL, numComp, numDof, >>> numBC, bcFields, bcPoints, NULL, NULL, §ion); >>> ? DMSetLocalSection(lDMBete, section); >>> >>> ? DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, >>> &lDMDistribueSansOvl); // segfault! >>> >>> =========== >>> >>> So we have other question/remarks: >>> >>> 3- Maybe PETSc expect something specific that is >>> missing/not verified: for example, we didn't gave any >>> coordinates since we just want to partition and compute >>> overlap for the mesh... and then recover our element >>> numbers in a "simple way" >>> >>> 4- We are telling ourselves it is somewhat a "big price >>> to pay" to have to build an unused section to have the >>> global to natural ordering set ? Could this requirement >>> be avoided? >>> >>> I don't think so. There would have to be _some_ way of >>> describing your data layout in terms of mesh points, and I >>> do not see how you could use less memory doing that. >>> >>> 5- Are there any improvement towards our usages in 3.16 >>> release? >>> >>> Let me try and run the code above. >>> >>> ? Thanks, >>> >>> ? ? ?Matt >>> >>> Thanks, >>> >>> Eric >>> >>> >>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>>> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland >>>> >>> > wrote: >>>> >>>> Hi, >>>> >>>> I come back with _almost_ the original question: >>>> >>>> I would like to add an integer information (*our* >>>> original element >>>> number, not petsc one) on each element of the >>>> DMPlex I create with >>>> DMPlexBuildFromCellListParallel. >>>> >>>> I would like this interger to be distribruted by or >>>> the same way >>>> DMPlexDistribute distribute the mesh. >>>> >>>> Is it possible to do this? >>>> >>>> >>>> I think we already have support for what you want. If >>>> you call >>>> >>>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>>> >>>> >>>> before DMPlexDistribute(), it will compute a PetscSF >>>> encoding the global to natural map. You >>>> can get it with >>>> >>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>>> >>>> >>>> and use it with >>>> >>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>>> >>>> >>>> Is this sufficient? >>>> >>>> ? Thanks, >>>> >>>> ? ? ?Matt >>>> >>>> Thanks, >>>> >>>> Eric >>>> >>>> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >>>> > Hi, >>>> > >>>> > I want to use DMPlexDistribute from PETSc for >>>> computing overlapping >>>> > and play with the different partitioners supported. >>>> > >>>> > However, after calling DMPlexDistribute, I >>>> noticed the elements are >>>> > renumbered and then the original number is lost. >>>> > >>>> > What would be the best way to keep track of the >>>> element renumbering? >>>> > >>>> > a) Adding an optional parameter to let the user >>>> retrieve a vector or >>>> > "IS" giving the old number? >>>> > >>>> > b) Adding a DMLabel (seems a wrong good solution) >>>> > >>>> > c) Other idea? >>>> > >>>> > Of course, I don't want to loose performances >>>> with the need of this >>>> > "mapping"... >>>> > >>>> > Thanks, >>>> > >>>> > Eric >>>> > >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they >>>> begin their experiments is infinitely more interesting >>>> than any results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin >>> their experiments is infinitely more interesting than any >>> results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eejjfmbjimlkboec.png Type: image/png Size: 87901 bytes Desc: not available URL: From knepley at gmail.com Tue Oct 26 15:28:27 2021 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 26 Oct 2021 16:28:27 -0400 Subject: [petsc-users] Question regarding DMPlex reordering In-Reply-To: <01e3a622-8561-93b4-4ebd-25331bd93486@onera.fr> References: <01e3a622-8561-93b4-4ebd-25331bd93486@onera.fr> Message-ID: On Tue, Oct 26, 2021 at 10:17 AM Pierre Seize wrote: > Hi, I had the idea to try and renumber my mesh cells, as I've heard it's > better: "neighbouring cells are stored next to one another, and memory > access are faster". > Right now, I load the mesh then I distribute it over the processes. I > thought I'd try to permute the numbering between those two steps : > > DMPlexCreateFromFile > DMPlexGetOrdering > DMPlexPermute > DMPlexDistribute > > but that gives me an error when it runs on more than one process: > > [0]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [0]PETSC ERROR: No support for this operation for this object type > [0]PETSC ERROR: Number of dofs for point 0 in the local section should be > positive > [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. > [0]PETSC ERROR: Petsc Release Version 3.16.0, unknown > [0]PETSC ERROR: ./build/bin/yanss on a named ldmpe202z.onera by pseize > Tue Oct 26 16:03:33 2021 > [0]PETSC ERROR: Configure options --PETSC_ARCH=arch-ld-gcc > --download-metis --download-parmetis --prefix=~/.local --with-cgns > [0]PETSC ERROR: #1 PetscPartitionerDMPlexPartition() at > /stck/pseize/softwares/petsc/src/dm/impls/plex/plexpartition.c:720 > [0]PETSC ERROR: #2 DMPlexDistribute() at > /stck/pseize/softwares/petsc/src/dm/impls/plex/plexdistribute.c:1630 > [0]PETSC ERROR: #3 MeshLoadFromFile() at src/spatial.c:689 > [0]PETSC ERROR: #4 main() at src/main.c:22 > [0]PETSC ERROR: PETSc Option Table entries: > [0]PETSC ERROR: -draw_comp 0 > [0]PETSC ERROR: -mesh data/box.msh > [0]PETSC ERROR: -mesh_view draw > [0]PETSC ERROR: -riemann anrs > [0]PETSC ERROR: -ts_max_steps 100 > [0]PETSC ERROR: -vec_view_partition > [0]PETSC ERROR: ----------------End of Error Message -------send entire > error message to petsc-maint at mcs.anl.gov---------- > > I checked and before I tried to reorder the mesh, the dm->localSection > was NULL before entering DMPlexDistribute, and I was able to fix the > error with DMSetLocalSection(dm, NULL) after DMPlexPermute, but it > doesn't seems it's the right way to do what I want. Does someone have any > advice ? > > Oh, this is probably me trying to be too clever. If a local section is defined, then I try to use the number of dofs in it to load balance better. There should never be a negative number of dofs in the local section (a global section uses this to indicate a dof owned by another process). So eliminating the local section will definitely fix that error. Now the question of how you got a local section. DMPlexPermute() does not create one, so it seems like you had one ahead of time, and that the values were not valid. Note that you can probably get rid of some of the loading code using DMCreate(comm, &dm); DMSetType(dm, DMPLEX); DMSetFromOptions(dm); DMViewFromOptions(dm, NULL, "-mesh_view"); and use -dm_plex_filename databox,msh -mesh_view Thanks, Matt > Thanks in advance > > Pierre Seize > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Oct 26 15:35:42 2021 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 26 Oct 2021 16:35:42 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> Message-ID: On Tue, Oct 26, 2021 at 1:35 PM Eric Chamberland < Eric.Chamberland at giref.ulaval.ca> wrote: > Here is a screenshot of the partition I hard coded (top) and > vertices/element numbers (down): > > I have not yet modified the ex44.c example to properly assign the > coordinates... > > (but I would not have done it like it is in the last version because the > sCoords array is the global array with global vertices number) > > I will have time to do this tomorrow... > > Maybe I can first try to reproduce all this with a smaller mesh? > That might make it easier to find a problem. Thanks! Matt > Eric > On 2021-10-26 9:46 a.m., Matthew Knepley wrote: > > Okay, I ran it. Something seems off with the mesh. First, I cannot simply > explain the partition. The number of shared vertices and edges > does not seem to come from a straight cut. Second, the mesh look scrambled > on output. > > Thanks, > > Matt > > On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland < > Eric.Chamberland at giref.ulaval.ca> wrote: > >> Hi Matthew, >> >> ok, I started back from your ex44.c example and added the global array of >> coordinates. I just have to code the creation of the local coordinates now. >> >> Eric >> On 2021-10-20 6:55 p.m., Matthew Knepley wrote: >> >> On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland < >> Eric.Chamberland at giref.ulaval.ca> wrote: >> >>> Hi Matthew, >>> >>> we tried to reproduce the error in a simple example. >>> >>> The context is the following: We hard coded the mesh and initial >>> partition into the code (see sConnectivity and sInitialPartition) for 2 >>> ranks and try to create a section in order to use the >>> DMPlexNaturalToGlobalBegin function to retreive our initial element numbers. >>> >>> Now the call to DMPlexDistribute give different errors depending on what >>> type of component we ask the field to be created. For our objective, we >>> would like a global field to be created on elements only (like a P0 >>> interpolation). >>> >>> We now have the following error generated: >>> >>> [0]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> [0]PETSC ERROR: Petsc has generated inconsistent data >>> [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 >>> [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html >>> for trouble shooting. >>> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 >>> [0]PETSC ERROR: ./bug on a named rohan by ericc Wed Oct 20 14:52:36 2021 >>> [0]PETSC ERROR: Configure options >>> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 --with-mpi-compilers=1 >>> --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 --with-cxx-dialect=C++14 >>> --with-make-np=12 --with-shared-libraries=1 --with-debugging=yes >>> --with-memalign=64 --with-visibility=0 --with-64-bit-indices=0 >>> --download-ml=yes --download-mumps=yes --download-superlu=yes >>> --download-hpddm=yes --download-slepc=yes --download-superlu_dist=yes >>> --download-parmetis=yes --download-ptscotch=yes --download-metis=yes >>> --download-strumpack=yes --download-suitesparse=yes --download-hypre=yes >>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>> --with-scalapack=1 >>> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >>> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >>> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >>> [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >>> [0]PETSC ERROR: #3 DMPlexDistribute() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >>> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >>> [0]PETSC ERROR: No PETSc Option Table entries >>> [0]PETSC ERROR: ----------------End of Error Message -------send entire >>> error message to petsc-maint at mcs.anl.gov---------- >>> >>> Hope the attached code is self-explaining, note that to make it short, >>> we have not included the final part of it, just the buggy part we are >>> encountering right now... >>> >>> Thanks for your insights, >>> >> Thanks for making the example. I tweaked it slightly. I put in a test >> case that just makes a parallel 7 x 10 quad mesh. This works >> fine. Thus I think it must be something connected with the original mesh. >> It is hard to get a handle on it without the coordinates. >> Do you think you could put the coordinate array in? I have added the code >> to load them (see attached file). >> >> Thanks, >> >> Matt >> >>> Eric >>> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >>> >>> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland < >>> Eric.Chamberland at giref.ulaval.ca> wrote: >>> >>>> Hi Matthew, >>>> >>>> we tried to use that. Now, we discovered that: >>>> >>>> 1- even if we "ask" for sfNatural creation with DMSetUseNatural, it is >>>> not created because DMPlexCreateGlobalToNaturalSF looks for a "section": >>>> this is not documented in DMSetUseNaturalso we are asking ourselfs: "is >>>> this a permanent feature or a temporary situation?" >>>> >>> I think explaining this will help clear up a lot. >>> >>> What the Natural2Global map does is permute a solution vector into the >>> ordering that it would have had prior to mesh distribution. >>> Now, in order to do this permutation, I need to know the original >>> (global) data layout. If it is not specified _before_ distribution, we >>> cannot build the permutation. The section describes the data layout, so >>> I need it before distribution. >>> >>> I cannot think of another way that you would implement this, but if you >>> want something else, let me know. >>> >>>> 2- We then tried to create a "section" in different manners: we took >>>> the code into the example petsc/src/dm/impls/plex/tests/ex15.c. However, >>>> we ended up with a segfault: >>>> >>>> corrupted size vs. prev_size >>>> [rohan:07297] *** Process received signal *** >>>> [rohan:07297] Signal: Aborted (6) >>>> [rohan:07297] Signal code: (-6) >>>> [rohan:07297] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>>> [rohan:07297] [ 1] /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>>> [rohan:07297] [ 2] /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>>> [rohan:07297] [ 3] /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>>> [rohan:07297] [ 4] /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>>> [rohan:07297] [ 5] /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>>> [rohan:07297] [ 6] /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>>> [rohan:07297] [ 7] /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>>> [rohan:07297] [ 8] /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>>> [rohan:07297] [ 9] >>>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>>> [rohan:07297] [10] >>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>>> [rohan:07297] [11] >>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>>> [rohan:07297] [12] >>>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>>> [rohan:07297] [13] /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>>> >>>> [rohan:07297] [14] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>>> [rohan:07297] [15] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>>> [rohan:07297] [16] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>>> [rohan:07297] [17] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>>> [rohan:07297] [18] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>>> >>> I am not sure what happened here, but if you could send a sample code, I >>> will figure it out. >>> >>>> If we do not create a section, the call to DMPlexDistribute is >>>> successful, but DMPlexGetGlobalToNaturalSF return a null SF pointer... >>>> >>> Yes, it just ignores it in this case because it does not have a global >>> layout. >>> >>>> Here are the operations we are calling ( this is almost the code we are >>>> using, I just removed verifications and creation of the connectivity which >>>> use our parallel structure and code): >>>> >>>> =========== >>>> >>>> PetscInt* lCells = 0; >>>> PetscInt lNumCorners = 0; >>>> PetscInt lDimMail = 0; >>>> PetscInt lnumCells = 0; >>>> >>>> //At this point we create the cells for PETSc expected input for >>>> DMPlexBuildFromCellListParallel and set lNumCorners, lDimMail and lnumCells >>>> to correct values. >>>> ... >>>> >>>> DM lDMBete = 0 >>>> DMPlexCreate(lMPIComm,&lDMBete); >>>> >>>> DMSetDimension(lDMBete, lDimMail); >>>> >>>> DMPlexBuildFromCellListParallel(lDMBete, >>>> lnumCells, >>>> PETSC_DECIDE, >>>> >>>> pLectureElementsLocaux.reqNbTotalSommets(), >>>> lNumCorners, >>>> lCells, >>>> PETSC_NULL); >>>> >>>> DM lDMBeteInterp = 0; >>>> DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>>> DMDestroy(&lDMBete); >>>> lDMBete = lDMBeteInterp; >>>> >>>> DMSetUseNatural(lDMBete,PETSC_TRUE); >>>> >>>> PetscSF lSFMigrationSansOvl = 0; >>>> PetscSF lSFMigrationOvl = 0; >>>> DM lDMDistribueSansOvl = 0; >>>> DM lDMAvecOverlap = 0; >>>> >>>> PetscPartitioner lPart; >>>> DMPlexGetPartitioner(lDMBete, &lPart); >>>> PetscPartitionerSetFromOptions(lPart); >>>> >>>> PetscSection section; >>>> PetscInt numFields = 1; >>>> PetscInt numBC = 0; >>>> PetscInt numComp[1] = {1}; >>>> PetscInt numDof[4] = {1, 0, 0, 0}; >>>> PetscInt bcFields[1] = {0}; >>>> IS bcPoints[1] = {NULL}; >>>> >>>> DMSetNumFields(lDMBete, numFields); >>>> >>>> DMPlexCreateSection(lDMBete, NULL, numComp, numDof, numBC, bcFields, >>>> bcPoints, NULL, NULL, §ion); >>>> DMSetLocalSection(lDMBete, section); >>>> >>>> DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, >>>> &lDMDistribueSansOvl); // segfault! >>>> >>>> =========== >>>> >>>> So we have other question/remarks: >>>> >>>> 3- Maybe PETSc expect something specific that is missing/not verified: >>>> for example, we didn't gave any coordinates since we just want to partition >>>> and compute overlap for the mesh... and then recover our element numbers in >>>> a "simple way" >>>> >>>> 4- We are telling ourselves it is somewhat a "big price to pay" to have >>>> to build an unused section to have the global to natural ordering set ? >>>> Could this requirement be avoided? >>>> >>> I don't think so. There would have to be _some_ way of describing your >>> data layout in terms of mesh points, and I do not see how you could use >>> less memory doing that. >>> >>>> 5- Are there any improvement towards our usages in 3.16 release? >>>> >>> Let me try and run the code above. >>> >>> Thanks, >>> >>> Matt >>> >>>> Thanks, >>>> >>>> Eric >>>> >>>> >>>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>>> >>>> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland < >>>> Eric.Chamberland at giref.ulaval.ca> wrote: >>>> >>>>> Hi, >>>>> >>>>> I come back with _almost_ the original question: >>>>> >>>>> I would like to add an integer information (*our* original element >>>>> number, not petsc one) on each element of the DMPlex I create with >>>>> DMPlexBuildFromCellListParallel. >>>>> >>>>> I would like this interger to be distribruted by or the same way >>>>> DMPlexDistribute distribute the mesh. >>>>> >>>>> Is it possible to do this? >>>>> >>>> >>>> I think we already have support for what you want. If you call >>>> >>>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>>> >>>> before DMPlexDistribute(), it will compute a PetscSF encoding the >>>> global to natural map. You >>>> can get it with >>>> >>>> >>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>>> >>>> and use it with >>>> >>>> >>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>>> >>>> Is this sufficient? >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> >>>>> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >>>>> > Hi, >>>>> > >>>>> > I want to use DMPlexDistribute from PETSc for computing overlapping >>>>> > and play with the different partitioners supported. >>>>> > >>>>> > However, after calling DMPlexDistribute, I noticed the elements are >>>>> > renumbered and then the original number is lost. >>>>> > >>>>> > What would be the best way to keep track of the element renumbering? >>>>> > >>>>> > a) Adding an optional parameter to let the user retrieve a vector or >>>>> > "IS" giving the old number? >>>>> > >>>>> > b) Adding a DMLabel (seems a wrong good solution) >>>>> > >>>>> > c) Other idea? >>>>> > >>>>> > Of course, I don't want to loose performances with the need of this >>>>> > "mapping"... >>>>> > >>>>> > Thanks, >>>>> > >>>>> > Eric >>>>> > >>>>> -- >>>>> Eric Chamberland, ing., M. Ing >>>>> Professionnel de recherche >>>>> GIREF/Universit? Laval >>>>> (418) 656-2131 poste 41 22 42 >>>>> >>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>>> >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eejjfmbjimlkboec.png Type: image/png Size: 87901 bytes Desc: not available URL: From Eric.Chamberland at giref.ulaval.ca Tue Oct 26 23:17:36 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Wed, 27 Oct 2021 00:17:36 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> Message-ID: Hi Matthew, the smallest mesh which crashes the code is a 2x5 mesh: See the modified ex44.c With smaller meshes(2x2, 2x4, etc), it passes...? But it bugs latter when I try to use DMPlexNaturalToGlobalBegin but let's keep that other problem for later... Thanks a lot for helping digging into this! :) Eric On 2021-10-26 4:35 p.m., Matthew Knepley wrote: > On Tue, Oct 26, 2021 at 1:35 PM Eric Chamberland > > wrote: > > Here is a screenshot of the partition I hard coded (top) and > vertices/element numbers (down): > > I have not yet modified the ex44.c example to properly assign the > coordinates... > > (but I would not have done it like it is in the last version > because the sCoords array is the global array with global vertices > number) > > I will have time to do this tomorrow... > > Maybe I can first try to reproduce all this with a smaller mesh? > > > That might make it easier to find a problem. > > ? Thanks! > > ? ? ?Matt > > Eric > > On 2021-10-26 9:46 a.m., Matthew Knepley wrote: >> Okay, I ran it. Something seems off with the mesh. First, I >> cannot simply explain the partition. The number of shared >> vertices and edges >> does not seem to come from a straight cut. Second, the mesh look >> scrambled on output. >> >> ? Thanks, >> >> ? ? Matt >> >> On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland >> > > wrote: >> >> Hi Matthew, >> >> ok, I started back from your ex44.c example and added the >> global array of coordinates.? I just have to code the >> creation of the local coordinates now. >> >> Eric >> >> On 2021-10-20 6:55 p.m., Matthew Knepley wrote: >>> On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland >>> >> > wrote: >>> >>> Hi Matthew, >>> >>> we tried to reproduce the error in a simple example. >>> >>> The context is the following: We hard coded the mesh and >>> initial partition into the code (see sConnectivity and >>> sInitialPartition) for 2 ranks and try to create a >>> section in order to use the DMPlexNaturalToGlobalBegin >>> function to retreive our initial element numbers. >>> >>> Now the call to DMPlexDistribute give different errors >>> depending on what type of component we ask the field to >>> be created.? For our objective, we would like a global >>> field to be created on elements only (like a P0 >>> interpolation). >>> >>> We now have the following error generated: >>> >>> [0]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> [0]PETSC ERROR: Petsc has generated inconsistent data >>> [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 >>> [0]PETSC ERROR: See >>> https://www.mcs.anl.gov/petsc/documentation/faq.html >>> >>> for trouble shooting. >>> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 >>> [0]PETSC ERROR: ./bug on a? named rohan by ericc Wed Oct >>> 20 14:52:36 2021 >>> [0]PETSC ERROR: Configure options >>> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 >>> --with-mpi-compilers=1 >>> --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 >>> --with-cxx-dialect=C++14 --with-make-np=12 >>> --with-shared-libraries=1 --with-debugging=yes >>> --with-memalign=64 --with-visibility=0 >>> --with-64-bit-indices=0 --download-ml=yes >>> --download-mumps=yes --download-superlu=yes >>> --download-hpddm=yes --download-slepc=yes >>> --download-superlu_dist=yes --download-parmetis=yes >>> --download-ptscotch=yes --download-metis=yes >>> --download-strumpack=yes --download-suitesparse=yes >>> --download-hypre=yes >>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>> --with-scalapack=1 >>> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >>> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >>> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >>> [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >>> [0]PETSC ERROR: #3 DMPlexDistribute() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >>> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >>> [0]PETSC ERROR: No PETSc Option Table entries >>> [0]PETSC ERROR: ----------------End of Error Message >>> -------send entire error message to >>> petsc-maint at mcs.anl.gov >>> ---------- >>> >>> Hope the attached code is self-explaining, note that to >>> make it short, we have not included the final part of >>> it, just the buggy part we are encountering right now... >>> >>> Thanks for your insights, >>> >>> Thanks for making the example. I tweaked it slightly. I put >>> in a test case that just makes a parallel 7 x 10 quad mesh. >>> This works >>> fine. Thus I think it must be something connected with the >>> original mesh. It is hard to get a handle on it without the >>> coordinates. >>> Do you think you could put the coordinate array in? I have >>> added the code to load them (see attached file). >>> >>> ? Thanks, >>> >>> ? ? ?Matt >>> >>> Eric >>> >>> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >>>> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland >>>> >>> > wrote: >>>> >>>> Hi Matthew, >>>> >>>> we tried to use that.? Now, we discovered that: >>>> >>>> 1- even if we "ask" for sfNatural creation with >>>> DMSetUseNatural, it is not created because >>>> DMPlexCreateGlobalToNaturalSF looks for a >>>> "section": this is not documented in >>>> DMSetUseNaturalso we are asking ourselfs: "is this >>>> a permanent feature or a temporary situation?" >>>> >>>> I think explaining this will help clear up a lot. >>>> >>>> What the Natural2Global?map does is permute a solution >>>> vector into the ordering that it would have had prior >>>> to mesh distribution. >>>> Now, in order to do this permutation, I need to know >>>> the original (global) data layout. If it is not >>>> specified _before_ distribution, we >>>> cannot build the permutation.? The section describes >>>> the data layout, so I need it before distribution. >>>> >>>> I cannot think of another way that you would implement >>>> this, but if you want something else, let me know. >>>> >>>> 2- We then tried to create a "section" in different >>>> manners: we took the code into the example >>>> petsc/src/dm/impls/plex/tests/ex15.c. However, we >>>> ended up with a segfault: >>>> >>>> corrupted size vs. prev_size >>>> [rohan:07297] *** Process received signal *** >>>> [rohan:07297] Signal: Aborted (6) >>>> [rohan:07297] Signal code: (-6) >>>> [rohan:07297] [ 0] >>>> /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>>> [rohan:07297] [ 1] >>>> /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>>> [rohan:07297] [ 2] >>>> /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>>> [rohan:07297] [ 3] >>>> /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>>> [rohan:07297] [ 4] >>>> /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>>> [rohan:07297] [ 5] >>>> /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>>> [rohan:07297] [ 6] >>>> /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>>> [rohan:07297] [ 7] >>>> /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>>> [rohan:07297] [ 8] >>>> /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>>> [rohan:07297] [ 9] >>>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>>> [rohan:07297] [10] >>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>>> [rohan:07297] [11] >>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>>> [rohan:07297] [12] >>>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>>> [rohan:07297] [13] >>>> /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>>> >>>> [rohan:07297] [14] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>>> [rohan:07297] [15] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>>> [rohan:07297] [16] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>>> [rohan:07297] [17] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>>> [rohan:07297] [18] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>>> >>>> I am not sure what happened here, but if you could send >>>> a sample code, I will figure it out. >>>> >>>> If we do not create a section, the call to >>>> DMPlexDistribute is successful, but >>>> DMPlexGetGlobalToNaturalSF return a null SF pointer... >>>> >>>> Yes, it just ignores it in this case because it does >>>> not have a global layout. >>>> >>>> Here are the operations we are calling ( this is >>>> almost the code we are using, I just removed >>>> verifications and creation of the connectivity >>>> which use our parallel structure and code): >>>> >>>> =========== >>>> >>>> ? PetscInt* lCells????? = 0; >>>> ? PetscInt? lNumCorners = 0; >>>> ? PetscInt? lDimMail??? = 0; >>>> ? PetscInt? lnumCells?? = 0; >>>> >>>> ? //At this point we create the cells for PETSc >>>> expected input for DMPlexBuildFromCellListParallel >>>> and set lNumCorners, lDimMail and lnumCells to >>>> correct values. >>>> ? ... >>>> >>>> ? DM?????? lDMBete = 0 >>>> DMPlexCreate(lMPIComm,&lDMBete); >>>> >>>> ? DMSetDimension(lDMBete, lDimMail); >>>> >>>> DMPlexBuildFromCellListParallel(lDMBete, >>>> ????????????????????????????????? lnumCells, >>>> ????????????????????????????????? PETSC_DECIDE, >>>> pLectureElementsLocaux.reqNbTotalSommets(), >>>> ????????????????????????????????? lNumCorners, >>>> ????????????????????????????????? lCells, >>>> ????????????????????????????????? PETSC_NULL); >>>> >>>> ? DM lDMBeteInterp = 0; >>>> ? DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>>> ? DMDestroy(&lDMBete); >>>> ? lDMBete = lDMBeteInterp; >>>> >>>> DMSetUseNatural(lDMBete,PETSC_TRUE); >>>> >>>> ? PetscSF lSFMigrationSansOvl = 0; >>>> ? PetscSF lSFMigrationOvl = 0; >>>> ? DM lDMDistribueSansOvl = 0; >>>> ? DM lDMAvecOverlap = 0; >>>> >>>> ? PetscPartitioner lPart; >>>> DMPlexGetPartitioner(lDMBete, &lPart); >>>> PetscPartitionerSetFromOptions(lPart); >>>> >>>> ? PetscSection?? section; >>>> ? PetscInt?????? numFields = 1; >>>> ? PetscInt?????? numBC = 0; >>>> ? PetscInt?????? numComp[1] = {1}; >>>> ? PetscInt?????? numDof[4] = {1, 0, 0, 0}; >>>> ? PetscInt?????? bcFields[1] = {0}; >>>> ? IS???????????? bcPoints[1] = {NULL}; >>>> >>>> ? DMSetNumFields(lDMBete, numFields); >>>> >>>> DMPlexCreateSection(lDMBete, NULL, numComp, numDof, >>>> numBC, bcFields, bcPoints, NULL, NULL, §ion); >>>> ? DMSetLocalSection(lDMBete, section); >>>> >>>> ? DMPlexDistribute(lDMBete, 0, >>>> &lSFMigrationSansOvl, &lDMDistribueSansOvl); // >>>> segfault! >>>> >>>> =========== >>>> >>>> So we have other question/remarks: >>>> >>>> 3- Maybe PETSc expect something specific that is >>>> missing/not verified: for example, we didn't gave >>>> any coordinates since we just want to partition and >>>> compute overlap for the mesh... and then recover >>>> our element numbers in a "simple way" >>>> >>>> 4- We are telling ourselves it is somewhat a "big >>>> price to pay" to have to build an unused section to >>>> have the global to natural ordering set ?? Could >>>> this requirement be avoided? >>>> >>>> I don't think so. There would have to be _some_ way of >>>> describing your data layout in terms of mesh points, >>>> and I do not see how you could use less memory doing that. >>>> >>>> 5- Are there any improvement towards our usages in >>>> 3.16 release? >>>> >>>> Let me try and run the code above. >>>> >>>> ? Thanks, >>>> >>>> ? ? ?Matt >>>> >>>> Thanks, >>>> >>>> Eric >>>> >>>> >>>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>>>> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland >>>>> >>>> > wrote: >>>>> >>>>> Hi, >>>>> >>>>> I come back with _almost_ the original question: >>>>> >>>>> I would like to add an integer information >>>>> (*our* original element >>>>> number, not petsc one) on each element of the >>>>> DMPlex I create with >>>>> DMPlexBuildFromCellListParallel. >>>>> >>>>> I would like this interger to be distribruted >>>>> by or the same way >>>>> DMPlexDistribute distribute the mesh. >>>>> >>>>> Is it possible to do this? >>>>> >>>>> >>>>> I think we already have support for what you want. >>>>> If you call >>>>> >>>>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>>>> >>>>> >>>>> before DMPlexDistribute(), it will compute a >>>>> PetscSF encoding the global to natural map. You >>>>> can get it with >>>>> >>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>>>> >>>>> >>>>> and use it with >>>>> >>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>>>> >>>>> >>>>> Is this sufficient? >>>>> >>>>> ? Thanks, >>>>> >>>>> ? ? ?Matt >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> >>>>> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >>>>> > Hi, >>>>> > >>>>> > I want to use DMPlexDistribute from PETSc >>>>> for computing overlapping >>>>> > and play with the different partitioners >>>>> supported. >>>>> > >>>>> > However, after calling DMPlexDistribute, I >>>>> noticed the elements are >>>>> > renumbered and then the original number is lost. >>>>> > >>>>> > What would be the best way to keep track of >>>>> the element renumbering? >>>>> > >>>>> > a) Adding an optional parameter to let the >>>>> user retrieve a vector or >>>>> > "IS" giving the old number? >>>>> > >>>>> > b) Adding a DMLabel (seems a wrong good >>>>> solution) >>>>> > >>>>> > c) Other idea? >>>>> > >>>>> > Of course, I don't want to loose >>>>> performances with the need of this >>>>> > "mapping"... >>>>> > >>>>> > Thanks, >>>>> > >>>>> > Eric >>>>> > >>>>> -- >>>>> Eric Chamberland, ing., M. Ing >>>>> Professionnel de recherche >>>>> GIREF/Universit? Laval >>>>> (418) 656-2131 poste 41 22 42 >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before >>>>> they begin their experiments is infinitely more >>>>> interesting than any results to which their >>>>> experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>> >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they >>>> begin their experiments is infinitely more interesting >>>> than any results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin >>> their experiments is infinitely more interesting than any >>> results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hbnbhlbilhmjdpfg.png Type: image/png Size: 42972 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eejjfmbjimlkboec.png Type: image/png Size: 87901 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex44.c Type: text/x-csrc Size: 11542 bytes Desc: not available URL: From yhcy1993 at gmail.com Wed Oct 27 00:47:43 2021 From: yhcy1993 at gmail.com (=?UTF-8?B?5LuT5a6H?=) Date: Wed, 27 Oct 2021 13:47:43 +0800 Subject: [petsc-users] Strange behavior of TS after setting hand-coded Jacobian In-Reply-To: <0C6ACBF3-F457-4BFD-AD19-8C455444748F@petsc.dev> References: <0C6ACBF3-F457-4BFD-AD19-8C455444748F@petsc.dev> Message-ID: Thanks for your kind reply. Several comparison tests have been performed. Attached are execution output files. Below are corresponding descriptions. good.txt -- Run without hand-coded jacobian, solution converged, with option '-ts_monitor -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason'; jac1.txt -- Run with hand-coded jacobian, does not converge, with option '-ts_monitor -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian'; jac2.txt -- Run with hand-coded jacobian, does not converge, with option '-ts_monitor -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian -ksp_view'; jac3.txt -- Run with hand-coded jacobian, does not converge, with option '-ts_monitor -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian -ksp_view -ts_max_snes_failures -1 '; The problem under consideration contains an eigen-value to be solved, making the first diagonal element of the jacobian matrix being zero. >From these outputs, it seems that the PC failed to factorize, which is caused by this 0 diagonal element. But I'm wondering why it works with jacobian matrix generated by finite-difference? Would employing DMDA for discretization be helpful? Regards Yu Cang Barry Smith ?2021?10?25??? ??10:50??? > > > It is definitely unexpected that -snes_test_jacobian verifies the Jacobian as matching but the solve process is completely different. > > Please run with -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian and send all the output > > Barry > > > > On Oct 25, 2021, at 9:53 AM, ?? wrote: > > > > I'm using TS to solve a set of DAE, which originates from a > > one-dimensional problem. The grid points are uniformly distributed. > > For simplicity, the DMDA is not employed for discretization. > > > > At first, only the residual function is prescribed through > > 'TSSetIFunction', and PETSC produces converged results. However, after > > providing hand-coded Jacobian through 'TSSetIJacobian', the internal > > SNES object fails (residual norm does not change), and TS reports > > 'DIVERGED_STEP_REJECTED'. > > > > I have tried to add the option '-snes_test_jacobian' to see if the > > hand-coded jacobian is somewhere wrong, but it shows '||J - > > Jfd||_F/||J||_F = 1.07488e-10, ||J - Jfd||_F = 2.14458e-07', > > indicating that the hand-coded jacobian is correct. > > > > Then, I added a monitor for the internal SNES object through > > 'SNESMonitorSet', in which the solution vector will be displayed at > > each iteration. It is interesting to find that, if the jacobian is not > > provided, meaning finite-difference is utilized for jacobian > > evaluation internally, the solution vector converges to steady > > solution and the SNES residual norm is reduced continuously. However, > > it turns out that, as long as the jacobian is provided, the solution > > vector will NEVER get changed! So the solution procedure stucked! > > > > This is quite strange! Hope to get some advice. > > PETSC version=3.14.6, program run in serial mode. > > > > Regards > > > > Yu Cang > -------------- next part -------------- ============================================================ = OPPDIFF = = 1D counterflow solver for incompressible fluid = ============================================================ Parameters: Domain gap = 3.(cm) Grid points: 21 Spacing = 0.05, 1.42857(mm) Reference density = 1.225(kg/m^3) Reference velocity = 0.01(m/s) Re = 100.000000 Dynamic viscosity = 3.675e-06(kg/m/s) Kinetic viscosity = 3e-06(m^2/s) Inlet velocity @left = 5., 0.05(m/s) Inlet velocity @right = -5., -0.05(m/s) Starting time = 0., 0.(s) Ending time = 100., 33.3333(s) Creating vectors and matrices ... Done! Setting I.C. ... Done! Setting TS ... Done! Time-Stepping ... 0 TS dt 0.1 time 0. 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- Run with -snes_test_jacobian_view and optionally -snes_test_jacobian to show difference of hand-coded and finite difference Jacobian entries greater than . Testing hand-coded Jacobian, if (for double precision runs) ||J - Jfd||_F/||J||_F is O(1.e-8), the hand-coded Jacobian is probably correct. ||J - Jfd||_F/||J||_F = 7.5633e-10, ||J - Jfd||_F = 3.17683e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 6.13604e-10, ||J - Jfd||_F = 3.17683e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 3.21478e-10, ||J - Jfd||_F = 3.17683e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 1.04271e-10, ||J - Jfd||_F = 3.17683e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 2.78563e-11, ||J - Jfd||_F = 3.17682e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 7.07886e-12, ||J - Jfd||_F = 3.17682e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 1.77683e-12, ||J - Jfd||_F = 3.17665e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 4.44717e-13, ||J - Jfd||_F = 3.17707e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 1.11165e-13, ||J - Jfd||_F = 3.17586e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 2.77876e-14, ||J - Jfd||_F = 3.17524e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- ||J - Jfd||_F/||J||_F = 6.89057e-15, ||J - Jfd||_F = 3.14944e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: [0]PETSC ERROR: TSStep has failed due to DIVERGED_STEP_REJECTED [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.14.6, unknown [0]PETSC ERROR: ./incompressible_jac on a named LARGE by sun Wed Oct 27 13:24:28 2021 [0]PETSC ERROR: Configure options --prefix=/usr/local --download-hypre=/home/sun/Downloads/hypre-2.20.0.zip --download-zlib --download-p4est=/home/sun/Downloads/p4est-v2.0.zip --with-debugging=0 --COPTFLAGS="-O3 -march=native -mtune=native" --CXXOPTFLAGS="-O3 -march=native -mtune=native" --FOPTFLAGS="-O3 -march=native -mtune=native" [0]PETSC ERROR: #1 TSStep() line 3775 in /home/sun/Desktop/SOFTWARE/petsc/3.14.6/src/ts/interface/ts.c [0]PETSC ERROR: #2 TSSolve() line 4156 in /home/sun/Desktop/SOFTWARE/petsc/3.14.6/src/ts/interface/ts.c [0]PETSC ERROR: #3 main() line 295 in /home/sun/Desktop/TFM/incompressible_jac.cc [0]PETSC ERROR: PETSc Option Table entries: [0]PETSC ERROR: -ksp_converged_reason [0]PETSC ERROR: -ksp_monitor_true_residual [0]PETSC ERROR: -ksp_view [0]PETSC ERROR: -snes_converged_reason [0]PETSC ERROR: -snes_monitor [0]PETSC ERROR: -snes_test_jacobian [0]PETSC ERROR: -ts_max_snes_failures -1 [0]PETSC ERROR: -ts_monitor [0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov---------- -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF with errorcode 295091. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------- next part -------------- ============================================================ = OPPDIFF = = 1D counterflow solver for incompressible fluid = ============================================================ Parameters: Domain gap = 3.(cm) Grid points: 21 Spacing = 0.05, 1.42857(mm) Reference density = 1.225(kg/m^3) Reference velocity = 0.01(m/s) Re = 100.000000 Dynamic viscosity = 3.675e-06(kg/m/s) Kinetic viscosity = 3e-06(m^2/s) Inlet velocity @left = 5., 0.05(m/s) Inlet velocity @right = -5., -0.05(m/s) Starting time = 0., 0.(s) Ending time = 100., 33.3333(s) Creating vectors and matrices ... Done! Setting I.C. ... Done! Setting TS ... Done! Time-Stepping ... 0 TS dt 0.1 time 0. 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- Run with -snes_test_jacobian_view and optionally -snes_test_jacobian to show difference of hand-coded and finite difference Jacobian entries greater than . Testing hand-coded Jacobian, if (for double precision runs) ||J - Jfd||_F/||J||_F is O(1.e-8), the hand-coded Jacobian is probably correct. ||J - Jfd||_F/||J||_F = 7.5633e-10, ||J - Jfd||_F = 3.17683e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT KSP Object: 1 MPI processes type: gmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: ilu out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 package used to perform factorization: petsc total: nonzeros=180, allocated nonzeros=180 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=43, cols=43 total: nonzeros=180, allocated nonzeros=215 total number of mallocs used during MatSetValues calls=0 not using I-node routines Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: [0]PETSC ERROR: TSStep has failed due to DIVERGED_NONLINEAR_SOLVE, increase -ts_max_snes_failures or make negative to attempt recovery [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.14.6, unknown [0]PETSC ERROR: ./incompressible_jac on a named LARGE by sun Wed Oct 27 13:23:34 2021 [0]PETSC ERROR: Configure options --prefix=/usr/local --download-hypre=/home/sun/Downloads/hypre-2.20.0.zip --download-zlib --download-p4est=/home/sun/Downloads/p4est-v2.0.zip --with-debugging=0 --COPTFLAGS="-O3 -march=native -mtune=native" --CXXOPTFLAGS="-O3 -march=native -mtune=native" --FOPTFLAGS="-O3 -march=native -mtune=native" [0]PETSC ERROR: #1 TSStep() line 3774 in /home/sun/Desktop/SOFTWARE/petsc/3.14.6/src/ts/interface/ts.c [0]PETSC ERROR: #2 TSSolve() line 4156 in /home/sun/Desktop/SOFTWARE/petsc/3.14.6/src/ts/interface/ts.c [0]PETSC ERROR: #3 main() line 295 in /home/sun/Desktop/TFM/incompressible_jac.cc [0]PETSC ERROR: PETSc Option Table entries: [0]PETSC ERROR: -ksp_converged_reason [0]PETSC ERROR: -ksp_monitor_true_residual [0]PETSC ERROR: -ksp_view [0]PETSC ERROR: -snes_converged_reason [0]PETSC ERROR: -snes_monitor [0]PETSC ERROR: -snes_test_jacobian [0]PETSC ERROR: -ts_monitor [0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov---------- -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF with errorcode 295091. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------- next part -------------- ============================================================ = OPPDIFF = = 1D counterflow solver for incompressible fluid = ============================================================ Parameters: Domain gap = 3.(cm) Grid points: 21 Spacing = 0.05, 1.42857(mm) Reference density = 1.225(kg/m^3) Reference velocity = 0.01(m/s) Re = 100.000000 Dynamic viscosity = 3.675e-06(kg/m/s) Kinetic viscosity = 3e-06(m^2/s) Inlet velocity @left = 5., 0.05(m/s) Inlet velocity @right = -5., -0.05(m/s) Starting time = 0., 0.(s) Ending time = 100., 33.3333(s) Creating vectors and matrices ... Done! Setting I.C. ... Done! Setting TS ... Done! Time-Stepping ... 0 TS dt 0.1 time 0. 0 SNES Function norm 4.472135955000e+01 ---------- Testing Jacobian ------------- Run with -snes_test_jacobian_view and optionally -snes_test_jacobian to show difference of hand-coded and finite difference Jacobian entries greater than . Testing hand-coded Jacobian, if (for double precision runs) ||J - Jfd||_F/||J||_F is O(1.e-8), the hand-coded Jacobian is probably correct. ||J - Jfd||_F/||J||_F = 7.5633e-10, ||J - Jfd||_F = 3.17683e-07 Linear solve did not converge due to DIVERGED_PC_FAILED iterations 0 PC failed due to FACTOR_NUMERIC_ZEROPIVOT Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: [0]PETSC ERROR: TSStep has failed due to DIVERGED_NONLINEAR_SOLVE, increase -ts_max_snes_failures or make negative to attempt recovery [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.14.6, unknown [0]PETSC ERROR: ./incompressible_jac on a named LARGE by sun Wed Oct 27 13:23:07 2021 [0]PETSC ERROR: Configure options --prefix=/usr/local --download-hypre=/home/sun/Downloads/hypre-2.20.0.zip --download-zlib --download-p4est=/home/sun/Downloads/p4est-v2.0.zip --with-debugging=0 --COPTFLAGS="-O3 -march=native -mtune=native" --CXXOPTFLAGS="-O3 -march=native -mtune=native" --FOPTFLAGS="-O3 -march=native -mtune=native" [0]PETSC ERROR: #1 TSStep() line 3774 in /home/sun/Desktop/SOFTWARE/petsc/3.14.6/src/ts/interface/ts.c [0]PETSC ERROR: #2 TSSolve() line 4156 in /home/sun/Desktop/SOFTWARE/petsc/3.14.6/src/ts/interface/ts.c [0]PETSC ERROR: #3 main() line 295 in /home/sun/Desktop/TFM/incompressible_jac.cc [0]PETSC ERROR: PETSc Option Table entries: [0]PETSC ERROR: -ksp_converged_reason [0]PETSC ERROR: -ksp_monitor_true_residual [0]PETSC ERROR: -snes_converged_reason [0]PETSC ERROR: -snes_monitor [0]PETSC ERROR: -snes_test_jacobian [0]PETSC ERROR: -ts_monitor [0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov---------- -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF with errorcode 295091. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------- next part -------------- ============================================================ = OPPDIFF = = 1D counterflow solver for incompressible fluid = ============================================================ Parameters: Domain gap = 3.(cm) Grid points: 21 Spacing = 0.05, 1.42857(mm) Reference density = 1.225(kg/m^3) Reference velocity = 0.01(m/s) Re = 100.000000 Dynamic viscosity = 3.675e-06(kg/m/s) Kinetic viscosity = 3e-06(m^2/s) Inlet velocity @left = 5., 0.05(m/s) Inlet velocity @right = -5., -0.05(m/s) Starting time = 0., 0.(s) Ending time = 100., 33.3333(s) Creating vectors and matrices ... Done! Setting I.C. ... Done! Setting TS ... Done! Time-Stepping ... 0 TS dt 0.1 time 0. 0 SNES Function norm 4.472135955000e+01 0 KSP preconditioned resid norm 1.039786297622e+02 true resid norm 4.472135955000e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.274255218218e-13 true resid norm 7.808054931715e-13 ||r(i)||/||b|| 1.745934159937e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.033045523509e+01 0 KSP preconditioned resid norm 1.032123726781e+02 true resid norm 4.033045523509e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.644630031141e-14 true resid norm 3.044025711296e-13 ||r(i)||/||b|| 7.547709773054e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.647084550217e+01 0 KSP preconditioned resid norm 1.002205980268e+02 true resid norm 3.647084550217e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 9.146773490752e-14 true resid norm 3.867971263170e-13 ||r(i)||/||b|| 1.060565284383e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 3 SNES Function norm 3.302987397637e+01 0 KSP preconditioned resid norm 9.588715518038e+01 true resid norm 3.302987397637e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.733726433179e-14 true resid norm 2.146878075769e-13 ||r(i)||/||b|| 6.499807045299e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 4 SNES Function norm 2.993583488032e+01 0 KSP preconditioned resid norm 9.076494989978e+01 true resid norm 2.993583488032e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 5.859560435859e-14 true resid norm 4.878395102318e-13 ||r(i)||/||b|| 1.629617186833e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 5 SNES Function norm 2.713914093755e+01 0 KSP preconditioned resid norm 8.522166977424e+01 true resid norm 2.713914093755e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.108184839803e-13 true resid norm 5.364124140451e-13 ||r(i)||/||b|| 1.976526874153e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 6 SNES Function norm 2.411351569164e+01 0 KSP preconditioned resid norm 7.824190976367e+01 true resid norm 2.411351569164e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.275103061298e-14 true resid norm 3.526740473154e-13 ||r(i)||/||b|| 1.462557562428e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 7 SNES Function norm 2.067788533862e+01 0 KSP preconditioned resid norm 6.899643046015e+01 true resid norm 2.067788533862e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.974878875731e-14 true resid norm 1.477492467331e-13 ||r(i)||/||b|| 7.145278364476e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 8 SNES Function norm 1.673094982754e+01 0 KSP preconditioned resid norm 5.639802573789e+01 true resid norm 1.673094982754e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.726172589271e-14 true resid norm 2.345889463959e-13 ||r(i)||/||b|| 1.402125694082e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 9 SNES Function norm 1.221764259446e+01 0 KSP preconditioned resid norm 3.860709295211e+01 true resid norm 1.221764259446e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.754950987785e-14 true resid norm 2.701359181474e-13 ||r(i)||/||b|| 2.211031433101e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 10 SNES Function norm 8.744164020698e+00 0 KSP preconditioned resid norm 1.801723834183e+00 true resid norm 8.744164020698e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.866152022490e-16 true resid norm 3.526925342092e-15 ||r(i)||/||b|| 4.033462013914e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 11 SNES Function norm 2.295191397811e-03 0 KSP preconditioned resid norm 3.908782878818e-04 true resid norm 2.295191397811e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.683593598286e-19 true resid norm 1.349117963796e-18 ||r(i)||/||b|| 5.878019432640e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 12 SNES Function norm 5.560443537355e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 12 1 TS dt 0.1 time 0.1 0 SNES Function norm 2.481890584294e+02 0 KSP preconditioned resid norm 4.814575218086e+01 true resid norm 2.481890584294e+02 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.978258231227e-15 true resid norm 1.352195787188e-13 ||r(i)||/||b|| 5.448248991091e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.266021973228e+00 0 KSP preconditioned resid norm 6.295150151503e-01 true resid norm 4.266021973228e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.645871455459e-16 true resid norm 1.499784027057e-15 ||r(i)||/||b|| 3.515650028221e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.166231966289e-03 0 KSP preconditioned resid norm 4.239093106351e-04 true resid norm 3.166231966289e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.125115632561e-19 true resid norm 9.826291929518e-18 ||r(i)||/||b|| 3.103465581215e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 3 SNES Function norm 2.069014172426e-09 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 3 0 SNES Function norm 2.481890584294e+02 0 KSP preconditioned resid norm 5.087493331112e+01 true resid norm 2.481890584294e+02 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.313547762114e-15 true resid norm 5.407063359255e-14 ||r(i)||/||b|| 2.178606661177e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.642481941904e-01 0 KSP preconditioned resid norm 7.305695402277e-02 true resid norm 3.642481941904e-01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.268326858204e-17 true resid norm 9.661537133731e-17 ||r(i)||/||b|| 2.652459857819e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.024529803087e-06 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 0 SNES Function norm 2.481890584294e+02 0 KSP preconditioned resid norm 5.239714364534e+01 true resid norm 2.481890584294e+02 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.330119296858e-14 true resid norm 6.280430957932e-13 ||r(i)||/||b|| 2.530502753698e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 5.654112168029e-03 0 KSP preconditioned resid norm 1.187690198844e-03 true resid norm 5.654112168029e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 6.616189272861e-19 true resid norm 2.185254248509e-17 ||r(i)||/||b|| 3.864893697838e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.308525067597e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 0 SNES Function norm 2.481890584294e+02 0 KSP preconditioned resid norm 5.259449826109e+01 true resid norm 2.481890584294e+02 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.225264457386e-15 true resid norm 7.898672548273e-12 ||r(i)||/||b|| 3.182522468258e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.284252616442e-04 0 KSP preconditioned resid norm 2.617251794092e-05 true resid norm 1.284252616442e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.863905206556e-21 true resid norm 5.671523044901e-18 ||r(i)||/||b|| 4.416205170455e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.002897683626e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 0 SNES Function norm 2.481890584294e+02 0 KSP preconditioned resid norm 5.261690098838e+01 true resid norm 2.481890584294e+02 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.757659105582e-14 true resid norm 1.733398935306e-11 ||r(i)||/||b|| 6.984187563606e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.078387080477e-05 0 KSP preconditioned resid norm 1.239471400584e-06 true resid norm 1.078387080477e-05 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.847067383530e-22 true resid norm 6.198963351121e-19 ||r(i)||/||b|| 5.748365743015e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.213177482058e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 2 TS dt 7.6238e-05 time 0.100061 0 SNES Function norm 9.448753356532e+01 0 KSP preconditioned resid norm 1.987605682962e-02 true resid norm 9.448753356532e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.226471750824e-14 true resid norm 4.938240530548e-10 ||r(i)||/||b|| 5.226340813663e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.491295456374e-05 0 KSP preconditioned resid norm 7.269039400395e-06 true resid norm 3.491295456374e-05 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 9.535030356460e-22 true resid norm 5.245837518513e-18 ||r(i)||/||b|| 1.502547574121e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.495520007081e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 0 SNES Function norm 9.448753356524e+01 0 KSP preconditioned resid norm 1.992148519035e-03 true resid norm 9.448753356524e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.176962073470e-11 true resid norm 2.537712187769e-06 ||r(i)||/||b|| 2.685764028349e-08 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.626423438851e-06 0 KSP preconditioned resid norm 7.162226053898e-08 true resid norm 2.626423438851e-06 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.094114490484e-22 true resid norm 5.178780977332e-17 ||r(i)||/||b|| 1.971799710864e-11 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.626676019134e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 0 SNES Function norm 9.448753356515e+01 0 KSP preconditioned resid norm 2.614583697436e-04 true resid norm 9.448753356515e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.281757288914e-11 true resid norm 2.914906640221e-05 ||r(i)||/||b|| 3.084964259557e-07 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.915317138429e-05 0 KSP preconditioned resid norm 8.009034597489e-09 true resid norm 2.915317138429e-05 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.424297252435e-20 true resid norm 3.283437574654e-14 ||r(i)||/||b|| 1.126271146069e-09 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.273764760818e-09 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 3 TS dt 1.89392e-06 time 0.100062 0 SNES Function norm 9.448420198559e+01 0 KSP preconditioned resid norm 4.950065075557e-04 true resid norm 9.448420198559e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 9.957254999716e-12 true resid norm 4.671480815308e-06 ||r(i)||/||b|| 4.944192486296e-08 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.586494110169e-06 0 KSP preconditioned resid norm 2.869834662879e-08 true resid norm 4.586494110169e-06 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.638393548524e-21 true resid norm 1.345346182329e-15 ||r(i)||/||b|| 2.933277902497e-10 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 6.870452719991e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 4 TS dt 1.89392e-05 time 0.100064 0 SNES Function norm 9.447789556025e+01 0 KSP preconditioned resid norm 4.946713243483e-03 true resid norm 9.447789556025e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.336781394036e-12 true resid norm 1.566413045865e-07 ||r(i)||/||b|| 1.657967756983e-09 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.167131003946e-06 0 KSP preconditioned resid norm 4.609216552367e-07 true resid norm 2.167131003946e-06 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.032900446192e-22 true resid norm 1.077636382835e-19 ||r(i)||/||b|| 4.972640698106e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 7.228995203392e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 5 TS dt 0.000189392 time 0.100082 0 SNES Function norm 9.441492975737e+01 0 KSP preconditioned resid norm 4.916485042654e-02 true resid norm 9.441492975737e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.102215497961e-13 true resid norm 3.354787492100e-09 ||r(i)||/||b|| 3.553238349826e-11 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.132909828152e-04 0 KSP preconditioned resid norm 4.457085873508e-05 true resid norm 2.132909828152e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 5.262954492816e-21 true resid norm 7.494475973218e-19 ||r(i)||/||b|| 3.513733151912e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 9.417796226665e-12 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 6 TS dt 0.000470304 time 0.100272 0 SNES Function norm 9.379479121406e+01 0 KSP preconditioned resid norm 1.203930740553e-01 true resid norm 9.379479121406e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.730115663252e-14 true resid norm 3.303112014119e-11 ||r(i)||/||b|| 3.521636938858e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.274374966127e-03 0 KSP preconditioned resid norm 2.674315055104e-04 true resid norm 1.274374966127e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.990683235506e-20 true resid norm 1.859987379741e-17 ||r(i)||/||b|| 1.459529125399e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 5.979604035441e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 7 TS dt 0.000543268 time 0.100742 0 SNES Function norm 9.231871046585e+01 0 KSP preconditioned resid norm 1.371331036236e-01 true resid norm 9.231871046585e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.414880100118e-13 true resid norm 4.022616350531e-10 ||r(i)||/||b|| 4.357314275982e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.644932065981e-03 0 KSP preconditioned resid norm 3.459339199337e-04 true resid norm 1.644932065981e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.060567287014e-20 true resid norm 2.269035642567e-18 ||r(i)||/||b|| 1.379409940078e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.495928558153e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 8 TS dt 0.000642798 time 0.101285 0 SNES Function norm 9.070764347681e+01 0 KSP preconditioned resid norm 1.595391992596e-01 true resid norm 9.070764347681e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.169273498298e-13 true resid norm 1.651130548029e-10 ||r(i)||/||b|| 1.820277194667e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.222083595253e-03 0 KSP preconditioned resid norm 4.672059124694e-04 true resid norm 2.222083595253e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.066232714517e-19 true resid norm 3.464166293781e-17 ||r(i)||/||b|| 1.558972084210e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.184860737285e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 9 TS dt 0.000658539 time 0.101928 0 SNES Function norm 8.891383101296e+01 0 KSP preconditioned resid norm 1.607234709312e-01 true resid norm 8.891383101296e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.373524345955e-14 true resid norm 1.153674595360e-10 ||r(i)||/||b|| 1.297519837147e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.254688159200e-03 0 KSP preconditioned resid norm 4.731246859854e-04 true resid norm 2.254688159200e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.641568792644e-19 true resid norm 2.979639026671e-17 ||r(i)||/||b|| 1.321530436266e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.953109347577e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 10 TS dt 0.000702176 time 0.102587 0 SNES Function norm 8.718319934422e+01 0 KSP preconditioned resid norm 1.682512702165e-01 true resid norm 8.718319934422e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 6.270122458095e-14 true resid norm 8.105014079367e-11 ||r(i)||/||b|| 9.296532061603e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.477652837217e-03 0 KSP preconditioned resid norm 5.180321420845e-04 true resid norm 2.477652837217e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 5.865367075029e-20 true resid norm 1.097199972480e-17 ||r(i)||/||b|| 4.428384622730e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.629749949144e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 11 TS dt 0.000717437 time 0.103289 0 SNES Function norm 8.543971302100e+01 0 KSP preconditioned resid norm 1.687013530854e-01 true resid norm 8.543971302100e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.995814874966e-13 true resid norm 2.527048874850e-10 ||r(i)||/||b|| 2.957698224278e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.502653637478e-03 0 KSP preconditioned resid norm 5.208120444490e-04 true resid norm 2.502653637478e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.893667677102e-19 true resid norm 1.443047106659e-17 ||r(i)||/||b|| 5.766068004971e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 8.161913663693e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 12 TS dt 0.000748038 time 0.104006 0 SNES Function norm 8.375063723852e+01 0 KSP preconditioned resid norm 1.724051968235e-01 true resid norm 8.375063723852e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 5.354930086583e-14 true resid norm 6.514015645990e-11 ||r(i)||/||b|| 7.777869949142e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.631609341940e-03 0 KSP preconditioned resid norm 5.447112937552e-04 true resid norm 2.631609341940e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.098280889535e-20 true resid norm 1.972334648190e-17 ||r(i)||/||b|| 7.494785098825e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 6.145990072717e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 13 TS dt 0.000768527 time 0.104754 0 SNES Function norm 8.207410973637e+01 0 KSP preconditioned resid norm 1.734749856068e-01 true resid norm 8.207410973637e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.635671538909e-14 true resid norm 1.020730806582e-10 ||r(i)||/||b|| 1.243669666185e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.686666220104e-03 0 KSP preconditioned resid norm 5.528722222494e-04 true resid norm 2.686666220104e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.310155204269e-19 true resid norm 4.830483216350e-17 ||r(i)||/||b|| 1.797946905427e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 6.801792135099e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 14 TS dt 0.000796547 time 0.105523 0 SNES Function norm 8.042848982172e+01 0 KSP preconditioned resid norm 1.758892775036e-01 true resid norm 8.042848982172e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.858879945257e-14 true resid norm 5.545956074167e-11 ||r(i)||/||b|| 6.895511884483e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.789812422277e-03 0 KSP preconditioned resid norm 5.705277461147e-04 true resid norm 2.789812422277e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.913142296618e-19 true resid norm 3.998556107728e-18 ||r(i)||/||b|| 1.433270594037e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 4.297468971878e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 15 TS dt 0.000821178 time 0.10632 0 SNES Function norm 7.879324106887e+01 0 KSP preconditioned resid norm 1.772055737326e-01 true resid norm 7.879324106887e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.000729750695e-13 true resid norm 1.107346109733e-10 ||r(i)||/||b|| 1.405382104748e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.864437597595e-03 0 KSP preconditioned resid norm 5.819802840658e-04 true resid norm 2.864437597595e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 6.570245323942e-20 true resid norm 9.116555987359e-18 ||r(i)||/||b|| 3.182668735745e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 8.304415356637e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 16 TS dt 0.000849708 time 0.107141 0 SNES Function norm 7.717202202766e+01 0 KSP preconditioned resid norm 1.789783897696e-01 true resid norm 7.717202202766e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.916608907545e-14 true resid norm 2.042013501437e-11 ||r(i)||/||b|| 2.646054162874e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.960122998770e-03 0 KSP preconditioned resid norm 5.972781997712e-04 true resid norm 2.960122998770e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.215077830152e-20 true resid norm 3.599074567107e-18 ||r(i)||/||b|| 1.215853046851e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.564690943417e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 17 TS dt 0.000877541 time 0.10799 0 SNES Function norm 7.555424443422e+01 0 KSP preconditioned resid norm 1.802089975312e-01 true resid norm 7.555424443422e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.811452014208e-14 true resid norm 1.880469120282e-11 ||r(i)||/||b|| 2.488899378670e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.044565370974e-03 0 KSP preconditioned resid norm 6.099484511065e-04 true resid norm 3.044565370974e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.386018977644e-19 true resid norm 2.881122259642e-17 ||r(i)||/||b|| 9.463164388289e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 8.739837085594e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 18 TS dt 0.000907727 time 0.108868 0 SNES Function norm 7.393914980812e+01 0 KSP preconditioned resid norm 1.815000001057e-01 true resid norm 7.393914980812e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.784008470994e-13 true resid norm 1.788267374402e-10 ||r(i)||/||b|| 2.418566319795e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.137962421776e-03 0 KSP preconditioned resid norm 6.240826864483e-04 true resid norm 3.137962421776e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.032203146415e-20 true resid norm 5.530962097641e-18 ||r(i)||/||b|| 1.762596664402e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.127871129670e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 19 TS dt 0.000938355 time 0.109776 0 SNES Function norm 7.232082451423e+01 0 KSP preconditioned resid norm 1.824396838374e-01 true resid norm 7.232082451423e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.933890699233e-15 true resid norm 8.580595300574e-12 ||r(i)||/||b|| 1.186462593341e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.226339091245e-03 0 KSP preconditioned resid norm 6.368425083840e-04 true resid norm 3.226339091245e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.871496403390e-20 true resid norm 2.538923370684e-17 ||r(i)||/||b|| 7.869363073378e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.272154252713e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 20 TS dt 0.000970716 time 0.110714 0 SNES Function norm 7.069752274292e+01 0 KSP preconditioned resid norm 1.832538626878e-01 true resid norm 7.069752274292e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.110765712265e-14 true resid norm 1.039607282967e-11 ||r(i)||/||b|| 1.470500298501e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.317729064783e-03 0 KSP preconditioned resid norm 6.496875723688e-04 true resid norm 3.317729064783e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.175454028470e-19 true resid norm 3.471612084273e-18 ||r(i)||/||b|| 1.046382033157e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 7.922088979970e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 21 TS dt 0.00100403 time 0.111685 0 SNES Function norm 6.906576770788e+01 0 KSP preconditioned resid norm 1.837669368174e-01 true resid norm 6.906576770788e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 5.220944148489e-14 true resid norm 4.739489134723e-11 ||r(i)||/||b|| 6.862284011335e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.405853725318e-03 0 KSP preconditioned resid norm 6.615462514101e-04 true resid norm 3.405853725318e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.472969378355e-19 true resid norm 2.674920574736e-18 ||r(i)||/||b|| 7.853891536361e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 6.478730795773e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 22 TS dt 0.00103887 time 0.112689 0 SNES Function norm 6.742398683218e+01 0 KSP preconditioned resid norm 1.840565102966e-01 true resid norm 6.742398683218e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 5.129358524756e-14 true resid norm 4.495373798069e-11 ||r(i)||/||b|| 6.667321244675e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.493756411309e-03 0 KSP preconditioned resid norm 6.729142804055e-04 true resid norm 3.493756411309e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.250072100387e-19 true resid norm 2.474260037478e-18 ||r(i)||/||b|| 7.081947755343e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 4.809516791341e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 23 TS dt 0.00107493 time 0.113728 0 SNES Function norm 6.577016205918e+01 0 KSP preconditioned resid norm 1.840430915784e-01 true resid norm 6.577016205918e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.692404099180e-14 true resid norm 2.271356589599e-11 ||r(i)||/||b|| 3.453475738064e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.578175999661e-03 0 KSP preconditioned resid norm 6.831979012321e-04 true resid norm 3.578175999661e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.643946865147e-19 true resid norm 1.506535397178e-17 ||r(i)||/||b|| 4.210344592667e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 6.664493188340e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 24 TS dt 0.00111249 time 0.114803 0 SNES Function norm 6.410324565742e+01 0 KSP preconditioned resid norm 1.837479744571e-01 true resid norm 6.410324565742e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.024166583807e-14 true resid norm 3.289874249015e-11 ||r(i)||/||b|| 5.132149262140e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.660269437685e-03 0 KSP preconditioned resid norm 6.924908608255e-04 true resid norm 3.660269437685e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.102586703435e-19 true resid norm 5.776909107619e-17 ||r(i)||/||b|| 1.578274278976e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.189722567212e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 25 TS dt 0.00115146 time 0.115915 0 SNES Function norm 6.242220974044e+01 0 KSP preconditioned resid norm 1.831321862998e-01 true resid norm 6.242220974044e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 5.784659764084e-14 true resid norm 4.567955909760e-11 ||r(i)||/||b|| 7.317837559345e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.737974402477e-03 0 KSP preconditioned resid norm 7.005633544762e-04 true resid norm 3.737974402477e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.135871084928e-19 true resid norm 2.205112955719e-17 ||r(i)||/||b|| 5.899218984106e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.500063882493e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 26 TS dt 0.00119202 time 0.117067 0 SNES Function norm 6.072659080397e+01 0 KSP preconditioned resid norm 1.821977702436e-01 true resid norm 6.072659080397e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.649450945618e-15 true resid norm 6.484928032696e-12 ||r(i)||/||b|| 1.067889362278e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.811380158393e-03 0 KSP preconditioned resid norm 7.073032403448e-04 true resid norm 3.811380158393e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.611940683353e-19 true resid norm 7.622915736938e-18 ||r(i)||/||b|| 2.000040777919e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 6.615265306981e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 27 TS dt 0.00123419 time 0.118259 0 SNES Function norm 5.901609663706e+01 0 KSP preconditioned resid norm 1.809247115736e-01 true resid norm 5.901609663706e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.104367388907e-13 true resid norm 8.124204151411e-11 ||r(i)||/||b|| 1.376608182234e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.879166712963e-03 0 KSP preconditioned resid norm 7.126046375357e-04 true resid norm 3.879166712963e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.987497893592e-19 true resid norm 1.148153465287e-17 ||r(i)||/||b|| 2.959794074976e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.085651432805e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 28 TS dt 0.00127811 time 0.119493 0 SNES Function norm 5.729080957253e+01 0 KSP preconditioned resid norm 1.793104552447e-01 true resid norm 5.729080957253e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.392999707037e-15 true resid norm 1.411967987557e-12 ||r(i)||/||b|| 2.464562812241e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.941207272783e-03 0 KSP preconditioned resid norm 7.163891818081e-04 true resid norm 3.941207272783e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.703899868095e-19 true resid norm 5.177584205515e-17 ||r(i)||/||b|| 1.313705128190e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.064483083775e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 29 TS dt 0.00132387 time 0.120771 0 SNES Function norm 5.555101902924e+01 0 KSP preconditioned resid norm 1.773454575890e-01 true resid norm 5.555101902924e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.972064780790e-14 true resid norm 6.146665917864e-11 ||r(i)||/||b|| 1.106490218411e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.996684389404e-03 0 KSP preconditioned resid norm 7.185434007291e-04 true resid norm 3.996684389404e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.465721333410e-19 true resid norm 1.631168355970e-17 ||r(i)||/||b|| 4.081303893534e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.283229430523e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 30 TS dt 0.00137162 time 0.122095 0 SNES Function norm 5.379729340554e+01 0 KSP preconditioned resid norm 1.750283358023e-01 true resid norm 5.379729340554e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.811597060842e-14 true resid norm 1.825982097628e-11 ||r(i)||/||b|| 3.394189525229e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.045111296257e-03 0 KSP preconditioned resid norm 7.189930688439e-04 true resid norm 4.045111296257e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.311083818389e-19 true resid norm 9.181589259933e-19 ||r(i)||/||b|| 2.269798922079e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 9.408608380963e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 31 TS dt 0.00142151 time 0.123466 0 SNES Function norm 5.203040668229e+01 0 KSP preconditioned resid norm 1.723565049165e-01 true resid norm 5.203040668229e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 9.938126289337e-14 true resid norm 6.328232306949e-11 ||r(i)||/||b|| 1.216256552748e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.085494349684e-03 0 KSP preconditioned resid norm 7.176026261847e-04 true resid norm 4.085494349684e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.638387173343e-19 true resid norm 1.710908942186e-18 ||r(i)||/||b|| 4.187764798447e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 4.153579340933e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 32 TS dt 0.00147373 time 0.124888 0 SNES Function norm 5.025136211091e+01 0 KSP preconditioned resid norm 1.693317456694e-01 true resid norm 5.025136211091e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.106016005333e-15 true resid norm 4.915227000456e-12 ||r(i)||/||b|| 9.781281131459e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.117640472250e-03 0 KSP preconditioned resid norm 7.144394783544e-04 true resid norm 4.117640472250e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.530419793902e-19 true resid norm 1.326671827479e-17 ||r(i)||/||b|| 3.221922449083e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 9.263376730208e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 33 TS dt 0.00152845 time 0.126362 0 SNES Function norm 4.846135996532e+01 0 KSP preconditioned resid norm 1.659569984577e-01 true resid norm 4.846135996532e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.436476651685e-14 true resid norm 4.980799829233e-11 ||r(i)||/||b|| 1.027787877352e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.140830713226e-03 0 KSP preconditioned resid norm 7.093467854890e-04 true resid norm 4.140830713226e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.501317005013e-19 true resid norm 7.071505221069e-19 ||r(i)||/||b|| 1.707750379286e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.148278019285e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 34 TS dt 0.0015859 time 0.12789 0 SNES Function norm 4.666180524544e+01 0 KSP preconditioned resid norm 1.622382143288e-01 true resid norm 4.666180524544e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.667125473967e-14 true resid norm 9.402151340091e-12 ||r(i)||/||b|| 2.014956620438e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.154658034367e-03 0 KSP preconditioned resid norm 7.023795747495e-04 true resid norm 4.154658034367e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.938670869707e-19 true resid norm 6.966533905652e-17 ||r(i)||/||b|| 1.676800797569e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 6.216377104899e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 35 TS dt 0.00164633 time 0.129476 0 SNES Function norm 4.485429437695e+01 0 KSP preconditioned resid norm 1.581834566910e-01 true resid norm 4.485429437695e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.964743204615e-14 true resid norm 2.162539833426e-11 ||r(i)||/||b|| 4.821254828472e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.158463601977e-03 0 KSP preconditioned resid norm 6.933805584414e-04 true resid norm 4.158463601977e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.318408894905e-19 true resid norm 1.764824811177e-17 ||r(i)||/||b|| 4.243934731900e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 5.456729815151e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 36 TS dt 0.00171001 time 0.131122 0 SNES Function norm 4.304061849637e+01 0 KSP preconditioned resid norm 1.538034361741e-01 true resid norm 4.304061849637e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.376475633699e-15 true resid norm 2.261323269048e-12 ||r(i)||/||b|| 5.253928377537e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.151851542096e-03 0 KSP preconditioned resid norm 6.824445119600e-04 true resid norm 4.151851542096e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.425885470347e-19 true resid norm 6.765881539538e-18 ||r(i)||/||b|| 1.629605844751e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 4.907082171738e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 37 TS dt 0.00177725 time 0.132832 0 SNES Function norm 4.122275891406e+01 0 KSP preconditioned resid norm 1.491114728610e-01 true resid norm 4.122275891406e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 9.879683010818e-14 true resid norm 4.953097338782e-11 ||r(i)||/||b|| 1.201544357841e-12 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.134246528095e-03 0 KSP preconditioned resid norm 6.695777997764e-04 true resid norm 4.134246528095e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.938654180469e-19 true resid norm 3.802548863157e-17 ||r(i)||/||b|| 9.197682908641e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 7.512192105957e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 38 TS dt 0.00184841 time 0.134609 0 SNES Function norm 3.940288923310e+01 0 KSP preconditioned resid norm 1.441239669552e-01 true resid norm 3.940288923310e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.921749369609e-15 true resid norm 1.878506381996e-12 ||r(i)||/||b|| 4.767433095789e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.105386911109e-03 0 KSP preconditioned resid norm 6.547408316757e-04 true resid norm 4.105386911109e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.099997430517e-19 true resid norm 2.326129550499e-17 ||r(i)||/||b|| 5.666042204705e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.891555812789e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 39 TS dt 0.00192387 time 0.136458 0 SNES Function norm 3.758337419822e+01 0 KSP preconditioned resid norm 1.388603566832e-01 true resid norm 3.758337419822e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.014704471424e-14 true resid norm 4.583873045369e-12 ||r(i)||/||b|| 1.219654473064e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.064868073942e-03 0 KSP preconditioned resid norm 6.379428735820e-04 true resid norm 4.064868073942e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.911677916926e-19 true resid norm 4.441807574955e-17 ||r(i)||/||b|| 1.092731054036e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 4.439258388149e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 40 TS dt 0.00200407 time 0.138382 0 SNES Function norm 3.576677069546e+01 0 KSP preconditioned resid norm 1.333434778652e-01 true resid norm 3.576677069546e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.413357869609e-14 true resid norm 1.491936326800e-11 ||r(i)||/||b|| 4.171291670427e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.012090399066e-03 0 KSP preconditioned resid norm 6.192427197756e-04 true resid norm 4.012090399066e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.541488553138e-19 true resid norm 4.629529274763e-17 ||r(i)||/||b|| 1.153894557271e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.439774244093e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 41 TS dt 0.00208952 time 0.140386 0 SNES Function norm 3.395582676376e+01 0 KSP preconditioned resid norm 1.275996978903e-01 true resid norm 3.395582676376e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.972608198890e-14 true resid norm 1.651953896466e-11 ||r(i)||/||b|| 4.865008612394e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.947285643748e-03 0 KSP preconditioned resid norm 5.988107592318e-04 true resid norm 3.947285643748e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.019377490130e-19 true resid norm 1.896150833255e-17 ||r(i)||/||b|| 4.803682845345e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 8.980558247126e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 42 TS dt 0.0021808 time 0.142475 0 SNES Function norm 3.215348048566e+01 0 KSP preconditioned resid norm 1.216595009417e-01 true resid norm 3.215348048566e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.961934042537e-14 true resid norm 1.166803100820e-11 ||r(i)||/||b|| 3.628854740440e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.869587284369e-03 0 KSP preconditioned resid norm 5.765667069190e-04 true resid norm 3.869587284369e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.478861694077e-19 true resid norm 5.618495549418e-17 ||r(i)||/||b|| 1.451962479852e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 5.225800773743e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 43 TS dt 0.00227856 time 0.144656 0 SNES Function norm 3.036285693864e+01 0 KSP preconditioned resid norm 1.155572657223e-01 true resid norm 3.036285693864e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 6.773601838534e-15 true resid norm 2.538341718583e-12 ||r(i)||/||b|| 8.360022654366e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.779178319861e-03 0 KSP preconditioned resid norm 5.526633444314e-04 true resid norm 3.779178319861e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.431212750593e-19 true resid norm 2.744847316321e-17 ||r(i)||/||b|| 7.263079653837e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 5.269116807128e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 44 TS dt 0.00238356 time 0.146935 0 SNES Function norm 2.858726374370e+01 0 KSP preconditioned resid norm 1.093316282536e-01 true resid norm 2.858726374370e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.905253540379e-14 true resid norm 6.679589159154e-12 ||r(i)||/||b|| 2.336561210978e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.676067400328e-03 0 KSP preconditioned resid norm 5.272605141006e-04 true resid norm 3.676067400328e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.896877172859e-19 true resid norm 2.905273604835e-18 ||r(i)||/||b|| 7.903210927460e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.597158065852e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 45 TS dt 0.00249669 time 0.149318 0 SNES Function norm 2.683018435110e+01 0 KSP preconditioned resid norm 1.030256040720e-01 true resid norm 2.683018435110e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.267508658200e-14 true resid norm 4.180982721886e-12 ||r(i)||/||b|| 1.558313080213e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.560322764375e-03 0 KSP preconditioned resid norm 5.004521328076e-04 true resid norm 3.560322764375e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.166343257656e-19 true resid norm 9.909003964766e-18 ||r(i)||/||b|| 2.783175745726e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.472525569279e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 46 TS dt 0.00261899 time 0.151815 0 SNES Function norm 2.509526913408e+01 0 KSP preconditioned resid norm 9.668643676422e-02 true resid norm 2.509526913408e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.146288647649e-14 true resid norm 6.609305556959e-12 ||r(i)||/||b|| 2.633685863916e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.432214768438e-03 0 KSP preconditioned resid norm 4.723926942241e-04 true resid norm 3.432214768438e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.237223266951e-19 true resid norm 1.725890051207e-17 ||r(i)||/||b|| 5.028502490805e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.516272487751e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 47 TS dt 0.00275165 time 0.154434 0 SNES Function norm 2.338632383021e+01 0 KSP preconditioned resid norm 9.036537337732e-02 true resid norm 2.338632383021e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.341333492884e-14 true resid norm 9.509239568656e-12 ||r(i)||/||b|| 4.066154064100e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.291769810485e-03 0 KSP preconditioned resid norm 4.432844230175e-04 true resid norm 3.291769810485e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.073759784149e-19 true resid norm 1.980471144303e-17 ||r(i)||/||b|| 6.016432673981e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 8.952431611147e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 48 TS dt 0.00289613 time 0.157186 0 SNES Function norm 2.170729533126e+01 0 KSP preconditioned resid norm 8.411722708891e-02 true resid norm 2.170729533126e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.162495775354e-14 true resid norm 1.089668370292e-11 ||r(i)||/||b|| 5.019825610067e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.139851147417e-03 0 KSP preconditioned resid norm 4.133124605661e-04 true resid norm 3.139851147417e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.433212032625e-19 true resid norm 8.447065853874e-18 ||r(i)||/||b|| 2.690275894391e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.140224718289e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 49 TS dt 0.00305411 time 0.160082 0 SNES Function norm 2.006225462307e+01 0 KSP preconditioned resid norm 7.799945416890e-02 true resid norm 2.006225462307e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.701286170172e-14 true resid norm 4.046252981304e-12 ||r(i)||/||b|| 2.016848583235e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.976962965743e-03 0 KSP preconditioned resid norm 3.827457235829e-04 true resid norm 2.976962965743e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 9.937590588351e-20 true resid norm 1.622244350506e-17 ||r(i)||/||b|| 5.449326609615e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.736153035227e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 50 TS dt 0.00322766 time 0.163136 0 SNES Function norm 1.845537683280e+01 0 KSP preconditioned resid norm 7.207092681280e-02 true resid norm 1.845537683280e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.384624042607e-14 true resid norm 2.956604164574e-12 ||r(i)||/||b|| 1.602028607359e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.804276773421e-03 0 KSP preconditioned resid norm 3.517864257410e-04 true resid norm 2.804276773421e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.858297704886e-20 true resid norm 3.321537709925e-18 ||r(i)||/||b|| 1.184454309719e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 4.507890529838e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 51 TS dt 0.00341925 time 0.166363 0 SNES Function norm 1.689091828562e+01 0 KSP preconditioned resid norm 6.638984442798e-02 true resid norm 1.689091828562e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.157676701455e-14 true resid norm 5.965462317149e-12 ||r(i)||/||b|| 3.531757253380e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.622595393251e-03 0 KSP preconditioned resid norm 3.207041377302e-04 true resid norm 2.622595393251e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.158566462061e-19 true resid norm 1.300688949561e-17 ||r(i)||/||b|| 4.959548670405e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 9.122703788973e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 52 TS dt 0.0036319 time 0.169783 0 SNES Function norm 1.537319051967e+01 0 KSP preconditioned resid norm 6.101098059921e-02 true resid norm 1.537319051967e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.559892513249e-14 true resid norm 2.546643601844e-12 ||r(i)||/||b|| 1.656548521001e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.433493660026e-03 0 KSP preconditioned resid norm 2.897503342787e-04 true resid norm 2.433493660026e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.052507755130e-19 true resid norm 1.691019879489e-17 ||r(i)||/||b|| 6.948938915546e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 7.338712903651e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 53 TS dt 0.00386931 time 0.173415 0 SNES Function norm 1.390653112094e+01 0 KSP preconditioned resid norm 5.598187570756e-02 true resid norm 1.390653112094e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.830661622063e-15 true resid norm 1.081743029539e-12 ||r(i)||/||b|| 7.778669030625e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.238300115941e-03 0 KSP preconditioned resid norm 2.592387164739e-04 true resid norm 2.238300115941e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.598332398216e-19 true resid norm 1.271999357011e-17 ||r(i)||/||b|| 5.682881164825e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 8.518721735734e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 54 TS dt 0.00413608 time 0.177284 0 SNES Function norm 1.249527116146e+01 0 KSP preconditioned resid norm 5.133823523900e-02 true resid norm 1.249527116146e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.516687839476e-14 true resid norm 1.705228329645e-12 ||r(i)||/||b|| 1.364698938990e-13 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.039009524594e-03 0 KSP preconditioned resid norm 2.294361895217e-04 true resid norm 2.039009524594e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.336979964578e-19 true resid norm 2.684812172840e-17 ||r(i)||/||b|| 1.316723703571e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 5.222766661714e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 55 TS dt 0.00443797 time 0.18142 0 SNES Function norm 1.114369887796e+01 0 KSP preconditioned resid norm 4.709866884497e-02 true resid norm 1.114369887796e+01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 5.419400969193e-16 true resid norm 4.237126420545e-14 ||r(i)||/||b|| 3.802262127638e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.837527167418e-03 0 KSP preconditioned resid norm 2.006226791308e-04 true resid norm 1.837527167418e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.276455077933e-20 true resid norm 7.361573122474e-18 ||r(i)||/||b|| 4.006239065744e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 7.160817239367e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 56 TS dt 0.00478225 time 0.185858 0 SNES Function norm 9.856019247786e+00 0 KSP preconditioned resid norm 4.325975288364e-02 true resid norm 9.856019247786e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.374768273953e-14 true resid norm 8.762926950683e-13 ||r(i)||/||b|| 8.890939364441e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.636177258488e-03 0 KSP preconditioned resid norm 1.731037368735e-04 true resid norm 1.636177258488e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.171229523435e-20 true resid norm 1.241780574218e-18 ||r(i)||/||b|| 7.589523493105e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.792682661019e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 57 TS dt 0.00517824 time 0.19064 0 SNES Function norm 8.636308870812e+00 0 KSP preconditioned resid norm 3.979273021195e-02 true resid norm 8.636308870812e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.861363656288e-15 true resid norm 3.642223275540e-13 ||r(i)||/||b|| 4.217337904449e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.437422283677e-03 0 KSP preconditioned resid norm 1.471231468375e-04 true resid norm 1.437422283677e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.267993245473e-20 true resid norm 1.667989907412e-17 ||r(i)||/||b|| 1.160403540666e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 7.188693069763e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 58 TS dt 0.00563801 time 0.195818 0 SNES Function norm 7.488465594377e+00 0 KSP preconditioned resid norm 3.664329969779e-02 true resid norm 7.488465594377e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.674924374000e-15 true resid norm 7.806320331340e-14 ||r(i)||/||b|| 1.042445910041e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.244001212822e-03 0 KSP preconditioned resid norm 1.229480052855e-04 true resid norm 1.244001212822e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.166683435594e-20 true resid norm 1.947901126557e-18 ||r(i)||/||b|| 1.565835391862e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.995271673731e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 59 TS dt 0.00617738 time 0.201456 0 SNES Function norm 6.416152382772e+00 0 KSP preconditioned resid norm 3.373572096240e-02 true resid norm 6.416152382772e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.237647103180e-15 true resid norm 3.376877578683e-14 ||r(i)||/||b|| 5.263088183115e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.058687426920e-03 0 KSP preconditioned resid norm 1.007867459623e-04 true resid norm 1.058687426920e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.957646690236e-20 true resid norm 9.797741821467e-18 ||r(i)||/||b|| 9.254612430762e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.114057033741e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 60 TS dt 0.00681743 time 0.207634 0 SNES Function norm 5.422735240105e+00 0 KSP preconditioned resid norm 3.098121468871e-02 true resid norm 5.422735240105e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.039149956249e-15 true resid norm 3.469696715088e-14 ||r(i)||/||b|| 6.398425446677e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 8.841297823670e-04 0 KSP preconditioned resid norm 8.082490574174e-05 true resid norm 8.841297823670e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.523898442649e-20 true resid norm 1.203075254732e-18 ||r(i)||/||b|| 1.360745083727e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.588771819762e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 61 TS dt 0.00758669 time 0.214451 0 SNES Function norm 4.511215295964e+00 0 KSP preconditioned resid norm 2.828947721106e-02 true resid norm 4.511215295964e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.146461216638e-15 true resid norm 2.019847489533e-14 ||r(i)||/||b|| 4.477391028844e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 7.227818718433e-04 0 KSP preconditioned resid norm 6.319780499619e-05 true resid norm 7.227818718433e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.544606931453e-20 true resid norm 6.172085956531e-18 ||r(i)||/||b|| 8.539348034269e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.693373617113e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 62 TS dt 0.00852442 time 0.222038 0 SNES Function norm 3.684155065880e+00 0 KSP preconditioned resid norm 2.558122719191e-02 true resid norm 3.684155065880e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.421399607295e-15 true resid norm 3.542724819742e-14 ||r(i)||/||b|| 9.616112124465e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 5.765794619693e-04 0 KSP preconditioned resid norm 4.799275039593e-05 true resid norm 5.765794619693e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.917210017870e-20 true resid norm 2.275114665599e-18 ||r(i)||/||b|| 3.945882251560e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 4.080956893516e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 63 TS dt 0.00968588 time 0.230562 0 SNES Function norm 2.943597558153e+00 0 KSP preconditioned resid norm 2.279968946137e-02 true resid norm 2.943597558153e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.688891915436e-15 true resid norm 4.658056432078e-14 ||r(i)||/||b|| 1.582436572954e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.469347227783e-04 0 KSP preconditioned resid norm 3.522459990639e-05 true resid norm 4.469347227783e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.913271436278e-20 true resid norm 1.788470624725e-18 ||r(i)||/||b|| 4.001637227037e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.327601328128e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 64 TS dt 0.0111507 time 0.240248 0 SNES Function norm 2.290972258273e+00 0 KSP preconditioned resid norm 1.991943474887e-02 true resid norm 2.290972258273e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.041387078411e-16 true resid norm 4.490497948641e-15 ||r(i)||/||b|| 1.960083947951e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 3.344242266320e-04 0 KSP preconditioned resid norm 2.483429299262e-05 true resid norm 3.344242266320e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.502871759649e-20 true resid norm 1.297541432592e-18 ||r(i)||/||b|| 3.879926540190e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 8.744519240953e-12 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 65 TS dt 0.0130373 time 0.251399 0 SNES Function norm 1.726972038604e+00 0 KSP preconditioned resid norm 1.695158091906e-02 true resid norm 1.726972038604e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.975791012103e-16 true resid norm 1.111729584800e-14 ||r(i)||/||b|| 6.437449825178e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.392592155875e-04 0 KSP preconditioned resid norm 1.668555019510e-05 true resid norm 2.392592155875e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 7.088396216954e-21 true resid norm 5.496514286129e-19 ||r(i)||/||b|| 2.297305151918e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 6.579404416439e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 66 TS dt 0.0155288 time 0.264436 0 SNES Function norm 1.251369830009e+00 0 KSP preconditioned resid norm 1.394470950975e-02 true resid norm 1.251369830009e+00 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 4.815828028815e-16 true resid norm 1.213627164980e-14 ||r(i)||/||b|| 9.698389204178e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.613502496331e-04 0 KSP preconditioned resid norm 1.057030278717e-05 true resid norm 1.613502496331e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 8.401765334436e-21 true resid norm 6.944496575239e-19 ||r(i)||/||b|| 4.303988739422e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.002960512866e-10 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 67 TS dt 0.0189222 time 0.279965 0 SNES Function norm 8.627296280253e-01 0 KSP preconditioned resid norm 1.098115744970e-02 true resid norm 8.627296280253e-01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.040524997250e-15 true resid norm 2.276783270205e-14 ||r(i)||/||b|| 2.639046111603e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.004787404905e-04 0 KSP preconditioned resid norm 6.217867528402e-06 true resid norm 1.004787404905e-04 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.769316671015e-21 true resid norm 8.855525476854e-20 ||r(i)||/||b|| 8.813332485679e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 3.760414731078e-12 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 68 TS dt 0.023731 time 0.298887 0 SNES Function norm 5.579721026980e-01 0 KSP preconditioned resid norm 8.168954839547e-03 true resid norm 5.579721026980e-01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.093014577302e-16 true resid norm 3.691584819710e-15 ||r(i)||/||b|| 6.616074176218e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 5.612090762488e-05 0 KSP preconditioned resid norm 3.316790793448e-06 true resid norm 5.612090762488e-05 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.389783265668e-21 true resid norm 1.405828116276e-19 ||r(i)||/||b|| 2.504998895728e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.107601957670e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 69 TS dt 0.0309094 time 0.322618 0 SNES Function norm 3.318078807622e-01 0 KSP preconditioned resid norm 5.630577405307e-03 true resid norm 3.318078807622e-01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 3.063267331490e-17 true resid norm 4.870469979815e-16 ||r(i)||/||b|| 1.467858439235e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.695020096883e-05 0 KSP preconditioned resid norm 1.546011182166e-06 true resid norm 2.695020096883e-05 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.793117635794e-21 true resid norm 1.357050217767e-19 ||r(i)||/||b|| 5.035399251146e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 9.864668651466e-11 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 70 TS dt 0.0423973 time 0.353528 0 SNES Function norm 1.761612327941e-01 0 KSP preconditioned resid norm 3.489583339177e-03 true resid norm 1.761612327941e-01 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.316658452387e-17 true resid norm 1.485569644921e-16 ||r(i)||/||b|| 8.433011175945e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.045148098580e-05 0 KSP preconditioned resid norm 5.929484721294e-07 true resid norm 1.045148098580e-05 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.828363770364e-21 true resid norm 7.752112884319e-20 ||r(i)||/||b|| 7.417238662014e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.326602314501e-13 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 71 TS dt 0.0625913 time 0.395925 0 SNES Function norm 7.983581767239e-02 0 KSP preconditioned resid norm 1.852676934474e-03 true resid norm 7.983581767239e-02 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.090500314767e-17 true resid norm 2.310106089883e-16 ||r(i)||/||b|| 2.893571027684e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 2.964796049083e-06 0 KSP preconditioned resid norm 1.693900381352e-07 true resid norm 2.964796049083e-06 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 6.407935088581e-22 true resid norm 2.488733292029e-20 ||r(i)||/||b|| 8.394281599232e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.264007744504e-13 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 72 TS dt 0.102961 time 0.458516 0 SNES Function norm 2.877453190320e-02 0 KSP preconditioned resid norm 7.782387666098e-04 true resid norm 2.877453190320e-02 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 9.420699403324e-18 true resid norm 6.243727619192e-17 ||r(i)||/||b|| 2.169879822961e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 5.249450920420e-07 0 KSP preconditioned resid norm 3.065901539711e-08 true resid norm 5.249450920420e-07 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 6.507185626458e-22 true resid norm 1.958685721627e-20 ||r(i)||/||b|| 3.731220181539e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 2.763153167418e-13 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 73 TS dt 0.199589 time 0.561477 0 SNES Function norm 7.357133041852e-03 0 KSP preconditioned resid norm 2.272827711081e-04 true resid norm 7.357133041852e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.560769044703e-18 true resid norm 1.055220659460e-17 ||r(i)||/||b|| 1.434282421505e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 4.484151797034e-08 0 KSP preconditioned resid norm 2.698822392308e-09 true resid norm 4.484151797034e-08 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.024441048047e-23 true resid norm 2.694593621840e-22 ||r(i)||/||b|| 6.009148984702e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.410929782356e-13 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 74 TS dt 0.499313 time 0.761066 0 SNES Function norm 1.108959982369e-03 0 KSP preconditioned resid norm 3.768770633683e-05 true resid norm 1.108959982369e-03 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.160988763768e-19 true resid norm 1.513119898651e-18 ||r(i)||/||b|| 1.364449504677e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.231447237523e-09 0 KSP preconditioned resid norm 7.687789000002e-11 true resid norm 1.231447237523e-09 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.618392523208e-25 true resid norm 6.085315417000e-24 ||r(i)||/||b|| 4.941596547199e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.013971887346e-13 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 75 TS dt 1.86355 time 1.26038 0 SNES Function norm 7.351115437016e-05 0 KSP preconditioned resid norm 2.626294379687e-06 true resid norm 7.351115437016e-05 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 6.595461681967e-20 true resid norm 4.399208838926e-19 ||r(i)||/||b|| 5.984409953317e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 5.250183361938e-12 0 KSP preconditioned resid norm 3.428226842422e-13 true resid norm 5.250183361938e-12 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.851316643091e-27 true resid norm 6.153997546811e-26 ||r(i)||/||b|| 1.172149070340e-14 Linear solve converged due to CONVERGED_RTOL iterations 1 2 SNES Function norm 1.260177193847e-13 Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 2 76 TS dt 12.9794 time 3.12393 0 SNES Function norm 1.372480987605e-06 0 KSP preconditioned resid norm 4.983377937858e-08 true resid norm 1.372480987605e-06 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 1.459814669632e-22 true resid norm 5.715488068778e-22 ||r(i)||/||b|| 4.164347718035e-16 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.071720032080e-13 Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 1 77 TS dt 129.794 time 16.1033 0 SNES Function norm 3.738924773290e-09 0 KSP preconditioned resid norm 1.360931234224e-10 true resid norm 3.738924773290e-09 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.483867357913e-24 true resid norm 1.443180452160e-23 ||r(i)||/||b|| 3.859880954197e-15 Linear solve converged due to CONVERGED_RTOL iterations 1 1 SNES Function norm 1.423043453910e-13 Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 1 78 TS dt 1297.94 time 145.897 Post-processing ... Done! Checking results ... Pressure eigenvalue = -100.375, -13.6622(kg/m^3/s^2) Estimated strain-rate = 20.0375, 6.67916(s^-1) Steady-state profile: Axial velocity: Vec Object: 1 MPI processes type: seq 5. 4.94981 4.79925 4.54831 4.197 3.74531 3.19325 2.54082 1.78811 0.936514 -6.10577e-17 -0.936514 -1.78811 -2.54082 -3.19325 -3.74531 -4.197 -4.54831 -4.79925 -4.94981 -5. Spread rate: Vec Object: 1 MPI processes type: seq 3.61321e-25 1.00375 2.0075 3.01125 4.015 5.01875 6.0225 7.02613 8.0279 9.00411 9.72618 9.00411 8.0279 7.02613 6.0225 5.01875 4.015 3.01125 2.0075 1.00375 0. From pierre.seize at onera.fr Wed Oct 27 02:15:53 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Wed, 27 Oct 2021 09:15:53 +0200 Subject: [petsc-users] Question regarding DMPlex reordering In-Reply-To: References: <01e3a622-8561-93b4-4ebd-25331bd93486@onera.fr> Message-ID: On 26/10/21 22:28, Matthew Knepley wrote: > On Tue, Oct 26, 2021 at 10:17 AM Pierre Seize > wrote: > > Hi, I had the idea to try and renumber my mesh cells, as I've > heard it's better: "neighbouring cells are stored next to one > another, and memory access are faster". > > Right now, I load the mesh then I distribute it over the > processes. I thought I'd try to permute the numbering between > those two steps : > > DMPlexCreateFromFile > DMPlexGetOrdering > DMPlexPermute > DMPlexDistribute > > but that gives me an error when it runs on more than one process: > > [0]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [0]PETSC ERROR: No support for this operation for this object type > [0]PETSC ERROR: Number of dofs for point 0 in the local section > should be positive > [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble > shooting. > [0]PETSC ERROR: Petsc Release Version 3.16.0, unknown > [0]PETSC ERROR: ./build/bin/yanss on a? named ldmpe202z.onera by > pseize Tue Oct 26 16:03:33 2021 > [0]PETSC ERROR: Configure options --PETSC_ARCH=arch-ld-gcc > --download-metis --download-parmetis --prefix=~/.local --with-cgns > [0]PETSC ERROR: #1 PetscPartitionerDMPlexPartition() at > /stck/pseize/softwares/petsc/src/dm/impls/plex/plexpartition.c:720 > [0]PETSC ERROR: #2 DMPlexDistribute() at > /stck/pseize/softwares/petsc/src/dm/impls/plex/plexdistribute.c:1630 > [0]PETSC ERROR: #3 MeshLoadFromFile() at src/spatial.c:689 > [0]PETSC ERROR: #4 main() at src/main.c:22 > [0]PETSC ERROR: PETSc Option Table entries: > [0]PETSC ERROR: -draw_comp 0 > [0]PETSC ERROR: -mesh data/box.msh > [0]PETSC ERROR: -mesh_view draw > [0]PETSC ERROR: -riemann anrs > [0]PETSC ERROR: -ts_max_steps 100 > [0]PETSC ERROR: -vec_view_partition > [0]PETSC ERROR: ----------------End of Error Message -------send > entire error message to petsc-maint at mcs.anl.gov > ---------- > > I checked and before I tried to reorder the mesh, the > dm->localSection was NULL before entering DMPlexDistribute, and I > was able to fix the error with DMSetLocalSection(dm, NULL) after > DMPlexPermute, but it doesn't seems it's the right way to do what > I want. Does someone have any advice ? > > Oh, this is probably me trying to be too clever. If a local section is > defined, then I try to use the number of dofs in it to load balance > better. > There should never be a negative number of dofs in the local section > (a global section uses this to indicate a dof?owned by another process). > So eliminating the local section will definitely fix that error. > > Now the question of how you got a local section. DMPlexPermute()? does > not create one, so it seems like you had one ahead of time, and that > the values were not valid. DMPlexPermute calls DMGetLocalSection, which creates dm->localSection if it's NULL, so before DMPlexPermute my dm->localSection is NULL, and after it is set. Because of that I enter the if in src/dm/impls/plex/plexpartition.c:707 and then I got the error. If i have a "wrong" dm->localSection, I think it has to come from DMPlexPermute. > Note that you can probably get rid of some of the loading code using > > ? DMCreate(comm, &dm); > ? DMSetType(dm, DMPLEX); > ? DMSetFromOptions(dm); > ? DMViewFromOptions(dm, NULL, "-mesh_view"); > > and use > > ? -dm_plex_filename databox,msh -mesh_view My loading code is already small, but just to make sure I wrote this minimal example: int main(int argc, char **argv){ ? PetscErrorCode ierr; ? ierr = PetscInitialize(&argc, &argv, NULL, help); if (ierr) return ierr; ? DM dm, foo_dm; ? ierr = DMCreate(PETSC_COMM_WORLD, &dm); CHKERRQ(ierr); ? ierr = DMSetType(dm, DMPLEX); CHKERRQ(ierr); ? ierr = DMSetFromOptions(dm); CHKERRQ(ierr); ? IS perm; ? ierr = DMPlexGetOrdering(dm, NULL, NULL, &perm); CHKERRQ(ierr); ? ierr = DMPlexPermute(dm, perm, &foo_dm); CHKERRQ(ierr); ? if (foo_dm) { ??? ierr = DMDestroy(&dm); CHKERRQ(ierr); ??? dm = foo_dm; ? } ? ierr = DMPlexDistribute(dm, 2, NULL, &foo_dm); CHKERRQ(ierr); ? if (foo_dm) { ??? ierr = DMDestroy(&dm); CHKERRQ(ierr); ??? dm = foo_dm; ? } ? ierr = ISDestroy(&perm); CHKERRQ(ierr); ? ierr = DMDestroy(&dm); CHKERRQ(ierr); ? ierr = PetscFinalize(); ? return ierr; } ran with mpiexec -n 2 ./build/bin/yanss -dm_plex_filename data/box.msh. The mesh is a 2D box from GMSH but I've got the same result with any mesh I've tried. It runs fine with 1 process but gives the previous error for more processes. Pierre -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuanxi at advancesoft.jp Wed Oct 27 03:49:39 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Wed, 27 Oct 2021 17:49:39 +0900 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? Message-ID: Hi, I am trying to parallelize my serial FEM program using PETSc. This program calculates structure deformation by using various types of elements such as solid, shell, beam, and truss. At the very beginning, I found it was hard for me to put such kinds of elements into DMPlex. Because solid elements are topologically three dimensional, shell element two, and beam or truss are topologically one-dimensional elements. After reading chapter 2.10: "DMPlex: Unstructured Grids in PETSc" of users manual carefully, I found the provided functions, such as DMPlexSetCone, cannot declare those topological differences. My question is : Is it possible and how to define all those topologically different elements into a DMPlex struct? Thanks in advance! Best regards, Yuan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 27 05:27:43 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 27 Oct 2021 06:27:43 -0400 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? In-Reply-To: References: Message-ID: On Wed, Oct 27, 2021 at 4:50 AM ?? wrote: > Hi, > > I am trying to parallelize my serial FEM program using PETSc. This program > calculates structure deformation by using various types of elements such as > solid, shell, beam, and truss. At the very beginning, I found it was hard > for me to put such kinds of elements into DMPlex. Because solid elements > are topologically three dimensional, shell element two, and beam or truss > are topologically one-dimensional elements. After reading chapter 2.10: > "DMPlex: Unstructured Grids in PETSc" of users manual carefully, I found > the provided functions, such as DMPlexSetCone, cannot declare those > topological differences. > > My question is : Is it possible and how to define all those topologically > different elements into a DMPlex struct? > Yes. The idea is to program in a dimension-independent way, so that the code can handle cells of any dimension. What you probably want is the "depth" in the DAG representation, which you can think of as the dimension of a cell. https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetPointDepth.html#DMPlexGetPointDepth Thanks, Matt > Thanks in advance! > > Best regards, > > Yuan. > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Eric.Chamberland at giref.ulaval.ca Wed Oct 27 07:29:03 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Wed, 27 Oct 2021 08:29:03 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> Message-ID: <12e32ebb-61ed-6a8c-ab77-2841090ba5fe@giref.ulaval.ca> Hi Matthew, the smallest mesh which crashes the code is a 2x5 mesh: See the modified ex44.c With smaller meshes(2x2, 2x4, etc), it passes...? But it bugs latter when I try to use DMPlexNaturalToGlobalBegin but let's keep that other problem for later... Thanks a lot for helping digging into this! :) Eric (sorry if you received this for a? 2nd times, I have trouble with my mail) On 2021-10-26 4:35 p.m., Matthew Knepley wrote: > On Tue, Oct 26, 2021 at 1:35 PM Eric Chamberland > > wrote: > > Here is a screenshot of the partition I hard coded (top) and > vertices/element numbers (down): > > I have not yet modified the ex44.c example to properly assign the > coordinates... > > (but I would not have done it like it is in the last version > because the sCoords array is the global array with global vertices > number) > > I will have time to do this tomorrow... > > Maybe I can first try to reproduce all this with a smaller mesh? > > > That might make it easier to find a problem. > > ? Thanks! > > ? ? ?Matt > > Eric > > On 2021-10-26 9:46 a.m., Matthew Knepley wrote: >> Okay, I ran it. Something seems off with the mesh. First, I >> cannot simply explain the partition. The number of shared >> vertices and edges >> does not seem to come from a straight cut. Second, the mesh look >> scrambled on output. >> >> ? Thanks, >> >> ? ? Matt >> >> On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland >> > > wrote: >> >> Hi Matthew, >> >> ok, I started back from your ex44.c example and added the >> global array of coordinates.? I just have to code the >> creation of the local coordinates now. >> >> Eric >> >> On 2021-10-20 6:55 p.m., Matthew Knepley wrote: >>> On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland >>> >> > wrote: >>> >>> Hi Matthew, >>> >>> we tried to reproduce the error in a simple example. >>> >>> The context is the following: We hard coded the mesh and >>> initial partition into the code (see sConnectivity and >>> sInitialPartition) for 2 ranks and try to create a >>> section in order to use the DMPlexNaturalToGlobalBegin >>> function to retreive our initial element numbers. >>> >>> Now the call to DMPlexDistribute give different errors >>> depending on what type of component we ask the field to >>> be created.? For our objective, we would like a global >>> field to be created on elements only (like a P0 >>> interpolation). >>> >>> We now have the following error generated: >>> >>> [0]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> [0]PETSC ERROR: Petsc has generated inconsistent data >>> [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 >>> [0]PETSC ERROR: See >>> https://www.mcs.anl.gov/petsc/documentation/faq.html >>> >>> for trouble shooting. >>> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 >>> [0]PETSC ERROR: ./bug on a? named rohan by ericc Wed Oct >>> 20 14:52:36 2021 >>> [0]PETSC ERROR: Configure options >>> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 >>> --with-mpi-compilers=1 >>> --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 >>> --with-cxx-dialect=C++14 --with-make-np=12 >>> --with-shared-libraries=1 --with-debugging=yes >>> --with-memalign=64 --with-visibility=0 >>> --with-64-bit-indices=0 --download-ml=yes >>> --download-mumps=yes --download-superlu=yes >>> --download-hpddm=yes --download-slepc=yes >>> --download-superlu_dist=yes --download-parmetis=yes >>> --download-ptscotch=yes --download-metis=yes >>> --download-strumpack=yes --download-suitesparse=yes >>> --download-hypre=yes >>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>> --with-scalapack=1 >>> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >>> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >>> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >>> [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >>> [0]PETSC ERROR: #3 DMPlexDistribute() at >>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >>> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >>> [0]PETSC ERROR: No PETSc Option Table entries >>> [0]PETSC ERROR: ----------------End of Error Message >>> -------send entire error message to >>> petsc-maint at mcs.anl.gov >>> ---------- >>> >>> Hope the attached code is self-explaining, note that to >>> make it short, we have not included the final part of >>> it, just the buggy part we are encountering right now... >>> >>> Thanks for your insights, >>> >>> Thanks for making the example. I tweaked it slightly. I put >>> in a test case that just makes a parallel 7 x 10 quad mesh. >>> This works >>> fine. Thus I think it must be something connected with the >>> original mesh. It is hard to get a handle on it without the >>> coordinates. >>> Do you think you could put the coordinate array in? I have >>> added the code to load them (see attached file). >>> >>> ? Thanks, >>> >>> ? ? ?Matt >>> >>> Eric >>> >>> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >>>> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland >>>> >>> > wrote: >>>> >>>> Hi Matthew, >>>> >>>> we tried to use that.? Now, we discovered that: >>>> >>>> 1- even if we "ask" for sfNatural creation with >>>> DMSetUseNatural, it is not created because >>>> DMPlexCreateGlobalToNaturalSF looks for a >>>> "section": this is not documented in >>>> DMSetUseNaturalso we are asking ourselfs: "is this >>>> a permanent feature or a temporary situation?" >>>> >>>> I think explaining this will help clear up a lot. >>>> >>>> What the Natural2Global?map does is permute a solution >>>> vector into the ordering that it would have had prior >>>> to mesh distribution. >>>> Now, in order to do this permutation, I need to know >>>> the original (global) data layout. If it is not >>>> specified _before_ distribution, we >>>> cannot build the permutation.? The section describes >>>> the data layout, so I need it before distribution. >>>> >>>> I cannot think of another way that you would implement >>>> this, but if you want something else, let me know. >>>> >>>> 2- We then tried to create a "section" in different >>>> manners: we took the code into the example >>>> petsc/src/dm/impls/plex/tests/ex15.c. However, we >>>> ended up with a segfault: >>>> >>>> corrupted size vs. prev_size >>>> [rohan:07297] *** Process received signal *** >>>> [rohan:07297] Signal: Aborted (6) >>>> [rohan:07297] Signal code: (-6) >>>> [rohan:07297] [ 0] >>>> /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>>> [rohan:07297] [ 1] >>>> /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>>> [rohan:07297] [ 2] >>>> /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>>> [rohan:07297] [ 3] >>>> /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>>> [rohan:07297] [ 4] >>>> /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>>> [rohan:07297] [ 5] >>>> /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>>> [rohan:07297] [ 6] >>>> /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>>> [rohan:07297] [ 7] >>>> /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>>> [rohan:07297] [ 8] >>>> /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>>> [rohan:07297] [ 9] >>>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>>> [rohan:07297] [10] >>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>>> [rohan:07297] [11] >>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>>> [rohan:07297] [12] >>>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>>> [rohan:07297] [13] >>>> /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>>> >>>> [rohan:07297] [14] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>>> [rohan:07297] [15] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>>> [rohan:07297] [16] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>>> [rohan:07297] [17] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>>> [rohan:07297] [18] >>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>>> >>>> I am not sure what happened here, but if you could send >>>> a sample code, I will figure it out. >>>> >>>> If we do not create a section, the call to >>>> DMPlexDistribute is successful, but >>>> DMPlexGetGlobalToNaturalSF return a null SF pointer... >>>> >>>> Yes, it just ignores it in this case because it does >>>> not have a global layout. >>>> >>>> Here are the operations we are calling ( this is >>>> almost the code we are using, I just removed >>>> verifications and creation of the connectivity >>>> which use our parallel structure and code): >>>> >>>> =========== >>>> >>>> ? PetscInt* lCells????? = 0; >>>> ? PetscInt? lNumCorners = 0; >>>> ? PetscInt? lDimMail??? = 0; >>>> ? PetscInt? lnumCells?? = 0; >>>> >>>> ? //At this point we create the cells for PETSc >>>> expected input for DMPlexBuildFromCellListParallel >>>> and set lNumCorners, lDimMail and lnumCells to >>>> correct values. >>>> ? ... >>>> >>>> ? DM?????? lDMBete = 0 >>>> DMPlexCreate(lMPIComm,&lDMBete); >>>> >>>> ? DMSetDimension(lDMBete, lDimMail); >>>> >>>> DMPlexBuildFromCellListParallel(lDMBete, >>>> ????????????????????????????????? lnumCells, >>>> ????????????????????????????????? PETSC_DECIDE, >>>> pLectureElementsLocaux.reqNbTotalSommets(), >>>> ????????????????????????????????? lNumCorners, >>>> ????????????????????????????????? lCells, >>>> ????????????????????????????????? PETSC_NULL); >>>> >>>> ? DM lDMBeteInterp = 0; >>>> ? DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>>> ? DMDestroy(&lDMBete); >>>> ? lDMBete = lDMBeteInterp; >>>> >>>> DMSetUseNatural(lDMBete,PETSC_TRUE); >>>> >>>> ? PetscSF lSFMigrationSansOvl = 0; >>>> ? PetscSF lSFMigrationOvl = 0; >>>> ? DM lDMDistribueSansOvl = 0; >>>> ? DM lDMAvecOverlap = 0; >>>> >>>> ? PetscPartitioner lPart; >>>> DMPlexGetPartitioner(lDMBete, &lPart); >>>> PetscPartitionerSetFromOptions(lPart); >>>> >>>> ? PetscSection?? section; >>>> ? PetscInt?????? numFields = 1; >>>> ? PetscInt?????? numBC = 0; >>>> ? PetscInt?????? numComp[1] = {1}; >>>> ? PetscInt?????? numDof[4] = {1, 0, 0, 0}; >>>> ? PetscInt?????? bcFields[1] = {0}; >>>> ? IS???????????? bcPoints[1] = {NULL}; >>>> >>>> ? DMSetNumFields(lDMBete, numFields); >>>> >>>> DMPlexCreateSection(lDMBete, NULL, numComp, numDof, >>>> numBC, bcFields, bcPoints, NULL, NULL, §ion); >>>> ? DMSetLocalSection(lDMBete, section); >>>> >>>> ? DMPlexDistribute(lDMBete, 0, >>>> &lSFMigrationSansOvl, &lDMDistribueSansOvl); // >>>> segfault! >>>> >>>> =========== >>>> >>>> So we have other question/remarks: >>>> >>>> 3- Maybe PETSc expect something specific that is >>>> missing/not verified: for example, we didn't gave >>>> any coordinates since we just want to partition and >>>> compute overlap for the mesh... and then recover >>>> our element numbers in a "simple way" >>>> >>>> 4- We are telling ourselves it is somewhat a "big >>>> price to pay" to have to build an unused section to >>>> have the global to natural ordering set ?? Could >>>> this requirement be avoided? >>>> >>>> I don't think so. There would have to be _some_ way of >>>> describing your data layout in terms of mesh points, >>>> and I do not see how you could use less memory doing that. >>>> >>>> 5- Are there any improvement towards our usages in >>>> 3.16 release? >>>> >>>> Let me try and run the code above. >>>> >>>> ? Thanks, >>>> >>>> ? ? ?Matt >>>> >>>> Thanks, >>>> >>>> Eric >>>> >>>> >>>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>>>> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland >>>>> >>>> > wrote: >>>>> >>>>> Hi, >>>>> >>>>> I come back with _almost_ the original question: >>>>> >>>>> I would like to add an integer information >>>>> (*our* original element >>>>> number, not petsc one) on each element of the >>>>> DMPlex I create with >>>>> DMPlexBuildFromCellListParallel. >>>>> >>>>> I would like this interger to be distribruted >>>>> by or the same way >>>>> DMPlexDistribute distribute the mesh. >>>>> >>>>> Is it possible to do this? >>>>> >>>>> >>>>> I think we already have support for what you want. >>>>> If you call >>>>> >>>>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>>>> >>>>> >>>>> before DMPlexDistribute(), it will compute a >>>>> PetscSF encoding the global to natural map. You >>>>> can get it with >>>>> >>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>>>> >>>>> >>>>> and use it with >>>>> >>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>>>> >>>>> >>>>> Is this sufficient? >>>>> >>>>> ? Thanks, >>>>> >>>>> ? ? ?Matt >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> >>>>> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >>>>> > Hi, >>>>> > >>>>> > I want to use DMPlexDistribute from PETSc >>>>> for computing overlapping >>>>> > and play with the different partitioners >>>>> supported. >>>>> > >>>>> > However, after calling DMPlexDistribute, I >>>>> noticed the elements are >>>>> > renumbered and then the original number is lost. >>>>> > >>>>> > What would be the best way to keep track of >>>>> the element renumbering? >>>>> > >>>>> > a) Adding an optional parameter to let the >>>>> user retrieve a vector or >>>>> > "IS" giving the old number? >>>>> > >>>>> > b) Adding a DMLabel (seems a wrong good >>>>> solution) >>>>> > >>>>> > c) Other idea? >>>>> > >>>>> > Of course, I don't want to loose >>>>> performances with the need of this >>>>> > "mapping"... >>>>> > >>>>> > Thanks, >>>>> > >>>>> > Eric >>>>> > >>>>> -- >>>>> Eric Chamberland, ing., M. Ing >>>>> Professionnel de recherche >>>>> GIREF/Universit? Laval >>>>> (418) 656-2131 poste 41 22 42 >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before >>>>> they begin their experiments is infinitely more >>>>> interesting than any results to which their >>>>> experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>> >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they >>>> begin their experiments is infinitely more interesting >>>> than any results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin >>> their experiments is infinitely more interesting than any >>> results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hbnbhlbilhmjdpfg.png Type: image/png Size: 42972 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eejjfmbjimlkboec.png Type: image/png Size: 87901 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex44.c Type: text/x-csrc Size: 11543 bytes Desc: not available URL: From knepley at gmail.com Wed Oct 27 08:03:39 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 27 Oct 2021 09:03:39 -0400 Subject: [petsc-users] Question regarding DMPlex reordering In-Reply-To: References: <01e3a622-8561-93b4-4ebd-25331bd93486@onera.fr> Message-ID: On Wed, Oct 27, 2021 at 3:15 AM Pierre Seize wrote: > > > On 26/10/21 22:28, Matthew Knepley wrote: > > On Tue, Oct 26, 2021 at 10:17 AM Pierre Seize > wrote: > >> Hi, I had the idea to try and renumber my mesh cells, as I've heard it's >> better: "neighbouring cells are stored next to one another, and memory >> access are faster". >> Right now, I load the mesh then I distribute it over the processes. I >> thought I'd try to permute the numbering between those two steps : >> >> DMPlexCreateFromFile >> DMPlexGetOrdering >> DMPlexPermute >> DMPlexDistribute >> >> but that gives me an error when it runs on more than one process: >> >> [0]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [0]PETSC ERROR: No support for this operation for this object type >> [0]PETSC ERROR: Number of dofs for point 0 in the local section should be >> positive >> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. >> [0]PETSC ERROR: Petsc Release Version 3.16.0, unknown >> [0]PETSC ERROR: ./build/bin/yanss on a named ldmpe202z.onera by pseize >> Tue Oct 26 16:03:33 2021 >> [0]PETSC ERROR: Configure options --PETSC_ARCH=arch-ld-gcc >> --download-metis --download-parmetis --prefix=~/.local --with-cgns >> [0]PETSC ERROR: #1 PetscPartitionerDMPlexPartition() at >> /stck/pseize/softwares/petsc/src/dm/impls/plex/plexpartition.c:720 >> [0]PETSC ERROR: #2 DMPlexDistribute() at >> /stck/pseize/softwares/petsc/src/dm/impls/plex/plexdistribute.c:1630 >> [0]PETSC ERROR: #3 MeshLoadFromFile() at src/spatial.c:689 >> [0]PETSC ERROR: #4 main() at src/main.c:22 >> [0]PETSC ERROR: PETSc Option Table entries: >> [0]PETSC ERROR: -draw_comp 0 >> [0]PETSC ERROR: -mesh data/box.msh >> [0]PETSC ERROR: -mesh_view draw >> [0]PETSC ERROR: -riemann anrs >> [0]PETSC ERROR: -ts_max_steps 100 >> [0]PETSC ERROR: -vec_view_partition >> [0]PETSC ERROR: ----------------End of Error Message -------send entire >> error message to petsc-maint at mcs.anl.gov---------- >> >> I checked and before I tried to reorder the mesh, the dm->localSection >> was NULL before entering DMPlexDistribute, and I was able to fix the >> error with DMSetLocalSection(dm, NULL) after DMPlexPermute, but it >> doesn't seems it's the right way to do what I want. Does someone have any >> advice ? >> > Oh, this is probably me trying to be too clever. If a local section is > defined, then I try to use the number of dofs in it to load balance better. > There should never be a negative number of dofs in the local section (a > global section uses this to indicate a dof owned by another process). > So eliminating the local section will definitely fix that error. > > Now the question of how you got a local section. DMPlexPermute() does not > create one, so it seems like you had one ahead of time, and that > the values were not valid. > > > DMPlexPermute calls DMGetLocalSection, which creates dm->localSection if > it's NULL, so before DMPlexPermute my dm->localSection is NULL, and after > it is set. Because of that I enter the if in > src/dm/impls/plex/plexpartition.c:707 and then I got the error. > If i have a "wrong" dm->localSection, I think it has to come from > DMPlexPermute. > > Note that you can probably get rid of some of the loading code using > > DMCreate(comm, &dm); > DMSetType(dm, DMPLEX); > DMSetFromOptions(dm); > DMViewFromOptions(dm, NULL, "-mesh_view"); > > and use > > -dm_plex_filename databox,msh -mesh_view > > > My loading code is already small, but just to make sure I wrote this > minimal example: > > int main(int argc, char **argv){ > PetscErrorCode ierr; > > ierr = PetscInitialize(&argc, &argv, NULL, help); if (ierr) return ierr; > > DM dm, foo_dm; > ierr = DMCreate(PETSC_COMM_WORLD, &dm); CHKERRQ(ierr); > ierr = DMSetType(dm, DMPLEX); CHKERRQ(ierr); > ierr = DMSetFromOptions(dm); CHKERRQ(ierr); > > IS perm; > ierr = DMPlexGetOrdering(dm, NULL, NULL, &perm); CHKERRQ(ierr); > ierr = DMPlexPermute(dm, perm, &foo_dm); CHKERRQ(ierr); > if (foo_dm) { > ierr = DMDestroy(&dm); CHKERRQ(ierr); > dm = foo_dm; > } > ierr = DMPlexDistribute(dm, 2, NULL, &foo_dm); CHKERRQ(ierr); > if (foo_dm) { > ierr = DMDestroy(&dm); CHKERRQ(ierr); > dm = foo_dm; > } > > ierr = ISDestroy(&perm); CHKERRQ(ierr); > ierr = DMDestroy(&dm); CHKERRQ(ierr); > ierr = PetscFinalize(); > return ierr; > } > > ran with mpiexec -n 2 ./build/bin/yanss -dm_plex_filename data/box.msh. > The mesh is a 2D box from GMSH but I've got the same result with any mesh > I've tried. It runs fine with 1 process but gives the previous error for > more processes. > Hi Pierre, You are right. This is my bug. Here is the fix: https://gitlab.com/petsc/petsc/-/merge_requests/4504 Is it possible to try this branch? Thanks, Matt > Pierre > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Wed Oct 27 08:43:03 2021 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 27 Oct 2021 09:43:03 -0400 Subject: [petsc-users] Strange behavior of TS after setting hand-coded Jacobian In-Reply-To: References: <0C6ACBF3-F457-4BFD-AD19-8C455444748F@petsc.dev> Message-ID: <9CC15214-4601-4554-808F-C3E96DC3D34A@petsc.dev> You can run with -ksp_error_if_not_converged to get it to stop as soon as a linear solve fails to help track down the exact breaking point. > The problem under consideration contains an eigen-value to be solved, > making the first diagonal element of the jacobian matrix being zero. > From these outputs, it seems that the PC failed to factorize, which is > caused by this 0 diagonal element. But I'm wondering why it works > with jacobian matrix generated by finite-difference? Presumably your "exact" Jacobian puts a zero on the diagonal while the finite differencing may put a small non-zero value in that location due to numerical round-off. In that case even if the factorization succeeds it may produce an inaccurate solution if the value on the diagonal is very small. If your matrix is singular or cannot be factored with LU then you need to use a different solver for the linear system that will be robust to the zero on the diagonal. What is the structure of your Jacobian? (The analytic form). Barry > On Oct 27, 2021, at 1:47 AM, ?? wrote: > > Thanks for your kind reply. > > Several comparison tests have been performed. Attached are execution > output files. Below are corresponding descriptions. > > good.txt -- Run without hand-coded jacobian, solution converged, with > option '-ts_monitor -snes_monitor -snes_converged_reason > -ksp_monitor_true_residual -ksp_converged_reason'; > jac1.txt -- Run with hand-coded jacobian, does not converge, with > option '-ts_monitor -snes_monitor -snes_converged_reason > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian'; > jac2.txt -- Run with hand-coded jacobian, does not converge, with > option '-ts_monitor -snes_monitor -snes_converged_reason > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > -ksp_view'; > jac3.txt -- Run with hand-coded jacobian, does not converge, with > option '-ts_monitor -snes_monitor -snes_converged_reason > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > -ksp_view -ts_max_snes_failures -1 '; > > The problem under consideration contains an eigen-value to be solved, > making the first diagonal element of the jacobian matrix being zero. > From these outputs, it seems that the PC failed to factorize, which is > caused by this 0 diagonal element. But I'm wondering why it works > with jacobian matrix generated by finite-difference? Would employing > DMDA for discretization be helpful? > > Regards > > Yu Cang > > Barry Smith ?2021?10?25??? ??10:50??? >> >> >> It is definitely unexpected that -snes_test_jacobian verifies the Jacobian as matching but the solve process is completely different. >> >> Please run with -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian and send all the output >> >> Barry >> >> >>> On Oct 25, 2021, at 9:53 AM, ?? wrote: >>> >>> I'm using TS to solve a set of DAE, which originates from a >>> one-dimensional problem. The grid points are uniformly distributed. >>> For simplicity, the DMDA is not employed for discretization. >>> >>> At first, only the residual function is prescribed through >>> 'TSSetIFunction', and PETSC produces converged results. However, after >>> providing hand-coded Jacobian through 'TSSetIJacobian', the internal >>> SNES object fails (residual norm does not change), and TS reports >>> 'DIVERGED_STEP_REJECTED'. >>> >>> I have tried to add the option '-snes_test_jacobian' to see if the >>> hand-coded jacobian is somewhere wrong, but it shows '||J - >>> Jfd||_F/||J||_F = 1.07488e-10, ||J - Jfd||_F = 2.14458e-07', >>> indicating that the hand-coded jacobian is correct. >>> >>> Then, I added a monitor for the internal SNES object through >>> 'SNESMonitorSet', in which the solution vector will be displayed at >>> each iteration. It is interesting to find that, if the jacobian is not >>> provided, meaning finite-difference is utilized for jacobian >>> evaluation internally, the solution vector converges to steady >>> solution and the SNES residual norm is reduced continuously. However, >>> it turns out that, as long as the jacobian is provided, the solution >>> vector will NEVER get changed! So the solution procedure stucked! >>> >>> This is quite strange! Hope to get some advice. >>> PETSC version=3.14.6, program run in serial mode. >>> >>> Regards >>> >>> Yu Cang >> > From pierre.seize at onera.fr Wed Oct 27 08:54:09 2021 From: pierre.seize at onera.fr (Pierre Seize) Date: Wed, 27 Oct 2021 15:54:09 +0200 Subject: [petsc-users] Question regarding DMPlex reordering In-Reply-To: References: <01e3a622-8561-93b4-4ebd-25331bd93486@onera.fr> Message-ID: Hi, thanks for the fix. It seems to work fine. Out of curiosity, I noticed the MatOrderingType of DMPlexGetOrdering is not used. Is this intentional ? To match MatGetOrdering ? Pierre On 27/10/21 15:03, Matthew Knepley wrote: > On Wed, Oct 27, 2021 at 3:15 AM Pierre Seize > wrote: > > > > On 26/10/21 22:28, Matthew Knepley wrote: >> On Tue, Oct 26, 2021 at 10:17 AM Pierre Seize >> > wrote: >> >> Hi, I had the idea to try and renumber my mesh cells, as I've >> heard it's better: "neighbouring cells are stored next to one >> another, and memory access are faster". >> >> Right now, I load the mesh then I distribute it over the >> processes. I thought I'd try to permute the numbering between >> those two steps : >> >> DMPlexCreateFromFile >> DMPlexGetOrdering >> DMPlexPermute >> DMPlexDistribute >> >> but that gives me an error when it runs on more than one process: >> >> [0]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [0]PETSC ERROR: No support for this operation for this object >> type >> [0]PETSC ERROR: Number of dofs for point 0 in the local >> section should be positive >> [0]PETSC ERROR: See https://petsc.org/release/faq/ for >> trouble shooting. >> [0]PETSC ERROR: Petsc Release Version 3.16.0, unknown >> [0]PETSC ERROR: ./build/bin/yanss on a? named ldmpe202z.onera >> by pseize Tue Oct 26 16:03:33 2021 >> [0]PETSC ERROR: Configure options --PETSC_ARCH=arch-ld-gcc >> --download-metis --download-parmetis --prefix=~/.local >> --with-cgns >> [0]PETSC ERROR: #1 PetscPartitionerDMPlexPartition() at >> /stck/pseize/softwares/petsc/src/dm/impls/plex/plexpartition.c:720 >> [0]PETSC ERROR: #2 DMPlexDistribute() at >> /stck/pseize/softwares/petsc/src/dm/impls/plex/plexdistribute.c:1630 >> [0]PETSC ERROR: #3 MeshLoadFromFile() at src/spatial.c:689 >> [0]PETSC ERROR: #4 main() at src/main.c:22 >> [0]PETSC ERROR: PETSc Option Table entries: >> [0]PETSC ERROR: -draw_comp 0 >> [0]PETSC ERROR: -mesh data/box.msh >> [0]PETSC ERROR: -mesh_view draw >> [0]PETSC ERROR: -riemann anrs >> [0]PETSC ERROR: -ts_max_steps 100 >> [0]PETSC ERROR: -vec_view_partition >> [0]PETSC ERROR: ----------------End of Error Message >> -------send entire error message to petsc-maint at mcs.anl.gov >> ---------- >> >> I checked and before I tried to reorder the mesh, the >> dm->localSection was NULL before entering DMPlexDistribute, >> and I was able to fix the error with DMSetLocalSection(dm, >> NULL) after DMPlexPermute, but it doesn't seems it's the >> right way to do what I want. Does someone have any advice ? >> >> Oh, this is probably me trying to be too clever. If a local >> section is defined, then I try to use the number of dofs in it to >> load balance better. >> There should never be a negative number of dofs in the local >> section (a global section uses this to indicate a dof?owned by >> another process). >> So eliminating the local section will definitely fix that error. >> >> Now the question of how you got a local section. DMPlexPermute()? >> does not create one, so it seems like you had one ahead of time, >> and that >> the values were not valid. > > DMPlexPermute calls DMGetLocalSection, which creates > dm->localSection if it's NULL, so before DMPlexPermute my > dm->localSection is NULL, and after it is set. Because of that I > enter the if in src/dm/impls/plex/plexpartition.c:707 and then I > got the error. > If i have a "wrong" dm->localSection, I think it has to come from > DMPlexPermute. > >> Note that you can probably get rid of some of the loading code using >> >> ? DMCreate(comm, &dm); >> ? DMSetType(dm, DMPLEX); >> ? DMSetFromOptions(dm); >> ? DMViewFromOptions(dm, NULL, "-mesh_view"); >> >> and use >> >> ? -dm_plex_filename databox,msh -mesh_view > > My loading code is already small, but just to make sure I wrote > this minimal example: > > int main(int argc, char **argv){ > ? PetscErrorCode ierr; > > ? ierr = PetscInitialize(&argc, &argv, NULL, help); if (ierr) > return ierr; > > ? DM dm, foo_dm; > ? ierr = DMCreate(PETSC_COMM_WORLD, &dm); CHKERRQ(ierr); > ? ierr = DMSetType(dm, DMPLEX); CHKERRQ(ierr); > ? ierr = DMSetFromOptions(dm); CHKERRQ(ierr); > > ? IS perm; > ? ierr = DMPlexGetOrdering(dm, NULL, NULL, &perm); CHKERRQ(ierr); > ? ierr = DMPlexPermute(dm, perm, &foo_dm); CHKERRQ(ierr); > ? if (foo_dm) { > ??? ierr = DMDestroy(&dm); CHKERRQ(ierr); > ??? dm = foo_dm; > ? } > ? ierr = DMPlexDistribute(dm, 2, NULL, &foo_dm); CHKERRQ(ierr); > ? if (foo_dm) { > ??? ierr = DMDestroy(&dm); CHKERRQ(ierr); > ??? dm = foo_dm; > ? } > > ? ierr = ISDestroy(&perm); CHKERRQ(ierr); > ? ierr = DMDestroy(&dm); CHKERRQ(ierr); > ? ierr = PetscFinalize(); > ? return ierr; > } > > ran with mpiexec -n 2 ./build/bin/yanss -dm_plex_filename > data/box.msh. The mesh is a 2D box from GMSH but I've got the same > result with any mesh I've tried. It runs fine with 1 process but > gives the previous error for more processes. > > > Hi Pierre, > > You are right. This is my bug. Here is the fix: > > https://gitlab.com/petsc/petsc/-/merge_requests/4504 > > Is it possible to try this branch? > > ? Thanks, > > ? ? ?Matt > > Pierre > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 27 09:06:58 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 27 Oct 2021 10:06:58 -0400 Subject: [petsc-users] Question regarding DMPlex reordering In-Reply-To: References: <01e3a622-8561-93b4-4ebd-25331bd93486@onera.fr> Message-ID: On Wed, Oct 27, 2021 at 9:54 AM Pierre Seize wrote: > Hi, thanks for the fix. It seems to work fine. > > Out of curiosity, I noticed the MatOrderingType of DMPlexGetOrdering is > not used. Is this intentional ? To match MatGetOrdering ? > True. It only does RCM. I noticed that when fixing it. I have to write interfaces to the rest of the matrix ordering routines since they are slightly different for meshes. I will make an issue. Thanks, Matt > Pierre > > On 27/10/21 15:03, Matthew Knepley wrote: > > On Wed, Oct 27, 2021 at 3:15 AM Pierre Seize > wrote: > >> >> >> On 26/10/21 22:28, Matthew Knepley wrote: >> >> On Tue, Oct 26, 2021 at 10:17 AM Pierre Seize >> wrote: >> >>> Hi, I had the idea to try and renumber my mesh cells, as I've heard it's >>> better: "neighbouring cells are stored next to one another, and memory >>> access are faster". >>> Right now, I load the mesh then I distribute it over the processes. I >>> thought I'd try to permute the numbering between those two steps : >>> >>> DMPlexCreateFromFile >>> DMPlexGetOrdering >>> DMPlexPermute >>> DMPlexDistribute >>> >>> but that gives me an error when it runs on more than one process: >>> >>> [0]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> [0]PETSC ERROR: No support for this operation for this object type >>> [0]PETSC ERROR: Number of dofs for point 0 in the local section should >>> be positive >>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. >>> [0]PETSC ERROR: Petsc Release Version 3.16.0, unknown >>> [0]PETSC ERROR: ./build/bin/yanss on a named ldmpe202z.onera by pseize >>> Tue Oct 26 16:03:33 2021 >>> [0]PETSC ERROR: Configure options --PETSC_ARCH=arch-ld-gcc >>> --download-metis --download-parmetis --prefix=~/.local --with-cgns >>> [0]PETSC ERROR: #1 PetscPartitionerDMPlexPartition() at >>> /stck/pseize/softwares/petsc/src/dm/impls/plex/plexpartition.c:720 >>> [0]PETSC ERROR: #2 DMPlexDistribute() at >>> /stck/pseize/softwares/petsc/src/dm/impls/plex/plexdistribute.c:1630 >>> [0]PETSC ERROR: #3 MeshLoadFromFile() at src/spatial.c:689 >>> [0]PETSC ERROR: #4 main() at src/main.c:22 >>> [0]PETSC ERROR: PETSc Option Table entries: >>> [0]PETSC ERROR: -draw_comp 0 >>> [0]PETSC ERROR: -mesh data/box.msh >>> [0]PETSC ERROR: -mesh_view draw >>> [0]PETSC ERROR: -riemann anrs >>> [0]PETSC ERROR: -ts_max_steps 100 >>> [0]PETSC ERROR: -vec_view_partition >>> [0]PETSC ERROR: ----------------End of Error Message -------send entire >>> error message to petsc-maint at mcs.anl.gov---------- >>> >>> I checked and before I tried to reorder the mesh, the dm->localSection >>> was NULL before entering DMPlexDistribute, and I was able to fix the >>> error with DMSetLocalSection(dm, NULL) after DMPlexPermute, but it >>> doesn't seems it's the right way to do what I want. Does someone have any >>> advice ? >>> >> Oh, this is probably me trying to be too clever. If a local section is >> defined, then I try to use the number of dofs in it to load balance better. >> There should never be a negative number of dofs in the local section (a >> global section uses this to indicate a dof owned by another process). >> So eliminating the local section will definitely fix that error. >> >> Now the question of how you got a local section. DMPlexPermute() does >> not create one, so it seems like you had one ahead of time, and that >> the values were not valid. >> >> >> DMPlexPermute calls DMGetLocalSection, which creates dm->localSection if >> it's NULL, so before DMPlexPermute my dm->localSection is NULL, and >> after it is set. Because of that I enter the if in >> src/dm/impls/plex/plexpartition.c:707 and then I got the error. >> If i have a "wrong" dm->localSection, I think it has to come from >> DMPlexPermute. >> >> Note that you can probably get rid of some of the loading code using >> >> DMCreate(comm, &dm); >> DMSetType(dm, DMPLEX); >> DMSetFromOptions(dm); >> DMViewFromOptions(dm, NULL, "-mesh_view"); >> >> and use >> >> -dm_plex_filename databox,msh -mesh_view >> >> >> My loading code is already small, but just to make sure I wrote this >> minimal example: >> >> int main(int argc, char **argv){ >> PetscErrorCode ierr; >> >> ierr = PetscInitialize(&argc, &argv, NULL, help); if (ierr) return ierr; >> >> DM dm, foo_dm; >> ierr = DMCreate(PETSC_COMM_WORLD, &dm); CHKERRQ(ierr); >> ierr = DMSetType(dm, DMPLEX); CHKERRQ(ierr); >> ierr = DMSetFromOptions(dm); CHKERRQ(ierr); >> >> IS perm; >> ierr = DMPlexGetOrdering(dm, NULL, NULL, &perm); CHKERRQ(ierr); >> ierr = DMPlexPermute(dm, perm, &foo_dm); CHKERRQ(ierr); >> if (foo_dm) { >> ierr = DMDestroy(&dm); CHKERRQ(ierr); >> dm = foo_dm; >> } >> ierr = DMPlexDistribute(dm, 2, NULL, &foo_dm); CHKERRQ(ierr); >> if (foo_dm) { >> ierr = DMDestroy(&dm); CHKERRQ(ierr); >> dm = foo_dm; >> } >> >> ierr = ISDestroy(&perm); CHKERRQ(ierr); >> ierr = DMDestroy(&dm); CHKERRQ(ierr); >> ierr = PetscFinalize(); >> return ierr; >> } >> >> ran with mpiexec -n 2 ./build/bin/yanss -dm_plex_filename data/box.msh. >> The mesh is a 2D box from GMSH but I've got the same result with any mesh >> I've tried. It runs fine with 1 process but gives the previous error for >> more processes. >> > > Hi Pierre, > > You are right. This is my bug. Here is the fix: > > https://gitlab.com/petsc/petsc/-/merge_requests/4504 > > Is it possible to try this branch? > > Thanks, > > Matt > > >> Pierre >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hongzhang at anl.gov Wed Oct 27 09:43:54 2021 From: hongzhang at anl.gov (Zhang, Hong) Date: Wed, 27 Oct 2021 14:43:54 +0000 Subject: [petsc-users] Strange behavior of TS after setting hand-coded Jacobian In-Reply-To: References: <0C6ACBF3-F457-4BFD-AD19-8C455444748F@petsc.dev> Message-ID: <918457CD-6B4F-49A2-9029-E6BAD039C9C0@anl.gov> Since your Jacobian matrix is small, it is possible to compare your hand-written Jacobian with the finite-difference approximation directly. Add -snes_test_jacobian_view to print out the matrices. Then you can see exactly where the difference is. Hong > On Oct 27, 2021, at 12:47 AM, ?? wrote: > > Thanks for your kind reply. > > Several comparison tests have been performed. Attached are execution > output files. Below are corresponding descriptions. > > good.txt -- Run without hand-coded jacobian, solution converged, with > option '-ts_monitor -snes_monitor -snes_converged_reason > -ksp_monitor_true_residual -ksp_converged_reason'; > jac1.txt -- Run with hand-coded jacobian, does not converge, with > option '-ts_monitor -snes_monitor -snes_converged_reason > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian'; > jac2.txt -- Run with hand-coded jacobian, does not converge, with > option '-ts_monitor -snes_monitor -snes_converged_reason > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > -ksp_view'; > jac3.txt -- Run with hand-coded jacobian, does not converge, with > option '-ts_monitor -snes_monitor -snes_converged_reason > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > -ksp_view -ts_max_snes_failures -1 '; > > The problem under consideration contains an eigen-value to be solved, > making the first diagonal element of the jacobian matrix being zero. > From these outputs, it seems that the PC failed to factorize, which is > caused by this 0 diagonal element. But I'm wondering why it works > with jacobian matrix generated by finite-difference? Would employing > DMDA for discretization be helpful? > > Regards > > Yu Cang > > Barry Smith ?2021?10?25??? ??10:50??? >> >> >> It is definitely unexpected that -snes_test_jacobian verifies the Jacobian as matching but the solve process is completely different. >> >> Please run with -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian and send all the output >> >> Barry >> >> >>> On Oct 25, 2021, at 9:53 AM, ?? wrote: >>> >>> I'm using TS to solve a set of DAE, which originates from a >>> one-dimensional problem. The grid points are uniformly distributed. >>> For simplicity, the DMDA is not employed for discretization. >>> >>> At first, only the residual function is prescribed through >>> 'TSSetIFunction', and PETSC produces converged results. However, after >>> providing hand-coded Jacobian through 'TSSetIJacobian', the internal >>> SNES object fails (residual norm does not change), and TS reports >>> 'DIVERGED_STEP_REJECTED'. >>> >>> I have tried to add the option '-snes_test_jacobian' to see if the >>> hand-coded jacobian is somewhere wrong, but it shows '||J - >>> Jfd||_F/||J||_F = 1.07488e-10, ||J - Jfd||_F = 2.14458e-07', >>> indicating that the hand-coded jacobian is correct. >>> >>> Then, I added a monitor for the internal SNES object through >>> 'SNESMonitorSet', in which the solution vector will be displayed at >>> each iteration. It is interesting to find that, if the jacobian is not >>> provided, meaning finite-difference is utilized for jacobian >>> evaluation internally, the solution vector converges to steady >>> solution and the SNES residual norm is reduced continuously. However, >>> it turns out that, as long as the jacobian is provided, the solution >>> vector will NEVER get changed! So the solution procedure stucked! >>> >>> This is quite strange! Hope to get some advice. >>> PETSC version=3.14.6, program run in serial mode. >>> >>> Regards >>> >>> Yu Cang >> > From knepley at gmail.com Wed Oct 27 10:14:27 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 27 Oct 2021 11:14:27 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: <12e32ebb-61ed-6a8c-ab77-2841090ba5fe@giref.ulaval.ca> References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> <12e32ebb-61ed-6a8c-ab77-2841090ba5fe@giref.ulaval.ca> Message-ID: On Wed, Oct 27, 2021 at 8:29 AM Eric Chamberland < Eric.Chamberland at giref.ulaval.ca> wrote: > Hi Matthew, > > the smallest mesh which crashes the code is a 2x5 mesh: > > See the modified ex44.c > > With smaller meshes(2x2, 2x4, etc), it passes... But it bugs latter when > I try to use DMPlexNaturalToGlobalBegin but let's keep that other problem > for later... > > Thanks a lot for helping digging into this! :) > I have made a small fix in this branch https://gitlab.com/petsc/petsc/-/commits/knepley/fix-plex-g2n It seems to run for me. Can you check it? Thanks, Matt > Eric > > (sorry if you received this for a 2nd times, I have trouble with my mail) > On 2021-10-26 4:35 p.m., Matthew Knepley wrote: > > On Tue, Oct 26, 2021 at 1:35 PM Eric Chamberland < > Eric.Chamberland at giref.ulaval.ca> wrote: > >> Here is a screenshot of the partition I hard coded (top) and >> vertices/element numbers (down): >> >> I have not yet modified the ex44.c example to properly assign the >> coordinates... >> >> (but I would not have done it like it is in the last version because the >> sCoords array is the global array with global vertices number) >> >> I will have time to do this tomorrow... >> >> Maybe I can first try to reproduce all this with a smaller mesh? >> > > That might make it easier to find a problem. > > Thanks! > > Matt > > >> Eric >> On 2021-10-26 9:46 a.m., Matthew Knepley wrote: >> >> Okay, I ran it. Something seems off with the mesh. First, I cannot simply >> explain the partition. The number of shared vertices and edges >> does not seem to come from a straight cut. Second, the mesh look >> scrambled on output. >> >> Thanks, >> >> Matt >> >> On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland < >> Eric.Chamberland at giref.ulaval.ca> wrote: >> >>> Hi Matthew, >>> >>> ok, I started back from your ex44.c example and added the global array >>> of coordinates. I just have to code the creation of the local coordinates >>> now. >>> >>> Eric >>> On 2021-10-20 6:55 p.m., Matthew Knepley wrote: >>> >>> On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland < >>> Eric.Chamberland at giref.ulaval.ca> wrote: >>> >>>> Hi Matthew, >>>> >>>> we tried to reproduce the error in a simple example. >>>> >>>> The context is the following: We hard coded the mesh and initial >>>> partition into the code (see sConnectivity and sInitialPartition) for 2 >>>> ranks and try to create a section in order to use the >>>> DMPlexNaturalToGlobalBegin function to retreive our initial element numbers. >>>> >>>> Now the call to DMPlexDistribute give different errors depending on >>>> what type of component we ask the field to be created. For our objective, >>>> we would like a global field to be created on elements only (like a P0 >>>> interpolation). >>>> >>>> We now have the following error generated: >>>> >>>> [0]PETSC ERROR: --------------------- Error Message >>>> -------------------------------------------------------------- >>>> [0]PETSC ERROR: Petsc has generated inconsistent data >>>> [0]PETSC ERROR: Inconsistency in indices, 18 should be 17 >>>> [0]PETSC ERROR: See >>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>>> shooting. >>>> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar 30, 2021 >>>> [0]PETSC ERROR: ./bug on a named rohan by ericc Wed Oct 20 14:52:36 >>>> 2021 >>>> [0]PETSC ERROR: Configure options >>>> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 --with-mpi-compilers=1 >>>> --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 --with-cxx-dialect=C++14 >>>> --with-make-np=12 --with-shared-libraries=1 --with-debugging=yes >>>> --with-memalign=64 --with-visibility=0 --with-64-bit-indices=0 >>>> --download-ml=yes --download-mumps=yes --download-superlu=yes >>>> --download-hpddm=yes --download-slepc=yes --download-superlu_dist=yes >>>> --download-parmetis=yes --download-ptscotch=yes --download-metis=yes >>>> --download-strumpack=yes --download-suitesparse=yes --download-hypre=yes >>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>>> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>>> --with-scalapack=1 >>>> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >>>> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>>> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >>>> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >>>> [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() at >>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >>>> [0]PETSC ERROR: #3 DMPlexDistribute() at >>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >>>> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >>>> [0]PETSC ERROR: No PETSc Option Table entries >>>> [0]PETSC ERROR: ----------------End of Error Message -------send entire >>>> error message to petsc-maint at mcs.anl.gov---------- >>>> >>>> Hope the attached code is self-explaining, note that to make it short, >>>> we have not included the final part of it, just the buggy part we are >>>> encountering right now... >>>> >>>> Thanks for your insights, >>>> >>> Thanks for making the example. I tweaked it slightly. I put in a test >>> case that just makes a parallel 7 x 10 quad mesh. This works >>> fine. Thus I think it must be something connected with the original >>> mesh. It is hard to get a handle on it without the coordinates. >>> Do you think you could put the coordinate array in? I have added the >>> code to load them (see attached file). >>> >>> Thanks, >>> >>> Matt >>> >>>> Eric >>>> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >>>> >>>> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland < >>>> Eric.Chamberland at giref.ulaval.ca> wrote: >>>> >>>>> Hi Matthew, >>>>> >>>>> we tried to use that. Now, we discovered that: >>>>> >>>>> 1- even if we "ask" for sfNatural creation with DMSetUseNatural, it is >>>>> not created because DMPlexCreateGlobalToNaturalSF looks for a "section": >>>>> this is not documented in DMSetUseNaturalso we are asking ourselfs: "is >>>>> this a permanent feature or a temporary situation?" >>>>> >>>> I think explaining this will help clear up a lot. >>>> >>>> What the Natural2Global map does is permute a solution vector into the >>>> ordering that it would have had prior to mesh distribution. >>>> Now, in order to do this permutation, I need to know the original >>>> (global) data layout. If it is not specified _before_ distribution, we >>>> cannot build the permutation. The section describes the data layout, >>>> so I need it before distribution. >>>> >>>> I cannot think of another way that you would implement this, but if you >>>> want something else, let me know. >>>> >>>>> 2- We then tried to create a "section" in different manners: we took >>>>> the code into the example petsc/src/dm/impls/plex/tests/ex15.c. However, >>>>> we ended up with a segfault: >>>>> >>>>> corrupted size vs. prev_size >>>>> [rohan:07297] *** Process received signal *** >>>>> [rohan:07297] Signal: Aborted (6) >>>>> [rohan:07297] Signal code: (-6) >>>>> [rohan:07297] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>>>> [rohan:07297] [ 1] /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>>>> [rohan:07297] [ 2] /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>>>> [rohan:07297] [ 3] /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>>>> [rohan:07297] [ 4] /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>>>> [rohan:07297] [ 5] /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>>>> [rohan:07297] [ 6] /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>>>> [rohan:07297] [ 7] /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>>>> [rohan:07297] [ 8] >>>>> /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>>>> [rohan:07297] [ 9] >>>>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>>>> [rohan:07297] [10] >>>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>>>> [rohan:07297] [11] >>>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>>>> [rohan:07297] [12] >>>>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>>>> [rohan:07297] [13] /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>>>> >>>>> [rohan:07297] [14] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>>>> [rohan:07297] [15] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>>>> [rohan:07297] [16] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>>>> [rohan:07297] [17] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>>>> [rohan:07297] [18] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>>>> >>>> I am not sure what happened here, but if you could send a sample code, >>>> I will figure it out. >>>> >>>>> If we do not create a section, the call to DMPlexDistribute is >>>>> successful, but DMPlexGetGlobalToNaturalSF return a null SF pointer... >>>>> >>>> Yes, it just ignores it in this case because it does not have a global >>>> layout. >>>> >>>>> Here are the operations we are calling ( this is almost the code we >>>>> are using, I just removed verifications and creation of the connectivity >>>>> which use our parallel structure and code): >>>>> >>>>> =========== >>>>> >>>>> PetscInt* lCells = 0; >>>>> PetscInt lNumCorners = 0; >>>>> PetscInt lDimMail = 0; >>>>> PetscInt lnumCells = 0; >>>>> >>>>> //At this point we create the cells for PETSc expected input for >>>>> DMPlexBuildFromCellListParallel and set lNumCorners, lDimMail and lnumCells >>>>> to correct values. >>>>> ... >>>>> >>>>> DM lDMBete = 0 >>>>> DMPlexCreate(lMPIComm,&lDMBete); >>>>> >>>>> DMSetDimension(lDMBete, lDimMail); >>>>> >>>>> DMPlexBuildFromCellListParallel(lDMBete, >>>>> lnumCells, >>>>> PETSC_DECIDE, >>>>> >>>>> pLectureElementsLocaux.reqNbTotalSommets(), >>>>> lNumCorners, >>>>> lCells, >>>>> PETSC_NULL); >>>>> >>>>> DM lDMBeteInterp = 0; >>>>> DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>>>> DMDestroy(&lDMBete); >>>>> lDMBete = lDMBeteInterp; >>>>> >>>>> DMSetUseNatural(lDMBete,PETSC_TRUE); >>>>> >>>>> PetscSF lSFMigrationSansOvl = 0; >>>>> PetscSF lSFMigrationOvl = 0; >>>>> DM lDMDistribueSansOvl = 0; >>>>> DM lDMAvecOverlap = 0; >>>>> >>>>> PetscPartitioner lPart; >>>>> DMPlexGetPartitioner(lDMBete, &lPart); >>>>> PetscPartitionerSetFromOptions(lPart); >>>>> >>>>> PetscSection section; >>>>> PetscInt numFields = 1; >>>>> PetscInt numBC = 0; >>>>> PetscInt numComp[1] = {1}; >>>>> PetscInt numDof[4] = {1, 0, 0, 0}; >>>>> PetscInt bcFields[1] = {0}; >>>>> IS bcPoints[1] = {NULL}; >>>>> >>>>> DMSetNumFields(lDMBete, numFields); >>>>> >>>>> DMPlexCreateSection(lDMBete, NULL, numComp, numDof, numBC, bcFields, >>>>> bcPoints, NULL, NULL, §ion); >>>>> DMSetLocalSection(lDMBete, section); >>>>> >>>>> DMPlexDistribute(lDMBete, 0, &lSFMigrationSansOvl, >>>>> &lDMDistribueSansOvl); // segfault! >>>>> >>>>> =========== >>>>> >>>>> So we have other question/remarks: >>>>> >>>>> 3- Maybe PETSc expect something specific that is missing/not verified: >>>>> for example, we didn't gave any coordinates since we just want to partition >>>>> and compute overlap for the mesh... and then recover our element numbers in >>>>> a "simple way" >>>>> >>>>> 4- We are telling ourselves it is somewhat a "big price to pay" to >>>>> have to build an unused section to have the global to natural ordering set >>>>> ? Could this requirement be avoided? >>>>> >>>> I don't think so. There would have to be _some_ way of describing your >>>> data layout in terms of mesh points, and I do not see how you could use >>>> less memory doing that. >>>> >>>>> 5- Are there any improvement towards our usages in 3.16 release? >>>>> >>>> Let me try and run the code above. >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> >>>>> >>>>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>>>> >>>>> On Wed, Sep 29, 2021 at 5:18 PM Eric Chamberland < >>>>> Eric.Chamberland at giref.ulaval.ca> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I come back with _almost_ the original question: >>>>>> >>>>>> I would like to add an integer information (*our* original element >>>>>> number, not petsc one) on each element of the DMPlex I create with >>>>>> DMPlexBuildFromCellListParallel. >>>>>> >>>>>> I would like this interger to be distribruted by or the same way >>>>>> DMPlexDistribute distribute the mesh. >>>>>> >>>>>> Is it possible to do this? >>>>>> >>>>> >>>>> I think we already have support for what you want. If you call >>>>> >>>>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>>>> >>>>> before DMPlexDistribute(), it will compute a PetscSF encoding the >>>>> global to natural map. You >>>>> can get it with >>>>> >>>>> >>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>>>> >>>>> and use it with >>>>> >>>>> >>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>>>> >>>>> Is this sufficient? >>>>> >>>>> Thanks, >>>>> >>>>> Matt >>>>> >>>>> >>>>>> Thanks, >>>>>> >>>>>> Eric >>>>>> >>>>>> On 2021-07-14 1:18 p.m., Eric Chamberland wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I want to use DMPlexDistribute from PETSc for computing overlapping >>>>>> > and play with the different partitioners supported. >>>>>> > >>>>>> > However, after calling DMPlexDistribute, I noticed the elements are >>>>>> > renumbered and then the original number is lost. >>>>>> > >>>>>> > What would be the best way to keep track of the element renumbering? >>>>>> > >>>>>> > a) Adding an optional parameter to let the user retrieve a vector >>>>>> or >>>>>> > "IS" giving the old number? >>>>>> > >>>>>> > b) Adding a DMLabel (seems a wrong good solution) >>>>>> > >>>>>> > c) Other idea? >>>>>> > >>>>>> > Of course, I don't want to loose performances with the need of this >>>>>> > "mapping"... >>>>>> > >>>>>> > Thanks, >>>>>> > >>>>>> > Eric >>>>>> > >>>>>> -- >>>>>> Eric Chamberland, ing., M. Ing >>>>>> Professionnel de recherche >>>>>> GIREF/Universit? Laval >>>>>> (418) 656-2131 poste 41 22 42 >>>>>> >>>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which their >>>>> experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>>> >>>>> -- >>>>> Eric Chamberland, ing., M. Ing >>>>> Professionnel de recherche >>>>> GIREF/Universit? Laval >>>>> (418) 656-2131 poste 41 22 42 >>>>> >>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>>> >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hbnbhlbilhmjdpfg.png Type: image/png Size: 42972 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eejjfmbjimlkboec.png Type: image/png Size: 87901 bytes Desc: not available URL: From samuelestes91 at gmail.com Wed Oct 27 11:25:01 2021 From: samuelestes91 at gmail.com (Samuel Estes) Date: Wed, 27 Oct 2021 11:25:01 -0500 Subject: [petsc-users] Question about setting block size for arbitrary Mat formats Message-ID: Hi, I am solving a linear system in which the matrix has some block structure. We will ultimately use the BAIJ format but for now we are just using the default CSR and would like to play with different formats to compare performance for our problem. Currently, I call MatSetBlockSize so that I can then use MatSetValuesBlocked and MatSetValuesBlockedLocal. My question is: in the absence of specifying one of the blocked formats, does setting the block size with MatSetBlockSize have any real effect on performance? My understanding is that it is really just useful from a programming perspective in that it allows you to set/access Mat values in blocks which is often a natural way to do things. Obviously changing the actual format to have a blocked structure could make a difference but I just want to check if there's anything else going on under the hood with the block size when the matrix is in AIJ format. Thanks! Sam -------------- next part -------------- An HTML attachment was scrubbed... URL: From Eric.Chamberland at giref.ulaval.ca Wed Oct 27 12:25:34 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Wed, 27 Oct 2021 13:25:34 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> <12e32ebb-61ed-6a8c-ab77-2841090ba5fe@giref.ulaval.ca> Message-ID: Great! Thanks Matthew, it is working for me up to that point! We are continuing the ex44.c and forward it to you at the next blocking point... Eric On 2021-10-27 11:14 a.m., Matthew Knepley wrote: > On Wed, Oct 27, 2021 at 8:29 AM Eric Chamberland > > wrote: > > Hi Matthew, > > the smallest mesh which crashes the code is a 2x5 mesh: > > See the modified ex44.c > > With smaller meshes(2x2, 2x4, etc), it passes...? But it bugs > latter when I try to use DMPlexNaturalToGlobalBegin but let's keep > that other problem for later... > > Thanks a lot for helping digging into this! :) > > I have made a small fix in this branch > > https://gitlab.com/petsc/petsc/-/commits/knepley/fix-plex-g2n > > > It seems to run for me. Can you check it? > > ? Thanks, > > ? ? ?Matt > > Eric > > (sorry if you received this for a? 2nd times, I have trouble with > my mail) > > On 2021-10-26 4:35 p.m., Matthew Knepley wrote: >> On Tue, Oct 26, 2021 at 1:35 PM Eric Chamberland >> > > wrote: >> >> Here is a screenshot of the partition I hard coded (top) and >> vertices/element numbers (down): >> >> I have not yet modified the ex44.c example to properly assign >> the coordinates... >> >> (but I would not have done it like it is in the last version >> because the sCoords array is the global array with global >> vertices number) >> >> I will have time to do this tomorrow... >> >> Maybe I can first try to reproduce all this with a smaller mesh? >> >> >> That might make it easier to find a problem. >> >> ? Thanks! >> >> ? ? ?Matt >> >> Eric >> >> On 2021-10-26 9:46 a.m., Matthew Knepley wrote: >>> Okay, I ran it. Something seems off with the mesh. First, I >>> cannot simply explain the partition. The number of shared >>> vertices and edges >>> does not seem to come from a straight cut. Second, the mesh >>> look scrambled on output. >>> >>> ? Thanks, >>> >>> ? ? Matt >>> >>> On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland >>> >> > wrote: >>> >>> Hi Matthew, >>> >>> ok, I started back from your ex44.c example and added >>> the global array of coordinates.? I just have to code >>> the creation of the local coordinates now. >>> >>> Eric >>> >>> On 2021-10-20 6:55 p.m., Matthew Knepley wrote: >>>> On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland >>>> >>> > wrote: >>>> >>>> Hi Matthew, >>>> >>>> we tried to reproduce the error in a simple example. >>>> >>>> The context is the following: We hard coded the >>>> mesh and initial partition into the code (see >>>> sConnectivity and sInitialPartition) for 2 ranks >>>> and try to create a section in order to use the >>>> DMPlexNaturalToGlobalBegin function to retreive our >>>> initial element numbers. >>>> >>>> Now the call to DMPlexDistribute give different >>>> errors depending on what type of component we ask >>>> the field to be created.? For our objective, we >>>> would like a global field to be created on elements >>>> only (like a P0 interpolation). >>>> >>>> We now have the following error generated: >>>> >>>> [0]PETSC ERROR: --------------------- Error Message >>>> -------------------------------------------------------------- >>>> [0]PETSC ERROR: Petsc has generated inconsistent data >>>> [0]PETSC ERROR: Inconsistency in indices, 18 should >>>> be 17 >>>> [0]PETSC ERROR: See >>>> https://www.mcs.anl.gov/petsc/documentation/faq.html >>>> >>>> for trouble shooting. >>>> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar >>>> 30, 2021 >>>> [0]PETSC ERROR: ./bug on a named rohan by ericc Wed >>>> Oct 20 14:52:36 2021 >>>> [0]PETSC ERROR: Configure options >>>> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 >>>> --with-mpi-compilers=1 >>>> --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 >>>> --with-cxx-dialect=C++14 --with-make-np=12 >>>> --with-shared-libraries=1 --with-debugging=yes >>>> --with-memalign=64 --with-visibility=0 >>>> --with-64-bit-indices=0 --download-ml=yes >>>> --download-mumps=yes --download-superlu=yes >>>> --download-hpddm=yes --download-slepc=yes >>>> --download-superlu_dist=yes --download-parmetis=yes >>>> --download-ptscotch=yes --download-metis=yes >>>> --download-strumpack=yes --download-suitesparse=yes >>>> --download-hypre=yes >>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>>> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>>> --with-scalapack=1 >>>> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >>>> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>>> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >>>> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >>>> [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() >>>> at >>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >>>> [0]PETSC ERROR: #3 DMPlexDistribute() at >>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >>>> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >>>> [0]PETSC ERROR: No PETSc Option Table entries >>>> [0]PETSC ERROR: ----------------End of Error >>>> Message -------send entire error message to >>>> petsc-maint at mcs.anl.gov >>>> ---------- >>>> >>>> Hope the attached code is self-explaining, note >>>> that to make it short, we have not included the >>>> final part of it, just the buggy part we are >>>> encountering right now... >>>> >>>> Thanks for your insights, >>>> >>>> Thanks for making the example. I tweaked it slightly. I >>>> put in a test case that just makes a parallel 7 x 10 >>>> quad mesh. This works >>>> fine. Thus I think it must be something connected with >>>> the original mesh. It is hard to get a handle on it >>>> without the coordinates. >>>> Do you think you could put the coordinate array in? I >>>> have added the code to load them (see attached file). >>>> >>>> ? Thanks, >>>> >>>> ? ? ?Matt >>>> >>>> Eric >>>> >>>> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >>>>> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland >>>>> >>>> > wrote: >>>>> >>>>> Hi Matthew, >>>>> >>>>> we tried to use that.? Now, we discovered that: >>>>> >>>>> 1- even if we "ask" for sfNatural creation >>>>> with DMSetUseNatural, it is not created >>>>> because DMPlexCreateGlobalToNaturalSF looks >>>>> for a "section": this is not documented in >>>>> DMSetUseNaturalso we are asking ourselfs: "is >>>>> this a permanent feature or a temporary >>>>> situation?" >>>>> >>>>> I think explaining this will help clear up a lot. >>>>> >>>>> What the Natural2Global?map does is permute a >>>>> solution vector into the ordering that it would >>>>> have had prior to mesh distribution. >>>>> Now, in order to do this permutation, I need to >>>>> know the original (global) data layout. If it is >>>>> not specified _before_ distribution, we >>>>> cannot build the permutation.? The section >>>>> describes the data layout, so I need it before >>>>> distribution. >>>>> >>>>> I cannot think of another way that you would >>>>> implement this, but if you want something else, >>>>> let me know. >>>>> >>>>> 2- We then tried to create a "section" in >>>>> different manners: we took the code into the >>>>> example petsc/src/dm/impls/plex/tests/ex15.c. >>>>> However, we ended up with a segfault: >>>>> >>>>> corrupted size vs. prev_size >>>>> [rohan:07297] *** Process received signal *** >>>>> [rohan:07297] Signal: Aborted (6) >>>>> [rohan:07297] Signal code:? (-6) >>>>> [rohan:07297] [ 0] >>>>> /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>>>> [rohan:07297] [ 1] >>>>> /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>>>> [rohan:07297] [ 2] >>>>> /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>>>> [rohan:07297] [ 3] >>>>> /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>>>> [rohan:07297] [ 4] >>>>> /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>>>> [rohan:07297] [ 5] >>>>> /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>>>> [rohan:07297] [ 6] >>>>> /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>>>> [rohan:07297] [ 7] >>>>> /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>>>> [rohan:07297] [ 8] >>>>> /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>>>> [rohan:07297] [ 9] >>>>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>>>> [rohan:07297] [10] >>>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>>>> [rohan:07297] [11] >>>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>>>> [rohan:07297] [12] >>>>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>>>> [rohan:07297] [13] >>>>> /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>>>> >>>>> [rohan:07297] [14] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>>>> [rohan:07297] [15] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>>>> [rohan:07297] [16] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>>>> [rohan:07297] [17] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>>>> [rohan:07297] [18] >>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>>>> >>>>> I am not sure what happened here, but if you could >>>>> send a sample code, I will figure it out. >>>>> >>>>> If we do not create a section, the call to >>>>> DMPlexDistribute is successful, but >>>>> DMPlexGetGlobalToNaturalSF return a null SF >>>>> pointer... >>>>> >>>>> Yes, it just ignores it in this case because it >>>>> does not have a global layout. >>>>> >>>>> Here are the operations we are calling ( this >>>>> is almost the code we are using, I just >>>>> removed verifications and creation of the >>>>> connectivity which use our parallel structure >>>>> and code): >>>>> >>>>> =========== >>>>> >>>>> ? PetscInt* lCells????? = 0; >>>>> ? PetscInt lNumCorners = 0; >>>>> ? PetscInt lDimMail??? = 0; >>>>> ? PetscInt lnumCells?? = 0; >>>>> >>>>> ? //At this point we create the cells for >>>>> PETSc expected input for >>>>> DMPlexBuildFromCellListParallel and set >>>>> lNumCorners, lDimMail and lnumCells to correct >>>>> values. >>>>> ? ... >>>>> >>>>> ? DM?????? lDMBete = 0 >>>>> DMPlexCreate(lMPIComm,&lDMBete); >>>>> >>>>> DMSetDimension(lDMBete, lDimMail); >>>>> >>>>> DMPlexBuildFromCellListParallel(lDMBete, >>>>> ????????????????????????????????? lnumCells, >>>>> ????????????????????????????????? PETSC_DECIDE, >>>>> pLectureElementsLocaux.reqNbTotalSommets(), >>>>> ????????????????????????????????? lNumCorners, >>>>> ????????????????????????????????? lCells, >>>>> ????????????????????????????????? PETSC_NULL); >>>>> >>>>> ? DM lDMBeteInterp = 0; >>>>> DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>>>> DMDestroy(&lDMBete); >>>>> ? lDMBete = lDMBeteInterp; >>>>> >>>>> DMSetUseNatural(lDMBete,PETSC_TRUE); >>>>> >>>>> ? PetscSF lSFMigrationSansOvl = 0; >>>>> ? PetscSF lSFMigrationOvl = 0; >>>>> ? DM lDMDistribueSansOvl = 0; >>>>> ? DM lDMAvecOverlap = 0; >>>>> >>>>> ? PetscPartitioner lPart; >>>>> DMPlexGetPartitioner(lDMBete, &lPart); >>>>> PetscPartitionerSetFromOptions(lPart); >>>>> >>>>> ? PetscSection section; >>>>> ? PetscInt numFields?? = 1; >>>>> ? PetscInt numBC?????? = 0; >>>>> ? PetscInt numComp[1]? = {1}; >>>>> ? PetscInt numDof[4]?? = {1, 0, 0, 0}; >>>>> ? PetscInt bcFields[1] = {0}; >>>>> ? IS bcPoints[1] = {NULL}; >>>>> >>>>> DMSetNumFields(lDMBete, numFields); >>>>> >>>>> DMPlexCreateSection(lDMBete, NULL, numComp, >>>>> numDof, numBC, bcFields, bcPoints, NULL, NULL, >>>>> §ion); >>>>> DMSetLocalSection(lDMBete, section); >>>>> >>>>> DMPlexDistribute(lDMBete, 0, >>>>> &lSFMigrationSansOvl, &lDMDistribueSansOvl); >>>>> // segfault! >>>>> >>>>> =========== >>>>> >>>>> So we have other question/remarks: >>>>> >>>>> 3- Maybe PETSc expect something specific that >>>>> is missing/not verified: for example, we >>>>> didn't gave any coordinates since we just want >>>>> to partition and compute overlap for the >>>>> mesh... and then recover our element numbers >>>>> in a "simple way" >>>>> >>>>> 4- We are telling ourselves it is somewhat a >>>>> "big price to pay" to have to build an unused >>>>> section to have the global to natural ordering >>>>> set ?? Could this requirement be avoided? >>>>> >>>>> I don't think so. There would have to be _some_ >>>>> way of describing your data layout in terms of >>>>> mesh points, and I do not see how you could use >>>>> less memory doing that. >>>>> >>>>> 5- Are there any improvement towards our >>>>> usages in 3.16 release? >>>>> >>>>> Let me try and run the code above. >>>>> >>>>> ? Thanks, >>>>> >>>>> ? ? ?Matt >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> >>>>> >>>>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>>>>> On Wed, Sep 29, 2021 at 5:18 PM Eric >>>>>> Chamberland >>>>> > wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I come back with _almost_ the original >>>>>> question: >>>>>> >>>>>> I would like to add an integer >>>>>> information (*our* original element >>>>>> number, not petsc one) on each element of >>>>>> the DMPlex I create with >>>>>> DMPlexBuildFromCellListParallel. >>>>>> >>>>>> I would like this interger to be >>>>>> distribruted by or the same way >>>>>> DMPlexDistribute distribute the mesh. >>>>>> >>>>>> Is it possible to do this? >>>>>> >>>>>> >>>>>> I think we already have support for what you >>>>>> want. If you call >>>>>> >>>>>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>>>>> >>>>>> >>>>>> before DMPlexDistribute(), it will compute a >>>>>> PetscSF encoding the global to natural map. You >>>>>> can get it with >>>>>> >>>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>>>>> >>>>>> >>>>>> and use it with >>>>>> >>>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>>>>> >>>>>> >>>>>> Is this sufficient? >>>>>> >>>>>> ? Thanks, >>>>>> >>>>>> ? ? ?Matt >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Eric >>>>>> >>>>>> On 2021-07-14 1:18 p.m., Eric Chamberland >>>>>> wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I want to use DMPlexDistribute from >>>>>> PETSc for computing overlapping >>>>>> > and play with the different >>>>>> partitioners supported. >>>>>> > >>>>>> > However, after calling >>>>>> DMPlexDistribute, I noticed the elements are >>>>>> > renumbered and then the original number >>>>>> is lost. >>>>>> > >>>>>> > What would be the best way to keep >>>>>> track of the element renumbering? >>>>>> > >>>>>> > a) Adding an optional parameter to let >>>>>> the user retrieve a vector or >>>>>> > "IS" giving the old number? >>>>>> > >>>>>> > b) Adding a DMLabel (seems a wrong good >>>>>> solution) >>>>>> > >>>>>> > c) Other idea? >>>>>> > >>>>>> > Of course, I don't want to loose >>>>>> performances with the need of this >>>>>> > "mapping"... >>>>>> > >>>>>> > Thanks, >>>>>> > >>>>>> > Eric >>>>>> > >>>>>> -- >>>>>> Eric Chamberland, ing., M. Ing >>>>>> Professionnel de recherche >>>>>> GIREF/Universit? Laval >>>>>> (418) 656-2131 poste 41 22 42 >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted >>>>>> before they begin their experiments is >>>>>> infinitely more interesting than any results >>>>>> to which their experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> >>>>> >>>>> -- >>>>> Eric Chamberland, ing., M. Ing >>>>> Professionnel de recherche >>>>> GIREF/Universit? Laval >>>>> (418) 656-2131 poste 41 22 42 >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before >>>>> they begin their experiments is infinitely more >>>>> interesting than any results to which their >>>>> experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>> >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they >>>> begin their experiments is infinitely more interesting >>>> than any results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin >>> their experiments is infinitely more interesting than any >>> results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to >> which their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hbnbhlbilhmjdpfg.png Type: image/png Size: 42972 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eejjfmbjimlkboec.png Type: image/png Size: 87901 bytes Desc: not available URL: From Eric.Chamberland at giref.ulaval.ca Wed Oct 27 13:32:33 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Wed, 27 Oct 2021 14:32:33 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> <12e32ebb-61ed-6a8c-ab77-2841090ba5fe@giref.ulaval.ca> Message-ID: <8a3704c3-d626-d860-0e98-33e113c5c376@giref.ulaval.ca> Hi Matthew, we continued the example.? Now it must be our misuse of PETSc that produced the wrong result. As stated into the code: // The call to DMPlexNaturalToGlobalBegin/End does not produce our expected result... ? // In lGlobalVec, we expect to have: ? /* ?? * Process [0] ?? * 2. ?? * 4. ?? * 8. ?? * 3. ?? * 9. ?? * Process [1] ?? * 1. ?? * 5. ?? * 7. ?? * 0. ?? * 6. ?? * ?? * but we obtained: ?? * ?? * Process [0] ?? * 2. ?? * 4. ?? * 8. ?? * 0. ?? * 0. ?? * Process [1] ?? * 0. ?? * 0. ?? * 0. ?? * 0. ?? * 0. ?? */ (see attached ex44.c) Thanks, Eric On 2021-10-27 1:25 p.m., Eric Chamberland wrote: > > Great! > > Thanks Matthew, it is working for me up to that point! > > We are continuing the ex44.c and forward it to you at the next > blocking point... > > Eric > > On 2021-10-27 11:14 a.m., Matthew Knepley wrote: >> On Wed, Oct 27, 2021 at 8:29 AM Eric Chamberland >> > > wrote: >> >> Hi Matthew, >> >> the smallest mesh which crashes the code is a 2x5 mesh: >> >> See the modified ex44.c >> >> With smaller meshes(2x2, 2x4, etc), it passes...? But it bugs >> latter when I try to use DMPlexNaturalToGlobalBegin but let's >> keep that other problem for later... >> >> Thanks a lot for helping digging into this! :) >> >> I have made a small fix in this branch >> >> https://gitlab.com/petsc/petsc/-/commits/knepley/fix-plex-g2n >> >> >> It seems to run for me. Can you check it? >> >> ? Thanks, >> >> ? ? ?Matt >> >> Eric >> >> (sorry if you received this for a? 2nd times, I have trouble with >> my mail) >> >> On 2021-10-26 4:35 p.m., Matthew Knepley wrote: >>> On Tue, Oct 26, 2021 at 1:35 PM Eric Chamberland >>> >> > wrote: >>> >>> Here is a screenshot of the partition I hard coded (top) and >>> vertices/element numbers (down): >>> >>> I have not yet modified the ex44.c example to properly >>> assign the coordinates... >>> >>> (but I would not have done it like it is in the last version >>> because the sCoords array is the global array with global >>> vertices number) >>> >>> I will have time to do this tomorrow... >>> >>> Maybe I can first try to reproduce all this with a smaller mesh? >>> >>> >>> That might make it easier to find a problem. >>> >>> ? Thanks! >>> >>> ? ? ?Matt >>> >>> Eric >>> >>> On 2021-10-26 9:46 a.m., Matthew Knepley wrote: >>>> Okay, I ran it. Something seems off with the mesh. First, I >>>> cannot simply explain the partition. The number of shared >>>> vertices and edges >>>> does not seem to come from a straight cut. Second, the mesh >>>> look scrambled on output. >>>> >>>> ? Thanks, >>>> >>>> ? ? Matt >>>> >>>> On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland >>>> >>> > wrote: >>>> >>>> Hi Matthew, >>>> >>>> ok, I started back from your ex44.c example and added >>>> the global array of coordinates.? I just have to code >>>> the creation of the local coordinates now. >>>> >>>> Eric >>>> >>>> On 2021-10-20 6:55 p.m., Matthew Knepley wrote: >>>>> On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland >>>>> >>>> > wrote: >>>>> >>>>> Hi Matthew, >>>>> >>>>> we tried to reproduce the error in a simple example. >>>>> >>>>> The context is the following: We hard coded the >>>>> mesh and initial partition into the code (see >>>>> sConnectivity and sInitialPartition) for 2 ranks >>>>> and try to create a section in order to use the >>>>> DMPlexNaturalToGlobalBegin function to retreive >>>>> our initial element numbers. >>>>> >>>>> Now the call to DMPlexDistribute give different >>>>> errors depending on what type of component we ask >>>>> the field to be created.? For our objective, we >>>>> would like a global field to be created on >>>>> elements only (like a P0 interpolation). >>>>> >>>>> We now have the following error generated: >>>>> >>>>> [0]PETSC ERROR: --------------------- Error >>>>> Message >>>>> -------------------------------------------------------------- >>>>> [0]PETSC ERROR: Petsc has generated inconsistent data >>>>> [0]PETSC ERROR: Inconsistency in indices, 18 >>>>> should be 17 >>>>> [0]PETSC ERROR: See >>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html >>>>> >>>>> for trouble shooting. >>>>> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar >>>>> 30, 2021 >>>>> [0]PETSC ERROR: ./bug on a? named rohan by ericc >>>>> Wed Oct 20 14:52:36 2021 >>>>> [0]PETSC ERROR: Configure options >>>>> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 >>>>> --with-mpi-compilers=1 >>>>> --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 >>>>> --with-cxx-dialect=C++14 --with-make-np=12 >>>>> --with-shared-libraries=1 --with-debugging=yes >>>>> --with-memalign=64 --with-visibility=0 >>>>> --with-64-bit-indices=0 --download-ml=yes >>>>> --download-mumps=yes --download-superlu=yes >>>>> --download-hpddm=yes --download-slepc=yes >>>>> --download-superlu_dist=yes >>>>> --download-parmetis=yes --download-ptscotch=yes >>>>> --download-metis=yes --download-strumpack=yes >>>>> --download-suitesparse=yes --download-hypre=yes >>>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>>>> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>>>> --with-scalapack=1 >>>>> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >>>>> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>>>> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >>>>> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >>>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >>>>> [0]PETSC ERROR: #2 DMPlexCreateGlobalToNaturalSF() >>>>> at >>>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >>>>> [0]PETSC ERROR: #3 DMPlexDistribute() at >>>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >>>>> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >>>>> [0]PETSC ERROR: No PETSc Option Table entries >>>>> [0]PETSC ERROR: ----------------End of Error >>>>> Message -------send entire error message to >>>>> petsc-maint at mcs.anl.gov >>>>> ---------- >>>>> >>>>> Hope the attached code is self-explaining, note >>>>> that to make it short, we have not included the >>>>> final part of it, just the buggy part we are >>>>> encountering right now... >>>>> >>>>> Thanks for your insights, >>>>> >>>>> Thanks for making the example. I tweaked it slightly. >>>>> I put in a test case that just makes a parallel 7 x 10 >>>>> quad mesh. This works >>>>> fine. Thus I think it must be something connected with >>>>> the original mesh. It is hard to get a handle on it >>>>> without the coordinates. >>>>> Do you think you could put the coordinate array in? I >>>>> have added the code to load them (see attached file). >>>>> >>>>> ? Thanks, >>>>> >>>>> ? ? ?Matt >>>>> >>>>> Eric >>>>> >>>>> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >>>>>> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland >>>>>> >>>>> > wrote: >>>>>> >>>>>> Hi Matthew, >>>>>> >>>>>> we tried to use that.? Now, we discovered that: >>>>>> >>>>>> 1- even if we "ask" for sfNatural creation >>>>>> with DMSetUseNatural, it is not created >>>>>> because DMPlexCreateGlobalToNaturalSF looks >>>>>> for a "section": this is not documented in >>>>>> DMSetUseNaturalso we are asking ourselfs: "is >>>>>> this a permanent feature or a temporary >>>>>> situation?" >>>>>> >>>>>> I think explaining this will help clear up a lot. >>>>>> >>>>>> What the Natural2Global?map does is permute a >>>>>> solution vector into the ordering that it would >>>>>> have had prior to mesh distribution. >>>>>> Now, in order to do this permutation, I need to >>>>>> know the original (global) data layout. If it is >>>>>> not specified _before_ distribution, we >>>>>> cannot build the permutation.? The section >>>>>> describes the data layout, so I need it before >>>>>> distribution. >>>>>> >>>>>> I cannot think of another way that you would >>>>>> implement this, but if you want something else, >>>>>> let me know. >>>>>> >>>>>> 2- We then tried to create a "section" in >>>>>> different manners: we took the code into the >>>>>> example petsc/src/dm/impls/plex/tests/ex15.c. >>>>>> However, we ended up with a segfault: >>>>>> >>>>>> corrupted size vs. prev_size >>>>>> [rohan:07297] *** Process received signal *** >>>>>> [rohan:07297] Signal: Aborted (6) >>>>>> [rohan:07297] Signal code: (-6) >>>>>> [rohan:07297] [ 0] >>>>>> /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>>>>> [rohan:07297] [ 1] >>>>>> /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>>>>> [rohan:07297] [ 2] >>>>>> /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>>>>> [rohan:07297] [ 3] >>>>>> /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>>>>> [rohan:07297] [ 4] >>>>>> /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>>>>> [rohan:07297] [ 5] >>>>>> /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>>>>> [rohan:07297] [ 6] >>>>>> /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>>>>> [rohan:07297] [ 7] >>>>>> /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>>>>> [rohan:07297] [ 8] >>>>>> /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>>>>> [rohan:07297] [ 9] >>>>>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>>>>> [rohan:07297] [10] >>>>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>>>>> [rohan:07297] [11] >>>>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>>>>> [rohan:07297] [12] >>>>>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>>>>> [rohan:07297] [13] >>>>>> /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>>>>> >>>>>> [rohan:07297] [14] >>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>>>>> [rohan:07297] [15] >>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>>>>> [rohan:07297] [16] >>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>>>>> [rohan:07297] [17] >>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>>>>> [rohan:07297] [18] >>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>>>>> >>>>>> I am not sure what happened here, but if you >>>>>> could send a sample code, I will figure it out. >>>>>> >>>>>> If we do not create a section, the call to >>>>>> DMPlexDistribute is successful, but >>>>>> DMPlexGetGlobalToNaturalSF return a null SF >>>>>> pointer... >>>>>> >>>>>> Yes, it just ignores it in this case because it >>>>>> does not have a global layout. >>>>>> >>>>>> Here are the operations we are calling ( this >>>>>> is almost the code we are using, I just >>>>>> removed verifications and creation of the >>>>>> connectivity which use our parallel structure >>>>>> and code): >>>>>> >>>>>> =========== >>>>>> >>>>>> ? PetscInt* lCells????? = 0; >>>>>> ? PetscInt lNumCorners = 0; >>>>>> ? PetscInt lDimMail??? = 0; >>>>>> ? PetscInt lnumCells?? = 0; >>>>>> >>>>>> ? //At this point we create the cells for >>>>>> PETSc expected input for >>>>>> DMPlexBuildFromCellListParallel and set >>>>>> lNumCorners, lDimMail and lnumCells to >>>>>> correct values. >>>>>> ? ... >>>>>> >>>>>> ? DM lDMBete = 0 >>>>>> DMPlexCreate(lMPIComm,&lDMBete); >>>>>> >>>>>> DMSetDimension(lDMBete, lDimMail); >>>>>> >>>>>> DMPlexBuildFromCellListParallel(lDMBete, >>>>>> ????????????????????????????????? lnumCells, >>>>>> ????????????????????????????????? PETSC_DECIDE, >>>>>> pLectureElementsLocaux.reqNbTotalSommets(), >>>>>> ????????????????????????????????? lNumCorners, >>>>>> ????????????????????????????????? lCells, >>>>>> ????????????????????????????????? PETSC_NULL); >>>>>> >>>>>> ? DM lDMBeteInterp = 0; >>>>>> DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>>>>> DMDestroy(&lDMBete); >>>>>> ? lDMBete = lDMBeteInterp; >>>>>> >>>>>> DMSetUseNatural(lDMBete,PETSC_TRUE); >>>>>> >>>>>> ? PetscSF lSFMigrationSansOvl = 0; >>>>>> ? PetscSF lSFMigrationOvl = 0; >>>>>> ? DM lDMDistribueSansOvl = 0; >>>>>> ? DM lDMAvecOverlap = 0; >>>>>> >>>>>> PetscPartitioner lPart; >>>>>> DMPlexGetPartitioner(lDMBete, &lPart); >>>>>> PetscPartitionerSetFromOptions(lPart); >>>>>> >>>>>> ? PetscSection section; >>>>>> ? PetscInt numFields?? = 1; >>>>>> ? PetscInt numBC?????? = 0; >>>>>> ? PetscInt numComp[1]? = {1}; >>>>>> ? PetscInt numDof[4]?? = {1, 0, 0, 0}; >>>>>> ? PetscInt bcFields[1] = {0}; >>>>>> ? IS bcPoints[1] = {NULL}; >>>>>> >>>>>> DMSetNumFields(lDMBete, numFields); >>>>>> >>>>>> DMPlexCreateSection(lDMBete, NULL, numComp, >>>>>> numDof, numBC, bcFields, bcPoints, NULL, >>>>>> NULL, §ion); >>>>>> DMSetLocalSection(lDMBete, section); >>>>>> >>>>>> DMPlexDistribute(lDMBete, 0, >>>>>> &lSFMigrationSansOvl, &lDMDistribueSansOvl); >>>>>> // segfault! >>>>>> >>>>>> =========== >>>>>> >>>>>> So we have other question/remarks: >>>>>> >>>>>> 3- Maybe PETSc expect something specific that >>>>>> is missing/not verified: for example, we >>>>>> didn't gave any coordinates since we just >>>>>> want to partition and compute overlap for the >>>>>> mesh... and then recover our element numbers >>>>>> in a "simple way" >>>>>> >>>>>> 4- We are telling ourselves it is somewhat a >>>>>> "big price to pay" to have to build an unused >>>>>> section to have the global to natural >>>>>> ordering set ?? Could this requirement be >>>>>> avoided? >>>>>> >>>>>> I don't think so. There would have to be _some_ >>>>>> way of describing your data layout in terms of >>>>>> mesh points, and I do not see how you could use >>>>>> less memory doing that. >>>>>> >>>>>> 5- Are there any improvement towards our >>>>>> usages in 3.16 release? >>>>>> >>>>>> Let me try and run the code above. >>>>>> >>>>>> ? Thanks, >>>>>> >>>>>> ? ? ?Matt >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Eric >>>>>> >>>>>> >>>>>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>>>>>> On Wed, Sep 29, 2021 at 5:18 PM Eric >>>>>>> Chamberland >>>>>>> >>>>>> > >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I come back with _almost_ the original >>>>>>> question: >>>>>>> >>>>>>> I would like to add an integer >>>>>>> information (*our* original element >>>>>>> number, not petsc one) on each element >>>>>>> of the DMPlex I create with >>>>>>> DMPlexBuildFromCellListParallel. >>>>>>> >>>>>>> I would like this interger to be >>>>>>> distribruted by or the same way >>>>>>> DMPlexDistribute distribute the mesh. >>>>>>> >>>>>>> Is it possible to do this? >>>>>>> >>>>>>> >>>>>>> I think we already have support for what you >>>>>>> want. If you call >>>>>>> >>>>>>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>>>>>> >>>>>>> >>>>>>> before DMPlexDistribute(), it will compute a >>>>>>> PetscSF encoding the global to natural map. You >>>>>>> can get it with >>>>>>> >>>>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>>>>>> >>>>>>> >>>>>>> and use it with >>>>>>> >>>>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>>>>>> >>>>>>> >>>>>>> Is this sufficient? >>>>>>> >>>>>>> ? Thanks, >>>>>>> >>>>>>> ? ? ?Matt >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Eric >>>>>>> >>>>>>> On 2021-07-14 1:18 p.m., Eric >>>>>>> Chamberland wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > I want to use DMPlexDistribute from >>>>>>> PETSc for computing overlapping >>>>>>> > and play with the different >>>>>>> partitioners supported. >>>>>>> > >>>>>>> > However, after calling >>>>>>> DMPlexDistribute, I noticed the elements >>>>>>> are >>>>>>> > renumbered and then the original >>>>>>> number is lost. >>>>>>> > >>>>>>> > What would be the best way to keep >>>>>>> track of the element renumbering? >>>>>>> > >>>>>>> > a) Adding an optional parameter to let >>>>>>> the user retrieve a vector or >>>>>>> > "IS" giving the old number? >>>>>>> > >>>>>>> > b) Adding a DMLabel (seems a wrong >>>>>>> good solution) >>>>>>> > >>>>>>> > c) Other idea? >>>>>>> > >>>>>>> > Of course, I don't want to loose >>>>>>> performances with the need of this >>>>>>> > "mapping"... >>>>>>> > >>>>>>> > Thanks, >>>>>>> > >>>>>>> > Eric >>>>>>> > >>>>>>> -- >>>>>>> Eric Chamberland, ing., M. Ing >>>>>>> Professionnel de recherche >>>>>>> GIREF/Universit? Laval >>>>>>> (418) 656-2131 poste 41 22 42 >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted >>>>>>> before they begin their experiments is >>>>>>> infinitely more interesting than any results >>>>>>> to which their experiments lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>> >>>>>> >>>>>> -- >>>>>> Eric Chamberland, ing., M. Ing >>>>>> Professionnel de recherche >>>>>> GIREF/Universit? Laval >>>>>> (418) 656-2131 poste 41 22 42 >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before >>>>>> they begin their experiments is infinitely more >>>>>> interesting than any results to which their >>>>>> experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> >>>>> >>>>> -- >>>>> Eric Chamberland, ing., M. Ing >>>>> Professionnel de recherche >>>>> GIREF/Universit? Laval >>>>> (418) 656-2131 poste 41 22 42 >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they >>>>> begin their experiments is infinitely more interesting >>>>> than any results to which their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>> >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin >>>> their experiments is infinitely more interesting than any >>>> results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to >>> which their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which >> their experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hbnbhlbilhmjdpfg.png Type: image/png Size: 42972 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eejjfmbjimlkboec.png Type: image/png Size: 87901 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex44.c Type: text/x-csrc Size: 13091 bytes Desc: not available URL: From karthikeyan.chockalingam at stfc.ac.uk Wed Oct 27 14:24:01 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Wed, 27 Oct 2021 19:24:01 +0000 Subject: [petsc-users] Cuda: Vec and Mat types Message-ID: Hello, I hope, I am framing the question currently. Are only distributed arrays (DMDA) of -vec_type and -mat_type only supported by CUDA? I am reading the petsc user manual in section 2.4 distributed arrays are introduced but at the start of chapter two there are other vector and matrix types as well. I wonder if these types (I don?t how they are referred by) are also CUDA supported? Can you please point me to some tutorial examples in KSP and SNES that can run on gpus? At the moment I am testing KSP/ex45.c with different preconditioners on cpus and gpus. I tried to run KSP/ex2.c with -vec_type cuda and -mat_type aijcuda noticed there was no gpu flops recorded in my log file. Many thanks, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 27 14:47:56 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 27 Oct 2021 15:47:56 -0400 Subject: [petsc-users] Question about setting block size for arbitrary Mat formats In-Reply-To: References: Message-ID: On Wed, Oct 27, 2021 at 12:25 PM Samuel Estes wrote: > Hi, > > I am solving a linear system in which the matrix has some block structure. > We will ultimately use the BAIJ format but for now we are just using the > default CSR and would like to play with different formats to compare > performance for our problem. Currently, I call MatSetBlockSize so that I > can then use MatSetValuesBlocked and MatSetValuesBlockedLocal. > > My question is: in the absence of specifying one of the blocked formats, > does setting the block size with MatSetBlockSize have any real effect on > performance? My understanding is that it is really just useful from a > programming perspective in that it allows you to set/access Mat values in > blocks which is often a natural way to do things. Obviously changing the > actual format to have a blocked structure could make a difference but I > just want to check if there's anything else going on under the hood with > the block size when the matrix is in AIJ format. > There are no performance gains with MatMult I think. There is some specialized code for inverting blocks, but it does not sound like you would be using that. Thanks, Matt > Thanks! > > Sam > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Wed Oct 27 15:13:36 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Wed, 27 Oct 2021 15:13:36 -0500 Subject: [petsc-users] Cuda: Vec and Mat types In-Reply-To: References: Message-ID: On Wed, Oct 27, 2021 at 2:24 PM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > Hello, > > > > I hope, I am framing the question currently. > > Are only distributed arrays (DMDA) of -vec_type and -mat_type only > supported by CUDA? > I don't understand this question. Currently, CUDA-capable types include VECCUDA, MATAIJCUDA and MATDENSECUDA, either sequential or MPI. > > > > I am reading the petsc user manual in section 2.4 distributed arrays are > introduced but at the start of chapter two there are other vector and > matrix types as well. I wonder if these types (I don?t how they are > referred by) are also CUDA supported? > > > > Can you please point me to some tutorial examples in KSP and SNES that can > run on gpus? > search "-mat_type aijcusparse" or "-dm_mat_type aijcusparse" in petsc tests/tutorials, you will find many. > > > At the moment I am testing KSP/ex45.c with different preconditioners on > cpus and gpus. > > > > I tried to run KSP/ex2.c with -vec_type cuda and -mat_type aijcuda noticed > there was no gpu flops recorded in my log file. > It is -mat_type aijcusparse > > > Many thanks, > > Karthik. > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyan.chockalingam at stfc.ac.uk Wed Oct 27 15:37:24 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Wed, 27 Oct 2021 20:37:24 +0000 Subject: [petsc-users] Cuda: Vec and Mat types In-Reply-To: References: Message-ID: Thank you for your response. It tried running ksp/ex2.c using ./ex2 -m 9 -n 9 ?vec_type cuda -mat_type aijcusparse -ksp_type cg -pc_type jacobi -log_view but the log file didn?t record any gpu flops. Sorry, my next question doesn?t belong to this thread. Does DMDA only work on structured grid/mesh and not on unstructured grid/mesh? Best, Karthik. From: Junchao Zhang Date: Wednesday, 27 October 2021 at 21:13 To: "Chockalingam, Karthikeyan (STFC,DL,HC)" Cc: "petsc-users at mcs.anl.gov" Subject: Re: [petsc-users] Cuda: Vec and Mat types On Wed, Oct 27, 2021 at 2:24 PM Karthikeyan Chockalingam - STFC UKRI > wrote: Hello, I hope, I am framing the question currently. Are only distributed arrays (DMDA) of -vec_type and -mat_type only supported by CUDA? I don't understand this question. Currently, CUDA-capable types include VECCUDA, MATAIJCUDA and MATDENSECUDA, either sequential or MPI. I am reading the petsc user manual in section 2.4 distributed arrays are introduced but at the start of chapter two there are other vector and matrix types as well. I wonder if these types (I don?t how they are referred by) are also CUDA supported? Can you please point me to some tutorial examples in KSP and SNES that can run on gpus? search "-mat_type aijcusparse" or "-dm_mat_type aijcusparse" in petsc tests/tutorials, you will find many. At the moment I am testing KSP/ex45.c with different preconditioners on cpus and gpus. I tried to run KSP/ex2.c with -vec_type cuda and -mat_type aijcuda noticed there was no gpu flops recorded in my log file. It is -mat_type aijcusparse Many thanks, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Oct 27 15:48:39 2021 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 27 Oct 2021 16:48:39 -0400 Subject: [petsc-users] Cuda: Vec and Mat types In-Reply-To: References: Message-ID: On Wed, Oct 27, 2021 at 4:37 PM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > Thank you for your response. > > > > It tried running ksp/ex2.c using > > > > ./ex2 -m 9 -n 9 ?vec_type cuda -mat_type aijcusparse -ksp_type cg -pc_type > jacobi -log_view > > > > but the log file didn?t record any gpu flops. > > > > Sorry, my next question doesn?t belong to this thread. > > Does DMDA only work on structured grid/mesh and not on unstructured > grid/mesh? > DMDA means a structured grid. DMPlex is an unstructured grid. Thanks, Matt > Best, > > Karthik. > > > > *From: *Junchao Zhang > *Date: *Wednesday, 27 October 2021 at 21:13 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] Cuda: Vec and Mat types > > > > > > > > On Wed, Oct 27, 2021 at 2:24 PM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Hello, > > > > I hope, I am framing the question currently. > > Are only distributed arrays (DMDA) of -vec_type and -mat_type only > supported by CUDA? > > I don't understand this question. Currently, CUDA-capable types include > VECCUDA, MATAIJCUDA and MATDENSECUDA, either sequential or MPI. > > > > I am reading the petsc user manual in section 2.4 distributed arrays are > introduced but at the start of chapter two there are other vector and > matrix types as well. I wonder if these types (I don?t how they are > referred by) are also CUDA supported? > > > > Can you please point me to some tutorial examples in KSP and SNES that can > run on gpus? > > search "-mat_type aijcusparse" or "-dm_mat_type aijcusparse" in petsc > tests/tutorials, you will find many. > > > > > > At the moment I am testing KSP/ex45.c with different preconditioners on > cpus and gpus. > > > > I tried to run KSP/ex2.c with -vec_type cuda and -mat_type aijcuda noticed > there was no gpu flops recorded in my log file. > > It is -mat_type aijcusparse > > > > > > Many thanks, > > Karthik. > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Wed Oct 27 17:16:37 2021 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Wed, 27 Oct 2021 17:16:37 -0500 Subject: [petsc-users] Cuda: Vec and Mat types In-Reply-To: References: Message-ID: On Wed, Oct 27, 2021 at 3:37 PM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > Thank you for your response. > > > > It tried running ksp/ex2.c using > > > > ./ex2 -m 9 -n 9 ?vec_type cuda -mat_type aijcusparse -ksp_type cg -pc_type > jacobi -log_view > under src/ksp/ksp/tutorials, run this command (your old command line has a weird character ?) ./ex2 -m 9 -n 9 -vec_type cuda -mat_type aijcusparse -ksp_type cg -pc_type jacobi -log_view Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F --------------------------------------------------------------------------------------------------------------------------------------------------------------- --- Event Stage 0: Main Stage MatMult 14 1.0 5.5454e-04 1.0 9.20e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 40 0 0 0 0 40 0 0 0 17 40 1 4.78e-03 0 0.00e+00 100 MatAssemblyBegin 1 1.0 1.9960e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 MatAssemblyEnd 1 1.0 2.2575e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 MatCUSPARSCopyTo 1 1.0 1.9215e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 4.78e-03 0 0.00e+00 0 VecTDot 26 1.0 7.3121e-04 1.0 4.19e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 18 0 0 0 0 18 0 0 0 6 9 0 0.00e+00 0 0.00e+00 100 VecNorm 15 1.0 8.1064e-04 1.0 2.42e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 10 0 0 0 0 10 0 0 0 3 4 0 0.00e+00 0 0.00e+00 100 > > but the log file didn?t record any gpu flops. > > > > Sorry, my next question doesn?t belong to this thread. > > Does DMDA only work on structured grid/mesh and not on unstructured > grid/mesh? > > > > Best, > > Karthik. > > > > *From: *Junchao Zhang > *Date: *Wednesday, 27 October 2021 at 21:13 > *To: *"Chockalingam, Karthikeyan (STFC,DL,HC)" < > karthikeyan.chockalingam at stfc.ac.uk> > *Cc: *"petsc-users at mcs.anl.gov" > *Subject: *Re: [petsc-users] Cuda: Vec and Mat types > > > > > > > > On Wed, Oct 27, 2021 at 2:24 PM Karthikeyan Chockalingam - STFC UKRI < > karthikeyan.chockalingam at stfc.ac.uk> wrote: > > Hello, > > > > I hope, I am framing the question currently. > > Are only distributed arrays (DMDA) of -vec_type and -mat_type only > supported by CUDA? > > I don't understand this question. Currently, CUDA-capable types include > VECCUDA, MATAIJCUDA and MATDENSECUDA, either sequential or MPI. > > > > I am reading the petsc user manual in section 2.4 distributed arrays are > introduced but at the start of chapter two there are other vector and > matrix types as well. I wonder if these types (I don?t how they are > referred by) are also CUDA supported? > > > > Can you please point me to some tutorial examples in KSP and SNES that can > run on gpus? > > search "-mat_type aijcusparse" or "-dm_mat_type aijcusparse" in petsc > tests/tutorials, you will find many. > > > > > > At the moment I am testing KSP/ex45.c with different preconditioners on > cpus and gpus. > > > > I tried to run KSP/ex2.c with -vec_type cuda and -mat_type aijcuda noticed > there was no gpu flops recorded in my log file. > > It is -mat_type aijcusparse > > > > > > Many thanks, > > Karthik. > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuanxi at advancesoft.jp Thu Oct 28 03:59:31 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Thu, 28 Oct 2021 17:59:31 +0900 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? In-Reply-To: References: Message-ID: Dear Matt, Thank you for your quick response. I think what you mean is to build DAG from my mesh at first and then call DMPlexCreateFromDAG () to construct DMPlex. A new problem is, as I know, the function DMPlexInterpolate would generate points with different depth. What's the difference between those faces and segment elements generated by DMPlexInterpolate with that defined by the original mesh, or should we not use DMPlexInterpolate in such a case? On the other hand, can DMComposite be used in this case by defining DMPlex with different topological dimensions at first and then composite them? Thanks in advance. Yuan 2021?10?27?(?) 19:27 Matthew Knepley : > On Wed, Oct 27, 2021 at 4:50 AM ?? wrote: > >> Hi, >> >> I am trying to parallelize my serial FEM program using PETSc. This >> program calculates structure deformation by using various types of elements >> such as solid, shell, beam, and truss. At the very beginning, I found it >> was hard for me to put such kinds of elements into DMPlex. Because solid >> elements are topologically three dimensional, shell element two, and beam >> or truss are topologically one-dimensional elements. After reading chapter >> 2.10: "DMPlex: Unstructured Grids in PETSc" of users manual carefully, I >> found the provided functions, such as DMPlexSetCone, cannot declare those >> topological differences. >> >> My question is : Is it possible and how to define all those topologically >> different elements into a DMPlex struct? >> > > Yes. The idea is to program in a dimension-independent way, so that the > code can handle cells of any dimension. > What you probably want is the "depth" in the DAG representation, which you > can think of as the dimension of a cell. > > > https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetPointDepth.html#DMPlexGetPointDepth > > Thanks, > > Matt > > >> Thanks in advance! >> >> Best regards, >> >> Yuan. >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Thu Oct 28 08:05:14 2021 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 28 Oct 2021 09:05:14 -0400 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? In-Reply-To: References: Message-ID: On Thu, Oct 28, 2021 at 4:59 AM ?? wrote: > Dear Matt, > > Thank you for your quick response. > > I think what you mean is to build DAG from my mesh at first and then call > DMPlexCreateFromDAG > () > to construct DMPlex. > No, I do not mean that. > A new problem is, as I know, the function DMPlexInterpolate would > generate points with different depth. What's the difference between those > faces and segment elements generated by DMPlexInterpolate with that > defined by the original mesh, or should we not use DMPlexInterpolate in > such a case? > > On the other hand, can DMComposite be used in this case by defining DMPlex > with different topological dimensions at first and then composite them? > You do not need that. I am obviously not understanding your question. My short answer is that Plex _already_ handles cells of different dimension automatically without anything extra. Maybe it would help if you defined a specific problem you have. Thanks, Matt > Thanks in advance. > > Yuan > > > 2021?10?27?(?) 19:27 Matthew Knepley : > >> On Wed, Oct 27, 2021 at 4:50 AM ?? wrote: >> >>> Hi, >>> >>> I am trying to parallelize my serial FEM program using PETSc. This >>> program calculates structure deformation by using various types of elements >>> such as solid, shell, beam, and truss. At the very beginning, I found it >>> was hard for me to put such kinds of elements into DMPlex. Because solid >>> elements are topologically three dimensional, shell element two, and beam >>> or truss are topologically one-dimensional elements. After reading chapter >>> 2.10: "DMPlex: Unstructured Grids in PETSc" of users manual carefully, I >>> found the provided functions, such as DMPlexSetCone, cannot declare those >>> topological differences. >>> >>> My question is : Is it possible and how to define all those >>> topologically different elements into a DMPlex struct? >>> >> >> Yes. The idea is to program in a dimension-independent way, so that the >> code can handle cells of any dimension. >> What you probably want is the "depth" in the DAG representation, which >> you can think of as the dimension of a cell. >> >> >> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetPointDepth.html#DMPlexGetPointDepth >> >> Thanks, >> >> Matt >> >> >>> Thanks in advance! >>> >>> Best regards, >>> >>> Yuan. >>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From rafael.m.silva at alumni.usp.br Thu Oct 28 14:12:57 2021 From: rafael.m.silva at alumni.usp.br (Rafael Monteiro da Silva) Date: Thu, 28 Oct 2021 16:12:57 -0300 Subject: [petsc-users] Installation on NEC SX-Aurora TSUBASA Message-ID: Hello. On my machine, for initial tests, I use the following options to install petsc: PETSC_DIR=/home/rafael/petsc PETSC_ARCH=optimized-v3.15.5 --with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native -mtune=native" --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich --download-superlu_dist --download-metis --download-parmetis --download-mumps --download-scalapack --download-hdf5 I need to test our software in an environment with NEC SX-Aurora TSUBASA Vector Engine. Is there any resource where I can set up petsc to use Vector Engine? Thank you! Regards, Rafael. -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay at mcs.anl.gov Thu Oct 28 14:26:22 2021 From: balay at mcs.anl.gov (Satish Balay) Date: Thu, 28 Oct 2021 14:26:22 -0500 (CDT) Subject: [petsc-users] Installation on NEC SX-Aurora TSUBASA In-Reply-To: References: Message-ID: >>>> https://www.nec.com/en/global/solutions/hpc/sx/software.html? Everything that compiles for Linux can also be compiled for the Vector Engine. The compilers support Fortran 2003, with extensions from Fortran 2008, as well as C++14. The compilers are able to vectorize and auto-parallelize loops. For general parallelization OpenMP and MPI will be supported. <<<< So I guess you can give it a try [using MPI provided by NEC] and see how it goes.. [assuming they also provide blas/lapack] ./configure --with-cc=mpicc --with-fc=mpif90 --with-cxx=mpicxx --with-blaslapack-lib="-lneclapack -lnecblas" And if needed - the additional option --with-batch=1 Once this works - you can try additional build options. Satish On Thu, 28 Oct 2021, Rafael Monteiro da Silva wrote: > Hello. > > On my machine, for initial tests, I use the following options to install > petsc: > > PETSC_DIR=/home/rafael/petsc PETSC_ARCH=optimized-v3.15.5 > --with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" > CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native > -mtune=native" --with-cc=gcc --with-cxx=g++ --with-fc=gfortran > --download-fblaslapack --download-mpich --download-superlu_dist > --download-metis --download-parmetis --download-mumps --download-scalapack > --download-hdf5 > > I need to test our software in an environment with NEC SX-Aurora > TSUBASA Vector Engine. > Is there any resource where I can set up petsc to use Vector Engine? > > Thank you! > Regards, > Rafael. > From stefano.zampini at gmail.com Thu Oct 28 14:38:12 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Thu, 28 Oct 2021 22:38:12 +0300 Subject: [petsc-users] Installation on NEC SX-Aurora TSUBASA In-Reply-To: References: Message-ID: Rafael PETSc can be built for NEC vector engines. Here is a sample configure script https://gitlab.com/petsc/petsc/-/blob/main/config/examples/arch-necve.py NEC blas lapack should be automatically used. I don?t know if the packages you need will compile and run smoothly. Their C/C++ compiler is very buggy, and I had to resort compiling with -O1to get almost all PETSc tests pass. PETSc automatically uses this optimization flag if you compile using with-debugging=0. Do not use higher optimizations, unless you are willing to file bug reports to them Stefano > On Oct 28, 2021, at 10:12 PM, Rafael Monteiro da Silva wrote: > > Hello. > > On my machine, for initial tests, I use the following options to install petsc: > > PETSC_DIR=/home/rafael/petsc PETSC_ARCH=optimized-v3.15.5 --with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native -mtune=native" --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich --download-superlu_dist --download-metis --download-parmetis --download-mumps --download-scalapack --download-hdf5 > > I need to test our software in an environment with NEC SX-Aurora TSUBASA Vector Engine. > Is there any resource where I can set up petsc to use Vector Engine? > > Thank you! > Regards, > Rafael. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rafael.m.silva at alumni.usp.br Thu Oct 28 14:52:29 2021 From: rafael.m.silva at alumni.usp.br (Rafael Monteiro da Silva) Date: Thu, 28 Oct 2021 16:52:29 -0300 Subject: [petsc-users] Installation on NEC SX-Aurora TSUBASA In-Reply-To: References: Message-ID: Thank you Satish and Stefano for pointing me out how to do this. Stefano, if I'm interpreting correctly, I could try to add build options I need to this script. Is that correct? First, I'll try to install (based on arch-necve.py script) and then, as Satish suggested, include additional build options. Rafael. Em qui., 28 de out. de 2021 ?s 16:38, Stefano Zampini < stefano.zampini at gmail.com> escreveu: > Rafael > > PETSc can be built for NEC vector engines. Here is a sample configure > script > https://gitlab.com/petsc/petsc/-/blob/main/config/examples/arch-necve.py > NEC blas lapack should be automatically used. > > I don?t know if the packages you need will compile and run smoothly. Their > C/C++ compiler is very buggy, and I had to resort compiling with -O1to get > almost all PETSc tests pass. > PETSc automatically uses this optimization flag if you compile using > with-debugging=0. Do not use higher optimizations, unless you are willing > to file bug reports to them > > > Stefano > > On Oct 28, 2021, at 10:12 PM, Rafael Monteiro da Silva < > rafael.m.silva at alumni.usp.br> wrote: > > Hello. > > On my machine, for initial tests, I use the following options to install > petsc: > > PETSC_DIR=/home/rafael/petsc PETSC_ARCH=optimized-v3.15.5 > --with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" > CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native > -mtune=native" --with-cc=gcc --with-cxx=g++ --with-fc=gfortran > --download-fblaslapack --download-mpich --download-superlu_dist > --download-metis --download-parmetis --download-mumps --download-scalapack > --download-hdf5 > > I need to test our software in an environment with NEC SX-Aurora > TSUBASA Vector Engine. > Is there any resource where I can set up petsc to use Vector Engine? > > Thank you! > Regards, > Rafael. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Thu Oct 28 14:54:36 2021 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Thu, 28 Oct 2021 22:54:36 +0300 Subject: [petsc-users] Installation on NEC SX-Aurora TSUBASA In-Reply-To: References: Message-ID: <72466844-80DA-447D-AA8F-ED7FC074645C@gmail.com> > On Oct 28, 2021, at 10:52 PM, Rafael Monteiro da Silva wrote: > > Thank you Satish and Stefano for pointing me out how to do this. > > Stefano, if I'm interpreting correctly, I could try to add build options I need to this script. Is that correct? The script configures PETSc with default options for NEC. I don?t recommend changing compilation flags > > First, I'll try to install (based on arch-necve.py script) and then, as Satish suggested, include additional build options. > Good luck with building and running these external packages > > Rafael. > > Em qui., 28 de out. de 2021 ?s 16:38, Stefano Zampini > escreveu: > Rafael > > PETSc can be built for NEC vector engines. Here is a sample configure script https://gitlab.com/petsc/petsc/-/blob/main/config/examples/arch-necve.py > NEC blas lapack should be automatically used. > > I don?t know if the packages you need will compile and run smoothly. Their C/C++ compiler is very buggy, and I had to resort compiling with -O1to get almost all PETSc tests pass. > PETSc automatically uses this optimization flag if you compile using with-debugging=0. Do not use higher optimizations, unless you are willing to file bug reports to them > > > Stefano > >> On Oct 28, 2021, at 10:12 PM, Rafael Monteiro da Silva > wrote: >> >> Hello. >> >> On my machine, for initial tests, I use the following options to install petsc: >> >> PETSC_DIR=/home/rafael/petsc PETSC_ARCH=optimized-v3.15.5 --with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native -mtune=native" --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-fblaslapack --download-mpich --download-superlu_dist --download-metis --download-parmetis --download-mumps --download-scalapack --download-hdf5 >> >> I need to test our software in an environment with NEC SX-Aurora TSUBASA Vector Engine. >> Is there any resource where I can set up petsc to use Vector Engine? >> >> Thank you! >> Regards, >> Rafael. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rafael.m.silva at alumni.usp.br Thu Oct 28 15:03:21 2021 From: rafael.m.silva at alumni.usp.br (Rafael Monteiro da Silva) Date: Thu, 28 Oct 2021 17:03:21 -0300 Subject: [petsc-users] Installation on NEC SX-Aurora TSUBASA In-Reply-To: <72466844-80DA-447D-AA8F-ED7FC074645C@gmail.com> References: <72466844-80DA-447D-AA8F-ED7FC074645C@gmail.com> Message-ID: Thanks for examplation, Stefano. I was referring to options for downloading external packages, like mumps. My understanding is that I'll need to look for how to build those packages and check if they support nec vector engine, is that correct? Rafael. Em qui., 28 de out. de 2021 ?s 16:54, Stefano Zampini < stefano.zampini at gmail.com> escreveu: > > > On Oct 28, 2021, at 10:52 PM, Rafael Monteiro da Silva < > rafael.m.silva at alumni.usp.br> wrote: > > Thank you Satish and Stefano for pointing me out how to do this. > > Stefano, if I'm interpreting correctly, I could try to add build options I > need to this script. Is that correct? > > > The script configures PETSc with default options for NEC. I don?t > recommend changing compilation flags > > > > First, I'll try to install (based on arch-necve.py script) and then, as > Satish suggested, include additional build options. > > > Good luck with building and running these external packages > > > Rafael. > > Em qui., 28 de out. de 2021 ?s 16:38, Stefano Zampini < > stefano.zampini at gmail.com> escreveu: > >> Rafael >> >> PETSc can be built for NEC vector engines. Here is a sample configure >> script >> https://gitlab.com/petsc/petsc/-/blob/main/config/examples/arch-necve.py >> NEC blas lapack should be automatically used. >> >> I don?t know if the packages you need will compile and run smoothly. >> Their C/C++ compiler is very buggy, and I had to resort compiling with >> -O1to get almost all PETSc tests pass. >> PETSc automatically uses this optimization flag if you compile using >> with-debugging=0. Do not use higher optimizations, unless you are willing >> to file bug reports to them >> >> >> Stefano >> >> On Oct 28, 2021, at 10:12 PM, Rafael Monteiro da Silva < >> rafael.m.silva at alumni.usp.br> wrote: >> >> Hello. >> >> On my machine, for initial tests, I use the following options to install >> petsc: >> >> PETSC_DIR=/home/rafael/petsc PETSC_ARCH=optimized-v3.15.5 >> --with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" >> CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native >> -mtune=native" --with-cc=gcc --with-cxx=g++ --with-fc=gfortran >> --download-fblaslapack --download-mpich --download-superlu_dist >> --download-metis --download-parmetis --download-mumps --download-scalapack >> --download-hdf5 >> >> I need to test our software in an environment with NEC SX-Aurora >> TSUBASA Vector Engine. >> Is there any resource where I can set up petsc to use Vector Engine? >> >> Thank you! >> Regards, >> Rafael. >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuanxi at advancesoft.jp Thu Oct 28 21:48:47 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Fri, 29 Oct 2021 11:48:47 +0900 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? In-Reply-To: References: Message-ID: Dear Matt, My mesh is something like the following figure, which is composed of three elements : one hexahedron(solid element), one quadrilateral (shell element), and one line (beam element). I found the function "TestEmptyStrata" in file \dm\impls\plex\tests\ex11.c would be a good example to read in such a kind of mesh by using DMPlexSetCone. But a problem is that you should declare all faces and edges of hexahedron element, all edges in quadrilateral element by DMPlexSetCone, otherwise PETsc could not do topological interpolation afterwards. Am I right here? As general in FEM mesh, my mesh does not contain any information about faces or edges of solid elements. That's why I consider using DMCOMPOSITE. That is - Put hexahedron, quadrilateral, and line elements into different DM structures. - do topological interpolation in those DMs separately. - composite them. Is there anything wrong in my above consideration? Any suggestions? ------------ /| /| / | / | cell 0: Hex / | / | ------------/ | | | | | | | | | cell 1: Quad | --------|---|------------ | / | / / | / | / / |/ |/ / ------------------------------------------- cell 2: line Much thanks for your help. Yuan 2021?10?28?(?) 22:05 Matthew Knepley : > On Thu, Oct 28, 2021 at 4:59 AM ?? wrote: > >> Dear Matt, >> >> Thank you for your quick response. >> >> I think what you mean is to build DAG from my mesh at first and then call >> DMPlexCreateFromDAG >> () >> to construct DMPlex. >> > > No, I do not mean that. > > >> A new problem is, as I know, the function DMPlexInterpolate would >> generate points with different depth. What's the difference between those >> faces and segment elements generated by DMPlexInterpolate with that >> defined by the original mesh, or should we not use DMPlexInterpolate in >> such a case? >> >> On the other hand, can DMComposite be used in this case by defining >> DMPlex with different topological dimensions at first and then composite >> them? >> > > You do not need that. I am obviously not understanding your question. My > short answer is that Plex _already_ handles cells of different > dimension automatically without anything extra. > > Maybe it would help if you defined a specific problem you have. > > Thanks, > > Matt > > >> Thanks in advance. >> >> Yuan >> >> >> 2021?10?27?(?) 19:27 Matthew Knepley : >> >>> On Wed, Oct 27, 2021 at 4:50 AM ?? wrote: >>> >>>> Hi, >>>> >>>> I am trying to parallelize my serial FEM program using PETSc. This >>>> program calculates structure deformation by using various types of elements >>>> such as solid, shell, beam, and truss. At the very beginning, I found it >>>> was hard for me to put such kinds of elements into DMPlex. Because solid >>>> elements are topologically three dimensional, shell element two, and beam >>>> or truss are topologically one-dimensional elements. After reading chapter >>>> 2.10: "DMPlex: Unstructured Grids in PETSc" of users manual carefully, I >>>> found the provided functions, such as DMPlexSetCone, cannot declare those >>>> topological differences. >>>> >>>> My question is : Is it possible and how to define all those >>>> topologically different elements into a DMPlex struct? >>>> >>> >>> Yes. The idea is to program in a dimension-independent way, so that the >>> code can handle cells of any dimension. >>> What you probably want is the "depth" in the DAG representation, which >>> you can think of as the dimension of a cell. >>> >>> >>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetPointDepth.html#DMPlexGetPointDepth >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Thanks in advance! >>>> >>>> Best regards, >>>> >>>> Yuan. >>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yhcy1993 at gmail.com Thu Oct 28 21:49:31 2021 From: yhcy1993 at gmail.com (=?UTF-8?B?5LuT5a6H?=) Date: Fri, 29 Oct 2021 10:49:31 +0800 Subject: [petsc-users] Strange behavior of TS after setting hand-coded Jacobian In-Reply-To: <9CC15214-4601-4554-808F-C3E96DC3D34A@petsc.dev> References: <0C6ACBF3-F457-4BFD-AD19-8C455444748F@petsc.dev> <9CC15214-4601-4554-808F-C3E96DC3D34A@petsc.dev> Message-ID: Thanks for your careful inspection and thoughtful suggestions. > finite differencing may put a small non-zero value in that location due to numerical round-off I think your explanation is reasonable. This numerical round-off may somehow help to avoid this pivot issue. The structure of my jacobian matrix looks like this (generated by '-mat_view draw'): [image: jac_view.png] Analytically, the first diagonal element of the jacobian is indeed 0, as its corresponding residual function is solely determined from boundary condition of another variable. This seems a little bit wired but is mathematically well-posed. For more description about the background physics, please refer to attached PDF file for more detailed explanation on the discretization and boundary conditions. Actually, the jacobian matrix is not singular, but I do believe this numerical difficulty is caused by the zero-element in diagonal. In this regard, I've performed some trial and test. It seems that several methods have been worked out for this pivot issue: a) By setting '-pc_type svd', PETSC does not panic any more with my hand-coded jacobian, and converged solution is obtained. Efficiency is also preserved. b) By setting '-pc_type none', converged solution is also obtained, but it takes too many KSP iterations to converge per SNES step (usually hundreds), making the overall solution procedure very slow. Do you think these methods really solved this kind of pivot issue? Not by chance like the numerical round-off in finite difference previously. Regards Yu Cang Barry Smith ?2021?10?27??? ??9:43??? > > > You can run with -ksp_error_if_not_converged to get it to stop as soon as a linear solve fails to help track down the exact breaking point. > > > The problem under consideration contains an eigen-value to be solved, > > making the first diagonal element of the jacobian matrix being zero. > > From these outputs, it seems that the PC failed to factorize, which is > > caused by this 0 diagonal element. But I'm wondering why it works > > with jacobian matrix generated by finite-difference? > > Presumably your "exact" Jacobian puts a zero on the diagonal while the finite differencing may put a small non-zero value in that location due to numerical round-off. In that case even if the factorization succeeds it may produce an inaccurate solution if the value on the diagonal is very small. > > If your matrix is singular or cannot be factored with LU then you need to use a different solver for the linear system that will be robust to the zero on the diagonal. What is the structure of your Jacobian? (The analytic form). > > Barry > > > > On Oct 27, 2021, at 1:47 AM, ?? wrote: > > > > Thanks for your kind reply. > > > > Several comparison tests have been performed. Attached are execution > > output files. Below are corresponding descriptions. > > > > good.txt -- Run without hand-coded jacobian, solution converged, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason'; > > jac1.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian'; > > jac2.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > > -ksp_view'; > > jac3.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > > -ksp_view -ts_max_snes_failures -1 '; > > > > The problem under consideration contains an eigen-value to be solved, > > making the first diagonal element of the jacobian matrix being zero. > > From these outputs, it seems that the PC failed to factorize, which is > > caused by this 0 diagonal element. But I'm wondering why it works > > with jacobian matrix generated by finite-difference? Would employing > > DMDA for discretization be helpful? > > > > Regards > > > > Yu Cang > > > > Barry Smith ?2021?10?25??? ??10:50??? > >> > >> > >> It is definitely unexpected that -snes_test_jacobian verifies the Jacobian as matching but the solve process is completely different. > >> > >> Please run with -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian and send all the output > >> > >> Barry > >> > >> > >>> On Oct 25, 2021, at 9:53 AM, ?? wrote: > >>> > >>> I'm using TS to solve a set of DAE, which originates from a > >>> one-dimensional problem. The grid points are uniformly distributed. > >>> For simplicity, the DMDA is not employed for discretization. > >>> > >>> At first, only the residual function is prescribed through > >>> 'TSSetIFunction', and PETSC produces converged results. However, after > >>> providing hand-coded Jacobian through 'TSSetIJacobian', the internal > >>> SNES object fails (residual norm does not change), and TS reports > >>> 'DIVERGED_STEP_REJECTED'. > >>> > >>> I have tried to add the option '-snes_test_jacobian' to see if the > >>> hand-coded jacobian is somewhere wrong, but it shows '||J - > >>> Jfd||_F/||J||_F = 1.07488e-10, ||J - Jfd||_F = 2.14458e-07', > >>> indicating that the hand-coded jacobian is correct. > >>> > >>> Then, I added a monitor for the internal SNES object through > >>> 'SNESMonitorSet', in which the solution vector will be displayed at > >>> each iteration. It is interesting to find that, if the jacobian is not > >>> provided, meaning finite-difference is utilized for jacobian > >>> evaluation internally, the solution vector converges to steady > >>> solution and the SNES residual norm is reduced continuously. However, > >>> it turns out that, as long as the jacobian is provided, the solution > >>> vector will NEVER get changed! So the solution procedure stucked! > >>> > >>> This is quite strange! Hope to get some advice. > >>> PETSC version=3.14.6, program run in serial mode. > >>> > >>> Regards > >>> > >>> Yu Cang > >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jac_view.png Type: image/png Size: 1998 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TFM.pdf Type: application/pdf Size: 44074 bytes Desc: not available URL: From yhcy1993 at gmail.com Thu Oct 28 22:10:05 2021 From: yhcy1993 at gmail.com (=?UTF-8?B?5LuT5a6H?=) Date: Fri, 29 Oct 2021 11:10:05 +0800 Subject: [petsc-users] Strange behavior of TS after setting hand-coded Jacobian In-Reply-To: <918457CD-6B4F-49A2-9029-E6BAD039C9C0@anl.gov> References: <0C6ACBF3-F457-4BFD-AD19-8C455444748F@petsc.dev> <918457CD-6B4F-49A2-9029-E6BAD039C9C0@anl.gov> Message-ID: Thanks for your kind reply. Actually, I've checked the ascii output of '-snes_test_jacobian_view' already, but these detailed diagnostics were not attached in the previous post. It shows no difference between finite-difference and hand-coded jacobian matrices. So I declared that the hand-coded jacobian is correct in previous post. Regards Yu Cang Zhang, Hong ?2021?10?27??? ??10:43??? > > Since your Jacobian matrix is small, it is possible to compare your hand-written Jacobian with the finite-difference approximation directly. Add -snes_test_jacobian_view to print out the matrices. Then you can see exactly where the difference is. > > Hong > > > On Oct 27, 2021, at 12:47 AM, ?? wrote: > > > > Thanks for your kind reply. > > > > Several comparison tests have been performed. Attached are execution > > output files. Below are corresponding descriptions. > > > > good.txt -- Run without hand-coded jacobian, solution converged, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason'; > > jac1.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian'; > > jac2.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > > -ksp_view'; > > jac3.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > > -ksp_view -ts_max_snes_failures -1 '; > > > > The problem under consideration contains an eigen-value to be solved, > > making the first diagonal element of the jacobian matrix being zero. > > From these outputs, it seems that the PC failed to factorize, which is > > caused by this 0 diagonal element. But I'm wondering why it works > > with jacobian matrix generated by finite-difference? Would employing > > DMDA for discretization be helpful? > > > > Regards > > > > Yu Cang > > > > Barry Smith ?2021?10?25??? ??10:50??? > >> > >> > >> It is definitely unexpected that -snes_test_jacobian verifies the Jacobian as matching but the solve process is completely different. > >> > >> Please run with -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian and send all the output > >> > >> Barry > >> > >> > >>> On Oct 25, 2021, at 9:53 AM, ?? wrote: > >>> > >>> I'm using TS to solve a set of DAE, which originates from a > >>> one-dimensional problem. The grid points are uniformly distributed. > >>> For simplicity, the DMDA is not employed for discretization. > >>> > >>> At first, only the residual function is prescribed through > >>> 'TSSetIFunction', and PETSC produces converged results. However, after > >>> providing hand-coded Jacobian through 'TSSetIJacobian', the internal > >>> SNES object fails (residual norm does not change), and TS reports > >>> 'DIVERGED_STEP_REJECTED'. > >>> > >>> I have tried to add the option '-snes_test_jacobian' to see if the > >>> hand-coded jacobian is somewhere wrong, but it shows '||J - > >>> Jfd||_F/||J||_F = 1.07488e-10, ||J - Jfd||_F = 2.14458e-07', > >>> indicating that the hand-coded jacobian is correct. > >>> > >>> Then, I added a monitor for the internal SNES object through > >>> 'SNESMonitorSet', in which the solution vector will be displayed at > >>> each iteration. It is interesting to find that, if the jacobian is not > >>> provided, meaning finite-difference is utilized for jacobian > >>> evaluation internally, the solution vector converges to steady > >>> solution and the SNES residual norm is reduced continuously. However, > >>> it turns out that, as long as the jacobian is provided, the solution > >>> vector will NEVER get changed! So the solution procedure stucked! > >>> > >>> This is quite strange! Hope to get some advice. > >>> PETSC version=3.14.6, program run in serial mode. > >>> > >>> Regards > >>> > >>> Yu Cang > >> > > > From karthikeyan.chockalingam at stfc.ac.uk Fri Oct 29 03:40:06 2021 From: karthikeyan.chockalingam at stfc.ac.uk (Karthikeyan Chockalingam - STFC UKRI) Date: Fri, 29 Oct 2021 08:40:06 +0000 Subject: [petsc-users] Memory usage Message-ID: <47EBBBAF-6603-408C-9B2D-64A7AFDCBE3F@stfc.ac.uk> Hello, I used the flags -memory_view -malloc_log to ran a problem on the same size on 32 cores and 64 cores. I understand what is total, max and min memory usage of the problem is. However I don?t understand the difference between Maximum and Current process memory? I also curious to understand why the memory usage is different on 32 cores (1 node, 2 sockets with 16 cores/socket) and 64 cores (2 nodes) for a problem of the same size? On 32 core: Summary of Memory Usage in PETSc Maximum (over computational time) process memory: total 7.5014e+10 max 2.6396e+09 min 2.0659e+09 Current process memory: total 3.2583e+10 max 1.3562e+09 min 8.3587e+08 On 64 core: Summary of Memory Usage in PETSc Maximum (over computational time) process memory: total 7.9337e+10 max 1.5433e+09 min 1.0319e+09 Current process memory: total 6.4491e+10 max 1.2090e+09 min 8.4000e+08 Kind regards, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuanxi at advancesoft.jp Fri Oct 29 05:11:21 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Fri, 29 Oct 2021 19:11:21 +0900 Subject: [petsc-users] Tutorials test case cannot run in parallel Message-ID: Hi, I have tried the test case ex3f90 in the folder \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I run it in 1 CPU by - mpirun -np 1 ./ex3f90 Everything seems OK. But when run it in 2 CPU by - mpirun -np 2 ./ex3f90 I got the following error message [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Object is in wrong state [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no graph set [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-248-ge617e6467c GIT Date: 2021-10-19 23:11:25 -0500 [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 18:48:54 2021 [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1 --download-hypre=1 --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi [0]PETSC ERROR: #1 DMPlexCheckPointSF() at /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 [0]PETSC ERROR: #3 DMPlexInterpolate() at /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 [0]PETSC ERROR: #4 User provided function() at User file:0 Abort(73) on node 0 (rank 0 in comm 16): application called MPI_Abort(MPI_COMM_SELF, 73) - process 0 ------------------------------------------------------------------------------------------------------------------------------------ It fails in calling DMPlexInterpolate. Maybe this program is not considered to be run in parallel. But if I wish to do so, how should I modify it to let it run on multiple CPUs? Much thanks for your help Yuan -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Fri Oct 29 05:16:45 2021 From: mfadams at lbl.gov (Mark Adams) Date: Fri, 29 Oct 2021 06:16:45 -0400 Subject: [petsc-users] Memory usage In-Reply-To: <47EBBBAF-6603-408C-9B2D-64A7AFDCBE3F@stfc.ac.uk> References: <47EBBBAF-6603-408C-9B2D-64A7AFDCBE3F@stfc.ac.uk> Message-ID: On Fri, Oct 29, 2021 at 4:40 AM Karthikeyan Chockalingam - STFC UKRI < karthikeyan.chockalingam at stfc.ac.uk> wrote: > Hello, > > > > I used the flags -memory_view -malloc_log to ran a problem on the same > size on 32 cores and 64 cores. > > > > I understand what is total, max and min memory usage of the problem is. > Total is the sum of all 32/62 processes. Max is, as it says "Maximum (over computational time)" > However I don?t understand the difference between Maximum and Current > process memory? > > > Current memory usage does not include some earlier high water mark in memory usage. > I also curious to understand why the memory usage is different on 32 cores > (1 node, 2 sockets with 16 cores/socket) and 64 cores (2 nodes) for a > problem of the same size? > There is some data and metadata that is stored redundantly on each process. > > > On 32 core: > > > > Summary of Memory Usage in PETSc > > Maximum (over computational time) process memory: total 7.5014e+10 > max 2.6396e+09 min 2.0659e+09 > > Current process memory: > total 3.2583e+10 max 1.3562e+09 min 8.3587e+08 > > > > On 64 core: > > > > Summary of Memory Usage in PETSc > > Maximum (over computational time) process memory: total 7.9337e+10 > max 1.5433e+09 min 1.0319e+09 > > Current process memory: > total 6.4491e+10 max 1.2090e+09 min 8.4000e+08 > > > > Kind regards, > > Karthik. > > This email and any attachments are intended solely for the use of the > named recipients. If you are not the intended recipient you must not use, > disclose, copy or distribute this email or any of its attachments and > should notify the sender immediately and delete this email from your > system. UK Research and Innovation (UKRI) has taken every reasonable > precaution to minimise risk of this email or any attachments containing > viruses or malware but the recipient should carry out its own virus and > malware checks before opening the attachments. UKRI does not accept any > liability for any losses or damages which the recipient may sustain due to > presence of any viruses. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Fri Oct 29 06:22:20 2021 From: mfadams at lbl.gov (Mark Adams) Date: Fri, 29 Oct 2021 07:22:20 -0400 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: This works for me (appended) using an up to date version of PETSc. I would delete the architecture director and reconfigure, and make all, and try again. Next, you seem to be using git. Use the 'main' branch and try again. Mark (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd src/dm/impls/plex/tutorials/ (base) 07:16 adams/swarm-omp-pc *= ~/Codes/petsc/src/dm/impls/plex/tutorials$ make PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -I/Users/markadams/Codes/petsc/include -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 (base) 07:16 adams/swarm-omp-pc *= ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 DM Object: testplex 1 MPI processes type: plex testplex in 3 dimensions: 0-cells: 12 1-cells: 20 2-cells: 11 3-cells: 2 Labels: celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) DM Object: testplex 1 MPI processes type: plex testplex in 3 dimensions: 0-cells: 12 1-cells: 20 2-cells: 11 3-cells: 2 Labels: celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: > Hi, > > I have tried the test case ex3f90 in the folder > \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I > run it in 1 CPU by > > - mpirun -np 1 ./ex3f90 > > Everything seems OK. But when run it in 2 CPU by > > - mpirun -np 2 ./ex3f90 > > I got the following error message > > [0]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [0]PETSC ERROR: Object is in wrong state > [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no graph set > [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. > [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-248-ge617e6467c > GIT Date: 2021-10-19 23:11:25 -0500 > [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 18:48:54 > 2021 > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx > --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 > --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 > --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 > --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 > --download-mumps=1 --download-scalapack=1 --download-hypre=1 > --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi > [0]PETSC ERROR: #1 DMPlexCheckPointSF() at > /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 > [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at > /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 > [0]PETSC ERROR: #3 DMPlexInterpolate() at > /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 > [0]PETSC ERROR: #4 User provided function() at User file:0 > Abort(73) on node 0 (rank 0 in comm 16): application called > MPI_Abort(MPI_COMM_SELF, 73) - process 0 > > ------------------------------------------------------------------------------------------------------------------------------------ > > It fails in calling DMPlexInterpolate. Maybe this program is not > considered to be run in parallel. But if I wish to do so, how should I > modify it to let it run on multiple CPUs? > > Much thanks for your help > > Yuan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hongzhang at anl.gov Fri Oct 29 09:05:52 2021 From: hongzhang at anl.gov (Zhang, Hong) Date: Fri, 29 Oct 2021 14:05:52 +0000 Subject: [petsc-users] Strange behavior of TS after setting hand-coded Jacobian In-Reply-To: References: <0C6ACBF3-F457-4BFD-AD19-8C455444748F@petsc.dev> <9CC15214-4601-4554-808F-C3E96DC3D34A@petsc.dev> Message-ID: One way to avoid the zero element in Jacobian is to exclude the boundary point from the solution vector. I often do this for Dirichlet boundary conditions since the value at the boundary is given directly and does not need to be taken as a degree of freedom. Hong (Mr.) On Oct 28, 2021, at 9:49 PM, ?? > wrote: Thanks for your careful inspection and thoughtful suggestions. > finite differencing may put a small non-zero value in that location due to numerical round-off I think your explanation is reasonable. This numerical round-off may somehow help to avoid this pivot issue. The structure of my jacobian matrix looks like this (generated by '-mat_view draw'): Analytically, the first diagonal element of the jacobian is indeed 0, as its corresponding residual function is solely determined from boundary condition of another variable. This seems a little bit wired but is mathematically well-posed. For more description about the background physics, please refer to attached PDF file for more detailed explanation on the discretization and boundary conditions. Actually, the jacobian matrix is not singular, but I do believe this numerical difficulty is caused by the zero-element in diagonal. In this regard, I've performed some trial and test. It seems that several methods have been worked out for this pivot issue: a) By setting '-pc_type svd', PETSC does not panic any more with my hand-coded jacobian, and converged solution is obtained. Efficiency is also preserved. b) By setting '-pc_type none', converged solution is also obtained, but it takes too many KSP iterations to converge per SNES step (usually hundreds), making the overall solution procedure very slow. Do you think these methods really solved this kind of pivot issue? Not by chance like the numerical round-off in finite difference previously. Regards Yu Cang Barry Smith > ?2021?10?27??? ??9:43??? > > > You can run with -ksp_error_if_not_converged to get it to stop as soon as a linear solve fails to help track down the exact breaking point. > > > The problem under consideration contains an eigen-value to be solved, > > making the first diagonal element of the jacobian matrix being zero. > > From these outputs, it seems that the PC failed to factorize, which is > > caused by this 0 diagonal element. But I'm wondering why it works > > with jacobian matrix generated by finite-difference? > > Presumably your "exact" Jacobian puts a zero on the diagonal while the finite differencing may put a small non-zero value in that location due to numerical round-off. In that case even if the factorization succeeds it may produce an inaccurate solution if the value on the diagonal is very small. > > If your matrix is singular or cannot be factored with LU then you need to use a different solver for the linear system that will be robust to the zero on the diagonal. What is the structure of your Jacobian? (The analytic form). > > Barry > > > > On Oct 27, 2021, at 1:47 AM, ?? > wrote: > > > > Thanks for your kind reply. > > > > Several comparison tests have been performed. Attached are execution > > output files. Below are corresponding descriptions. > > > > good.txt -- Run without hand-coded jacobian, solution converged, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason'; > > jac1.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian'; > > jac2.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > > -ksp_view'; > > jac3.txt -- Run with hand-coded jacobian, does not converge, with > > option '-ts_monitor -snes_monitor -snes_converged_reason > > -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian > > -ksp_view -ts_max_snes_failures -1 '; > > > > The problem under consideration contains an eigen-value to be solved, > > making the first diagonal element of the jacobian matrix being zero. > > From these outputs, it seems that the PC failed to factorize, which is > > caused by this 0 diagonal element. But I'm wondering why it works > > with jacobian matrix generated by finite-difference? Would employing > > DMDA for discretization be helpful? > > > > Regards > > > > Yu Cang > > > > Barry Smith > ?2021?10?25??? ??10:50??? > >> > >> > >> It is definitely unexpected that -snes_test_jacobian verifies the Jacobian as matching but the solve process is completely different. > >> > >> Please run with -snes_monitor -snes_converged_reason -ksp_monitor_true_residual -ksp_converged_reason -snes_test_jacobian and send all the output > >> > >> Barry > >> > >> > >>> On Oct 25, 2021, at 9:53 AM, ?? > wrote: > >>> > >>> I'm using TS to solve a set of DAE, which originates from a > >>> one-dimensional problem. The grid points are uniformly distributed. > >>> For simplicity, the DMDA is not employed for discretization. > >>> > >>> At first, only the residual function is prescribed through > >>> 'TSSetIFunction', and PETSC produces converged results. However, after > >>> providing hand-coded Jacobian through 'TSSetIJacobian', the internal > >>> SNES object fails (residual norm does not change), and TS reports > >>> 'DIVERGED_STEP_REJECTED'. > >>> > >>> I have tried to add the option '-snes_test_jacobian' to see if the > >>> hand-coded jacobian is somewhere wrong, but it shows '||J - > >>> Jfd||_F/||J||_F = 1.07488e-10, ||J - Jfd||_F = 2.14458e-07', > >>> indicating that the hand-coded jacobian is correct. > >>> > >>> Then, I added a monitor for the internal SNES object through > >>> 'SNESMonitorSet', in which the solution vector will be displayed at > >>> each iteration. It is interesting to find that, if the jacobian is not > >>> provided, meaning finite-difference is utilized for jacobian > >>> evaluation internally, the solution vector converges to steady > >>> solution and the SNES residual norm is reduced continuously. However, > >>> it turns out that, as long as the jacobian is provided, the solution > >>> vector will NEVER get changed! So the solution procedure stucked! > >>> > >>> This is quite strange! Hope to get some advice. > >>> PETSC version=3.14.6, program run in serial mode. > >>> > >>> Regards > >>> > >>> Yu Cang > >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuanxi at advancesoft.jp Fri Oct 29 21:41:30 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Sat, 30 Oct 2021 11:41:30 +0900 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: Thanks, Mark. I do what you suggested but nothing changes. Besides, from your compile history and result, - you use gfortran with no MPI library, not mpif90 - two CPUs gives exactly the same result - The first line of the DMView output should be "DM Object: testplex 2 MPI processes", not "DM Object: testplex 1 MPI processes", when you use 2CPUs It seems like you did not use MPI but just two CPUs do exactly the same thing.. Best regards, Yuan 2021?10?29?(?) 20:22 Mark Adams : > This works for me (appended) using an up to date version of PETSc. > > I would delete the architecture director and reconfigure, and make all, > and try again. > > Next, you seem to be using git. Use the 'main' branch and try again. > > Mark > > (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd > src/dm/impls/plex/tutorials/ > (base) 07:16 adams/swarm-omp-pc *= > ~/Codes/petsc/src/dm/impls/plex/tutorials$ make > PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 > gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress > -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs > -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall > -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall > -ffree-line-length-0 -Wno-unused-dummy-argument -g -O > -I/Users/markadams/Codes/petsc/include > -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 > -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib > -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib > -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib > -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib > -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 > -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 > -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 > -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack > -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran > -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 > (base) 07:16 adams/swarm-omp-pc *= > ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 > DM Object: testplex 1 MPI processes > type: plex > testplex in 3 dimensions: > 0-cells: 12 > 1-cells: 20 > 2-cells: 11 > 3-cells: 2 > Labels: > celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) > depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) > DM Object: testplex 1 MPI processes > type: plex > testplex in 3 dimensions: > 0-cells: 12 > 1-cells: 20 > 2-cells: 11 > 3-cells: 2 > Labels: > celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) > depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) > cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 > cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 > cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 > cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 > > On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: > >> Hi, >> >> I have tried the test case ex3f90 in the folder >> \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I >> run it in 1 CPU by >> >> - mpirun -np 1 ./ex3f90 >> >> Everything seems OK. But when run it in 2 CPU by >> >> - mpirun -np 2 ./ex3f90 >> >> I got the following error message >> >> [0]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [0]PETSC ERROR: Object is in wrong state >> [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no graph >> set >> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. >> [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-248-ge617e6467c >> GIT Date: 2021-10-19 23:11:25 -0500 >> [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 18:48:54 >> 2021 >> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx >> --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 >> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 >> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 >> --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 >> --download-mumps=1 --download-scalapack=1 --download-hypre=1 >> --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi >> [0]PETSC ERROR: #1 DMPlexCheckPointSF() at >> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 >> [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at >> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 >> [0]PETSC ERROR: #3 DMPlexInterpolate() at >> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 >> [0]PETSC ERROR: #4 User provided function() at User file:0 >> Abort(73) on node 0 (rank 0 in comm 16): application called >> MPI_Abort(MPI_COMM_SELF, 73) - process 0 >> >> ------------------------------------------------------------------------------------------------------------------------------------ >> >> It fails in calling DMPlexInterpolate. Maybe this program is not >> considered to be run in parallel. But if I wish to do so, how should I >> modify it to let it run on multiple CPUs? >> >> Much thanks for your help >> >> Yuan >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Sat Oct 30 07:29:00 2021 From: mfadams at lbl.gov (Mark Adams) Date: Sat, 30 Oct 2021 08:29:00 -0400 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: 08:27 adams/pcksp-batch-kokkos *= summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ make PETSC_ARCH=arch-summit-opt-gnu-kokkos-cuda ex3f90 mpifort -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -I/gpfs/alpine/csc314/scratch/adams/petsc/include -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/include -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/include -I/sw/summit/cuda/11.0.3/include ex3f90.F90 -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib -Wl,-rpath,/sw/summit/cuda/11.0.3/lib64 -L/sw/summit/cuda/11.0.3/lib64 -L/sw/summit/cuda/11.0.3/lib64/stubs -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib -lpetsc -lkokkoskernels -lkokkoscontainers -lkokkoscore -lp4est -lsc -lblas -llapack -lhdf5_hl -lhdf5 -lm -lz -lcudart -lcufft -lcublas -lcusparse -lcusolver -lcurand -lcuda -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempif08 -lmpi_ibm_usempi_ignore_tkr -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath -lstdc++ -ldl -o ex3f90 08:27 adams/pcksp-batch-kokkos *= summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ jsrun -n 2 -g 1 ./ex3f90 DM Object: testplex 2 MPI processes type: plex testplex in 3 dimensions: 0-cells: 12 12 1-cells: 20 20 2-cells: 11 11 3-cells: 2 2 Labels: celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 08:28 adams/pcksp-batch-kokkos *= summit:/gpfs/alpine/csc314/scratch/adams/pets On Fri, Oct 29, 2021 at 10:41 PM ?? wrote: > Thanks, Mark. > > I do what you suggested but nothing changes. Besides, from your compile > history and result, > > - you use gfortran with no MPI library, not mpif90 > - two CPUs gives exactly the same result > - The first line of the DMView output should be "DM Object: testplex 2 > MPI processes", not "DM Object: testplex 1 MPI processes", when you use > 2CPUs > > It seems like you did not use MPI but just two CPUs do exactly the same > thing.. > > Best regards, > > Yuan > > > 2021?10?29?(?) 20:22 Mark Adams : > >> This works for me (appended) using an up to date version of PETSc. >> >> I would delete the architecture director and reconfigure, and make all, >> and try again. >> >> Next, you seem to be using git. Use the 'main' branch and try again. >> >> Mark >> >> (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd >> src/dm/impls/plex/tutorials/ >> (base) 07:16 adams/swarm-omp-pc *= >> ~/Codes/petsc/src/dm/impls/plex/tutorials$ make >> PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 >> gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress >> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs >> -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall >> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall >> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >> -I/Users/markadams/Codes/petsc/include >> -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 >> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 >> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack >> -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran >> -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 >> (base) 07:16 adams/swarm-omp-pc *= >> ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 >> DM Object: testplex 1 MPI processes >> type: plex >> testplex in 3 dimensions: >> 0-cells: 12 >> 1-cells: 20 >> 2-cells: 11 >> 3-cells: 2 >> Labels: >> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >> DM Object: testplex 1 MPI processes >> type: plex >> testplex in 3 dimensions: >> 0-cells: 12 >> 1-cells: 20 >> 2-cells: 11 >> 3-cells: 2 >> Labels: >> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >> >> On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: >> >>> Hi, >>> >>> I have tried the test case ex3f90 in the folder >>> \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I >>> run it in 1 CPU by >>> >>> - mpirun -np 1 ./ex3f90 >>> >>> Everything seems OK. But when run it in 2 CPU by >>> >>> - mpirun -np 2 ./ex3f90 >>> >>> I got the following error message >>> >>> [0]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> [0]PETSC ERROR: Object is in wrong state >>> [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no graph >>> set >>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. >>> [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-248-ge617e6467c >>> GIT Date: 2021-10-19 23:11:25 -0500 >>> [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 18:48:54 >>> 2021 >>> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx >>> --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 >>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 >>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 >>> --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 >>> --download-mumps=1 --download-scalapack=1 --download-hypre=1 >>> --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi >>> [0]PETSC ERROR: #1 DMPlexCheckPointSF() at >>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 >>> [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at >>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 >>> [0]PETSC ERROR: #3 DMPlexInterpolate() at >>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 >>> [0]PETSC ERROR: #4 User provided function() at User file:0 >>> Abort(73) on node 0 (rank 0 in comm 16): application called >>> MPI_Abort(MPI_COMM_SELF, 73) - process 0 >>> >>> ------------------------------------------------------------------------------------------------------------------------------------ >>> >>> It fails in calling DMPlexInterpolate. Maybe this program is not >>> considered to be run in parallel. But if I wish to do so, how should I >>> modify it to let it run on multiple CPUs? >>> >>> Much thanks for your help >>> >>> Yuan >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Sat Oct 30 07:51:27 2021 From: mfadams at lbl.gov (Mark Adams) Date: Sat, 30 Oct 2021 08:51:27 -0400 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: Ah, I can reproduce this error with debugging turned on. This test is not a parallel test, but it does not say that serial is a requirement. So there is a problem here. Anyone? On Sat, Oct 30, 2021 at 8:29 AM Mark Adams wrote: > 08:27 adams/pcksp-batch-kokkos *= > summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ > make PETSC_ARCH=arch-summit-opt-gnu-kokkos-cuda ex3f90 > mpifort -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O > -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O > -I/gpfs/alpine/csc314/scratch/adams/petsc/include > -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/include > -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/include > -I/sw/summit/cuda/11.0.3/include ex3f90.F90 > -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib > -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib > -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib > -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib > -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 > -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib > -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib > -Wl,-rpath,/sw/summit/cuda/11.0.3/lib64 -L/sw/summit/cuda/11.0.3/lib64 > -L/sw/summit/cuda/11.0.3/lib64/stubs > -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib > -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib > -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 > -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 > -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc > -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc > -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 > -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 > -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 > -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib > -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib > -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib > -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib -lpetsc > -lkokkoskernels -lkokkoscontainers -lkokkoscore -lp4est -lsc -lblas > -llapack -lhdf5_hl -lhdf5 -lm -lz -lcudart -lcufft -lcublas -lcusparse > -lcusolver -lcurand -lcuda -lstdc++ -ldl -lmpiprofilesupport > -lmpi_ibm_usempif08 -lmpi_ibm_usempi_ignore_tkr -lmpi_ibm_mpifh -lmpi_ibm > -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath > -lstdc++ -ldl -o ex3f90 > 08:27 adams/pcksp-batch-kokkos *= > summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ > jsrun -n 2 -g 1 ./ex3f90 > DM Object: testplex 2 MPI processes > type: plex > testplex in 3 dimensions: > 0-cells: 12 12 > 1-cells: 20 20 > 2-cells: 11 11 > 3-cells: 2 2 > Labels: > celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) > depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) > cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 > cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 > cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 > cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 > 08:28 adams/pcksp-batch-kokkos *= > summit:/gpfs/alpine/csc314/scratch/adams/pets > > On Fri, Oct 29, 2021 at 10:41 PM ?? wrote: > >> Thanks, Mark. >> >> I do what you suggested but nothing changes. Besides, from your compile >> history and result, >> >> - you use gfortran with no MPI library, not mpif90 >> - two CPUs gives exactly the same result >> - The first line of the DMView output should be "DM Object: testplex 2 >> MPI processes", not "DM Object: testplex 1 MPI processes", when you use >> 2CPUs >> >> It seems like you did not use MPI but just two CPUs do exactly the same >> thing.. >> >> Best regards, >> >> Yuan >> >> >> 2021?10?29?(?) 20:22 Mark Adams : >> >>> This works for me (appended) using an up to date version of PETSc. >>> >>> I would delete the architecture director and reconfigure, and make all, >>> and try again. >>> >>> Next, you seem to be using git. Use the 'main' branch and try again. >>> >>> Mark >>> >>> (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd >>> src/dm/impls/plex/tutorials/ >>> (base) 07:16 adams/swarm-omp-pc *= >>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ make >>> PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 >>> gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress >>> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs >>> -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall >>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall >>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>> -I/Users/markadams/Codes/petsc/include >>> -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 >>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 >>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack >>> -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran >>> -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 >>> (base) 07:16 adams/swarm-omp-pc *= >>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 >>> DM Object: testplex 1 MPI processes >>> type: plex >>> testplex in 3 dimensions: >>> 0-cells: 12 >>> 1-cells: 20 >>> 2-cells: 11 >>> 3-cells: 2 >>> Labels: >>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>> DM Object: testplex 1 MPI processes >>> type: plex >>> testplex in 3 dimensions: >>> 0-cells: 12 >>> 1-cells: 20 >>> 2-cells: 11 >>> 3-cells: 2 >>> Labels: >>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>> >>> On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: >>> >>>> Hi, >>>> >>>> I have tried the test case ex3f90 in the folder >>>> \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I >>>> run it in 1 CPU by >>>> >>>> - mpirun -np 1 ./ex3f90 >>>> >>>> Everything seems OK. But when run it in 2 CPU by >>>> >>>> - mpirun -np 2 ./ex3f90 >>>> >>>> I got the following error message >>>> >>>> [0]PETSC ERROR: --------------------- Error Message >>>> -------------------------------------------------------------- >>>> [0]PETSC ERROR: Object is in wrong state >>>> [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no graph >>>> set >>>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble >>>> shooting. >>>> [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-248-ge617e6467c >>>> GIT Date: 2021-10-19 23:11:25 -0500 >>>> [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 18:48:54 >>>> 2021 >>>> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx >>>> --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 >>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 >>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 >>>> --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 >>>> --download-mumps=1 --download-scalapack=1 --download-hypre=1 >>>> --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi >>>> [0]PETSC ERROR: #1 DMPlexCheckPointSF() at >>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 >>>> [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at >>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 >>>> [0]PETSC ERROR: #3 DMPlexInterpolate() at >>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 >>>> [0]PETSC ERROR: #4 User provided function() at User file:0 >>>> Abort(73) on node 0 (rank 0 in comm 16): application called >>>> MPI_Abort(MPI_COMM_SELF, 73) - process 0 >>>> >>>> ------------------------------------------------------------------------------------------------------------------------------------ >>>> >>>> It fails in calling DMPlexInterpolate. Maybe this program is not >>>> considered to be run in parallel. But if I wish to do so, how should I >>>> modify it to let it run on multiple CPUs? >>>> >>>> Much thanks for your help >>>> >>>> Yuan >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuanxi at advancesoft.jp Sat Oct 30 08:37:56 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Sat, 30 Oct 2021 22:37:56 +0900 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: Thank you for your reply. I have solved the problem by modifying ---------------------------------------------------- call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) ---------------------------------------------------- into ----------------------------------------------------- numPoints1 = [0, 0, 0, 0] if (rank == 0) then call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) else call DMPlexCreateFromDAG(dm, 3, numPoints1, PETSC_NULL_INTEGER, PETSC_NULL_INTEGER,PETSC_NULL_INTEGER, PETSC_NULL_REAL, ierr) endif ---------------------------------------------------- The result obtained as follows DM Object: testplex 2 MPI processes type: plex testplex in 3 dimensions: 0-cells: 12 0 1-cells: 20 0 2-cells: 11 0 3-cells: 2 0 Labels: celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 15428 RUNNING AT DESKTOP-9ITFSBM = KILLED BY SIGNAL: 9 (Killed) =================================================================================== There is still problem left. I like it relevent 2021?10?30?(?) 21:51 Mark Adams : > Ah, I can reproduce this error with debugging turned on. > This test is not a parallel test, but it does not say that serial is a > requirement. > So there is a problem here. > Anyone? > > On Sat, Oct 30, 2021 at 8:29 AM Mark Adams wrote: > >> 08:27 adams/pcksp-batch-kokkos *= >> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >> make PETSC_ARCH=arch-summit-opt-gnu-kokkos-cuda ex3f90 >> mpifort -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >> -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >> -I/gpfs/alpine/csc314/scratch/adams/petsc/include >> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/include >> -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/include >> -I/sw/summit/cuda/11.0.3/include ex3f90.F90 >> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >> -Wl,-rpath,/sw/summit/cuda/11.0.3/lib64 -L/sw/summit/cuda/11.0.3/lib64 >> -L/sw/summit/cuda/11.0.3/lib64/stubs >> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib >> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib -lpetsc >> -lkokkoskernels -lkokkoscontainers -lkokkoscore -lp4est -lsc -lblas >> -llapack -lhdf5_hl -lhdf5 -lm -lz -lcudart -lcufft -lcublas -lcusparse >> -lcusolver -lcurand -lcuda -lstdc++ -ldl -lmpiprofilesupport >> -lmpi_ibm_usempif08 -lmpi_ibm_usempi_ignore_tkr -lmpi_ibm_mpifh -lmpi_ibm >> -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath >> -lstdc++ -ldl -o ex3f90 >> 08:27 adams/pcksp-batch-kokkos *= >> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >> jsrun -n 2 -g 1 ./ex3f90 >> DM Object: testplex 2 MPI processes >> type: plex >> testplex in 3 dimensions: >> 0-cells: 12 12 >> 1-cells: 20 20 >> 2-cells: 11 11 >> 3-cells: 2 2 >> Labels: >> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >> 08:28 adams/pcksp-batch-kokkos *= >> summit:/gpfs/alpine/csc314/scratch/adams/pets >> >> On Fri, Oct 29, 2021 at 10:41 PM ?? wrote: >> >>> Thanks, Mark. >>> >>> I do what you suggested but nothing changes. Besides, from your compile >>> history and result, >>> >>> - you use gfortran with no MPI library, not mpif90 >>> - two CPUs gives exactly the same result >>> - The first line of the DMView output should be "DM Object: testplex 2 >>> MPI processes", not "DM Object: testplex 1 MPI processes", when you use >>> 2CPUs >>> >>> It seems like you did not use MPI but just two CPUs do exactly the same >>> thing.. >>> >>> Best regards, >>> >>> Yuan >>> >>> >>> 2021?10?29?(?) 20:22 Mark Adams : >>> >>>> This works for me (appended) using an up to date version of PETSc. >>>> >>>> I would delete the architecture director and reconfigure, and make all, >>>> and try again. >>>> >>>> Next, you seem to be using git. Use the 'main' branch and try again. >>>> >>>> Mark >>>> >>>> (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd >>>> src/dm/impls/plex/tutorials/ >>>> (base) 07:16 adams/swarm-omp-pc *= >>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ make >>>> PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 >>>> gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress >>>> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs >>>> -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall >>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall >>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>>> -I/Users/markadams/Codes/petsc/include >>>> -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 >>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 >>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack >>>> -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran >>>> -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 >>>> (base) 07:16 adams/swarm-omp-pc *= >>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 >>>> DM Object: testplex 1 MPI processes >>>> type: plex >>>> testplex in 3 dimensions: >>>> 0-cells: 12 >>>> 1-cells: 20 >>>> 2-cells: 11 >>>> 3-cells: 2 >>>> Labels: >>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>> DM Object: testplex 1 MPI processes >>>> type: plex >>>> testplex in 3 dimensions: >>>> 0-cells: 12 >>>> 1-cells: 20 >>>> 2-cells: 11 >>>> 3-cells: 2 >>>> Labels: >>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>> >>>> On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: >>>> >>>>> Hi, >>>>> >>>>> I have tried the test case ex3f90 in the folder >>>>> \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I >>>>> run it in 1 CPU by >>>>> >>>>> - mpirun -np 1 ./ex3f90 >>>>> >>>>> Everything seems OK. But when run it in 2 CPU by >>>>> >>>>> - mpirun -np 2 ./ex3f90 >>>>> >>>>> I got the following error message >>>>> >>>>> [0]PETSC ERROR: --------------------- Error Message >>>>> -------------------------------------------------------------- >>>>> [0]PETSC ERROR: Object is in wrong state >>>>> [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no >>>>> graph set >>>>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble >>>>> shooting. >>>>> [0]PETSC ERROR: Petsc Development GIT revision: >>>>> v3.16.0-248-ge617e6467c GIT Date: 2021-10-19 23:11:25 -0500 >>>>> [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 >>>>> 18:48:54 2021 >>>>> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx >>>>> --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 >>>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 >>>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 >>>>> --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 >>>>> --download-mumps=1 --download-scalapack=1 --download-hypre=1 >>>>> --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi >>>>> [0]PETSC ERROR: #1 DMPlexCheckPointSF() at >>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 >>>>> [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at >>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 >>>>> [0]PETSC ERROR: #3 DMPlexInterpolate() at >>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 >>>>> [0]PETSC ERROR: #4 User provided function() at User file:0 >>>>> Abort(73) on node 0 (rank 0 in comm 16): application called >>>>> MPI_Abort(MPI_COMM_SELF, 73) - process 0 >>>>> >>>>> ------------------------------------------------------------------------------------------------------------------------------------ >>>>> >>>>> It fails in calling DMPlexInterpolate. Maybe this program is not >>>>> considered to be run in parallel. But if I wish to do so, how should I >>>>> modify it to let it run on multiple CPUs? >>>>> >>>>> Much thanks for your help >>>>> >>>>> Yuan >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Sat Oct 30 09:40:08 2021 From: mfadams at lbl.gov (Mark Adams) Date: Sat, 30 Oct 2021 10:40:08 -0400 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: Great. Thank you. Could you please send a 'git diff' if that is available? And we can take care of it. On Sat, Oct 30, 2021 at 9:38 AM ?? wrote: > Thank you for your reply. > > I have solved the problem by modifying > ---------------------------------------------------- > call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, > cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) > ---------------------------------------------------- > into > ----------------------------------------------------- > numPoints1 = [0, 0, 0, 0] > if (rank == 0) then > call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, > cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) > else > call DMPlexCreateFromDAG(dm, 3, numPoints1, PETSC_NULL_INTEGER, > PETSC_NULL_INTEGER,PETSC_NULL_INTEGER, PETSC_NULL_REAL, ierr) > endif > ---------------------------------------------------- > > The result obtained as follows > > DM Object: testplex 2 MPI processes > type: plex > testplex in 3 dimensions: > 0-cells: 12 0 > 1-cells: 20 0 > 2-cells: 11 0 > 3-cells: 2 0 > Labels: > celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) > depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) > cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 > cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = RANK 0 PID 15428 RUNNING AT DESKTOP-9ITFSBM > = KILLED BY SIGNAL: 9 (Killed) > > =================================================================================== > > There is still problem left. I like it relevent > > 2021?10?30?(?) 21:51 Mark Adams : > >> Ah, I can reproduce this error with debugging turned on. >> This test is not a parallel test, but it does not say that serial is a >> requirement. >> So there is a problem here. >> Anyone? >> >> On Sat, Oct 30, 2021 at 8:29 AM Mark Adams wrote: >> >>> 08:27 adams/pcksp-batch-kokkos *= >>> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >>> make PETSC_ARCH=arch-summit-opt-gnu-kokkos-cuda ex3f90 >>> mpifort -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g >>> -O -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>> -I/gpfs/alpine/csc314/scratch/adams/petsc/include >>> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/include >>> -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/include >>> -I/sw/summit/cuda/11.0.3/include ex3f90.F90 >>> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >>> -Wl,-rpath,/sw/summit/cuda/11.0.3/lib64 -L/sw/summit/cuda/11.0.3/lib64 >>> -L/sw/summit/cuda/11.0.3/lib64/stubs >>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib >>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib -lpetsc >>> -lkokkoskernels -lkokkoscontainers -lkokkoscore -lp4est -lsc -lblas >>> -llapack -lhdf5_hl -lhdf5 -lm -lz -lcudart -lcufft -lcublas -lcusparse >>> -lcusolver -lcurand -lcuda -lstdc++ -ldl -lmpiprofilesupport >>> -lmpi_ibm_usempif08 -lmpi_ibm_usempi_ignore_tkr -lmpi_ibm_mpifh -lmpi_ibm >>> -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath >>> -lstdc++ -ldl -o ex3f90 >>> 08:27 adams/pcksp-batch-kokkos *= >>> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >>> jsrun -n 2 -g 1 ./ex3f90 >>> DM Object: testplex 2 MPI processes >>> type: plex >>> testplex in 3 dimensions: >>> 0-cells: 12 12 >>> 1-cells: 20 20 >>> 2-cells: 11 11 >>> 3-cells: 2 2 >>> Labels: >>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>> 08:28 adams/pcksp-batch-kokkos *= >>> summit:/gpfs/alpine/csc314/scratch/adams/pets >>> >>> On Fri, Oct 29, 2021 at 10:41 PM ?? wrote: >>> >>>> Thanks, Mark. >>>> >>>> I do what you suggested but nothing changes. Besides, from your compile >>>> history and result, >>>> >>>> - you use gfortran with no MPI library, not mpif90 >>>> - two CPUs gives exactly the same result >>>> - The first line of the DMView output should be "DM Object: testplex 2 >>>> MPI processes", not "DM Object: testplex 1 MPI processes", when you use >>>> 2CPUs >>>> >>>> It seems like you did not use MPI but just two CPUs do exactly the same >>>> thing.. >>>> >>>> Best regards, >>>> >>>> Yuan >>>> >>>> >>>> 2021?10?29?(?) 20:22 Mark Adams : >>>> >>>>> This works for me (appended) using an up to date version of PETSc. >>>>> >>>>> I would delete the architecture director and reconfigure, and make >>>>> all, and try again. >>>>> >>>>> Next, you seem to be using git. Use the 'main' branch and try again. >>>>> >>>>> Mark >>>>> >>>>> (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd >>>>> src/dm/impls/plex/tutorials/ >>>>> (base) 07:16 adams/swarm-omp-pc *= >>>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ make >>>>> PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 >>>>> gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress >>>>> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs >>>>> -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall >>>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall >>>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>>>> -I/Users/markadams/Codes/petsc/include >>>>> -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 >>>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 >>>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack >>>>> -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran >>>>> -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 >>>>> (base) 07:16 adams/swarm-omp-pc *= >>>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 >>>>> DM Object: testplex 1 MPI processes >>>>> type: plex >>>>> testplex in 3 dimensions: >>>>> 0-cells: 12 >>>>> 1-cells: 20 >>>>> 2-cells: 11 >>>>> 3-cells: 2 >>>>> Labels: >>>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>>> DM Object: testplex 1 MPI processes >>>>> type: plex >>>>> testplex in 3 dimensions: >>>>> 0-cells: 12 >>>>> 1-cells: 20 >>>>> 2-cells: 11 >>>>> 3-cells: 2 >>>>> Labels: >>>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>>> >>>>> On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have tried the test case ex3f90 in the folder >>>>>> \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I >>>>>> run it in 1 CPU by >>>>>> >>>>>> - mpirun -np 1 ./ex3f90 >>>>>> >>>>>> Everything seems OK. But when run it in 2 CPU by >>>>>> >>>>>> - mpirun -np 2 ./ex3f90 >>>>>> >>>>>> I got the following error message >>>>>> >>>>>> [0]PETSC ERROR: --------------------- Error Message >>>>>> -------------------------------------------------------------- >>>>>> [0]PETSC ERROR: Object is in wrong state >>>>>> [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no >>>>>> graph set >>>>>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble >>>>>> shooting. >>>>>> [0]PETSC ERROR: Petsc Development GIT revision: >>>>>> v3.16.0-248-ge617e6467c GIT Date: 2021-10-19 23:11:25 -0500 >>>>>> [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 >>>>>> 18:48:54 2021 >>>>>> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx >>>>>> --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 >>>>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 >>>>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 >>>>>> --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 >>>>>> --download-mumps=1 --download-scalapack=1 --download-hypre=1 >>>>>> --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi >>>>>> [0]PETSC ERROR: #1 DMPlexCheckPointSF() at >>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 >>>>>> [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at >>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 >>>>>> [0]PETSC ERROR: #3 DMPlexInterpolate() at >>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 >>>>>> [0]PETSC ERROR: #4 User provided function() at User file:0 >>>>>> Abort(73) on node 0 (rank 0 in comm 16): application called >>>>>> MPI_Abort(MPI_COMM_SELF, 73) - process 0 >>>>>> >>>>>> ------------------------------------------------------------------------------------------------------------------------------------ >>>>>> >>>>>> It fails in calling DMPlexInterpolate. Maybe this program is not >>>>>> considered to be run in parallel. But if I wish to do so, how should I >>>>>> modify it to let it run on multiple CPUs? >>>>>> >>>>>> Much thanks for your help >>>>>> >>>>>> Yuan >>>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Sat Oct 30 11:17:24 2021 From: knepley at gmail.com (Matthew Knepley) Date: Sat, 30 Oct 2021 12:17:24 -0400 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: Yes, it is a serial test. Thanks, Matt On Sat, Oct 30, 2021 at 9:38 AM ?? wrote: > Thank you for your reply. > > I have solved the problem by modifying > ---------------------------------------------------- > call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, > cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) > ---------------------------------------------------- > into > ----------------------------------------------------- > numPoints1 = [0, 0, 0, 0] > if (rank == 0) then > call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, > cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) > else > call DMPlexCreateFromDAG(dm, 3, numPoints1, PETSC_NULL_INTEGER, > PETSC_NULL_INTEGER,PETSC_NULL_INTEGER, PETSC_NULL_REAL, ierr) > endif > ---------------------------------------------------- > > The result obtained as follows > > DM Object: testplex 2 MPI processes > type: plex > testplex in 3 dimensions: > 0-cells: 12 0 > 1-cells: 20 0 > 2-cells: 11 0 > 3-cells: 2 0 > Labels: > celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) > depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) > cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 > cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = RANK 0 PID 15428 RUNNING AT DESKTOP-9ITFSBM > = KILLED BY SIGNAL: 9 (Killed) > > =================================================================================== > > There is still problem left. I like it relevent > > 2021?10?30?(?) 21:51 Mark Adams : > >> Ah, I can reproduce this error with debugging turned on. >> This test is not a parallel test, but it does not say that serial is a >> requirement. >> So there is a problem here. >> Anyone? >> >> On Sat, Oct 30, 2021 at 8:29 AM Mark Adams wrote: >> >>> 08:27 adams/pcksp-batch-kokkos *= >>> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >>> make PETSC_ARCH=arch-summit-opt-gnu-kokkos-cuda ex3f90 >>> mpifort -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g >>> -O -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>> -I/gpfs/alpine/csc314/scratch/adams/petsc/include >>> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/include >>> -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/include >>> -I/sw/summit/cuda/11.0.3/include ex3f90.F90 >>> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >>> -Wl,-rpath,/sw/summit/cuda/11.0.3/lib64 -L/sw/summit/cuda/11.0.3/lib64 >>> -L/sw/summit/cuda/11.0.3/lib64/stubs >>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib >>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib -lpetsc >>> -lkokkoskernels -lkokkoscontainers -lkokkoscore -lp4est -lsc -lblas >>> -llapack -lhdf5_hl -lhdf5 -lm -lz -lcudart -lcufft -lcublas -lcusparse >>> -lcusolver -lcurand -lcuda -lstdc++ -ldl -lmpiprofilesupport >>> -lmpi_ibm_usempif08 -lmpi_ibm_usempi_ignore_tkr -lmpi_ibm_mpifh -lmpi_ibm >>> -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath >>> -lstdc++ -ldl -o ex3f90 >>> 08:27 adams/pcksp-batch-kokkos *= >>> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >>> jsrun -n 2 -g 1 ./ex3f90 >>> DM Object: testplex 2 MPI processes >>> type: plex >>> testplex in 3 dimensions: >>> 0-cells: 12 12 >>> 1-cells: 20 20 >>> 2-cells: 11 11 >>> 3-cells: 2 2 >>> Labels: >>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>> 08:28 adams/pcksp-batch-kokkos *= >>> summit:/gpfs/alpine/csc314/scratch/adams/pets >>> >>> On Fri, Oct 29, 2021 at 10:41 PM ?? wrote: >>> >>>> Thanks, Mark. >>>> >>>> I do what you suggested but nothing changes. Besides, from your compile >>>> history and result, >>>> >>>> - you use gfortran with no MPI library, not mpif90 >>>> - two CPUs gives exactly the same result >>>> - The first line of the DMView output should be "DM Object: testplex 2 >>>> MPI processes", not "DM Object: testplex 1 MPI processes", when you use >>>> 2CPUs >>>> >>>> It seems like you did not use MPI but just two CPUs do exactly the same >>>> thing.. >>>> >>>> Best regards, >>>> >>>> Yuan >>>> >>>> >>>> 2021?10?29?(?) 20:22 Mark Adams : >>>> >>>>> This works for me (appended) using an up to date version of PETSc. >>>>> >>>>> I would delete the architecture director and reconfigure, and make >>>>> all, and try again. >>>>> >>>>> Next, you seem to be using git. Use the 'main' branch and try again. >>>>> >>>>> Mark >>>>> >>>>> (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd >>>>> src/dm/impls/plex/tutorials/ >>>>> (base) 07:16 adams/swarm-omp-pc *= >>>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ make >>>>> PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 >>>>> gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress >>>>> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs >>>>> -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall >>>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall >>>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>>>> -I/Users/markadams/Codes/petsc/include >>>>> -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 >>>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 >>>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack >>>>> -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran >>>>> -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 >>>>> (base) 07:16 adams/swarm-omp-pc *= >>>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 >>>>> DM Object: testplex 1 MPI processes >>>>> type: plex >>>>> testplex in 3 dimensions: >>>>> 0-cells: 12 >>>>> 1-cells: 20 >>>>> 2-cells: 11 >>>>> 3-cells: 2 >>>>> Labels: >>>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>>> DM Object: testplex 1 MPI processes >>>>> type: plex >>>>> testplex in 3 dimensions: >>>>> 0-cells: 12 >>>>> 1-cells: 20 >>>>> 2-cells: 11 >>>>> 3-cells: 2 >>>>> Labels: >>>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>>> >>>>> On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have tried the test case ex3f90 in the folder >>>>>> \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I >>>>>> run it in 1 CPU by >>>>>> >>>>>> - mpirun -np 1 ./ex3f90 >>>>>> >>>>>> Everything seems OK. But when run it in 2 CPU by >>>>>> >>>>>> - mpirun -np 2 ./ex3f90 >>>>>> >>>>>> I got the following error message >>>>>> >>>>>> [0]PETSC ERROR: --------------------- Error Message >>>>>> -------------------------------------------------------------- >>>>>> [0]PETSC ERROR: Object is in wrong state >>>>>> [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no >>>>>> graph set >>>>>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble >>>>>> shooting. >>>>>> [0]PETSC ERROR: Petsc Development GIT revision: >>>>>> v3.16.0-248-ge617e6467c GIT Date: 2021-10-19 23:11:25 -0500 >>>>>> [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 >>>>>> 18:48:54 2021 >>>>>> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx >>>>>> --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 >>>>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 >>>>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 >>>>>> --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 >>>>>> --download-mumps=1 --download-scalapack=1 --download-hypre=1 >>>>>> --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi >>>>>> [0]PETSC ERROR: #1 DMPlexCheckPointSF() at >>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 >>>>>> [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at >>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 >>>>>> [0]PETSC ERROR: #3 DMPlexInterpolate() at >>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 >>>>>> [0]PETSC ERROR: #4 User provided function() at User file:0 >>>>>> Abort(73) on node 0 (rank 0 in comm 16): application called >>>>>> MPI_Abort(MPI_COMM_SELF, 73) - process 0 >>>>>> >>>>>> ------------------------------------------------------------------------------------------------------------------------------------ >>>>>> >>>>>> It fails in calling DMPlexInterpolate. Maybe this program is not >>>>>> considered to be run in parallel. But if I wish to do so, how should I >>>>>> modify it to let it run on multiple CPUs? >>>>>> >>>>>> Much thanks for your help >>>>>> >>>>>> Yuan >>>>>> >>>>> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mfadams at lbl.gov Sat Oct 30 18:20:54 2021 From: mfadams at lbl.gov (Mark Adams) Date: Sat, 30 Oct 2021 19:20:54 -0400 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: Do we have a policy on this? I know some tests say they are serial. Maybe just say that tests are only supported for the parameters in the test. Yuan: at the bottom of all tests and tutorials are example input arguments and parallel run configurations. This tutorial is very rudimentary as you can see by: !/*TEST ! ! test: ! suffix: 0 ! !TEST*/ If you are looking for a parallel test, find one that has something like this: test: suffix: mesh_2 * nsize: 2* requires: exodusii args: -dm_distribute -petscpartitioner_type simple -dm_plex_filename ${wPETSC_DIR}/share/petsc/datafiles/meshes/sevenside-quad-15.exo -orth_qual_atol 0.95 TEST*/ On Sat, Oct 30, 2021 at 12:17 PM Matthew Knepley wrote: > Yes, it is a serial test. > > Thanks, > > Matt > > On Sat, Oct 30, 2021 at 9:38 AM ?? wrote: > >> Thank you for your reply. >> >> I have solved the problem by modifying >> ---------------------------------------------------- >> call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, >> cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) >> ---------------------------------------------------- >> into >> ----------------------------------------------------- >> numPoints1 = [0, 0, 0, 0] >> if (rank == 0) then >> call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, >> cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) >> else >> call DMPlexCreateFromDAG(dm, 3, numPoints1, PETSC_NULL_INTEGER, >> PETSC_NULL_INTEGER,PETSC_NULL_INTEGER, PETSC_NULL_REAL, ierr) >> endif >> ---------------------------------------------------- >> >> The result obtained as follows >> >> DM Object: testplex 2 MPI processes >> type: plex >> testplex in 3 dimensions: >> 0-cells: 12 0 >> 1-cells: 20 0 >> 2-cells: 11 0 >> 3-cells: 2 0 >> Labels: >> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = RANK 0 PID 15428 RUNNING AT DESKTOP-9ITFSBM >> = KILLED BY SIGNAL: 9 (Killed) >> >> =================================================================================== >> >> There is still problem left. I like it relevent >> >> 2021?10?30?(?) 21:51 Mark Adams : >> >>> Ah, I can reproduce this error with debugging turned on. >>> This test is not a parallel test, but it does not say that serial is a >>> requirement. >>> So there is a problem here. >>> Anyone? >>> >>> On Sat, Oct 30, 2021 at 8:29 AM Mark Adams wrote: >>> >>>> 08:27 adams/pcksp-batch-kokkos *= >>>> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >>>> make PETSC_ARCH=arch-summit-opt-gnu-kokkos-cuda ex3f90 >>>> mpifort -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g >>>> -O -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>>> -I/gpfs/alpine/csc314/scratch/adams/petsc/include >>>> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/include >>>> -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/include >>>> -I/sw/summit/cuda/11.0.3/include ex3f90.F90 >>>> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>>> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>>> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>>> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >>>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >>>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >>>> -Wl,-rpath,/sw/summit/cuda/11.0.3/lib64 -L/sw/summit/cuda/11.0.3/lib64 >>>> -L/sw/summit/cuda/11.0.3/lib64/stubs >>>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >>>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >>>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >>>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >>>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >>>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >>>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >>>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >>>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >>>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >>>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >>>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib >>>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib -lpetsc >>>> -lkokkoskernels -lkokkoscontainers -lkokkoscore -lp4est -lsc -lblas >>>> -llapack -lhdf5_hl -lhdf5 -lm -lz -lcudart -lcufft -lcublas -lcusparse >>>> -lcusolver -lcurand -lcuda -lstdc++ -ldl -lmpiprofilesupport >>>> -lmpi_ibm_usempif08 -lmpi_ibm_usempi_ignore_tkr -lmpi_ibm_mpifh -lmpi_ibm >>>> -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath >>>> -lstdc++ -ldl -o ex3f90 >>>> 08:27 adams/pcksp-batch-kokkos *= >>>> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >>>> jsrun -n 2 -g 1 ./ex3f90 >>>> DM Object: testplex 2 MPI processes >>>> type: plex >>>> testplex in 3 dimensions: >>>> 0-cells: 12 12 >>>> 1-cells: 20 20 >>>> 2-cells: 11 11 >>>> 3-cells: 2 2 >>>> Labels: >>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>> 08:28 adams/pcksp-batch-kokkos *= >>>> summit:/gpfs/alpine/csc314/scratch/adams/pets >>>> >>>> On Fri, Oct 29, 2021 at 10:41 PM ?? wrote: >>>> >>>>> Thanks, Mark. >>>>> >>>>> I do what you suggested but nothing changes. Besides, from your >>>>> compile history and result, >>>>> >>>>> - you use gfortran with no MPI library, not mpif90 >>>>> - two CPUs gives exactly the same result >>>>> - The first line of the DMView output should be "DM Object: testplex >>>>> 2 MPI processes", not "DM Object: testplex 1 MPI processes", when you use >>>>> 2CPUs >>>>> >>>>> It seems like you did not use MPI but just two CPUs do exactly >>>>> the same thing.. >>>>> >>>>> Best regards, >>>>> >>>>> Yuan >>>>> >>>>> >>>>> 2021?10?29?(?) 20:22 Mark Adams : >>>>> >>>>>> This works for me (appended) using an up to date version of PETSc. >>>>>> >>>>>> I would delete the architecture director and reconfigure, and make >>>>>> all, and try again. >>>>>> >>>>>> Next, you seem to be using git. Use the 'main' branch and try again. >>>>>> >>>>>> Mark >>>>>> >>>>>> (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd >>>>>> src/dm/impls/plex/tutorials/ >>>>>> (base) 07:16 adams/swarm-omp-pc *= >>>>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ make >>>>>> PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 >>>>>> gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress >>>>>> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs >>>>>> -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall >>>>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall >>>>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>>>>> -I/Users/markadams/Codes/petsc/include >>>>>> -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 >>>>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 >>>>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack >>>>>> -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran >>>>>> -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 >>>>>> (base) 07:16 adams/swarm-omp-pc *= >>>>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 >>>>>> DM Object: testplex 1 MPI processes >>>>>> type: plex >>>>>> testplex in 3 dimensions: >>>>>> 0-cells: 12 >>>>>> 1-cells: 20 >>>>>> 2-cells: 11 >>>>>> 3-cells: 2 >>>>>> Labels: >>>>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>>>> DM Object: testplex 1 MPI processes >>>>>> type: plex >>>>>> testplex in 3 dimensions: >>>>>> 0-cells: 12 >>>>>> 1-cells: 20 >>>>>> 2-cells: 11 >>>>>> 3-cells: 2 >>>>>> Labels: >>>>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>>>> >>>>>> On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have tried the test case ex3f90 in the folder >>>>>>> \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I >>>>>>> run it in 1 CPU by >>>>>>> >>>>>>> - mpirun -np 1 ./ex3f90 >>>>>>> >>>>>>> Everything seems OK. But when run it in 2 CPU by >>>>>>> >>>>>>> - mpirun -np 2 ./ex3f90 >>>>>>> >>>>>>> I got the following error message >>>>>>> >>>>>>> [0]PETSC ERROR: --------------------- Error Message >>>>>>> -------------------------------------------------------------- >>>>>>> [0]PETSC ERROR: Object is in wrong state >>>>>>> [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no >>>>>>> graph set >>>>>>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble >>>>>>> shooting. >>>>>>> [0]PETSC ERROR: Petsc Development GIT revision: >>>>>>> v3.16.0-248-ge617e6467c GIT Date: 2021-10-19 23:11:25 -0500 >>>>>>> [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 >>>>>>> 18:48:54 2021 >>>>>>> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx >>>>>>> --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 >>>>>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 >>>>>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 >>>>>>> --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 >>>>>>> --download-mumps=1 --download-scalapack=1 --download-hypre=1 >>>>>>> --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi >>>>>>> [0]PETSC ERROR: #1 DMPlexCheckPointSF() at >>>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 >>>>>>> [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at >>>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 >>>>>>> [0]PETSC ERROR: #3 DMPlexInterpolate() at >>>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 >>>>>>> [0]PETSC ERROR: #4 User provided function() at User file:0 >>>>>>> Abort(73) on node 0 (rank 0 in comm 16): application called >>>>>>> MPI_Abort(MPI_COMM_SELF, 73) - process 0 >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------------------------------------------ >>>>>>> >>>>>>> It fails in calling DMPlexInterpolate. Maybe this program is not >>>>>>> considered to be run in parallel. But if I wish to do so, how should I >>>>>>> modify it to let it run on multiple CPUs? >>>>>>> >>>>>>> Much thanks for your help >>>>>>> >>>>>>> Yuan >>>>>>> >>>>>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuanxi at advancesoft.jp Sun Oct 31 02:05:00 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Sun, 31 Oct 2021 16:05:00 +0900 Subject: [petsc-users] Tutorials test case cannot run in parallel In-Reply-To: References: Message-ID: Please see the attached file. I Hope it will be of some help! 2021?10?30?(?) 23:40 Mark Adams : > Great. Thank you. > Could you please send a 'git diff' if that is available? And we can take > care of it. > > > On Sat, Oct 30, 2021 at 9:38 AM ?? wrote: > >> Thank you for your reply. >> >> I have solved the problem by modifying >> ---------------------------------------------------- >> call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, >> cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) >> ---------------------------------------------------- >> into >> ----------------------------------------------------- >> numPoints1 = [0, 0, 0, 0] >> if (rank == 0) then >> call DMPlexCreateFromDAG(dm, depth, numPoints, coneSize, >> cones,coneOrientations, vertexCoords, ierr);CHKERRA(ierr) >> else >> call DMPlexCreateFromDAG(dm, 3, numPoints1, PETSC_NULL_INTEGER, >> PETSC_NULL_INTEGER,PETSC_NULL_INTEGER, PETSC_NULL_REAL, ierr) >> endif >> ---------------------------------------------------- >> >> The result obtained as follows >> >> DM Object: testplex 2 MPI processes >> type: plex >> testplex in 3 dimensions: >> 0-cells: 12 0 >> 1-cells: 20 0 >> 2-cells: 11 0 >> 3-cells: 2 0 >> Labels: >> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = RANK 0 PID 15428 RUNNING AT DESKTOP-9ITFSBM >> = KILLED BY SIGNAL: 9 (Killed) >> >> =================================================================================== >> >> There is still problem left. I like it relevent >> >> 2021?10?30?(?) 21:51 Mark Adams : >> >>> Ah, I can reproduce this error with debugging turned on. >>> This test is not a parallel test, but it does not say that serial is a >>> requirement. >>> So there is a problem here. >>> Anyone? >>> >>> On Sat, Oct 30, 2021 at 8:29 AM Mark Adams wrote: >>> >>>> 08:27 adams/pcksp-batch-kokkos *= >>>> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >>>> make PETSC_ARCH=arch-summit-opt-gnu-kokkos-cuda ex3f90 >>>> mpifort -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g >>>> -O -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>>> -I/gpfs/alpine/csc314/scratch/adams/petsc/include >>>> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/include >>>> -I/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/include >>>> -I/sw/summit/cuda/11.0.3/include ex3f90.F90 >>>> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>>> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>>> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>>> -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-cuda/lib >>>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >>>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >>>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/hdf5-1.10.7-yxvwkhm4nhgezbl2mwzdruwoaiblt6q2/lib >>>> -Wl,-rpath,/sw/summit/cuda/11.0.3/lib64 -L/sw/summit/cuda/11.0.3/lib64 >>>> -L/sw/summit/cuda/11.0.3/lib64/stubs >>>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >>>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/spectrum-mpi-10.4.0.3-20210112-6jbupg3thjwhsabgevk6xmwhd2bbyxdc/lib >>>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >>>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc/powerpc64le-unknown-linux-gnu/9.1.0 >>>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >>>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib/gcc >>>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.1.0/netlib-lapack-3.9.1-t2a6tcso5tkezcjmfrqvqi2cpary7kgx/lib64 >>>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >>>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib64 >>>> -Wl,-rpath,/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >>>> -L/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/darshan-runtime-3.3.0-mu6tnxlhxfplrq3srkkgi5dvly6wenwy/lib >>>> -Wl,-rpath,/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib >>>> -L/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/lib -lpetsc >>>> -lkokkoskernels -lkokkoscontainers -lkokkoscore -lp4est -lsc -lblas >>>> -llapack -lhdf5_hl -lhdf5 -lm -lz -lcudart -lcufft -lcublas -lcusparse >>>> -lcusolver -lcurand -lcuda -lstdc++ -ldl -lmpiprofilesupport >>>> -lmpi_ibm_usempif08 -lmpi_ibm_usempi_ignore_tkr -lmpi_ibm_mpifh -lmpi_ibm >>>> -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath >>>> -lstdc++ -ldl -o ex3f90 >>>> 08:27 adams/pcksp-batch-kokkos *= >>>> summit:/gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/tutorials$ >>>> jsrun -n 2 -g 1 ./ex3f90 >>>> DM Object: testplex 2 MPI processes >>>> type: plex >>>> testplex in 3 dimensions: >>>> 0-cells: 12 12 >>>> 1-cells: 20 20 >>>> 2-cells: 11 11 >>>> 3-cells: 2 2 >>>> Labels: >>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>> 08:28 adams/pcksp-batch-kokkos *= >>>> summit:/gpfs/alpine/csc314/scratch/adams/pets >>>> >>>> On Fri, Oct 29, 2021 at 10:41 PM ?? wrote: >>>> >>>>> Thanks, Mark. >>>>> >>>>> I do what you suggested but nothing changes. Besides, from your >>>>> compile history and result, >>>>> >>>>> - you use gfortran with no MPI library, not mpif90 >>>>> - two CPUs gives exactly the same result >>>>> - The first line of the DMView output should be "DM Object: testplex >>>>> 2 MPI processes", not "DM Object: testplex 1 MPI processes", when you use >>>>> 2CPUs >>>>> >>>>> It seems like you did not use MPI but just two CPUs do exactly >>>>> the same thing.. >>>>> >>>>> Best regards, >>>>> >>>>> Yuan >>>>> >>>>> >>>>> 2021?10?29?(?) 20:22 Mark Adams : >>>>> >>>>>> This works for me (appended) using an up to date version of PETSc. >>>>>> >>>>>> I would delete the architecture director and reconfigure, and make >>>>>> all, and try again. >>>>>> >>>>>> Next, you seem to be using git. Use the 'main' branch and try again. >>>>>> >>>>>> Mark >>>>>> >>>>>> (base) 07:09 adams/swarm-omp-pc *= ~/Codes/petsc$ cd >>>>>> src/dm/impls/plex/tutorials/ >>>>>> (base) 07:16 adams/swarm-omp-pc *= >>>>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ make >>>>>> PETSC_DIR=/Users/markadams/Codes/petsc PETSC_ARCH=arch-macosx-gnu-g ex3f90 >>>>>> gfortran-11 -Wl,-bind_at_load -Wl,-multiply_defined,suppress >>>>>> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs >>>>>> -Wl,-search_paths_first -Wl,-no_compact_unwind -fPIC -Wall >>>>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O -fPIC -Wall >>>>>> -ffree-line-length-0 -Wno-unused-dummy-argument -g -O >>>>>> -I/Users/markadams/Codes/petsc/include >>>>>> -I/Users/markadams/Codes/petsc/arch-macosx-gnu-g/include ex3f90.F90 >>>>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>>> -Wl,-rpath,/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>>> -L/Users/markadams/Codes/petsc/arch-macosx-gnu-g/lib >>>>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11/gcc/x86_64-apple-darwin20/11.2.0 >>>>>> -Wl,-rpath,/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 >>>>>> -L/usr/local/Cellar/gcc/11.2.0/lib/gcc/11 -lpetsc -lp4est -lsc -llapack >>>>>> -lblas -lhdf5_hl -lhdf5 -lmetis -lz -lstdc++ -ldl -lgcc_s.1 -lgfortran >>>>>> -lquadmath -lm -lquadmath -lstdc++ -ldl -lgcc_s.1 -o ex3f90 >>>>>> (base) 07:16 adams/swarm-omp-pc *= >>>>>> ~/Codes/petsc/src/dm/impls/plex/tutorials$ mpirun -np 2 ./ex3f90 >>>>>> DM Object: testplex 1 MPI processes >>>>>> type: plex >>>>>> testplex in 3 dimensions: >>>>>> 0-cells: 12 >>>>>> 1-cells: 20 >>>>>> 2-cells: 11 >>>>>> 3-cells: 2 >>>>>> Labels: >>>>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>>>> DM Object: testplex 1 MPI processes >>>>>> type: plex >>>>>> testplex in 3 dimensions: >>>>>> 0-cells: 12 >>>>>> 1-cells: 20 >>>>>> 2-cells: 11 >>>>>> 3-cells: 2 >>>>>> Labels: >>>>>> celltype: 4 strata with value/size (0 (12), 7 (2), 4 (11), 1 (20)) >>>>>> depth: 4 strata with value/size (0 (12), 1 (20), 2 (11), 3 (2)) >>>>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>>>> cell: 0 volume: 0.5000 centroid: -0.2500 0.5000 0.5000 >>>>>> cell: 1 volume: 0.5000 centroid: 0.2500 0.5000 0.5000 >>>>>> >>>>>> On Fri, Oct 29, 2021 at 6:11 AM ?? wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have tried the test case ex3f90 in the folder >>>>>>> \src\dm\impls\plex\tutorials to run in parallel but found it fails. When I >>>>>>> run it in 1 CPU by >>>>>>> >>>>>>> - mpirun -np 1 ./ex3f90 >>>>>>> >>>>>>> Everything seems OK. But when run it in 2 CPU by >>>>>>> >>>>>>> - mpirun -np 2 ./ex3f90 >>>>>>> >>>>>>> I got the following error message >>>>>>> >>>>>>> [0]PETSC ERROR: --------------------- Error Message >>>>>>> -------------------------------------------------------------- >>>>>>> [0]PETSC ERROR: Object is in wrong state >>>>>>> [0]PETSC ERROR: This DMPlex is distributed but its PointSF has no >>>>>>> graph set >>>>>>> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble >>>>>>> shooting. >>>>>>> [0]PETSC ERROR: Petsc Development GIT revision: >>>>>>> v3.16.0-248-ge617e6467c GIT Date: 2021-10-19 23:11:25 -0500 >>>>>>> [0]PETSC ERROR: ./ex3f90 on a named pc-010-088 by Fri Oct 29 >>>>>>> 18:48:54 2021 >>>>>>> [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx >>>>>>> --with-fc=mpiifort --with-fortran-bindings=1 --with-debugging=0 >>>>>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.4.0 >>>>>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.4.0 --download-metis=1 >>>>>>> --download-parmetis=1 --download-cmake --force --download-superlu_dist=1 >>>>>>> --download-mumps=1 --download-scalapack=1 --download-hypre=1 >>>>>>> --download-ml=1 --with-debugging=yes --prefix=/home/yuanxi >>>>>>> [0]PETSC ERROR: #1 DMPlexCheckPointSF() at >>>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plex.c:8626 >>>>>>> [0]PETSC ERROR: #2 DMPlexOrientInterface_Internal() at >>>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:595 >>>>>>> [0]PETSC ERROR: #3 DMPlexInterpolate() at >>>>>>> /home/yuanxi/myprograms/petsc/src/dm/impls/plex/plexinterpolate.c:1357 >>>>>>> [0]PETSC ERROR: #4 User provided function() at User file:0 >>>>>>> Abort(73) on node 0 (rank 0 in comm 16): application called >>>>>>> MPI_Abort(MPI_COMM_SELF, 73) - process 0 >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------------------------------------------ >>>>>>> >>>>>>> It fails in calling DMPlexInterpolate. Maybe this program is not >>>>>>> considered to be run in parallel. But if I wish to do so, how should I >>>>>>> modify it to let it run on multiple CPUs? >>>>>>> >>>>>>> Much thanks for your help >>>>>>> >>>>>>> Yuan >>>>>>> >>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex3f90.diff Type: application/octet-stream Size: 1904 bytes Desc: not available URL: From knepley at gmail.com Sun Oct 31 07:16:14 2021 From: knepley at gmail.com (Matthew Knepley) Date: Sun, 31 Oct 2021 08:16:14 -0400 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? In-Reply-To: References: Message-ID: On Thu, Oct 28, 2021 at 10:48 PM ?? wrote: > Dear Matt, > > My mesh is something like the following figure, which is composed of > three elements : one hexahedron(solid element), one quadrilateral (shell > element), and one line (beam element). I found the function "TestEmptyStrata" > in file \dm\impls\plex\tests\ex11.c would be a good example to read in such > a kind of mesh by using DMPlexSetCone. But a problem is that you should > declare all faces and edges of hexahedron element, all edges in > quadrilateral element by DMPlexSetCone, otherwise PETsc could not do > topological interpolation afterwards. Am I right here? > As general in FEM mesh, my mesh does not contain any information about faces > or edges of solid elements. That's why I consider using DMCOMPOSITE. That is > > - Put hexahedron, quadrilateral, and line elements into different DM > structures. > - do topological interpolation in those DMs separately. > - composite them. > > Is there anything wrong in my above consideration? Any suggestions? > > ------------ > /| /| > / | / | cell 0: Hex > / | / | > ------------/ | > | | | | > | | | | cell 1: Quad > | --------|---|------------ > | / | / / > | / | / / > |/ |/ / > ------------------------------------------- > cell 2: line > > Much thanks for your help. > If you are solving something where everything is embedded in a volumetric mesh, then there is no problem. However, if you really have the mesh above, where lower dimensional pieces are sticking out of the mesh, then Plex can represent the mesh, but automatic interpolation (creation of edges and faces) will not work. Why is this? We use depth in the DAG as a proxy for cell dimension, but this will no longer work if faces are not part of a volume. Will DMCOMPOSITE do what you want? It depends. It will be able to lay out a vector, but it will not know about any topological connectivity between the meshes and will not preallocate a Jacobian with any interaction. If the meshes are truly separate, this is fine. If not, it is not that useful. Could you modify the existing code to support this? Yes, it would not be terribly difficult. When you load the mesh, you must know what kind of cell you are loading. You could explicitly set this using DMPlexSetCellType(). Then, instead of taking a certain height stratum of the DAG to loop over, you would instead use all cells marked with a certain cell type. The rest of the interpolation code should work fine. What kind of physics do you have where low dimensional features are not embedded in the larger volume? Thanks, Matt > Yuan > > 2021?10?28?(?) 22:05 Matthew Knepley : > >> On Thu, Oct 28, 2021 at 4:59 AM ?? wrote: >> >>> Dear Matt, >>> >>> Thank you for your quick response. >>> >>> I think what you mean is to build DAG from my mesh at first and then >>> call DMPlexCreateFromDAG >>> () >>> to construct DMPlex. >>> >> >> No, I do not mean that. >> >> >>> A new problem is, as I know, the function DMPlexInterpolate would >>> generate points with different depth. What's the difference between those >>> faces and segment elements generated by DMPlexInterpolate with that >>> defined by the original mesh, or should we not use DMPlexInterpolate in >>> such a case? >>> >>> On the other hand, can DMComposite be used in this case by defining >>> DMPlex with different topological dimensions at first and then composite >>> them? >>> >> >> You do not need that. I am obviously not understanding your question. My >> short answer is that Plex _already_ handles cells of different >> dimension automatically without anything extra. >> >> Maybe it would help if you defined a specific problem you have. >> >> Thanks, >> >> Matt >> >> >>> Thanks in advance. >>> >>> Yuan >>> >>> >>> 2021?10?27?(?) 19:27 Matthew Knepley : >>> >>>> On Wed, Oct 27, 2021 at 4:50 AM ?? wrote: >>>> >>>>> Hi, >>>>> >>>>> I am trying to parallelize my serial FEM program using PETSc. This >>>>> program calculates structure deformation by using various types of elements >>>>> such as solid, shell, beam, and truss. At the very beginning, I found it >>>>> was hard for me to put such kinds of elements into DMPlex. Because solid >>>>> elements are topologically three dimensional, shell element two, and beam >>>>> or truss are topologically one-dimensional elements. After reading chapter >>>>> 2.10: "DMPlex: Unstructured Grids in PETSc" of users manual carefully, I >>>>> found the provided functions, such as DMPlexSetCone, cannot declare those >>>>> topological differences. >>>>> >>>>> My question is : Is it possible and how to define all those >>>>> topologically different elements into a DMPlex struct? >>>>> >>>> >>>> Yes. The idea is to program in a dimension-independent way, so that the >>>> code can handle cells of any dimension. >>>> What you probably want is the "depth" in the DAG representation, which >>>> you can think of as the dimension of a cell. >>>> >>>> >>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetPointDepth.html#DMPlexGetPointDepth >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> >>>>> Thanks in advance! >>>>> >>>>> Best regards, >>>>> >>>>> Yuan. >>>>> >>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Eric.Chamberland at giref.ulaval.ca Sun Oct 31 09:07:24 2021 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Sun, 31 Oct 2021 10:07:24 -0400 Subject: [petsc-users] Is it possible to keep track of original elements # after a call to DMPlexDistribute ? In-Reply-To: <8a3704c3-d626-d860-0e98-33e113c5c376@giref.ulaval.ca> References: <7236c736-6066-1ba3-55b1-60782d8e754f@giref.ulaval.ca> <631832bb-4953-a3eb-13c6-0f7fe17e869d@giref.ulaval.ca> <12e32ebb-61ed-6a8c-ab77-2841090ba5fe@giref.ulaval.ca> <8a3704c3-d626-d860-0e98-33e113c5c376@giref.ulaval.ca> Message-ID: Hi Matthew, we do not know if DMPlexNaturalToGlobalBegin/End is buggy or if it is our comprehension of what it should do that is wrong... Would you just check if what we try to do from line 313 to 356 is good or wrong? The expected result is that the global vector "lGlobalVec" contains the element numbers from the initial partition that have been put into "lNatVec". Thanks a lot for any insights!! Eric On 2021-10-27 2:32 p.m., Eric Chamberland wrote: > > Hi Matthew, > > we continued the example.? Now it must be our misuse of PETSc that > produced the wrong result. > > As stated into the code: > > // The call to DMPlexNaturalToGlobalBegin/End does not produce our > expected result... > ? // In lGlobalVec, we expect to have: > ? /* > ?? * Process [0] > ?? * 2. > ?? * 4. > ?? * 8. > ?? * 3. > ?? * 9. > ?? * Process [1] > ?? * 1. > ?? * 5. > ?? * 7. > ?? * 0. > ?? * 6. > ?? * > ?? * but we obtained: > ?? * > ?? * Process [0] > ?? * 2. > ?? * 4. > ?? * 8. > ?? * 0. > ?? * 0. > ?? * Process [1] > ?? * 0. > ?? * 0. > ?? * 0. > ?? * 0. > ?? * 0. > ?? */ > > (see attached ex44.c) > > Thanks, > > Eric > > On 2021-10-27 1:25 p.m., Eric Chamberland wrote: >> >> Great! >> >> Thanks Matthew, it is working for me up to that point! >> >> We are continuing the ex44.c and forward it to you at the next >> blocking point... >> >> Eric >> >> On 2021-10-27 11:14 a.m., Matthew Knepley wrote: >>> On Wed, Oct 27, 2021 at 8:29 AM Eric Chamberland >>> >> > wrote: >>> >>> Hi Matthew, >>> >>> the smallest mesh which crashes the code is a 2x5 mesh: >>> >>> See the modified ex44.c >>> >>> With smaller meshes(2x2, 2x4, etc), it passes... But it bugs >>> latter when I try to use DMPlexNaturalToGlobalBegin but let's >>> keep that other problem for later... >>> >>> Thanks a lot for helping digging into this! :) >>> >>> I have made a small fix in this branch >>> >>> https://gitlab.com/petsc/petsc/-/commits/knepley/fix-plex-g2n >>> >>> >>> It seems to run for me. Can you check it? >>> >>> ? Thanks, >>> >>> ? ? ?Matt >>> >>> Eric >>> >>> (sorry if you received this for a? 2nd times, I have trouble >>> with my mail) >>> >>> On 2021-10-26 4:35 p.m., Matthew Knepley wrote: >>>> On Tue, Oct 26, 2021 at 1:35 PM Eric Chamberland >>>> >>> > wrote: >>>> >>>> Here is a screenshot of the partition I hard coded (top) >>>> and vertices/element numbers (down): >>>> >>>> I have not yet modified the ex44.c example to properly >>>> assign the coordinates... >>>> >>>> (but I would not have done it like it is in the last >>>> version because the sCoords array is the global array with >>>> global vertices number) >>>> >>>> I will have time to do this tomorrow... >>>> >>>> Maybe I can first try to reproduce all this with a smaller >>>> mesh? >>>> >>>> >>>> That might make it easier to find a problem. >>>> >>>> ? Thanks! >>>> >>>> ? ? ?Matt >>>> >>>> Eric >>>> >>>> On 2021-10-26 9:46 a.m., Matthew Knepley wrote: >>>>> Okay, I ran it. Something seems off with the mesh. First, >>>>> I cannot simply explain the partition. The number of >>>>> shared vertices and edges >>>>> does not seem to come from a straight cut. Second, the >>>>> mesh look scrambled on output. >>>>> >>>>> ? Thanks, >>>>> >>>>> ? ? Matt >>>>> >>>>> On Sun, Oct 24, 2021 at 11:49 PM Eric Chamberland >>>>> >>>> > wrote: >>>>> >>>>> Hi Matthew, >>>>> >>>>> ok, I started back from your ex44.c example and added >>>>> the global array of coordinates.? I just have to code >>>>> the creation of the local coordinates now. >>>>> >>>>> Eric >>>>> >>>>> On 2021-10-20 6:55 p.m., Matthew Knepley wrote: >>>>>> On Wed, Oct 20, 2021 at 3:06 PM Eric Chamberland >>>>>> >>>>> > wrote: >>>>>> >>>>>> Hi Matthew, >>>>>> >>>>>> we tried to reproduce the error in a simple example. >>>>>> >>>>>> The context is the following: We hard coded the >>>>>> mesh and initial partition into the code (see >>>>>> sConnectivity and sInitialPartition) for 2 ranks >>>>>> and try to create a section in order to use the >>>>>> DMPlexNaturalToGlobalBegin function to retreive >>>>>> our initial element numbers. >>>>>> >>>>>> Now the call to DMPlexDistribute give different >>>>>> errors depending on what type of component we ask >>>>>> the field to be created. For our objective, we >>>>>> would like a global field to be created on >>>>>> elements only (like a P0 interpolation). >>>>>> >>>>>> We now have the following error generated: >>>>>> >>>>>> [0]PETSC ERROR: --------------------- Error >>>>>> Message >>>>>> -------------------------------------------------------------- >>>>>> [0]PETSC ERROR: Petsc has generated inconsistent data >>>>>> [0]PETSC ERROR: Inconsistency in indices, 18 >>>>>> should be 17 >>>>>> [0]PETSC ERROR: See >>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html >>>>>> >>>>>> for trouble shooting. >>>>>> [0]PETSC ERROR: Petsc Release Version 3.15.0, Mar >>>>>> 30, 2021 >>>>>> [0]PETSC ERROR: ./bug on a? named rohan by ericc >>>>>> Wed Oct 20 14:52:36 2021 >>>>>> [0]PETSC ERROR: Configure options >>>>>> --prefix=/opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7 >>>>>> --with-mpi-compilers=1 >>>>>> --with-mpi-dir=/opt/openmpi-4.1.0_gcc7 >>>>>> --with-cxx-dialect=C++14 --with-make-np=12 >>>>>> --with-shared-libraries=1 --with-debugging=yes >>>>>> --with-memalign=64 --with-visibility=0 >>>>>> --with-64-bit-indices=0 --download-ml=yes >>>>>> --download-mumps=yes --download-superlu=yes >>>>>> --download-hpddm=yes --download-slepc=yes >>>>>> --download-superlu_dist=yes >>>>>> --download-parmetis=yes --download-ptscotch=yes >>>>>> --download-metis=yes --download-strumpack=yes >>>>>> --download-suitesparse=yes --download-hypre=yes >>>>>> --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>>>>> --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>>>>> --with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/.. >>>>>> --with-scalapack=1 >>>>>> --with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include >>>>>> --with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64 >>>>>> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" >>>>>> [0]PETSC ERROR: #1 PetscSFCreateSectionSF() at >>>>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/vec/is/sf/utils/sfutils.c:409 >>>>>> [0]PETSC ERROR: #2 >>>>>> DMPlexCreateGlobalToNaturalSF() at >>>>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexnatural.c:184 >>>>>> [0]PETSC ERROR: #3 DMPlexDistribute() at >>>>>> /tmp/ompi-opt/petsc-3.15.0-debug/src/dm/impls/plex/plexdistribute.c:1733 >>>>>> [0]PETSC ERROR: #4 main() at bug_section.cc:159 >>>>>> [0]PETSC ERROR: No PETSc Option Table entries >>>>>> [0]PETSC ERROR: ----------------End of Error >>>>>> Message -------send entire error message to >>>>>> petsc-maint at mcs.anl.gov >>>>>> ---------- >>>>>> >>>>>> Hope the attached code is self-explaining, note >>>>>> that to make it short, we have not included the >>>>>> final part of it, just the buggy part we are >>>>>> encountering right now... >>>>>> >>>>>> Thanks for your insights, >>>>>> >>>>>> Thanks for making the example. I tweaked it slightly. >>>>>> I put in a test case that just makes a parallel 7 x >>>>>> 10 quad mesh. This works >>>>>> fine. Thus I think it must be something connected >>>>>> with the original mesh. It is hard to get a handle on >>>>>> it without the coordinates. >>>>>> Do you think you could put the coordinate array in? I >>>>>> have added the code to load them (see attached file). >>>>>> >>>>>> ? Thanks, >>>>>> >>>>>> ? ? ?Matt >>>>>> >>>>>> Eric >>>>>> >>>>>> On 2021-10-06 9:23 p.m., Matthew Knepley wrote: >>>>>>> On Wed, Oct 6, 2021 at 5:43 PM Eric Chamberland >>>>>>> >>>>>> > wrote: >>>>>>> >>>>>>> Hi Matthew, >>>>>>> >>>>>>> we tried to use that. Now, we discovered that: >>>>>>> >>>>>>> 1- even if we "ask" for sfNatural creation >>>>>>> with DMSetUseNatural, it is not created >>>>>>> because DMPlexCreateGlobalToNaturalSF looks >>>>>>> for a "section": this is not documented in >>>>>>> DMSetUseNaturalso we are asking ourselfs: >>>>>>> "is this a permanent feature or a temporary >>>>>>> situation?" >>>>>>> >>>>>>> I think explaining this will help clear up a lot. >>>>>>> >>>>>>> What the Natural2Global?map does is permute a >>>>>>> solution vector into the ordering that it would >>>>>>> have had prior to mesh distribution. >>>>>>> Now, in order to do this permutation, I need to >>>>>>> know the original (global) data layout. If it is >>>>>>> not specified _before_ distribution, we >>>>>>> cannot build the permutation. The section >>>>>>> describes the data layout, so I need it before >>>>>>> distribution. >>>>>>> >>>>>>> I cannot think of another way that you would >>>>>>> implement this, but if you want something else, >>>>>>> let me know. >>>>>>> >>>>>>> 2- We then tried to create a "section" in >>>>>>> different manners: we took the code into the >>>>>>> example >>>>>>> petsc/src/dm/impls/plex/tests/ex15.c. >>>>>>> However, we ended up with a segfault: >>>>>>> >>>>>>> corrupted size vs. prev_size >>>>>>> [rohan:07297] *** Process received signal *** >>>>>>> [rohan:07297] Signal: Aborted (6) >>>>>>> [rohan:07297] Signal code: (-6) >>>>>>> [rohan:07297] [ 0] >>>>>>> /lib64/libpthread.so.0(+0x13f80)[0x7f6f13be3f80] >>>>>>> [rohan:07297] [ 1] >>>>>>> /lib64/libc.so.6(gsignal+0x10b)[0x7f6f109b718b] >>>>>>> [rohan:07297] [ 2] >>>>>>> /lib64/libc.so.6(abort+0x175)[0x7f6f109b8585] >>>>>>> [rohan:07297] [ 3] >>>>>>> /lib64/libc.so.6(+0x7e2f7)[0x7f6f109fb2f7] >>>>>>> [rohan:07297] [ 4] >>>>>>> /lib64/libc.so.6(+0x857ea)[0x7f6f10a027ea] >>>>>>> [rohan:07297] [ 5] >>>>>>> /lib64/libc.so.6(+0x86036)[0x7f6f10a03036] >>>>>>> [rohan:07297] [ 6] >>>>>>> /lib64/libc.so.6(+0x861a3)[0x7f6f10a031a3] >>>>>>> [rohan:07297] [ 7] >>>>>>> /lib64/libc.so.6(+0x88740)[0x7f6f10a05740] >>>>>>> [rohan:07297] [ 8] >>>>>>> /lib64/libc.so.6(__libc_malloc+0x1b8)[0x7f6f10a070c8] >>>>>>> [rohan:07297] [ 9] >>>>>>> /lib64/libc.so.6(__backtrace_symbols+0x134)[0x7f6f10a8b064] >>>>>>> [rohan:07297] [10] >>>>>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x4538ce] >>>>>>> [rohan:07297] [11] >>>>>>> /home/mefpp_ericc/GIREF/bin/MEF++.dev(_Z15attacheDebuggerv+0x120)[0x4523c0] >>>>>>> [rohan:07297] [12] >>>>>>> /home/mefpp_ericc/GIREF/lib/libgiref_dev_Util.so(traitementSignal+0x612)[0x7f6f28f503a2] >>>>>>> [rohan:07297] [13] >>>>>>> /lib64/libc.so.6(+0x3a210)[0x7f6f109b7210] >>>>>>> >>>>>>> [rohan:07297] [14] >>>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscTrMallocDefault+0x6fd)[0x7f6f22f1b8ed] >>>>>>> [rohan:07297] [15] >>>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscMallocA+0x5cd)[0x7f6f22f19c2d] >>>>>>> [rohan:07297] [16] >>>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(PetscSFCreateSectionSF+0xb48)[0x7f6f23268e18] >>>>>>> [rohan:07297] [17] >>>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexCreateGlobalToNaturalSF+0x13b2)[0x7f6f241a5602] >>>>>>> [rohan:07297] [18] >>>>>>> /opt/petsc-3.15.0_debug_openmpi-4.1.0_gcc7/lib/libpetsc.so.3.15(DMPlexDistribute+0x39b1)[0x7f6f23fdca21] >>>>>>> >>>>>>> I am not sure what happened here, but if you >>>>>>> could send a sample code, I will figure it out. >>>>>>> >>>>>>> If we do not create a section, the call to >>>>>>> DMPlexDistribute is successful, but >>>>>>> DMPlexGetGlobalToNaturalSF return a null SF >>>>>>> pointer... >>>>>>> >>>>>>> Yes, it just ignores it in this case because it >>>>>>> does not have a global layout. >>>>>>> >>>>>>> Here are the operations we are calling ( >>>>>>> this is almost the code we are using, I just >>>>>>> removed verifications and creation of the >>>>>>> connectivity which use our parallel >>>>>>> structure and code): >>>>>>> >>>>>>> =========== >>>>>>> >>>>>>> ? PetscInt* lCells????? = 0; >>>>>>> ? PetscInt lNumCorners = 0; >>>>>>> ? PetscInt lDimMail??? = 0; >>>>>>> ? PetscInt lnumCells?? = 0; >>>>>>> >>>>>>> ? //At this point we create the cells for >>>>>>> PETSc expected input for >>>>>>> DMPlexBuildFromCellListParallel and set >>>>>>> lNumCorners, lDimMail and lnumCells to >>>>>>> correct values. >>>>>>> ? ... >>>>>>> >>>>>>> ? DM lDMBete = 0 >>>>>>> DMPlexCreate(lMPIComm,&lDMBete); >>>>>>> >>>>>>> DMSetDimension(lDMBete, lDimMail); >>>>>>> >>>>>>> DMPlexBuildFromCellListParallel(lDMBete, >>>>>>> ????????????????????????????????? lnumCells, >>>>>>> ????????????????????????????????? PETSC_DECIDE, >>>>>>> pLectureElementsLocaux.reqNbTotalSommets(), >>>>>>> ????????????????????????????????? lNumCorners, >>>>>>> ????????????????????????????????? lCells, >>>>>>> ????????????????????????????????? PETSC_NULL); >>>>>>> >>>>>>> ? DM lDMBeteInterp = 0; >>>>>>> DMPlexInterpolate(lDMBete, &lDMBeteInterp); >>>>>>> DMDestroy(&lDMBete); >>>>>>> ? lDMBete = lDMBeteInterp; >>>>>>> >>>>>>> DMSetUseNatural(lDMBete,PETSC_TRUE); >>>>>>> >>>>>>> ? PetscSF lSFMigrationSansOvl = 0; >>>>>>> ? PetscSF lSFMigrationOvl = 0; >>>>>>> ? DM lDMDistribueSansOvl = 0; >>>>>>> ? DM lDMAvecOverlap = 0; >>>>>>> >>>>>>> PetscPartitioner lPart; >>>>>>> DMPlexGetPartitioner(lDMBete, &lPart); >>>>>>> PetscPartitionerSetFromOptions(lPart); >>>>>>> >>>>>>> PetscSection section; >>>>>>> PetscInt numFields?? = 1; >>>>>>> PetscInt numBC?????? = 0; >>>>>>> PetscInt numComp[1]? = {1}; >>>>>>> PetscInt numDof[4]?? = {1, 0, 0, 0}; >>>>>>> PetscInt bcFields[1] = {0}; >>>>>>> IS bcPoints[1] = {NULL}; >>>>>>> >>>>>>> DMSetNumFields(lDMBete, numFields); >>>>>>> >>>>>>> DMPlexCreateSection(lDMBete, NULL, numComp, >>>>>>> numDof, numBC, bcFields, bcPoints, NULL, >>>>>>> NULL, §ion); >>>>>>> DMSetLocalSection(lDMBete, section); >>>>>>> >>>>>>> DMPlexDistribute(lDMBete, 0, >>>>>>> &lSFMigrationSansOvl, &lDMDistribueSansOvl); >>>>>>> // segfault! >>>>>>> >>>>>>> =========== >>>>>>> >>>>>>> So we have other question/remarks: >>>>>>> >>>>>>> 3- Maybe PETSc expect something specific >>>>>>> that is missing/not verified: for example, >>>>>>> we didn't gave any coordinates since we just >>>>>>> want to partition and compute overlap for >>>>>>> the mesh... and then recover our element >>>>>>> numbers in a "simple way" >>>>>>> >>>>>>> 4- We are telling ourselves it is somewhat a >>>>>>> "big price to pay" to have to build an >>>>>>> unused section to have the global to natural >>>>>>> ordering set ?? Could this requirement be >>>>>>> avoided? >>>>>>> >>>>>>> I don't think so. There would have to be _some_ >>>>>>> way of describing your data layout in terms of >>>>>>> mesh points, and I do not see how you could use >>>>>>> less memory doing that. >>>>>>> >>>>>>> 5- Are there any improvement towards our >>>>>>> usages in 3.16 release? >>>>>>> >>>>>>> Let me try and run the code above. >>>>>>> >>>>>>> ? Thanks, >>>>>>> >>>>>>> ? ? ?Matt >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Eric >>>>>>> >>>>>>> >>>>>>> On 2021-09-29 7:39 p.m., Matthew Knepley wrote: >>>>>>>> On Wed, Sep 29, 2021 at 5:18 PM Eric >>>>>>>> Chamberland >>>>>>>> >>>>>>> > >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I come back with _almost_ the original >>>>>>>> question: >>>>>>>> >>>>>>>> I would like to add an integer >>>>>>>> information (*our* original element >>>>>>>> number, not petsc one) on each element >>>>>>>> of the DMPlex I create with >>>>>>>> DMPlexBuildFromCellListParallel. >>>>>>>> >>>>>>>> I would like this interger to be >>>>>>>> distribruted by or the same way >>>>>>>> DMPlexDistribute distribute the mesh. >>>>>>>> >>>>>>>> Is it possible to do this? >>>>>>>> >>>>>>>> >>>>>>>> I think we already have support for what >>>>>>>> you want. If you call >>>>>>>> >>>>>>>> https://petsc.org/main/docs/manualpages/DM/DMSetUseNatural.html >>>>>>>> >>>>>>>> >>>>>>>> before DMPlexDistribute(), it will compute >>>>>>>> a PetscSF encoding the global to natural >>>>>>>> map. You >>>>>>>> can get it with >>>>>>>> >>>>>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetGlobalToNaturalSF.html >>>>>>>> >>>>>>>> >>>>>>>> and use it with >>>>>>>> >>>>>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGlobalToNaturalBegin.html >>>>>>>> >>>>>>>> >>>>>>>> Is this sufficient? >>>>>>>> >>>>>>>> ? Thanks, >>>>>>>> >>>>>>>> ? ? ?Matt >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Eric >>>>>>>> >>>>>>>> On 2021-07-14 1:18 p.m., Eric >>>>>>>> Chamberland wrote: >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > I want to use DMPlexDistribute from >>>>>>>> PETSc for computing overlapping >>>>>>>> > and play with the different >>>>>>>> partitioners supported. >>>>>>>> > >>>>>>>> > However, after calling >>>>>>>> DMPlexDistribute, I noticed the >>>>>>>> elements are >>>>>>>> > renumbered and then the original >>>>>>>> number is lost. >>>>>>>> > >>>>>>>> > What would be the best way to keep >>>>>>>> track of the element renumbering? >>>>>>>> > >>>>>>>> > a) Adding an optional parameter to >>>>>>>> let the user retrieve a vector or >>>>>>>> > "IS" giving the old number? >>>>>>>> > >>>>>>>> > b) Adding a DMLabel (seems a wrong >>>>>>>> good solution) >>>>>>>> > >>>>>>>> > c) Other idea? >>>>>>>> > >>>>>>>> > Of course, I don't want to loose >>>>>>>> performances with the need of this >>>>>>>> > "mapping"... >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > >>>>>>>> > Eric >>>>>>>> > >>>>>>>> -- >>>>>>>> Eric Chamberland, ing., M. Ing >>>>>>>> Professionnel de recherche >>>>>>>> GIREF/Universit? Laval >>>>>>>> (418) 656-2131 poste 41 22 42 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> What most experimenters take for granted >>>>>>>> before they begin their experiments is >>>>>>>> infinitely more interesting than any >>>>>>>> results to which their experiments lead. >>>>>>>> -- Norbert Wiener >>>>>>>> >>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Eric Chamberland, ing., M. Ing >>>>>>> Professionnel de recherche >>>>>>> GIREF/Universit? Laval >>>>>>> (418) 656-2131 poste 41 22 42 >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before >>>>>>> they begin their experiments is infinitely more >>>>>>> interesting than any results to which their >>>>>>> experiments lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>> >>>>>> >>>>>> -- >>>>>> Eric Chamberland, ing., M. Ing >>>>>> Professionnel de recherche >>>>>> GIREF/Universit? Laval >>>>>> (418) 656-2131 poste 41 22 42 >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they >>>>>> begin their experiments is infinitely more >>>>>> interesting than any results to which their >>>>>> experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> >>>>> >>>>> -- >>>>> Eric Chamberland, ing., M. Ing >>>>> Professionnel de recherche >>>>> GIREF/Universit? Laval >>>>> (418) 656-2131 poste 41 22 42 >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin >>>>> their experiments is infinitely more interesting than any >>>>> results to which their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>> >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Universit? Laval >>>> (418) 656-2131 poste 41 22 42 >>>> >>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin >>>> their experiments is infinitely more interesting than any >>>> results to which their experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> >>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Universit? Laval >>> (418) 656-2131 poste 41 22 42 >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which >>> their experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Universit? Laval >> (418) 656-2131 poste 41 22 42 > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Universit? Laval > (418) 656-2131 poste 41 22 42 -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval (418) 656-2131 poste 41 22 42 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hbnbhlbilhmjdpfg.png Type: image/png Size: 42972 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eejjfmbjimlkboec.png Type: image/png Size: 87901 bytes Desc: not available URL: From yuanxi at advancesoft.jp Sun Oct 31 11:21:08 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Mon, 1 Nov 2021 01:21:08 +0900 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? In-Reply-To: References: Message-ID: Dear Matt Thank you for your detailed explanation. First, I would like to answer your question about my application where low dimensional features are not embedded in the larger volume. It is quite general in structural engineering. For example, buildings are generally modelled as shells and beams, which are two and one dimension respectively. While building foundation is modelled by solid elements, which is three dimension, at the same time. Secondly, It is regrettably to find that DMComposite is not available for me, because all my solid, shell, and beam elements are connected each other. At last, I have build a simple program to see if DMPlexSetCellType() works for me, following the suggestion of functions in PETSc like DMPlexCreateCGNS. But it failed when it tried to do DMPlexInterpolate ! 9----------8---------13 ! /| /| /| ! / | / | / | ! / | / | / | ! 6---------7---------12 | ! | | | | | | ! | | | | | | ! | | | | | | ! | | | | | | ! | 5------|---4-------|-11--------17--------16 ! | / | / | / / / ! | / | / | / / / ! |/ |/ |/ / / ! 2---------3---------10--------14-------15 The calculation result are follows. It seems that the cell type are set correctly but depth is still 2. -------------------------------------------------------------------- DM Object: TestMesh 2 MPI processes type: plex TestMesh in 3 dimensions: 0-cells: 16 0 3-cells: 20 (18) 0 Labels: celltype: 3 strata with value/size (7 (2), 4 (2), 0 (16)) depth: 2 strata with value/size (0 (16), 1 (20)) [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Object is in wrong state [0]PETSC ERROR: Array was not checked out [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-351-g743e004674 GIT Date: 2021-10-29 09:32:23 -0500 [0]PETSC ERROR: ./ex3f90 on a arch-linux-c-debug named DESKTOP-9ITFSBM by hillyuan Mon Nov 1 00:26:39 2021 [0]PETSC ERROR: Configure options --with-cc=mpiicc --with-cxx=mpiicpc --with-fc=mpiifort --with-fortran-bindings=1 --with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.3.0 --with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.3.0 [0]PETSC ERROR: #1 DMRestoreWorkArray() at /home/hillyuan/programs/petsc/src/dm/interface/dm.c:1580 [0]PETSC ERROR: #2 DMPlexRestoreRawFaces_Internal() at /home/hillyuan/programs/petsc/src/dm/impls/plex/plexinterpolate.c:323 [0]PETSC ERROR: #3 DMPlexInterpolateFaces_Internal() at /home/hillyuan/programs/petsc/src/dm/impls/plex/plexinterpolate.c:375 [0]PETSC ERROR: #4 DMPlexInterpolate() at /home/hillyuan/programs/petsc/src/dm/impls/plex/plexinterpolate.c:1340 ----------------------------------------------------------------------------------------- I attached my test program in this mail. It is much appreciated that you could provide any suggestion. Best regards, Yuan 2021?10?31?(?) 21:16 Matthew Knepley : > On Thu, Oct 28, 2021 at 10:48 PM ?? wrote: > >> Dear Matt, >> >> My mesh is something like the following figure, which is composed of >> three elements : one hexahedron(solid element), one quadrilateral (shell >> element), and one line (beam element). I found the function "TestEmptyStrata" >> in file \dm\impls\plex\tests\ex11.c would be a good example to read in such >> a kind of mesh by using DMPlexSetCone. But a problem is that you should >> declare all faces and edges of hexahedron element, all edges in >> quadrilateral element by DMPlexSetCone, otherwise PETsc could not do >> topological interpolation afterwards. Am I right here? >> As general in FEM mesh, my mesh does not contain any information about faces >> or edges of solid elements. That's why I consider using DMCOMPOSITE. That is >> >> - Put hexahedron, quadrilateral, and line elements into different DM >> structures. >> - do topological interpolation in those DMs separately. >> - composite them. >> >> Is there anything wrong in my above consideration? Any suggestions? >> >> ------------ >> /| /| >> / | / | cell 0: Hex >> / | / | >> ------------/ | >> | | | | >> | | | | cell 1: Quad >> | --------|---|------------ >> | / | / / >> | / | / / >> |/ |/ / >> ------------------------------------------- >> cell 2: line >> >> Much thanks for your help. >> > > If you are solving something where everything is embedded in a volumetric > mesh, then there is no problem. However, if you really have > the mesh above, where lower dimensional pieces are sticking out of the > mesh, then Plex can represent the mesh, but automatic interpolation > (creation of edges and faces) will not work. Why is this? We use depth in > the DAG as a proxy for cell dimension, but this will no longer work > if faces are not part of a volume. > > Will DMCOMPOSITE do what you want? It depends. It will be able to lay out > a vector, but it will not know about any topological connectivity > between the meshes and will not preallocate a Jacobian with any > interaction. If the meshes are truly separate, this is fine. If not, it is > not that > useful. > > Could you modify the existing code to support this? Yes, it would not be > terribly difficult. When you load the mesh, you must know what kind > of cell you are loading. You could explicitly set this using > DMPlexSetCellType(). Then, instead of taking a certain height stratum of > the DAG > to loop over, you would instead use all cells marked with a certain cell > type. The rest of the interpolation code should work fine. > > What kind of physics do you have where low dimensional features are not > embedded in the larger volume? > > Thanks, > > Matt > > >> Yuan >> >> 2021?10?28?(?) 22:05 Matthew Knepley : >> >>> On Thu, Oct 28, 2021 at 4:59 AM ?? wrote: >>> >>>> Dear Matt, >>>> >>>> Thank you for your quick response. >>>> >>>> I think what you mean is to build DAG from my mesh at first and then >>>> call DMPlexCreateFromDAG >>>> () >>>> to construct DMPlex. >>>> >>> >>> No, I do not mean that. >>> >>> >>>> A new problem is, as I know, the function DMPlexInterpolate would >>>> generate points with different depth. What's the difference between those >>>> faces and segment elements generated by DMPlexInterpolate with that >>>> defined by the original mesh, or should we not use DMPlexInterpolate in >>>> such a case? >>>> >>>> On the other hand, can DMComposite be used in this case by defining >>>> DMPlex with different topological dimensions at first and then composite >>>> them? >>>> >>> >>> You do not need that. I am obviously not understanding your question. My >>> short answer is that Plex _already_ handles cells of different >>> dimension automatically without anything extra. >>> >>> Maybe it would help if you defined a specific problem you have. >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Thanks in advance. >>>> >>>> Yuan >>>> >>>> >>>> 2021?10?27?(?) 19:27 Matthew Knepley : >>>> >>>>> On Wed, Oct 27, 2021 at 4:50 AM ?? wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am trying to parallelize my serial FEM program using PETSc. This >>>>>> program calculates structure deformation by using various types of elements >>>>>> such as solid, shell, beam, and truss. At the very beginning, I found it >>>>>> was hard for me to put such kinds of elements into DMPlex. Because solid >>>>>> elements are topologically three dimensional, shell element two, and beam >>>>>> or truss are topologically one-dimensional elements. After reading chapter >>>>>> 2.10: "DMPlex: Unstructured Grids in PETSc" of users manual carefully, I >>>>>> found the provided functions, such as DMPlexSetCone, cannot declare those >>>>>> topological differences. >>>>>> >>>>>> My question is : Is it possible and how to define all those >>>>>> topologically different elements into a DMPlex struct? >>>>>> >>>>> >>>>> Yes. The idea is to program in a dimension-independent way, so that >>>>> the code can handle cells of any dimension. >>>>> What you probably want is the "depth" in the DAG representation, which >>>>> you can think of as the dimension of a cell. >>>>> >>>>> >>>>> https://petsc.org/main/docs/manualpages/DMPLEX/DMPlexGetPointDepth.html#DMPlexGetPointDepth >>>>> >>>>> Thanks, >>>>> >>>>> Matt >>>>> >>>>> >>>>>> Thanks in advance! >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Yuan. >>>>>> >>>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which their >>>>> experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> >>>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> >>> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: test_topomesh.f90 Type: application/octet-stream Size: 2988 bytes Desc: not available URL: From mfadams at lbl.gov Sun Oct 31 15:00:53 2021 From: mfadams at lbl.gov (Mark Adams) Date: Sun, 31 Oct 2021 16:00:53 -0400 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? In-Reply-To: References: Message-ID: > >> Is there anything wrong in my above consideration? Any suggestions? >> >> ------------ >> /| /| >> / | / | cell 0: Hex >> / | / | >> ------------/ | >> | | | | >> | | | | cell 1: Quad >> | --------|---|------------ >> | / | / / >> | / | / / >> |/ |/ / >> ------------------------------------------- >> cell 2: line >> >> Much thanks for your help. >> > > If you are solving something where everything is embedded in a volumetric > mesh, then there is no problem. However, if you really have > the mesh above, where lower dimensional pieces are sticking out of the > mesh, then Plex can represent the mesh, but automatic interpolation > (creation of edges and faces) will not work. > Yuan: can you add a fake Hex over the hanging shell here? Give them zero stiffness and constrain the hanging nodes that the fake elements create. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuanxi at advancesoft.jp Sun Oct 31 20:01:58 2021 From: yuanxi at advancesoft.jp (=?UTF-8?B?6KKB54WV?=) Date: Mon, 1 Nov 2021 10:01:58 +0900 Subject: [petsc-users] How to construct DMPlex of cells with different topological dimension? In-Reply-To: References: Message-ID: | Yuan: can you add a fake Hex over the hanging shell here? Give them zero stiffness and constrain the hanging nodes that the fake elements create. I don't think it is a good idea although it would work theoretically. It would be a disaster when you try to do calculations over several tens of millions of elements this way. Maybe I should give up automatic interpolation but call DMPlexDistribute directly after setting up Dofs upon each node. Is it possible? Yuan 2021?11?1?(?) 5:01 Mark Adams : > > >>> Is there anything wrong in my above consideration? Any suggestions? >>> >>> ------------ >>> /| /| >>> / | / | cell 0: Hex >>> / | / | >>> ------------/ | >>> | | | | >>> | | | | cell 1: Quad >>> | --------|---|------------ >>> | / | / / >>> | / | / / >>> |/ |/ / >>> ------------------------------------------- >>> cell 2: line >>> >>> Much thanks for your help. >>> >> >> If you are solving something where everything is embedded in a volumetric >> mesh, then there is no problem. However, if you really have >> the mesh above, where lower dimensional pieces are sticking out of the >> mesh, then Plex can represent the mesh, but automatic interpolation >> (creation of edges and faces) will not work. >> > > Yuan: can you add a fake Hex over the hanging shell here? Give them zero > stiffness and constrain the hanging nodes that the fake elements create. > Mark > > -------------- next part -------------- An HTML attachment was scrubbed... URL: