[petsc-users] [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's KSPSolver

Tue Apr 23 16:13:04 CDT 2024

No, I think sparse matrix-vector products (MatMult in petsc) can be
accelerated with multithreading.  petsc does not do that, but one can use
other libraries, such as MKL for that.

--Junchao Zhang

On Tue, Apr 23, 2024 at 3:00 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca>
wrote:

> Thanks Barry! Does this mean that the sparse matrix-vector products, which
> actually constitute the majority of the computations in my GMRES routine in
> PETSc, don’t utilize multithreading? Only basic vector operations such as
> VecAXPY and VecDot
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> Thanks Barry! Does this mean that the sparse matrix-vector products, which
> actually constitute the majority of the computations in my GMRES routine in
> PETSc, don’t utilize multithreading? Only basic vector operations such as
> VecAXPY and VecDot or dense matrix operations in PETSc will benefit from
> multithreading, is it correct?
>
> Best,
>
> Yongzhong
>
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Tuesday, April 23, 2024 at 3:35 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Inquiry about Multithreading Capabilities in
> PETSc's KSPSolver
>
> 你通常不会收到来自 bsmith at petsc.dev 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!dNTjUVcAAP0C7NtR5H6sE0meEnl7wTwB9pzM-8m4NDdThhFir6g2N9NoawLr_s-JIN9Vgg8_Wy6a1-23415HryX9RWYd7b5-_Cc$>
>
>
>
>    Intel MKL or OpenBLAS are the best bet, but for vector operations they
> will not be significant since the vector operations do not dominate the
> computations.
>
>
>
> On Apr 23, 2024, at 3:23 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
>
>
> Hi Barry,
>
> Thank you for the information provided!
>
> Do you think different BLAS implementation will affect the multithreading
> performance of some vectors operations in GMERS in PETSc?
>
>
>
> I am now using OpenBLAS but didn’t see much improvement when theb
> multithreading are enabled, do you think other implementation such as
> netlib and intel-mkl will help?
>
> Best,
>
> Yongzhong
>
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Monday, April 22, 2024 at 4:20 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Inquiry about Multithreading Capabilities in
> PETSc's KSPSolver
>
> 你通常不会收到来自 bsmith at petsc.dev 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!dNTjUVcAAP0C7NtR5H6sE0meEnl7wTwB9pzM-8m4NDdThhFir6g2N9NoawLr_s-JIN9Vgg8_Wy6a1-23415HryX9RWYd7b5-_Cc$>
>
>
>
>    PETSc provided solvers do not directly use threads.
>
>
>
>    The BLAS used by LAPACK and PETSc may use threads depending on what
> BLAS is being used and how it was configured.
>
>
>
>    Some of the vector operations in GMRES in PETSc use BLAS that can use
> threads, including axpy, dot, etc. For sufficiently large problems, the use
> of threaded BLAS can help with these routines, but not significantly for
> the solver.
>
>
>
>    Dense matrix-vector products MatMult() and dense matrix direct solvers
> PCLU use BLAS and thus can benefit from threading. The benefit can be
> significant for large enough problems with good hardware, especially with
> PCLU.
>
>
>
>    If you run with -blas_view  PETSc tries to indicate information about
> the threading of BLAS. You can also use -blas_num_threads <n> to set the
> number of threads, equivalent to setting the environmental variable.  For
> dense solvers you can vary the number of threads and run with -log_view to
> see what it helps to improve and what it does not effect.
>
>
>
>
>
>
>
> On Apr 22, 2024, at 4:06 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
>
>
> This Message Is From an External Sender
>
> This message came from outside your organization.
>
> Hello all,
>
>
>
> I am writing to ask if PETSc’s KSPSolver makes use of
> OpenMP/multithreading, specifically when performing iterative solutions
> with the GMRES algorithm.
>
>
>
> The questions appeared when I was running a large numerical program based
> on boundary element method. I used the PETSc's GMRES algorithm in KSPSolve
> to solve a shell matrix system iteratively. I observed that threads were
> being utilized, controlled by the *OPENBLAS_NUM_THREADS* environment
> variable. However, I noticed no significant performance difference between
> running the solver with multiple threads versus a single thread.
>
> Could you please *confirm if GMRES in KSPSolve leverages multithreading,
> and also whether it is influenced by the multithreadings of the low-level
> math libraries such as BLAS and LAPACK?* *If so*, how can I enable
> multithreading effectively to see noticeable improvements in solution times
> when using GMRES? *If not*, why do I observe that threads are being used
> during the GMERS solutions?
>
>
>
> *For reference, I am using PETSc version 3.16.0, configured in CMakelists
> as follows:*
>
>
> ./configure PETSC_ARCH=config-release --with-scalar-type=complex
> --with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native
> CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++
> --download-openmpi --download-superlu --download-opencascade
> --with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB}
> --with-threadsafety --with-log=0 --with-openmp
>
> To simplify the diagnosis of potential issues, I have also written a small
> example program using GMRES to solve a sparse matrix system derived from a
> 2D Poisson problem using the finite difference method. I found similar
> issues on this piece of codes. The code is as follows:
>
> #include <petscksp.h>
>
> /* Monitor function to print iteration number and residual norm */
> PetscErrorCode MyKSPMonitor(KSP ksp, PetscInt n, PetscReal rnorm, void
> *ctx) {
>     PetscErrorCode ierr;
>     ierr = PetscPrintf(PETSC_COMM_WORLD, "Iteration %D, Residual norm
> %g\n", n, (double)rnorm);
>     CHKERRQ(ierr);
>     return 0;
> }
>
> int main(int argc, char **args) {
>     Vec x, b, x_true, e;
>     Mat A;
>     KSP ksp;
>     PetscErrorCode ierr;
>     PetscInt i, j, Ii, J, n = 500; // Size of the grid n x n
>     PetscInt Istart, Iend, ncols;
>     PetscScalar v;
>     PetscMPIInt rank;
>     PetscInitialize(&argc, &args, NULL, NULL);
>     PetscLogDouble t1, t2;     // Variables for timing
>     MPI_Comm_rank(PETSC_COMM_WORLD, &rank);
>
>     // Create vectors and matrix
>     ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, n*n, &x);
> CHKERRQ(ierr);
>     ierr = VecDuplicate(x, &b); CHKERRQ(ierr);
>     ierr = VecDuplicate(x, &x_true); CHKERRQ(ierr);
>
>     // Set true solution as all ones
>     ierr = VecSet(x_true, 1.0); CHKERRQ(ierr);
>
>     // Create and assemble matrix A for the 2D Laplacian using 5-point
> stencil
>     ierr = MatCreate(PETSC_COMM_WORLD, &A); CHKERRQ(ierr);
>     ierr = MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, n*n, n*n);
> CHKERRQ(ierr);
>     ierr = MatSetFromOptions(A); CHKERRQ(ierr);
>     ierr = MatSetUp(A); CHKERRQ(ierr);
>     ierr = MatGetOwnershipRange(A, &Istart, &Iend); CHKERRQ(ierr);
>     for (Ii = Istart; Ii < Iend; Ii++) {
>         i = Ii / n; // Row index
>         j = Ii % n; // Column index
>         v = -4.0;
>         ierr = MatSetValues(A, 1, &Ii, 1, &Ii, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         if (i > 0) { // South
>             J = Ii - n;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         }
>         if (i < n - 1) { // North
>             J = Ii + n;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         }
>         if (j > 0) { // West
>             J = Ii - 1;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         }
>         if (j < n - 1) { // East
>             J = Ii + 1;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         }
>     }
>     ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
>     ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
>
>     // Compute the RHS corresponding to the true solution
>     ierr = MatMult(A, x_true, b); CHKERRQ(ierr);
>
>     // Set up and solve the linear system
>     ierr = KSPCreate(PETSC_COMM_WORLD, &ksp); CHKERRQ(ierr);
>     ierr = KSPSetOperators(ksp, A, A); CHKERRQ(ierr);
>     ierr = KSPSetType(ksp, KSPGMRES); CHKERRQ(ierr);
>     ierr = KSPSetTolerances(ksp, 1e-5, PETSC_DEFAULT, PETSC_DEFAULT,
> PETSC_DEFAULT); CHKERRQ(ierr);
>
>     /* Set up the monitor */
>     ierr = KSPMonitorSet(ksp, MyKSPMonitor, NULL, NULL); CHKERRQ(ierr);
>
>     // Start timing
>     PetscTime(&t1);
>
>     ierr = KSPSolve(ksp, b, x); CHKERRQ(ierr);
>
>     // Stop timing
>     PetscTime(&t2);
>
>     // Compute error
>     ierr = VecDuplicate(x, &e); CHKERRQ(ierr);
>     ierr = VecWAXPY(e, -1.0, x_true, x); CHKERRQ(ierr);
>     PetscReal norm_error, norm_true;
>     ierr = VecNorm(e, NORM_2, &norm_error); CHKERRQ(ierr);
>     ierr = VecNorm(x_true, NORM_2, &norm_true); CHKERRQ(ierr);
>     PetscReal relative_error = norm_error / norm_true;
>     if (rank == 0) { // Print only from the first MPI process
>         PetscPrintf(PETSC_COMM_WORLD, "Relative error ||x - x_true||_2 /
> ||x_true||_2: %g\n", (double)relative_error);
>     }
>
>     // Output the wall time taken for MatMult
>     PetscPrintf(PETSC_COMM_WORLD, "Time taken for KSPSolve: %f
> seconds\n", t2 - t1);
>
>     // Cleanup
>     ierr = VecDestroy(&x); CHKERRQ(ierr);
>     ierr = VecDestroy(&b); CHKERRQ(ierr);
>     ierr = VecDestroy(&x_true); CHKERRQ(ierr);
>     ierr = VecDestroy(&e); CHKERRQ(ierr);
>     ierr = MatDestroy(&A); CHKERRQ(ierr);
>     ierr = KSPDestroy(&ksp); CHKERRQ(ierr);
>     PetscFinalize();
>     return 0;
> }
>
> Here are some profiling results for GMERS solution.
>
> OPENBLAS_NUM_THREADS = 1, iteration steps  = 859, solution time = 16.1
> OPENBLAS_NUM_THREADS = 2, iteration steps  = 859, solution time = 16.3
>
> OPENBLAS_NUM_THREADS = 4, iteration steps  = 859, solution time = 16.7
>
> OPENBLAS_NUM_THREADS = 8, iteration steps  = 859, solution time = 16.8
>
> OPENBLAS_NUM_THREADS = 16, iteration steps  = 859, solution time = 17.8
>
>
> *I am using one workstation with Intel® Core™ i9-11900K Processor, 8
> cores, 16 threads. Note that I am not using multiple MPI processes, such as
> mpirun/mpiexec, the default number of MPI processes should be 1, correct if
> I am wrong.*
>
>
>
> Thank you in advance!
>
> Sincerely,
>
> Yongzhong
>
>
>
> -----------------------------------------------------------
>
> *Yongzhong Li*
>
> PhD student | Electromagnetics Group
>
> Department of Electrical & Computer Engineering
>
> University of Toronto
>
> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!apMO9i2wyrFLKJCy54sGuuNGrtwe2jlpeEh8S2BlUedaYAMmdctp-yhrBUXJ8q61TgCYpTQXtCWIJe5sDBT-gPcZWpOq$ 
> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!efVv_hPkRBEhyquXer2c8sFeGrjOtTjEGicYg2niCyfT9swzjLFyf6k4XrhKElaF-cX_Q02y9KSTRNFHPlKhXMtuzaTekCWcXgw$>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240423/2ddc846b/attachment-0001.html>