[petsc-users] [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's KSPSolver

Tue Apr 23 17:34:17 CDT 2024

On Tue, Apr 23, 2024 at 4:00 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca>
wrote:

> Thanks Barry! Does this mean that the sparse matrix-vector products, which
> actually constitute the majority of the computations in my GMRES routine in
> PETSc, don’t utilize multithreading? Only basic vector operations such as
> VecAXPY and VecDot
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> Thanks Barry! Does this mean that the sparse matrix-vector products, which
> actually constitute the majority of the computations in my GMRES routine in
> PETSc, don’t utilize multithreading? Only basic vector operations such as
> VecAXPY and VecDot or dense matrix operations in PETSc will benefit from
> multithreading, is it correct?
>

I am not sure what your point is above.

SpMV performance is mainly controlled by memory bandwidth (latency plays
very little role). If you think
the MPI processes are not using the full bandwidth, use more processes. If
they are, and you think using
threads will speed anything up, you are incorrect. In fact, the only
difference between a thread and a
process is the default memory sharing flag, MPI will perform at least as
well (and usually better), than
adding threads to SpMV.

There are dozens of publications showing this.

  Thanks,

     Matt

> Best,
>
> Yongzhong
>
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Tuesday, April 23, 2024 at 3:35 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Inquiry about Multithreading Capabilities in
> PETSc's KSPSolver
>
> 你通常不会收到来自 bsmith at petsc.dev 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!dNTjUVcAAP0C7NtR5H6sE0meEnl7wTwB9pzM-8m4NDdThhFir6g2N9NoawLr_s-JIN9Vgg8_Wy6a1-23415HryX9RWYd7b5-_Cc$>
>
>
>
>    Intel MKL or OpenBLAS are the best bet, but for vector operations they
> will not be significant since the vector operations do not dominate the
> computations.
>
>
>
> On Apr 23, 2024, at 3:23 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
>
>
> Hi Barry,
>
> Thank you for the information provided!
>
> Do you think different BLAS implementation will affect the multithreading
> performance of some vectors operations in GMERS in PETSc?
>
>
>
> I am now using OpenBLAS but didn’t see much improvement when theb
> multithreading are enabled, do you think other implementation such as
> netlib and intel-mkl will help?
>
> Best,
>
> Yongzhong
>
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Monday, April 22, 2024 at 4:20 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Inquiry about Multithreading Capabilities in
> PETSc's KSPSolver
>
> 你通常不会收到来自 bsmith at petsc.dev 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!dNTjUVcAAP0C7NtR5H6sE0meEnl7wTwB9pzM-8m4NDdThhFir6g2N9NoawLr_s-JIN9Vgg8_Wy6a1-23415HryX9RWYd7b5-_Cc$>
>
>
>
>    PETSc provided solvers do not directly use threads.
>
>
>
>    The BLAS used by LAPACK and PETSc may use threads depending on what
> BLAS is being used and how it was configured.
>
>
>
>    Some of the vector operations in GMRES in PETSc use BLAS that can use
> threads, including axpy, dot, etc. For sufficiently large problems, the use
> of threaded BLAS can help with these routines, but not significantly for
> the solver.
>
>
>
>    Dense matrix-vector products MatMult() and dense matrix direct solvers
> PCLU use BLAS and thus can benefit from threading. The benefit can be
> significant for large enough problems with good hardware, especially with
> PCLU.
>
>
>
>    If you run with -blas_view  PETSc tries to indicate information about
> the threading of BLAS. You can also use -blas_num_threads <n> to set the
> number of threads, equivalent to setting the environmental variable.  For
> dense solvers you can vary the number of threads and run with -log_view to
> see what it helps to improve and what it does not effect.
>
>
>
>
>
>
>
> On Apr 22, 2024, at 4:06 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
>
>
> This Message Is From an External Sender
>
> This message came from outside your organization.
>
> Hello all,
>
>
>
> I am writing to ask if PETSc’s KSPSolver makes use of
> OpenMP/multithreading, specifically when performing iterative solutions
> with the GMRES algorithm.
>
>
>
> The questions appeared when I was running a large numerical program based
> on boundary element method. I used the PETSc's GMRES algorithm in KSPSolve
> to solve a shell matrix system iteratively. I observed that threads were
> being utilized, controlled by the *OPENBLAS_NUM_THREADS* environment
> variable. However, I noticed no significant performance difference between
> running the solver with multiple threads versus a single thread.
>
> Could you please *confirm if GMRES in KSPSolve leverages multithreading,
> and also whether it is influenced by the multithreadings of the low-level
> math libraries such as BLAS and LAPACK?* *If so*, how can I enable
> multithreading effectively to see noticeable improvements in solution times
> when using GMRES? *If not*, why do I observe that threads are being used
> during the GMERS solutions?
>
>
>
> *For reference, I am using PETSc version 3.16.0, configured in CMakelists
> as follows:*
>
>
> ./configure PETSC_ARCH=config-release --with-scalar-type=complex
> --with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native
> CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++
> --download-openmpi --download-superlu --download-opencascade
> --with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB}
> --with-threadsafety --with-log=0 --with-openmp
>
> To simplify the diagnosis of potential issues, I have also written a small
> example program using GMRES to solve a sparse matrix system derived from a
> 2D Poisson problem using the finite difference method. I found similar
> issues on this piece of codes. The code is as follows:
>
> #include <petscksp.h>
>
> /* Monitor function to print iteration number and residual norm */
> PetscErrorCode MyKSPMonitor(KSP ksp, PetscInt n, PetscReal rnorm, void
> *ctx) {
>     PetscErrorCode ierr;
>     ierr = PetscPrintf(PETSC_COMM_WORLD, "Iteration %D, Residual norm
> %g\n", n, (double)rnorm);
>     CHKERRQ(ierr);
>     return 0;
> }
>
> int main(int argc, char **args) {
>     Vec x, b, x_true, e;
>     Mat A;
>     KSP ksp;
>     PetscErrorCode ierr;
>     PetscInt i, j, Ii, J, n = 500; // Size of the grid n x n
>     PetscInt Istart, Iend, ncols;
>     PetscScalar v;
>     PetscMPIInt rank;
>     PetscInitialize(&argc, &args, NULL, NULL);
>     PetscLogDouble t1, t2;     // Variables for timing
>     MPI_Comm_rank(PETSC_COMM_WORLD, &rank);
>
>     // Create vectors and matrix
>     ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, n*n, &x);
> CHKERRQ(ierr);
>     ierr = VecDuplicate(x, &b); CHKERRQ(ierr);
>     ierr = VecDuplicate(x, &x_true); CHKERRQ(ierr);
>
>     // Set true solution as all ones
>     ierr = VecSet(x_true, 1.0); CHKERRQ(ierr);
>
>     // Create and assemble matrix A for the 2D Laplacian using 5-point
> stencil
>     ierr = MatCreate(PETSC_COMM_WORLD, &A); CHKERRQ(ierr);
>     ierr = MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, n*n, n*n);
> CHKERRQ(ierr);
>     ierr = MatSetFromOptions(A); CHKERRQ(ierr);
>     ierr = MatSetUp(A); CHKERRQ(ierr);
>     ierr = MatGetOwnershipRange(A, &Istart, &Iend); CHKERRQ(ierr);
>     for (Ii = Istart; Ii < Iend; Ii++) {
>         i = Ii / n; // Row index
>         j = Ii % n; // Column index
>         v = -4.0;
>         ierr = MatSetValues(A, 1, &Ii, 1, &Ii, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         if (i > 0) { // South
>             J = Ii - n;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         }
>         if (i < n - 1) { // North
>             J = Ii + n;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         }
>         if (j > 0) { // West
>             J = Ii - 1;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         }
>         if (j < n - 1) { // East
>             J = Ii + 1;
>             v = 1.0;
>             ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
> CHKERRQ(ierr);
>         }
>     }
>     ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
>     ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
>
>     // Compute the RHS corresponding to the true solution
>     ierr = MatMult(A, x_true, b); CHKERRQ(ierr);
>
>     // Set up and solve the linear system
>     ierr = KSPCreate(PETSC_COMM_WORLD, &ksp); CHKERRQ(ierr);
>     ierr = KSPSetOperators(ksp, A, A); CHKERRQ(ierr);
>     ierr = KSPSetType(ksp, KSPGMRES); CHKERRQ(ierr);
>     ierr = KSPSetTolerances(ksp, 1e-5, PETSC_DEFAULT, PETSC_DEFAULT,
> PETSC_DEFAULT); CHKERRQ(ierr);
>
>     /* Set up the monitor */
>     ierr = KSPMonitorSet(ksp, MyKSPMonitor, NULL, NULL); CHKERRQ(ierr);
>
>     // Start timing
>     PetscTime(&t1);
>
>     ierr = KSPSolve(ksp, b, x); CHKERRQ(ierr);
>
>     // Stop timing
>     PetscTime(&t2);
>
>     // Compute error
>     ierr = VecDuplicate(x, &e); CHKERRQ(ierr);
>     ierr = VecWAXPY(e, -1.0, x_true, x); CHKERRQ(ierr);
>     PetscReal norm_error, norm_true;
>     ierr = VecNorm(e, NORM_2, &norm_error); CHKERRQ(ierr);
>     ierr = VecNorm(x_true, NORM_2, &norm_true); CHKERRQ(ierr);
>     PetscReal relative_error = norm_error / norm_true;
>     if (rank == 0) { // Print only from the first MPI process
>         PetscPrintf(PETSC_COMM_WORLD, "Relative error ||x - x_true||_2 /
> ||x_true||_2: %g\n", (double)relative_error);
>     }
>
>     // Output the wall time taken for MatMult
>     PetscPrintf(PETSC_COMM_WORLD, "Time taken for KSPSolve: %f
> seconds\n", t2 - t1);
>
>     // Cleanup
>     ierr = VecDestroy(&x); CHKERRQ(ierr);
>     ierr = VecDestroy(&b); CHKERRQ(ierr);
>     ierr = VecDestroy(&x_true); CHKERRQ(ierr);
>     ierr = VecDestroy(&e); CHKERRQ(ierr);
>     ierr = MatDestroy(&A); CHKERRQ(ierr);
>     ierr = KSPDestroy(&ksp); CHKERRQ(ierr);
>     PetscFinalize();
>     return 0;
> }
>
> Here are some profiling results for GMERS solution.
>
> OPENBLAS_NUM_THREADS = 1, iteration steps  = 859, solution time = 16.1
> OPENBLAS_NUM_THREADS = 2, iteration steps  = 859, solution time = 16.3
>
> OPENBLAS_NUM_THREADS = 4, iteration steps  = 859, solution time = 16.7
>
> OPENBLAS_NUM_THREADS = 8, iteration steps  = 859, solution time = 16.8
>
> OPENBLAS_NUM_THREADS = 16, iteration steps  = 859, solution time = 17.8
>
>
> *I am using one workstation with Intel® Core™ i9-11900K Processor, 8
> cores, 16 threads. Note that I am not using multiple MPI processes, such as
> mpirun/mpiexec, the default number of MPI processes should be 1, correct if
> I am wrong.*
>
>
>
> Thank you in advance!
>
> Sincerely,
>
> Yongzhong
>
>
>
> -----------------------------------------------------------
>
> *Yongzhong Li*
>
> PhD student | Electromagnetics Group
>
> Department of Electrical & Computer Engineering
>
> University of Toronto
>
> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!d3DJ62gIDNdeGxB_NtVYXcyjdkvqiZl_TpDpRGWE0xdSPrsCxJN0xatFarNh9uELGFK33NBr1tWMxcZKN4-b$ 
> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!efVv_hPkRBEhyquXer2c8sFeGrjOtTjEGicYg2niCyfT9swzjLFyf6k4XrhKElaF-cX_Q02y9KSTRNFHPlKhXMtuzaTekCWcXgw$>
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!d3DJ62gIDNdeGxB_NtVYXcyjdkvqiZl_TpDpRGWE0xdSPrsCxJN0xatFarNh9uELGFK33NBr1tWMxbFuASbi$  <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!d3DJ62gIDNdeGxB_NtVYXcyjdkvqiZl_TpDpRGWE0xdSPrsCxJN0xatFarNh9uELGFK33NBr1tWMxXox4mE6$ >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240423/20ef149c/attachment-0001.html>