[petsc-users] Inquiry about Multithreading Capabilities in PETSc's KSPSolver

Yongzhong Li yongzhong.li at mail.utoronto.ca
Mon Apr 22 15:06:28 CDT 2024


Hello all,



I am writing to ask if PETSc’s KSPSolver makes use of OpenMP/multithreading, specifically when performing iterative solutions with the GMRES algorithm.



The questions appeared when I was running a large numerical program based on boundary element method. I used the PETSc's GMRES algorithm in KSPSolve to solve a shell matrix system iteratively. I observed that threads were being utilized, controlled by the OPENBLAS_NUM_THREADS environment variable. However, I noticed no significant performance difference between running the solver with multiple threads versus a single thread.

Could you please confirm if GMRES in KSPSolve leverages multithreading, and also whether it is influenced by the multithreadings of the low-level math libraries such as BLAS and LAPACK? If so, how can I enable multithreading effectively to see noticeable improvements in solution times when using GMRES? If not, why do I observe that threads are being used during the GMERS solutions?



For reference, I am using PETSc version 3.16.0, configured in CMakelists as follows:

./configure PETSC_ARCH=config-release --with-scalar-type=complex --with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++ --download-openmpi --download-superlu --download-opencascade --with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB} --with-threadsafety --with-log=0 --with-openmp

To simplify the diagnosis of potential issues, I have also written a small example program using GMRES to solve a sparse matrix system derived from a 2D Poisson problem using the finite difference method. I found similar issues on this piece of codes. The code is as follows:

#include <petscksp.h>

/* Monitor function to print iteration number and residual norm */
PetscErrorCode MyKSPMonitor(KSP ksp, PetscInt n, PetscReal rnorm, void *ctx) {
    PetscErrorCode ierr;
    ierr = PetscPrintf(PETSC_COMM_WORLD, "Iteration %D, Residual norm %g\n", n, (double)rnorm);
    CHKERRQ(ierr);
    return 0;
}

int main(int argc, char **args) {
    Vec x, b, x_true, e;
    Mat A;
    KSP ksp;
    PetscErrorCode ierr;
    PetscInt i, j, Ii, J, n = 500; // Size of the grid n x n
    PetscInt Istart, Iend, ncols;
    PetscScalar v;
    PetscMPIInt rank;
    PetscInitialize(&argc, &args, NULL, NULL);
    PetscLogDouble t1, t2;     // Variables for timing
    MPI_Comm_rank(PETSC_COMM_WORLD, &rank);

    // Create vectors and matrix
    ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, n*n, &x); CHKERRQ(ierr);
    ierr = VecDuplicate(x, &b); CHKERRQ(ierr);
    ierr = VecDuplicate(x, &x_true); CHKERRQ(ierr);

    // Set true solution as all ones
    ierr = VecSet(x_true, 1.0); CHKERRQ(ierr);

    // Create and assemble matrix A for the 2D Laplacian using 5-point stencil
    ierr = MatCreate(PETSC_COMM_WORLD, &A); CHKERRQ(ierr);
    ierr = MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, n*n, n*n); CHKERRQ(ierr);
    ierr = MatSetFromOptions(A); CHKERRQ(ierr);
    ierr = MatSetUp(A); CHKERRQ(ierr);
    ierr = MatGetOwnershipRange(A, &Istart, &Iend); CHKERRQ(ierr);
    for (Ii = Istart; Ii < Iend; Ii++) {
        i = Ii / n; // Row index
        j = Ii % n; // Column index
        v = -4.0;
        ierr = MatSetValues(A, 1, &Ii, 1, &Ii, &v, INSERT_VALUES); CHKERRQ(ierr);
        if (i > 0) { // South
            J = Ii - n;
            v = 1.0;
            ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); CHKERRQ(ierr);
        }
        if (i < n - 1) { // North
            J = Ii + n;
            v = 1.0;
            ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); CHKERRQ(ierr);
        }
        if (j > 0) { // West
            J = Ii - 1;
            v = 1.0;
            ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); CHKERRQ(ierr);
        }
        if (j < n - 1) { // East
            J = Ii + 1;
            v = 1.0;
            ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES); CHKERRQ(ierr);
        }
    }
    ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
    ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);

    // Compute the RHS corresponding to the true solution
    ierr = MatMult(A, x_true, b); CHKERRQ(ierr);

    // Set up and solve the linear system
    ierr = KSPCreate(PETSC_COMM_WORLD, &ksp); CHKERRQ(ierr);
    ierr = KSPSetOperators(ksp, A, A); CHKERRQ(ierr);
    ierr = KSPSetType(ksp, KSPGMRES); CHKERRQ(ierr);
    ierr = KSPSetTolerances(ksp, 1e-5, PETSC_DEFAULT, PETSC_DEFAULT, PETSC_DEFAULT); CHKERRQ(ierr);

    /* Set up the monitor */
    ierr = KSPMonitorSet(ksp, MyKSPMonitor, NULL, NULL); CHKERRQ(ierr);

    // Start timing
    PetscTime(&t1);

    ierr = KSPSolve(ksp, b, x); CHKERRQ(ierr);

    // Stop timing
    PetscTime(&t2);

    // Compute error
    ierr = VecDuplicate(x, &e); CHKERRQ(ierr);
    ierr = VecWAXPY(e, -1.0, x_true, x); CHKERRQ(ierr);
    PetscReal norm_error, norm_true;
    ierr = VecNorm(e, NORM_2, &norm_error); CHKERRQ(ierr);
    ierr = VecNorm(x_true, NORM_2, &norm_true); CHKERRQ(ierr);
    PetscReal relative_error = norm_error / norm_true;
    if (rank == 0) { // Print only from the first MPI process
        PetscPrintf(PETSC_COMM_WORLD, "Relative error ||x - x_true||_2 / ||x_true||_2: %g\n", (double)relative_error);
    }

    // Output the wall time taken for MatMult
    PetscPrintf(PETSC_COMM_WORLD, "Time taken for KSPSolve: %f seconds\n", t2 - t1);

    // Cleanup
    ierr = VecDestroy(&x); CHKERRQ(ierr);
    ierr = VecDestroy(&b); CHKERRQ(ierr);
    ierr = VecDestroy(&x_true); CHKERRQ(ierr);
    ierr = VecDestroy(&e); CHKERRQ(ierr);
    ierr = MatDestroy(&A); CHKERRQ(ierr);
    ierr = KSPDestroy(&ksp); CHKERRQ(ierr);
    PetscFinalize();
    return 0;
}

Here are some profiling results for GMERS solution.

OPENBLAS_NUM_THREADS = 1, iteration steps  = 859, solution time = 16.1
OPENBLAS_NUM_THREADS = 2, iteration steps  = 859, solution time = 16.3

OPENBLAS_NUM_THREADS = 4, iteration steps  = 859, solution time = 16.7

OPENBLAS_NUM_THREADS = 8, iteration steps  = 859, solution time = 16.8

OPENBLAS_NUM_THREADS = 16, iteration steps  = 859, solution time = 17.8

I am using one workstation with Intel® Core™ i9-11900K Processor, 8 cores, 16 threads. Note that I am not using multiple MPI processes, such as mpirun/mpiexec, the default number of MPI processes should be 1, correct if I am wrong.



Thank you in advance!

Sincerely,

Yongzhong

-----------------------------------------------------------
Yongzhong Li
PhD student | Electromagnetics Group
Department of Electrical & Computer Engineering
University of Toronto
https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!dke3wyns7hkBOVhDpSLeiXntYbY-XnVkGEyNyesUI7XkVcBQe_oSaykAmlN6EZ_B9P4mbm-aP1RKw2FwERcONEC88CYksHvjUZ0$ 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240422/afe568bb/attachment-0001.html>


More information about the petsc-users mailing list