[petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue

Wed Jun 26 10:12:58 CDT 2024

Yongzhong,
  Try Barry's approach first.  BTW, I ran another petsc test.  You can see
GEMV was used in KSPSolve.  You could also try this one.

$ cd src/ksp/ksp/tutorials
$ make bench_kspsolve
$ MKL_VERBOSE=1 OMP_PROC_BIND=spread MKL_NUM_THREADS=8 ./bench_kspsolve
-split_ksp -mat_type aijmkl
===========================================
Test: KSP performance - Poisson
        Input matrix: 27-pt finite difference stencil
        -n 100
        DoFs = 1000000
        Number of nonzeros = 26463592

Step1  - creating Vecs and Mat...
Step2a - running PCSetUp()...
Step2b - running KSPSolve()...
MKL_VERBOSE oneMKL 2022.0 Product build 20211112 for Intel(R) 64
architecture Intel(R) Architecture processors, Lnx 3.18GHz lp64 gnu_thread
MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa9432b5e60,1) 474.25us CNR:OFF
Dyn:1 FastMM:1 TID:0  NThr:8
MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa9441f8260,1) 1.93ms CNR:OFF
Dyn:1 FastMM:1 TID:0  NThr:8
MKL_VERBOSE *ZGEMV*(C,1000000,2,0x7ffccef20c20,0x7fa9432b5e60,1000000,0x7fa94513a660,1,0x7ffccef20c30,0x1c4b610,1)
1.86ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8
MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa94513a660,1) 2.55ms CNR:OFF
Dyn:1 FastMM:1 TID:0  NThr:8
MKL_VERBOSE *ZGEMV*(C,1000000,3,0x7ffccef20c20,0x7fa9432b5e60,1000000,0x7fa8cb7a6660,1,0x7ffccef20c30,0x1c4b610,1)
2.95ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8

--Junchao Zhang

On Tue, Jun 25, 2024 at 10:19 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca>
wrote:

> Hi Junchao, thank you for your help for these benchmarking test!
>
> I check out to petsc/main and did a few things to verify from my side,
>
> 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute
> node. The results are as follow,
>
> $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4
> Vector(N)      VecMDot-1    VecMDot-3    VecMDot-8    VecMDot-30  (us)
>
> --------------------------------------------------------------------------
>
>          128        14.5          1.2          1.8          5.2
>
>          256         1.5          0.9          1.6          4.7
>
>          512         2.7          2.8          6.1         13.2
>
>         1024         4.0          4.0          9.3         16.4
>
>         2048         7.4          7.3         11.3         39.3
>
>         4096        14.2         13.9         19.1         93.4
>
>         8192        28.8         26.3         25.4         31.3
>
>        16384        54.1         25.8         26.7         33.8
>
>        32768       109.8         25.7         24.2         56.0
>
>        65536       220.2         24.4         26.5         89.0
>
>       131072       424.1         31.5         36.1        149.6
>
>       262144       898.1         37.1         53.9        286.1
>
>       524288      1754.6         48.7        100.3       1122.2
>
>      1048576      3645.8         86.5        347.9       2950.4
>
>      2097152      7371.4        308.7       1440.6       6874.9
>
>
>
> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4
>
> Vector(N)      VecMDot-1    VecMDot-3    VecMDot-8    VecMDot-30  (us)
>
> --------------------------------------------------------------------------
>
>          128        14.9          1.2          1.9          5.2
>
>          256         1.5          1.0          1.7          4.7
>
>          512         2.7          2.8          6.1         12.0
>
>         1024         3.9          4.0          9.3         16.8
>
>         2048         7.4          7.3         10.4         41.3
>
>         4096        14.0         13.8         18.6         84.2
>
>         8192        27.0         21.3         43.8        177.5
>
>        16384        54.1         34.1         89.1        330.4
>
>        32768       110.4         82.1        203.5        781.1
>
>        65536       213.0        191.8        423.9       1696.4
>
>       131072       428.7        360.2        934.0       4080.0
>
>       262144       883.4        723.2       1745.6      10120.7
>
>       524288      1817.5       1466.1       4751.4      23217.2
>
>      1048576      3611.0       3796.5      11814.9      48687.7
>
>      2097152      7401.9      10592.0      27543.2     106565.4
>
>
> I can see the speed up brought by more MKL threads, and if I set
> NKL_VERBOSE to 1, I can see something like
>
>
>
>
>
> *MKL_VERBOSE
> ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1)
> 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:6 ca *From my understanding,
> the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute
> node and is using ZGEMV MKL BLAS.
>
> However, when I ran my own program and set MKL_VERBOSE to 1, it is very
> strange that I still can’t find any MKL outputs, though I can see from the
> PETSc log that VecMDot and VecMAXPY() are called.
>
>
> I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a
> way that is similar to ex2k test?  Should I expect to see MKL outputs for
> whatever linear system I solve with KSPGMRES? Does it relate to if it is
> dense matrix or sparse matrix, although I am not really understand why
> VecMDot/MAXPY() have something to do with dense matrix-vector
> multiplication.
>
> Thank you,
>
> Yongzhong
>
> *From: *Junchao Zhang <junchao.zhang at gmail.com>
> *Date: *Tuesday, June 25, 2024 at 6:34 PM
> *To: *Matthew Knepley <knepley at gmail.com>
> *Cc: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>, Pierre Jolivet <
> pierre at joliv.et>, petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
> Hi, Yongzhong,
>
>   Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY,  if we
> can speed up the two with OpenMP threads, then we can speed up
> KSPGMRESOrthog.  We recently added an optimization to do VecMDot/MAXPY() in
> dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny
> matrices ).  So with MKL_VERBOSE=1,  you should see something like
> "MKL_VERBOSE ZGEMV ..."  in output.  If not, could you try again with
> petsc/main?
>
>   petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them.  I ran
> VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was
> strange to see no speedup.   I then configured petsc with openblas, I did
> see better performance with more threads
>
>
>
> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4
> Vector(N)      VecMDot-3    VecMDot-8    VecMDot-30  (us)
> --------------------------------------------------------------------------
>          128         2.0          2.5          6.1
>          256         1.8          2.7          7.0
>          512         2.1          3.1          8.6
>         1024         2.7          4.0         12.3
>         2048         3.8          6.3         28.0
>         4096         6.1         10.6         42.4
>         8192        10.9         21.8         79.5
>        16384        21.2         39.4        149.6
>        32768        45.9         75.7        224.6
>        65536       142.2        215.8        732.1
>       131072       169.1        233.2       1729.4
>       262144       367.5        830.0       4159.2
>       524288       999.2       1718.1       8538.5
>      1048576      2113.5       4082.1      18274.8
>      2097152      5392.6      10273.4      43273.4
>
>
>
>
>
> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4
> Vector(N)      VecMDot-3    VecMDot-8    VecMDot-30  (us)
> --------------------------------------------------------------------------
>          128         2.0          2.5          6.0
>          256         1.8          2.7         15.0
>          512         2.1          9.0         16.6
>         1024         2.6          8.7         16.1
>         2048         7.7         10.3         20.5
>         4096         9.9         11.4         25.9
>         8192        14.5         22.1         39.6
>        16384        25.1         27.8         67.8
>        32768        44.7         95.7         91.5
>        65536        82.1        156.8        165.1
>       131072       194.0        335.1        341.5
>       262144       388.5        380.8        612.9
>       524288      1046.7        967.1       1653.3
>      1048576      1997.4       2169.0       4034.4
>      2097152      5502.9       5787.3      12608.1
>
>
>
> The tall-and-skinny matrices in KSPGMRESOrthog vary in width.  The average
> speedup depends on components.  So I suggest you run ex2k to see in your
> environment whether oneMKL can speedup the kernels.
>
>
>
> --Junchao Zhang
>
>
>
>
>
> On Mon, Jun 24, 2024 at 11:35 AM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
> Let me run some examples on our end to see whether the code calls expected
> functions.
>
>
> --Junchao Zhang
>
>
>
>
>
> On Mon, Jun 24, 2024 at 10:46 AM Matthew Knepley <knepley at gmail.com>
> wrote:
>
> On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li <yongzhong. li@ mail.
> utoronto. ca> wrote: Thank you Pierre for your information. Do we have a
> conclusion for my original question about the parallelization efficiency
> for different stages of
>
> ZjQcmQRYFpfptBannerStart
>
> *This Message Is From an External Sender *
>
> This message came from outside your organization.
>
>
>
> ZjQcmQRYFpfptBannerEnd
>
> On Mon, Jun 24, 2024 at 11:21 AM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> Thank you Pierre for your information. Do we have a conclusion for my
> original question about the parallelization efficiency for different stages
> of KSP Solve? Do we need to do more testing to figure out the issues? Thank
> you, Yongzhong From:
>
> ZjQcmQRYFpfptBannerStart
>
> *This Message Is From an External Sender *
>
> This message came from outside your organization.
>
>
>
> ZjQcmQRYFpfptBannerEnd
>
> Thank you Pierre for your information. Do we have a conclusion for my
> original question about the parallelization efficiency for different stages
> of KSP Solve? Do we need to do more testing to figure out the issues?
>
>
>
> We have an extended discussion of this here:
> https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twG25odPc$ 
> <https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x4zRO7V_$>
>
>
>
> The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc)
> are memory bandwidth limited. If there is no more bandwidth to be
> marshalled on your board, then adding more processes does nothing at all.
> This is why people were asking about how many "nodes" you are running on,
> because that is the unit of memory bandwidth, not "cores" which make little
> difference.
>
>
>
>   Thanks,
>
>
>
>      Matt
>
>
>
> Thank you,
>
> Yongzhong
>
>
>
> *From: *Pierre Jolivet <pierre at joliv.et>
> *Date: *Sunday, June 23, 2024 at 12:41 AM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
>
>
>
>
> On 23 Jun 2024, at 4:07 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
>
>
> This Message Is From an External Sender
>
> This message came from outside your organization.
>
> Yeah, I ran my program again using -mat_view::ascii_info and set
> MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix
> to be seqaijmkl type (I’ve attached a few as below)
>
> --> Setting up matrix-vector products...
>
>
>
> Mat Object: 1 MPI process
>
>   type: seqaijmkl
>
>   rows=16490, cols=35937
>
>   total: nonzeros=128496, allocated nonzeros=128496
>
>   total number of mallocs used during MatSetValues calls=0
>
>     not using I-node routines
>
> Mat Object: 1 MPI process
>
>   type: seqaijmkl
>
>   rows=16490, cols=35937
>
>   total: nonzeros=128496, allocated nonzeros=128496
>
>   total number of mallocs used during MatSetValues calls=0
>
>     not using I-node routines
>
>
>
> --> Solving the system...
>
>
>
> Excitation 1 of 1...
>
>
>
> ================================================
>
> Iterative solve completed in 7435 ms.
>
> CONVERGED: rtol.
>
> Iterations: 72
>
> Final relative residual norm: 9.22287e-07
>
> ================================================
>
> [CPU TIME] System solution: 2.27160000e+02 s.
>
> [WALL TIME] System solution: 7.44387218e+00 s.
>
> However, it seems to me that there were still no MKL outputs even I set
> MKL_VERBOSE to be 1. Although, I think it should be many spmv operations
> when doing KSPSolve(). Do you see the possible reasons?
>
>
>
> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS
> is.
>
>
>
> Thanks,
>
> Pierre
>
>
>
> Thanks,
>
> Yongzhong
>
>
>
>
>
> *From: *Matthew Knepley <knepley at gmail.com>
> *Date: *Saturday, June 22, 2024 at 5:56 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *Junchao Zhang <junchao.zhang at gmail.com>, Pierre Jolivet <
> pierre at joliv.et>, petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
> 你通常不会收到来自 knepley at gmail.com 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3qwbYRxM$>
>
> On Sat, Jun 22, 2024 at 5:03 PM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100
> MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for
> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R)
> AVX-512) with support of Vector
>
> ZjQcmQRYFpfptBannerStart
>
> *This Message Is From an External Sender*
>
> This message came from outside your organization.
>
>
>
> ZjQcmQRYFpfptBannerEnd
>
> MKL_VERBOSE=1 ./ex1
>
>
> matrix nonzeros = 100, allocated nonzeros = 100
>
> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for
> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R)
> AVX-512) with support of Vector Neural Network Instructions enabled
> processors, Lnx 2.50GHz lp64 gnu_thread
>
> MKL_VERBOSE
> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1)
> 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0)
> 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10)
> 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1)
> 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1
> FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10)
> 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1
> FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1)
> 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0)
> 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0)
> 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0)
> 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15)
> 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0)
> 730ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0)
> 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0)
> 390ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE
> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0)
> 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> Yes, for petsc example, there are MKL outputs, but for my own program. All
> I did is to change the matrix type from MATAIJ to MATAIJMKL to get
> optimized performance for spmv from MKL. Should I expect to see any MKL
> outputs in this case?
>
>
>
> Are you sure that the type changed? You can MatView() the matrix with
> format ascii_info to see.
>
>
>
>   Thanks,
>
>
>
>      Matt
>
>
>
>
>
> Thanks,
>
> Yongzhong
>
>
>
> *From: *Junchao Zhang <junchao.zhang at gmail.com>
> *Date: *Saturday, June 22, 2024 at 9:40 AM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *Pierre Jolivet <pierre at joliv.et>, petsc-users at mcs.anl.gov <
> petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
> No,  you don't.  It is strange.  Perhaps you can you run a petsc example
> first and see if MKL is really used
>
> $ cd src/mat/tests
>
> $ make ex1
>
> $ MKL_VERBOSE=1 ./ex1
>
>
> --Junchao Zhang
>
>
>
>
>
> On Fri, Jun 21, 2024 at 4:03 PM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> I am using
>
> export MKL_VERBOSE=1
>
> ./xx
>
> in the bash file, do I have to use - ksp_converged_reason?
>
> Thanks,
>
> Yongzhong
>
>
>
> *From: *Pierre Jolivet <pierre at joliv.et>
> *Date: *Friday, June 21, 2024 at 1:47 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *Junchao Zhang <junchao.zhang at gmail.com>, petsc-users at mcs.anl.gov <
> petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
> 你通常不会收到来自 pierre at joliv.et 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!flsZMI97ne0yyxHhLda3hROB9qsgstuZS-jPinxGIzFCCSdn1ujdoMR8dyz-5_kVqqMM-12Lt0dTdjKrx3wXhHZmBhNydvFQeSY$>
>
> How do you set the variable?
>
>
>
> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason
>
> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64
> architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled
> processors, Lnx 2.80GHz lp64 intel_thread
>
> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1
> TID:0  NThr:1
>
> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1
> FastMM:1 TID:0  NThr:1
>
> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1
> TID:0  NThr:1
>
> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1
> TID:0  NThr:1
>
> [...]
>
>
>
> On 21 Jun 2024, at 7:37 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
>
>
> This Message Is From an External Sender
>
> This message came from outside your organization.
>
> Hello all,
>
> I set MKL_VERBOSE = 1, but observed no print output specific to the use of
> MKL. Does PETSc enable this verbose output?
>
> Best,
>
> Yongzhong
>
>
>
> *From: *Pierre Jolivet <pierre at joliv.et>
> *Date: *Friday, June 21, 2024 at 1:36 AM
> *To: *Junchao Zhang <junchao.zhang at gmail.com>
> *Cc: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
> 你通常不会收到来自 pierre at joliv.et 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!eXBeeIXo9Yqgp2nypqwKYimLnGBZXnF4dXxgLM1UoOIO6n8nt3XlfgjVWLPWJh4UOa5NNpx-nrJb_H828XRQKUREfR2m69oCbxI$>
>
>
>
>
>
> On 21 Jun 2024, at 6:42 AM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
>
>
>
> This Message Is From an External Sender
>
> This message came from outside your organization.
>
> I remember there are some MKL env vars to print MKL routines called.
>
>
>
> The environment variable is MKL_VERBOSE
>
>
>
> Thanks,
>
> Pierre
>
>
>
> Maybe we can try it to see what MKL routines are really used and then we
> can understand why some petsc functions did not speed up
>
>
> --Junchao Zhang
>
>
>
>
>
> On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> *This Message Is From an External Sender*
>
> This message came from outside your organization.
>
>
>
> Hi Barry, sorry for my last results. I didn’t fully understand the stage
> profiling and logging in PETSc, now I only record KSPSolve() stage of my
> program. Some sample codes are as follow,
>
>                 // Static variable to keep track of the stage counter
>
>                 static int stageCounter = 1;
>
>
>
>                 // Generate a unique stage name
>
>                 std::ostringstream oss;
>
>                 oss << "Stage " << stageCounter << " of Code";
>
>                 std::string stageName = oss.str();
>
>
>
>                 // Register the stage
>
>                 PetscLogStage stagenum;
>
>
>
>                 PetscLogStageRegister(stageName.c_str(), &stagenum);
>
>                 PetscLogStagePush(stagenum);
>
>
>
>                 *KSPSolve(*ksp_ptr, b, x);*
>
>
>
>                 PetscLogStagePop();
>
>                 stageCounter++;
>
> I have attached my new logging results, there are 1 main stage and 4 other
> stages where each one is KSPSolve() call.
>
> To provide some additional backgrounds, if you recall, I have been trying
> to get efficient iterative solution using multithreading. I found out by
> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to
> perform sparse matrix-vector multiplication faster, I am using
> MATSEQAIJMKL. This makes the shell matrix vector product in each iteration
> scale well with the #of threads. However, I found out the total GMERS solve
> time (~KSPSolve() time) is not scaling well the #of threads.
>
> From the logging results I learned that when performing KSPSolve(), there
> are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs
> using different number of threads and plotted the time consumption for
> PCApply() and KSPGMERSOrthog() against #of thread. I found out these two
> operations are not scaling with the threads at all! My results are attached
> as the pdf to give you a clear view.
>
> My questions is,
>
> From my understanding, in PCApply, MatSolve() is involved,
> KSPGMERSOrthog() will have many vector operations, so why these two parts
> can’t scale well with the # of threads when the intel MKL library is linked?
>
> Thank you,
> Yongzhong
>
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Friday, June 14, 2024 at 11:36 AM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
> Performance Issue
>
>
>
>    I am a bit confused. Without the initial guess computation, there are
> still a bunch of events I don't understand
>
>
>
> MatTranspose          79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> MatMatMultSym        110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>
> MatMatMultNum         90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>
> MatMatMatMultSym      20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>
> MatRARtSym            25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>
> MatMatTrnMultSym      25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> MatMatTrnMultNum      25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0   275
>
> MatTrnMatMultSym      10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> MatTrnMatMultNum      10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
>
>
> in addition there are many more VecMAXPY then VecMDot (in GMRES they are
> each done the same number of times)
>
>
>
> VecMDot             5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00  8 10  0  0  0   8 10  0  0  0 12016
>
> VecMAXPY           22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00 39 20  0  0  0  39 20  0  0  0  4913
>
>
>
> Finally there are a huge number of
>
>
>
> MatMultAdd        258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00  7 29  0  0  0   7 29  0  0  0 43025
>
>
>
> Are you making calls to all these routines? Are you doing this inside your
> MatMult() or before you call KSPSolve?
>
>
>
> The reason I wanted you to make a simpler run without the initial guess
> code is that your events are far more complicated than would be produced by
> GMRES alone so it is not possible to understand the behavior you are seeing
> without fully understanding all the events happening in the code.
>
>
>
>   Barry
>
>
>
>
>
> On Jun 14, 2024, at 1:19 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
>
>
> Thanks, I have attached the results without using any KSPGuess. At low
> frequency, the iteration steps are quite close to the one with KSPGuess,
> specifically
>
>   KSPGuess Object: 1 MPI process
>
>     type: fischer
>
>     Model 1, size 200
>
> However, I found at higher frequency, the # of iteration steps are
>  significant higher than the one with KSPGuess, I have attahced both of the
> results for your reference.
>
> Moreover, could I ask why the one without the KSPGuess options can be used
> for a baseline comparsion? What are we comparing here? How does it relate
> to the performance issue/bottleneck I found? “*I have noticed that the
> time taken by **KSPSolve** is **almost two times **greater than the CPU
> time for matrix-vector product multiplied by the number of iteration*”
>
> Thank you!
> Yongzhong
>
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Thursday, June 13, 2024 at 2:14 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
> Performance Issue
>
>
>
>   Can you please run the same thing without the  KSPGuess option(s) for a
> baseline comparison?
>
>
>
>    Thanks
>
>
>
>    Barry
>
>
>
> On Jun 13, 2024, at 1:27 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
>
>
> This Message Is From an External Sender
>
> This message came from outside your organization.
>
> Hi Matt,
>
> I have rerun the program with the keys you provided. The system output
> when performing ksp solve and the final petsc log output were stored in a
> .txt file attached for your reference.
>
> Thanks!
> Yongzhong
>
>
>
> *From: *Matthew Knepley <knepley at gmail.com>
> *Date: *Wednesday, June 12, 2024 at 6:46 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
> Performance Issue
>
> 你通常不会收到来自 knepley at gmail.com 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$>
>
> On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> Dear PETSc’s developers, I hope this email finds you well. I am currently
> working on a project using PETSc and have encountered a performance issue
> with the KSPSolve function. Specifically, I have noticed that the time
> taken by KSPSolve is
>
> ZjQcmQRYFpfptBannerStart
>
> *This Message Is From an External Sender*
>
> This message came from outside your organization.
>
>
>
> ZjQcmQRYFpfptBannerEnd
>
> Dear PETSc’s developers,
>
> I hope this email finds you well.
>
> I am currently working on a project using PETSc and have encountered a
> performance issue with the KSPSolve function. Specifically, *I have
> noticed that the time taken by **KSPSolve** is **almost two times **greater
> than the CPU time for matrix-vector product multiplied by the number of
> iteration steps*. I use C++ chrono to record CPU time.
>
> For context, I am using a shell system matrix A. Despite my efforts to
> parallelize the matrix-vector product (Ax), the overall solve time
> remains higher than the matrix vector product per iteration indicates
> when multiple threads were used. Here are a few details of my setup:
>
>    - *Matrix Type*: Shell system matrix
>    - *Preconditioner*: Shell PC
>    - *Parallel Environment*: Using Intel MKL as PETSc’s BLAS/LAPACK
>    library, multithreading is enabled
>
> I have considered several potential reasons, such as preconditioner setup,
> additional solver operations, and the inherent overhead of using a shell
> system matrix. *However, since KSPSolve is a high-level API, I have been
> unable to pinpoint the exact cause of the increased solve time.*
>
> Have you observed the same issue? Could you please provide some
> experience on how to diagnose and address this performance discrepancy?
> Any insights or recommendations you could offer would be greatly
> appreciated.
>
>
>
> For any performance question like this, we need to see the output of your
> code run with
>
>
>
>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view
>
>
>
>   Thanks,
>
>
>
>      Matt
>
>
>
> Thank you for your time and assistance.
>
> Best regards,
>
> Yongzhong
>
> -----------------------------------------------------------
>
> *Yongzhong Li*
>
> PhD student | Electromagnetics Group
>
> Department of Electrical & Computer Engineering
>
> University of Toronto
>
> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twMAVmany$ 
> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$>
>
>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twFI2FOPm$ 
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$>
>
> <ksp_petsc_log.txt>
>
>
>
> <ksp_petsc_log.txt><ksp_petsc_log_noguess.txt>
>
>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twFI2FOPm$ 
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3cNeD9Gw$>
>
>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twFI2FOPm$ 
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x2Os2C2d$>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240626/99199be2/attachment-0001.html>