[petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue

Fri Jun 28 12:20:16 CDT 2024

Hi, Yongzhong,
   It is great to see you have made such good progress.  Barry is right,
you need -vec_maxpy_use_gemv 1.  It's my mistake for not mentioning it
earlier. But even with that, there are still problems.
   petsc tries to optimize VecMDot/MAXPY with BLAS GEMV, with hope that
vendors' BLAS library would be highly optimized on that. However, we found
though they were good with VecMDot, but not with VecMAXPY.  So by default
in petsc, we disabled the GEMV optimization for VecMAXPY.  One can use
-vec_maxpy_use_gemv 1 to turn on it.
   I turned it on and tested VecMAXPY with ex2k and MKL, but failed to see
any improvement with multiple threads.  I could not understand why MKL is
so bad on it.   You can try it yourself in your environment.
   Without the GEMV optimization, VecMAXPY() is implemented by petsc with a
batch of PetscKernelAXPY() kernels, which contain simple for loops but not
OpenMP parallelized (since petsc does not support OpenMP outright) .  I
added "omp parallel for" pragma in PetscKernelAXPY() kernels, and tested
ex2k again with now parallelized petsc. Here is the result.

 $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m
2 -test_name VecMAXPY -vec_maxpy_use_gemv 0
Vector(N)      VecMAXPY-1    VecMAXPY-3    VecMAXPY-8    VecMAXPY-30  (us)
--------------------------------------------------------------------------
         128         7.0         10.1         21.4         72.7
         256         7.9         12.9         29.5        101.0
         512         9.4         17.2         40.5        136.2
        1024        15.9         27.3         67.5        249.3
        2048        26.5         48.7        139.6        432.7
        4096        47.1         77.3        186.4        710.3
        8192        84.8        152.2        423.9       1580.6
       16384       154.9        298.5        792.1       2889.2
       32768       183.7        338.7        893.9       3436.2
       65536       639.1       1247.8       3219.1      12494.8
      131072      1125.2       1856.2       6843.0      23653.7
      262144      2603.2       4948.4      13259.4      51287.7
      524288      5093.6      10305.0      26451.7      96919.6
     1048576      5898.6      10947.2      45486.4     127352.8
     2097152     11845.4      21912.5      57999.6     331403.4

$ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=16 ./ex2k -n 15 -m
2 -test_name VecMAXPY -vec_maxpy_use_gemv 0
Vector(N)      VecMAXPY-1    VecMAXPY-3    VecMAXPY-8    VecMAXPY-30  (us)
--------------------------------------------------------------------------
         128        17.0         16.1         31.5        112.9
         256        13.7         16.8         31.2        120.2
         512        14.5         18.1         33.9        129.9
        1024        16.5         21.0         38.5        150.4
        2048        18.5         22.1         41.8        171.4
        4096        21.0         25.4         55.3        212.3
        8192        27.0         30.3         68.6        251.9
       16384        32.2         44.5         93.3        350.5
       32768        45.8         65.0        149.8        558.8
       65536        59.7        102.8        247.1        946.0
      131072       100.7        186.4        485.3       1898.1
      262144       183.4        345.2        922.2       3567.0
      524288       339.6        676.8       1820.7       7530.4
     1048576       662.0       1364.7       3585.3      13969.1
     2097152      1379.7       2788.6       7414.0      28275.3

We can see VecMAXPY() can be easily speeded up with multithreading.

For MatSolve, I checked petsc's aijmkl.c, and found we don't have interface
to MKL's sparse solve.  I checked
https://urldefense.us/v3/__https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2023-0/openmp-threaded-functions-and-problems.html__;!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIgTsoHmt$ ,
but confused with MKL's list of threaded function

   -

   Direct sparse solver.
   -

   All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse
   Triangular solvers.

I don't know whether MKL has threaded sparse solver.

--Junchao Zhang

On Fri, Jun 28, 2024 at 11:35 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Are you running with -vec_maxpy_use_gemv ?
>
>
> On Jun 28, 2024, at 1:46 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
> Thanks all for your help!!!
>
> I think I find the issues. I am compiling a large CMake project that
> relies on many external libraries (projects). Previously, I used OpenBLAS
> as the BLAS for all the dependencies including PETSc. After I switched to
> Intel MKL for PETSc, I still kept the OpenBLAS and use it as the BLAS for
> all the other dependencies. I think somehow even when I specify the
> blas-lapack-dir to the MKLROOT when PETSc is configured, the actual program
> still use OpenBLAS as the BLAS for some PETSc functions,  such as VecMDot()
> and VecMAXPY(), so that’s why I didn’t see any MKL verbose during the
> KSPSolve(). Now I remove the OpenBLAS and use Intel MKL as the BLAS for all
> the dependencies. The issue is resolved, I can clearly see MKL routines are
> called when KSP GMRES is running.
>
> Back to my original questions, my goal is to achieve good parallelization
> efficiency for KSP GMRES Solve. As I use multithreading-enabled MKL spmv
> routines, the wall time for MatMult/MatMultAdd() has been greatly reduced.
> However,the KSPGMRESOrthog and MatSolve in PCApply still take over 50% of
> solving time and can’t benefit from multithreading. *After I fixed the
> issue I mentioned, I found I got around 15% time reduced because of more
> efficient VecMDot() calls*. I attach a petsc log comparison for your
> reference (same settings, only difference is whether use MKL BLAS or not),
> you can see the percentage of VecMDot() is reduced. However, here comes the
> interesting part, *VecMAXPY() didn’t benefit from MKL BLAS, it still
> takes almost 40% of solution when I use 64 MKL Threads*, which is a lot
> for my program. And if I multiple this percentage with the actual wall time
> against different # of threads, it stays the same. Then I used ex2k
> benchmark to verify what I found. Here is the result,
>
> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 5 -test_name VecMAXPY
> Vector(N)      VecMAXPY-1    VecMAXPY-3    VecMAXPY-8    VecMAXPY-30  (us)
> --------------------------------------------------------------------------
>          128         0.4          0.9          2.4          8.8
>          256         0.3          1.1          3.5         13.3
>          512         0.5          4.4          6.7         26.5
>         1024         0.9          4.8         13.3         51.0
>         2048         3.5         12.3         37.1         94.7
>         4096         4.3         24.5         73.6        179.6
>         8192         6.3         48.7         98.9        380.8
>        16384         9.3         99.2        200.2        774.0
>        32768        30.6        155.4        421.2       1662.9
>        65536       101.2        269.4        827.4       3565.0
>       131072       206.9        551.0       1829.0       7580.5
>       262144       450.2       1251.9       3986.2      15525.6
>       524288      1322.1       2901.7       8567.1      31840.0
>      1048576      2788.6       6190.6      16394.7      63514.9
>      2097152      5534.8      12619.9      35427.4     130064.5
> $ MKL_NUM_THREADS=8 ./ex2k -n 15 -m 5 -test_name VecMAXPY
> Vector(N)      VecMAXPY-1    VecMAXPY-3    VecMAXPY-8    VecMAXPY-30  (us)
> --------------------------------------------------------------------------
>          128         0.3          0.7          2.4          8.8
>          256         0.3          1.1          3.6         13.5
>          512         0.5          4.4          6.8         26.4
>         1024         0.9          4.8         13.6         50.5
>         2048         7.6         12.2         36.5         95.0
>         4096         8.5         25.7         72.4        182.6
>         8192        11.9         48.5        103.7        383.7
>        16384        12.8         97.7        203.7        785.0
>        32768        11.2        148.5        421.9       1681.5
>        65536        15.5        271.2        843.8       3613.7
>       131072        34.3        564.7       1905.2       7558.8
>       262144       106.4       1334.5       4002.8      15458.3
>       524288       217.2       2858.4       8407.9      31303.7
>      1048576       701.5       6060.6      16947.3      64118.5
>      2097152      1769.7      13218.3      36347.3     131062.9
>
> It stays the same, no benefit from multithreading BLAS!! Unlike what I
> found for VecMdot(), where I did see speed up for more #of threads. Then, I
> dig deeper. *I learned that for VecMDot(), it calls ZGEMV while for
> VecMAXPY(), it calls ZAXPY. This observation seems to indicate that ZAXPY
> is not benefiting from MKL threads.*
>
> My question is *do you know why ZAXPY is not multithreaded*? From my
> perspective,  VecMDot() and VecMAXPY() are very similar operations, the
> only difference is whether we need to scale the vectors to be multiplied or
> not. I think you have mentioned that recently you did some optimization to
> these two routines*, from my above results and observations, are these
> aligned with your expectations*? Could we further optimize the codes to
> get more parallelization efficiency in my case?
>
> *And another question, can MatSolve() in KSPSolve be multithreaded? Would
> MUMPS help?*
>
> Thank you and regards,
> Yongzhong
>
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Thursday, June 27, 2024 11:10 AM
> *To:* Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc:* Barry Smith <bsmith at petsc.dev>; petsc-users at mcs.anl.gov
> *Subject:* Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
> How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans,
> &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))?  n is
> the vector length in VecMDot.
> it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE *ZGEMV..., *since
> the code did call gemv. Perhaps you need to double check your spelling etc.
>
> If you also use ex2k, and potentially modify Ms[] and Ns[] to match the
> sizes in your code, to see if there is a speedup with more threads.
>
> --Junchao Zhang
>
>
> On Thu, Jun 27, 2024 at 9:39 AM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see
> the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2,
> xarray, &ione, &zero, z + i, &ione)); is called multiple
> ZjQcmQRYFpfptBannerStart
> *This Message Is From an External Sender*
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Mostly 3, maximum 7, but definitely hit the point when m > 1,
>
> I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one,
> yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple
> times
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Thursday, June 27, 2024 at 1:12 AM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
>   How big are the m's getting in your code?
>
>
>
>
> On Jun 27, 2024, at 12:40 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
> Hi Barry, I used gdb to debug my program, set a breakpoint to
> VecMultiDot_Seq_GEMV function. I did see when I debug this function, it
> will call BLAS (but not always, only if m > 1), as shown below. However, I
> still didn’t see any MKL outputs even if I set MKLK_VERBOSE=1.
>
> *(gdb) *
> *550            PetscCall(VecRestoreArrayRead(yin[i], &yfirst));*
> *(gdb) *
> *553            m = j - i;*
> *(gdb) *
> *554            if (m > 1) {*
> *(gdb) *
> *555              PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the
> cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above*
> *(gdb) *
> *556              PetscScalar  one = 1, zero = 0;*
> *(gdb) *
> *558              PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one,
> yarray, &lda2, xarray, &ione, &zero, z + i, &ione));*
> *(gdb) s*
> *PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210>
> "VecMultiDot_Seq_GEMV",*
> *    file=0x7ffff68a1078
> "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c")*
> *    at
> /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106*
> *106          if (!TRdebug) return PETSC_SUCCESS;*
> *(gdb) *
> *154        }*
>
> Am I not using MKL BLAS, is that why I didn’t see multithreading speed up
> for KSPGMRESOrthog? What do you think could be the potential reasons? Is
> there any silent mode that will possibly affect the MKL Verbose.
>
> Thank you and best regards,
> Yongzhong
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Wednesday, June 26, 2024 at 8:15 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
>   if (m > 1) {
>       PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe
> since we've screened out those lda > PETSC_BLAS_INT_MAX above
>       PetscScalar  one = 1, zero = 0;
>
>       PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray,
> &lda2, xarray, &ione, &zero, z + i, &ione));
>       PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0)));
>
> The call to BLAS above is where it uses MKL.
>
>
>
>
>
> On Jun 26, 2024, at 6:59 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
> Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV
> https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIoPmWgCr$ 
> <https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHEITtuVo$>
> Can I ask which lines of codes suggest the use of intel mkl?
>
> Thanks,
> Yongzhong
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Wednesday, June 26, 2024 at 10:30 AM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
>    In a debug version of PETSc run your application in a debugger and put
> a break point in VecMultiDot_Seq_GEMV.  Then next through the code from
> that point to see what decision it makes about using dgemv() to see why it
> is not getting into the Intel code.
>
>
>
>
>
> On Jun 25, 2024, at 11:19 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
>
> Hi Junchao, thank you for your help for these benchmarking test!
>
> I check out to petsc/main and did a few things to verify from my side,
>
> 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute
> node. The results are as follow,
> $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4
> Vector(N)      VecMDot-1    VecMDot-3    VecMDot-8    VecMDot-30  (us)
> --------------------------------------------------------------------------
>          128        14.5          1.2          1.8          5.2
>          256         1.5          0.9          1.6          4.7
>          512         2.7          2.8          6.1         13.2
>         1024         4.0          4.0          9.3         16.4
>         2048         7.4          7.3         11.3         39.3
>         4096        14.2         13.9         19.1         93.4
>         8192        28.8         26.3         25.4         31.3
>        16384        54.1         25.8         26.7         33.8
>        32768       109.8         25.7         24.2         56.0
>        65536       220.2         24.4         26.5         89.0
>       131072       424.1         31.5         36.1        149.6
>       262144       898.1         37.1         53.9        286.1
>       524288      1754.6         48.7        100.3       1122.2
>      1048576      3645.8         86.5        347.9       2950.4
>      2097152      7371.4        308.7       1440.6       6874.9
>
> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4
> Vector(N)      VecMDot-1    VecMDot-3    VecMDot-8    VecMDot-30  (us)
> --------------------------------------------------------------------------
>          128        14.9          1.2          1.9          5.2
>          256         1.5          1.0          1.7          4.7
>          512         2.7          2.8          6.1         12.0
>         1024         3.9          4.0          9.3         16.8
>         2048         7.4          7.3         10.4         41.3
>         4096        14.0         13.8         18.6         84.2
>         8192        27.0         21.3         43.8        177.5
>        16384        54.1         34.1         89.1        330.4
>        32768       110.4         82.1        203.5        781.1
>        65536       213.0        191.8        423.9       1696.4
>       131072       428.7        360.2        934.0       4080.0
>       262144       883.4        723.2       1745.6      10120.7
>       524288      1817.5       1466.1       4751.4      23217.2
>      1048576      3611.0       3796.5      11814.9      48687.7
>      2097152      7401.9      10592.0      27543.2     106565.4
>
> I can see the speed up brought by more MKL threads, and if I set
> NKL_VERBOSE to 1, I can see something like
>
>
>
> *MKL_VERBOSE
> ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1)
> 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:6 ca*From my understanding,
> the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute
> node and is using ZGEMV MKL BLAS.
>
> However, when I ran my own program and set MKL_VERBOSE to 1, it is very
> strange that I still can’t find any MKL outputs, though I can see from the
> PETSc log that VecMDot and VecMAXPY() are called.
>
> I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a
> way that is similar to ex2k test?  Should I expect to see MKL outputs for
> whatever linear system I solve with KSPGMRES? Does it relate to if it is
> dense matrix or sparse matrix, although I am not really understand why
> VecMDot/MAXPY() have something to do with dense matrix-vector
> multiplication.
>
> Thank you,
>
> Yongzhong
>
> *From: *Junchao Zhang <junchao.zhang at gmail.com>
> *Date: *Tuesday, June 25, 2024 at 6:34 PM
> *To: *Matthew Knepley <knepley at gmail.com>
> *Cc: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>, Pierre Jolivet <
> pierre at joliv.et>, petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
> Hi, Yongzhong,
>   Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY,  if we
> can speed up the two with OpenMP threads, then we can speed up
> KSPGMRESOrthog.  We recently added an optimization to do VecMDot/MAXPY() in
> dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny
> matrices ).  So with MKL_VERBOSE=1,  you should see something like
> "MKL_VERBOSE ZGEMV ..."  in output.  If not, could you try again with
> petsc/main?
>   petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them.  I ran
> VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was
> strange to see no speedup.   I then configured petsc with openblas, I did
> see better performance with more threads
>
> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4
> Vector(N)      VecMDot-3    VecMDot-8    VecMDot-30  (us)
> --------------------------------------------------------------------------
>          128         2.0          2.5          6.1
>          256         1.8          2.7          7.0
>          512         2.1          3.1          8.6
>         1024         2.7          4.0         12.3
>         2048         3.8          6.3         28.0
>         4096         6.1         10.6         42.4
>         8192        10.9         21.8         79.5
>        16384        21.2         39.4        149.6
>        32768        45.9         75.7        224.6
>        65536       142.2        215.8        732.1
>       131072       169.1        233.2       1729.4
>       262144       367.5        830.0       4159.2
>       524288       999.2       1718.1       8538.5
>      1048576      2113.5       4082.1      18274.8
>      2097152      5392.6      10273.4      43273.4
>
>
> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4
> Vector(N)      VecMDot-3    VecMDot-8    VecMDot-30  (us)
> --------------------------------------------------------------------------
>          128         2.0          2.5          6.0
>          256         1.8          2.7         15.0
>          512         2.1          9.0         16.6
>         1024         2.6          8.7         16.1
>         2048         7.7         10.3         20.5
>         4096         9.9         11.4         25.9
>         8192        14.5         22.1         39.6
>        16384        25.1         27.8         67.8
>        32768        44.7         95.7         91.5
>        65536        82.1        156.8        165.1
>       131072       194.0        335.1        341.5
>       262144       388.5        380.8        612.9
>       524288      1046.7        967.1       1653.3
>      1048576      1997.4       2169.0       4034.4
>      2097152      5502.9       5787.3      12608.1
>
> The tall-and-skinny matrices in KSPGMRESOrthog vary in width.  The average
> speedup depends on components.  So I suggest you run ex2k to see in your
> environment whether oneMKL can speedup the kernels.
>
> --Junchao Zhang
>
>
> On Mon, Jun 24, 2024 at 11:35 AM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
> Let me run some examples on our end to see whether the code calls expected
> functions.
>
> --Junchao Zhang
>
>
> On Mon, Jun 24, 2024 at 10:46 AM Matthew Knepley <knepley at gmail.com>
> wrote:
> On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li <yongzhong. li@ mail.
> utoronto. ca> wrote: Thank you Pierre for your information. Do we have a
> conclusion for my original question about the parallelization efficiency
> for different stages of
> ZjQcmQRYFpfptBannerStart
> *This Message Is From an External Sender*
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> On Mon, Jun 24, 2024 at 11:21 AM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> Thank you Pierre for your information. Do we have a conclusion for my
> original question about the parallelization efficiency for different stages
> of KSP Solve? Do we need to do more testing to figure out the issues? Thank
> you, Yongzhong From:
> ZjQcmQRYFpfptBannerStart
> *This Message Is From an External Sender*
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Thank you Pierre for your information. Do we have a conclusion for my
> original question about the parallelization efficiency for different stages
> of KSP Solve? Do we need to do more testing to figure out the issues?
>
>
> We have an extended discussion of this here:
> https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIp2OY8h7$ 
> <https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x4zRO7V_$>
>
> The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc)
> are memory bandwidth limited. If there is no more bandwidth to be
> marshalled on your board, then adding more processes does nothing at all.
> This is why people were asking about how many "nodes" you are running on,
> because that is the unit of memory bandwidth, not "cores" which make little
> difference.
>
>   Thanks,
>
>      Matt
>
>
> Thank you,
> Yongzhong
>
>
> *From: *Pierre Jolivet <pierre at joliv.et>
> *Date: *Sunday, June 23, 2024 at 12:41 AM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
>
>
>
>
> On 23 Jun 2024, at 4:07 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
> Yeah, I ran my program again using -mat_view::ascii_info and set
> MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix
> to be seqaijmkl type (I’ve attached a few as below)
>
> --> Setting up matrix-vector products...
>
> Mat Object: 1 MPI process
>   type: seqaijmkl
>   rows=16490, cols=35937
>   total: nonzeros=128496, allocated nonzeros=128496
>   total number of mallocs used during MatSetValues calls=0
>     not using I-node routines
> Mat Object: 1 MPI process
>   type: seqaijmkl
>   rows=16490, cols=35937
>   total: nonzeros=128496, allocated nonzeros=128496
>   total number of mallocs used during MatSetValues calls=0
>     not using I-node routines
>
> --> Solving the system...
>
> Excitation 1 of 1...
>
> ================================================
> Iterative solve completed in 7435 ms.
> CONVERGED: rtol.
> Iterations: 72
> Final relative residual norm: 9.22287e-07
> ================================================
> [CPU TIME] System solution: 2.27160000e+02 s.
> [WALL TIME] System solution: 7.44387218e+00 s.
>
> However, it seems to me that there were still no MKL outputs even I set
> MKL_VERBOSE to be 1. Although, I think it should be many spmv operations
> when doing KSPSolve(). Do you see the possible reasons?
>
>
> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS
> is.
>
> Thanks,
> Pierre
>
>
>
> Thanks,
> Yongzhong
>
>
>
> *From: *Matthew Knepley <knepley at gmail.com>
> *Date: *Saturday, June 22, 2024 at 5:56 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *Junchao Zhang <junchao.zhang at gmail.com>, Pierre Jolivet <
> pierre at joliv.et>, petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
> 你通常不会收到来自 knepley at gmail.com 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3qwbYRxM$>
> On Sat, Jun 22, 2024 at 5:03 PM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100
> MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for
> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R)
> AVX-512) with support of Vector
> ZjQcmQRYFpfptBannerStart
> *This Message Is From an External Sender*
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> MKL_VERBOSE=1 ./ex1
>
> matrix nonzeros = 100, allocated nonzeros = 100
> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for
> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R)
> AVX-512) with support of Vector Neural Network Instructions enabled
> processors, Lnx 2.50GHz lp64 gnu_thread
> MKL_VERBOSE
> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1)
> 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0)
> 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10)
> 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1)
> 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1
> FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10)
> 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1
> FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1)
> 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0)
> 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0)
> 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0)
> 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15)
> 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0)
> 730ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0)
> 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us
> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0)
> 390ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE
> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0)
> 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF
> Dyn:1 FastMM:1 TID:0  NThr:1
>
> Yes, for petsc example, there are MKL outputs, but for my own program. All
> I did is to change the matrix type from MATAIJ to MATAIJMKL to get
> optimized performance for spmv from MKL. Should I expect to see any MKL
> outputs in this case?
>
>
> Are you sure that the type changed? You can MatView() the matrix with
> format ascii_info to see.
>
>   Thanks,
>
>      Matt
>
>
>
> Thanks,
> Yongzhong
>
>
> *From: *Junchao Zhang <junchao.zhang at gmail.com>
> *Date: *Saturday, June 22, 2024 at 9:40 AM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *Pierre Jolivet <pierre at joliv.et>, petsc-users at mcs.anl.gov <
> petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
> No,  you don't.  It is strange.  Perhaps you can you run a petsc example
> first and see if MKL is really used
> $ cd src/mat/tests
> $ make ex1
> $ MKL_VERBOSE=1 ./ex1
>
> --Junchao Zhang
>
>
> On Fri, Jun 21, 2024 at 4:03 PM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> I am using
>
> export MKL_VERBOSE=1
> ./xx
>
> in the bash file, do I have to use - ksp_converged_reason?
>
> Thanks,
> Yongzhong
>
>
> *From: *Pierre Jolivet <pierre at joliv.et>
> *Date: *Friday, June 21, 2024 at 1:47 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *Junchao Zhang <junchao.zhang at gmail.com>, petsc-users at mcs.anl.gov <
> petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
> 你通常不会收到来自 pierre at joliv.et 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!flsZMI97ne0yyxHhLda3hROB9qsgstuZS-jPinxGIzFCCSdn1ujdoMR8dyz-5_kVqqMM-12Lt0dTdjKrx3wXhHZmBhNydvFQeSY$>
> How do you set the variable?
>
> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason
> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64
> architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled
> processors, Lnx 2.80GHz lp64 intel_thread
> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1
> TID:0  NThr:1
> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1
> FastMM:1 TID:0  NThr:1
> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1
> TID:0  NThr:1
> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1
> TID:0  NThr:1
> [...]
>
>
> On 21 Jun 2024, at 7:37 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
> Hello all,
>
> I set MKL_VERBOSE = 1, but observed no print output specific to the use of
> MKL. Does PETSc enable this verbose output?
>
> Best,
>
> Yongzhong
>
>
> *From: *Pierre Jolivet <pierre at joliv.et>
> *Date: *Friday, June 21, 2024 at 1:36 AM
> *To: *Junchao Zhang <junchao.zhang at gmail.com>
> *Cc: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
> KSPSolve Performance Issue
> 你通常不会收到来自 pierre at joliv.et 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!eXBeeIXo9Yqgp2nypqwKYimLnGBZXnF4dXxgLM1UoOIO6n8nt3XlfgjVWLPWJh4UOa5NNpx-nrJb_H828XRQKUREfR2m69oCbxI$>
>
>
>
> On 21 Jun 2024, at 6:42 AM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
> I remember there are some MKL env vars to print MKL routines called.
>
>
> The environment variable is MKL_VERBOSE
>
> Thanks,
> Pierre
>
>
> Maybe we can try it to see what MKL routines are really used and then we
> can understand why some petsc functions did not speed up
>
> --Junchao Zhang
>
>
> On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> *This Message Is From an External Sender*
> This message came from outside your organization.
>
> Hi Barry, sorry for my last results. I didn’t fully understand the stage
> profiling and logging in PETSc, now I only record KSPSolve() stage of my
> program. Some sample codes are as follow,
>
>                 // Static variable to keep track of the stage counter
>                 static int stageCounter = 1;
>
>                 // Generate a unique stage name
>                 std::ostringstream oss;
>                 oss << "Stage " << stageCounter << " of Code";
>                 std::string stageName = oss.str();
>
>                 // Register the stage
>                 PetscLogStage stagenum;
>
>                 PetscLogStageRegister(stageName.c_str(), &stagenum);
>                 PetscLogStagePush(stagenum);
>
>                 *KSPSolve(*ksp_ptr, b, x);*
>
>                 PetscLogStagePop();
>                 stageCounter++;
>
> I have attached my new logging results, there are 1 main stage and 4 other
> stages where each one is KSPSolve() call.
>
> To provide some additional backgrounds, if you recall, I have been trying
> to get efficient iterative solution using multithreading. I found out by
> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to
> perform sparse matrix-vector multiplication faster, I am using
> MATSEQAIJMKL. This makes the shell matrix vector product in each iteration
> scale well with the #of threads. However, I found out the total GMERS solve
> time (~KSPSolve() time) is not scaling well the #of threads.
>
> From the logging results I learned that when performing KSPSolve(), there
> are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs
> using different number of threads and plotted the time consumption for
> PCApply() and KSPGMERSOrthog() against #of thread. I found out these two
> operations are not scaling with the threads at all! My results are attached
> as the pdf to give you a clear view.
>
> My questions is,
>
> From my understanding, in PCApply, MatSolve() is involved,
> KSPGMERSOrthog() will have many vector operations, so why these two parts
> can’t scale well with the # of threads when the intel MKL library is linked?
>
> Thank you,
> Yongzhong
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Friday, June 14, 2024 at 11:36 AM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
> Performance Issue
>
>    I am a bit confused. Without the initial guess computation, there are
> still a bunch of events I don't understand
>
> MatTranspose          79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMatMultSym        110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatMatMultNum         90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatMatMatMultSym      20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatRARtSym            25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatMatTrnMultSym      25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMatTrnMultNum      25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0   275
> MatTrnMatMultSym      10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatTrnMatMultNum      10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> in addition there are many more VecMAXPY then VecMDot (in GMRES they are
> each done the same number of times)
>
> VecMDot             5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00  8 10  0  0  0   8 10  0  0  0 12016
> VecMAXPY           22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00 39 20  0  0  0  39 20  0  0  0  4913
>
> Finally there are a huge number of
>
> MatMultAdd        258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00
> 0.0e+00  7 29  0  0  0   7 29  0  0  0 43025
>
> Are you making calls to all these routines? Are you doing this inside your
> MatMult() or before you call KSPSolve?
>
> The reason I wanted you to make a simpler run without the initial guess
> code is that your events are far more complicated than would be produced by
> GMRES alone so it is not possible to understand the behavior you are seeing
> without fully understanding all the events happening in the code.
>
>   Barry
>
>
>
> On Jun 14, 2024, at 1:19 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
> Thanks, I have attached the results without using any KSPGuess. At low
> frequency, the iteration steps are quite close to the one with KSPGuess,
> specifically
>
>   KSPGuess Object: 1 MPI process
>     type: fischer
>     Model 1, size 200
>
> However, I found at higher frequency, the # of iteration steps are
>  significant higher than the one with KSPGuess, I have attahced both of the
> results for your reference.
>
> Moreover, could I ask why the one without the KSPGuess options can be used
> for a baseline comparsion? What are we comparing here? How does it relate
> to the performance issue/bottleneck I found? “*I have noticed that the
> time taken by **KSPSolve** is **almost two times **greater than the CPU
> time for matrix-vector product multiplied by the number of iteration*”
>
> Thank you!
> Yongzhong
>
>
> *From: *Barry Smith <bsmith at petsc.dev>
> *Date: *Thursday, June 13, 2024 at 2:14 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
> Performance Issue
>
>   Can you please run the same thing without the  KSPGuess option(s) for a
> baseline comparison?
>
>    Thanks
>
>    Barry
>
>
> On Jun 13, 2024, at 1:27 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
> Hi Matt,
>
> I have rerun the program with the keys you provided. The system output
> when performing ksp solve and the final petsc log output were stored in a
> .txt file attached for your reference.
>
> Thanks!
> Yongzhong
>
>
> *From: *Matthew Knepley <knepley at gmail.com>
> *Date: *Wednesday, June 12, 2024 at 6:46 PM
> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
> piero.triverio at utoronto.ca>
> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
> Performance Issue
> 你通常不会收到来自 knepley at gmail.com 的电子邮件。了解这一点为什么很重要
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$>
> On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
> Dear PETSc’s developers, I hope this email finds you well. I am currently
> working on a project using PETSc and have encountered a performance issue
> with the KSPSolve function. Specifically, I have noticed that the time
> taken by KSPSolve is
> ZjQcmQRYFpfptBannerStart
> *This Message Is From an External Sender*
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Dear PETSc’s developers,
> I hope this email finds you well.
> I am currently working on a project using PETSc and have encountered a
> performance issue with the KSPSolve function. Specifically, *I have
> noticed that the time taken by **KSPSolve** is **almost two times **greater
> than the CPU time for matrix-vector product multiplied by the number of
> iteration steps*. I use C++ chrono to record CPU time.
> For context, I am using a shell system matrix A. Despite my efforts to
> parallelize the matrix-vector product (Ax), the overall solve time
> remains higher than the matrix vector product per iteration indicates
> when multiple threads were used. Here are a few details of my setup:
>
>    - *Matrix Type*: Shell system matrix
>    - *Preconditioner*: Shell PC
>    - *Parallel Environment*: Using Intel MKL as PETSc’s BLAS/LAPACK
>    library, multithreading is enabled
>
> I have considered several potential reasons, such as preconditioner setup,
> additional solver operations, and the inherent overhead of using a shell
> system matrix. *However, since KSPSolve is a high-level API, I have been
> unable to pinpoint the exact cause of the increased solve time.*
> Have you observed the same issue? Could you please provide some
> experience on how to diagnose and address this performance discrepancy?
> Any insights or recommendations you could offer would be greatly
> appreciated.
>
>
> For any performance question like this, we need to see the output of your
> code run with
>
>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view
>
>   Thanks,
>
>      Matt
>
>
> Thank you for your time and assistance.
> Best regards,
> Yongzhong
> -----------------------------------------------------------
> *Yongzhong Li*
> PhD student | Electromagnetics Group
> Department of Electrical & Computer Engineering
> University of Toronto
> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIug_3RUa$ 
> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$>
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIqd3E3yv$ 
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$>
> <ksp_petsc_log.txt>
>
>
> <ksp_petsc_log.txt><ksp_petsc_log_noguess.txt>
>
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIqd3E3yv$ 
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3cNeD9Gw$>
>
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIqd3E3yv$ 
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x2Os2C2d$>
>
>
>
> <petsc_log_comparison.txt>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240628/56ea1970/attachment-0001.html>