[petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue

Fri Jun 28 13:33:00 CDT 2024

OK, then you need  '--with-mkl_pardiso-dir='+os.environ['MKLROOT'] in petsc
configure

--Junchao Zhang

On Fri, Jun 28, 2024 at 1:05 PM Pierre Jolivet <pierre.jolivet at lip6.fr>
wrote:

>
>
> On 28 Jun 2024, at 7:20 PM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
> Hi, Yongzhong,
>    It is great to see you have made such good progress.  Barry is right,
> you need -vec_maxpy_use_gemv 1.  It's my mistake for not mentioning it
> earlier. But even with that, there are still problems.
>    petsc tries to optimize VecMDot/MAXPY with BLAS GEMV, with hope that
> vendors' BLAS library would be highly optimized on that. However, we found
> though they were good with VecMDot, but not with VecMAXPY.  So by default
> in petsc, we disabled the GEMV optimization for VecMAXPY.  One can use
> -vec_maxpy_use_gemv 1 to turn on it.
>    I turned it on and tested VecMAXPY with ex2k and MKL, but failed to see
> any improvement with multiple threads.  I could not understand why MKL is
> so bad on it.   You can try it yourself in your environment.
>    Without the GEMV optimization, VecMAXPY() is implemented by petsc with
> a batch of PetscKernelAXPY() kernels, which contain simple for loops but
> not OpenMP parallelized (since petsc does not support OpenMP outright) .  I
> added "omp parallel for" pragma in PetscKernelAXPY() kernels, and tested
> ex2k again with now parallelized petsc. Here is the result.
>
>  $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m
> 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0
> Vector(N)      VecMAXPY-1    VecMAXPY-3    VecMAXPY-8    VecMAXPY-30  (us)
> --------------------------------------------------------------------------
>          128         7.0         10.1         21.4         72.7
>          256         7.9         12.9         29.5        101.0
>          512         9.4         17.2         40.5        136.2
>         1024        15.9         27.3         67.5        249.3
>         2048        26.5         48.7        139.6        432.7
>         4096        47.1         77.3        186.4        710.3
>         8192        84.8        152.2        423.9       1580.6
>        16384       154.9        298.5        792.1       2889.2
>        32768       183.7        338.7        893.9       3436.2
>        65536       639.1       1247.8       3219.1      12494.8
>       131072      1125.2       1856.2       6843.0      23653.7
>       262144      2603.2       4948.4      13259.4      51287.7
>       524288      5093.6      10305.0      26451.7      96919.6
>      1048576      5898.6      10947.2      45486.4     127352.8
>      2097152     11845.4      21912.5      57999.6     331403.4
>
> $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=16 ./ex2k -n 15 -m
> 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0
> Vector(N)      VecMAXPY-1    VecMAXPY-3    VecMAXPY-8    VecMAXPY-30  (us)
> --------------------------------------------------------------------------
>          128        17.0         16.1         31.5        112.9
>          256        13.7         16.8         31.2        120.2
>          512        14.5         18.1         33.9        129.9
>         1024        16.5         21.0         38.5        150.4
>         2048        18.5         22.1         41.8        171.4
>         4096        21.0         25.4         55.3        212.3
>         8192        27.0         30.3         68.6        251.9
>        16384        32.2         44.5         93.3        350.5
>        32768        45.8         65.0        149.8        558.8
>        65536        59.7        102.8        247.1        946.0
>       131072       100.7        186.4        485.3       1898.1
>       262144       183.4        345.2        922.2       3567.0
>       524288       339.6        676.8       1820.7       7530.4
>      1048576       662.0       1364.7       3585.3      13969.1
>      2097152      1379.7       2788.6       7414.0      28275.3
>
> We can see VecMAXPY() can be easily speeded up with multithreading.
>
> For MatSolve, I checked petsc's aijmkl.c, and found we don't have
> interface to MKL's sparse solve.
>
>
> We do, it’s in src/mat/impls/aij/seq/mkl_pardiso, and it’s threaded (the
> distributed version is in src/mat/impls/aij/mpi/mkl_cpardiso).
>
> Thanks,
> Pierre
>
>   I checked
> https://urldefense.us/v3/__https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2023-0/openmp-threaded-functions-and-problems.html__;!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_-ec-aQc$ 
> <https://urldefense.us/v3/__https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2023-0/openmp-threaded-functions-and-problems.html__;!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIgTsoHmt$>,
> but confused with MKL's list of threaded function
>
>    - Direct sparse solver.
>    - All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse
>    Triangular solvers.
>
> I don't know whether MKL has threaded sparse solver.
>
> --Junchao Zhang
>
>
> On Fri, Jun 28, 2024 at 11:35 AM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   Are you running with -vec_maxpy_use_gemv ?
>>
>>
>> On Jun 28, 2024, at 1:46 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>> Thanks all for your help!!!
>>
>> I think I find the issues. I am compiling a large CMake project that
>> relies on many external libraries (projects). Previously, I used OpenBLAS
>> as the BLAS for all the dependencies including PETSc. After I switched to
>> Intel MKL for PETSc, I still kept the OpenBLAS and use it as the BLAS for
>> all the other dependencies. I think somehow even when I specify the
>> blas-lapack-dir to the MKLROOT when PETSc is configured, the actual program
>> still use OpenBLAS as the BLAS for some PETSc functions,  such as VecMDot()
>> and VecMAXPY(), so that’s why I didn’t see any MKL verbose during the
>> KSPSolve(). Now I remove the OpenBLAS and use Intel MKL as the BLAS for all
>> the dependencies. The issue is resolved, I can clearly see MKL routines are
>> called when KSP GMRES is running.
>>
>> Back to my original questions, my goal is to achieve good parallelization
>> efficiency for KSP GMRES Solve. As I use multithreading-enabled MKL spmv
>> routines, the wall time for MatMult/MatMultAdd() has been greatly reduced.
>> However,the KSPGMRESOrthog and MatSolve in PCApply still take over 50% of
>> solving time and can’t benefit from multithreading. *After I fixed the
>> issue I mentioned, I found I got around 15% time reduced because of more
>> efficient VecMDot() calls*. I attach a petsc log comparison for your
>> reference (same settings, only difference is whether use MKL BLAS or not),
>> you can see the percentage of VecMDot() is reduced. However, here comes the
>> interesting part, *VecMAXPY() didn’t benefit from MKL BLAS, it still
>> takes almost 40% of solution when I use 64 MKL Threads*, which is a lot
>> for my program. And if I multiple this percentage with the actual wall time
>> against different # of threads, it stays the same. Then I used ex2k
>> benchmark to verify what I found. Here is the result,
>>
>> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 5 -test_name VecMAXPY
>> Vector(N)      VecMAXPY-1    VecMAXPY-3    VecMAXPY-8    VecMAXPY-30  (us)
>> --------------------------------------------------------------------------
>>          128         0.4          0.9          2.4          8.8
>>          256         0.3          1.1          3.5         13.3
>>          512         0.5          4.4          6.7         26.5
>>         1024         0.9          4.8         13.3         51.0
>>         2048         3.5         12.3         37.1         94.7
>>         4096         4.3         24.5         73.6        179.6
>>         8192         6.3         48.7         98.9        380.8
>>        16384         9.3         99.2        200.2        774.0
>>        32768        30.6        155.4        421.2       1662.9
>>        65536       101.2        269.4        827.4       3565.0
>>       131072       206.9        551.0       1829.0       7580.5
>>       262144       450.2       1251.9       3986.2      15525.6
>>       524288      1322.1       2901.7       8567.1      31840.0
>>      1048576      2788.6       6190.6      16394.7      63514.9
>>      2097152      5534.8      12619.9      35427.4     130064.5
>> $ MKL_NUM_THREADS=8 ./ex2k -n 15 -m 5 -test_name VecMAXPY
>> Vector(N)      VecMAXPY-1    VecMAXPY-3    VecMAXPY-8    VecMAXPY-30  (us)
>> --------------------------------------------------------------------------
>>          128         0.3          0.7          2.4          8.8
>>          256         0.3          1.1          3.6         13.5
>>          512         0.5          4.4          6.8         26.4
>>         1024         0.9          4.8         13.6         50.5
>>         2048         7.6         12.2         36.5         95.0
>>         4096         8.5         25.7         72.4        182.6
>>         8192        11.9         48.5        103.7        383.7
>>        16384        12.8         97.7        203.7        785.0
>>        32768        11.2        148.5        421.9       1681.5
>>        65536        15.5        271.2        843.8       3613.7
>>       131072        34.3        564.7       1905.2       7558.8
>>       262144       106.4       1334.5       4002.8      15458.3
>>       524288       217.2       2858.4       8407.9      31303.7
>>      1048576       701.5       6060.6      16947.3      64118.5
>>      2097152      1769.7      13218.3      36347.3     131062.9
>>
>> It stays the same, no benefit from multithreading BLAS!! Unlike what I
>> found for VecMdot(), where I did see speed up for more #of threads. Then, I
>> dig deeper. *I learned that for VecMDot(), it calls ZGEMV while for
>> VecMAXPY(), it calls ZAXPY. This observation seems to indicate that ZAXPY
>> is not benefiting from MKL threads.*
>>
>> My question is *do you know why ZAXPY is not multithreaded*? From my
>> perspective,  VecMDot() and VecMAXPY() are very similar operations, the
>> only difference is whether we need to scale the vectors to be multiplied or
>> not. I think you have mentioned that recently you did some optimization to
>> these two routines*, from my above results and observations, are these
>> aligned with your expectations*? Could we further optimize the codes to
>> get more parallelization efficiency in my case?
>>
>> *And another question, can MatSolve() in KSPSolve be multithreaded? Would
>> MUMPS help?*
>>
>> Thank you and regards,
>> Yongzhong
>>
>> *From:* Junchao Zhang <junchao.zhang at gmail.com>
>> *Sent:* Thursday, June 27, 2024 11:10 AM
>> *To:* Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc:* Barry Smith <bsmith at petsc.dev>; petsc-users at mcs.anl.gov
>> *Subject:* Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>> How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans,
>> &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))?  n is
>> the vector length in VecMDot.
>> it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE *ZGEMV..., *since
>> the code did call gemv. Perhaps you need to double check your spelling etc.
>>
>> If you also use ex2k, and potentially modify Ms[] and Ns[] to match the
>> sizes in your code, to see if there is a speedup with more threads.
>>
>> --Junchao Zhang
>>
>>
>> On Thu, Jun 27, 2024 at 9:39 AM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see
>> the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2,
>> xarray, &ione, &zero, z + i, &ione)); is called multiple
>> ZjQcmQRYFpfptBannerStart
>> *This Message Is From an External Sender*
>> This message came from outside your organization.
>>
>> ZjQcmQRYFpfptBannerEnd
>> Mostly 3, maximum 7, but definitely hit the point when m > 1,
>>
>> I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one,
>> yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple
>> times
>>
>>
>> *From: *Barry Smith <bsmith at petsc.dev>
>> *Date: *Thursday, June 27, 2024 at 1:12 AM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>>   How big are the m's getting in your code?
>>
>>
>>
>>
>> On Jun 27, 2024, at 12:40 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>> Hi Barry, I used gdb to debug my program, set a breakpoint to
>> VecMultiDot_Seq_GEMV function. I did see when I debug this function, it
>> will call BLAS (but not always, only if m > 1), as shown below. However, I
>> still didn’t see any MKL outputs even if I set MKLK_VERBOSE=1.
>>
>> *(gdb) *
>> *550            PetscCall(VecRestoreArrayRead(yin[i], &yfirst));*
>> *(gdb) *
>> *553            m = j - i;*
>> *(gdb) *
>> *554            if (m > 1) {*
>> *(gdb) *
>> *555              PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the
>> cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above*
>> *(gdb) *
>> *556              PetscScalar  one = 1, zero = 0;*
>> *(gdb) *
>> *558              PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m,
>> &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione));*
>> *(gdb) s*
>> *PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210>
>> "VecMultiDot_Seq_GEMV",*
>> *    file=0x7ffff68a1078
>> "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c")*
>> *    at
>> /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106*
>> *106          if (!TRdebug) return PETSC_SUCCESS;*
>> *(gdb) *
>> *154        }*
>>
>> Am I not using MKL BLAS, is that why I didn’t see multithreading speed up
>> for KSPGMRESOrthog? What do you think could be the potential reasons? Is
>> there any silent mode that will possibly affect the MKL Verbose.
>>
>> Thank you and best regards,
>> Yongzhong
>>
>>
>> *From: *Barry Smith <bsmith at petsc.dev>
>> *Date: *Wednesday, June 26, 2024 at 8:15 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>>   if (m > 1) {
>>       PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is
>> safe since we've screened out those lda > PETSC_BLAS_INT_MAX above
>>       PetscScalar  one = 1, zero = 0;
>>
>>       PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray,
>> &lda2, xarray, &ione, &zero, z + i, &ione));
>>       PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0)));
>>
>> The call to BLAS above is where it uses MKL.
>>
>>
>>
>>
>>
>> On Jun 26, 2024, at 6:59 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>> Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV
>> https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_wPb-7VN$ 
>> <https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHEITtuVo$>
>> Can I ask which lines of codes suggest the use of intel mkl?
>>
>> Thanks,
>> Yongzhong
>>
>>
>> *From: *Barry Smith <bsmith at petsc.dev>
>> *Date: *Wednesday, June 26, 2024 at 10:30 AM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>>    In a debug version of PETSc run your application in a debugger and put
>> a break point in VecMultiDot_Seq_GEMV.  Then next through the code from
>> that point to see what decision it makes about using dgemv() to see why it
>> is not getting into the Intel code.
>>
>>
>>
>>
>>
>> On Jun 25, 2024, at 11:19 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>> This Message Is From an External Sender
>> This message came from outside your organization.
>>
>> Hi Junchao, thank you for your help for these benchmarking test!
>>
>> I check out to petsc/main and did a few things to verify from my side,
>>
>> 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute
>> node. The results are as follow,
>> $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4
>> Vector(N)      VecMDot-1    VecMDot-3    VecMDot-8    VecMDot-30  (us)
>> --------------------------------------------------------------------------
>>          128        14.5          1.2          1.8          5.2
>>          256         1.5          0.9          1.6          4.7
>>          512         2.7          2.8          6.1         13.2
>>         1024         4.0          4.0          9.3         16.4
>>         2048         7.4          7.3         11.3         39.3
>>         4096        14.2         13.9         19.1         93.4
>>         8192        28.8         26.3         25.4         31.3
>>        16384        54.1         25.8         26.7         33.8
>>        32768       109.8         25.7         24.2         56.0
>>        65536       220.2         24.4         26.5         89.0
>>       131072       424.1         31.5         36.1        149.6
>>       262144       898.1         37.1         53.9        286.1
>>       524288      1754.6         48.7        100.3       1122.2
>>      1048576      3645.8         86.5        347.9       2950.4
>>      2097152      7371.4        308.7       1440.6       6874.9
>>
>> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4
>> Vector(N)      VecMDot-1    VecMDot-3    VecMDot-8    VecMDot-30  (us)
>> --------------------------------------------------------------------------
>>          128        14.9          1.2          1.9          5.2
>>          256         1.5          1.0          1.7          4.7
>>          512         2.7          2.8          6.1         12.0
>>         1024         3.9          4.0          9.3         16.8
>>         2048         7.4          7.3         10.4         41.3
>>         4096        14.0         13.8         18.6         84.2
>>         8192        27.0         21.3         43.8        177.5
>>        16384        54.1         34.1         89.1        330.4
>>        32768       110.4         82.1        203.5        781.1
>>        65536       213.0        191.8        423.9       1696.4
>>       131072       428.7        360.2        934.0       4080.0
>>       262144       883.4        723.2       1745.6      10120.7
>>       524288      1817.5       1466.1       4751.4      23217.2
>>      1048576      3611.0       3796.5      11814.9      48687.7
>>      2097152      7401.9      10592.0      27543.2     106565.4
>>
>> I can see the speed up brought by more MKL threads, and if I set
>> NKL_VERBOSE to 1, I can see something like
>>
>>
>>
>> *MKL_VERBOSE
>> ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1)
>> 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:6 ca*From my understanding,
>> the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute
>> node and is using ZGEMV MKL BLAS.
>>
>> However, when I ran my own program and set MKL_VERBOSE to 1, it is very
>> strange that I still can’t find any MKL outputs, though I can see from the
>> PETSc log that VecMDot and VecMAXPY() are called.
>>
>> I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a
>> way that is similar to ex2k test?  Should I expect to see MKL outputs for
>> whatever linear system I solve with KSPGMRES? Does it relate to if it is
>> dense matrix or sparse matrix, although I am not really understand why
>> VecMDot/MAXPY() have something to do with dense matrix-vector
>> multiplication.
>>
>> Thank you,
>>
>> Yongzhong
>>
>> *From: *Junchao Zhang <junchao.zhang at gmail.com>
>> *Date: *Tuesday, June 25, 2024 at 6:34 PM
>> *To: *Matthew Knepley <knepley at gmail.com>
>> *Cc: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>, Pierre Jolivet <
>> pierre at joliv.et>, petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>> Hi, Yongzhong,
>>   Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY,  if
>> we can speed up the two with OpenMP threads, then we can speed up
>> KSPGMRESOrthog.  We recently added an optimization to do VecMDot/MAXPY() in
>> dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny
>> matrices ).  So with MKL_VERBOSE=1,  you should see something like
>> "MKL_VERBOSE ZGEMV ..."  in output.  If not, could you try again with
>> petsc/main?
>>   petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them.  I ran
>> VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was
>> strange to see no speedup.   I then configured petsc with openblas, I did
>> see better performance with more threads
>>
>> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4
>> Vector(N)      VecMDot-3    VecMDot-8    VecMDot-30  (us)
>> --------------------------------------------------------------------------
>>          128         2.0          2.5          6.1
>>          256         1.8          2.7          7.0
>>          512         2.1          3.1          8.6
>>         1024         2.7          4.0         12.3
>>         2048         3.8          6.3         28.0
>>         4096         6.1         10.6         42.4
>>         8192        10.9         21.8         79.5
>>        16384        21.2         39.4        149.6
>>        32768        45.9         75.7        224.6
>>        65536       142.2        215.8        732.1
>>       131072       169.1        233.2       1729.4
>>       262144       367.5        830.0       4159.2
>>       524288       999.2       1718.1       8538.5
>>      1048576      2113.5       4082.1      18274.8
>>      2097152      5392.6      10273.4      43273.4
>>
>>
>> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4
>> Vector(N)      VecMDot-3    VecMDot-8    VecMDot-30  (us)
>> --------------------------------------------------------------------------
>>          128         2.0          2.5          6.0
>>          256         1.8          2.7         15.0
>>          512         2.1          9.0         16.6
>>         1024         2.6          8.7         16.1
>>         2048         7.7         10.3         20.5
>>         4096         9.9         11.4         25.9
>>         8192        14.5         22.1         39.6
>>        16384        25.1         27.8         67.8
>>        32768        44.7         95.7         91.5
>>        65536        82.1        156.8        165.1
>>       131072       194.0        335.1        341.5
>>       262144       388.5        380.8        612.9
>>       524288      1046.7        967.1       1653.3
>>      1048576      1997.4       2169.0       4034.4
>>      2097152      5502.9       5787.3      12608.1
>>
>> The tall-and-skinny matrices in KSPGMRESOrthog vary in width.  The
>> average speedup depends on components.  So I suggest you run ex2k to see in
>> your environment whether oneMKL can speedup the kernels.
>>
>> --Junchao Zhang
>>
>>
>> On Mon, Jun 24, 2024 at 11:35 AM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>> Let me run some examples on our end to see whether the code calls
>> expected functions.
>>
>> --Junchao Zhang
>>
>>
>> On Mon, Jun 24, 2024 at 10:46 AM Matthew Knepley <knepley at gmail.com>
>> wrote:
>> On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li <yongzhong. li@ mail.
>> utoronto. ca> wrote: Thank you Pierre for your information. Do we have a
>> conclusion for my original question about the parallelization efficiency
>> for different stages of
>> ZjQcmQRYFpfptBannerStart
>> *This Message Is From an External Sender*
>> This message came from outside your organization.
>>
>> ZjQcmQRYFpfptBannerEnd
>> On Mon, Jun 24, 2024 at 11:21 AM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> Thank you Pierre for your information. Do we have a conclusion for my
>> original question about the parallelization efficiency for different stages
>> of KSP Solve? Do we need to do more testing to figure out the issues? Thank
>> you, Yongzhong From:
>> ZjQcmQRYFpfptBannerStart
>> *This Message Is From an External Sender*
>> This message came from outside your organization.
>>
>> ZjQcmQRYFpfptBannerEnd
>> Thank you Pierre for your information. Do we have a conclusion for my
>> original question about the parallelization efficiency for different stages
>> of KSP Solve? Do we need to do more testing to figure out the issues?
>>
>>
>> We have an extended discussion of this here:
>> https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_5b-GJoq$ 
>> <https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x4zRO7V_$>
>>
>> The kinds of operations you are talking about (SpMV, VecDot, VecAXPY,
>> etc) are memory bandwidth limited. If there is no more bandwidth to be
>> marshalled on your board, then adding more processes does nothing at all.
>> This is why people were asking about how many "nodes" you are running on,
>> because that is the unit of memory bandwidth, not "cores" which make little
>> difference.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>> Thank you,
>> Yongzhong
>>
>>
>> *From: *Pierre Jolivet <pierre at joliv.et>
>> *Date: *Sunday, June 23, 2024 at 12:41 AM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>>
>>
>>
>> On 23 Jun 2024, at 4:07 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> Yeah, I ran my program again using -mat_view::ascii_info and set
>> MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix
>> to be seqaijmkl type (I’ve attached a few as below)
>>
>> --> Setting up matrix-vector products...
>>
>> Mat Object: 1 MPI process
>>   type: seqaijmkl
>>   rows=16490, cols=35937
>>   total: nonzeros=128496, allocated nonzeros=128496
>>   total number of mallocs used during MatSetValues calls=0
>>     not using I-node routines
>> Mat Object: 1 MPI process
>>   type: seqaijmkl
>>   rows=16490, cols=35937
>>   total: nonzeros=128496, allocated nonzeros=128496
>>   total number of mallocs used during MatSetValues calls=0
>>     not using I-node routines
>>
>> --> Solving the system...
>>
>> Excitation 1 of 1...
>>
>> ================================================
>> Iterative solve completed in 7435 ms.
>> CONVERGED: rtol.
>> Iterations: 72
>> Final relative residual norm: 9.22287e-07
>> ================================================
>> [CPU TIME] System solution: 2.27160000e+02 s.
>> [WALL TIME] System solution: 7.44387218e+00 s.
>>
>> However, it seems to me that there were still no MKL outputs even I set
>> MKL_VERBOSE to be 1. Although, I think it should be many spmv operations
>> when doing KSPSolve(). Do you see the possible reasons?
>>
>>
>> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS
>> is.
>>
>> Thanks,
>> Pierre
>>
>>
>>
>> Thanks,
>> Yongzhong
>>
>>
>>
>> *From: *Matthew Knepley <knepley at gmail.com>
>> *Date: *Saturday, June 22, 2024 at 5:56 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *Junchao Zhang <junchao.zhang at gmail.com>, Pierre Jolivet <
>> pierre at joliv.et>, petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>> 你通常不会收到来自 knepley at gmail.com 的电子邮件。了解这一点为什么很重要
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3qwbYRxM$>
>> On Sat, Jun 22, 2024 at 5:03 PM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100
>> MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for
>> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R)
>> AVX-512) with support of Vector
>> ZjQcmQRYFpfptBannerStart
>> *This Message Is From an External Sender*
>> This message came from outside your organization.
>>
>> ZjQcmQRYFpfptBannerEnd
>> MKL_VERBOSE=1 ./ex1
>>
>> matrix nonzeros = 100, allocated nonzeros = 100
>> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for
>> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R)
>> AVX-512) with support of Vector Neural Network Instructions enabled
>> processors, Lnx 2.50GHz lp64 gnu_thread
>> MKL_VERBOSE
>> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1)
>> 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0)
>> 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10)
>> 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1)
>> 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10)
>> 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1)
>> 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0)
>> 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0)
>> 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0)
>> 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15)
>> 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0)
>> 730ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0)
>> 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF
>> Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0)
>> 685ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0)
>> 390ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0)
>> 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF
>> Dyn:1 FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> Yes, for petsc example, there are MKL outputs, but for my own program.
>> All I did is to change the matrix type from MATAIJ to MATAIJMKL to get
>> optimized performance for spmv from MKL. Should I expect to see any MKL
>> outputs in this case?
>>
>>
>> Are you sure that the type changed? You can MatView() the matrix with
>> format ascii_info to see.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>
>> Thanks,
>> Yongzhong
>>
>>
>> *From: *Junchao Zhang <junchao.zhang at gmail.com>
>> *Date: *Saturday, June 22, 2024 at 9:40 AM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *Pierre Jolivet <pierre at joliv.et>, petsc-users at mcs.anl.gov <
>> petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>> No,  you don't.  It is strange.  Perhaps you can you run a petsc example
>> first and see if MKL is really used
>> $ cd src/mat/tests
>> $ make ex1
>> $ MKL_VERBOSE=1 ./ex1
>>
>> --Junchao Zhang
>>
>>
>> On Fri, Jun 21, 2024 at 4:03 PM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> I am using
>>
>> export MKL_VERBOSE=1
>> ./xx
>>
>> in the bash file, do I have to use - ksp_converged_reason?
>>
>> Thanks,
>> Yongzhong
>>
>>
>> *From: *Pierre Jolivet <pierre at joliv.et>
>> *Date: *Friday, June 21, 2024 at 1:47 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *Junchao Zhang <junchao.zhang at gmail.com>, petsc-users at mcs.anl.gov <
>> petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>> 你通常不会收到来自 pierre at joliv.et 的电子邮件。了解这一点为什么很重要
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!flsZMI97ne0yyxHhLda3hROB9qsgstuZS-jPinxGIzFCCSdn1ujdoMR8dyz-5_kVqqMM-12Lt0dTdjKrx3wXhHZmBhNydvFQeSY$>
>> How do you set the variable?
>>
>> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason
>> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64
>> architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled
>> processors, Lnx 2.80GHz lp64 intel_thread
>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1
>> TID:0  NThr:1
>> [...]
>>
>>
>> On 21 Jun 2024, at 7:37 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> Hello all,
>>
>> I set MKL_VERBOSE = 1, but observed no print output specific to the use
>> of MKL. Does PETSc enable this verbose output?
>>
>> Best,
>>
>> Yongzhong
>>
>>
>> *From: *Pierre Jolivet <pierre at joliv.et>
>> *Date: *Friday, June 21, 2024 at 1:36 AM
>> *To: *Junchao Zhang <junchao.zhang at gmail.com>
>> *Cc: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>,
>> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>> 你通常不会收到来自 pierre at joliv.et 的电子邮件。了解这一点为什么很重要
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!eXBeeIXo9Yqgp2nypqwKYimLnGBZXnF4dXxgLM1UoOIO6n8nt3XlfgjVWLPWJh4UOa5NNpx-nrJb_H828XRQKUREfR2m69oCbxI$>
>>
>>
>>
>> On 21 Jun 2024, at 6:42 AM, Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> I remember there are some MKL env vars to print MKL routines called.
>>
>>
>> The environment variable is MKL_VERBOSE
>>
>> Thanks,
>> Pierre
>>
>>
>> Maybe we can try it to see what MKL routines are really used and then we
>> can understand why some petsc functions did not speed up
>>
>> --Junchao Zhang
>>
>>
>> On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> *This Message Is From an External Sender*
>> This message came from outside your organization.
>>
>> Hi Barry, sorry for my last results. I didn’t fully understand the stage
>> profiling and logging in PETSc, now I only record KSPSolve() stage of my
>> program. Some sample codes are as follow,
>>
>>                 // Static variable to keep track of the stage counter
>>                 static int stageCounter = 1;
>>
>>                 // Generate a unique stage name
>>                 std::ostringstream oss;
>>                 oss << "Stage " << stageCounter << " of Code";
>>                 std::string stageName = oss.str();
>>
>>                 // Register the stage
>>                 PetscLogStage stagenum;
>>
>>                 PetscLogStageRegister(stageName.c_str(), &stagenum);
>>                 PetscLogStagePush(stagenum);
>>
>>                 *KSPSolve(*ksp_ptr, b, x);*
>>
>>                 PetscLogStagePop();
>>                 stageCounter++;
>>
>> I have attached my new logging results, there are 1 main stage and 4
>> other stages where each one is KSPSolve() call.
>>
>> To provide some additional backgrounds, if you recall, I have been trying
>> to get efficient iterative solution using multithreading. I found out by
>> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to
>> perform sparse matrix-vector multiplication faster, I am using
>> MATSEQAIJMKL. This makes the shell matrix vector product in each iteration
>> scale well with the #of threads. However, I found out the total GMERS solve
>> time (~KSPSolve() time) is not scaling well the #of threads.
>>
>> From the logging results I learned that when performing KSPSolve(), there
>> are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs
>> using different number of threads and plotted the time consumption for
>> PCApply() and KSPGMERSOrthog() against #of thread. I found out these two
>> operations are not scaling with the threads at all! My results are attached
>> as the pdf to give you a clear view.
>>
>> My questions is,
>>
>> From my understanding, in PCApply, MatSolve() is involved,
>> KSPGMERSOrthog() will have many vector operations, so why these two parts
>> can’t scale well with the # of threads when the intel MKL library is linked?
>>
>> Thank you,
>> Yongzhong
>>
>>
>> *From: *Barry Smith <bsmith at petsc.dev>
>> *Date: *Friday, June 14, 2024 at 11:36 AM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
>> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
>> piero.triverio at utoronto.ca>
>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
>> Performance Issue
>>
>>    I am a bit confused. Without the initial guess computation, there are
>> still a bunch of events I don't understand
>>
>> MatTranspose          79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatMatMultSym        110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatMatMultNum         90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatMatMatMultSym      20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatRARtSym            25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatMatTrnMultSym      25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatMatTrnMultNum      25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0   275
>> MatTrnMatMultSym      10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatTrnMatMultNum      10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> in addition there are many more VecMAXPY then VecMDot (in GMRES they are
>> each done the same number of times)
>>
>> VecMDot             5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00
>> 0.0e+00  8 10  0  0  0   8 10  0  0  0 12016
>> VecMAXPY           22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00
>> 0.0e+00 39 20  0  0  0  39 20  0  0  0  4913
>>
>> Finally there are a huge number of
>>
>> MatMultAdd        258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00
>> 0.0e+00  7 29  0  0  0   7 29  0  0  0 43025
>>
>> Are you making calls to all these routines? Are you doing this inside
>> your MatMult() or before you call KSPSolve?
>>
>> The reason I wanted you to make a simpler run without the initial guess
>> code is that your events are far more complicated than would be produced by
>> GMRES alone so it is not possible to understand the behavior you are seeing
>> without fully understanding all the events happening in the code.
>>
>>   Barry
>>
>>
>>
>> On Jun 14, 2024, at 1:19 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>> Thanks, I have attached the results without using any KSPGuess. At low
>> frequency, the iteration steps are quite close to the one with KSPGuess,
>> specifically
>>
>>   KSPGuess Object: 1 MPI process
>>     type: fischer
>>     Model 1, size 200
>>
>> However, I found at higher frequency, the # of iteration steps are
>>  significant higher than the one with KSPGuess, I have attahced both of the
>> results for your reference.
>>
>> Moreover, could I ask why the one without the KSPGuess options can be
>> used for a baseline comparsion? What are we comparing here? How does it
>> relate to the performance issue/bottleneck I found? “*I have noticed
>> that the time taken by **KSPSolve** is **almost two times **greater than
>> the CPU time for matrix-vector product multiplied by the number of
>> iteration*”
>>
>> Thank you!
>> Yongzhong
>>
>>
>> *From: *Barry Smith <bsmith at petsc.dev>
>> *Date: *Thursday, June 13, 2024 at 2:14 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
>> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
>> piero.triverio at utoronto.ca>
>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
>> Performance Issue
>>
>>   Can you please run the same thing without the  KSPGuess option(s) for a
>> baseline comparison?
>>
>>    Thanks
>>
>>    Barry
>>
>>
>> On Jun 13, 2024, at 1:27 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> Hi Matt,
>>
>> I have rerun the program with the keys you provided. The system output
>> when performing ksp solve and the final petsc log output were stored in a
>> .txt file attached for your reference.
>>
>> Thanks!
>> Yongzhong
>>
>>
>> *From: *Matthew Knepley <knepley at gmail.com>
>> *Date: *Wednesday, June 12, 2024 at 6:46 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
>> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
>> piero.triverio at utoronto.ca>
>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
>> Performance Issue
>> 你通常不会收到来自 knepley at gmail.com 的电子邮件。了解这一点为什么很重要
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$>
>> On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> Dear PETSc’s developers, I hope this email finds you well. I am currently
>> working on a project using PETSc and have encountered a performance issue
>> with the KSPSolve function. Specifically, I have noticed that the time
>> taken by KSPSolve is
>> ZjQcmQRYFpfptBannerStart
>> *This Message Is From an External Sender*
>> This message came from outside your organization.
>>
>> ZjQcmQRYFpfptBannerEnd
>> Dear PETSc’s developers,
>> I hope this email finds you well.
>> I am currently working on a project using PETSc and have encountered a
>> performance issue with the KSPSolve function. Specifically, *I have
>> noticed that the time taken by **KSPSolve** is **almost two times **greater
>> than the CPU time for matrix-vector product multiplied by the number of
>> iteration steps*. I use C++ chrono to record CPU time.
>> For context, I am using a shell system matrix A. Despite my efforts to
>> parallelize the matrix-vector product (Ax), the overall solve time
>> remains higher than the matrix vector product per iteration indicates
>> when multiple threads were used. Here are a few details of my setup:
>>
>>    - *Matrix Type*: Shell system matrix
>>    - *Preconditioner*: Shell PC
>>    - *Parallel Environment*: Using Intel MKL as PETSc’s BLAS/LAPACK
>>    library, multithreading is enabled
>>
>> I have considered several potential reasons, such as preconditioner
>> setup, additional solver operations, and the inherent overhead of using a
>> shell system matrix. *However, since KSPSolve is a high-level API, I
>> have been unable to pinpoint the exact cause of the increased solve time.*
>> Have you observed the same issue? Could you please provide some
>> experience on how to diagnose and address this performance discrepancy?
>> Any insights or recommendations you could offer would be greatly
>> appreciated.
>>
>>
>> For any performance question like this, we need to see the output of your
>> code run with
>>
>>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>> Thank you for your time and assistance.
>> Best regards,
>> Yongzhong
>> -----------------------------------------------------------
>> *Yongzhong Li*
>> PhD student | Electromagnetics Group
>> Department of Electrical & Computer Engineering
>> University of Toronto
>> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_92mZO9M$ 
>> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$>
>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_xKEEYRn$ 
>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$>
>> <ksp_petsc_log.txt>
>>
>>
>> <ksp_petsc_log.txt><ksp_petsc_log_noguess.txt>
>>
>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_xKEEYRn$ 
>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3cNeD9Gw$>
>>
>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_xKEEYRn$ 
>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x2Os2C2d$>
>>
>>
>>
>> <petsc_log_comparison.txt>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240628/7ef4aa71/attachment-0001.html>