[petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
Yongzhong Li
yongzhong.li at mail.utoronto.ca
Thu Jun 27 09:38:45 CDT 2024
Mostly 3, maximum 7, but definitely hit the point when m > 1,
I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple times
From: Barry Smith <bsmith at petsc.dev>
Date: Thursday, June 27, 2024 at 1:12 AM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
How big are the m's getting in your code?
On Jun 27, 2024, at 12:40 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca> wrote:
Hi Barry, I used gdb to debug my program, set a breakpoint to VecMultiDot_Seq_GEMV function. I did see when I debug this function, it will call BLAS (but not always, only if m > 1), as shown below. However, I still didn’t see any MKL outputs even if I set MKLK_VERBOSE=1.
(gdb)
550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst));
(gdb)
553 m = j - i;
(gdb)
554 if (m > 1) {
(gdb)
555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above
(gdb)
556 PetscScalar one = 1, zero = 0;
(gdb)
558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione));
(gdb) s
PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> "VecMultiDot_Seq_GEMV",
file=0x7ffff68a1078 "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c")
at /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106
106 if (!TRdebug) return PETSC_SUCCESS;
(gdb)
154 }
Am I not using MKL BLAS, is that why I didn’t see multithreading speed up for KSPGMRESOrthog? What do you think could be the potential reasons? Is there any silent mode that will possibly affect the MKL Verbose.
Thank you and best regards,
Yongzhong
From: Barry Smith <bsmith at petsc.dev<mailto:bsmith at petsc.dev>>
Date: Wednesday, June 26, 2024 at 8:15 PM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
if (m > 1) {
PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above
PetscScalar one = 1, zero = 0;
PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione));
PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0)));
The call to BLAS above is where it uses MKL.
On Jun 26, 2024, at 6:59 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHEITtuVo$
Can I ask which lines of codes suggest the use of intel mkl?
Thanks,
Yongzhong
From: Barry Smith <bsmith at petsc.dev<mailto:bsmith at petsc.dev>>
Date: Wednesday, June 26, 2024 at 10:30 AM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code.
On Jun 25, 2024, at 11:19 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
This Message Is From an External Sender
This message came from outside your organization.
Hi Junchao, thank you for your help for these benchmarking test!
I check out to petsc/main and did a few things to verify from my side,
1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow,
$ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4
Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us)
--------------------------------------------------------------------------
128 14.5 1.2 1.8 5.2
256 1.5 0.9 1.6 4.7
512 2.7 2.8 6.1 13.2
1024 4.0 4.0 9.3 16.4
2048 7.4 7.3 11.3 39.3
4096 14.2 13.9 19.1 93.4
8192 28.8 26.3 25.4 31.3
16384 54.1 25.8 26.7 33.8
32768 109.8 25.7 24.2 56.0
65536 220.2 24.4 26.5 89.0
131072 424.1 31.5 36.1 149.6
262144 898.1 37.1 53.9 286.1
524288 1754.6 48.7 100.3 1122.2
1048576 3645.8 86.5 347.9 2950.4
2097152 7371.4 308.7 1440.6 6874.9
$ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4
Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us)
--------------------------------------------------------------------------
128 14.9 1.2 1.9 5.2
256 1.5 1.0 1.7 4.7
512 2.7 2.8 6.1 12.0
1024 3.9 4.0 9.3 16.8
2048 7.4 7.3 10.4 41.3
4096 14.0 13.8 18.6 84.2
8192 27.0 21.3 43.8 177.5
16384 54.1 34.1 89.1 330.4
32768 110.4 82.1 203.5 781.1
65536 213.0 191.8 423.9 1696.4
131072 428.7 360.2 934.0 4080.0
262144 883.4 723.2 1745.6 10120.7
524288 1817.5 1466.1 4751.4 23217.2
1048576 3611.0 3796.5 11814.9 48687.7
2097152 7401.9 10592.0 27543.2 106565.4
I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like
MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca
>From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS.
However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can’t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called.
I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication.
Thank you,
Yongzhong
From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
Date: Tuesday, June 25, 2024 at 6:34 PM
To: Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Cc: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>, Pierre Jolivet <pierre at joliv.et<mailto:pierre at joliv.et>>, petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
Hi, Yongzhong,
Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main?
petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads
$ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4
Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us)
--------------------------------------------------------------------------
128 2.0 2.5 6.1
256 1.8 2.7 7.0
512 2.1 3.1 8.6
1024 2.7 4.0 12.3
2048 3.8 6.3 28.0
4096 6.1 10.6 42.4
8192 10.9 21.8 79.5
16384 21.2 39.4 149.6
32768 45.9 75.7 224.6
65536 142.2 215.8 732.1
131072 169.1 233.2 1729.4
262144 367.5 830.0 4159.2
524288 999.2 1718.1 8538.5
1048576 2113.5 4082.1 18274.8
2097152 5392.6 10273.4 43273.4
$ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4
Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us)
--------------------------------------------------------------------------
128 2.0 2.5 6.0
256 1.8 2.7 15.0
512 2.1 9.0 16.6
1024 2.6 8.7 16.1
2048 7.7 10.3 20.5
4096 9.9 11.4 25.9
8192 14.5 22.1 39.6
16384 25.1 27.8 67.8
32768 44.7 95.7 91.5
65536 82.1 156.8 165.1
131072 194.0 335.1 341.5
262144 388.5 380.8 612.9
524288 1046.7 967.1 1653.3
1048576 1997.4 2169.0 4034.4
2097152 5502.9 5787.3 12608.1
The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels.
--Junchao Zhang
On Mon, Jun 24, 2024 at 11:35 AM Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> wrote:
Let me run some examples on our end to see whether the code calls expected functions.
--Junchao Zhang
On Mon, Jun 24, 2024 at 10:46 AM Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:
On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li <yongzhong. li@ mail. utoronto. ca> wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
On Mon, Jun 24, 2024 at 11:21 AM Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues?
We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHdxA7B0w$ <https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x4zRO7V_$>
The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference.
Thanks,
Matt
Thank you,
Yongzhong
From: Pierre Jolivet <pierre at joliv.et<mailto:pierre at joliv.et>>
Date: Sunday, June 23, 2024 at 12:41 AM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
On 23 Jun 2024, at 4:07 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
This Message Is From an External Sender
This message came from outside your organization.
Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I’ve attached a few as below)
--> Setting up matrix-vector products...
Mat Object: 1 MPI process
type: seqaijmkl
rows=16490, cols=35937
total: nonzeros=128496, allocated nonzeros=128496
total number of mallocs used during MatSetValues calls=0
not using I-node routines
Mat Object: 1 MPI process
type: seqaijmkl
rows=16490, cols=35937
total: nonzeros=128496, allocated nonzeros=128496
total number of mallocs used during MatSetValues calls=0
not using I-node routines
--> Solving the system...
Excitation 1 of 1...
================================================
Iterative solve completed in 7435 ms.
CONVERGED: rtol.
Iterations: 72
Final relative residual norm: 9.22287e-07
================================================
[CPU TIME] System solution: 2.27160000e+02 s.
[WALL TIME] System solution: 7.44387218e+00 s.
However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons?
SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is.
Thanks,
Pierre
Thanks,
Yongzhong
From: Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Date: Saturday, June 22, 2024 at 5:56 PM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>, Pierre Jolivet <pierre at joliv.et<mailto:pierre at joliv.et>>, petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
你通常不会收到来自 knepley at gmail.com<mailto:knepley at gmail.com> 的电子邮件。了解这一点为什么很重要<https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3qwbYRxM$>
On Sat, Jun 22, 2024 at 5:03 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
MKL_VERBOSE=1 ./ex1
matrix nonzeros = 100, allocated nonzeros = 100
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread
MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case?
Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see.
Thanks,
Matt
Thanks,
Yongzhong
From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
Date: Saturday, June 22, 2024 at 9:40 AM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: Pierre Jolivet <pierre at joliv.et<mailto:pierre at joliv.et>>, petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used
$ cd src/mat/tests
$ make ex1
$ MKL_VERBOSE=1 ./ex1
--Junchao Zhang
On Fri, Jun 21, 2024 at 4:03 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
I am using
export MKL_VERBOSE=1
./xx
in the bash file, do I have to use - ksp_converged_reason?
Thanks,
Yongzhong
From: Pierre Jolivet <pierre at joliv.et<mailto:pierre at joliv.et>>
Date: Friday, June 21, 2024 at 1:47 PM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>, petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
你通常不会收到来自 pierre at joliv.et<mailto:pierre at joliv.et> 的电子邮件。了解这一点为什么很重要<https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!flsZMI97ne0yyxHhLda3hROB9qsgstuZS-jPinxGIzFCCSdn1ujdoMR8dyz-5_kVqqMM-12Lt0dTdjKrx3wXhHZmBhNydvFQeSY$>
How do you set the variable?
$ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason
MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread
MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
[...]
On 21 Jun 2024, at 7:37 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
This Message Is From an External Sender
This message came from outside your organization.
Hello all,
I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output?
Best,
Yongzhong
From: Pierre Jolivet <pierre at joliv.et<mailto:pierre at joliv.et>>
Date: Friday, June 21, 2024 at 1:36 AM
To: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
Cc: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>, petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
你通常不会收到来自 pierre at joliv.et<mailto:pierre at joliv.et> 的电子邮件。了解这一点为什么很重要<https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!eXBeeIXo9Yqgp2nypqwKYimLnGBZXnF4dXxgLM1UoOIO6n8nt3XlfgjVWLPWJh4UOa5NNpx-nrJb_H828XRQKUREfR2m69oCbxI$>
On 21 Jun 2024, at 6:42 AM, Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> wrote:
This Message Is From an External Sender
This message came from outside your organization.
I remember there are some MKL env vars to print MKL routines called.
The environment variable is MKL_VERBOSE
Thanks,
Pierre
Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up
--Junchao Zhang
On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
This Message Is From an External Sender
This message came from outside your organization.
Hi Barry, sorry for my last results. I didn’t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow,
// Static variable to keep track of the stage counter
static int stageCounter = 1;
// Generate a unique stage name
std::ostringstream oss;
oss << "Stage " << stageCounter << " of Code";
std::string stageName = oss.str();
// Register the stage
PetscLogStage stagenum;
PetscLogStageRegister(stageName.c_str(), &stagenum);
PetscLogStagePush(stagenum);
KSPSolve(*ksp_ptr, b, x);
PetscLogStagePop();
stageCounter++;
I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call.
To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads.
>From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view.
My questions is,
>From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can’t scale well with the # of threads when the intel MKL library is linked?
Thank you,
Yongzhong
From: Barry Smith <bsmith at petsc.dev<mailto:bsmith at petsc.dev>>
Date: Friday, June 14, 2024 at 11:36 AM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>, petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov> <petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov>>, Piero Triverio <piero.triverio at utoronto.ca<mailto:piero.triverio at utoronto.ca>>
Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand
MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275
MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times)
VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016
VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913
Finally there are a huge number of
MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025
Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve?
The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code.
Barry
On Jun 14, 2024, at 1:19 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically
KSPGuess Object: 1 MPI process
type: fischer
Model 1, size 200
However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference.
Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? “I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration”
Thank you!
Yongzhong
From: Barry Smith <bsmith at petsc.dev<mailto:bsmith at petsc.dev>>
Date: Thursday, June 13, 2024 at 2:14 PM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>, petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov> <petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov>>, Piero Triverio <piero.triverio at utoronto.ca<mailto:piero.triverio at utoronto.ca>>
Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
Can you please run the same thing without the KSPGuess option(s) for a baseline comparison?
Thanks
Barry
On Jun 13, 2024, at 1:27 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
This Message Is From an External Sender
This message came from outside your organization.
Hi Matt,
I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference.
Thanks!
Yongzhong
From: Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Date: Wednesday, June 12, 2024 at 6:46 PM
To: Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>, petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov> <petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov>>, Piero Triverio <piero.triverio at utoronto.ca<mailto:piero.triverio at utoronto.ca>>
Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
你通常不会收到来自 knepley at gmail.com<mailto:knepley at gmail.com> 的电子邮件。了解这一点为什么很重要<https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$>
On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca<mailto:yongzhong.li at mail.utoronto.ca>> wrote:
Dear PETSc’s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Dear PETSc’s developers,
I hope this email finds you well.
I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time.
For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup:
* Matrix Type: Shell system matrix
* Preconditioner: Shell PC
* Parallel Environment: Using Intel MKL as PETSc’s BLAS/LAPACK library, multithreading is enabled
I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time.
Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated.
For any performance question like this, we need to see the output of your code run with
-ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view
Thanks,
Matt
Thank you for your time and assistance.
Best regards,
Yongzhong
-----------------------------------------------------------
Yongzhong Li
PhD student | Electromagnetics Group
Department of Electrical & Computer Engineering
University of Toronto
https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHpSnB5jI$ <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$>
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHhNWbDeU$ <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$>
<ksp_petsc_log.txt>
<ksp_petsc_log.txt><ksp_petsc_log_noguess.txt>
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHhNWbDeU$ <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3cNeD9Gw$>
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHhNWbDeU$ <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x2Os2C2d$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240627/6cc8b8d7/attachment-0001.html>
More information about the petsc-users
mailing list