[petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue

Junchao Zhang junchao.zhang at gmail.com
Mon Jun 24 11:35:43 CDT 2024


Let me run some examples on our end to see whether the code calls expected
functions.

--Junchao Zhang


On Mon, Jun 24, 2024 at 10:46 AM Matthew Knepley <knepley at gmail.com> wrote:

> On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li <yongzhong. li@ mail.
> utoronto. ca> wrote: Thank you Pierre for your information. Do we have a
> conclusion for my original question about the parallelization efficiency
> for different stages of
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> On Mon, Jun 24, 2024 at 11:21 AM Yongzhong Li <
> yongzhong.li at mail.utoronto.ca> wrote:
>
>> Thank you Pierre for your information. Do we have a conclusion for my
>> original question about the parallelization efficiency for different stages
>> of KSP Solve? Do we need to do more testing to figure out the issues? Thank
>> you, Yongzhong From:
>> ZjQcmQRYFpfptBannerStart
>> This Message Is From an External Sender
>> This message came from outside your organization.
>>
>> ZjQcmQRYFpfptBannerEnd
>>
>> Thank you Pierre for your information. Do we have a conclusion for my
>> original question about the parallelization efficiency for different stages
>> of KSP Solve? Do we need to do more testing to figure out the issues?
>>
>
> We have an extended discussion of this here:
> https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7P10iuoI$ 
> <https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x4zRO7V_$>
>
> The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc)
> are memory bandwidth limited. If there is no more bandwidth to be
> marshalled on your board, then adding more processes does nothing at all.
> This is why people were asking about how many "nodes" you are running on,
> because that is the unit of memory bandwidth, not "cores" which make little
> difference.
>
>   Thanks,
>
>      Matt
>
>
>> Thank you,
>>
>> Yongzhong
>>
>>
>>
>> *From: *Pierre Jolivet <pierre at joliv.et>
>> *Date: *Sunday, June 23, 2024 at 12:41 AM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>>
>>
>>
>>
>> On 23 Jun 2024, at 4:07 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>>
>>
>> This Message Is From an External Sender
>>
>> This message came from outside your organization.
>>
>> Yeah, I ran my program again using -mat_view::ascii_info and set
>> MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix
>> to be seqaijmkl type (I’ve attached a few as below)
>>
>> --> Setting up matrix-vector products...
>>
>>
>>
>> Mat Object: 1 MPI process
>>
>>   type: seqaijmkl
>>
>>   rows=16490, cols=35937
>>
>>   total: nonzeros=128496, allocated nonzeros=128496
>>
>>   total number of mallocs used during MatSetValues calls=0
>>
>>     not using I-node routines
>>
>> Mat Object: 1 MPI process
>>
>>   type: seqaijmkl
>>
>>   rows=16490, cols=35937
>>
>>   total: nonzeros=128496, allocated nonzeros=128496
>>
>>   total number of mallocs used during MatSetValues calls=0
>>
>>     not using I-node routines
>>
>>
>>
>> --> Solving the system...
>>
>>
>>
>> Excitation 1 of 1...
>>
>>
>>
>> ================================================
>>
>> Iterative solve completed in 7435 ms.
>>
>> CONVERGED: rtol.
>>
>> Iterations: 72
>>
>> Final relative residual norm: 9.22287e-07
>>
>> ================================================
>>
>> [CPU TIME] System solution: 2.27160000e+02 s.
>>
>> [WALL TIME] System solution: 7.44387218e+00 s.
>>
>> However, it seems to me that there were still no MKL outputs even I set
>> MKL_VERBOSE to be 1. Although, I think it should be many spmv operations
>> when doing KSPSolve(). Do you see the possible reasons?
>>
>>
>>
>> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS
>> is.
>>
>>
>>
>> Thanks,
>>
>> Pierre
>>
>>
>>
>> Thanks,
>>
>> Yongzhong
>>
>>
>>
>>
>>
>> *From: *Matthew Knepley <knepley at gmail.com>
>> *Date: *Saturday, June 22, 2024 at 5:56 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *Junchao Zhang <junchao.zhang at gmail.com>, Pierre Jolivet <
>> pierre at joliv.et>, petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>> 你通常不会收到来自 knepley at gmail.com 的电子邮件。了解这一点为什么很重要
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3qwbYRxM$>
>>
>> On Sat, Jun 22, 2024 at 5:03 PM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100
>> MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for
>> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R)
>> AVX-512) with support of Vector
>>
>> ZjQcmQRYFpfptBannerStart
>>
>> *This Message Is From an External Sender*
>>
>> This message came from outside your organization.
>>
>>
>>
>> ZjQcmQRYFpfptBannerEnd
>>
>> MKL_VERBOSE=1 ./ex1
>>
>>
>> matrix nonzeros = 100, allocated nonzeros = 100
>>
>> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for
>> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R)
>> AVX-512) with support of Vector Neural Network Instructions enabled
>> processors, Lnx 2.50GHz lp64 gnu_thread
>>
>> MKL_VERBOSE
>> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1)
>> 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0)
>> 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10)
>> 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1)
>> 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10)
>> 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1)
>> 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0)
>> 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0)
>> 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0)
>> 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15)
>> 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0)
>> 730ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0)
>> 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF
>> Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0)
>> 685ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0)
>> 390ns CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE
>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0)
>> 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF
>> Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns
>> CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1
>>
>> Yes, for petsc example, there are MKL outputs, but for my own program.
>> All I did is to change the matrix type from MATAIJ to MATAIJMKL to get
>> optimized performance for spmv from MKL. Should I expect to see any MKL
>> outputs in this case?
>>
>>
>>
>> Are you sure that the type changed? You can MatView() the matrix with
>> format ascii_info to see.
>>
>>
>>
>>   Thanks,
>>
>>
>>
>>      Matt
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Yongzhong
>>
>>
>>
>> *From: *Junchao Zhang <junchao.zhang at gmail.com>
>> *Date: *Saturday, June 22, 2024 at 9:40 AM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *Pierre Jolivet <pierre at joliv.et>, petsc-users at mcs.anl.gov <
>> petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>> No,  you don't.  It is strange.  Perhaps you can you run a petsc example
>> first and see if MKL is really used
>>
>> $ cd src/mat/tests
>>
>> $ make ex1
>>
>> $ MKL_VERBOSE=1 ./ex1
>>
>>
>> --Junchao Zhang
>>
>>
>>
>>
>>
>> On Fri, Jun 21, 2024 at 4:03 PM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> I am using
>>
>> export MKL_VERBOSE=1
>>
>> ./xx
>>
>> in the bash file, do I have to use - ksp_converged_reason?
>>
>> Thanks,
>>
>> Yongzhong
>>
>>
>>
>> *From: *Pierre Jolivet <pierre at joliv.et>
>> *Date: *Friday, June 21, 2024 at 1:47 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *Junchao Zhang <junchao.zhang at gmail.com>, petsc-users at mcs.anl.gov <
>> petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>> 你通常不会收到来自 pierre at joliv.et 的电子邮件。了解这一点为什么很重要
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!flsZMI97ne0yyxHhLda3hROB9qsgstuZS-jPinxGIzFCCSdn1ujdoMR8dyz-5_kVqqMM-12Lt0dTdjKrx3wXhHZmBhNydvFQeSY$>
>>
>> How do you set the variable?
>>
>>
>>
>> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason
>>
>> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64
>> architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled
>> processors, Lnx 2.80GHz lp64 intel_thread
>>
>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1
>> FastMM:1 TID:0  NThr:1
>>
>> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1
>> TID:0  NThr:1
>>
>> [...]
>>
>>
>>
>> On 21 Jun 2024, at 7:37 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>>
>>
>> This Message Is From an External Sender
>>
>> This message came from outside your organization.
>>
>> Hello all,
>>
>> I set MKL_VERBOSE = 1, but observed no print output specific to the use
>> of MKL. Does PETSc enable this verbose output?
>>
>> Best,
>>
>> Yongzhong
>>
>>
>>
>> *From: *Pierre Jolivet <pierre at joliv.et>
>> *Date: *Friday, June 21, 2024 at 1:36 AM
>> *To: *Junchao Zhang <junchao.zhang at gmail.com>
>> *Cc: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>,
>> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc
>> KSPSolve Performance Issue
>>
>> 你通常不会收到来自 pierre at joliv.et 的电子邮件。了解这一点为什么很重要
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!eXBeeIXo9Yqgp2nypqwKYimLnGBZXnF4dXxgLM1UoOIO6n8nt3XlfgjVWLPWJh4UOa5NNpx-nrJb_H828XRQKUREfR2m69oCbxI$>
>>
>>
>>
>>
>>
>> On 21 Jun 2024, at 6:42 AM, Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>
>>
>> This Message Is From an External Sender
>>
>> This message came from outside your organization.
>>
>> I remember there are some MKL env vars to print MKL routines called.
>>
>>
>>
>> The environment variable is MKL_VERBOSE
>>
>>
>>
>> Thanks,
>>
>> Pierre
>>
>>
>>
>> Maybe we can try it to see what MKL routines are really used and then we
>> can understand why some petsc functions did not speed up
>>
>>
>> --Junchao Zhang
>>
>>
>>
>>
>>
>> On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> *This Message Is From an External Sender*
>>
>> This message came from outside your organization.
>>
>>
>>
>> Hi Barry, sorry for my last results. I didn’t fully understand the stage
>> profiling and logging in PETSc, now I only record KSPSolve() stage of my
>> program. Some sample codes are as follow,
>>
>>                 // Static variable to keep track of the stage counter
>>
>>                 static int stageCounter = 1;
>>
>>
>>
>>                 // Generate a unique stage name
>>
>>                 std::ostringstream oss;
>>
>>                 oss << "Stage " << stageCounter << " of Code";
>>
>>                 std::string stageName = oss.str();
>>
>>
>>
>>                 // Register the stage
>>
>>                 PetscLogStage stagenum;
>>
>>
>>
>>                 PetscLogStageRegister(stageName.c_str(), &stagenum);
>>
>>                 PetscLogStagePush(stagenum);
>>
>>
>>
>>                 *KSPSolve(*ksp_ptr, b, x);*
>>
>>
>>
>>                 PetscLogStagePop();
>>
>>                 stageCounter++;
>>
>> I have attached my new logging results, there are 1 main stage and 4
>> other stages where each one is KSPSolve() call.
>>
>> To provide some additional backgrounds, if you recall, I have been trying
>> to get efficient iterative solution using multithreading. I found out by
>> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to
>> perform sparse matrix-vector multiplication faster, I am using
>> MATSEQAIJMKL. This makes the shell matrix vector product in each iteration
>> scale well with the #of threads. However, I found out the total GMERS solve
>> time (~KSPSolve() time) is not scaling well the #of threads.
>>
>> From the logging results I learned that when performing KSPSolve(), there
>> are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs
>> using different number of threads and plotted the time consumption for
>> PCApply() and KSPGMERSOrthog() against #of thread. I found out these two
>> operations are not scaling with the threads at all! My results are attached
>> as the pdf to give you a clear view.
>>
>> My questions is,
>>
>> From my understanding, in PCApply, MatSolve() is involved,
>> KSPGMERSOrthog() will have many vector operations, so why these two parts
>> can’t scale well with the # of threads when the intel MKL library is linked?
>>
>> Thank you,
>> Yongzhong
>>
>>
>>
>> *From: *Barry Smith <bsmith at petsc.dev>
>> *Date: *Friday, June 14, 2024 at 11:36 AM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
>> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
>> piero.triverio at utoronto.ca>
>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
>> Performance Issue
>>
>>
>>
>>    I am a bit confused. Without the initial guess computation, there are
>> still a bunch of events I don't understand
>>
>>
>>
>> MatTranspose          79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> MatMatMultSym        110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>
>> MatMatMultNum         90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>
>> MatMatMatMultSym      20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>
>> MatRARtSym            25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>
>> MatMatTrnMultSym      25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> MatMatTrnMultNum      25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0   275
>>
>> MatTrnMatMultSym      10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> MatTrnMatMultNum      10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>>
>>
>> in addition there are many more VecMAXPY then VecMDot (in GMRES they are
>> each done the same number of times)
>>
>>
>>
>> VecMDot             5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00
>> 0.0e+00  8 10  0  0  0   8 10  0  0  0 12016
>>
>> VecMAXPY           22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00
>> 0.0e+00 39 20  0  0  0  39 20  0  0  0  4913
>>
>>
>>
>> Finally there are a huge number of
>>
>>
>>
>> MatMultAdd        258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00
>> 0.0e+00  7 29  0  0  0   7 29  0  0  0 43025
>>
>>
>>
>> Are you making calls to all these routines? Are you doing this inside
>> your MatMult() or before you call KSPSolve?
>>
>>
>>
>> The reason I wanted you to make a simpler run without the initial guess
>> code is that your events are far more complicated than would be produced by
>> GMRES alone so it is not possible to understand the behavior you are seeing
>> without fully understanding all the events happening in the code.
>>
>>
>>
>>   Barry
>>
>>
>>
>>
>>
>> On Jun 14, 2024, at 1:19 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>>
>>
>> Thanks, I have attached the results without using any KSPGuess. At low
>> frequency, the iteration steps are quite close to the one with KSPGuess,
>> specifically
>>
>>   KSPGuess Object: 1 MPI process
>>
>>     type: fischer
>>
>>     Model 1, size 200
>>
>> However, I found at higher frequency, the # of iteration steps are
>>  significant higher than the one with KSPGuess, I have attahced both of the
>> results for your reference.
>>
>> Moreover, could I ask why the one without the KSPGuess options can be
>> used for a baseline comparsion? What are we comparing here? How does it
>> relate to the performance issue/bottleneck I found? “*I have noticed
>> that the time taken by **KSPSolve** is **almost two times **greater than
>> the CPU time for matrix-vector product multiplied by the number of
>> iteration*”
>>
>> Thank you!
>> Yongzhong
>>
>>
>>
>> *From: *Barry Smith <bsmith at petsc.dev>
>> *Date: *Thursday, June 13, 2024 at 2:14 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
>> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
>> piero.triverio at utoronto.ca>
>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
>> Performance Issue
>>
>>
>>
>>   Can you please run the same thing without the  KSPGuess option(s) for
>> a baseline comparison?
>>
>>
>>
>>    Thanks
>>
>>
>>
>>    Barry
>>
>>
>>
>> On Jun 13, 2024, at 1:27 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> wrote:
>>
>>
>>
>> This Message Is From an External Sender
>>
>> This message came from outside your organization.
>>
>> Hi Matt,
>>
>> I have rerun the program with the keys you provided. The system output
>> when performing ksp solve and the final petsc log output were stored in a
>> .txt file attached for your reference.
>>
>> Thanks!
>> Yongzhong
>>
>>
>>
>> *From: *Matthew Knepley <knepley at gmail.com>
>> *Date: *Wednesday, June 12, 2024 at 6:46 PM
>> *To: *Yongzhong Li <yongzhong.li at mail.utoronto.ca>
>> *Cc: *petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>,
>> petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, Piero Triverio <
>> piero.triverio at utoronto.ca>
>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve
>> Performance Issue
>>
>> 你通常不会收到来自 knepley at gmail.com 的电子邮件。了解这一点为什么很重要
>> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$>
>>
>> On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <
>> yongzhong.li at mail.utoronto.ca> wrote:
>>
>> Dear PETSc’s developers, I hope this email finds you well. I am currently
>> working on a project using PETSc and have encountered a performance issue
>> with the KSPSolve function. Specifically, I have noticed that the time
>> taken by KSPSolve is
>>
>> ZjQcmQRYFpfptBannerStart
>>
>> *This Message Is From an External Sender*
>>
>> This message came from outside your organization.
>>
>>
>>
>> ZjQcmQRYFpfptBannerEnd
>>
>> Dear PETSc’s developers,
>>
>> I hope this email finds you well.
>>
>> I am currently working on a project using PETSc and have encountered a
>> performance issue with the KSPSolve function. Specifically, *I have
>> noticed that the time taken by **KSPSolve** is **almost two times **greater
>> than the CPU time for matrix-vector product multiplied by the number of
>> iteration steps*. I use C++ chrono to record CPU time.
>>
>> For context, I am using a shell system matrix A. Despite my efforts to
>> parallelize the matrix-vector product (Ax), the overall solve time
>> remains higher than the matrix vector product per iteration indicates
>> when multiple threads were used. Here are a few details of my setup:
>>
>>    - *Matrix Type*: Shell system matrix
>>    - *Preconditioner*: Shell PC
>>    - *Parallel Environment*: Using Intel MKL as PETSc’s BLAS/LAPACK
>>    library, multithreading is enabled
>>
>> I have considered several potential reasons, such as preconditioner
>> setup, additional solver operations, and the inherent overhead of using a
>> shell system matrix. *However, since KSPSolve is a high-level API, I
>> have been unable to pinpoint the exact cause of the increased solve time.*
>>
>> Have you observed the same issue? Could you please provide some
>> experience on how to diagnose and address this performance discrepancy?
>> Any insights or recommendations you could offer would be greatly
>> appreciated.
>>
>>
>>
>> For any performance question like this, we need to see the output of your
>> code run with
>>
>>
>>
>>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view
>>
>>
>>
>>   Thanks,
>>
>>
>>
>>      Matt
>>
>>
>>
>> Thank you for your time and assistance.
>>
>> Best regards,
>>
>> Yongzhong
>>
>> -----------------------------------------------------------
>>
>> *Yongzhong Li*
>>
>> PhD student | Electromagnetics Group
>>
>> Department of Electrical & Computer Engineering
>>
>> University of Toronto
>>
>> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7PGFx3_7$ 
>> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$>
>>
>>
>>
>>
>>
>>
>> --
>>
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>>
>>
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7EU47BCC$ 
>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$>
>>
>> <ksp_petsc_log.txt>
>>
>>
>>
>> <ksp_petsc_log.txt><ksp_petsc_log_noguess.txt>
>>
>>
>>
>>
>>
>>
>> --
>>
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>>
>>
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7EU47BCC$ 
>> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3cNeD9Gw$>
>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7EU47BCC$ 
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x2Os2C2d$>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240624/389dce4c/attachment-0001.html>


More information about the petsc-users mailing list