[petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
Pierre Jolivet
pierre at joliv.et
Fri Jun 21 00:36:27 CDT 2024
> On 21 Jun 2024, at 6:42 AM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
> I remember there are some MKL env vars to print MKL routines called.
The environment variable is MKL_VERBOSE
Thanks,
Pierre
> Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up
>
> --Junchao Zhang
>
>
> On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca <mailto:yongzhong.li at mail.utoronto.ca>> wrote:
>> This Message Is From an External Sender
>> This message came from outside your organization.
>>
>> Hi Barry, sorry for my last results. I didn’t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow,
>>
>> // Static variable to keep track of the stage counter
>>
>> static int stageCounter = 1;
>>
>>
>>
>> // Generate a unique stage name
>>
>> std::ostringstream oss;
>>
>> oss << "Stage " << stageCounter << " of Code";
>>
>> std::string stageName = oss.str();
>>
>>
>>
>> // Register the stage
>>
>> PetscLogStage stagenum;
>>
>>
>>
>> PetscLogStageRegister(stageName.c_str(), &stagenum);
>>
>> PetscLogStagePush(stagenum);
>>
>>
>>
>> KSPSolve(*ksp_ptr, b, x);
>>
>>
>>
>> PetscLogStagePop();
>>
>> stageCounter++;
>>
>> I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call.
>>
>> To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads.
>>
>> From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view.
>>
>> My questions is,
>>
>> From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can’t scale well with the # of threads when the intel MKL library is linked?
>>
>> Thank you,
>> Yongzhong
>>
>>
>>
>> From: Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>>
>> Date: Friday, June 14, 2024 at 11:36 AM
>> To: Yongzhong Li <yongzhong.li at mail.utoronto.ca <mailto:yongzhong.li at mail.utoronto.ca>>
>> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov> <petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov>>, Piero Triverio <piero.triverio at utoronto.ca <mailto:piero.triverio at utoronto.ca>>
>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
>>
>>
>>
>> I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand
>>
>>
>>
>> MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>
>> MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>
>> MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>
>> MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>
>> MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>
>> MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>
>> MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275
>>
>> MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>
>> MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>
>>
>>
>> in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times)
>>
>>
>>
>> VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016
>>
>> VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913
>>
>>
>>
>> Finally there are a huge number of
>>
>>
>>
>> MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025
>>
>>
>>
>> Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve?
>>
>>
>>
>> The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code.
>>
>>
>>
>> Barry
>>
>>
>>
>>
>>
>>
>> On Jun 14, 2024, at 1:19 AM, Yongzhong Li <yongzhong.li at mail.utoronto.ca <mailto:yongzhong.li at mail.utoronto.ca>> wrote:
>>
>>
>>
>> Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically
>>
>> KSPGuess Object: 1 MPI process
>>
>> type: fischer
>>
>> Model 1, size 200
>>
>> However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference.
>>
>> Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? “I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration”
>>
>> Thank you!
>> Yongzhong
>>
>>
>> From: Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>>
>> Date: Thursday, June 13, 2024 at 2:14 PM
>> To: Yongzhong Li <yongzhong.li at mail.utoronto.ca <mailto:yongzhong.li at mail.utoronto.ca>>
>> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov> <petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov>>, Piero Triverio <piero.triverio at utoronto.ca <mailto:piero.triverio at utoronto.ca>>
>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
>>
>>
>>
>> Can you please run the same thing without the KSPGuess option(s) for a baseline comparison?
>>
>>
>> Thanks
>>
>>
>> Barry
>>
>>
>>
>> On Jun 13, 2024, at 1:27 PM, Yongzhong Li <yongzhong.li at mail.utoronto.ca <mailto:yongzhong.li at mail.utoronto.ca>> wrote:
>>
>>
>> This Message Is From an External Sender
>>
>> This message came from outside your organization.
>>
>> Hi Matt,
>>
>> I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference.
>>
>> Thanks!
>> Yongzhong
>>
>>
>> From: Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>>
>> Date: Wednesday, June 12, 2024 at 6:46 PM
>> To: Yongzhong Li <yongzhong.li at mail.utoronto.ca <mailto:yongzhong.li at mail.utoronto.ca>>
>> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov> <petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov>>, Piero Triverio <piero.triverio at utoronto.ca <mailto:piero.triverio at utoronto.ca>>
>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue
>>
>> 你通常不会收到来自 knepley at gmail.com <mailto:knepley at gmail.com> 的电子邮件。了解这一点为什么很重要 <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$>
>> On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <yongzhong.li at mail.utoronto.ca <mailto:yongzhong.li at mail.utoronto.ca>> wrote:
>>
>> Dear PETSc’s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is
>>
>> ZjQcmQRYFpfptBannerStart
>>
>> This Message Is From an External Sender
>>
>> This message came from outside your organization.
>>
>>
>> ZjQcmQRYFpfptBannerEnd
>>
>> Dear PETSc’s developers,
>>
>> I hope this email finds you well.
>>
>> I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time.
>>
>> For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup:
>>
>> Matrix Type: Shell system matrix
>> Preconditioner: Shell PC
>> Parallel Environment: Using Intel MKL as PETSc’s BLAS/LAPACK library, multithreading is enabled
>> I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time.
>>
>> Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated.
>>
>>
>>
>> For any performance question like this, we need to see the output of your code run with
>>
>>
>>
>> -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Matt
>>
>>
>>
>> Thank you for your time and assistance.
>>
>> Best regards,
>>
>> Yongzhong
>>
>> -----------------------------------------------------------
>>
>> Yongzhong Li
>>
>> PhD student | Electromagnetics Group
>>
>> Department of Electrical & Computer Engineering
>>
>> University of Toronto
>>
>> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cxTM09LsKoYUA08P97agSWfNaQ7kgSux1FjxDwySQtW7Eg2OyUPt_464qMf8D4fDNGWVJRXvPqZTEgKvCtkt7A$ <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$>
>>
>>
>>
>>
>>
>>
>> --
>>
>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>> -- Norbert Wiener
>>
>>
>>
>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cxTM09LsKoYUA08P97agSWfNaQ7kgSux1FjxDwySQtW7Eg2OyUPt_464qMf8D4fDNGWVJRXvPqZTEgISAv2xYg$ <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$>
>> <ksp_petsc_log.txt>
>>
>>
>>
>> <ksp_petsc_log.txt><ksp_petsc_log_noguess.txt>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240621/53c7210a/attachment-0001.html>
More information about the petsc-users
mailing list