From knepley at gmail.com Sun Jun 2 09:27:17 2024 From: knepley at gmail.com (Matthew Knepley) Date: Sun, 2 Jun 2024 10:27:17 -0400 Subject: [petsc-users] 2^32 integer problems In-Reply-To: References: Message-ID: On Sat, Jun 1, 2024 at 11:39?PM Carpenter, Mark H. (LARC-D302) via petsc-users wrote: > Mark Carpenter, NASA Langley. I am a novice PETSC user of about 10 years. > I?ve build a DG-FEM code with petsc as one of the solver paths (I have my > own as well). Furthermore, I use petsc for MPI communication. I?m running > the DG-FEM > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Mark Carpenter, NASA Langley. > > > > I am a novice PETSC user of about 10 years. I?ve build a DG-FEM code > with petsc as one of the solver paths (I have my own as well). > Furthermore, I use petsc for MPI communication. > > > > I?m running the DG-FEM code on our NAS supercomputer. Everything works > when my integer sizes are small. When I exceed the 2^32 limit of integer > arithmetic the code fails in very strange ways. > > The users that originally set up the petsc infrastructure in the code are > no longer at NASA and I?m ?dead in the water?. > > > > I think I?ve promoted all the integers that are problematic in my code > (F95). On PETSC side: I?ve tried > > 1. Reinstall petsc with ?with-64-bit-integers (no luck) > > That option does not exist, so this will not work. > > 1. > 2. Reinstall petsc with ?with-64-bit-integers and > ?with-64-bit-indices (code will not compile with these options. > Additional variables on F90 side require promotion and then the errors > cascade through code when making PETSC calls. > > We should fix this. I feel confident we can get the code to compile. > > 1. > 2. It?s possible that I?ve missed offending integers, but the petsc > error messages are so cryptic that I can?t even tell where it is failing. > > > > Further complicating matters: > > The problem by definition needs to be HUGE. Problem sizes requiring 1000 > cores (10^6 elements at P5) are needed to experience the errors, which > involves waiting in queues for ? day at least. > > > > Attached are the > > 1. Install script used to install PETSC on our machine > 2. The Makefile used on the fortran side > 3. A data dump from an offending simulation (which is huge and I can?t > see any useful information.) > > > > How do I attack this problem. > > (I?ve never gotten debugging working properly). > Let's get the install for 64-bit indices to work. So we 1) Configure PETSc adding --with-64bit-indices to the configure line. Does this work? If not, send configure.log 2) Compile PETSc. Does this work? If not, send make.log 3) Compile your code. Does this work? If not, send all output. 4) Do one of the 1/2 day runs and let us know what happens. An alternative is to run a small number of processes on a large memory workstation. We do this to test at the lab. Thanks, Matt > Mark > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!auF9rrBGDlsDNKTGGczofe7W5jFe6xzdNRcYh93Hu_48IDvf_AkLauQ1sfAdN5qS_ENmKo_z_6HeyVJBTACI$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Sun Jun 2 09:30:23 2024 From: knepley at gmail.com (Matthew Knepley) Date: Sun, 2 Jun 2024 10:30:23 -0400 Subject: [petsc-users] 2^32 integer problems In-Reply-To: References: Message-ID: On Sun, Jun 2, 2024 at 10:27?AM Matthew Knepley wrote: > On Sat, Jun 1, 2024 at 11:39?PM Carpenter, Mark H. (LARC-D302) via > petsc-users wrote: > >> Mark Carpenter, NASA Langley. I am a novice PETSC user of about 10 years. >> I?ve build a DG-FEM code with petsc as one of the solver paths (I have my >> own as well). Furthermore, I use petsc for MPI communication. I?m running >> the DG-FEM >> ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> >> Mark Carpenter, NASA Langley. >> >> >> >> I am a novice PETSC user of about 10 years. I?ve build a DG-FEM code >> with petsc as one of the solver paths (I have my own as well). >> Furthermore, I use petsc for MPI communication. >> >> >> >> I?m running the DG-FEM code on our NAS supercomputer. Everything works >> when my integer sizes are small. When I exceed the 2^32 limit of integer >> arithmetic the code fails in very strange ways. >> >> The users that originally set up the petsc infrastructure in the code are >> no longer at NASA and I?m ?dead in the water?. >> > One additional point. I have looked at the error message. When you make PETSc calls, each call should be wrapped in PetscCall(). Here is a Fortran example: https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/tutorials/ex22f.F90?ref_type=heads__;!!G_uCfscf7eWS!eOkbaTOpui-YHhrX_HYLmYerXOaaGtlJn04-tdLvQzfRqa6gaCs2x-YtPn7xNTWzRRgD-wze7GkX5hkXqc8i$ This checks the return value after each call and ends early if there is an error. It would make your error output much more readable. Thanks, Matt > >> >> I think I?ve promoted all the integers that are problematic in my code >> (F95). On PETSC side: I?ve tried >> >> 1. Reinstall petsc with ?with-64-bit-integers (no luck) >> >> > That option does not exist, so this will not work. > > >> >> 1. >> 2. Reinstall petsc with ?with-64-bit-integers and >> ?with-64-bit-indices (code will not compile with these options. >> Additional variables on F90 side require promotion and then the errors >> cascade through code when making PETSC calls. >> >> > We should fix this. I feel confident we can get the code to compile. > > >> >> 1. >> 2. It?s possible that I?ve missed offending integers, but the petsc >> error messages are so cryptic that I can?t even tell where it is failing. >> >> >> >> Further complicating matters: >> >> The problem by definition needs to be HUGE. Problem sizes requiring 1000 >> cores (10^6 elements at P5) are needed to experience the errors, which >> involves waiting in queues for ? day at least. >> >> >> >> Attached are the >> >> 1. Install script used to install PETSC on our machine >> 2. The Makefile used on the fortran side >> 3. A data dump from an offending simulation (which is huge and I >> can?t see any useful information.) >> >> >> >> How do I attack this problem. >> >> (I?ve never gotten debugging working properly). >> > > Let's get the install for 64-bit indices to work. So we > > 1) Configure PETSc adding --with-64bit-indices to the configure line. Does > this work? If not, send configure.log > > 2) Compile PETSc. Does this work? If not, send make.log > > 3) Compile your code. Does this work? If not, send all output. > > 4) Do one of the 1/2 day runs and let us know what happens. An alternative > is to run a small number > of processes on a large memory workstation. We do this to test at the > lab. > > Thanks, > > Matt > > >> Mark >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eOkbaTOpui-YHhrX_HYLmYerXOaaGtlJn04-tdLvQzfRqa6gaCs2x-YtPn7xNTWzRRgD-wze7GkX5gB4gnrA$ > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eOkbaTOpui-YHhrX_HYLmYerXOaaGtlJn04-tdLvQzfRqa6gaCs2x-YtPn7xNTWzRRgD-wze7GkX5gB4gnrA$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay at mcs.anl.gov Sun Jun 2 11:52:24 2024 From: balay at mcs.anl.gov (Satish Balay) Date: Sun, 2 Jun 2024 11:52:24 -0500 (CDT) Subject: [petsc-users] 2^32 integer problems In-Reply-To: References: Message-ID: <09e84ebf-40c4-70bf-1ad9-210548b3fb87@mcs.anl.gov> A couple of suggestions. - try building with gcc/gfortran - likely the compiler will flag issues (warnings) with the sources - that might be the cause of some of the errors. - try using PetscInt datatype across all sources (i.e use .F90 suffix - and include petsc includes) - to avoid any lingering mismatch (as a fix for some of the above warnings) - and then - you might be able to simplify your makefile to be more portable [using petsc formatted makefile] Satish On Sun, 2 Jun 2024, Matthew Knepley wrote: > On Sun, Jun 2, 2024 at 10:27?AM Matthew Knepley wrote: > > > On Sat, Jun 1, 2024 at 11:39?PM Carpenter, Mark H. (LARC-D302) via > > petsc-users wrote: > > > >> Mark Carpenter, NASA Langley. I am a novice PETSC user of about 10 years. > >> I?ve build a DG-FEM code with petsc as one of the solver paths (I have my > >> own as well). Furthermore, I use petsc for MPI communication. I?m running > >> the DG-FEM > >> ZjQcmQRYFpfptBannerStart > >> This Message Is From an External Sender > >> This message came from outside your organization. > >> > >> ZjQcmQRYFpfptBannerEnd > >> > >> Mark Carpenter, NASA Langley. > >> > >> > >> > >> I am a novice PETSC user of about 10 years. I?ve build a DG-FEM code > >> with petsc as one of the solver paths (I have my own as well). > >> Furthermore, I use petsc for MPI communication. > >> > >> > >> > >> I?m running the DG-FEM code on our NAS supercomputer. Everything works > >> when my integer sizes are small. When I exceed the 2^32 limit of integer > >> arithmetic the code fails in very strange ways. > >> > >> The users that originally set up the petsc infrastructure in the code are > >> no longer at NASA and I?m ?dead in the water?. > >> > > > One additional point. I have looked at the error message. When you make > PETSc calls, each call should be wrapped in PetscCall(). Here is a Fortran > example: > > > https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/tutorials/ex22f.F90?ref_type=heads__;!!G_uCfscf7eWS!eOkbaTOpui-YHhrX_HYLmYerXOaaGtlJn04-tdLvQzfRqa6gaCs2x-YtPn7xNTWzRRgD-wze7GkX5hkXqc8i$ > > This checks the return value after each call and ends early if there is an > error. It would make your > error output much more readable. > > Thanks, > > Matt > > > > > >> > >> I think I?ve promoted all the integers that are problematic in my code > >> (F95). On PETSC side: I?ve tried > >> > >> 1. Reinstall petsc with ?with-64-bit-integers (no luck) > >> > >> > > That option does not exist, so this will not work. > > > > > >> > >> 1. > >> 2. Reinstall petsc with ?with-64-bit-integers and > >> ?with-64-bit-indices (code will not compile with these options. > >> Additional variables on F90 side require promotion and then the errors > >> cascade through code when making PETSC calls. > >> > >> > > We should fix this. I feel confident we can get the code to compile. > > > > > >> > >> 1. > >> 2. It?s possible that I?ve missed offending integers, but the petsc > >> error messages are so cryptic that I can?t even tell where it is failing. > >> > >> > >> > >> Further complicating matters: > >> > >> The problem by definition needs to be HUGE. Problem sizes requiring 1000 > >> cores (10^6 elements at P5) are needed to experience the errors, which > >> involves waiting in queues for ? day at least. > >> > >> > >> > >> Attached are the > >> > >> 1. Install script used to install PETSC on our machine > >> 2. The Makefile used on the fortran side > >> 3. A data dump from an offending simulation (which is huge and I > >> can?t see any useful information.) > >> > >> > >> > >> How do I attack this problem. > >> > >> (I?ve never gotten debugging working properly). > >> > > > > Let's get the install for 64-bit indices to work. So we > > > > 1) Configure PETSc adding --with-64bit-indices to the configure line. Does > > this work? If not, send configure.log > > > > 2) Compile PETSc. Does this work? If not, send make.log > > > > 3) Compile your code. Does this work? If not, send all output. > > > > 4) Do one of the 1/2 day runs and let us know what happens. An alternative > > is to run a small number > > of processes on a large memory workstation. We do this to test at the > > lab. > > > > Thanks, > > > > Matt > > > > > >> Mark > >> > > > > > > -- > > What most experimenters take for granted before they begin their > > experiments is infinitely more interesting than any results to which their > > experiments lead. > > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eOkbaTOpui-YHhrX_HYLmYerXOaaGtlJn04-tdLvQzfRqa6gaCs2x-YtPn7xNTWzRRgD-wze7GkX5gB4gnrA$ > > > > > > > From bsmith at petsc.dev Wed Jun 5 10:14:55 2024 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 5 Jun 2024 11:14:55 -0400 Subject: [petsc-users] Question for PETSc Fortran users on design moving forward Message-ID: I am working to improve PETSc support for Fortran and to automate more of the process so our Fortran coverage will be complete and always up-to-date. This will require a few small changes in the usage from Fortran. Could you please take a look at https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/merge_requests/7598__;!!G_uCfscf7eWS!fnJy7aGoJDjNKY2hei6kcg1iNooee8vbE8XyX8qLz59wS4gTMHh87-8bO041Q0fvKaLCzgbUmH9QjqxFvI_CK2k$ and make any comments or suggestions. Thanks Barry The changes would appear in the next release of PETSc in October and would not slightly backward incompatible with older versions of PETSc. But the compiler will tell you what needs to be updated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From usovlev2000 at gmail.com Wed Jun 5 09:19:08 2024 From: usovlev2000 at gmail.com (=?UTF-8?B?0JvQtdCyINCj0YHQvtCy?=) Date: Wed, 5 Jun 2024 17:19:08 +0300 Subject: [petsc-users] PETSc for python Message-ID: Hello, dear colleagues! I'm trying to download the PETSc library for python. I use the pip command: "pip install mpi4py petsc petsc4py" for python3 on Windows11, but I get an installation error (see attached file). Could you help me with this? Best regards, Lev. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- C:\Users\usovl>pip install mpi4py petsc petsc4py Collecting mpi4py Downloading mpi4py-3.1.6-cp312-cp312-win_amd64.whl.metadata (8.0 kB) Collecting petsc Downloading petsc-3.21.2.tar.gz (17.3 MB) ???????????????????????????????????????? 17.3/17.3 MB 12.1 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Collecting petsc4py Downloading petsc4py-3.21.2.tar.gz (420 kB) ???????????????????????????????????????? 420.2/420.2 kB 13.2 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... error error: subprocess-exited-with-error ? pip subprocess to install backend dependencies did not run successfully. ? exit code: 1 ??> [89 lines of output] Collecting petsc<3.22,>=3.21 Using cached petsc-3.21.2.tar.gz (17.3 MB) Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Collecting wheel Using cached wheel-0.43.0-py3-none-any.whl.metadata (2.2 kB) Using cached wheel-0.43.0-py3-none-any.whl (65 kB) Building wheels for collected packages: petsc Building wheel for petsc (pyproject.toml): started Building wheel for petsc (pyproject.toml): finished with status 'error' error: subprocess-exited-with-error Building wheel for petsc (pyproject.toml) did not run successfully. exit code: 1 [63 lines of output] running bdist_wheel running build running build_py creating build creating build\lib creating build\lib\petsc copying config\pypi\__init__.py -> build\lib\petsc copying config\pypi\__main__.py -> build\lib\petsc installing to build\bdist.win-amd64\wheel running install PETSc: configure configure options: --prefix=C:\Users\usovl\AppData\Local\Temp\pip-install-4vqqob1d\petsc_5264a13e7662435482a9bd86113b70f5\build\bdist.win-amd64\wheel\petsc PETSC_ARCH=arch-python --with-shared-libraries=1 --with-debugging=0 --with-c2html=0 --with-mpi=0 =============================================================================== *** Windows python detected. Please rerun ./configure with cygwin-python. *** =============================================================================== Traceback (most recent call last): File "C:\Python312\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in main() File "C:\Python312\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main json_out['return_val'] = hook(**hook_input['kwargs']) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python312\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 251, in build_wheel return _build_backend().build_wheel(wheel_directory, config_settings, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\build_meta.py", line 410, in build_wheel return self._build_with_temp_dir( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\build_meta.py", line 395, in _build_with_temp_dir self.run_setup() File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup exec(code, locals()) File "", line 363, in File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\__init__.py", line 103, in setup return distutils.core.setup(**attrs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\_distutils\core.py", line 184, in setup return run_commands(dist) ^^^^^^^^^^^^^^^^^^ File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\_distutils\core.py", line 200, in run_commands dist.run_commands() File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands self.run_command(cmd) File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\dist.py", line 968, in run_command super().run_command(command) File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command cmd_obj.run() File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\wheel\bdist_wheel.py", line 403, in run self.run_command("install") File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command self.distribution.run_command(command) File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\dist.py", line 968, in run_command super().run_command(command) File "C:\Users\usovl\AppData\Local\Temp\pip-build-env-lgrqszpi\overlay\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command cmd_obj.run() File "", line 286, in run File "", line 189, in config RuntimeError: 3 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for petsc Failed to build petsc ERROR: Could not build wheels for petsc, which is required to install pyproject.toml-based projects [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error ? pip subprocess to install backend dependencies did not run successfully. ? exit code: 1 ??> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip. From s_g at berkeley.edu Wed Jun 5 13:38:35 2024 From: s_g at berkeley.edu (Sanjay Govindjee) Date: Wed, 5 Jun 2024 13:38:35 -0500 Subject: [petsc-users] Question for PETSc Fortran users on design moving forward In-Reply-To: References: Message-ID: <7dcc3b47-d979-4b08-bcaf-8fa00e5080d5@berkeley.edu> Barry, ? As a regular used of PETSc in Fortran, I see no problem with these changes to the Fortran interface. -sanjay -------------------------------------------------------------------- On 6/5/24 10:14 AM, Barry Smith wrote: > I am working to improve PETSc support for Fortran and to automate more > of the process so our Fortran coverage will be complete and always > up-to-date. This will require a few small changes in the usage from > Fortran. Could you please take a look > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > ZjQcmQRYFpfptBannerEnd > > I am working to improve PETSc support for Fortran and to automate more > of the process so our Fortran coverage will be complete and always > up-to-date. > > This will require a few small changes in the usage from Fortran. > > Could you please take a look at > https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/merge_requests/7598__;!!G_uCfscf7eWS!dKLBSAnenMU1lr3Pg_M0T1qvoWiBvqo_0NP1_-544KCKB0AU__0Heoo5RNYhIM6qXB_12xi6pOcs5cFDC6osQA$ > ?and > make any comments or suggestions. > > Thanks > > ? Barry > > The changes would appear in the next release of PETSc in October and > would not slightly backward incompatible with older versions of PETSc. > But the compiler will tell you what needs to be updated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From C.Klaij at marin.nl Thu Jun 6 07:52:28 2024 From: C.Klaij at marin.nl (Klaij, Christiaan) Date: Thu, 6 Jun 2024 12:52:28 +0000 Subject: [petsc-users] matload and petsc4py Message-ID: I'm writing a matrix to file from my fortran code (that uses petsc-3.19.4) with -mat_view binary. Then, I'm trying to load this mat into python (that uses petsc-3.21.0). This works fine using single or multiple procs when the marix was written using a single proc (attached file a_mat_np_1.dat). However, when the matrix was written using mulitple procs (attached file a_mat_n_2.dat) I get the error below. Is this supposed to work? If so, what I'm I doing wrong? $ cat test_matrixImport_binary.py import sys import petsc4py from petsc4py import PETSc from mpi4py import MPI # mat files #filename = "./a_mat_np_1.dat" # Works filename = "./a_mat_np_2.dat" # Doesn't work # Initialize PETSc petsc4py.init(sys.argv) # Create a viewer for reading the binary file viewer = PETSc.Viewer().createBinary(filename, mode='r', comm=PETSc.COMM_WORLD) # Create a matrix and load data from the binary file A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) A.load(viewer) $ python test_matrixImport_binary.py Traceback (most recent call last): File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in A.load(viewer) File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load petsc4py.PETSc.Error: error code 79 [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 [0] MatLoad_SeqAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5091 [0] MatLoad_SeqAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5142 [0] Unexpected data in file [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 $ mpirun -n 2 python test_matrixImport_binary.py Traceback (most recent call last): File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in Traceback (most recent call last): File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in A.load(viewer) File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load A.load(viewer) File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load petsc4py.PETSc.Error: error code 79 [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 [0] MatLoad_MPIAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 [0] MatLoad_MPIAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 [0] Unexpected data in file [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 petsc4py.PETSc.Error: error code 79 [1] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 [1] MatLoad_MPIAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 [1] MatLoad_MPIAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 [1] Unexpected data in file [1] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 dr. ir. Christiaan Klaij | Senior Researcher | Research & Development T +31 317 49 33 44 | C.Klaij at marin.nl | https://urldefense.us/v3/__http://www.marin.nl__;!!G_uCfscf7eWS!YrcVeQ6V8OD3jKxSzzxpyuTgFdncWh4YcL1SgDT8NHqystMpzO1pkd17oNGni-ll5I8qH9_ueOtj3WYWm7XFthU$ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image337700.png Type: image/png Size: 5004 bytes Desc: image337700.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image563921.png Type: image/png Size: 487 bytes Desc: image563921.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image268336.png Type: image/png Size: 504 bytes Desc: image268336.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image799923.png Type: image/png Size: 482 bytes Desc: image799923.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: a_mat_np_1.dat Type: application/octet-stream Size: 5936 bytes Desc: a_mat_np_1.dat URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: a_mat_np_1.dat.info Type: application/octet-stream Size: 22 bytes Desc: a_mat_np_1.dat.info URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: a_mat_np_2.dat Type: application/octet-stream Size: 2976 bytes Desc: a_mat_np_2.dat URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: a_mat_np_2.dat.info Type: application/octet-stream Size: 22 bytes Desc: a_mat_np_2.dat.info URL: From stefano.zampini at gmail.com Thu Jun 6 08:01:12 2024 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Thu, 6 Jun 2024 15:01:12 +0200 Subject: [petsc-users] matload and petsc4py In-Reply-To: References: Message-ID: On Thu, Jun 6, 2024, 14:53 Klaij, Christiaan wrote: > I'm writing a matrix to file from my fortran code (that uses petsc-3. 19. > 4) with -mat_view binary. Then, I'm trying to load this mat into python > (that uses petsc-3. 21. 0). This works fine using single or multiple procs > when the marix > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > I'm writing a matrix to file from my fortran code (that uses petsc-3.19.4) > with -mat_view binary. Then, I'm trying to load this mat into python (that > uses petsc-3.21.0). This works fine using single or multiple procs when the > marix was written using a single proc (attached file a_mat_np_1.dat). > However, when the matrix was written using mulitple procs (attached file > a_mat_n_2.dat) I get the error below. Is this supposed to work? If so, what > I'm I doing wrong? > This should work. And your script seems ok too . How did you save the matrix in parallel? I suspect that file is corrupt > > $ cat test_matrixImport_binary.py > import sys > import petsc4py > from petsc4py import PETSc > from mpi4py import MPI > > # mat files > #filename = "./a_mat_np_1.dat" # Works > filename = "./a_mat_np_2.dat" # Doesn't work > > # Initialize PETSc > petsc4py.init(sys.argv) > > # Create a viewer for reading the binary file > viewer = PETSc.Viewer().createBinary(filename, mode='r', > comm=PETSc.COMM_WORLD) > > # Create a matrix and load data from the binary file > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) > A.load(viewer) > > $ python test_matrixImport_binary.py > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line > 18, in > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > petsc4py.PETSc.Error: error code 79 > [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [0] MatLoad_SeqAIJ() at > /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5091 > [0] MatLoad_SeqAIJ_Binary() at > /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5142 > [0] Unexpected data in file > [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > > $ mpirun -n 2 python test_matrixImport_binary.py > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line > 18, in > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line > 18, in > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > petsc4py.PETSc.Error: error code 79 > [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [0] MatLoad_MPIAIJ() at > /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 > [0] MatLoad_MPIAIJ_Binary() at > /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 > [0] Unexpected data in file > [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > petsc4py.PETSc.Error: error code 79 > [1] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [1] MatLoad_MPIAIJ() at > /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 > [1] MatLoad_MPIAIJ_Binary() at > /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 > [1] Unexpected data in file > [1] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > > > dr. ir.???? Christiaan Klaij > | Senior Researcher | Research & Development > T +31 317 49 33 44 <+31%20317%2049%2033%2044> | C.Klaij at marin.nl | > https://urldefense.us/v3/__http://www.marin.nl__;!!G_uCfscf7eWS!bo30rJrAzGhdWcQwu5TbarbyiLMX1JhUlqvIeOd9MkvOG0knvEGaXfbc0_OWtTDnRlaJwql3m5vwIgprE0TsH2sIh1Prvdg$ > > [image: Facebook] > > [image: LinkedIn] > > [image: YouTube] > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image337700.png Type: image/png Size: 5004 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image563921.png Type: image/png Size: 487 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image268336.png Type: image/png Size: 504 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image799923.png Type: image/png Size: 482 bytes Desc: not available URL: From bsmith at petsc.dev Thu Jun 6 08:31:18 2024 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 6 Jun 2024 09:31:18 -0400 Subject: [petsc-users] matload and petsc4py In-Reply-To: References: Message-ID: When I attempt to read the "np_2" data into Matlab (the reading process doesn't use any PETSc code) I get the same failure. It does look like the file is corrupted. Barry > On Jun 6, 2024, at 8:52?AM, Klaij, Christiaan wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I'm writing a matrix to file from my fortran code (that uses petsc-3.19.4) with -mat_view binary. Then, I'm trying to load this mat into python (that uses petsc-3.21.0). This works fine using single or multiple procs when the marix was written using a single proc (attached file a_mat_np_1.dat). However, when the matrix was written using mulitple procs (attached file a_mat_n_2.dat) I get the error below. Is this supposed to work? If so, what I'm I doing wrong? > > $ cat test_matrixImport_binary.py > import sys > import petsc4py > from petsc4py import PETSc > from mpi4py import MPI > > # mat files > #filename = "./a_mat_np_1.dat" # Works > filename = "./a_mat_np_2.dat" # Doesn't work > > # Initialize PETSc > petsc4py.init(sys.argv) > > # Create a viewer for reading the binary file > viewer = PETSc.Viewer().createBinary(filename, mode='r', comm=PETSc.COMM_WORLD) > > # Create a matrix and load data from the binary file > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) > A.load(viewer) > > $ python test_matrixImport_binary.py > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > petsc4py.PETSc.Error: error code 79 > [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [0] MatLoad_SeqAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5091 > [0] MatLoad_SeqAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5142 > [0] Unexpected data in file > [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > > $ mpirun -n 2 python test_matrixImport_binary.py > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > petsc4py.PETSc.Error: error code 79 > [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [0] MatLoad_MPIAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 > [0] MatLoad_MPIAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 > [0] Unexpected data in file > [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > petsc4py.PETSc.Error: error code 79 > [1] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [1] MatLoad_MPIAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 > [1] MatLoad_MPIAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 > [1] Unexpected data in file > [1] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > > > > dr. ir.???? Christiaan Klaij > | Senior Researcher | Research & Development > T +31?317?49?33?44 | C.Klaij at marin.nl | https://urldefense.us/v3/__http://www.marin.nl__;!!G_uCfscf7eWS!a1xtGgs31Xt_6pJ-UUpqz9O6jfNb7GDOBoRRBZiCsTwsrgfsArGejlWuyqulRp-L3Q1cvQfBgn2XWeWbC9vB554$ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From C.Klaij at marin.nl Thu Jun 6 08:50:53 2024 From: C.Klaij at marin.nl (Klaij, Christiaan) Date: Thu, 6 Jun 2024 13:50:53 +0000 Subject: [petsc-users] matload and petsc4py In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image337700.png Type: image/png Size: 5004 bytes Desc: image337700.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image563921.png Type: image/png Size: 487 bytes Desc: image563921.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image268336.png Type: image/png Size: 504 bytes Desc: image268336.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image799923.png Type: image/png Size: 482 bytes Desc: image799923.png URL: From bsmith at petsc.dev Thu Jun 6 12:08:28 2024 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 6 Jun 2024 13:08:28 -0400 Subject: [petsc-users] matload and petsc4py In-Reply-To: References: Message-ID: <7F3E9AE8-E4C9-47AE-9C50-DC9B26AFBFE0@petsc.dev> Try without this line CALL PetscViewerPushFormat(viewer, PETSC_VIEWER_DEFAULT, ier) it shouldn't matter but worth trying. > On Jun 6, 2024, at 9:50?AM, Klaij, Christiaan wrote: > > This Message Is From an External Sender > This message came from outside your organization. > The matrix was saved as with this code: > > CALL PetscViewerBinaryOpen(PETSC_COMM_WORLD, filename, FILE_MODE_WRITE, viewer, ier) > CALL PetscViewerPushFormat(viewer, PETSC_VIEWER_DEFAULT, ier) > CALL MatView(A, viewer, ier) > CALL PetscViewerDestroy(viewer, ier) > > I will try again to check for file corruption > > ________________________________________ > From: Stefano Zampini > > Sent: Thursday, June 6, 2024 3:01 PM > To: Klaij, Christiaan > Cc: PETSc users list > Subject: Re: [petsc-users] matload and petsc4py > > You don't often get email from stefano.zampini at gmail.com . Learn why this is important > > > On Thu, Jun 6, 2024, 14:53 Klaij, Christiaan > wrote: > I'm writing a matrix to file from my fortran code (that uses petsc-3.?19.?4) with -mat_view binary. Then, I'm trying to load this mat into python (that uses petsc-3.?21.?0). This works fine using single or multiple procs when the marix > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > I'm writing a matrix to file from my fortran code (that uses petsc-3.19.4) with -mat_view binary. Then, I'm trying to load this mat into python (that uses petsc-3.21.0). This works fine using single or multiple procs when the marix was written using a single proc (attached file a_mat_np_1.dat). However, when the matrix was written using mulitple procs (attached file a_mat_n_2.dat) I get the error below. Is this supposed to work? If so, what I'm I doing wrong? > > > This should work. And your script seems ok too . How did you save the matrix in parallel? I suspect that file is corrupt > > $ cat test_matrixImport_binary.py > import sys > import petsc4py > from petsc4py import PETSc > from mpi4py import MPI > > # mat files > #filename = "./a_mat_np_1.dat" # Works > filename = "./a_mat_np_2.dat" # Doesn't work > > # Initialize PETSc > petsc4py.init(sys.argv) > > # Create a viewer for reading the binary file > viewer = PETSc.Viewer().createBinary(filename, mode='r', comm=PETSc.COMM_WORLD) > > # Create a matrix and load data from the binary file > A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) > A.load(viewer) > > $ python test_matrixImport_binary.py > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > petsc4py.PETSc.Error: error code 79 > [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [0] MatLoad_SeqAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5091 > [0] MatLoad_SeqAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5142 > [0] Unexpected data in file > [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > > $ mpirun -n 2 python test_matrixImport_binary.py > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in > Traceback (most recent call last): > File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > A.load(viewer) > File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load > petsc4py.PETSc.Error: error code 79 > [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [0] MatLoad_MPIAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 > [0] MatLoad_MPIAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 > [0] Unexpected data in file > [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > petsc4py.PETSc.Error: error code 79 > [1] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 > [1] MatLoad_MPIAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 > [1] MatLoad_MPIAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 > [1] Unexpected data in file > [1] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 > > > [cid:ii_18feda229ee74859a161] > dr. ir.???? Christiaan Klaij > | Senior Researcher | Research & Development > T +31 317 49 33 44 | C.Klaij at marin.nl | https://urldefense.us/v3/__http://www.marin.nl__;!!G_uCfscf7eWS!dhd_vSPt_mzy9yVfgQZvw3KcZQwq--Wojmk9JdzuGmnjbC_54Z7IF0WH6qosvtdRyyjv94blDuN1lAlZod9brMU$ > [Facebook] > [LinkedIn] > [YouTube] > -------------- next part -------------- An HTML attachment was scrubbed... URL: From C.Klaij at marin.nl Fri Jun 7 03:17:53 2024 From: C.Klaij at marin.nl (Klaij, Christiaan) Date: Fri, 7 Jun 2024 08:17:53 +0000 Subject: [petsc-users] matload and petsc4py In-Reply-To: <7F3E9AE8-E4C9-47AE-9C50-DC9B26AFBFE0@petsc.dev> References: <7F3E9AE8-E4C9-47AE-9C50-DC9B26AFBFE0@petsc.dev> Message-ID: Well, after trying a second time, the binary file is fine and loads into python without issues. No idea how the first file got corrupted. I've also tried without the PetscViewerPushFormat call, and that works equally well. Thanks for your help! Chris ________________________________________ dr. ir. Christiaan Klaij | Senior Researcher | Research & Development T +31 317 49 33 44 | C.Klaij at marin.nl | https://urldefense.us/v3/__http://www.marin.nl__;!!G_uCfscf7eWS!a9GfF_LhoavWZziwYJYPGdit19o_V12PJ7Va7HZ4FOqQrLcBCZOHGqnoEpTXlLRFh4MBui66Jv03qp9Zf3lKuoo$ From: Barry Smith Sent: Thursday, June 6, 2024 7:08 PM To: Klaij, Christiaan Cc: Stefano Zampini; PETSc users list Subject: Re: [petsc-users] matload and petsc4py You don't often get email from bsmith at petsc.dev. Learn why this is important Try without this line CALL PetscViewerPushFormat(viewer, PETSC_VIEWER_DEFAULT, ier) it shouldn't matter but worth trying. On Jun 6, 2024, at 9:50?AM, Klaij, Christiaan wrote: This Message Is From an External Sender This message came from outside your organization. The matrix was saved as with this code: CALL PetscViewerBinaryOpen(PETSC_COMM_WORLD, filename, FILE_MODE_WRITE, viewer, ier) CALL PetscViewerPushFormat(viewer, PETSC_VIEWER_DEFAULT, ier) CALL MatView(A, viewer, ier) CALL PetscViewerDestroy(viewer, ier) I will try again to check for file corruption ________________________________________ From: Stefano Zampini > Sent: Thursday, June 6, 2024 3:01 PM To: Klaij, Christiaan Cc: PETSc users list Subject: Re: [petsc-users] matload and petsc4py You don't often get email from stefano.zampini at gmail.com. Learn why this is important On Thu, Jun 6, 2024, 14:53 Klaij, Christiaan > wrote: I'm writing a matrix to file from my fortran code (that uses petsc-3.?19.?4) with -mat_view binary. Then, I'm trying to load this mat into python (that uses petsc-3.?21.?0). This works fine using single or multiple procs when the marix ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd I'm writing a matrix to file from my fortran code (that uses petsc-3.19.4) with -mat_view binary. Then, I'm trying to load this mat into python (that uses petsc-3.21.0). This works fine using single or multiple procs when the marix was written using a single proc (attached file a_mat_np_1.dat). However, when the matrix was written using mulitple procs (attached file a_mat_n_2.dat) I get the error below. Is this supposed to work? If so, what I'm I doing wrong? This should work. And your script seems ok too . How did you save the matrix in parallel? I suspect that file is corrupt $ cat test_matrixImport_binary.py import sys import petsc4py from petsc4py import PETSc from mpi4py import MPI # mat files #filename = "./a_mat_np_1.dat" # Works filename = "./a_mat_np_2.dat" # Doesn't work # Initialize PETSc petsc4py.init(sys.argv) # Create a viewer for reading the binary file viewer = PETSc.Viewer().createBinary(filename, mode='r', comm=PETSc.COMM_WORLD) # Create a matrix and load data from the binary file A = PETSc.Mat().create(comm=PETSc.COMM_WORLD) A.load(viewer) $ python test_matrixImport_binary.py Traceback (most recent call last): File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in A.load(viewer) File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load petsc4py.PETSc.Error: error code 79 [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 [0] MatLoad_SeqAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5091 [0] MatLoad_SeqAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/seq/aij.c:5142 [0] Unexpected data in file [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 $ mpirun -n 2 python test_matrixImport_binary.py Traceback (most recent call last): File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in Traceback (most recent call last): File "/projects/P35662.700/test_cklaij/test_matrixImport_binary.py", line 18, in A.load(viewer) File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load A.load(viewer) File "petsc4py/PETSc/Mat.pyx", line 2025, in petsc4py.PETSc.Mat.load petsc4py.PETSc.Error: error code 79 [0] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 [0] MatLoad_MPIAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 [0] MatLoad_MPIAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 [0] Unexpected data in file [0] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 petsc4py.PETSc.Error: error code 79 [1] MatLoad() at /home/cklaij/petsc-3.21.0/src/mat/interface/matrix.c:1344 [1] MatLoad_MPIAIJ() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3035 [1] MatLoad_MPIAIJ_Binary() at /home/cklaij/petsc-3.21.0/src/mat/impls/aij/mpi/mpiaij.c:3087 [1] Unexpected data in file [1] Inconsistent matrix data in file: nonzeros = 460, sum-row-lengths = 761 [cid:ii_18feda229ee74859a161] dr. ir.???? Christiaan Klaij | Senior Researcher | Research & Development T +31 317 49 33 44 | C.Klaij at marin.nl | https://urldefense.us/v3/__http://www.marin.nl__;!!G_uCfscf7eWS!dhd_vSPt_mzy9yVfgQZvw3KcZQwq--Wojmk9JdzuGmnjbC_54Z7IF0WH6qosvtdRyyjv94blDuN1lAlZod9brMU$ [Facebook] [LinkedIn] [YouTube] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image634986.png Type: image/png Size: 5004 bytes Desc: image634986.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image728206.png Type: image/png Size: 487 bytes Desc: image728206.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image424860.png Type: image/png Size: 504 bytes Desc: image424860.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image994549.png Type: image/png Size: 482 bytes Desc: image994549.png URL: From liufield at gmail.com Fri Jun 7 19:03:27 2024 From: liufield at gmail.com (neil liu) Date: Fri, 7 Jun 2024 20:03:27 -0400 Subject: [petsc-users] About the complex version of gmres. Message-ID: Dear Petsc developers, I am using Petsc to solve a complex system ,AX=B. A is complex and B is real. And the petsc was configured with Configure options --download-mpich --download-fblaslapack=1 --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-triangle --with-scalar-type=complex A and B were also imported to matlab and the same system was solved. The direct and iterative solver in matlab give the same result, which are quite different from the result from Petsc. A and B are attached. x from petsc is also attached. I am using only one processor. It is weird. Thanks a lot. Xiaodong -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- %Mat Object: 1 MPI process % type: seqaij % Size = 189 189 % Nonzeros = 1437 zzz = zeros(1437,4); zzz = [ 1 1 1.0000000000000000e+00 0.0000000000000000e+00 2 2 1.0000000000000000e+00 0.0000000000000000e+00 3 3 1.0000000000000000e+00 0.0000000000000000e+00 4 1 1.7448072822336421e+01 0.0000000000000000e+00 4 2 -2.6091031600270316e+01 0.0000000000000000e+00 4 3 -4.6642036408564564e+01 0.0000000000000000e+00 4 4 -1.7939224198902082e+00 0.0000000000000000e+00 4 5 -6.4153074362378362e+01 0.0000000000000000e+00 4 6 -5.2202557943942168e+01 0.0000000000000000e+00 4 90 -9.3878182451303580e+00 0.0000000000000000e+00 4 91 -5.9536194241145509e+01 0.0000000000000000e+00 4 94 1.7177625836072707e+01 0.0000000000000000e+00 4 96 -5.9044935663294808e+01 0.0000000000000000e+00 4 119 -4.6092778739227526e+01 0.0000000000000000e+00 4 128 -1.4251193510127047e+01 0.0000000000000000e+00 4 129 4.4176160666526556e+01 0.0000000000000000e+00 5 1 -1.2509432866168664e+01 0.0000000000000000e+00 5 2 -3.2496769552092829e+01 0.0000000000000000e+00 5 3 -2.9755653068801536e+01 0.0000000000000000e+00 5 4 -6.4153074362378362e+01 0.0000000000000000e+00 5 5 -5.0599229847282807e+01 0.0000000000000000e+00 5 6 -6.4108331015046218e+01 0.0000000000000000e+00 5 85 -2.3288694994240505e+01 0.0000000000000000e+00 5 86 -6.6946413258413395e+00 0.0000000000000000e+00 5 87 4.6646332361794848e+01 0.0000000000000000e+00 5 88 2.5398402558336581e+00 0.0000000000000000e+00 5 89 -1.3477435071945351e+01 0.0000000000000000e+00 5 92 8.7834743682865764e+00 0.0000000000000000e+00 5 94 2.5004277678951695e+01 0.0000000000000000e+00 5 95 6.2165517583852861e+00 0.0000000000000000e+00 5 96 -5.4704938971099068e+01 0.0000000000000000e+00 5 129 2.1086269546187836e+01 0.0000000000000000e+00 6 1 3.6646214685312199e+00 0.0000000000000000e+00 6 2 2.0255311403578581e+01 0.0000000000000000e+00 6 3 2.1432203053040013e+01 0.0000000000000000e+00 6 4 -5.2202557943942168e+01 0.0000000000000000e+00 6 5 -6.4108331015046218e+01 0.0000000000000000e+00 6 6 1.1051463488587196e+01 0.0000000000000000e+00 6 87 4.6646332361931655e+01 0.0000000000000000e+00 6 89 8.7834743682660665e+00 0.0000000000000000e+00 6 90 -6.6946413258602018e+00 0.0000000000000000e+00 6 91 -2.3288694994189619e+01 0.0000000000000000e+00 6 92 -1.3477435072114222e+01 0.0000000000000000e+00 6 93 2.5398402557848376e+00 0.0000000000000000e+00 6 119 6.2165517584039947e+00 0.0000000000000000e+00 7 7 1.0000000000000000e+00 0.0000000000000000e+00 8 8 1.0000000000000000e+00 0.0000000000000000e+00 9 9 1.0000000000000000e+00 0.0000000000000000e+00 10 7 -3.1964378158618654e+01 0.0000000000000000e+00 10 8 3.7058134338963100e-11 0.0000000000000000e+00 10 9 -2.1953433976834411e+01 0.0000000000000000e+00 10 10 4.1309990189710099e+01 0.0000000000000000e+00 10 11 -4.6805068271053599e+01 0.0000000000000000e+00 10 12 -4.6805068270933489e+01 0.0000000000000000e+00 10 100 -1.1604651406189923e+01 0.0000000000000000e+00 10 102 -4.9938558690455395e+01 0.0000000000000000e+00 10 104 -4.1803449556020084e+01 0.0000000000000000e+00 10 120 8.0996542409283077e+00 0.0000000000000000e+00 10 144 -1.6449340668505251e+01 0.0000000000000000e+00 10 145 -1.6935645517826238e+01 0.0000000000000000e+00 10 146 1.8529352742245536e+01 0.0000000000000000e+00 11 3 2.0656677254712516e+01 0.0000000000000000e+00 11 7 2.9210789371887174e+01 0.0000000000000000e+00 11 8 -1.3399321859557700e+01 0.0000000000000000e+00 11 9 2.0656677254547283e+01 0.0000000000000000e+00 11 10 -4.6805068271053599e+01 0.0000000000000000e+00 11 11 -1.4518075109369065e+01 0.0000000000000000e+00 11 12 -5.5828065299118876e+01 0.0000000000000000e+00 11 13 -3.9221733553797215e+01 0.0000000000000000e+00 11 14 -4.6805068270996685e+01 0.0000000000000000e+00 11 45 -1.8529352742247752e+01 0.0000000000000000e+00 11 100 -1.6935645517822817e+01 0.0000000000000000e+00 11 101 -4.1803449555837702e+01 0.0000000000000000e+00 11 102 -1.6449340668442446e+01 0.0000000000000000e+00 11 103 -4.9938558690229470e+01 0.0000000000000000e+00 11 104 8.0996542409131891e+00 0.0000000000000000e+00 11 127 -1.1604651406182441e+01 0.0000000000000000e+00 12 3 2.9210789371869136e+01 0.0000000000000000e+00 12 7 2.0656677254510221e+01 0.0000000000000000e+00 12 8 1.3399321859386362e+01 0.0000000000000000e+00 12 9 3.9221733553626720e+01 0.0000000000000000e+00 12 10 -4.6805068270933489e+01 0.0000000000000000e+00 12 11 -5.5828065299118876e+01 0.0000000000000000e+00 12 12 -1.4518075109375740e+01 0.0000000000000000e+00 12 13 -2.0656677254678179e+01 0.0000000000000000e+00 12 14 -4.6805068270900662e+01 0.0000000000000000e+00 12 90 -1.6935645517848045e+01 0.0000000000000000e+00 12 119 8.0996542409003620e+00 0.0000000000000000e+00 12 120 -4.1803449555874785e+01 0.0000000000000000e+00 12 121 -1.8529352742312085e+01 0.0000000000000000e+00 12 125 -1.6449340668440335e+01 0.0000000000000000e+00 12 144 -4.9938558690418944e+01 0.0000000000000000e+00 12 145 -1.1604651406307708e+01 0.0000000000000000e+00 13 13 1.0000000000000000e+00 0.0000000000000000e+00 14 3 -3.1964378158695439e+01 0.0000000000000000e+00 14 8 3.4338257528616475e-11 0.0000000000000000e+00 14 11 -4.6805068270996685e+01 0.0000000000000000e+00 14 12 -4.6805068270900662e+01 0.0000000000000000e+00 14 13 2.1953433976995857e+01 0.0000000000000000e+00 14 14 4.1309990190123948e+01 0.0000000000000000e+00 14 57 -1.8529352742124665e+01 0.0000000000000000e+00 14 90 -1.1604651406136309e+01 0.0000000000000000e+00 14 101 8.0996542409242807e+00 0.0000000000000000e+00 14 103 -1.6449340668387485e+01 0.0000000000000000e+00 14 119 -4.1803449555941597e+01 0.0000000000000000e+00 14 125 -4.9938558690263804e+01 0.0000000000000000e+00 14 127 -1.6935645517770730e+01 0.0000000000000000e+00 15 15 -5.7622406855444099e+00 1.7668052571739419e+00 15 16 2.5181503824208519e+01 1.1781048114337240e+00 15 17 -5.6201536969092523e+00 5.9676790944058544e-01 15 18 -9.9702168793637611e+00 0.0000000000000000e+00 15 19 1.1025246921080569e+01 0.0000000000000000e+00 15 20 8.3809822830286973e-01 0.0000000000000000e+00 15 131 7.7976060511603873e-01 0.0000000000000000e+00 15 132 -2.6121233995418869e+01 -1.1781048114292538e+00 15 134 6.3265333754418229e+00 -5.9676790943838753e-01 15 177 -2.2120239984752685e+00 -1.5742632787019710e+00 15 179 1.8869095956996489e+01 1.7748727208639972e+00 15 183 2.7874941615753833e+01 2.1356231475037033e+00 15 186 1.8869095957030670e+01 1.7748727208705120e+00 16 15 2.5181503824208519e+01 1.1781048114337240e+00 16 16 -3.4473234756711101e+01 7.0694244338522907e+00 16 17 -1.3727431931887299e+01 8.8998322377797701e-01 16 18 -1.2094023270819218e+01 0.0000000000000000e+00 16 19 -3.2073132473577118e+01 0.0000000000000000e+00 16 20 2.3165717627636887e+01 0.0000000000000000e+00 16 50 2.6615769043051355e+01 0.0000000000000000e+00 16 76 -4.9348022004119612e+00 0.0000000000000000e+00 16 82 -6.6666666666231240e+00 5.9676790943895563e-01 16 83 1.0364306117981794e+01 0.0000000000000000e+00 16 97 2.6405875468888844e+01 -8.8998322377792116e-01 16 98 -4.2458121170373602e+01 0.0000000000000000e+00 16 166 1.6449340668702133e+00 -3.9982358434576998e-01 16 167 -6.5797362673579425e+00 3.1453696833497071e+00 16 177 -2.0514030023892047e+01 -1.3750491365268884e+00 16 183 -5.0394876495463301e+01 -5.5710811412866459e+00 16 184 1.2626968689653392e+01 -1.8901500944919634e+00 16 185 -4.4022983553369235e+00 -1.6550431732394963e+00 16 186 1.6743244955299030e+01 1.5342864293916341e+00 17 17 1.0000000000000000e+00 0.0000000000000000e+00 18 15 -9.9702168793637611e+00 0.0000000000000000e+00 18 16 -1.2094023270819218e+01 0.0000000000000000e+00 18 17 -2.0686423847475293e+01 0.0000000000000000e+00 18 18 -2.3355964821294926e+00 0.0000000000000000e+00 18 19 -3.6817375630969437e+01 0.0000000000000000e+00 18 20 -3.2523457498268151e+01 0.0000000000000000e+00 18 49 -1.7331655952024242e+01 0.0000000000000000e+00 18 70 -4.6268979531464907e+01 0.0000000000000000e+00 18 80 2.0502193746628897e+01 0.0000000000000000e+00 18 105 -3.7684567298535455e+01 0.0000000000000000e+00 18 106 -1.7307994600745232e+01 0.0000000000000000e+00 18 118 1.3369315737140408e+01 0.0000000000000000e+00 18 131 -2.7722719906458007e+01 0.0000000000000000e+00 18 132 1.7699109359304369e+01 0.0000000000000000e+00 18 133 -7.6130096995454721e+00 0.0000000000000000e+00 18 134 3.1605002978394360e+01 0.0000000000000000e+00 19 15 1.1025246921080569e+01 0.0000000000000000e+00 19 16 -3.2073132473577118e+01 0.0000000000000000e+00 19 17 -1.1255925042516349e+01 0.0000000000000000e+00 19 18 -3.6817375630969437e+01 0.0000000000000000e+00 19 19 1.2062187493948660e+00 0.0000000000000000e+00 19 20 -3.3109368148974411e+01 0.0000000000000000e+00 19 50 -5.1165882322008017e+01 0.0000000000000000e+00 19 72 -1.5686699693518971e+01 0.0000000000000000e+00 19 81 -1.4011065680231159e+01 0.0000000000000000e+00 19 83 -1.1075624703850663e+01 0.0000000000000000e+00 19 98 -4.0057194084425106e+01 0.0000000000000000e+00 19 107 -1.3890965936088769e+01 0.0000000000000000e+00 19 130 1.3024439092778930e+01 0.0000000000000000e+00 19 131 -2.7972608592053906e+01 0.0000000000000000e+00 19 132 2.7402267354168380e+01 0.0000000000000000e+00 19 134 1.6919348754188327e+01 0.0000000000000000e+00 20 15 8.3809822830286973e-01 0.0000000000000000e+00 20 16 2.3165717627636887e+01 0.0000000000000000e+00 20 17 2.7331853556115753e+01 0.0000000000000000e+00 20 18 -3.2523457498268151e+01 0.0000000000000000e+00 20 19 -3.3109368148974411e+01 0.0000000000000000e+00 20 20 3.9537361435634551e+01 0.0000000000000000e+00 20 48 -1.7414916428258003e+01 0.0000000000000000e+00 20 49 -5.8035438684986659e+01 0.0000000000000000e+00 20 50 -1.9471890319915452e+01 0.0000000000000000e+00 20 83 -1.6085168960056542e+01 0.0000000000000000e+00 20 98 1.5540144339200694e+01 0.0000000000000000e+00 20 106 -7.0118707534308760e+00 0.0000000000000000e+00 20 118 -4.2976072893144135e+01 0.0000000000000000e+00 21 21 1.0000000000000000e+00 0.0000000000000000e+00 22 22 1.0000000000000000e+00 0.0000000000000000e+00 23 23 1.0000000000000000e+00 0.0000000000000000e+00 24 2 3.0701460828333744e+01 0.0000000000000000e+00 24 21 2.0273185083477792e+01 0.0000000000000000e+00 24 22 -1.5991442266413918e+01 0.0000000000000000e+00 24 23 -7.2358725872472220e+01 0.0000000000000000e+00 24 24 2.8945812711552080e+01 0.0000000000000000e+00 24 25 -2.7485402137076498e+01 0.0000000000000000e+00 24 26 -1.0182347795339140e+02 0.0000000000000000e+00 24 27 5.8708503357810450e+01 0.0000000000000000e+00 24 28 1.3047629412842440e+01 0.0000000000000000e+00 24 29 -4.4523491174977089e+01 0.0000000000000000e+00 24 64 7.6794754800235587e+00 0.0000000000000000e+00 24 65 -3.1551546012357772e+01 0.0000000000000000e+00 24 66 -4.9049650732321602e+01 0.0000000000000000e+00 24 86 3.0459770467219563e+01 0.0000000000000000e+00 24 120 1.7653444799937155e+01 0.0000000000000000e+00 24 150 -7.9192680165582928e+01 0.0000000000000000e+00 24 151 1.2539160347440311e+02 0.0000000000000000e+00 24 152 -2.2959242579385361e+01 0.0000000000000000e+00 24 153 3.9359866003641208e+01 0.0000000000000000e+00 25 21 -2.9622539752620966e+01 0.0000000000000000e+00 25 22 -5.3680897456830166e+01 0.0000000000000000e+00 25 23 -7.7835959133920642e+00 0.0000000000000000e+00 25 24 -2.7485402137076498e+01 0.0000000000000000e+00 25 25 1.2161136112223259e+02 0.0000000000000000e+00 25 26 -4.8909982019995432e+01 0.0000000000000000e+00 25 27 4.0698128093179875e+01 0.0000000000000000e+00 25 28 2.7105488487831757e+01 0.0000000000000000e+00 25 29 -7.7422622654916182e+01 0.0000000000000000e+00 25 74 5.0219313686965641e+01 0.0000000000000000e+00 25 88 3.3297687985228649e+01 0.0000000000000000e+00 25 111 -9.6124165763302869e+01 0.0000000000000000e+00 25 112 -3.1322090517381543e+01 0.0000000000000000e+00 25 126 4.7150779494431980e+01 0.0000000000000000e+00 25 135 -8.7877538278337425e+01 0.0000000000000000e+00 25 136 4.9553251012744632e+01 0.0000000000000000e+00 26 21 -8.2078463530218499e+00 0.0000000000000000e+00 26 22 3.2214972397570492e+01 0.0000000000000000e+00 26 23 7.3296876482375609e+01 0.0000000000000000e+00 26 24 -1.0182347795339140e+02 0.0000000000000000e+00 26 25 -4.8909982019995432e+01 0.0000000000000000e+00 26 26 3.0342499837362254e+01 0.0000000000000000e+00 26 64 -2.0687578020890115e+01 0.0000000000000000e+00 26 65 1.6270779691617872e+01 0.0000000000000000e+00 26 66 -2.7345733424251119e+01 0.0000000000000000e+00 26 67 -5.7695905190862646e+01 0.0000000000000000e+00 26 68 -1.2100252401931122e+01 0.0000000000000000e+00 26 69 -4.4663159887556333e+01 0.0000000000000000e+00 26 71 -3.0072951621540263e+01 0.0000000000000000e+00 26 112 -1.6706067789161043e+01 0.0000000000000000e+00 26 113 2.9582227812439811e+01 0.0000000000000000e+00 26 135 -2.3098911292313055e+01 0.0000000000000000e+00 26 136 3.8596301284753480e+01 0.0000000000000000e+00 26 137 -8.0589367291768525e+01 0.0000000000000000e+00 26 138 -1.2305676573776667e+02 0.0000000000000000e+00 27 27 1.0000000000000000e+00 0.0000000000000000e+00 28 28 1.0000000000000000e+00 0.0000000000000000e+00 29 21 2.7650498680337435e+01 0.0000000000000000e+00 29 24 -4.4523491174977089e+01 0.0000000000000000e+00 29 25 -7.7422622654916182e+01 0.0000000000000000e+00 29 27 -3.6309519968113356e+01 0.0000000000000000e+00 29 28 -2.0387928325740880e+01 0.0000000000000000e+00 29 29 8.5506376789615373e+01 0.0000000000000000e+00 29 86 8.1886415556992631e+00 0.0000000000000000e+00 29 88 -4.9825728880257245e+01 0.0000000000000000e+00 29 89 7.7257735520364195e+00 0.0000000000000000e+00 29 111 3.4963141930985500e+01 0.0000000000000000e+00 29 126 7.7892879573674534e+01 0.0000000000000000e+00 29 150 -2.7518010312262533e+01 0.0000000000000000e+00 29 151 1.7583610443698419e+01 0.0000000000000000e+00 30 30 1.0000000000000000e+00 0.0000000000000000e+00 31 31 1.0000000000000000e+00 0.0000000000000000e+00 32 32 1.0000000000000000e+00 0.0000000000000000e+00 33 30 -1.7489133700035499e+01 0.0000000000000000e+00 33 31 1.9090602567252954e+01 0.0000000000000000e+00 33 32 4.3159472534586030e+01 0.0000000000000000e+00 33 33 1.7796761857113410e+01 0.0000000000000000e+00 33 34 -1.6449340668467109e+01 0.0000000000000000e+00 33 35 -5.0043465199444199e+01 0.0000000000000000e+00 33 36 -2.3311600733532615e+01 0.0000000000000000e+00 33 37 -6.6232014670381671e+00 0.0000000000000000e+00 33 38 -3.8960206968330546e+01 0.0000000000000000e+00 33 54 -1.2467401100231264e+01 0.0000000000000000e+00 33 55 2.4199265566373644e+01 0.0000000000000000e+00 33 56 -2.6492805868184021e+01 0.0000000000000000e+00 33 77 1.6558003667545321e+01 0.0000000000000000e+00 33 79 3.3116007334970750e+00 0.0000000000000000e+00 33 156 -2.5670338834546357e+01 0.0000000000000000e+00 33 157 9.9130696007514718e+00 0.0000000000000000e+00 34 30 1.7489133699999243e+01 0.0000000000000000e+00 34 31 1.1601468867271382e+01 0.0000000000000000e+00 34 32 6.6232014670558499e+00 0.0000000000000000e+00 34 33 -1.6449340668467109e+01 0.0000000000000000e+00 34 34 2.4289567725257658e+01 0.0000000000000000e+00 34 35 -1.7380470701013881e+01 0.0000000000000000e+00 34 36 -1.9090602567154882e+01 0.0000000000000000e+00 34 37 -4.3159472534711540e+01 0.0000000000000000e+00 34 38 -5.0043465199479542e+01 0.0000000000000000e+00 34 44 -8.4419963318848490e-01 0.0000000000000000e+00 34 63 -2.0043465199753484e+01 0.0000000000000000e+00 34 116 1.2467401100150827e+01 0.0000000000000000e+00 34 117 -2.2467401100202494e+01 0.0000000000000000e+00 34 142 -1.7380470701013184e+01 0.0000000000000000e+00 34 148 -8.4419963324012137e-01 0.0000000000000000e+00 34 154 -3.8181205134621948e+01 0.0000000000000000e+00 35 30 1.2467401100197105e+01 0.0000000000000000e+00 35 31 -2.2467401100224539e+01 0.0000000000000000e+00 35 32 -4.3159472534605953e+01 0.0000000000000000e+00 35 33 -5.0043465199444199e+01 0.0000000000000000e+00 35 34 -1.7380470701013881e+01 0.0000000000000000e+00 35 35 2.4289567725157529e+01 0.0000000000000000e+00 35 40 1.1601468867212532e+01 0.0000000000000000e+00 35 43 -8.4419963324012515e-01 0.0000000000000000e+00 35 44 -2.0043465199574989e+01 0.0000000000000000e+00 35 54 1.7489133700055302e+01 0.0000000000000000e+00 35 55 -1.9090602567280939e+01 0.0000000000000000e+00 35 56 -1.6449340668524155e+01 0.0000000000000000e+00 35 60 6.6232014670438151e+00 0.0000000000000000e+00 35 61 -1.7380470701007273e+01 0.0000000000000000e+00 35 63 -8.4419963318849123e-01 0.0000000000000000e+00 35 154 -3.8181205134487584e+01 0.0000000000000000e+00 36 36 1.0000000000000000e+00 0.0000000000000000e+00 37 37 1.0000000000000000e+00 0.0000000000000000e+00 38 30 -1.2467401100116716e+01 0.0000000000000000e+00 38 33 -3.8960206968330546e+01 0.0000000000000000e+00 38 34 -5.0043465199479542e+01 0.0000000000000000e+00 38 36 3.5324303117407176e+01 0.0000000000000000e+00 38 37 4.3159472534731520e+01 0.0000000000000000e+00 38 38 -2.6616457947952846e+01 0.0000000000000000e+00 38 48 -1.3333333333426900e+01 0.0000000000000000e+00 38 77 -7.8786346169886894e+00 0.0000000000000000e+00 38 79 1.3311600733637299e+01 0.0000000000000000e+00 38 83 2.7596110819336921e+01 0.0000000000000000e+00 38 84 -1.7769971617777554e+01 0.0000000000000000e+00 38 114 -6.6232014670614507e+00 0.0000000000000000e+00 38 116 -1.7489133700019053e+01 0.0000000000000000e+00 38 117 1.9090602567182867e+01 0.0000000000000000e+00 38 118 9.9130696007514807e+00 0.0000000000000000e+00 38 139 -2.6492805868186984e+01 0.0000000000000000e+00 38 142 -1.6449340668524162e+01 0.0000000000000000e+00 38 156 -4.1123351671170269e+01 0.0000000000000000e+00 38 158 -2.2380470700976407e+01 0.0000000000000000e+00 39 39 1.0000000000000000e+00 0.0000000000000000e+00 40 40 1.0000000000000000e+00 0.0000000000000000e+00 41 41 1.0000000000000000e+00 0.0000000000000000e+00 42 39 1.7489133700142400e+01 0.0000000000000000e+00 42 40 -1.9090602567359184e+01 0.0000000000000000e+00 42 41 -4.3159472534807662e+01 0.0000000000000000e+00 42 42 2.6777493357726918e+01 0.0000000000000000e+00 42 43 -1.6449340668433251e+01 0.0000000000000000e+00 42 44 -5.0043465199611703e+01 0.0000000000000000e+00 42 45 1.1006230771933016e+01 0.0000000000000000e+00 42 46 6.6232014670071973e+00 0.0000000000000000e+00 42 47 -1.6764324960524110e+01 0.0000000000000000e+00 42 57 -2.1872163005067677e+01 0.0000000000000000e+00 42 58 1.2467401100311948e+01 0.0000000000000000e+00 42 59 -1.6764324960536669e+01 0.0000000000000000e+00 42 102 -1.3076171287669142e-01 0.0000000000000000e+00 42 103 -2.0784484760043853e+01 0.0000000000000000e+00 42 125 -1.3076171282183557e-01 0.0000000000000000e+00 42 155 4.1527049998609456e+01 0.0000000000000000e+00 43 35 -8.4419963324012515e-01 0.0000000000000000e+00 43 39 -1.7489133700130310e+01 0.0000000000000000e+00 43 40 -1.1601468867205874e+01 0.0000000000000000e+00 43 41 -6.6232014670130877e+00 0.0000000000000000e+00 43 42 -1.6449340668433251e+01 0.0000000000000000e+00 43 43 2.4289567725555703e+01 0.0000000000000000e+00 43 44 -1.7380470700985949e+01 0.0000000000000000e+00 43 45 1.9090602567326492e+01 0.0000000000000000e+00 43 46 4.3159472534977773e+01 0.0000000000000000e+00 43 47 -5.0043465199837883e+01 0.0000000000000000e+00 43 61 -2.0043465199699927e+01 0.0000000000000000e+00 43 142 -8.4419963329562342e-01 0.0000000000000000e+00 43 143 -2.2467401100323890e+01 0.0000000000000000e+00 43 148 -1.7380470700979110e+01 0.0000000000000000e+00 43 149 1.2467401100356563e+01 0.0000000000000000e+00 43 154 3.8181205134735698e+01 0.0000000000000000e+00 44 31 -1.1601468867182296e+01 0.0000000000000000e+00 44 34 -8.4419963318848490e-01 0.0000000000000000e+00 44 35 -2.0043465199574989e+01 0.0000000000000000e+00 44 39 -1.2467401100346098e+01 0.0000000000000000e+00 44 40 2.2467401100326356e+01 0.0000000000000000e+00 44 41 4.3159472534787710e+01 0.0000000000000000e+00 44 42 -5.0043465199611703e+01 0.0000000000000000e+00 44 43 -1.7380470700985949e+01 0.0000000000000000e+00 44 44 2.4289567725532908e+01 0.0000000000000000e+00 44 57 1.9090602567331175e+01 0.0000000000000000e+00 44 58 -1.7489133700122572e+01 0.0000000000000000e+00 44 59 -1.6449340668376195e+01 0.0000000000000000e+00 44 61 -8.4419963324011982e-01 0.0000000000000000e+00 44 62 -6.6232014670015706e+00 0.0000000000000000e+00 44 63 -1.7380470701004306e+01 0.0000000000000000e+00 44 154 3.8181205134533059e+01 0.0000000000000000e+00 45 45 1.0000000000000000e+00 0.0000000000000000e+00 46 46 1.0000000000000000e+00 0.0000000000000000e+00 47 39 1.2467401100319293e+01 0.0000000000000000e+00 47 42 -1.6764324960524110e+01 0.0000000000000000e+00 47 43 -5.0043465199837883e+01 0.0000000000000000e+00 47 45 -2.1872163005063907e+01 0.0000000000000000e+00 47 46 -4.3159472534999189e+01 0.0000000000000000e+00 47 47 2.6777493357764044e+01 0.0000000000000000e+00 47 68 6.6232014670190775e+00 0.0000000000000000e+00 47 102 -2.0784484760170976e+01 0.0000000000000000e+00 47 103 -1.3076171287695948e-01 0.0000000000000000e+00 47 143 1.9090602567357969e+01 0.0000000000000000e+00 47 144 -1.3076171293588179e-01 0.0000000000000000e+00 47 146 -1.1006230771956277e+01 0.0000000000000000e+00 47 147 -1.6764324960499092e+01 0.0000000000000000e+00 47 148 -1.6449340668493939e+01 0.0000000000000000e+00 47 149 -1.7489133700152156e+01 0.0000000000000000e+00 47 155 4.1527049998830122e+01 0.0000000000000000e+00 48 20 -1.7414916428258003e+01 0.0000000000000000e+00 48 38 -1.3333333333426900e+01 0.0000000000000000e+00 48 48 1.6016405898209033e+01 0.0000000000000000e+00 48 49 3.5488822097268788e+01 0.0000000000000000e+00 48 50 -1.7339039667063140e+01 0.0000000000000000e+00 48 51 -1.0894163608456564e+01 0.0000000000000000e+00 48 52 1.6175672477562717e+01 0.0000000000000000e+00 48 53 -1.4053360471994631e+00 0.0000000000000000e+00 48 83 1.7783889495298254e+01 0.0000000000000000e+00 48 114 -3.4179630537475248e+01 0.0000000000000000e+00 48 115 -2.8135536119673560e+01 0.0000000000000000e+00 48 116 -1.0459575115202444e+01 0.0000000000000000e+00 48 118 -1.9249782886004589e+01 0.0000000000000000e+00 48 139 4.8224670334251904e+01 0.0000000000000000e+00 48 142 -1.9090602567346330e+01 0.0000000000000000e+00 48 158 -1.5800734433625859e+01 0.0000000000000000e+00 49 17 -2.0381186490571285e+01 0.0000000000000000e+00 49 18 -1.7331655952024242e+01 0.0000000000000000e+00 49 20 -5.8035438684986659e+01 0.0000000000000000e+00 49 48 3.5488822097268788e+01 0.0000000000000000e+00 49 49 1.0964774914400550e+00 0.0000000000000000e+00 49 50 -3.9701701127612893e+01 0.0000000000000000e+00 49 51 -2.3697792022329636e+00 0.0000000000000000e+00 49 52 -2.2814274524002688e+01 0.0000000000000000e+00 49 53 4.3791673411322904e+01 0.0000000000000000e+00 49 70 -3.5815537467699436e+01 0.0000000000000000e+00 49 73 -8.0864434404526464e-01 0.0000000000000000e+00 49 80 1.9879352549751502e+01 0.0000000000000000e+00 49 83 1.7740766200929635e+01 0.0000000000000000e+00 49 105 -2.9692187617398202e-01 0.0000000000000000e+00 49 106 2.0726582079953101e+01 0.0000000000000000e+00 49 118 5.7191415197291157e+01 0.0000000000000000e+00 50 16 2.6615769043051355e+01 0.0000000000000000e+00 50 19 -5.1165882322008017e+01 0.0000000000000000e+00 50 20 -1.9471890319915452e+01 0.0000000000000000e+00 50 48 -1.7339039667063140e+01 0.0000000000000000e+00 50 49 -3.9701701127612893e+01 0.0000000000000000e+00 50 50 -1.3595792297217120e+00 0.0000000000000000e+00 50 51 -2.0073488294378681e+01 0.0000000000000000e+00 50 52 -3.7751152494324276e+00 0.0000000000000000e+00 50 53 4.0185699844313774e+01 0.0000000000000000e+00 50 72 -2.9860477361764172e+01 0.0000000000000000e+00 50 74 5.1766754357168931e-01 0.0000000000000000e+00 50 81 3.0715963936508487e+01 0.0000000000000000e+00 50 83 2.3244849821614178e+01 0.0000000000000000e+00 50 98 5.5584384299511463e+01 0.0000000000000000e+00 50 107 1.4070831509973505e+01 0.0000000000000000e+00 50 118 3.2584977267163240e-01 0.0000000000000000e+00 51 37 2.8574469510837048e+01 0.0000000000000000e+00 51 48 -1.0894163608456564e+01 0.0000000000000000e+00 51 49 -2.3697792022329636e+00 0.0000000000000000e+00 51 50 -2.0073488294378681e+01 0.0000000000000000e+00 51 51 1.6049811715602488e+01 0.0000000000000000e+00 51 52 -5.1953259294069030e+00 0.0000000000000000e+00 51 53 -1.6070237705369234e+01 0.0000000000000000e+00 51 72 -1.8209919778019914e+00 0.0000000000000000e+00 51 74 -6.9454711937223053e+00 0.0000000000000000e+00 51 81 1.2989722224216774e+01 0.0000000000000000e+00 51 108 1.4294001552189453e+01 0.0000000000000000e+00 51 109 -1.5814075860761474e+01 0.0000000000000000e+00 51 110 -4.4667381637868147e+01 0.0000000000000000e+00 51 111 -1.6449340668656632e+01 0.0000000000000000e+00 51 112 -2.2106421349559767e+01 0.0000000000000000e+00 51 114 -2.2106421349502497e+01 0.0000000000000000e+00 51 115 -1.6449340668570123e+01 0.0000000000000000e+00 51 116 -1.4294001552190565e+01 0.0000000000000000e+00 51 117 1.5814075860695594e+01 0.0000000000000000e+00 52 48 1.6175672477562717e+01 0.0000000000000000e+00 52 49 -2.2814274524002688e+01 0.0000000000000000e+00 52 50 -3.7751152494324276e+00 0.0000000000000000e+00 52 51 -5.1953259294069030e+00 0.0000000000000000e+00 52 52 3.8357350751084979e+01 0.0000000000000000e+00 52 53 -1.6043886574567782e+01 0.0000000000000000e+00 52 70 -3.4390863654096888e+00 0.0000000000000000e+00 52 73 -5.7908521760064646e+00 0.0000000000000000e+00 52 80 1.6943875186319083e+01 0.0000000000000000e+00 52 113 -6.0291147700082801e+00 0.0000000000000000e+00 52 114 5.0177241411698461e+01 0.0000000000000000e+00 52 115 -3.2345921378490161e+01 0.0000000000000000e+00 52 116 -6.0291147701710637e+00 0.0000000000000000e+00 53 48 -1.4053360471994631e+00 0.0000000000000000e+00 53 49 4.3791673411322904e+01 0.0000000000000000e+00 53 50 4.0185699844313774e+01 0.0000000000000000e+00 53 51 -1.6070237705369234e+01 0.0000000000000000e+00 53 52 -1.6043886574567782e+01 0.0000000000000000e+00 53 53 -1.4637408281264747e+01 0.0000000000000000e+00 53 70 3.6634917548642932e+01 0.0000000000000000e+00 53 71 -1.1094523464978372e+00 0.0000000000000000e+00 53 72 3.3252070786516093e+01 0.0000000000000000e+00 53 73 -1.6194585729829392e+01 0.0000000000000000e+00 53 74 -1.6408724421341621e+01 0.0000000000000000e+00 53 80 -2.6304420213644240e+00 0.0000000000000000e+00 53 81 -2.3386595213736805e+00 0.0000000000000000e+00 54 54 1.0000000000000000e+00 0.0000000000000000e+00 55 55 1.0000000000000000e+00 0.0000000000000000e+00 56 32 -6.6232014670496744e+00 0.0000000000000000e+00 56 33 -2.6492805868184021e+01 0.0000000000000000e+00 56 35 -1.6449340668524155e+01 0.0000000000000000e+00 56 40 1.9090602567248258e+01 0.0000000000000000e+00 56 54 -1.7489133700043229e+01 0.0000000000000000e+00 56 55 -3.7358738101191719e+01 0.0000000000000000e+00 56 56 8.3183442524020244e+00 0.0000000000000000e+00 56 60 4.3159472534776100e+01 0.0000000000000000e+00 56 61 -5.0043465199670393e+01 0.0000000000000000e+00 56 77 2.5713804034296917e+01 0.0000000000000000e+00 56 84 2.8960206968421225e+01 0.0000000000000000e+00 56 118 9.9130696007758523e+00 0.0000000000000000e+00 56 139 -2.6492805868244766e+01 0.0000000000000000e+00 56 140 -2.4199265566346938e+01 0.0000000000000000e+00 56 141 1.2467401100241741e+01 0.0000000000000000e+00 56 156 -3.6883992664818962e+01 0.0000000000000000e+00 56 157 7.2681355338944371e+01 0.0000000000000000e+00 56 158 -1.6449340668568073e+01 0.0000000000000000e+00 56 159 2.7271807701986358e+01 0.0000000000000000e+00 57 57 1.0000000000000000e+00 0.0000000000000000e+00 58 58 1.0000000000000000e+00 0.0000000000000000e+00 59 31 -1.9090602567233109e+01 0.0000000000000000e+00 59 41 6.6232014670192285e+00 0.0000000000000000e+00 59 42 -1.6764324960536669e+01 0.0000000000000000e+00 59 44 -1.6449340668376195e+01 0.0000000000000000e+00 59 57 1.1006230771987669e+01 0.0000000000000000e+00 59 58 1.7489133700086320e+01 0.0000000000000000e+00 59 59 2.6777493357847348e+01 0.0000000000000000e+00 59 62 -4.3159472534913199e+01 0.0000000000000000e+00 59 63 -5.0043465199646974e+01 0.0000000000000000e+00 59 103 -1.3076171282262211e-01 0.0000000000000000e+00 59 121 -2.1872163005043500e+01 0.0000000000000000e+00 59 124 1.2467401100265668e+01 0.0000000000000000e+00 59 125 -2.0784484760233202e+01 0.0000000000000000e+00 59 144 -1.3076171287746324e-01 0.0000000000000000e+00 59 147 -1.6764324960529994e+01 0.0000000000000000e+00 59 155 4.1527049998756240e+01 0.0000000000000000e+00 60 60 1.0000000000000000e+00 0.0000000000000000e+00 61 35 -1.7380470701007273e+01 0.0000000000000000e+00 61 40 -2.2467401100220776e+01 0.0000000000000000e+00 61 43 -2.0043465199699927e+01 0.0000000000000000e+00 61 44 -8.4419963324011982e-01 0.0000000000000000e+00 61 54 1.2467401100204443e+01 0.0000000000000000e+00 61 56 -5.0043465199670393e+01 0.0000000000000000e+00 61 60 -4.3159472534797487e+01 0.0000000000000000e+00 61 61 2.4289567725176678e+01 0.0000000000000000e+00 61 114 6.6232014670556927e+00 0.0000000000000000e+00 61 139 -1.6449340668584849e+01 0.0000000000000000e+00 61 140 1.9090602567279742e+01 0.0000000000000000e+00 61 141 -1.7489133700065072e+01 0.0000000000000000e+00 61 142 -1.7380470700988692e+01 0.0000000000000000e+00 61 143 -1.1601468867235742e+01 0.0000000000000000e+00 61 148 -8.4419963329562941e-01 0.0000000000000000e+00 61 154 -3.8181205134690224e+01 0.0000000000000000e+00 62 62 1.0000000000000000e+00 0.0000000000000000e+00 63 31 2.2467401100293539e+01 0.0000000000000000e+00 63 34 -2.0043465199753484e+01 0.0000000000000000e+00 63 35 -8.4419963318849123e-01 0.0000000000000000e+00 63 44 -1.7380470701004306e+01 0.0000000000000000e+00 63 58 -1.2467401100231541e+01 0.0000000000000000e+00 63 59 -5.0043465199646974e+01 0.0000000000000000e+00 63 62 4.3159472534933144e+01 0.0000000000000000e+00 63 63 2.4289567725507144e+01 0.0000000000000000e+00 63 68 -6.6232014670248569e+00 0.0000000000000000e+00 63 117 -1.1601468867247080e+01 0.0000000000000000e+00 63 121 1.9090602567261083e+01 0.0000000000000000e+00 63 124 -1.7489133700106120e+01 0.0000000000000000e+00 63 142 -8.4419963324011627e-01 0.0000000000000000e+00 63 147 -1.6449340668433234e+01 0.0000000000000000e+00 63 148 -1.7380470700980037e+01 0.0000000000000000e+00 63 154 3.8181205134667408e+01 0.0000000000000000e+00 64 64 1.0000000000000000e+00 0.0000000000000000e+00 65 65 1.0000000000000000e+00 0.0000000000000000e+00 66 23 8.5913042115943128e+00 0.0000000000000000e+00 66 24 -4.9049650732321602e+01 0.0000000000000000e+00 66 26 -2.7345733424251119e+01 0.0000000000000000e+00 66 64 2.9347815527649065e+01 0.0000000000000000e+00 66 65 5.4414158198222310e+01 0.0000000000000000e+00 66 66 1.2531207292774253e+02 0.0000000000000000e+00 66 67 -4.0279121954994665e+01 0.0000000000000000e+00 66 68 -2.6583572126570424e+01 0.0000000000000000e+00 66 69 -7.7667042901742050e+01 0.0000000000000000e+00 66 92 5.1374849561995696e+01 0.0000000000000000e+00 66 115 3.4963141930920521e+01 0.0000000000000000e+00 66 120 3.3626962294980871e+01 0.0000000000000000e+00 66 122 -9.8945155584021592e+01 0.0000000000000000e+00 66 126 4.7150779494343567e+01 0.0000000000000000e+00 66 152 -9.1333829836593821e+01 0.0000000000000000e+00 66 153 5.0430793667172779e+01 0.0000000000000000e+00 67 67 1.0000000000000000e+00 0.0000000000000000e+00 68 68 1.0000000000000000e+00 0.0000000000000000e+00 69 26 -4.4663159887556333e+01 0.0000000000000000e+00 69 64 -2.8178869553063542e+01 0.0000000000000000e+00 69 66 -7.7667042901742050e+01 0.0000000000000000e+00 69 67 3.6693608927872376e+01 0.0000000000000000e+00 69 68 2.0067219210538251e+01 0.0000000000000000e+00 69 69 8.4340332882360215e+01 0.0000000000000000e+00 69 73 5.4027514936553924e+00 0.0000000000000000e+00 69 113 7.5173341243419980e+00 0.0000000000000000e+00 69 115 -4.9660430814339179e+01 0.0000000000000000e+00 69 122 3.3297687985166128e+01 0.0000000000000000e+00 69 126 7.0545466581885734e+01 0.0000000000000000e+00 69 137 -2.1681074519067085e+01 0.0000000000000000e+00 69 138 -1.6775902145416758e+01 0.0000000000000000e+00 70 18 -4.6268979531464907e+01 0.0000000000000000e+00 70 49 -3.5815537467699436e+01 0.0000000000000000e+00 70 52 -3.4390863654096888e+00 0.0000000000000000e+00 70 53 3.6634917548642932e+01 0.0000000000000000e+00 70 70 2.3437741868106370e+00 0.0000000000000000e+00 70 71 1.6166835965483756e+01 0.0000000000000000e+00 70 72 -2.7349592077169369e+01 0.0000000000000000e+00 70 73 -1.7801120814678470e+01 0.0000000000000000e+00 70 74 4.7743969236933970e-01 0.0000000000000000e+00 70 80 -3.3472162748130280e+01 0.0000000000000000e+00 70 105 4.8439642095812459e+01 0.0000000000000000e+00 70 106 2.0205271870454915e+01 0.0000000000000000e+00 70 130 -4.2744776025083631e-01 0.0000000000000000e+00 70 131 -1.4194653042381727e+01 0.0000000000000000e+00 70 133 1.8736454892255125e+01 0.0000000000000000e+00 70 134 -1.9040010693805481e+01 0.0000000000000000e+00 71 71 1.0000000000000000e+00 0.0000000000000000e+00 72 19 -1.5686699693518971e+01 0.0000000000000000e+00 72 50 -2.9860477361764172e+01 0.0000000000000000e+00 72 51 -1.8209919778019914e+00 0.0000000000000000e+00 72 53 3.3252070786516093e+01 0.0000000000000000e+00 72 70 -2.7349592077169369e+01 0.0000000000000000e+00 72 71 -2.9453033061618637e+01 0.0000000000000000e+00 72 72 -1.1086724857403496e+00 0.0000000000000000e+00 72 73 -6.3201265412849661e-01 0.0000000000000000e+00 72 74 -1.5966165212681368e+01 0.0000000000000000e+00 72 81 -1.3959809194814346e+01 0.0000000000000000e+00 72 98 5.9765829742346427e-02 0.0000000000000000e+00 72 107 2.0510009301603581e+01 0.0000000000000000e+00 72 130 4.7615590644471411e+01 0.0000000000000000e+00 72 131 -3.9766653206274306e+01 0.0000000000000000e+00 72 132 2.3817787858948190e+01 0.0000000000000000e+00 72 133 1.6390054863474191e+01 0.0000000000000000e+00 73 49 -8.0864434404526464e-01 0.0000000000000000e+00 73 52 -5.7908521760064646e+00 0.0000000000000000e+00 73 53 -1.6194585729829392e+01 0.0000000000000000e+00 73 69 5.4027514936553924e+00 0.0000000000000000e+00 73 70 -1.7801120814678470e+01 0.0000000000000000e+00 73 71 2.7081177937700178e+00 0.0000000000000000e+00 73 72 -6.3201265412849661e-01 0.0000000000000000e+00 73 73 2.4090582253076455e+01 0.0000000000000000e+00 73 74 -3.2929412089304748e+01 0.0000000000000000e+00 73 80 -1.1099274865000130e+01 0.0000000000000000e+00 73 113 -1.5496664507549749e+00 0.0000000000000000e+00 73 114 -2.2106421349298813e+01 0.0000000000000000e+00 73 115 -2.6121572966439849e+01 0.0000000000000000e+00 73 126 -8.5090389053661966e+01 0.0000000000000000e+00 73 135 2.3096201770041297e+01 0.0000000000000000e+00 73 137 -2.3497985375763026e+01 0.0000000000000000e+00 74 25 5.0219313686965641e+01 0.0000000000000000e+00 74 50 5.1766754357168931e-01 0.0000000000000000e+00 74 51 -6.9454711937223053e+00 0.0000000000000000e+00 74 53 -1.6408724421341621e+01 0.0000000000000000e+00 74 70 4.7743969236933970e-01 0.0000000000000000e+00 74 71 -1.0337593869997969e+01 0.0000000000000000e+00 74 72 -1.5966165212681368e+01 0.0000000000000000e+00 74 73 -3.2929412089304748e+01 0.0000000000000000e+00 74 74 4.1990192765564522e+01 0.0000000000000000e+00 74 81 -1.6502172513268764e+01 0.0000000000000000e+00 74 108 6.0291147702444521e+00 0.0000000000000000e+00 74 111 -6.8766106137039799e+01 0.0000000000000000e+00 74 112 3.2407575082177992e+01 0.0000000000000000e+00 74 126 -4.9127550432182126e+01 0.0000000000000000e+00 74 135 -3.7704048746457630e+01 0.0000000000000000e+00 74 137 4.3920549752234976e+01 0.0000000000000000e+00 75 75 1.0000000000000000e+00 0.0000000000000000e+00 76 76 1.0000000000000000e+00 0.0000000000000000e+00 77 77 1.0000000000000000e+00 0.0000000000000000e+00 78 78 1.0000000000000000e+00 0.0000000000000000e+00 79 33 3.3116007334970750e+00 0.0000000000000000e+00 79 36 -7.4348022005954384e+00 0.0000000000000000e+00 79 38 1.3311600733637299e+01 0.0000000000000000e+00 79 75 6.6232014670561412e+00 0.0000000000000000e+00 79 76 -5.1521281988440970e+00 0.0000000000000000e+00 79 77 -2.5217325998193019e+00 0.0000000000000000e+00 79 78 -1.4978267400127701e+01 0.0000000000000000e+00 79 79 3.9635787279991894e+00 0.0000000000000000e+00 79 82 1.0490850646857598e-11 0.0000000000000000e+00 79 83 -1.0963728932341811e+01 0.0000000000000000e+00 79 84 -1.2543465199539856e+01 0.0000000000000000e+00 79 156 -6.8187948653282824e+00 0.0000000000000000e+00 79 158 6.6449340668385464e+00 0.0000000000000000e+00 80 80 1.0000000000000000e+00 0.0000000000000000e+00 81 81 1.0000000000000000e+00 0.0000000000000000e+00 82 16 -6.6666666666231240e+00 5.9676790943895563e-01 82 76 -1.0043465199680575e+01 0.0000000000000000e+00 82 78 1.0757749766886279e+01 -6.0762534852552182e-02 82 79 1.0490850646857598e-11 0.0000000000000000e+00 82 82 8.7072723266911556e+00 3.7583778255159461e+00 82 83 1.0043465199484812e+01 0.0000000000000000e+00 82 84 -1.0074890439491465e+01 7.3806700688669596e-02 82 97 -2.5535762623647805e+01 -1.1781048114242536e+00 82 98 8.6491799554800090e-12 0.0000000000000000e+00 82 163 3.4515451454850385e+01 3.8489110873569246e+00 82 166 -1.0753762576954568e+01 4.3151558690273251e-01 82 167 -8.9272429425853783e+00 -3.4589468699359380e+00 82 168 1.4206686392760542e+01 -1.9572360451562325e+00 82 169 -3.0214466474654518e+01 -5.2151641091413365e+00 82 171 -1.7882001707118309e+00 -2.1172957385442559e-02 82 184 1.8869095956946847e+01 1.7748727208668531e+00 83 16 1.0364306117981794e+01 0.0000000000000000e+00 83 19 -1.1075624703850663e+01 0.0000000000000000e+00 83 20 -1.6085168960056542e+01 0.0000000000000000e+00 83 38 2.7596110819336921e+01 0.0000000000000000e+00 83 48 1.7783889495298254e+01 0.0000000000000000e+00 83 49 1.7740766200929635e+01 0.0000000000000000e+00 83 50 2.3244849821614178e+01 0.0000000000000000e+00 83 76 -3.9478417604165529e+01 0.0000000000000000e+00 83 78 4.1297212469542018e+01 0.0000000000000000e+00 83 79 -1.0963728932341811e+01 0.0000000000000000e+00 83 82 1.0043465199484812e+01 0.0000000000000000e+00 83 83 -2.5442807651660807e+01 0.0000000000000000e+00 83 84 4.6003822371938995e+01 0.0000000000000000e+00 83 97 4.9348022004206094e+00 0.0000000000000000e+00 83 98 -3.2554805305031373e+01 0.0000000000000000e+00 83 118 -1.6642580518638933e+01 0.0000000000000000e+00 83 139 2.5713804034377343e+01 0.0000000000000000e+00 83 156 2.4414905684616095e+01 0.0000000000000000e+00 83 158 -2.4264463365716306e+01 0.0000000000000000e+00 84 84 1.0000000000000000e+00 0.0000000000000000e+00 85 5 -2.3288694994240505e+01 0.0000000000000000e+00 85 85 4.0526712438262294e+01 0.0000000000000000e+00 85 86 2.1472831089089233e+01 0.0000000000000000e+00 85 87 5.1917657271250583e+01 0.0000000000000000e+00 85 88 -2.6020774803349561e+01 0.0000000000000000e+00 85 89 -2.6075348151002844e+00 0.0000000000000000e+00 85 91 -2.3253842548856689e+01 0.0000000000000000e+00 85 93 -3.6949498603262718e+00 0.0000000000000000e+00 85 94 1.5604370003431024e+01 0.0000000000000000e+00 85 95 4.6968039050478239e+01 0.0000000000000000e+00 85 96 -5.9640945775636055e+01 0.0000000000000000e+00 85 99 -2.6473189944988551e+01 0.0000000000000000e+00 85 128 1.4568216033769625e+01 0.0000000000000000e+00 86 86 1.0000000000000000e+00 0.0000000000000000e+00 87 2 2.0508483800085742e-11 0.0000000000000000e+00 87 5 4.6646332361794848e+01 0.0000000000000000e+00 87 6 4.6646332361931655e+01 0.0000000000000000e+00 87 85 5.1917657271250583e+01 0.0000000000000000e+00 87 86 5.1473750709339434e+00 0.0000000000000000e+00 87 87 -2.6555536016582089e+00 0.0000000000000000e+00 87 88 -3.8216861041382714e+01 0.0000000000000000e+00 87 89 -2.9823700187739199e+01 0.0000000000000000e+00 87 90 5.1473750709501687e+00 0.0000000000000000e+00 87 91 5.1917657271147483e+01 0.0000000000000000e+00 87 92 -2.9823700187864056e+01 0.0000000000000000e+00 87 93 -3.8216861041312981e+01 0.0000000000000000e+00 87 99 3.8830350754536821e-11 0.0000000000000000e+00 88 5 2.5398402558336581e+00 0.0000000000000000e+00 88 25 3.3297687985228649e+01 0.0000000000000000e+00 88 28 -2.8753234156828732e+01 0.0000000000000000e+00 88 29 -4.9825728880257245e+01 0.0000000000000000e+00 88 62 7.6762526523928187e+00 0.0000000000000000e+00 88 85 -2.6020774803349561e+01 0.0000000000000000e+00 88 86 2.3947201006777554e+00 0.0000000000000000e+00 88 87 -3.8216861041382714e+01 0.0000000000000000e+00 88 88 4.6044948205896510e+01 0.0000000000000000e+00 88 89 -1.5913704524898709e+01 0.0000000000000000e+00 88 91 -3.6949498603651034e+00 0.0000000000000000e+00 88 93 -1.3510878614578672e+01 0.0000000000000000e+00 88 99 -1.0987129275639305e+01 0.0000000000000000e+00 88 109 1.8684937856646826e+01 0.0000000000000000e+00 88 110 -1.6449340668565711e+01 0.0000000000000000e+00 88 111 -6.2810053867582674e+01 0.0000000000000000e+00 88 123 -1.7089004829949083e+01 0.0000000000000000e+00 88 126 -7.3968809367457411e+01 0.0000000000000000e+00 88 150 2.4955872988872677e+01 0.0000000000000000e+00 89 2 2.2446144896095017e+00 0.0000000000000000e+00 89 5 -1.3477435071945351e+01 0.0000000000000000e+00 89 6 8.7834743682660665e+00 0.0000000000000000e+00 89 29 7.7257735520364195e+00 0.0000000000000000e+00 89 85 -2.6075348151002844e+00 0.0000000000000000e+00 89 86 -1.4502999917857944e+00 0.0000000000000000e+00 89 87 -2.9823700187739199e+01 0.0000000000000000e+00 89 88 -1.5913704524898709e+01 0.0000000000000000e+00 89 89 4.1283867847855262e+01 0.0000000000000000e+00 89 92 -3.8194216398696653e+01 0.0000000000000000e+00 89 126 -9.0650378382518852e+01 0.0000000000000000e+00 89 150 -2.4553962047864061e+01 0.0000000000000000e+00 89 152 2.8297725455989607e+01 0.0000000000000000e+00 90 3 1.0852485140256135e+01 0.0000000000000000e+00 90 4 -9.3878182451303580e+00 0.0000000000000000e+00 90 6 -6.6946413258602018e+00 0.0000000000000000e+00 90 12 -1.6935645517848045e+01 0.0000000000000000e+00 90 14 -1.1604651406136309e+01 0.0000000000000000e+00 90 87 5.1473750709501687e+00 0.0000000000000000e+00 90 90 1.4624784316126876e+01 0.0000000000000000e+00 90 91 2.1472831089100517e+01 0.0000000000000000e+00 90 92 -1.5234207653726999e+01 0.0000000000000000e+00 90 93 5.7878213972884396e+00 0.0000000000000000e+00 90 119 -1.4511438217375716e+01 0.0000000000000000e+00 90 120 -2.0609307911999629e+01 0.0000000000000000e+00 90 121 1.7355885974933507e+01 0.0000000000000000e+00 90 122 2.6780196646867150e+01 0.0000000000000000e+00 90 125 1.8095075137091108e+01 0.0000000000000000e+00 90 144 1.3382604049186964e+01 0.0000000000000000e+00 91 3 1.5604370003534353e+01 0.0000000000000000e+00 91 4 -5.9536194241145509e+01 0.0000000000000000e+00 91 6 -2.3288694994189619e+01 0.0000000000000000e+00 91 85 -2.3253842548856689e+01 0.0000000000000000e+00 91 87 5.1917657271147483e+01 0.0000000000000000e+00 91 88 -3.6949498603651034e+00 0.0000000000000000e+00 91 90 2.1472831089100517e+01 0.0000000000000000e+00 91 91 1.9987183943785425e+01 0.0000000000000000e+00 91 92 -2.6075348151653293e+00 0.0000000000000000e+00 91 93 -2.6020774803234076e+01 0.0000000000000000e+00 91 95 -7.6049649336359693e+00 0.0000000000000000e+00 91 96 -2.0434776960025673e+01 0.0000000000000000e+00 91 99 1.4255518621355611e+01 0.0000000000000000e+00 91 119 4.9528086829510066e+01 0.0000000000000000e+00 91 128 2.0032139326249474e+01 0.0000000000000000e+00 91 129 -2.2382684036448701e+01 0.0000000000000000e+00 92 2 7.7408918504385174e+00 0.0000000000000000e+00 92 5 8.7834743682865764e+00 0.0000000000000000e+00 92 6 -1.3477435072114222e+01 0.0000000000000000e+00 92 66 5.1374849561995696e+01 0.0000000000000000e+00 92 87 -2.9823700187864056e+01 0.0000000000000000e+00 92 89 -3.8194216398696653e+01 0.0000000000000000e+00 92 90 -1.5234207653726999e+01 0.0000000000000000e+00 92 91 -2.6075348151653293e+00 0.0000000000000000e+00 92 92 4.8126859401509321e+01 0.0000000000000000e+00 92 93 -4.0676346856151131e+00 0.0000000000000000e+00 92 120 -1.9422215090389944e+01 0.0000000000000000e+00 92 121 -1.5814075860780479e+01 0.0000000000000000e+00 92 122 -7.4395967888695111e+01 0.0000000000000000e+00 92 126 -5.2861465370024725e+01 0.0000000000000000e+00 92 150 5.2143212654248195e+01 0.0000000000000000e+00 92 152 -3.9361650367974505e+01 0.0000000000000000e+00 93 6 2.5398402557848376e+00 0.0000000000000000e+00 93 62 -4.1412937988349135e+01 0.0000000000000000e+00 93 85 -3.6949498603262718e+00 0.0000000000000000e+00 93 87 -3.8216861041312981e+01 0.0000000000000000e+00 93 88 -1.3510878614578672e+01 0.0000000000000000e+00 93 90 5.7878213972884396e+00 0.0000000000000000e+00 93 91 -2.6020774803234076e+01 0.0000000000000000e+00 93 92 -4.0676346856151131e+00 0.0000000000000000e+00 93 93 3.6605129437113803e+01 0.0000000000000000e+00 93 99 2.1237893729020371e+01 0.0000000000000000e+00 93 110 -4.9571093353742839e+01 0.0000000000000000e+00 93 120 1.0966120786086671e+01 0.0000000000000000e+00 93 121 -2.3584097786682609e+01 0.0000000000000000e+00 93 122 -2.6492805868140071e+01 0.0000000000000000e+00 93 123 -1.1846697480100222e+01 0.0000000000000000e+00 93 124 1.1846697480026791e+01 0.0000000000000000e+00 94 94 1.0000000000000000e+00 0.0000000000000000e+00 95 95 1.0000000000000000e+00 0.0000000000000000e+00 96 1 3.9086437101151272e+00 0.0000000000000000e+00 96 4 -5.9044935663294808e+01 0.0000000000000000e+00 96 5 -5.4704938971099068e+01 0.0000000000000000e+00 96 85 -5.9640945775636055e+01 0.0000000000000000e+00 96 86 -9.3878182450457377e+00 0.0000000000000000e+00 96 91 -2.0434776960025673e+01 0.0000000000000000e+00 96 94 -3.2994315576869099e+01 0.0000000000000000e+00 96 95 -4.8231052460292929e+01 0.0000000000000000e+00 96 96 4.4531507543960103e+01 0.0000000000000000e+00 96 99 2.2173180967405592e+01 0.0000000000000000e+00 96 119 8.1314905263216541e+00 0.0000000000000000e+00 96 128 -1.9825404177536374e+01 0.0000000000000000e+00 96 129 -2.7538428878798253e+01 0.0000000000000000e+00 97 97 1.0000000000000000e+00 0.0000000000000000e+00 98 98 1.0000000000000000e+00 0.0000000000000000e+00 99 99 1.0000000000000000e+00 0.0000000000000000e+00 100 100 1.0000000000000000e+00 0.0000000000000000e+00 101 101 1.0000000000000000e+00 0.0000000000000000e+00 102 7 1.9704305647103112e+01 0.0000000000000000e+00 102 10 -4.9938558690455395e+01 0.0000000000000000e+00 102 11 -1.6449340668442446e+01 0.0000000000000000e+00 102 42 -1.3076171287669142e-01 0.0000000000000000e+00 102 45 -1.1315492283762314e+01 0.0000000000000000e+00 102 47 -2.0784484760170976e+01 0.0000000000000000e+00 102 100 1.8095075137114236e+01 0.0000000000000000e+00 102 101 -5.1467486931143966e+00 0.0000000000000000e+00 102 102 2.4338204324231416e+01 0.0000000000000000e+00 102 103 -1.8534515960647752e+01 0.0000000000000000e+00 102 104 4.4725308532643950e+01 0.0000000000000000e+00 102 144 -1.8534515960636728e+01 0.0000000000000000e+00 102 145 1.3382604049143946e+01 0.0000000000000000e+00 102 146 -2.3088865822875555e+01 0.0000000000000000e+00 102 147 -1.3076171293508593e-01 0.0000000000000000e+00 102 155 -4.2995741129939603e+01 0.0000000000000000e+00 103 11 -4.9938558690229470e+01 0.0000000000000000e+00 103 13 1.9704305647106722e+01 0.0000000000000000e+00 103 14 -1.6449340668387485e+01 0.0000000000000000e+00 103 42 -2.0784484760043853e+01 0.0000000000000000e+00 103 45 2.3088865822876826e+01 0.0000000000000000e+00 103 47 -1.3076171287695948e-01 0.0000000000000000e+00 103 57 -1.1315492283742888e+01 0.0000000000000000e+00 103 59 -1.3076171282262211e-01 0.0000000000000000e+00 103 100 1.3382604049133356e+01 0.0000000000000000e+00 103 101 4.4725308532444068e+01 0.0000000000000000e+00 103 102 -1.8534515960647752e+01 0.0000000000000000e+00 103 103 2.4338204324209894e+01 0.0000000000000000e+00 103 119 -5.1467486931059216e+00 0.0000000000000000e+00 103 125 -1.8534515960668010e+01 0.0000000000000000e+00 103 127 1.8095075137100217e+01 0.0000000000000000e+00 103 155 -4.2995741129710538e+01 0.0000000000000000e+00 104 104 1.0000000000000000e+00 0.0000000000000000e+00 105 105 1.0000000000000000e+00 0.0000000000000000e+00 106 106 1.0000000000000000e+00 0.0000000000000000e+00 107 107 1.0000000000000000e+00 0.0000000000000000e+00 108 108 1.0000000000000000e+00 0.0000000000000000e+00 109 109 1.0000000000000000e+00 0.0000000000000000e+00 110 28 -5.5701502816725919e+00 0.0000000000000000e+00 110 37 -5.6012393306473179e+01 0.0000000000000000e+00 110 51 -4.4667381637868147e+01 0.0000000000000000e+00 110 62 4.6699834731317758e+01 0.0000000000000000e+00 110 68 -5.5701502816672406e+00 0.0000000000000000e+00 110 88 -1.6449340668565711e+01 0.0000000000000000e+00 110 93 -4.9571093353742839e+01 0.0000000000000000e+00 110 99 -1.9522950132493040e+01 0.0000000000000000e+00 110 108 -2.4674094276182615e+01 0.0000000000000000e+00 110 109 3.6324564183551701e+01 0.0000000000000000e+00 110 110 1.5629361379673297e+01 0.0000000000000000e+00 110 111 -2.6492805868309290e+01 0.0000000000000000e+00 110 115 -2.6492805868205451e+01 0.0000000000000000e+00 110 116 2.4674094276076779e+01 0.0000000000000000e+00 110 117 -3.6324564183482664e+01 0.0000000000000000e+00 110 121 1.9522950132407644e+01 0.0000000000000000e+00 110 122 -1.6449340668479199e+01 0.0000000000000000e+00 110 123 1.7915945424690477e+01 0.0000000000000000e+00 110 124 -1.7915945424637485e+01 0.0000000000000000e+00 111 25 -9.6124165763302869e+01 0.0000000000000000e+00 111 28 2.8911388257410497e+01 0.0000000000000000e+00 111 29 3.4963141930985500e+01 0.0000000000000000e+00 111 37 -8.8600184154211412e+00 0.0000000000000000e+00 111 51 -1.6449340668656632e+01 0.0000000000000000e+00 111 74 -6.8766106137039799e+01 0.0000000000000000e+00 111 81 2.8135536119804222e+01 0.0000000000000000e+00 111 88 -6.2810053867582674e+01 0.0000000000000000e+00 111 108 -2.6452672693858236e+01 0.0000000000000000e+00 111 109 -2.4841116200486788e+01 0.0000000000000000e+00 111 110 -2.6492805868309290e+01 0.0000000000000000e+00 111 111 9.0403347501215990e+01 0.0000000000000000e+00 111 112 -5.3875620264911724e+01 0.0000000000000000e+00 111 123 1.3114787574974235e+01 0.0000000000000000e+00 111 126 -5.2414568508313472e+01 0.0000000000000000e+00 111 135 4.9869806994679173e+01 0.0000000000000000e+00 112 112 1.0000000000000000e+00 0.0000000000000000e+00 113 113 1.0000000000000000e+00 0.0000000000000000e+00 114 114 1.0000000000000000e+00 0.0000000000000000e+00 115 37 -8.8600184153811892e+00 0.0000000000000000e+00 115 48 -2.8135536119673560e+01 0.0000000000000000e+00 115 51 -1.6449340668570123e+01 0.0000000000000000e+00 115 52 -3.2345921378490161e+01 0.0000000000000000e+00 115 66 3.4963141930920521e+01 0.0000000000000000e+00 115 68 2.8911388257303138e+01 0.0000000000000000e+00 115 69 -4.9660430814339179e+01 0.0000000000000000e+00 115 73 -2.6121572966439849e+01 0.0000000000000000e+00 115 80 -2.8135536119307094e+01 0.0000000000000000e+00 115 110 -2.6492805868205451e+01 0.0000000000000000e+00 115 113 2.3843651441156855e+01 0.0000000000000000e+00 115 114 -8.4021040675889608e+01 0.0000000000000000e+00 115 115 2.8685217943849054e+01 0.0000000000000000e+00 115 116 2.6452672693729774e+01 0.0000000000000000e+00 115 117 2.4841116200478140e+01 0.0000000000000000e+00 115 122 -6.2810053867281198e+01 0.0000000000000000e+00 115 124 -1.3114787574898628e+01 0.0000000000000000e+00 115 126 -6.9703611091676265e+01 0.0000000000000000e+00 115 137 1.9257522822590484e+01 0.0000000000000000e+00 116 34 1.2467401100150827e+01 0.0000000000000000e+00 116 37 1.8573837985352885e+01 0.0000000000000000e+00 116 38 -1.7489133700019053e+01 0.0000000000000000e+00 116 48 -1.0459575115202444e+01 0.0000000000000000e+00 116 51 -1.4294001552190565e+01 0.0000000000000000e+00 116 52 -6.0291147701710637e+00 0.0000000000000000e+00 116 110 2.4674094276076779e+01 0.0000000000000000e+00 116 114 3.1058087361754886e+01 0.0000000000000000e+00 116 115 2.6452672693729774e+01 0.0000000000000000e+00 116 116 2.0674004890939219e+01 0.0000000000000000e+00 116 117 -1.6620305244572116e+01 0.0000000000000000e+00 116 139 -1.2467401100284876e+01 0.0000000000000000e+00 116 142 1.7489133700079474e+01 0.0000000000000000e+00 117 34 -2.2467401100202494e+01 0.0000000000000000e+00 117 37 -2.1858170278467853e+01 0.0000000000000000e+00 117 38 1.9090602567182867e+01 0.0000000000000000e+00 117 51 1.5814075860695594e+01 0.0000000000000000e+00 117 63 -1.1601468867247080e+01 0.0000000000000000e+00 117 68 -1.6040587568851436e+01 0.0000000000000000e+00 117 110 -3.6324564183482664e+01 0.0000000000000000e+00 117 115 2.4841116200478140e+01 0.0000000000000000e+00 117 116 -1.6620305244572116e+01 0.0000000000000000e+00 117 117 1.7482138837763177e+01 0.0000000000000000e+00 117 122 -1.8684937856565870e+01 0.0000000000000000e+00 117 124 -1.3673584530817690e+01 0.0000000000000000e+00 117 142 1.1601468867171322e+01 0.0000000000000000e+00 117 147 -1.9090602567424561e+01 0.0000000000000000e+00 117 148 2.2467401100344624e+01 0.0000000000000000e+00 117 154 3.7030813913021143e-15 0.0000000000000000e+00 118 118 1.0000000000000000e+00 0.0000000000000000e+00 119 119 1.0000000000000000e+00 0.0000000000000000e+00 120 120 1.0000000000000000e+00 0.0000000000000000e+00 121 12 -1.8529352742312085e+01 0.0000000000000000e+00 121 59 -2.1872163005043500e+01 0.0000000000000000e+00 121 62 -1.6878599844663768e+01 0.0000000000000000e+00 121 63 1.9090602567261083e+01 0.0000000000000000e+00 121 90 1.7355885974933507e+01 0.0000000000000000e+00 121 92 -1.5814075860780479e+01 0.0000000000000000e+00 121 93 -2.3584097786682609e+01 0.0000000000000000e+00 121 110 1.9522950132407644e+01 0.0000000000000000e+00 121 120 -2.3912473529038962e+01 0.0000000000000000e+00 121 121 1.8386189882562711e+01 0.0000000000000000e+00 121 122 3.8419594873139047e+01 0.0000000000000000e+00 121 124 -1.3254578392777509e+01 0.0000000000000000e+00 121 125 -1.1315492283802474e+01 0.0000000000000000e+00 121 144 2.3088865822894064e+01 0.0000000000000000e+00 121 147 1.1006230771893897e+01 0.0000000000000000e+00 121 155 -1.3103411407798407e-12 0.0000000000000000e+00 122 62 7.6762526523808523e+00 0.0000000000000000e+00 122 66 -9.8945155584021592e+01 0.0000000000000000e+00 122 68 -2.8753234156699836e+01 0.0000000000000000e+00 122 69 3.3297687985166128e+01 0.0000000000000000e+00 122 90 2.6780196646867150e+01 0.0000000000000000e+00 122 92 -7.4395967888695111e+01 0.0000000000000000e+00 122 93 -2.6492805868140071e+01 0.0000000000000000e+00 122 110 -1.6449340668479199e+01 0.0000000000000000e+00 122 115 -6.2810053867281198e+01 0.0000000000000000e+00 122 117 -1.8684937856565870e+01 0.0000000000000000e+00 122 120 4.1206568474082488e+01 0.0000000000000000e+00 122 121 3.8419594873139047e+01 0.0000000000000000e+00 122 122 9.6033209253089808e+01 0.0000000000000000e+00 122 124 1.7089004829904916e+01 0.0000000000000000e+00 122 126 -5.2414568508100203e+01 0.0000000000000000e+00 122 152 5.4356250761189244e+01 0.0000000000000000e+00 123 123 1.0000000000000000e+00 0.0000000000000000e+00 124 59 1.2467401100265668e+01 0.0000000000000000e+00 124 62 1.3681197950840481e+01 0.0000000000000000e+00 124 63 -1.7489133700106120e+01 0.0000000000000000e+00 124 68 1.2832113993874794e+01 0.0000000000000000e+00 124 93 1.1846697480026791e+01 0.0000000000000000e+00 124 110 -1.7915945424637485e+01 0.0000000000000000e+00 124 115 -1.3114787574898628e+01 0.0000000000000000e+00 124 117 -1.3673584530817690e+01 0.0000000000000000e+00 124 121 -1.3254578392777509e+01 0.0000000000000000e+00 124 122 1.7089004829904916e+01 0.0000000000000000e+00 124 124 2.0522287454795265e+01 0.0000000000000000e+00 124 147 1.7489133700166555e+01 0.0000000000000000e+00 124 148 -1.2467401100399702e+01 0.0000000000000000e+00 125 3 1.9704305647036673e+01 0.0000000000000000e+00 125 12 -1.6449340668440335e+01 0.0000000000000000e+00 125 14 -4.9938558690263804e+01 0.0000000000000000e+00 125 42 -1.3076171282183557e-01 0.0000000000000000e+00 125 57 2.3088865822845005e+01 0.0000000000000000e+00 125 59 -2.0784484760233202e+01 0.0000000000000000e+00 125 90 1.8095075137091108e+01 0.0000000000000000e+00 125 103 -1.8534515960668010e+01 0.0000000000000000e+00 125 119 4.4725308532591029e+01 0.0000000000000000e+00 125 120 -5.1467486931251205e+00 0.0000000000000000e+00 125 121 -1.1315492283802474e+01 0.0000000000000000e+00 125 125 2.4338204324197150e+01 0.0000000000000000e+00 125 127 1.3382604049018745e+01 0.0000000000000000e+00 125 144 -1.8534515960642914e+01 0.0000000000000000e+00 125 147 -1.3076171287615540e-01 0.0000000000000000e+00 125 155 -4.2995741129862054e+01 0.0000000000000000e+00 126 2 2.3845487198258589e+01 0.0000000000000000e+00 126 25 4.7150779494431980e+01 0.0000000000000000e+00 126 28 1.6654539457568518e+00 0.0000000000000000e+00 126 29 7.7892879573674534e+01 0.0000000000000000e+00 126 66 4.7150779494343567e+01 0.0000000000000000e+00 126 68 1.6654539457543964e+00 0.0000000000000000e+00 126 69 7.0545466581885734e+01 0.0000000000000000e+00 126 71 -2.0824347982193679e+01 0.0000000000000000e+00 126 73 -8.5090389053661966e+01 0.0000000000000000e+00 126 74 -4.9127550432182126e+01 0.0000000000000000e+00 126 86 1.7230099436836259e+01 0.0000000000000000e+00 126 88 -7.3968809367457411e+01 0.0000000000000000e+00 126 89 -9.0650378382518852e+01 0.0000000000000000e+00 126 92 -5.2861465370024725e+01 0.0000000000000000e+00 126 111 -5.2414568508313472e+01 0.0000000000000000e+00 126 112 3.4950669228647313e-01 0.0000000000000000e+00 126 113 1.3854771328935094e+01 0.0000000000000000e+00 126 115 -6.9703611091676265e+01 0.0000000000000000e+00 126 120 2.9814011991935550e+00 0.0000000000000000e+00 126 122 -5.2414568508100203e+01 0.0000000000000000e+00 126 126 -1.6280212876367690e+02 0.0000000000000000e+00 126 135 4.5753430920864929e+01 0.0000000000000000e+00 126 137 1.0578089928095244e+02 0.0000000000000000e+00 126 150 1.2238074974045800e+02 0.0000000000000000e+00 126 152 5.2215371683106106e+01 0.0000000000000000e+00 127 127 1.0000000000000000e+00 0.0000000000000000e+00 128 128 1.0000000000000000e+00 0.0000000000000000e+00 129 129 1.0000000000000000e+00 0.0000000000000000e+00 130 130 1.0000000000000000e+00 0.0000000000000000e+00 131 15 7.7976060511603873e-01 0.0000000000000000e+00 131 18 -2.7722719906458007e+01 0.0000000000000000e+00 131 19 -2.7972608592053906e+01 0.0000000000000000e+00 131 70 -1.4194653042381727e+01 0.0000000000000000e+00 131 71 1.6817502623725026e+01 0.0000000000000000e+00 131 72 -3.9766653206274306e+01 0.0000000000000000e+00 131 105 1.1427000994260009e+01 0.0000000000000000e+00 131 107 -1.0793348766169260e+01 0.0000000000000000e+00 131 130 -3.5647361892048963e+01 0.0000000000000000e+00 131 131 3.1893340446453596e+01 0.0000000000000000e+00 131 132 -2.7905295652513615e+01 0.0000000000000000e+00 131 133 -1.5111365352317932e+01 0.0000000000000000e+00 131 134 -1.7582714884627066e+01 0.0000000000000000e+00 132 132 1.0000000000000000e+00 0.0000000000000000e+00 133 133 1.0000000000000000e+00 0.0000000000000000e+00 134 134 1.0000000000000000e+00 0.0000000000000000e+00 135 22 3.2847183223583592e+01 0.0000000000000000e+00 135 25 -8.7877538278337425e+01 0.0000000000000000e+00 135 26 -2.3098911292313055e+01 0.0000000000000000e+00 135 71 -6.8471544113370930e+00 0.0000000000000000e+00 135 73 2.3096201770041297e+01 0.0000000000000000e+00 135 74 -3.7704048746457630e+01 0.0000000000000000e+00 135 111 4.9869806994679173e+01 0.0000000000000000e+00 135 112 1.1646265066406764e+01 0.0000000000000000e+00 135 126 4.5753430920864929e+01 0.0000000000000000e+00 135 135 6.6742535265327945e+01 0.0000000000000000e+00 135 136 -2.6302969726668621e+01 0.0000000000000000e+00 135 137 -4.8530024155810395e+01 0.0000000000000000e+00 135 138 -3.3150802270595378e+01 0.0000000000000000e+00 136 136 1.0000000000000000e+00 0.0000000000000000e+00 137 26 -8.0589367291768525e+01 0.0000000000000000e+00 137 67 1.2806325667023055e+01 0.0000000000000000e+00 137 69 -2.1681074519067085e+01 0.0000000000000000e+00 137 71 1.8761698330425258e+01 0.0000000000000000e+00 137 73 -2.3497985375763026e+01 0.0000000000000000e+00 137 74 4.3920549752234976e+01 0.0000000000000000e+00 137 113 -1.6010611270798254e+01 0.0000000000000000e+00 137 115 1.9257522822590484e+01 0.0000000000000000e+00 137 126 1.0578089928095244e+02 0.0000000000000000e+00 137 135 -4.8530024155810395e+01 0.0000000000000000e+00 137 136 3.0778506490551161e+00 0.0000000000000000e+00 137 137 6.8088251883491168e+01 0.0000000000000000e+00 137 138 6.5067057938926666e+01 0.0000000000000000e+00 138 138 1.0000000000000000e+00 0.0000000000000000e+00 139 38 -2.6492805868186984e+01 0.0000000000000000e+00 139 48 4.8224670334251904e+01 0.0000000000000000e+00 139 56 -2.6492805868244766e+01 0.0000000000000000e+00 139 60 -6.6232014670380028e+00 0.0000000000000000e+00 139 61 -1.6449340668584849e+01 0.0000000000000000e+00 139 83 2.5713804034377343e+01 0.0000000000000000e+00 139 114 4.3159472534672005e+01 0.0000000000000000e+00 139 116 -1.2467401100284876e+01 0.0000000000000000e+00 139 118 5.7833483537418630e+01 0.0000000000000000e+00 139 139 3.1347421188505429e+01 0.0000000000000000e+00 139 140 3.7358738101214797e+01 0.0000000000000000e+00 139 141 1.7489133700101345e+01 0.0000000000000000e+00 139 142 -5.0043465199635079e+01 0.0000000000000000e+00 139 143 -1.9090602567377832e+01 0.0000000000000000e+00 139 158 -4.3463728932235412e+01 0.0000000000000000e+00 139 159 2.5713804034426939e+01 0.0000000000000000e+00 140 140 1.0000000000000000e+00 0.0000000000000000e+00 141 141 1.0000000000000000e+00 0.0000000000000000e+00 142 34 -1.7380470701013184e+01 0.0000000000000000e+00 142 37 6.6232014670320378e+00 0.0000000000000000e+00 142 38 -1.6449340668524162e+01 0.0000000000000000e+00 142 43 -8.4419963329562342e-01 0.0000000000000000e+00 142 48 -1.9090602567346330e+01 0.0000000000000000e+00 142 61 -1.7380470700988692e+01 0.0000000000000000e+00 142 63 -8.4419963324011627e-01 0.0000000000000000e+00 142 114 -4.3159472534650575e+01 0.0000000000000000e+00 142 116 1.7489133700079474e+01 0.0000000000000000e+00 142 117 1.1601468867171322e+01 0.0000000000000000e+00 142 139 -5.0043465199635079e+01 0.0000000000000000e+00 142 141 -1.2467401100322142e+01 0.0000000000000000e+00 142 142 2.4289567725206084e+01 0.0000000000000000e+00 142 143 2.2467401100254907e+01 0.0000000000000000e+00 142 148 -2.0043465199521435e+01 0.0000000000000000e+00 142 154 -3.8181205134555896e+01 0.0000000000000000e+00 143 143 1.0000000000000000e+00 0.0000000000000000e+00 144 9 -1.9704305647236016e+01 0.0000000000000000e+00 144 10 -1.6449340668505251e+01 0.0000000000000000e+00 144 12 -4.9938558690418944e+01 0.0000000000000000e+00 144 47 -1.3076171293588179e-01 0.0000000000000000e+00 144 59 -1.3076171287746324e-01 0.0000000000000000e+00 144 90 1.3382604049186964e+01 0.0000000000000000e+00 144 102 -1.8534515960636728e+01 0.0000000000000000e+00 144 104 -5.1467486931015891e+00 0.0000000000000000e+00 144 120 4.4725308532538321e+01 0.0000000000000000e+00 144 121 2.3088865822894064e+01 0.0000000000000000e+00 144 125 -1.8534515960642914e+01 0.0000000000000000e+00 144 144 2.4338204324112478e+01 0.0000000000000000e+00 144 145 1.8095075137176806e+01 0.0000000000000000e+00 144 146 1.1315492283700914e+01 0.0000000000000000e+00 144 147 -2.0784484759981577e+01 0.0000000000000000e+00 144 155 -4.2995741129788058e+01 0.0000000000000000e+00 145 145 1.0000000000000000e+00 0.0000000000000000e+00 146 146 1.0000000000000000e+00 0.0000000000000000e+00 147 47 -1.6764324960499092e+01 0.0000000000000000e+00 147 59 -1.6764324960529994e+01 0.0000000000000000e+00 147 62 6.6232014669954147e+00 0.0000000000000000e+00 147 63 -1.6449340668433234e+01 0.0000000000000000e+00 147 68 -4.3159472534852263e+01 0.0000000000000000e+00 147 102 -1.3076171293508593e-01 0.0000000000000000e+00 147 117 -1.9090602567424561e+01 0.0000000000000000e+00 147 121 1.1006230771893897e+01 0.0000000000000000e+00 147 124 1.7489133700166555e+01 0.0000000000000000e+00 147 125 -1.3076171287615540e-01 0.0000000000000000e+00 147 144 -2.0784484759981577e+01 0.0000000000000000e+00 147 146 2.1872163005102287e+01 0.0000000000000000e+00 147 147 2.6777493357773082e+01 0.0000000000000000e+00 147 148 -5.0043465199802533e+01 0.0000000000000000e+00 147 149 -1.2467401100436994e+01 0.0000000000000000e+00 147 155 4.1527049998683303e+01 0.0000000000000000e+00 148 34 -8.4419963324012137e-01 0.0000000000000000e+00 148 43 -1.7380470700979110e+01 0.0000000000000000e+00 148 46 -6.6232014670014090e+00 0.0000000000000000e+00 148 47 -1.6449340668493939e+01 0.0000000000000000e+00 148 61 -8.4419963329562941e-01 0.0000000000000000e+00 148 63 -1.7380470700980037e+01 0.0000000000000000e+00 148 68 4.3159472534873686e+01 0.0000000000000000e+00 148 117 2.2467401100344624e+01 0.0000000000000000e+00 148 124 -1.2467401100399702e+01 0.0000000000000000e+00 148 142 -2.0043465199521435e+01 0.0000000000000000e+00 148 143 1.1601468867146648e+01 0.0000000000000000e+00 148 146 -1.9090602567456074e+01 0.0000000000000000e+00 148 147 -5.0043465199802533e+01 0.0000000000000000e+00 148 148 2.4289567725451942e+01 0.0000000000000000e+00 148 149 1.7489133700188429e+01 0.0000000000000000e+00 148 154 3.8181205134601356e+01 0.0000000000000000e+00 149 149 1.0000000000000000e+00 0.0000000000000000e+00 150 2 -2.2132360342272484e+01 0.0000000000000000e+00 150 24 -7.9192680165582928e+01 0.0000000000000000e+00 150 27 -1.2876160023521139e+01 0.0000000000000000e+00 150 29 -2.7518010312262533e+01 0.0000000000000000e+00 150 86 -1.7947431573196184e+01 0.0000000000000000e+00 150 88 2.4955872988872677e+01 0.0000000000000000e+00 150 89 -2.4553962047864061e+01 0.0000000000000000e+00 150 92 5.2143212654248195e+01 0.0000000000000000e+00 150 126 1.2238074974045800e+02 0.0000000000000000e+00 150 150 6.2809099964821300e+01 0.0000000000000000e+00 150 151 -6.5525594455901853e+01 0.0000000000000000e+00 150 152 -5.8574147841957966e+01 0.0000000000000000e+00 150 153 3.2570497402591396e+00 0.0000000000000000e+00 151 151 1.0000000000000000e+00 0.0000000000000000e+00 152 2 6.6326709989783401e+00 0.0000000000000000e+00 152 24 -2.2959242579385361e+01 0.0000000000000000e+00 152 65 -3.2777348867235624e+01 0.0000000000000000e+00 152 66 -9.1333829836593821e+01 0.0000000000000000e+00 152 89 2.8297725455989607e+01 0.0000000000000000e+00 152 92 -3.9361650367974505e+01 0.0000000000000000e+00 152 120 -1.3749929598159451e+01 0.0000000000000000e+00 152 122 5.4356250761189244e+01 0.0000000000000000e+00 152 126 5.2215371683106106e+01 0.0000000000000000e+00 152 150 -5.8574147841957966e+01 0.0000000000000000e+00 152 151 3.3958510568592885e+01 0.0000000000000000e+00 152 152 6.9394813869233687e+01 0.0000000000000000e+00 152 153 -2.6377417283614818e+01 0.0000000000000000e+00 153 153 1.0000000000000000e+00 0.0000000000000000e+00 154 31 6.9087338386101216e-15 0.0000000000000000e+00 154 34 -3.8181205134621948e+01 0.0000000000000000e+00 154 35 -3.8181205134487584e+01 0.0000000000000000e+00 154 40 -3.7030813424529693e-15 0.0000000000000000e+00 154 43 3.8181205134735698e+01 0.0000000000000000e+00 154 44 3.8181205134533059e+01 0.0000000000000000e+00 154 61 -3.8181205134690224e+01 0.0000000000000000e+00 154 63 3.8181205134667408e+01 0.0000000000000000e+00 154 117 3.7030813913021143e-15 0.0000000000000000e+00 154 142 -3.8181205134555896e+01 0.0000000000000000e+00 154 143 6.9087338874477051e-15 0.0000000000000000e+00 154 148 3.8181205134601356e+01 0.0000000000000000e+00 154 154 2.1130095190560020e+01 0.0000000000000000e+00 155 42 4.1527049998609456e+01 0.0000000000000000e+00 155 45 -2.6659418781667979e-13 0.0000000000000000e+00 155 47 4.1527049998830122e+01 0.0000000000000000e+00 155 57 7.8677161342840645e-13 0.0000000000000000e+00 155 59 4.1527049998756240e+01 0.0000000000000000e+00 155 102 -4.2995741129939603e+01 0.0000000000000000e+00 155 103 -4.2995741129710538e+01 0.0000000000000000e+00 155 121 -1.3103411407798407e-12 0.0000000000000000e+00 155 125 -4.2995741129862054e+01 0.0000000000000000e+00 155 144 -4.2995741129788058e+01 0.0000000000000000e+00 155 146 -7.9377207121057590e-13 0.0000000000000000e+00 155 147 4.1527049998683303e+01 0.0000000000000000e+00 155 155 2.4261434594136784e+01 0.0000000000000000e+00 156 33 -2.5670338834546357e+01 0.0000000000000000e+00 156 36 1.1190235350485764e+01 0.0000000000000000e+00 156 38 -4.1123351671170269e+01 0.0000000000000000e+00 156 55 1.5800734433545442e+01 0.0000000000000000e+00 156 56 -3.6883992664818962e+01 0.0000000000000000e+00 156 77 -2.2056167583452986e+01 0.0000000000000000e+00 156 79 -6.8187948653282824e+00 0.0000000000000000e+00 156 83 2.4414905684616095e+01 0.0000000000000000e+00 156 84 -3.1860574185002545e+01 0.0000000000000000e+00 156 156 6.1083625907012493e+00 0.0000000000000000e+00 156 157 -3.9956534800313072e+01 0.0000000000000000e+00 156 158 -2.5800734433504406e+01 0.0000000000000000e+00 156 159 -5.8007344335686266e+00 0.0000000000000000e+00 157 157 1.0000000000000000e+00 0.0000000000000000e+00 158 38 -2.2380470700976407e+01 0.0000000000000000e+00 158 48 -1.5800734433625859e+01 0.0000000000000000e+00 158 56 -1.6449340668568073e+01 0.0000000000000000e+00 158 79 6.6449340668385464e+00 0.0000000000000000e+00 158 83 -2.4264463365716306e+01 0.0000000000000000e+00 158 84 1.6558003667506682e+01 0.0000000000000000e+00 158 118 -2.8311600733340470e+01 0.0000000000000000e+00 158 139 -4.3463728932235412e+01 0.0000000000000000e+00 158 140 -1.5800734433651087e+01 0.0000000000000000e+00 158 156 -2.5800734433504406e+01 0.0000000000000000e+00 158 157 2.3159472534852597e+01 0.0000000000000000e+00 158 158 3.4072241726803639e+01 0.0000000000000000e+00 158 159 -1.4286195965629952e+01 0.0000000000000000e+00 159 159 1.0000000000000000e+00 0.0000000000000000e+00 160 160 1.0000000000000000e+00 0.0000000000000000e+00 161 161 1.0000000000000000e+00 0.0000000000000000e+00 162 162 1.0000000000000000e+00 0.0000000000000000e+00 163 82 3.4515451454850385e+01 3.8489110873569246e+00 163 160 2.1497524395734768e+01 1.9499899280662724e+01 163 161 1.4008478840608420e+00 -7.5343307515763003e-01 163 162 -1.2196825653632661e+01 1.5641330155775298e+01 163 163 2.2510410138436065e+01 8.4389156143333963e+01 163 164 -2.6183089251002929e+01 8.6097363071076423e+00 163 165 -2.2436653748193390e+01 2.0499084456452238e+01 163 166 5.0298975349574206e+01 1.4842150400670381e+01 163 167 -2.6803695779278112e+01 -1.6132892868185778e+01 163 168 -3.8511325096442931e+01 2.5522627483546486e+01 163 169 -9.1553036700265849e+01 -1.1969997845311958e+01 163 170 -8.9999833448501079e+00 1.2837647954409164e+01 163 173 1.4023624292469641e+01 1.9347448736703832e+01 164 78 1.6590702102767654e+01 2.5586036767465039e+00 164 160 -2.3496219493310520e+01 2.3101912918228471e+01 164 161 -1.8390402191384574e+01 -5.5835314161534377e-01 164 162 -2.5227587137054397e-01 -6.1843708796908436e-01 164 163 -2.6183089251002929e+01 8.6097363071076423e+00 164 164 1.6090134049183398e+01 8.1907504887071283e+01 164 165 -2.4774915147828050e+01 6.2374778930169015e+00 164 166 -1.2512603446336957e+01 1.6229142857435956e+01 164 169 -3.0806366712554560e+01 2.1201538897996496e+01 164 171 -8.2250515124419906e+01 -5.8065976892442253e+00 164 172 5.6835408712472145e+01 2.0833724947083610e+01 164 173 2.1223388950964001e+01 4.2457678757167701e+00 164 174 1.9602853188813874e-01 1.4165539466535193e+01 165 84 -3.7832961478118847e+01 -3.2825698559557090e+00 165 160 1.6531237554313862e+00 -1.3499598718854555e-01 165 161 8.6690401492833384e+00 -5.9408447443407626e+00 165 162 1.6822584276158057e+01 2.0046174854679055e+01 165 163 -2.2436653748193390e+01 2.0499084456452238e+01 165 164 -2.4774915147828050e+01 6.2374778930169015e+00 165 165 2.7925821841252557e+01 7.1671489930421686e+01 165 167 -1.4901819251935590e+01 -1.7991375455703238e+01 165 168 -7.9172211541475633e+01 -9.9940099647525749e+00 165 170 3.7634590701983306e+01 1.9032139126938894e+01 165 171 -2.6922610425819837e+01 1.6826922659380720e+01 165 172 -6.6425878394926841e+00 9.4899508626992883e+00 165 174 2.5694747932135407e+01 1.4173086481163089e+01 166 166 1.0000000000000000e+00 0.0000000000000000e+00 167 16 -6.5797362673579425e+00 3.1453696833497071e+00 167 82 -8.9272429425853783e+00 -3.4589468699359380e+00 167 97 2.0514030023817060e+01 1.3750491365210835e+00 167 162 -1.2009288976612700e+01 -3.3238629006396518e+00 167 163 -2.6803695779278112e+01 -1.6132892868185778e+01 167 165 -1.4901819251935590e+01 -1.7991375455703238e+01 167 166 7.9610550891068108e+00 5.4202610275181451e-01 167 167 5.0279026533077049e+00 1.6752772562320168e+01 167 168 1.3212294088659988e+01 6.4276134106577265e+00 167 169 2.3463265772322465e+01 6.3067088567663845e+00 167 170 9.5671907912704546e-01 -1.2540959472463484e+00 167 183 -6.6232014670208859e+00 4.7224363315093463e+00 167 184 -1.7267627089756875e+01 -4.7711414554528426e+00 167 185 -6.5797362673925370e+00 8.0496906781747928e+00 167 186 4.0663038595682153e+00 3.6532251534193549e+00 167 187 7.3127067936802908e+00 2.7704069572967001e+00 168 78 6.6231920572396881e+00 6.4955004263128363e-02 168 82 1.4206686392760542e+01 -1.9572360451562325e+00 168 84 3.8704564266914616e+01 4.9984024816194159e+00 168 162 -2.3901802596785696e+01 -5.1537275012940764e+00 168 163 -3.8511325096442931e+01 2.5522627483546486e+01 168 165 -7.9172211541475633e+01 -9.9940099647525749e+00 168 166 1.1052185682527918e+01 -2.4577977694094590e+00 168 167 1.3212294088659988e+01 6.4276134106577265e+00 168 168 9.2906839383176845e+00 3.5430033938111322e+01 168 169 -3.9160866933322978e+01 3.4649119354767262e+00 168 170 -3.7400213694563348e+01 -7.4078929939014753e+00 168 171 -6.1300079343989779e+01 2.8655437744915164e+00 168 174 -2.3820369262687976e+01 -7.2183305879539237e+00 169 78 1.9733780972010294e+01 -1.8882279229275989e+00 169 82 -3.0214466474654518e+01 -5.2151641091413365e+00 169 84 4.8349918865278578e+00 4.3782046877685804e-02 169 160 -2.6536227738806595e+01 -3.1183058792678757e+00 169 163 -9.1553036700265849e+01 -1.1969997845311958e+01 169 164 -3.0806366712554560e+01 2.1201538897996496e+01 169 166 -4.5757031052494646e+01 -6.8735580603817015e+00 169 167 2.3463265772322465e+01 6.3067088567663845e+00 169 168 -3.9160866933322978e+01 3.4649119354767262e+00 169 169 2.7197071783798865e+01 3.1807878361464454e+01 169 171 -6.8156782193453807e+01 4.2292399774230205e+00 169 172 1.0482812795975917e+01 -2.3680673405288548e+00 169 173 -1.4510293492790126e+01 -8.3719705317788318e+00 170 170 1.0000000000000000e+00 0.0000000000000000e+00 171 78 -4.5173441605140226e+01 -1.6420918169113734e+00 171 82 -1.7882001707118309e+00 -2.1172957385442559e-02 171 84 -4.2542846387966357e+01 2.8294796794326182e+00 171 161 6.8386163713808230e+00 4.6755886038359051e+00 171 164 -8.2250515124419906e+01 -5.8065976892442253e+00 171 165 -2.6922610425819837e+01 1.6826922659380720e+01 171 168 -6.1300079343989779e+01 2.8655437744915164e+00 171 169 -6.8156782193453807e+01 4.2292399774230205e+00 171 170 1.4012592215430869e+01 -3.9357607319982124e+00 171 171 3.1564298185478602e+01 2.7567842105724523e+01 171 172 -5.3142773245384276e+01 -1.3879379995436587e+01 171 173 -6.1078893067917361e+00 -4.9266710172753587e+00 171 174 -1.2269555282418729e+01 2.6478918365952859e+00 172 172 1.0000000000000000e+00 0.0000000000000000e+00 173 173 1.0000000000000000e+00 0.0000000000000000e+00 174 174 1.0000000000000000e+00 0.0000000000000000e+00 175 175 1.0000000000000000e+00 0.0000000000000000e+00 176 176 1.0000000000000000e+00 0.0000000000000000e+00 177 15 -2.2120239984752685e+00 -1.5742632787019710e+00 177 16 -2.0514030023892047e+01 -1.3750491365268884e+00 177 17 -1.8258022158409172e+01 -5.2307929015138777e-01 177 132 2.0514030023840387e+01 1.3750491365246227e+00 177 134 6.5797362674102011e+00 -3.1453696833694891e+00 177 175 -4.0663038595717591e+00 -3.6532251533722970e+00 177 176 -6.5797362673756066e+00 8.0496906781495401e+00 177 177 -1.1449666170561510e+01 1.1223961504197458e+01 177 178 6.6232014670210191e+00 -4.7224363315167324e+00 177 179 -1.7267627089746107e+01 -4.7711414554909073e+00 177 180 -7.3127067936141499e+00 -2.7704069573139103e+00 177 181 -1.5122720435175783e+01 -2.2296970940601595e+00 177 182 -5.9905882338681682e+00 1.8855374488560275e+00 177 183 -2.4585073482092710e+01 -2.9352703161712053e+00 177 186 -1.3977758956056478e+01 -8.9354862101273724e+00 177 188 8.3449525576290906e+00 9.2530887962220500e+00 178 178 1.0000000000000000e+00 0.0000000000000000e+00 179 179 1.0000000000000000e+00 0.0000000000000000e+00 180 180 1.0000000000000000e+00 0.0000000000000000e+00 181 181 1.0000000000000000e+00 0.0000000000000000e+00 182 182 1.0000000000000000e+00 0.0000000000000000e+00 183 183 1.0000000000000000e+00 0.0000000000000000e+00 184 184 1.0000000000000000e+00 0.0000000000000000e+00 185 185 1.0000000000000000e+00 0.0000000000000000e+00 186 15 1.8869095957030670e+01 1.7748727208705120e+00 186 16 1.6743244955299030e+01 1.5342864293916341e+00 186 17 1.6449340668561572e+01 -6.8962254315661315e+00 186 162 -5.6677727268187281e+00 -5.6223910390649907e+00 186 167 4.0663038595682153e+00 3.6532251534193549e+00 186 170 -1.0842324520214579e+01 -1.7999646897466231e+01 186 177 -1.3977758956056478e+01 -8.9354862101273724e+00 186 181 1.1558003667605412e+01 -8.9356719865239960e+00 186 182 -9.8696044010799859e+00 1.1229564491304252e+01 186 183 -4.2213671360907462e+01 -1.1712719559224167e+01 186 184 -2.2209031116839633e+00 6.3774795047488428e+00 186 185 -1.2626968689679341e+01 9.2466315867935478e+00 186 186 -1.1898226800989411e+01 2.7683004412550762e+01 186 187 1.7329751716346326e+01 -1.1413502120251970e+01 186 188 4.6170075053954935e+00 -3.6119755249001813e+00 186 189 3.6125989485269844e-01 -4.8246483795686501e+00 187 187 1.0000000000000000e+00 0.0000000000000000e+00 188 188 1.0000000000000000e+00 0.0000000000000000e+00 189 189 1.0000000000000000e+00 0.0000000000000000e+00 ]; Mat_0x84000000_1 = spconvert(zzz); -------------- next part -------------- %Vec Object: 1 MPI process % type: seq Vec_0x84000000_1 = [ 2.2149713067704293e+00 - 4.8295838887390877e-03i 0.0000000000000000e+00 9.5903389407682701e-17 - 2.0911036767973006e-19i -3.6868434123478322e-01 + 3.0513170294283582e-03i 1.5619954673770539e-01 + 2.5190766051046936e-03i 7.3459676461649392e-01 + 1.1506907791226789e-03i 0.0000000000000000e+00 2.2149713067704822e+00 - 4.8295838887390313e-03i 0.0000000000000000e+00 -1.0655749333642738e-03 - 2.1861864422900962e-02i -9.1078509832788834e-01 - 6.8671710905254799e-02i 5.1699366018877513e-01 + 6.3658120523194264e-02i 0.0000000000000000e+00 -1.9916340605556908e-02 + 2.5589848968919013e-02i -2.2430504846423166e-01 + 2.3328695522948018e-01i 2.7986500910351538e-02 + 6.7659051119051170e-02i 0.0000000000000000e+00 -2.7640085878933623e-01 - 2.2432869438568891e-03i -1.4525745715363511e-01 + 4.7496218115772426e-02i -4.8131812451502623e-02 - 7.0557009142595550e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 2.1904426186367892e-01 + 4.7232491009985708e-02i -2.2628927812459666e-02 - 2.2088625467595081e-02i 9.5792371015845354e-02 + 1.4121427037546532e-02i 0.0000000000000000e+00 0.0000000000000000e+00 -9.0606856543329867e-02 - 2.4996897374991854e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 -3.2538592138746797e-01 + 8.7405387973851334e-02i 8.5209034052166274e-02 + 5.3272411800119032e-02i 1.1020218440973585e-03 - 1.1081546008804379e-02i 0.0000000000000000e+00 0.0000000000000000e+00 5.8865428246754073e-02 - 2.3737598065440721e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 7.4163510569459051e-02 - 2.6706655694135281e-03i 5.7408696183814623e-01 + 1.0059973223066754e-01i 3.1354875155634360e-01 + 5.8726066681349647e-03i 0.0000000000000000e+00 0.0000000000000000e+00 3.0033068492567433e-01 + 4.0522447085718637e-02i 2.8486370554654372e-01 - 1.3019158466321579e-01i 3.8222897573310821e-02 - 5.4259092185802047e-03i 9.8392311703494148e-02 + 3.9625081813548742e-02i 3.6046490835975242e-01 - 2.1477434987026263e-02i 3.6193076193987089e-01 + 2.1376067710013876e-02i 1.0009043675417224e-01 + 6.7895074667704489e-03i 0.0000000000000000e+00 0.0000000000000000e+00 -1.4704549105311218e-01 + 6.6184038169534443e-03i 0.0000000000000000e+00 0.0000000000000000e+00 8.0849580079885422e-02 - 5.4751730547655642e-02i 0.0000000000000000e+00 1.6552869793474925e-01 - 6.5843589394226340e-02i 0.0000000000000000e+00 -1.7125938819699302e-01 - 1.9333174307095802e-02i 0.0000000000000000e+00 0.0000000000000000e+00 -5.6614450002707395e-01 + 1.8747123089751627e-02i 0.0000000000000000e+00 0.0000000000000000e+00 -4.3392724887512341e-01 + 3.0072656833225749e-02i 2.5139958478151686e-01 - 7.8069500567605324e-02i 0.0000000000000000e+00 1.5186001541372279e-01 + 1.9329170755019689e-02i 6.8363493213486493e-01 - 4.4761005147435376e-02i -1.4669658712551548e-01 - 2.5410875022849895e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 -5.9738192623665931e-01 + 1.1595760296284106e-01i 0.0000000000000000e+00 0.0000000000000000e+00 4.8817214072817694e-02 + 6.2400362030879702e-02i 1.6705532588118854e-01 - 1.2325187077489568e-02i 0.0000000000000000e+00 2.6848009784020915e-01 - 1.2786461966365781e-01i 0.0000000000000000e+00 -4.8158918995159561e-01 + 1.6182039505749073e-02i -7.0117812660062717e-01 - 7.9411466534748426e-02i 8.5346777679070074e-01 - 4.3598264095322255e-02i 1.0027364094130973e+00 + 6.8347814002461413e-02i -3.9377376769652572e-01 + 3.1921602288948400e-02i 9.6679119421085269e-01 + 2.4141696875398435e-02i 3.6072402946357807e-01 - 2.2952147885558435e-02i 0.0000000000000000e+00 0.0000000000000000e+00 8.4175632735151870e-02 - 4.8328341895552566e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 8.5427759718834498e-02 - 1.5384867772496980e-02i -9.1609106585751299e-01 - 5.2001200669095263e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 2.0196257164332685e-02 + 1.0549789263397061e-01i -4.5788661713242457e-01 - 2.7618780247400991e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 5.9712257333260868e-02 + 4.5440924439338016e-03i 7.7596649640496251e-01 - 1.9544517402705391e-01i 4.5307436420745872e-01 + 8.9403042266077579e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 4.2634499804593196e-01 + 5.9122516661190679e-02i -6.8729437329787285e-02 - 3.2041837788226057e-02i 0.0000000000000000e+00 1.0994917436856230e-02 - 1.4435863897199525e-02i 4.2293694300650686e-01 + 2.7840887508704195e-02i 6.1952181668983815e-02 + 9.0832583678214420e-03i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 -7.6450570860746678e-02 + 3.6154825887408426e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 1.2479628909898192e-01 - 4.1520201207910688e-02i 0.0000000000000000e+00 2.7549854188988138e-01 - 2.1788754805580453e-02i 0.0000000000000000e+00 1.6970614350827157e-01 - 4.8164568064454391e-02i 0.0000000000000000e+00 0.0000000000000000e+00 9.0038717575489910e-02 - 5.8426528036116296e-02i 0.0000000000000000e+00 8.5847070401321668e-01 + 6.0699185151494523e-03i 0.0000000000000000e+00 0.0000000000000000e+00 -1.6254493218475649e-01 - 8.5713761292492793e-02i -3.8595480506538787e-01 - 5.4276857457363985e-02i 0.0000000000000000e+00 3.1568369630357795e-01 - 3.0029110354376398e-03i 0.0000000000000000e+00 1.7474237407394697e-01 + 4.2145233990959000e-02i 0.0000000000000000e+00 1.3988305319257480e-01 - 5.2300796674830584e-02i 2.5440595485120737e-01 + 1.6915838215439322e-02i -2.9698280452395176e-01 + 9.2218600279524929e-02i 0.0000000000000000e+00 3.5912349185162251e-01 - 2.6086281076175891e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 -5.6487725285284474e-03 - 1.6011337718813540e-02i -3.9484498769499993e-03 - 3.1441265657118786e-02i -4.1517393080220642e-03 + 8.7815596932424578e-03i 0.0000000000000000e+00 3.8585709194248530e-02 - 7.9201669546819217e-02i -2.4413635092351851e-02 + 2.3810397947847745e-02i 2.8725628908471350e-02 + 1.4771492905419188e-02i 0.0000000000000000e+00 2.1199082557847436e-02 - 1.0419460603712183e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 7.2477462297039039e-02 + 5.6886809621827565e-02i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 -1.3208827441732857e-01 - 1.9543748266906196e-03i 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 ]; -------------- next part -------------- %Vec Object: 1 MPI process % type: seq Vec_0x84000000_1 = [ 2.2214414690804367e+00 0.0000000000000000e+00 9.6183533215257419e-17 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 2.2214414690804896e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 0.0000000000000000e+00 ]; From bsmith at petsc.dev Fri Jun 7 21:12:11 2024 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 7 Jun 2024 22:12:11 -0400 Subject: [petsc-users] About the complex version of gmres. In-Reply-To: References: Message-ID: <1F6F070D-D78D-4E9E-9118-4D1437B3A9E5@petsc.dev> If I run with -pc_type lu it solves the system If I run with the default (ILU) -ksp_monitor_true_residual -ksp_converged_reason it does not converge. In fact, it makes no real progress to the solution. It is always important to use KSPGetConvergedReason() or -ksp_converged_reason or -ksp_error_if_not_converged to check that the solver has actually converged. If I run with -ksp_gmres_restart 100 it converges in 75 iterations. Barry > On Jun 7, 2024, at 8:03?PM, neil liu wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear Petsc developers, > > I am using Petsc to solve a complex system ,AX=B. > > A is complex and B is real. > > And the petsc was configured with > Configure options --download-mpich --download-fblaslapack=1 --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-triangle --with-scalar-type=complex > > A and B were also imported to matlab and the same system was solved. > The direct and iterative solver in matlab give the same result, which are quite different from the result from Petsc. > A and B are attached. x from petsc is also attached. I am using only one processor. > > It is weird. > > Thanks a lot. > > Xiaodong > -------------- next part -------------- An HTML attachment was scrubbed... URL: From liufield at gmail.com Fri Jun 7 21:53:03 2024 From: liufield at gmail.com (neil liu) Date: Fri, 7 Jun 2024 22:53:03 -0400 Subject: [petsc-users] About the complex version of gmres. In-Reply-To: <1F6F070D-D78D-4E9E-9118-4D1437B3A9E5@petsc.dev> References: <1F6F070D-D78D-4E9E-9118-4D1437B3A9E5@petsc.dev> Message-ID: Thanks a lot for your explanation. Could you please share your petsc code to test this ? Thanks, Xiaodong On Fri, Jun 7, 2024 at 10:12?PM Barry Smith wrote: > > If I run with -pc_type lu it solves the system > > If I run with the default (ILU) -ksp_monitor_true_residual > -ksp_converged_reason it does not converge. In fact, it makes no real > progress to the solution. > > It is always important to use KSPGetConvergedReason() or > -ksp_converged_reason or -ksp_error_if_not_converged to check that the > solver has actually converged. > > If I run with -ksp_gmres_restart 100 it converges in 75 iterations. > > Barry > > > > > On Jun 7, 2024, at 8:03?PM, neil liu wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear Petsc developers, > > I am using Petsc to solve a complex system ,AX=B. > > A is complex and B is real. > > And the petsc was configured with > Configure options --download-mpich --download-fblaslapack=1 --with-cc=gcc > --with-cxx=g++ --with-fc=gfortran --download-triangle > --with-scalar-type=complex > > A and B were also imported to matlab and the same system was solved. > The direct and iterative solver in matlab give the same result, which are > quite different from the result from Petsc. > A and B are attached. x from petsc is also attached. I am using only one > processor. > > It is weird. > > Thanks a lot. > > Xiaodong > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Jun 7 21:56:48 2024 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 7 Jun 2024 22:56:48 -0400 Subject: [petsc-users] About the complex version of gmres. In-Reply-To: References: <1F6F070D-D78D-4E9E-9118-4D1437B3A9E5@petsc.dev> Message-ID: <8263C2B5-1521-4A9B-B8CA-2B91BDA9FFB8@petsc.dev> I used src/ksp/ksp/tutorials/ex10.c and saved the A and b into a PETSc binary file from Matlab with PetscBinaryWrite() in PETSC_DIR/share/petsc/matlab > On Jun 7, 2024, at 10:53?PM, neil liu wrote: > > Thanks a lot for your explanation. > Could you please share your petsc code to test this ? > > Thanks, > > Xiaodong > > On Fri, Jun 7, 2024 at 10:12?PM Barry Smith > wrote: >> >> If I run with -pc_type lu it solves the system >> >> If I run with the default (ILU) -ksp_monitor_true_residual -ksp_converged_reason it does not converge. In fact, it makes no real progress to the solution. >> >> It is always important to use KSPGetConvergedReason() or -ksp_converged_reason or -ksp_error_if_not_converged to check that the solver has actually converged. >> >> If I run with -ksp_gmres_restart 100 it converges in 75 iterations. >> >> Barry >> >> >> >> >>> On Jun 7, 2024, at 8:03?PM, neil liu > wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> Dear Petsc developers, >>> >>> I am using Petsc to solve a complex system ,AX=B. >>> >>> A is complex and B is real. >>> >>> And the petsc was configured with >>> Configure options --download-mpich --download-fblaslapack=1 --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-triangle --with-scalar-type=complex >>> >>> A and B were also imported to matlab and the same system was solved. >>> The direct and iterative solver in matlab give the same result, which are quite different from the result from Petsc. >>> A and B are attached. x from petsc is also attached. I am using only one processor. >>> >>> It is weird. >>> >>> Thanks a lot. >>> >>> Xiaodong >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From miroslav.kuchta at gmail.com Tue Jun 11 08:24:34 2024 From: miroslav.kuchta at gmail.com (Miroslav Kuchta) Date: Tue, 11 Jun 2024 15:24:34 +0200 Subject: [petsc-users] Memory usage in SLEPc eigensolver Message-ID: Dear mailing list, I have a question regarding memory usage in SLEPc. Specifically, I am running out of memory when solving a generalized eigenvalue problem Ax = alpha Mx. Here M is singular so we set the problem type as GNHEP and solve with Krylov-Schur method and shift-and-invert spectral transform. The matrix A come from Stokes like problem so the transform is set to use a block diagonal preconditioner B where each of the blocks (through fieldsplit) uses hypre. The solver works nicely on smaller problems in 3d (with about 100K dofs). However, upon further refinement the system size gets to millions of dofs and we run out of memory (>150GB). I find it surprising because KSP(A, B) on the same machine works without issues. When running with "-log_trace -info" I see that the memory requests before the job is killed come from the preconditioner setup *[0] PCSetUp(): Setting up PC for first time[0] MatConvert(): Check superclass seqhypre mpiaij -> 0[0] MatConvert(): Check superclass mpihypre mpiaij -> 0[0] MatConvert(): Check specialized (1) MatConvert_mpiaij_seqhypre_C (mpiaij) -> 0[0] MatConvert(): Check specialized (1) MatConvert_mpiaij_mpihypre_C (mpiaij) -> 0* *[0] MatConvert(): Check specialized (1) MatConvert_mpiaij_hypre_C (mpiaij) -> 1* Interestingly, when solving just the problem Ax = b with B as preconditioner I don't see any calls like the above. We can get access to a larger machine but I am curious if our setup/solution strategy can be improved/optimized. Do you have any advice on how to reduce the memory footprint? Thanks and best regards, Miro -------------- next part -------------- An HTML attachment was scrubbed... URL: From jroman at dsic.upv.es Tue Jun 11 08:43:17 2024 From: jroman at dsic.upv.es (Jose E. Roman) Date: Tue, 11 Jun 2024 15:43:17 +0200 Subject: [petsc-users] Memory usage in SLEPc eigensolver In-Reply-To: References: Message-ID: <8B6DD72E-5570-4A8F-A232-2EF8E8D8663B@dsic.upv.es> An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Jun 11 08:43:38 2024 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 11 Jun 2024 09:43:38 -0400 Subject: [petsc-users] Memory usage in SLEPc eigensolver In-Reply-To: References: Message-ID: <299B86EC-2F14-4369-A04E-20E3D1A097D6@petsc.dev> You can run with -log_view -log_view_memory and it will display rich information about in which event the memory is allocated and how much. There are several columns of information and the notes displayed explain how to interpret each column. Feel fro the post the output and ask questions about the information displayed since it is a big confusing. Barry > On Jun 11, 2024, at 9:24?AM, Miroslav Kuchta wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear mailing list, > > I have a question regarding memory usage in SLEPc. Specifically, I am running out of memory when solving a generalized > eigenvalue problem Ax = alpha Mx. Here M is singular so we set the problem type as GNHEP and solve with Krylov-Schur > method and shift-and-invert spectral transform. The matrix A come from Stokes like problem so the transform is set to use > a block diagonal preconditioner B where each of the blocks (through fieldsplit) uses hypre. The solver works nicely on smaller > problems in 3d (with about 100K dofs). However, upon further refinement the system size gets to millions of dofs and we run > out of memory (>150GB). I find it surprising because KSP(A, B) on the same machine works without issues. When running > with "-log_trace -info" I see that the memory requests before the job is killed come from the preconditioner setup > > [0] PCSetUp(): Setting up PC for first time > [0] MatConvert(): Check superclass seqhypre mpiaij -> 0 > [0] MatConvert(): Check superclass mpihypre mpiaij -> 0 > [0] MatConvert(): Check specialized (1) MatConvert_mpiaij_seqhypre_C (mpiaij) -> 0 > [0] MatConvert(): Check specialized (1) MatConvert_mpiaij_mpihypre_C (mpiaij) -> 0 > [0] MatConvert(): Check specialized (1) MatConvert_mpiaij_hypre_C (mpiaij) -> 1 > > Interestingly, when solving just the problem Ax = b with B as preconditioner I don't see any calls like the above. We can get access > to a larger machine but I am curious if our setup/solution strategy can be improved/optimized. Do you have any advice on how to > reduce the memory footprint? > > Thanks and best regards, Miro -------------- next part -------------- An HTML attachment was scrubbed... URL: From danyang.su at gmail.com Tue Jun 11 20:05:36 2024 From: danyang.su at gmail.com (Danyang Su) Date: Tue, 11 Jun 2024 18:05:36 -0700 Subject: [petsc-users] Error in PETSc configuration on Mac Sonoma Message-ID: <28E0BACE-BA01-4BD4-9A3B-EEBA2FD751CF@gmail.com> Dear All, I recently upgraded my MacOS to Sonoma and found some issue in PETSc configuration, which does not occur before. The main issue is from hdf5 fortran binding. It?s more likely something is broken after system upgrade, but not clear which part. Any suggestions? Thanks and regards, Danyang -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: configure.log Type: application/octet-stream Size: 5536399 bytes Desc: not available URL: From bsmith at petsc.dev Tue Jun 11 21:10:45 2024 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 11 Jun 2024 22:10:45 -0400 Subject: [petsc-users] Error in PETSc configuration on Mac Sonoma In-Reply-To: <28E0BACE-BA01-4BD4-9A3B-EEBA2FD751CF@gmail.com> References: <28E0BACE-BA01-4BD4-9A3B-EEBA2FD751CF@gmail.com> Message-ID: The issue is here configure:31815: /Users/danyangsu/Soft/PETSc/petsc-3.20.5/macos-gnu-opt/bin/mpif90 -o conftest -I. -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -g -O -fallow-argument-mismatch conftest.f90 -ldl -lm >&5 ld: warning: -commons use_dylibs is no longer supported, using error treatment instead ld: common symbol '_mpifcmb5_' from '/private/var/folders/jm/wcm4mv8s3v1gqz383tcf_4c00000gp/T/ccqBB7yf.o' conflicts with definition from dylib '_mpifcmb5_' from '/Users/danyangsu/Soft/PETSc/petsc-3.20.5/macos-gnu-opt/lib/libmpifort.12.dylib' Perhaps someone has the fix. > On Jun 11, 2024, at 9:05?PM, Danyang Su wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear All, > > I recently upgraded my MacOS to Sonoma and found some issue in PETSc configuration, which does not occur before. The main issue is from hdf5 fortran binding. It?s more likely something is broken after system upgrade, but not clear which part. > > Any suggestions? > > Thanks and regards, > > Danyang > -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay.anl at fastmail.org Tue Jun 11 23:01:18 2024 From: balay.anl at fastmail.org (Satish Balay) Date: Tue, 11 Jun 2024 23:01:18 -0500 (CDT) Subject: [petsc-users] Error in PETSc configuration on Mac Sonoma In-Reply-To: References: <28E0BACE-BA01-4BD4-9A3B-EEBA2FD751CF@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From danyang.su at gmail.com Tue Jun 11 23:28:34 2024 From: danyang.su at gmail.com (Danyang Su) Date: Tue, 11 Jun 2024 21:28:34 -0700 Subject: [petsc-users] Error in PETSc configuration on Mac Sonoma In-Reply-To: References: <28E0BACE-BA01-4BD4-9A3B-EEBA2FD751CF@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From derek.teaney at stonybrook.edu Wed Jun 12 08:11:11 2024 From: derek.teaney at stonybrook.edu (Derek Teaney) Date: Wed, 12 Jun 2024 09:11:11 -0400 Subject: [petsc-users] Questions on EIMEX: Message-ID: <8DF55AE6-9DC2-406E-861E-159155AAEC13@stonybrook.edu> Dear All, I have a question and a comment on the TSEIMEX scheme in the TS routines. 1/ Looking at the cited reference, I see three schemes there 2.4b, 2.4c, 2.4d . It is not clear which of these is being implemented. 2/ The documentation for EIMEX mixes up F(u, udot ) and G(u) relative to the users manual. This may have been done on purpose, to conform with the Constantinescu ref., but a perhaps a comment is in order. Thanks, Derek ------------------------------------------------------------------------ Derek Teaney Professor Dept. of Physics & Astronomy Stony Brook University Stony Brook, NY 11794-3800 Tel: (631) 632-4489 e-mail: Derek.Teaney at stonybrook.edu ------------------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Wed Jun 12 17:36:23 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Wed, 12 Jun 2024 22:36:23 +0000 Subject: [petsc-users] Assistance Needed with PETSc KSPSolve Performance Issue Message-ID: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!YME6NPPibCKcgA6BRrCcOZBp90jG3xcObexgXGxsVV6i12v_JAnZlhZNJ1SQdikKzM6jBmFVU2Tqrhfjag9YiyROKq6IsdoQ8Lw$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Jun 12 17:46:14 2024 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 12 Jun 2024 18:46:14 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: Message-ID: On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some experience > on how to diagnose and address this performance discrepancy? Any insights > or recommendations you could offer would be greatly appreciated. > For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!eMuXWvayLIhrQweHZY95IfQMST6PiUiLEskCz9WUy0pb9bazMdyoLiAyZh_l80blSuxXwO5yN7vzdEzWkCL8$ > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eMuXWvayLIhrQweHZY95IfQMST6PiUiLEskCz9WUy0pb9bazMdyoLiAyZh_l80blSuxXwO5yN7vzdEsBefqt$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Thu Jun 13 12:27:47 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Thu, 13 Jun 2024 17:27:47 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: Message-ID: Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li Cc: petsc-users at mcs.anl.gov , petsc-maint at mcs.anl.gov , Piero Triverio Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!Yfo9Sj1FioTY_hbQkkkF2sbCwllFU-V5CHUuDrxbor7fV_x0ZipVlX0pNA0DVF4dgKpaGDiA3saQdP5M_n-IhQWNlnw5Ugt9PLY$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Yfo9Sj1FioTY_hbQkkkF2sbCwllFU-V5CHUuDrxbor7fV_x0ZipVlX0pNA0DVF4dgKpaGDiA3saQdP5M_n-IhQWNlnw5mUCkHxo$ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ksp_petsc_log.txt URL: From bsmith at petsc.dev Thu Jun 13 13:14:20 2024 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 13 Jun 2024 14:14:20 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: Message-ID: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry > On Jun 13, 2024, at 1:27?PM, Yongzhong Li wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. > > Thanks! > Yongzhong > > From: Matthew Knepley > > Date: Wednesday, June 12, 2024 at 6:46?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: > Matrix Type: Shell system matrix > Preconditioner: Shell PC > Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled > I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. > Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. > > For any performance question like this, we need to see the output of your code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!fauKPPSN6fIvLxuqYn1CRvpUf5q9zeWauAOP28SBKtXHbucpJwjmXGMcWD21S3qRjSPoyFZTDYG9jPhI5dAE71E$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fauKPPSN6fIvLxuqYn1CRvpUf5q9zeWauAOP28SBKtXHbucpJwjmXGMcWD21S3qRjSPoyFZTDYG9jPhIrXxdS4M$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From spradeepmahadeek at gmail.com Thu Jun 13 17:33:24 2024 From: spradeepmahadeek at gmail.com (s.pradeep kumar) Date: Thu, 13 Jun 2024 17:33:24 -0500 Subject: [petsc-users] Petsc Build error Message-ID: Dear Petsc Development Team, I am encountering an issue while trying to build Petsc library in our Cray Cluster. Despite loading appropriate gcc and cray-mpich modules and successfully configuring Petsc, I receive the following error (see A1). I have also attached petsc configure statement (see A2) and snippet of configure log (see A3). Could you please provide assistance in resolving this build issue? Any insights or additional steps I might take to diagnose and fix the problem would be greatly appreciated. Thank you for your support Regards, Pradeep *A1.Error from make.log:* ========================================== /usr/bin/gmake --print-directory -f gmakefile -j88 -l272.0 --output-sync=recurse V= libs gmake[3]: *** No rule to make target '/$NETWORK$/users/$USERNAME$/NumLib/petsc-3.21.1/src/sys/f90-mod/petscsysmod.F90', needed by 'gnu-opt/obj/src/sys/f90-mod/petscsysmod.o'. Stop. gmake[2]: *** [/users/$USERNAME$/NumLib/petsc-3.21.1/lib/petsc/conf/ rules_doc.mk:5: libs] Error 2 **************************ERROR************************************* Error during compile, check gnu-opt/lib/petsc/conf/make.log Send it and gnu-opt/lib/petsc/conf/configure.log to petsc-maint at mcs.anl.gov ******************************************************************** Finishing make run at Thu, 13 Jun 2024 11:42:44 -0600 *A2.Configure Statement:* ./configure --with-scalar-type=real --with-precision=double --download-metis=$metis_path --download-metis-use-doubleprecision=1 --download-parmetis=$parmetis_path --with-cmake=1 --with-mpi-dir=$MPI_DIR --download-fblaslapack=$fblaslapack_path --with-debugging=0 *COPTFLAGS*=-O2 *FOPTFLAGS*=-O2 *Where the paths are defined:* *fblaslapack_path* ="/users/$USERNAME$/NumLib/ExternalPackages/petsc-pkg-fblaslapack-e8a03f57d64c.tar.gz" *metis_path* ="/users/$USERNAME$/NumLib/ExternalPackages/petsc-pkg-metis-69fb26dd0428.tar.gz" *parmetis_path* ="/users/$USERNAME$/NumLib/ExternalPackages/petsc-pkg-parmetis-f5e3aab04fd5.tar.gz" *MPI_DIR*=${CRAY_MPICH_PREFIX} *A3. Snippets From configure.log:* --------------------------------------------------------------------------------------------- PETSc: Build : Set default architecture to gnu-opt in lib/petsc/conf/petscvariables File creation : Created gnu-opt/lib/petsc/conf/reconfigure-gnu-opt.py for automatic reconfiguration Framework: RDict update : Substitutions were stored in RDict with parent None File creation : Created makefile configure header gnu-opt/lib/petsc/conf/petscvariables File creation : Created makefile configure header gnu-opt/lib/petsc/conf/petscrules File creation : Created configure header gnu-opt/include/petscconf.h File creation : Created C specific configure header gnu-opt/include/petscfix.h File creation : Created configure pkg header gnu-opt/include/petscpkg_version.h Compilers: C Compiler: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O2 Version: gcc (GCC) 12.2.0 20220819 (HPE) C++ Compiler: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpicxx -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-psabi -fstack-protector -fvisibility=hidden -g -O -std=gnu++20 -fPIC Version: g++ (GCC) 12.2.0 20220819 (HPE) Fortran Compiler: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpif90 -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O2 Version: GNU Fortran (GCC) 12.2.0 20220819 (HPE) Linkers: Shared linker: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpicc -shared -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O2 Dynamic linker: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpicc -shared -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O2 Libraries linked against: -lquadmath -ldl BlasLapack: Libraries: -Wl,-rpath,/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -L/users/$USERNAME$ /NumLib/petsc-3.21.1/gnu-opt/lib -lflapack -lfblas uses 4 byte integers MPI: Version: 3 Includes: -I/opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/include mpiexec: /bin/false Implementation: mpich3 MPICH_NUMVERSION: 30400002 X: Libraries: -lX11 python: Executable: /usr/bin/python3 pthread: Libraries: -lpthread cmake: Version: 3.23.1 Executable: /usr/projects/hpcsoft/tce/23-03/cray-sles15-x86_64_v3-slingshot-none/packages/cmake/cmake-3.23.1/bin/cmake fblaslapack: metis: Version: 5.1.0 Includes: -I/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/include Libraries: -Wl,-rpath,/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -L/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -lmetis parmetis: Version: 4.0.3 Includes: -I/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/include Libraries: -Wl,-rpath,/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -L/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -lparmetis regex: bison: Version: 3.0 Executable: /usr/bin/bison PETSc: Language used to compile PETSc: C PETSC_ARCH: gnu-opt PETSC_DIR: /users/$USERNAME$/NumLib/petsc-3.21.1 Prefix: Scalar type: real Precision: double Support for __float128 Integer size: 4 bytes Single library: yes Shared libraries: yes Memory alignment from malloc(): 16 bytes Using GNU make: /usr/bin/gmake xxx=======================================================================================xxx Configure stage complete. Now build PETSc libraries with: make PETSC_DIR=/users/$USERNAME$/NumLib/petsc-3.21.1 PETSC_ARCH=gnu-opt all xxx=======================================================================================xxx ================================================================================ Finishing configure run at Thu, 13 Jun 2024 11:42:35 -0600 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Thu Jun 13 17:47:13 2024 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 13 Jun 2024 18:47:13 -0400 Subject: [petsc-users] Petsc Build error In-Reply-To: References: Message-ID: <2B59960D-E8B1-4068-889F-4C9A5899383C@petsc.dev> Please send configure.log and make.log to petsc-maint at mcs.anl.gov without the exact details we cannot determine what has gone wrong. > On Jun 13, 2024, at 6:33?PM, s.pradeep kumar wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear Petsc Development Team, > > I am encountering an issue while trying to build Petsc library in our Cray Cluster. Despite loading appropriate gcc and cray-mpich modules and successfully configuring Petsc, I receive the following error (see A1). I have also attached petsc configure statement (see A2) and snippet of configure log (see A3). > > Could you please provide assistance in resolving this build issue? Any insights or additional steps I might take to diagnose and fix the problem would be greatly appreciated. > > Thank you for your support > > Regards, > > Pradeep > > > > > A1.Error from make.log: > > > ========================================== > /usr/bin/gmake --print-directory -f gmakefile -j88 -l272.0 --output-sync=recurse V= libs > gmake[3]: *** No rule to make target '/$NETWORK$/users/$USERNAME$/NumLib/petsc-3.21.1/src/sys/f90-mod/petscsysmod.F90', needed by 'gnu-opt/obj/src/sys/f90-mod/petscsysmod.o'. Stop. > gmake[2]: *** [/users/$USERNAME$/NumLib/petsc-3.21.1/lib/petsc/conf/rules_doc.mk:5 : libs] Error 2 > **************************ERROR************************************* > Error during compile, check gnu-opt/lib/petsc/conf/make.log > Send it and gnu-opt/lib/petsc/conf/configure.log to petsc-maint at mcs.anl.gov > ******************************************************************** > Finishing make run at Thu, 13 Jun 2024 11:42:44 -0600 > > A2.Configure Statement: > > ./configure --with-scalar-type=real --with-precision=double --download-metis=$metis_path --download-metis-use-doubleprecision=1 --download-parmetis=$parmetis_path --with-cmake=1 --with-mpi-dir=$MPI_DIR --download-fblaslapack=$fblaslapack_path --with-debugging=0 COPTFLAGS=-O2 FOPTFLAGS=-O2 > > Where the paths are defined: > > fblaslapack_path="/users/$USERNAME$/NumLib/ExternalPackages/petsc-pkg-fblaslapack-e8a03f57d64c.tar.gz" > metis_path="/users/$USERNAME$/NumLib/ExternalPackages/petsc-pkg-metis-69fb26dd0428.tar.gz" > parmetis_path="/users/$USERNAME$/NumLib/ExternalPackages/petsc-pkg-parmetis-f5e3aab04fd5.tar.gz" > > MPI_DIR=${CRAY_MPICH_PREFIX} > > > > > A3. Snippets From configure.log: > > --------------------------------------------------------------------------------------------- > PETSc: > Build : Set default architecture to gnu-opt in lib/petsc/conf/petscvariables > File creation : Created gnu-opt/lib/petsc/conf/reconfigure-gnu-opt.py for automatic reconfiguration > Framework: > RDict update : Substitutions were stored in RDict with parent None > File creation : Created makefile configure header gnu-opt/lib/petsc/conf/petscvariables > File creation : Created makefile configure header gnu-opt/lib/petsc/conf/petscrules > File creation : Created configure header gnu-opt/include/petscconf.h > File creation : Created C specific configure header gnu-opt/include/petscfix.h > File creation : Created configure pkg header gnu-opt/include/petscpkg_version.h > Compilers: > C Compiler: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O2 > Version: gcc (GCC) 12.2.0 20220819 (HPE) > C++ Compiler: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpicxx -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-psabi -fstack-protector -fvisibility=hidden -g -O -std=gnu++20 -fPIC > Version: g++ (GCC) 12.2.0 20220819 (HPE) > Fortran Compiler: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpif90 -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O2 > Version: GNU Fortran (GCC) 12.2.0 20220819 (HPE) > Linkers: > Shared linker: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpicc -shared -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O2 > Dynamic linker: /opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/bin/mpicc -shared -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O2 > Libraries linked against: -lquadmath -ldl > BlasLapack: > Libraries: -Wl,-rpath,/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -L/users/$USERNAME$ /NumLib/petsc-3.21.1/gnu-opt/lib -lflapack -lfblas > uses 4 byte integers > MPI: > Version: 3 > Includes: -I/opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/include > mpiexec: /bin/false > Implementation: mpich3 > MPICH_NUMVERSION: 30400002 > X: > Libraries: -lX11 > python: > Executable: /usr/bin/python3 > pthread: > Libraries: -lpthread > cmake: > Version: 3.23.1 > Executable: /usr/projects/hpcsoft/tce/23-03/cray-sles15-x86_64_v3-slingshot-none/packages/cmake/cmake-3.23.1/bin/cmake > fblaslapack: > metis: > Version: 5.1.0 > Includes: -I/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/include > Libraries: -Wl,-rpath,/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -L/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -lmetis > parmetis: > Version: 4.0.3 > Includes: -I/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/include > Libraries: -Wl,-rpath,/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -L/users/$USERNAME$/NumLib/petsc-3.21.1/gnu-opt/lib -lparmetis > regex: > bison: > Version: 3.0 > Executable: /usr/bin/bison > PETSc: > Language used to compile PETSc: C > PETSC_ARCH: gnu-opt > PETSC_DIR: /users/$USERNAME$/NumLib/petsc-3.21.1 > Prefix: > Scalar type: real > Precision: double > Support for __float128 > Integer size: 4 bytes > Single library: yes > Shared libraries: yes > Memory alignment from malloc(): 16 bytes > Using GNU make: /usr/bin/gmake > xxx=======================================================================================xxx > Configure stage complete. Now build PETSc libraries with: > make PETSC_DIR=/users/$USERNAME$/NumLib/petsc-3.21.1 PETSC_ARCH=gnu-opt all > xxx=======================================================================================xxx > ================================================================================ > Finishing configure run at Thu, 13 Jun 2024 11:42:35 -0600 -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay.anl at fastmail.org Thu Jun 13 18:06:00 2024 From: balay.anl at fastmail.org (Satish Balay) Date: Thu, 13 Jun 2024 18:06:00 -0500 (CDT) Subject: [petsc-users] Petsc Build error In-Reply-To: <2B59960D-E8B1-4068-889F-4C9A5899383C@petsc.dev> References: <2B59960D-E8B1-4068-889F-4C9A5899383C@petsc.dev> Message-ID: <277fbcf4-d939-f751-52bd-899331ee7e2b@fastmail.org> An HTML attachment was scrubbed... URL: From balay.anl at fastmail.org Thu Jun 13 18:14:21 2024 From: balay.anl at fastmail.org (Satish Balay) Date: Thu, 13 Jun 2024 18:14:21 -0500 (CDT) Subject: [petsc-users] Petsc Build error In-Reply-To: <277fbcf4-d939-f751-52bd-899331ee7e2b@fastmail.org> References: <2B59960D-E8B1-4068-889F-4C9A5899383C@petsc.dev> <277fbcf4-d939-f751-52bd-899331ee7e2b@fastmail.org> Message-ID: <87083505-33ab-8f2c-2b6a-3078efe6578e@fastmail.org> An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Fri Jun 14 00:19:43 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Fri, 14 Jun 2024 05:19:43 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> Message-ID: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li Cc: petsc-users at mcs.anl.gov , petsc-maint at mcs.anl.gov , Piero Triverio Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!ZrobJp9NYWE8nb54WSzT7puCrLjMrJ2gz_VVjU7BYkDhaIoql4SuXfIZdDIZ_qqOZjY3w64UM-TXxOZFa6OIoiuhUWJal-lbdro$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZrobJp9NYWE8nb54WSzT7puCrLjMrJ2gz_VVjU7BYkDhaIoql4SuXfIZdDIZ_qqOZjY3w64UM-TXxOZFa6OIoiuhUWJa55ry-qw$ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ksp_petsc_log.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ksp_petsc_log_noguess.txt URL: From knepley at gmail.com Fri Jun 14 09:29:23 2024 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 14 Jun 2024 10:29:23 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> Message-ID: PETSc itself only takes 47% of the runtime. I am not sure what is happening for the other half. For the PETSc half, it is all in the solve: KSPSolve 20 1.0 5.3323e+03 1.0 1.01e+14 1.0 0.0e+00 0.0e+00 0.0e+00 47 100 0 0 0 47 100 0 0 0 18943 About 2/3 of that is matrix operations (I don't know where you are using LU) MatMult 19960 1.0 2.1336e+03 1.0 8.78e+13 1.0 0.0e+00 0.0e+00 0.0e+00 19 87 0 0 0 19 87 0 0 0 41163 MatMultAdd 152320 1.0 8.4854e+02 1.0 3.60e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 35 0 0 0 7 35 0 0 0 42442 MatSolve 6600 1.0 9.0724e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 8 0 0 0 0 8 0 0 0 0 0 and 1/3 is vector operations for orthogonalization in GMRES: KSPGMRESOrthog 3290 1.0 1.2390e+03 1.0 8.77e+12 1.0 0.0e+00 0.0e+00 0.0e+00 11 9 0 0 0 11 9 0 0 0 7082 VecMAXPY 13220 1.0 1.7894e+03 1.0 9.02e+12 1.0 0.0e+00 0.0e+00 0.0e+00 16 9 0 0 0 16 9 0 0 0 5040 The flop rates do not look crazy, but I do not know what kind of hardware you are running on. Thanks, Matt On Fri, Jun 14, 2024 at 1:20?AM Yongzhong Li wrote: > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 > However, I found at > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve is **almost two times greater than the CPU time > for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve is **almost two times greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some experience > on how to diagnose and address this performance discrepancy? Any > insights or recommendations you could offer would be greatly appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!fTDOqOTfYZs4FVyI7NuFX2IPcFNkDKfw0tBwg7sqK1df_HIGAzkpZHNBcWjz96Mfb2isyStipMBB1awwc73f$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fTDOqOTfYZs4FVyI7NuFX2IPcFNkDKfw0tBwg7sqK1df_HIGAzkpZHNBcWjz96Mfb2isyStipMBB1W3-CeTd$ > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fTDOqOTfYZs4FVyI7NuFX2IPcFNkDKfw0tBwg7sqK1df_HIGAzkpZHNBcWjz96Mfb2isyStipMBB1W3-CeTd$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From spradeepmahadeek at gmail.com Fri Jun 14 10:16:58 2024 From: spradeepmahadeek at gmail.com (s.pradeep kumar) Date: Fri, 14 Jun 2024 10:16:58 -0500 Subject: [petsc-users] Petsc Build error In-Reply-To: <87083505-33ab-8f2c-2b6a-3078efe6578e@fastmail.org> References: <2B59960D-E8B1-4068-889F-4C9A5899383C@petsc.dev> <277fbcf4-d939-f751-52bd-899331ee7e2b@fastmail.org> <87083505-33ab-8f2c-2b6a-3078efe6578e@fastmail.org> Message-ID: Thanks Satish and Barry! Starting from fresh petsc tarball worked. Appreciate your support. Regards, Pradeep On Thu, Jun 13, 2024 at 6:14?PM Satish Balay wrote: > > > > On Jun 13, 2024, at 6:33?PM, s.pradeep kumar < > spradeepmahadeek at gmail.com> wrote: > > > > > ./configure --with-scalar-type=real --with-precision=double > --download-metis=$metis_path --download-metis-use-doubleprecision=1 > --download-parmetis=$parmetis_path --with-cmake=1 --with-mpi-dir=$MPI_DI > > R --download-fblaslapack=$fblaslapack_path --with-debugging=0 > COPTFLAGS=-O2 FOPTFLAGS=-O2 > > Also on cray you can: > > - use cc,CC,ftn as MPI compilers (with cray-mpich, PrgEnv-gnu or > equivalent loaded). i.e --with-cc=cc etc [and skip --with-mpi-dir] > > - use 'module load cray-libsci' instead of --download-fblaslapack > > Satish -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Jun 14 10:36:22 2024 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 14 Jun 2024 11:36:22 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> Message-ID: <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry > On Jun 14, 2024, at 1:19?AM, Yongzhong Li wrote: > > Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically > > KSPGuess Object: 1 MPI process > type: fischer > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? > > Thank you! > Yongzhong > > From: Barry Smith > > Date: Thursday, June 13, 2024 at 2:14?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? > > Thanks > > Barry > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. > > Thanks! > Yongzhong > > From: Matthew Knepley > > Date: Wednesday, June 12, 2024 at 6:46?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: > Matrix Type: Shell system matrix > Preconditioner: Shell PC > Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled > I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. > Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. > > For any performance question like this, we need to see the output of your code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cpzruOwzb5N1ZGsGKL8RbNWCwCC7xZghRWjYeSbdL5VZd4fq0dIKpA21KkD9s30f3YZvEW-b_U_OktT0STuBwcM$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cpzruOwzb5N1ZGsGKL8RbNWCwCC7xZghRWjYeSbdL5VZd4fq0dIKpA21KkD9s30f3YZvEW-b_U_OktT0QwJXMa4$ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sircara at ornl.gov Fri Jun 14 12:35:14 2024 From: sircara at ornl.gov (Sircar, Arpan) Date: Fri, 14 Jun 2024 17:35:14 +0000 Subject: [petsc-users] Running PETSc with a Kokkos backend on OLCF Frontier Message-ID: Hi, We have been working with OpenFOAM (an open-source CFD software) which can transfer its matrices to PETSc to use its linear solvers. This has been tested and is working well on OCLF's Frontier machine. Next we are trying to use the Kokkos backend to run it on Frontier GPUs. While the OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the modules sourced (attached file bash_petsc4foam_gpu) and configuring PETSc correctly (attached file config_gpu), the GPU solve seems to take more time than the CPU solve. The PETSc run-time options we are using are attached to this email (file fvSolution_petsc_pKok_Uof). Could you please take a look and let us know if this combination of options is fine? In this approach we are trying to solve the pressure equation only on the GPUs. Thanks, Arpan Arpan Sircar R&D Associate Staff Thermal Hydraulics Group Nuclear Energy and Fuel Cycle Division Oak Ridge National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fvSolution_petsc_pKok_Uof Type: application/octet-stream Size: 2812 bytes Desc: fvSolution_petsc_pKok_Uof URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bash_petsc4foam_gpu Type: application/octet-stream Size: 1725 bytes Desc: bash_petsc4foam_gpu URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: config_gpu Type: application/octet-stream Size: 895 bytes Desc: config_gpu URL: From bsmith at petsc.dev Fri Jun 14 12:47:45 2024 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 14 Jun 2024 13:47:45 -0400 Subject: [petsc-users] Running PETSc with a Kokkos backend on OLCF Frontier In-Reply-To: References: Message-ID: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> Please run both the CPU solvers and GPU solvers cases with -log_view and send the two outputs. Barry > On Jun 14, 2024, at 1:35?PM, Sircar, Arpan via petsc-users wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, > > We have been working with OpenFOAM (an open-source CFD software) which can transfer its matrices to PETSc to use its linear solvers. This has been tested and is working well on OCLF's Frontier machine. Next we are trying to use the Kokkos backend to run it on Frontier GPUs. While the OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the modules sourced (attached file bash_petsc4foam_gpu) and configuring PETSc correctly (attached file config_gpu), the GPU solve seems to take more time than the CPU solve. > > The PETSc run-time options we are using are attached to this email (file fvSolution_petsc_pKok_Uof). Could you please take a look and let us know if this combination of options is fine? In this approach we are trying to solve the pressure equation only on the GPUs. > > Thanks, > Arpan > > Arpan Sircar > R&D Associate Staff > Thermal Hydraulics Group > Nuclear Energy and Fuel Cycle Division > Oak Ridge National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smm5164 at psu.edu Fri Jun 14 13:42:31 2024 From: smm5164 at psu.edu (Mcintyre, Sean Michael) Date: Fri, 14 Jun 2024 18:42:31 +0000 Subject: [petsc-users] Specify BoomerAMG aggressive coarsening interpolation type via options database Message-ID: Hi there, I'd like to try a different long-range interpolation scheme with BoomerAMG's aggressive coarsening (defaults to multipass). Is there a way to specify this via the PETSc options database? I see in the BoomerAMG documentation that the appropriate function call would be HYPRE_BoomerAMGSetInterpType. I'd prefer to do it via the options database than put it directly into my code. Adding the -help option, I don't see anything like pc_hypre_boomeramg_agg_interp_type. Could this perhaps be added if there isn't currently a way to do it? Thanks, Sean McIntyre -------------- next part -------------- An HTML attachment was scrubbed... URL: From sircara at ornl.gov Fri Jun 14 13:59:27 2024 From: sircara at ornl.gov (Sircar, Arpan) Date: Fri, 14 Jun 2024 18:59:27 +0000 Subject: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos backend on OLCF Frontier In-Reply-To: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> References: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> Message-ID: Hi Barry, Thanks for your prompt response. These are run with with the same PETSc solvers but the one on GPUs (log_pKok) uses mataijkokkos while the other one does not. Please let me know if you need any other information. Thanks, Arpan ________________________________ From: Barry Smith Sent: Friday, June 14, 2024 1:47 PM To: Sircar, Arpan Cc: petsc-users at mcs.anl.gov ; Gottiparthi, Kalyan Subject: [EXTERNAL] Re: [petsc-users] Running PETSc with a Kokkos backend on OLCF Frontier Please run both the CPU solvers and GPU solvers cases with -log_view and send the two outputs. Barry On Jun 14, 2024, at 1:35?PM, Sircar, Arpan via petsc-users wrote: This Message Is From an External Sender This message came from outside your organization. Hi, We have been working with OpenFOAM (an open-source CFD software) which can transfer its matrices to PETSc to use its linear solvers. This has been tested and is working well on OCLF's Frontier machine. Next we are trying to use the Kokkos backend to run it on Frontier GPUs. While the OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the modules sourced (attached file bash_petsc4foam_gpu) and configuring PETSc correctly (attached file config_gpu), the GPU solve seems to take more time than the CPU solve. The PETSc run-time options we are using are attached to this email (file fvSolution_petsc_pKok_Uof). Could you please take a look and let us know if this combination of options is fine? In this approach we are trying to solve the pressure equation only on the GPUs. Thanks, Arpan Arpan Sircar R&D Associate Staff Thermal Hydraulics Group Nuclear Energy and Fuel Cycle Division Oak Ridge National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: log_pKok Type: application/octet-stream Size: 871113 bytes Desc: log_pKok URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: log_pPet Type: application/octet-stream Size: 870264 bytes Desc: log_pPet URL: From pierre at joliv.et Fri Jun 14 14:05:19 2024 From: pierre at joliv.et (Pierre Jolivet) Date: Fri, 14 Jun 2024 21:05:19 +0200 Subject: [petsc-users] Specify BoomerAMG aggressive coarsening interpolation type via options database In-Reply-To: References: Message-ID: > On 14 Jun 2024, at 8:42?PM, Mcintyre, Sean Michael wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi there, > > I'd like to try a different long-range interpolation scheme with BoomerAMG's aggressive coarsening (defaults to multipass). Is there a way to specify this via the PETSc options database? I see in the BoomerAMG documentation that the appropriate function call would be HYPRE_BoomerAMGSetInterpType. I'd prefer to do it via the options database than put it directly into my code. Adding the -help option, I don't see anything like pc_hypre_boomeramg_agg_interp_type. Could this perhaps be added if there isn't currently a way to do it? It?s there already. $ ./ex1 -pc_type hypre -help|grep multipass -pc_hypre_boomeramg_interp_type: Interpolation type (choose one of) classical direct multipass multipass-wts ext+i ext+i-cc standard standard-wts block block-wtd FF FF1 ext ad-wts ext-mm ext+i-mm ext+e-mm (None) Thanks, Pierre > Thanks, > Sean McIntyre -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Fri Jun 14 14:16:20 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Fri, 14 Jun 2024 14:16:20 -0500 Subject: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos backend on OLCF Frontier In-Reply-To: References: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> Message-ID: Arpan, Did you add -log_view ? --Junchao Zhang On Fri, Jun 14, 2024 at 2:00?PM Sircar, Arpan via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi Barry, Thanks for your prompt response. These are run with with the > same PETSc solvers but the one on GPUs (log_pKok) uses mataijkokkos while > the other one does not. Please let me know if you need any other > information. Thanks, Arpan From: > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Hi Barry, > > Thanks for your prompt response. These are run with with the same PETSc > solvers but the one on GPUs (log_pKok) uses mataijkokkos while the other > one does not. > > Please let me know if you need any other information. > > Thanks, > Arpan > ------------------------------ > *From:* Barry Smith > *Sent:* Friday, June 14, 2024 1:47 PM > *To:* Sircar, Arpan > *Cc:* petsc-users at mcs.anl.gov ; Gottiparthi, > Kalyan > *Subject:* [EXTERNAL] Re: [petsc-users] Running PETSc with a Kokkos > backend on OLCF Frontier > > > Please run both the CPU solvers and GPU solvers cases with -log_view > and send the two outputs. > > Barry > > > On Jun 14, 2024, at 1:35?PM, Sircar, Arpan via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, > > We have been working with OpenFOAM (an open-source CFD software) which can > transfer its matrices to PETSc to use its linear solvers. This has been > tested and is working well on OCLF's Frontier machine. Next we are trying > to use the Kokkos backend to run it on Frontier GPUs. While the > OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the > modules sourced (attached file *bash_petsc4foam_gpu*) and configuring > PETSc correctly (attached file *config_gpu*), the GPU solve seems to take > more time than the CPU solve. > > The PETSc run-time options we are using are attached to this email (file > *fvSolution_petsc_pKok_Uof*). Could you please take a look and let us > know if this combination of options is fine? In this approach we are trying > to solve the pressure equation only on the GPUs. > > Thanks, > Arpan > > *Arpan Sircar* > R&D Associate Staff > Thermal Hydraulics Group > Nuclear Energy and Fuel Cycle Division > *Oak Ridge National Laboratory* > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre at joliv.et Fri Jun 14 14:20:14 2024 From: pierre at joliv.et (Pierre Jolivet) Date: Fri, 14 Jun 2024 21:20:14 +0200 Subject: [petsc-users] Specify BoomerAMG aggressive coarsening interpolation type via options database In-Reply-To: References: Message-ID: <402F7522-5ABE-46DF-BFA8-1D1D25B5F514@joliv.et> > On 14 Jun 2024, at 9:15?PM, Mcintyre, Sean Michael wrote: > > Pierre, > > That only specifies it for AMG levels that are not coarsened aggressively. Aggressively coarsened levels end up with the default multipass long-range interpolation. My bad, in your initial message, you mentioned the HYPRE_BoomerAMGSetInterpType() API, but I guess you meant to write HYPRE_BoomerAMGSetAggInterpType()? This is indeed not interfaced right now, but you could copy what?s currently done for -pc_hypre_boomeramg_interp_type and submit an MR. Thanks, Pierre > Thanks, > Sean > From: Pierre Jolivet > > Sent: Friday, June 14, 2024 3:05 PM > To: Mcintyre, Sean Michael > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] Specify BoomerAMG aggressive coarsening interpolation type via options database > > You don't often get email from pierre at joliv.et . Learn why this is important > > >> On 14 Jun 2024, at 8:42?PM, Mcintyre, Sean Michael > wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Hi there, >> >> I'd like to try a different long-range interpolation scheme with BoomerAMG's aggressive coarsening (defaults to multipass). Is there a way to specify this via the PETSc options database? I see in the BoomerAMG documentation that the appropriate function call would be HYPRE_BoomerAMGSetInterpType. I'd prefer to do it via the options database than put it directly into my code. Adding the -help option, I don't see anything like pc_hypre_boomeramg_agg_interp_type. Could this perhaps be added if there isn't currently a way to do it? > > It?s there already. > $ ./ex1 -pc_type hypre -help|grep multipass > -pc_hypre_boomeramg_interp_type: Interpolation type (choose one of) classical direct multipass multipass-wts ext+i ext+i-cc standard standard-wts block block-wtd FF FF1 ext ad-wts ext-mm ext+i-mm ext+e-mm (None) > > Thanks, > Pierre > >> Thanks, >> Sean McIntyre -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Jun 14 14:22:21 2024 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 14 Jun 2024 15:22:21 -0400 Subject: [petsc-users] [EXTERNAL] Running PETSc with a Kokkos backend on OLCF Frontier In-Reply-To: References: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> Message-ID: We need the output when run with -log_view to see where the time is being spent and what communication between the CPU and GPU is occurring. Barry > On Jun 14, 2024, at 2:59?PM, Sircar, Arpan wrote: > > Hi Barry, > > Thanks for your prompt response. These are run with with the same PETSc solvers but the one on GPUs (log_pKok) uses mataijkokkos while the other one does not. > > Please let me know if you need any other information. > > Thanks, > Arpan > From: Barry Smith > > Sent: Friday, June 14, 2024 1:47 PM > To: Sircar, Arpan > > Cc: petsc-users at mcs.anl.gov >; Gottiparthi, Kalyan > > Subject: [EXTERNAL] Re: [petsc-users] Running PETSc with a Kokkos backend on OLCF Frontier > > > Please run both the CPU solvers and GPU solvers cases with -log_view and send the two outputs. > > Barry > > >> On Jun 14, 2024, at 1:35?PM, Sircar, Arpan via petsc-users > wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Hi, >> >> We have been working with OpenFOAM (an open-source CFD software) which can transfer its matrices to PETSc to use its linear solvers. This has been tested and is working well on OCLF's Frontier machine. Next we are trying to use the Kokkos backend to run it on Frontier GPUs. While the OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the modules sourced (attached file bash_petsc4foam_gpu) and configuring PETSc correctly (attached file config_gpu), the GPU solve seems to take more time than the CPU solve. >> >> The PETSc run-time options we are using are attached to this email (file fvSolution_petsc_pKok_Uof). Could you please take a look and let us know if this combination of options is fine? In this approach we are trying to solve the pressure equation only on the GPUs. >> >> Thanks, >> Arpan >> >> Arpan Sircar >> R&D Associate Staff >> Thermal Hydraulics Group >> Nuclear Energy and Fuel Cycle Division >> Oak Ridge National Laboratory >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smm5164 at psu.edu Fri Jun 14 14:22:46 2024 From: smm5164 at psu.edu (Mcintyre, Sean Michael) Date: Fri, 14 Jun 2024 19:22:46 +0000 Subject: [petsc-users] Specify BoomerAMG aggressive coarsening interpolation type via options database In-Reply-To: <402F7522-5ABE-46DF-BFA8-1D1D25B5F514@joliv.et> References: <402F7522-5ABE-46DF-BFA8-1D1D25B5F514@joliv.et> Message-ID: I must have copied the wrong line out of the BoomerAMG website. Oops! I'll probably do that, then. Thanks, Sean ________________________________ From: Pierre Jolivet Sent: Friday, June 14, 2024 3:20 PM To: Mcintyre, Sean Michael Cc: petsc-users Subject: Re: [petsc-users] Specify BoomerAMG aggressive coarsening interpolation type via options database You don't often get email from pierre at joliv.et. Learn why this is important On 14 Jun 2024, at 9:15?PM, Mcintyre, Sean Michael wrote: Pierre, That only specifies it for AMG levels that are not coarsened aggressively. Aggressively coarsened levels end up with the default multipass long-range interpolation. My bad, in your initial message, you mentioned the HYPRE_BoomerAMGSetInterpType() API, but I guess you meant to write HYPRE_BoomerAMGSetAggInterpType()? This is indeed not interfaced right now, but you could copy what?s currently done for -pc_hypre_boomeramg_interp_type and submit an MR. Thanks, Pierre Thanks, Sean ________________________________ From: Pierre Jolivet > Sent: Friday, June 14, 2024 3:05 PM To: Mcintyre, Sean Michael > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Specify BoomerAMG aggressive coarsening interpolation type via options database You don't often get email from pierre at joliv.et. Learn why this is important On 14 Jun 2024, at 8:42?PM, Mcintyre, Sean Michael > wrote: This Message Is From an External Sender This message came from outside your organization. Hi there, I'd like to try a different long-range interpolation scheme with BoomerAMG's aggressive coarsening (defaults to multipass). Is there a way to specify this via the PETSc options database? I see in the BoomerAMG documentation that the appropriate function call would be HYPRE_BoomerAMGSetInterpType. I'd prefer to do it via the options database than put it directly into my code. Adding the -help option, I don't see anything like pc_hypre_boomeramg_agg_interp_type. Could this perhaps be added if there isn't currently a way to do it? It?s there already. $ ./ex1 -pc_type hypre -help|grep multipass -pc_hypre_boomeramg_interp_type: Interpolation type (choose one of) classical direct multipass multipass-wts ext+i ext+i-cc standard standard-wts block block-wtd FF FF1 ext ad-wts ext-mm ext+i-mm ext+e-mm (None) Thanks, Pierre Thanks, Sean McIntyre -------------- next part -------------- An HTML attachment was scrubbed... URL: From sircara at ornl.gov Fri Jun 14 14:44:51 2024 From: sircara at ornl.gov (Sircar, Arpan) Date: Fri, 14 Jun 2024 19:44:51 +0000 Subject: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos backend on OLCF Frontier In-Reply-To: References: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> Message-ID: Hi Junchao and Barry, I tried adding -log_view to my OpenFOAM command since that is what I am running the entire package through. However, it does not recognize that as a valid option. I am not sure where to put that tag in this setup. I am however using a Kokkos profiling tool, the output of which (for the GPU run) is attached with this email. Do let me know if this is useful or if you have ideas of where to put the -log_view tag. Junchao - Great nice running into you here. Thanks, Arpan ________________________________ From: Junchao Zhang Sent: Friday, June 14, 2024 3:16 PM To: Sircar, Arpan Cc: Barry Smith ; petsc-users at mcs.anl.gov ; Gottiparthi, Kalyan Subject: Re: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos backend on OLCF Frontier Arpan, Did you add -log_view ? --Junchao Zhang On Fri, Jun 14, 2024 at 2:00?PM Sircar, Arpan via petsc-users > wrote: Hi Barry, Thanks for your prompt response. These are run with with the same PETSc solvers but the one on GPUs (log_pKok) uses mataijkokkos while the other one does not. Please let me know if you need any other information. Thanks, Arpan From:? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Hi Barry, Thanks for your prompt response. These are run with with the same PETSc solvers but the one on GPUs (log_pKok) uses mataijkokkos while the other one does not. Please let me know if you need any other information. Thanks, Arpan ________________________________ From: Barry Smith > Sent: Friday, June 14, 2024 1:47 PM To: Sircar, Arpan > Cc: petsc-users at mcs.anl.gov >; Gottiparthi, Kalyan > Subject: [EXTERNAL] Re: [petsc-users] Running PETSc with a Kokkos backend on OLCF Frontier Please run both the CPU solvers and GPU solvers cases with -log_view and send the two outputs. Barry On Jun 14, 2024, at 1:35?PM, Sircar, Arpan via petsc-users > wrote: This Message Is From an External Sender This message came from outside your organization. Hi, We have been working with OpenFOAM (an open-source CFD software) which can transfer its matrices to PETSc to use its linear solvers. This has been tested and is working well on OCLF's Frontier machine. Next we are trying to use the Kokkos backend to run it on Frontier GPUs. While the OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the modules sourced (attached file bash_petsc4foam_gpu) and configuring PETSc correctly (attached file config_gpu), the GPU solve seems to take more time than the CPU solve. The PETSc run-time options we are using are attached to this email (file fvSolution_petsc_pKok_Uof). Could you please take a look and let us know if this combination of options is fine? In this approach we are trying to solve the pressure equation only on the GPUs. Thanks, Arpan Arpan Sircar R&D Associate Staff Thermal Hydraulics Group Nuclear Energy and Fuel Cycle Division Oak Ridge National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: log.kok Type: application/octet-stream Size: 4892 bytes Desc: log.kok URL: From junchao.zhang at gmail.com Fri Jun 14 14:58:22 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Fri, 14 Jun 2024 14:58:22 -0500 Subject: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos backend on OLCF Frontier In-Reply-To: References: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> Message-ID: Arpan, Nice to meet you. -log_view in a petsc option, so I think you can add it to your fvSolution_petsc_pKok_Uof at location like mat_type mpiaijkokkos; vec_type kokkos; log_view; --Junchao Zhang On Fri, Jun 14, 2024 at 2:44?PM Sircar, Arpan wrote: > Hi Junchao and Barry, > > I tried adding -log_view to my OpenFOAM command since that is what I am > running the entire package through. However, it does not recognize that as > a valid option. I am not sure where to put that tag in this setup. > I am however using a Kokkos profiling tool, the output of which (for the > GPU run) is attached with this email. Do let me know if this is useful or > if you have ideas of where to put the -log_view tag. > > Junchao - Great nice running into you here. > > Thanks, > Arpan > ------------------------------ > *From:* Junchao Zhang > *Sent:* Friday, June 14, 2024 3:16 PM > *To:* Sircar, Arpan > *Cc:* Barry Smith ; petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov>; Gottiparthi, Kalyan > *Subject:* Re: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos > backend on OLCF Frontier > > Arpan, > Did you add -log_view ? > > --Junchao Zhang > > > On Fri, Jun 14, 2024 at 2:00?PM Sircar, Arpan via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > Hi Barry, Thanks for your prompt response. These are run with with the > same PETSc solvers but the one on GPUs (log_pKok) uses mataijkokkos while > the other one does not. Please let me know if you need any other > information. Thanks, Arpan From: > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Hi Barry, > > Thanks for your prompt response. These are run with with the same PETSc > solvers but the one on GPUs (log_pKok) uses mataijkokkos while the other > one does not. > > Please let me know if you need any other information. > > Thanks, > Arpan > ------------------------------ > *From:* Barry Smith > *Sent:* Friday, June 14, 2024 1:47 PM > *To:* Sircar, Arpan > *Cc:* petsc-users at mcs.anl.gov ; Gottiparthi, > Kalyan > *Subject:* [EXTERNAL] Re: [petsc-users] Running PETSc with a Kokkos > backend on OLCF Frontier > > > Please run both the CPU solvers and GPU solvers cases with -log_view > and send the two outputs. > > Barry > > > On Jun 14, 2024, at 1:35?PM, Sircar, Arpan via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, > > We have been working with OpenFOAM (an open-source CFD software) which can > transfer its matrices to PETSc to use its linear solvers. This has been > tested and is working well on OCLF's Frontier machine. Next we are trying > to use the Kokkos backend to run it on Frontier GPUs. While the > OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the > modules sourced (attached file *bash_petsc4foam_gpu*) and configuring > PETSc correctly (attached file *config_gpu*), the GPU solve seems to take > more time than the CPU solve. > > The PETSc run-time options we are using are attached to this email (file > *fvSolution_petsc_pKok_Uof*). Could you please take a look and let us > know if this combination of options is fine? In this approach we are trying > to solve the pressure equation only on the GPUs. > > Thanks, > Arpan > > *Arpan Sircar* > R&D Associate Staff > Thermal Hydraulics Group > Nuclear Energy and Fuel Cycle Division > *Oak Ridge National Laboratory* > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Jun 14 15:01:00 2024 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 14 Jun 2024 16:01:00 -0400 Subject: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos backend on OLCF Frontier In-Reply-To: References: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> Message-ID: Are you using https://urldefense.us/v3/__https://develop.openfoam.com/modules/external-solver__;!!G_uCfscf7eWS!YgXDdUKnfJdPEbKAQzR8TYlQzciWTcMmU_6jrKexdBB6UajcOHUvJhfaBjBtMN2cX6EUFDrcF4LcFItaSLxbAb0$ If so you might be able to add it like petsc { options { log_view; ksp_type cg; pc_type bjacobi; sub_pc_type icc; } as discussed on the webpage for setting PETSc solver options > On Jun 14, 2024, at 3:44?PM, Sircar, Arpan wrote: > > Hi Junchao and Barry, > > I tried adding -log_view to my OpenFOAM command since that is what I am running the entire package through. However, it does not recognize that as a valid option. I am not sure where to put that tag in this setup. > I am however using a Kokkos profiling tool, the output of which (for the GPU run) is attached with this email. Do let me know if this is useful or if you have ideas of where to put the -log_view tag. > > Junchao - Great nice running into you here. > > Thanks, > Arpan > From: Junchao Zhang > > Sent: Friday, June 14, 2024 3:16 PM > To: Sircar, Arpan > > Cc: Barry Smith >; petsc-users at mcs.anl.gov >; Gottiparthi, Kalyan > > Subject: Re: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos backend on OLCF Frontier > > Arpan, > Did you add -log_view ? > > --Junchao Zhang > > > On Fri, Jun 14, 2024 at 2:00?PM Sircar, Arpan via petsc-users > wrote: > This Message Is From an External Sender > This message came from outside your organization. > > Hi Barry, > > Thanks for your prompt response. These are run with with the same PETSc solvers but the one on GPUs (log_pKok) uses mataijkokkos while the other one does not. > > Please let me know if you need any other information. > > Thanks, > Arpan > From: Barry Smith > > Sent: Friday, June 14, 2024 1:47 PM > To: Sircar, Arpan > > Cc: petsc-users at mcs.anl.gov >; Gottiparthi, Kalyan > > Subject: [EXTERNAL] Re: [petsc-users] Running PETSc with a Kokkos backend on OLCF Frontier > > > Please run both the CPU solvers and GPU solvers cases with -log_view and send the two outputs. > > Barry > > >> On Jun 14, 2024, at 1:35?PM, Sircar, Arpan via petsc-users > wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Hi, >> >> We have been working with OpenFOAM (an open-source CFD software) which can transfer its matrices to PETSc to use its linear solvers. This has been tested and is working well on OCLF's Frontier machine. Next we are trying to use the Kokkos backend to run it on Frontier GPUs. While the OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the modules sourced (attached file bash_petsc4foam_gpu) and configuring PETSc correctly (attached file config_gpu), the GPU solve seems to take more time than the CPU solve. >> >> The PETSc run-time options we are using are attached to this email (file fvSolution_petsc_pKok_Uof). Could you please take a look and let us know if this combination of options is fine? In this approach we are trying to solve the pressure equation only on the GPUs. >> >> Thanks, >> Arpan >> >> Arpan Sircar >> R&D Associate Staff >> Thermal Hydraulics Group >> Nuclear Energy and Fuel Cycle Division >> Oak Ridge National Laboratory >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Fri Jun 14 15:06:16 2024 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Fri, 14 Jun 2024 23:06:16 +0300 Subject: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos backend on OLCF Frontier In-Reply-To: References: <642149EE-B4FA-4876-9D62-217B961CE9F6@petsc.dev> Message-ID: These options are prefixed by the equation name, and it won't work. You should add the option via the environment variable PETSC_OPTIONS="-log_view" On Fri, Jun 14, 2024, 23:01 Barry Smith wrote: > Are you using https: //develop. openfoam. com/modules/external-solver If > so you might be able to add it like petsc { options { log_view; ksp_type > cg; pc_type bjacobi; sub_pc_type icc; } as discussed on the webpage for > setting PETSc solver optionsOn > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Are you using https://urldefense.us/v3/__https://develop.openfoam.com/modules/external-solver__;!!G_uCfscf7eWS!avqfqzhEZD_pucu7YsUcVV0ZJNb6Gaq5y7vItF1Yx2FXvC9s_R6nGKhtj1kvYWaVbBa2sHt5_Sm99fJknpUa2ErVWHwP6nE$ > > > If so you might be able to add it like > > petsc > { > options > { > > log_view; > ksp_type cg; > pc_type bjacobi; > sub_pc_type icc; > } > > as discussed on the webpage for setting PETSc solver options > > > > On Jun 14, 2024, at 3:44?PM, Sircar, Arpan wrote: > > Hi Junchao and Barry, > > I tried adding -log_view to my OpenFOAM command since that is what I am > running the entire package through. However, it does not recognize that as > a valid option. I am not sure where to put that tag in this setup. > I am however using a Kokkos profiling tool, the output of which (for the > GPU run) is attached with this email. Do let me know if this is useful or > if you have ideas of where to put the -log_view tag. > > Junchao - Great nice running into you here. > > Thanks, > Arpan > ------------------------------ > *From:* Junchao Zhang > *Sent:* Friday, June 14, 2024 3:16 PM > *To:* Sircar, Arpan > *Cc:* Barry Smith ; petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov>; Gottiparthi, Kalyan > *Subject:* Re: [petsc-users] [EXTERNAL] Re: Running PETSc with a Kokkos > backend on OLCF Frontier > > Arpan, > Did you add -log_view ? > > --Junchao Zhang > > > On Fri, Jun 14, 2024 at 2:00?PM Sircar, Arpan via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > > Hi Barry, > > Thanks for your prompt response. These are run with with the same PETSc > solvers but the one on GPUs (log_pKok) uses mataijkokkos while the other > one does not. > > Please let me know if you need any other information. > > Thanks, > Arpan > ------------------------------ > *From:* Barry Smith > *Sent:* Friday, June 14, 2024 1:47 PM > *To:* Sircar, Arpan > *Cc:* petsc-users at mcs.anl.gov ; Gottiparthi, > Kalyan > *Subject:* [EXTERNAL] Re: [petsc-users] Running PETSc with a Kokkos > backend on OLCF Frontier > > > Please run both the CPU solvers and GPU solvers cases with -log_view > and send the two outputs. > > Barry > > > On Jun 14, 2024, at 1:35?PM, Sircar, Arpan via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, > > We have been working with OpenFOAM (an open-source CFD software) which can > transfer its matrices to PETSc to use its linear solvers. This has been > tested and is working well on OCLF's Frontier machine. Next we are trying > to use the Kokkos backend to run it on Frontier GPUs. While the > OpenFOAM+PETSc+Kokkos environment built correctly on Frontier using the > modules sourced (attached file *bash_petsc4foam_gpu*) and configuring > PETSc correctly (attached file *config_gpu*), the GPU solve seems to take > more time than the CPU solve. > > The PETSc run-time options we are using are attached to this email (file > *fvSolution_petsc_pKok_Uof*). Could you please take a look and let us > know if this combination of options is fine? In this approach we are trying > to solve the pressure equation only on the GPUs. > > Thanks, > Arpan > > *Arpan Sircar* > R&D Associate Staff > Thermal Hydraulics Group > Nuclear Energy and Fuel Cycle Division > *Oak Ridge National Laboratory* > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Thu Jun 20 22:39:37 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Fri, 21 Jun 2024 03:39:37 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> Message-ID: Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li Cc: petsc-users at mcs.anl.gov , petsc-maint at mcs.anl.gov , Piero Triverio Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!ZX0g5Rah2QayJiJpCO1eqen0Vf8-qL3bZcj8rdMOqqzQ4AeVbAlN6SGuiOE2X9iQjiCwj1fF1pdPGN_Afl53SdyYP93-Fwo6_H0$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZX0g5Rah2QayJiJpCO1eqen0Vf8-qL3bZcj8rdMOqqzQ4AeVbAlN6SGuiOE2X9iQjiCwj1fF1pdPGN_Afl53SdyYP93-Q9gBEl4$ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: PETSc Performance Summary.txt URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Parallel Performamce for KSPSolve.pdf Type: application/pdf Size: 246037 bytes Desc: Parallel Performamce for KSPSolve.pdf URL: From junchao.zhang at gmail.com Thu Jun 20 23:42:53 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Thu, 20 Jun 2024 23:42:53 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> Message-ID: I remember there are some MKL env vars to print MKL routines called. Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li wrote: > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, // Static variable to keep track > of the stage counter > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > > static int stageCounter = 1; > > > > // Generate a unique stage name > > std::ostringstream oss; > > oss << "Stage " << stageCounter << " of Code"; > > std::string stageName = oss.str(); > > > > // Register the stage > > PetscLogStage stagenum; > > > > PetscLogStageRegister(stageName.c_str(), &stagenum); > > PetscLogStagePush(stagenum); > > > > * KSPSolve(*ksp_ptr, b, x);* > > > > PetscLogStagePop(); > > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > > > Finally there are a huge number of > > > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > > > Barry > > > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve is **almost two times greater than the CPU time > for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve is **almost two times greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some experience > on how to diagnose and address this performance discrepancy? Any > insights or recommendations you could offer would be greatly appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!bLhD-VVLlAIX2LZGI3Xm13B5A9pPbt00el688AkMFtdLD_BKccqXIOS7Byytn1S4bRlVOvFchfDsvWuIOLHUgubsxH1t$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bLhD-VVLlAIX2LZGI3Xm13B5A9pPbt00el688AkMFtdLD_BKccqXIOS7Byytn1S4bRlVOvFchfDsvWuIOLHUgipHgofj$ > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre at joliv.et Fri Jun 21 00:36:27 2024 From: pierre at joliv.et (Pierre Jolivet) Date: Fri, 21 Jun 2024 07:36:27 +0200 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> Message-ID: <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre > Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: >> This Message Is From an External Sender >> This message came from outside your organization. >> >> Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, >> >> // Static variable to keep track of the stage counter >> >> static int stageCounter = 1; >> >> >> >> // Generate a unique stage name >> >> std::ostringstream oss; >> >> oss << "Stage " << stageCounter << " of Code"; >> >> std::string stageName = oss.str(); >> >> >> >> // Register the stage >> >> PetscLogStage stagenum; >> >> >> >> PetscLogStageRegister(stageName.c_str(), &stagenum); >> >> PetscLogStagePush(stagenum); >> >> >> >> KSPSolve(*ksp_ptr, b, x); >> >> >> >> PetscLogStagePop(); >> >> stageCounter++; >> >> I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. >> >> To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >> >> From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. >> >> My questions is, >> >> From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? >> >> Thank you, >> Yongzhong >> >> >> >> From: Barry Smith > >> Date: Friday, June 14, 2024 at 11:36?AM >> To: Yongzhong Li > >> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >> >> >> >> I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand >> >> >> >> MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 >> >> MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> >> >> in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) >> >> >> >> VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 >> >> VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 >> >> >> >> Finally there are a huge number of >> >> >> >> MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 >> >> >> >> Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? >> >> >> >> The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. >> >> >> >> Barry >> >> >> >> >> >> >> On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: >> >> >> >> Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically >> >> KSPGuess Object: 1 MPI process >> >> type: fischer >> >> Model 1, size 200 >> >> However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. >> >> Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? >> >> Thank you! >> Yongzhong >> >> >> From: Barry Smith > >> Date: Thursday, June 13, 2024 at 2:14?PM >> To: Yongzhong Li > >> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >> >> >> >> Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? >> >> >> Thanks >> >> >> Barry >> >> >> >> On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: >> >> >> This Message Is From an External Sender >> >> This message came from outside your organization. >> >> Hi Matt, >> >> I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. >> >> Thanks! >> Yongzhong >> >> >> From: Matthew Knepley > >> Date: Wednesday, June 12, 2024 at 6:46?PM >> To: Yongzhong Li > >> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >> >> ????????? knepley at gmail.com ????????????????? >> On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: >> >> Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is >> >> ZjQcmQRYFpfptBannerStart >> >> This Message Is From an External Sender >> >> This message came from outside your organization. >> >> >> ZjQcmQRYFpfptBannerEnd >> >> Dear PETSc?s developers, >> >> I hope this email finds you well. >> >> I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. >> >> For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: >> >> Matrix Type: Shell system matrix >> Preconditioner: Shell PC >> Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled >> I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. >> >> Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. >> >> >> >> For any performance question like this, we need to see the output of your code run with >> >> >> >> -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view >> >> >> >> Thanks, >> >> >> >> Matt >> >> >> >> Thank you for your time and assistance. >> >> Best regards, >> >> Yongzhong >> >> ----------------------------------------------------------- >> >> Yongzhong Li >> >> PhD student | Electromagnetics Group >> >> Department of Electrical & Computer Engineering >> >> University of Toronto >> >> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cxTM09LsKoYUA08P97agSWfNaQ7kgSux1FjxDwySQtW7Eg2OyUPt_464qMf8D4fDNGWVJRXvPqZTEgKvCtkt7A$ >> >> >> >> >> >> >> -- >> >> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >> -- Norbert Wiener >> >> >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cxTM09LsKoYUA08P97agSWfNaQ7kgSux1FjxDwySQtW7Eg2OyUPt_464qMf8D4fDNGWVJRXvPqZTEgISAv2xYg$ >> >> >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Fri Jun 21 12:37:57 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Fri, 21 Jun 2024 17:37:57 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang Cc: Yongzhong Li , petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!eXBeeIXo9Yqgp2nypqwKYimLnGBZXnF4dXxgLM1UoOIO6n8nt3XlfgjVWLPWJh4UOa5NNpx-nrJb_H828XRQKUREfR2m9I1yPZo$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eXBeeIXo9Yqgp2nypqwKYimLnGBZXnF4dXxgLM1UoOIO6n8nt3XlfgjVWLPWJh4UOa5NNpx-nrJb_H828XRQKUREfR2m9E9w4UY$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre at joliv.et Fri Jun 21 12:47:11 2024 From: pierre at joliv.et (Pierre Jolivet) Date: Fri, 21 Jun 2024 19:47:11 +0200 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] > On 21 Jun 2024, at 7:37?PM, Yongzhong Li wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? > > Best, > Yongzhong > > > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:36?AM > To: Junchao Zhang > > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called. > > The environment variable is MKL_VERBOSE > > Thanks, > Pierre > > > Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: > This Message Is From an External Sender > This message came from outside your organization. > > Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > static int stageCounter = 1; > > // Generate a unique stage name > std::ostringstream oss; > oss << "Stage " << stageCounter << " of Code"; > std::string stageName = oss.str(); > > // Register the stage > PetscLogStage stagenum; > > PetscLogStageRegister(stageName.c_str(), &stagenum); > PetscLogStagePush(stagenum); > > KSPSolve(*ksp_ptr, b, x); > > PetscLogStagePop(); > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > From: Barry Smith > > Date: Friday, June 14, 2024 at 11:36?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > Finally there are a huge number of > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? > > The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. > > Barry > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically > > KSPGuess Object: 1 MPI process > type: fischer > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? > > Thank you! > Yongzhong > > From: Barry Smith > > Date: Thursday, June 13, 2024 at 2:14?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? > > Thanks > > Barry > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. > > Thanks! > Yongzhong > > From: Matthew Knepley > > Date: Wednesday, June 12, 2024 at 6:46?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: > Matrix Type: Shell system matrix > Preconditioner: Shell PC > Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled > I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. > Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. > > For any performance question like this, we need to see the output of your code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!dyMF1oRvr6dKSgMF8DY1CbNZpPH1TLs6jQQPaBSD91BavByk95ynHW8SxAFI8F3BNIxHhs0HO2I4dpeIlVq2fQ$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dyMF1oRvr6dKSgMF8DY1CbNZpPH1TLs6jQQPaBSD91BavByk95ynHW8SxAFI8F3BNIxHhs0HO2I4dpfRK52EeQ$ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Fri Jun 21 16:03:39 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Fri, 21 Jun 2024 21:03:39 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li Cc: Junchao Zhang , petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!Zturq926lMOqbcXpBsLk3Xx52E8yBtGS2wWJPvtk_j6NkJZ0ZgRKIEMEXRthhtqyrAwxtK0Glw8h6uizbw18-ioxAAn3IwK8YCI$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Zturq926lMOqbcXpBsLk3Xx52E8yBtGS2wWJPvtk_j6NkJZ0ZgRKIEMEXRthhtqyrAwxtK0Glw8h6uizbw18-ioxAAn3WtF1k0Y$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Sat Jun 22 08:40:04 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Sat, 22 Jun 2024 08:40:04 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li wrote: > I am using > > export MKL_VERBOSE=1 > > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:47?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > How do you set the variable? > > > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 > architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled > processors, Lnx 2.80GHz lp64 intel_thread > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > [...] > > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of > MKL. Does PETSc enable this verbose output? > > Best, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:36?AM > *To: *Junchao Zhang > *Cc: *Yongzhong Li , > petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > I remember there are some MKL env vars to print MKL routines called. > > > > The environment variable is MKL_VERBOSE > > > > Thanks, > > Pierre > > > > Maybe we can try it to see what MKL routines are really used and then we > can understand why some petsc functions did not speed up > > > --Junchao Zhang > > > > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > > static int stageCounter = 1; > > > > // Generate a unique stage name > > std::ostringstream oss; > > oss << "Stage " << stageCounter << " of Code"; > > std::string stageName = oss.str(); > > > > // Register the stage > > PetscLogStage stagenum; > > > > PetscLogStageRegister(stageName.c_str(), &stagenum); > > PetscLogStagePush(stagenum); > > > > *KSPSolve(*ksp_ptr, b, x);* > > > > PetscLogStagePop(); > > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > > > Finally there are a huge number of > > > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > > > Barry > > > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve** is **almost two times **greater than the CPU > time for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some experience > on how to diagnose and address this performance discrepancy? Any > insights or recommendations you could offer would be greatly appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!YIK0ThSYBvsQIwQn1nK19oKLIjhdsMJjFBn4TKBPWX9MdHPawEkV_Ol4-JrQi4VGPFKXs18XY44nNUBQDDP5XLnDSypz$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YIK0ThSYBvsQIwQn1nK19oKLIjhdsMJjFBn4TKBPWX9MdHPawEkV_Ol4-JrQi4VGPFKXs18XY44nNUBQDDP5XEgU8ZBa$ > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ykai0908 at 163.com Fri Jun 21 21:20:42 2024 From: ykai0908 at 163.com (Jacinto YANG) Date: Sat, 22 Jun 2024 10:20:42 +0800 (CST) Subject: [petsc-users] *****SPAM*****Consulting PETSc Parallel Solver for Solving Sparse Linear Equations Message-ID: <2abbb3aa.ae3.1903dbef385.Coremail.ykai0908@163.com> Dear PETSc team, Hello! I am personally engaged in numerical calculations and am currently developing a program project that requires efficient solution of large sparse linear equation systems. During the research process, I learned that the PETSc library provides a powerful parallel solver, which seems to be very suitable for my needs. What I want to ask is, can I call PETSc's parallel solver in my project to solve linear equations? If possible, could you recommend a parallel solver suitable for handling sparse matrices? The coefficient matrix of my linear equation system is usually sparse, so I need a solver that can effectively handle such matrices. In addition, I would greatly appreciate any relevant tutorials or document recommendations. Your help is crucial to the progress of my project, and I look forward to your reply. Best regards, Jacinto YANG -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Sat Jun 22 16:03:11 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Sat, 22 Jun 2024 21:03:11 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Thanks, Yongzhong From: Junchao Zhang Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li Cc: Pierre Jolivet , petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!flsZMI97ne0yyxHhLda3hROB9qsgstuZS-jPinxGIzFCCSdn1ujdoMR8dyz-5_kVqqMM-12Lt0dTdjKrx3wXhHZmBhNy72AFb1k$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!flsZMI97ne0yyxHhLda3hROB9qsgstuZS-jPinxGIzFCCSdn1ujdoMR8dyz-5_kVqqMM-12Lt0dTdjKrx3wXhHZmBhNyYEgp7uQ$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Sat Jun 22 16:56:02 2024 From: knepley at gmail.com (Matthew Knepley) Date: Sat, 22 Jun 2024 17:56:02 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li wrote: > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > MKL_VERBOSE=1 ./ex1 > > > matrix nonzeros = 100, allocated nonzeros = 100 > > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector Neural Network Instructions enabled > processors, Lnx 2.50GHz lp64 gnu_thread > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) > 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) > 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) > 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) > 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) > 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) > 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) > 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) > 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) > 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) > 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) > 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) > 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All > I did is to change the matrix type from MATAIJ to MATAIJMKL to get > optimized performance for spmv from MKL. Should I expect to see any MKL > outputs in this case? > Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt > Thanks, > > Yongzhong > > > > *From: *Junchao Zhang > *Date: *Saturday, June 22, 2024 at 9:40?AM > *To: *Yongzhong Li > *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example > first and see if MKL is really used > > $ cd src/mat/tests > > $ make ex1 > > $ MKL_VERBOSE=1 ./ex1 > > > --Junchao Zhang > > > > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > I am using > > export MKL_VERBOSE=1 > > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:47?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > How do you set the variable? > > > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 > architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled > processors, Lnx 2.80GHz lp64 intel_thread > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > [...] > > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of > MKL. Does PETSc enable this verbose output? > > Best, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:36?AM > *To: *Junchao Zhang > *Cc: *Yongzhong Li , > petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > I remember there are some MKL env vars to print MKL routines called. > > > > The environment variable is MKL_VERBOSE > > > > Thanks, > > Pierre > > > > Maybe we can try it to see what MKL routines are really used and then we > can understand why some petsc functions did not speed up > > > --Junchao Zhang > > > > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > > static int stageCounter = 1; > > > > // Generate a unique stage name > > std::ostringstream oss; > > oss << "Stage " << stageCounter << " of Code"; > > std::string stageName = oss.str(); > > > > // Register the stage > > PetscLogStage stagenum; > > > > PetscLogStageRegister(stageName.c_str(), &stagenum); > > PetscLogStagePush(stagenum); > > > > *KSPSolve(*ksp_ptr, b, x);* > > > > PetscLogStagePop(); > > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > > > Finally there are a huge number of > > > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > > > Barry > > > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve** is **almost two times **greater than the CPU > time for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some experience > on how to diagnose and address this performance discrepancy? Any > insights or recommendations you could offer would be greatly appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!bDAP9_cc4kxQoG-PxDlkBdIp_YAhb2swSdTCmldNce2eI4DO6YATl5KED0zpX5PC2AEvY1tq0jjSK32rn8gN$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bDAP9_cc4kxQoG-PxDlkBdIp_YAhb2swSdTCmldNce2eI4DO6YATl5KED0zpX5PC2AEvY1tq0jjSK2axnx8f$ > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bDAP9_cc4kxQoG-PxDlkBdIp_YAhb2swSdTCmldNce2eI4DO6YATl5KED0zpX5PC2AEvY1tq0jjSK2axnx8f$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Sat Jun 22 21:07:15 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Sun, 23 Jun 2024 02:07:15 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? Thanks, Yongzhong From: Matthew Knepley Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li Cc: Junchao Zhang , Pierre Jolivet , petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3FeTjauU$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3PHizaV4$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fVvbGldqcUV5ju4jpu5oGmt-VjITi5JpCJzhHxpbgsERLVYZzglpxKOOyrBRGxjRxp7vWHwt3SnINFOQErR1Z8kcDcf3PHizaV4$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Sat Jun 22 22:34:40 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Sat, 22 Jun 2024 22:34:40 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: Could you send your petsc configure.log? --Junchao Zhang On Sat, Jun 22, 2024 at 9:07?PM Yongzhong Li wrote: > Yeah, I ran my program again using -mat_view::ascii_info and set > MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix > to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > > > --> Solving the system... > > > > Excitation 1 of 1... > > > > ================================================ > > Iterative solve completed in 7435 ms. > > CONVERGED: rtol. > > Iterations: 72 > > Final relative residual norm: 9.22287e-07 > > ================================================ > > [CPU TIME] System solution: 2.27160000e+02 s. > > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set > MKL_VERBOSE to be 1. Although, I think it should be many spmv operations > when doing KSPSolve(). Do you see the possible reasons? > > Thanks, > > Yongzhong > > > > > > *From: *Matthew Knepley > *Date: *Saturday, June 22, 2024 at 5:56?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender * > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > MKL_VERBOSE=1 ./ex1 > > > matrix nonzeros = 100, allocated nonzeros = 100 > > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector Neural Network Instructions enabled > processors, Lnx 2.50GHz lp64 gnu_thread > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) > 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) > 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) > 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) > 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) > 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) > 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) > 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) > 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) > 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) > 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) > 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) > 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All > I did is to change the matrix type from MATAIJ to MATAIJMKL to get > optimized performance for spmv from MKL. Should I expect to see any MKL > outputs in this case? > > > > Are you sure that the type changed? You can MatView() the matrix with > format ascii_info to see. > > > > Thanks, > > > > Matt > > > > > > Thanks, > > Yongzhong > > > > *From: *Junchao Zhang > *Date: *Saturday, June 22, 2024 at 9:40?AM > *To: *Yongzhong Li > *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example > first and see if MKL is really used > > $ cd src/mat/tests > > $ make ex1 > > $ MKL_VERBOSE=1 ./ex1 > > > --Junchao Zhang > > > > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > I am using > > export MKL_VERBOSE=1 > > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:47?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > How do you set the variable? > > > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 > architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled > processors, Lnx 2.80GHz lp64 intel_thread > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > [...] > > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of > MKL. Does PETSc enable this verbose output? > > Best, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:36?AM > *To: *Junchao Zhang > *Cc: *Yongzhong Li , > petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > I remember there are some MKL env vars to print MKL routines called. > > > > The environment variable is MKL_VERBOSE > > > > Thanks, > > Pierre > > > > Maybe we can try it to see what MKL routines are really used and then we > can understand why some petsc functions did not speed up > > > --Junchao Zhang > > > > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > > static int stageCounter = 1; > > > > // Generate a unique stage name > > std::ostringstream oss; > > oss << "Stage " << stageCounter << " of Code"; > > std::string stageName = oss.str(); > > > > // Register the stage > > PetscLogStage stagenum; > > > > PetscLogStageRegister(stageName.c_str(), &stagenum); > > PetscLogStagePush(stagenum); > > > > *KSPSolve(*ksp_ptr, b, x);* > > > > PetscLogStagePop(); > > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > > > Finally there are a huge number of > > > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > > > Barry > > > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve** is **almost two times **greater than the CPU > time for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some experience > on how to diagnose and address this performance discrepancy? Any > insights or recommendations you could offer would be greatly appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!ZJ3rk1mLYM96-a65aDgKPW29CNt0jSkT02ALoctzHwDVqMJ3uiu3AGbtN0eTa_rlUIAvUceuzppH2RLwF54IeimZuptx$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZJ3rk1mLYM96-a65aDgKPW29CNt0jSkT02ALoctzHwDVqMJ3uiu3AGbtN0eTa_rlUIAvUceuzppH2RLwF54IernkFsz1$ > > > > > > > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZJ3rk1mLYM96-a65aDgKPW29CNt0jSkT02ALoctzHwDVqMJ3uiu3AGbtN0eTa_rlUIAvUceuzppH2RLwF54IernkFsz1$ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Sat Jun 22 22:46:39 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Sun, 23 Jun 2024 03:46:39 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: Sure, the log file is attached! Thanks, Yongzhong From: Junchao Zhang Date: Saturday, June 22, 2024 at 11:34?PM To: Yongzhong Li Cc: Matthew Knepley , Pierre Jolivet , petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Could you send your petsc configure.log? --Junchao Zhang On Sat, Jun 22, 2024 at 9:07?PM Yongzhong Li > wrote: Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? Thanks, Yongzhong From: Matthew Knepley > Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!ZdSVlIsmD-isz3OIW01DTr_TsIeVpkfswNdtm20bsflkNUe2-otHvx1QxNOwwJDT8IGS1_eVGgpXTSACMJ2rqhdyQoORLParZ3k$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZdSVlIsmD-isz3OIW01DTr_TsIeVpkfswNdtm20bsflkNUe2-otHvx1QxNOwwJDT8IGS1_eVGgpXTSACMJ2rqhdyQoORHd3oTE0$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZdSVlIsmD-isz3OIW01DTr_TsIeVpkfswNdtm20bsflkNUe2-otHvx1QxNOwwJDT8IGS1_eVGgpXTSACMJ2rqhdyQoORHd3oTE0$ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: configure.log Type: application/octet-stream Size: 1496850 bytes Desc: configure.log URL: From pierre at joliv.et Sat Jun 22 23:40:38 2024 From: pierre at joliv.et (Pierre Jolivet) Date: Sun, 23 Jun 2024 06:40:38 +0200 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> Message-ID: <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> > On 23 Jun 2024, at 4:07?AM, Yongzhong Li wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > > --> Solving the system... > > Excitation 1 of 1... > > ================================================ > Iterative solve completed in 7435 ms. > CONVERGED: rtol. > Iterations: 72 > Final relative residual norm: 9.22287e-07 > ================================================ > [CPU TIME] System solution: 2.27160000e+02 s. > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. Thanks, Pierre > Thanks, > Yongzhong > > > From: Matthew Knepley > > Date: Saturday, June 22, 2024 at 5:56?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > MKL_VERBOSE=1 ./ex1 > > matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? > > Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. > > Thanks, > > Matt > > > Thanks, > Yongzhong > > From: Junchao Zhang > > Date: Saturday, June 22, 2024 at 9:40?AM > To: Yongzhong Li > > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used > $ cd src/mat/tests > $ make ex1 > $ MKL_VERBOSE=1 ./ex1 > > --Junchao Zhang > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: > I am using > > export MKL_VERBOSE=1 > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > Yongzhong > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:47?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > How do you set the variable? > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > [...] > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? > > Best, > Yongzhong > > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:36?AM > To: Junchao Zhang > > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called. > > The environment variable is MKL_VERBOSE > > Thanks, > Pierre > > > Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: > This Message Is From an External Sender > This message came from outside your organization. > > Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > static int stageCounter = 1; > > // Generate a unique stage name > std::ostringstream oss; > oss << "Stage " << stageCounter << " of Code"; > std::string stageName = oss.str(); > > // Register the stage > PetscLogStage stagenum; > > PetscLogStageRegister(stageName.c_str(), &stagenum); > PetscLogStagePush(stagenum); > > KSPSolve(*ksp_ptr, b, x); > > PetscLogStagePop(); > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > From: Barry Smith > > Date: Friday, June 14, 2024 at 11:36?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > Finally there are a huge number of > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? > > The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. > > Barry > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically > > KSPGuess Object: 1 MPI process > type: fischer > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? > > Thank you! > Yongzhong > > From: Barry Smith > > Date: Thursday, June 13, 2024 at 2:14?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? > > Thanks > > Barry > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. > > Thanks! > Yongzhong > > From: Matthew Knepley > > Date: Wednesday, June 12, 2024 at 6:46?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: > Matrix Type: Shell system matrix > Preconditioner: Shell PC > Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled > I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. > Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. > > For any performance question like this, we need to see the output of your code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!Z8HZiXGQnjHqDmc7iazH87rJlbMWsLB3EuunbfKExj-yIP_YO4fwbd8Pwjj09pkXSKM0E65MYD8qig2qydb_fA$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Z8HZiXGQnjHqDmc7iazH87rJlbMWsLB3EuunbfKExj-yIP_YO4fwbd8Pwjj09pkXSKM0E65MYD8qig1_dQLi2Q$ > > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Z8HZiXGQnjHqDmc7iazH87rJlbMWsLB3EuunbfKExj-yIP_YO4fwbd8Pwjj09pkXSKM0E65MYD8qig1_dQLi2Q$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From lzou at anl.gov Sun Jun 23 16:03:55 2024 From: lzou at anl.gov (Zou, Ling) Date: Sun, 23 Jun 2024 21:03:55 +0000 Subject: [petsc-users] Modelica + PETSc? Message-ID: Hi all, I am just curious ? any effort trying to include PETSc as Modelica?s solution option? (Modelica forum or email list seem to be quite dead so asking here.) -Ling -------------- next part -------------- An HTML attachment was scrubbed... URL: From marildo.kola at gmail.com Sun Jun 23 17:21:02 2024 From: marildo.kola at gmail.com (Marildo Kola) Date: Mon, 24 Jun 2024 00:21:02 +0200 Subject: [petsc-users] Restart Krylov-Schur "Manually" Message-ID: Hello, I am using SLEPc to calculate eigenvalues for fluid dynamics stability analysis (specifically studying bifurcations). We employ a MatShellOperation, which involves propagating Navier-Stokes to construct the Krylov space, and this particularly slows down our algorithm. The problem I am facing is that, after days of simulations, the simulation may die due to a time limit on the cluster, but the eigensolver (I am using the default Krylov-Schur) has not converged yet, leading to the loss of all the information computed up to that point. I wanted to inquire if it is possible to implement, with the available features, a restarting strategy, which can allow me, once the simulation stops (or after a given number of restart iterations of the solver), to save all the information necessary to restart the EPSSolver from the point it had stopped. Thank you in advance, Best regards, Marildo Kola -------------- next part -------------- An HTML attachment was scrubbed... URL: From jroman at dsic.upv.es Mon Jun 24 03:14:34 2024 From: jroman at dsic.upv.es (Jose E. Roman) Date: Mon, 24 Jun 2024 10:14:34 +0200 Subject: [petsc-users] Restart Krylov-Schur "Manually" In-Reply-To: References: Message-ID: <2A28BA96-F12D-44E1-91F4-12EA2B800D76@dsic.upv.es> An HTML attachment was scrubbed... URL: From samar.khatiwala at earth.ox.ac.uk Mon Jun 24 03:24:03 2024 From: samar.khatiwala at earth.ox.ac.uk (Samar Khatiwala) Date: Mon, 24 Jun 2024 08:24:03 +0000 Subject: [petsc-users] Restart Krylov-Schur "Manually" In-Reply-To: <2A28BA96-F12D-44E1-91F4-12EA2B800D76@dsic.upv.es> References: <2A28BA96-F12D-44E1-91F4-12EA2B800D76@dsic.upv.es> Message-ID: <3DF11952-6113-49F7-ABB2-63F4F2DCDE45@earth.ox.ac.uk> Hi, Sorry to hijack this thread but I just want to add that this is a more general problem that I constantly face with PETSc. Not being able to checkpoint the complete state of a solver instance and restart a computation (at least not easily) has long been the biggest missing feature in PETSc for me. Thanks, Samar On Jun 24, 2024, at 9:14 AM, Jose E. Roman via petsc-users wrote: This Message Is From an External Sender This message came from outside your organization. Unfortunately there is no support for this. If you requested several eigenvalues and the solver has converged some of them already, then it would be possible to stop the run, save the eigenvectors and rerun with the eigenvectors passed via EPSSetDeflationSpace(). Jose > El 24 jun 2024, a las 0:21, Marildo Kola > escribi?: > > This Message Is From an External Sender > This message came from outside your organization. > Hello, > I am using SLEPc to calculate eigenvalues for fluid dynamics stability analysis (specifically studying bifurcations). We employ a MatShellOperation, which involves propagating Navier-Stokes to construct the Krylov space, and this particularly slows down our algorithm. The problem I am facing is that, after days of simulations, the simulation may die due to a time limit on the cluster, but the eigensolver (I am using the default Krylov-Schur) has not converged yet, leading to the loss of all the information computed up to that point. I wanted to inquire if it is possible to implement, with the available features, a restarting strategy, which can allow me, once the simulation stops (or after a given number of restart iterations of the solver), to save all the information necessary to restart the EPSSolver from the point it had stopped. > Thank you in advance, > Best regards, Marildo Kola -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Jun 24 06:11:53 2024 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 24 Jun 2024 07:11:53 -0400 Subject: [petsc-users] Modelica + PETSc? In-Reply-To: References: Message-ID: On Sun, Jun 23, 2024 at 5:04?PM Zou, Ling via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi all, I am just curious ? any effort trying to include PETSc as > Modelica?s solution option? > > (Modelica forum or email list seem to be quite dead so asking here.) > I had not heard of it before. I looked at the 3.6 specification, but it did not sy how the generated DAE were solved, or how to interface packages. Do they have documentation on that? Thanks, Matt > > > -Ling > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Zm0NjFBejQX24YkLIjkKSr1FkyhGSd5YDzKPEYLhPVdjIB_EifkXVLP3RicixnSb0xjR5KtYyBcRN6v-fWU4$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Jun 24 06:15:28 2024 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 24 Jun 2024 07:15:28 -0400 Subject: [petsc-users] Restart Krylov-Schur "Manually" In-Reply-To: <3DF11952-6113-49F7-ABB2-63F4F2DCDE45@earth.ox.ac.uk> References: <2A28BA96-F12D-44E1-91F4-12EA2B800D76@dsic.upv.es> <3DF11952-6113-49F7-ABB2-63F4F2DCDE45@earth.ox.ac.uk> Message-ID: On Mon, Jun 24, 2024 at 4:24?AM Samar Khatiwala < samar.khatiwala at earth.ox.ac.uk> wrote: > Hi, Sorry to hijack this thread but I just want to add that this is a more > general problem that I constantly face with PETSc. Not being able to > checkpoint the complete state of a solver instance and restart a > computation (at least not easily) > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Hi, > > Sorry to hijack this thread but I just want to add that this is a more > general problem that I constantly face with PETSc. Not being able to > checkpoint the complete state of a solver instance and restart a > computation (at least not easily) has long been the biggest missing feature > in PETSc for me. > Which type of solver do you want to do this for? Some solvers, like Newton, just need the current iterate, which we do. You could imagine saving Krylov spaces, but it is very often cheaper to regenerate them than to save and load them from disk (which tends to be under-provisioned). Thanks, Matt > Thanks, > > Samar > > On Jun 24, 2024, at 9:14 AM, Jose E. Roman via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > This Message Is From an External Sender > This message came from outside your organization. > > Unfortunately there is no support for this. > > If you requested several eigenvalues and the solver has converged some of them already, then it would be possible to stop the run, save the eigenvectors and rerun with the eigenvectors passed via EPSSetDeflationSpace(). > > Jose > > > > El 24 jun 2024, a las 0:21, Marildo Kola escribi?: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hello, > > I am using SLEPc to calculate eigenvalues for fluid dynamics stability analysis (specifically studying bifurcations). We employ a MatShellOperation, which involves propagating Navier-Stokes to construct the Krylov space, and this particularly slows down our algorithm. The problem I am facing is that, after days of simulations, the simulation may die due to a time limit on the cluster, but the eigensolver (I am using the default Krylov-Schur) has not converged yet, leading to the loss of all the information computed up to that point. I wanted to inquire if it is possible to implement, with the available features, a restarting strategy, which can allow me, once the simulation stops (or after a given number of restart iterations of the solver), to save all the information necessary to restart the EPSSolver from the point it had stopped. > > Thank you in advance, > > Best regards, Marildo Kola > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ch8NH3Wyy13drlSnX_Ftydd3HzRlIz3IQda46x_WnHdpCZvqNPlj-Fhk8Ap5uLxb85QjWMip0Rn0PZdq7XGu$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From lzou at anl.gov Mon Jun 24 09:29:16 2024 From: lzou at anl.gov (Zou, Ling) Date: Mon, 24 Jun 2024 14:29:16 +0000 Subject: [petsc-users] Modelica + PETSc? In-Reply-To: References: Message-ID: This is the website I normally refer to https://urldefense.us/v3/__https://openmodelica.org/doc/OpenModelicaUsersGuide/latest/solving.html__;!!G_uCfscf7eWS!ZZZIEo4PTP8K3Wn8r3Qk0Zy1YJWCVZtiUVvdSq4KnaeMp3VLcrJ3eQOZocmlvn8MCCGcUyzd0niXGJqTtg4$ Looks like DASSL is the default solver. PS: I was playing with Modelica with some toy problem I have, which solves fine but could not hold on with the steady-state solution for some reason. Maybe I did it wrong, or maybe I am not familiar with the solver. That was the reason of the Modelica+PETSc question since I am quite familiar with PETSc. Also, the combination seems to be a powerful pair. -Ling From: Matthew Knepley Date: Monday, June 24, 2024 at 6:12 AM To: Zou, Ling Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Modelica + PETSc? On Sun, Jun 23, 2024 at 5:?04 PM Zou, Ling via petsc-users wrote: Hi all, I am just curious ? any effort trying to include PETSc as Modelica?s solution option? (Modelica forum or email list seem to be quite dead ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Sun, Jun 23, 2024 at 5:04?PM Zou, Ling via petsc-users > wrote: Hi all, I am just curious ? any effort trying to include PETSc as Modelica?s solution option? (Modelica forum or email list seem to be quite dead so asking here.) I had not heard of it before. I looked at the 3.6 specification, but it did not sy how the generated DAE were solved, or how to interface packages. Do they have documentation on that? Thanks, Matt -Ling -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZZZIEo4PTP8K3Wn8r3Qk0Zy1YJWCVZtiUVvdSq4KnaeMp3VLcrJ3eQOZocmlvn8MCCGcUyzd0niXYMDTlyc$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Mon Jun 24 10:18:22 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Mon, 24 Jun 2024 15:18:22 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From: Pierre Jolivet Date: Sunday, June 23, 2024 at 12:41?AM To: Yongzhong Li Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue On 23 Jun 2024, at 4:07?AM, Yongzhong Li wrote: This Message Is From an External Sender This message came from outside your organization. Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. Thanks, Pierre Thanks, Yongzhong From: Matthew Knepley > Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!dMHrM6vrNtExHslaOjdVWdI5eBWZG56rbmYqPXwVW4whcyjsWtFdMuucr4NNgkppuRIOm1ipYS5gM2yRcHb2NDxqvuM5LRyRGzI$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dMHrM6vrNtExHslaOjdVWdI5eBWZG56rbmYqPXwVW4whcyjsWtFdMuucr4NNgkppuRIOm1ipYS5gM2yRcHb2NDxqvuM5Yvl2iTo$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dMHrM6vrNtExHslaOjdVWdI5eBWZG56rbmYqPXwVW4whcyjsWtFdMuucr4NNgkppuRIOm1ipYS5gM2yRcHb2NDxqvuM5Yvl2iTo$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From samar.khatiwala at earth.ox.ac.uk Mon Jun 24 10:38:19 2024 From: samar.khatiwala at earth.ox.ac.uk (Samar Khatiwala) Date: Mon, 24 Jun 2024 15:38:19 +0000 Subject: [petsc-users] Restart Krylov-Schur "Manually" In-Reply-To: References: <2A28BA96-F12D-44E1-91F4-12EA2B800D76@dsic.upv.es> <3DF11952-6113-49F7-ABB2-63F4F2DCDE45@earth.ox.ac.uk> Message-ID: Hi Matt, This would be for SNES and KSP. In many of my applications it would be too expensive to regenerate the Krylov space, which would also be problematic for Newton as I often do matrix-free calculations. I know how complex the underlying data structures are for these objects and entirely understand how difficult it would be to provide a general checkpointing facility. Still, I do dream that one day I?ll be able to do Save(snes,...) and Load(snes,?) ... Thanks, Samar On Jun 24, 2024, at 12:15 PM, Matthew Knepley wrote: On Mon, Jun 24, 2024 at 4:24?AM Samar Khatiwala > wrote: This Message Is From an External Sender This message came from outside your organization. Hi, Sorry to hijack this thread but I just want to add that this is a more general problem that I constantly face with PETSc. Not being able to checkpoint the complete state of a solver instance and restart a computation (at least not easily) has long been the biggest missing feature in PETSc for me. Which type of solver do you want to do this for? Some solvers, like Newton, just need the current iterate, which we do. You could imagine saving Krylov spaces, but it is very often cheaper to regenerate them than to save and load them from disk (which tends to be under-provisioned). Thanks, Matt Thanks, Samar On Jun 24, 2024, at 9:14 AM, Jose E. Roman via petsc-users > wrote: This Message Is From an External Sender This message came from outside your organization. Unfortunately there is no support for this. If you requested several eigenvalues and the solver has converged some of them already, then it would be possible to stop the run, save the eigenvectors and rerun with the eigenvectors passed via EPSSetDeflationSpace(). Jose > El 24 jun 2024, a las 0:21, Marildo Kola > escribi?: > > This Message Is From an External Sender > This message came from outside your organization. > Hello, > I am using SLEPc to calculate eigenvalues for fluid dynamics stability analysis (specifically studying bifurcations). We employ a MatShellOperation, which involves propagating Navier-Stokes to construct the Krylov space, and this particularly slows down our algorithm. The problem I am facing is that, after days of simulations, the simulation may die due to a time limit on the cluster, but the eigensolver (I am using the default Krylov-Schur) has not converged yet, leading to the loss of all the information computed up to that point. I wanted to inquire if it is possible to implement, with the available features, a restarting strategy, which can allow me, once the simulation stops (or after a given number of restart iterations of the solver), to save all the information necessary to restart the EPSSolver from the point it had stopped. > Thank you in advance, > Best regards, Marildo Kola -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZMeCWsu9Ah27To5Ol1-bQX3iJD0vUKjgJRbqyWvsTfTCcWaq5SCd1TrtLJBrASH0OQcLIcPjoloT_p0TASaEArdmYJjfQ_dNXgYjai8$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Jun 24 10:39:35 2024 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 24 Jun 2024 11:39:35 -0400 Subject: [petsc-users] Modelica + PETSc? In-Reply-To: References: Message-ID: On Mon, Jun 24, 2024 at 10:29?AM Zou, Ling wrote: > This is the website I normally refer to > > https://urldefense.us/v3/__https://openmodelica.org/doc/OpenModelicaUsersGuide/latest/solving.html__;!!G_uCfscf7eWS!atB8LuQrlGQnbi8lXYaGJKrUHYTfhXYVS8-QcBlSPWc_cjEPT8rDhboXFr08Hx6cSDhSTlwO2WEXOpoY6C5F$ > > > > Looks like DASSL is the default solver. > > That is what I would have guessed. DASSL is a good solver, but quite dated. I think PETSc can solve those problems, and more scalably. We would be happy to give advice on conforming to their interface. Thanks, Matt > PS: I was playing with Modelica with some toy problem I have, which solves > fine but could not hold on with the steady-state solution for some reason. > Maybe I did it wrong, or maybe I am not familiar with the solver. That was > the reason of the Modelica+PETSc question since I am quite familiar with > PETSc. Also, the combination seems to be a powerful pair. > > > > -Ling > > > > *From: *Matthew Knepley > *Date: *Monday, June 24, 2024 at 6:12 AM > *To: *Zou, Ling > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] Modelica + PETSc? > > On Sun, Jun 23, 2024 at 5: 04 PM Zou, Ling via petsc-users mcs. anl. gov> wrote: Hi all, I am just curious ? any effort trying to > include PETSc as Modelica?s solution option? (Modelica forum or email list > seem to be quite dead > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender * > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > On Sun, Jun 23, 2024 at 5:04?PM Zou, Ling via petsc-users < > petsc-users at mcs.anl.gov> wrote: > > Hi all, I am just curious ? any effort trying to include PETSc as > Modelica?s solution option? > > (Modelica forum or email list seem to be quite dead so asking here.) > > > > I had not heard of it before. I looked at the 3.6 specification, but it > did not sy how the generated DAE were solved, or > > how to interface packages. Do they have documentation on that? > > > > Thanks, > > > > Matt > > > > > > -Ling > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!atB8LuQrlGQnbi8lXYaGJKrUHYTfhXYVS8-QcBlSPWc_cjEPT8rDhboXFr08Hx6cSDhSTlwO2WEXOjQrgtOM$ > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!atB8LuQrlGQnbi8lXYaGJKrUHYTfhXYVS8-QcBlSPWc_cjEPT8rDhboXFr08Hx6cSDhSTlwO2WEXOjQrgtOM$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Jun 24 10:41:48 2024 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 24 Jun 2024 11:41:48 -0400 Subject: [petsc-users] Restart Krylov-Schur "Manually" In-Reply-To: References: <2A28BA96-F12D-44E1-91F4-12EA2B800D76@dsic.upv.es> <3DF11952-6113-49F7-ABB2-63F4F2DCDE45@earth.ox.ac.uk> Message-ID: On Mon, Jun 24, 2024 at 11:38?AM Samar Khatiwala < samar.khatiwala at earth.ox.ac.uk> wrote: > Hi Matt, > > This would be for SNES and KSP. In many of my applications it would be too > expensive to regenerate the Krylov space, which would also be problematic > for Newton as I often do matrix-free calculations. > > I know how complex the underlying data structures are for these objects > and entirely understand how difficult it would be to provide a general > checkpointing facility. Still, I do dream that one day I?ll be able to do > Save(snes,...) and Load(snes,?) ... > Let's talk specifically about SNES. I think this works now. It would be good to find out why you think it does not. You can do SNESView() and it will serialize the solver, and VecView() to serialize the current solution. Then you SNESLoad() and VecLoad() and call SNESSolve() with that solution as the initial guess. Thanks, Matt > Thanks, > > Samar > > On Jun 24, 2024, at 12:15 PM, Matthew Knepley wrote: > > On Mon, Jun 24, 2024 at 4:24?AM Samar Khatiwala < > samar.khatiwala at earth.ox.ac.uk> wrote: > >> This Message Is From an External Sender >> This message came from outside your organization. >> >> Hi, >> >> Sorry to hijack this thread but I just want to add that this is a more >> general problem that I constantly face with PETSc. Not being able to >> checkpoint the complete state of a solver instance and restart a >> computation (at least not easily) has long been the biggest missing feature >> in PETSc for me. >> > > Which type of solver do you want to do this for? Some solvers, like > Newton, just need the current iterate, which we do. You could imagine > saving Krylov spaces, but it is very often cheaper to regenerate them than > to save and load them from disk (which tends to be under-provisioned). > > Thanks, > > Matt > > >> Thanks, >> >> Samar >> >> On Jun 24, 2024, at 9:14 AM, Jose E. Roman via petsc-users < >> petsc-users at mcs.anl.gov> wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> >> Unfortunately there is no support for this. >> >> If you requested several eigenvalues and the solver has converged some of them already, then it would be possible to stop the run, save the eigenvectors and rerun with the eigenvectors passed via EPSSetDeflationSpace(). >> >> Jose >> >> >> > El 24 jun 2024, a las 0:21, Marildo Kola escribi?: >> > >> > This Message Is From an External Sender >> > This message came from outside your organization. >> > Hello, >> > I am using SLEPc to calculate eigenvalues for fluid dynamics stability analysis (specifically studying bifurcations). We employ a MatShellOperation, which involves propagating Navier-Stokes to construct the Krylov space, and this particularly slows down our algorithm. The problem I am facing is that, after days of simulations, the simulation may die due to a time limit on the cluster, but the eigensolver (I am using the default Krylov-Schur) has not converged yet, leading to the loss of all the information computed up to that point. I wanted to inquire if it is possible to implement, with the available features, a restarting strategy, which can allow me, once the simulation stops (or after a given number of restart iterations of the solver), to save all the information necessary to restart the EPSSolver from the point it had stopped. >> > Thank you in advance, >> > Best regards, Marildo Kola >> >> >> >> >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eqOVZGVAqNSjCctZz15A80QkgJt28WLpriJPEkHdcCiN1vrJ4RfXPebAjRgUJQsG16l6LF3_JF75uEgZQtwr$ > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eqOVZGVAqNSjCctZz15A80QkgJt28WLpriJPEkHdcCiN1vrJ4RfXPebAjRgUJQsG16l6LF3_JF75uEgZQtwr$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Jun 24 10:45:26 2024 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 24 Jun 2024 11:45:26 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li wrote: > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? Thank > you, Yongzhong From: > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? > We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x4zRO7V_$ The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. Thanks, Matt > Thank you, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Sunday, June 23, 2024 at 12:41?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Yeah, I ran my program again using -mat_view::ascii_info and set > MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix > to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > > > --> Solving the system... > > > > Excitation 1 of 1... > > > > ================================================ > > Iterative solve completed in 7435 ms. > > CONVERGED: rtol. > > Iterations: 72 > > Final relative residual norm: 9.22287e-07 > > ================================================ > > [CPU TIME] System solution: 2.27160000e+02 s. > > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set > MKL_VERBOSE to be 1. Although, I think it should be many spmv operations > when doing KSPSolve(). Do you see the possible reasons? > > > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS > is. > > > > Thanks, > > Pierre > > > > Thanks, > > Yongzhong > > > > > > *From: *Matthew Knepley > *Date: *Saturday, June 22, 2024 at 5:56?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > MKL_VERBOSE=1 ./ex1 > > > matrix nonzeros = 100, allocated nonzeros = 100 > > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector Neural Network Instructions enabled > processors, Lnx 2.50GHz lp64 gnu_thread > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) > 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) > 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) > 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) > 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) > 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) > 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) > 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) > 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) > 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) > 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) > 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) > 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All > I did is to change the matrix type from MATAIJ to MATAIJMKL to get > optimized performance for spmv from MKL. Should I expect to see any MKL > outputs in this case? > > > > Are you sure that the type changed? You can MatView() the matrix with > format ascii_info to see. > > > > Thanks, > > > > Matt > > > > > > Thanks, > > Yongzhong > > > > *From: *Junchao Zhang > *Date: *Saturday, June 22, 2024 at 9:40?AM > *To: *Yongzhong Li > *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example > first and see if MKL is really used > > $ cd src/mat/tests > > $ make ex1 > > $ MKL_VERBOSE=1 ./ex1 > > > --Junchao Zhang > > > > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > I am using > > export MKL_VERBOSE=1 > > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:47?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > How do you set the variable? > > > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 > architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled > processors, Lnx 2.80GHz lp64 intel_thread > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > [...] > > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of > MKL. Does PETSc enable this verbose output? > > Best, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:36?AM > *To: *Junchao Zhang > *Cc: *Yongzhong Li , > petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > I remember there are some MKL env vars to print MKL routines called. > > > > The environment variable is MKL_VERBOSE > > > > Thanks, > > Pierre > > > > Maybe we can try it to see what MKL routines are really used and then we > can understand why some petsc functions did not speed up > > > --Junchao Zhang > > > > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > > static int stageCounter = 1; > > > > // Generate a unique stage name > > std::ostringstream oss; > > oss << "Stage " << stageCounter << " of Code"; > > std::string stageName = oss.str(); > > > > // Register the stage > > PetscLogStage stagenum; > > > > PetscLogStageRegister(stageName.c_str(), &stagenum); > > PetscLogStagePush(stagenum); > > > > *KSPSolve(*ksp_ptr, b, x);* > > > > PetscLogStagePop(); > > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > > > Finally there are a huge number of > > > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > > > Barry > > > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve** is **almost two times **greater than the CPU > time for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some experience > on how to diagnose and address this performance discrepancy? Any > insights or recommendations you could offer would be greatly appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x3b414pj$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x7MKsDeh$ > > > > > > > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x7MKsDeh$ > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!aQJpmm5W6l6FUiumnIPmkouzwzNUfx-Dyq04i1O2KS_InQGk6qjI7wUir0Hx6QEUQE2AMiJDsez3x7MKsDeh$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ligang0309 at gmail.com Sun Jun 23 07:27:23 2024 From: ligang0309 at gmail.com (Gang Li) Date: Sun, 23 Jun 2024 20:27:23 +0800 Subject: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin Message-ID: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> Hi, I have configured the PETSc under Cygwin by: cygpath -u `cygpath -ms '/cygdrive/c/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64'` ./configure --with-cc='win32fe_icl' --with-fc='win32fe_ifort' --with-cxx='win32fe_icl' \ --with-precision=double --with-scalar-type=complex \ --with-shared-libraries=0 \ --with-mpi=0 \ --with-blaslapack-lib='-L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib' It seems to be successful. When make it, I faced a problem: $ make PETSC_DIR=/cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug all /usr/bin/python3 ./config/gmakegen.py --petsc-arch=arch-mswin-c-debug makefile:25: /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf /rules_util.mk: No such file or directory gmake[1]: *** No rule to make target '/cygdrive/e/Major/Codes/libraries/PETSc/pe tsc-3.21.2/lib/petsc/conf/rules_util.mk'. Stop. make: *** [GNUmakefile:9: all] Error 2 Could you help to check it? Thanks. Gang Li -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: configure.log Type: application/octet-stream Size: 2274908 bytes Desc: not available URL: From bsmith at petsc.dev Mon Jun 24 11:06:09 2024 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 24 Jun 2024 12:06:09 -0400 Subject: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin In-Reply-To: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> Message-ID: <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> Do ls /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf/ > On Jun 23, 2024, at 8:27?AM, Gang Li wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, > > I have configured the PETSc under Cygwin by: > > cygpath -u `cygpath -ms '/cygdrive/c/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64'` > ./configure --with-cc='win32fe_icl' --with-fc='win32fe_ifort' --with-cxx='win32fe_icl' \ > --with-precision=double --with-scalar-type=complex \ > --with-shared-libraries=0 \ > --with-mpi=0 \ > --with-blaslapack-lib='-L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib' > > It seems to be successful. When make it, I faced a problem: > > $ make PETSC_DIR=/cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug all > /usr/bin/python3 ./config/gmakegen.py --petsc-arch=arch-mswin-c-debug > makefile:25: /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf > /rules_util.mk: No such file or directory > gmake[1]: *** No rule to make target '/cygdrive/e/Major/Codes/libraries/PETSc/pe > tsc-3.21.2/lib/petsc/conf/rules_util.mk'. Stop. > make: *** [GNUmakefile:9: all] Error 2 > > Could you help to check it? > Thanks. > > Gang Li > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay.anl at fastmail.org Mon Jun 24 11:11:06 2024 From: balay.anl at fastmail.org (Satish Balay) Date: Mon, 24 Jun 2024 11:11:06 -0500 (CDT) Subject: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin In-Reply-To: <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> Message-ID: <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Mon Jun 24 11:35:43 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Mon, 24 Jun 2024 11:35:43 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Let me run some examples on our end to see whether the code calls expected functions. --Junchao Zhang On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley wrote: > On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li utoronto. ca> wrote: Thank you Pierre for your information. Do we have a > conclusion for my original question about the parallelization efficiency > for different stages of > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > >> Thank you Pierre for your information. Do we have a conclusion for my >> original question about the parallelization efficiency for different stages >> of KSP Solve? Do we need to do more testing to figure out the issues? Thank >> you, Yongzhong From: >> ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> >> Thank you Pierre for your information. Do we have a conclusion for my >> original question about the parallelization efficiency for different stages >> of KSP Solve? Do we need to do more testing to figure out the issues? >> > > We have an extended discussion of this here: > https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7P10iuoI$ > > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) > are memory bandwidth limited. If there is no more bandwidth to be > marshalled on your board, then adding more processes does nothing at all. > This is why people were asking about how many "nodes" you are running on, > because that is the unit of memory bandwidth, not "cores" which make little > difference. > > Thanks, > > Matt > > >> Thank you, >> >> Yongzhong >> >> >> >> *From: *Pierre Jolivet >> *Date: *Sunday, June 23, 2024 at 12:41?AM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> >> >> >> >> On 23 Jun 2024, at 4:07?AM, Yongzhong Li >> wrote: >> >> >> >> This Message Is From an External Sender >> >> This message came from outside your organization. >> >> Yeah, I ran my program again using -mat_view::ascii_info and set >> MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix >> to be seqaijmkl type (I?ve attached a few as below) >> >> --> Setting up matrix-vector products... >> >> >> >> Mat Object: 1 MPI process >> >> type: seqaijmkl >> >> rows=16490, cols=35937 >> >> total: nonzeros=128496, allocated nonzeros=128496 >> >> total number of mallocs used during MatSetValues calls=0 >> >> not using I-node routines >> >> Mat Object: 1 MPI process >> >> type: seqaijmkl >> >> rows=16490, cols=35937 >> >> total: nonzeros=128496, allocated nonzeros=128496 >> >> total number of mallocs used during MatSetValues calls=0 >> >> not using I-node routines >> >> >> >> --> Solving the system... >> >> >> >> Excitation 1 of 1... >> >> >> >> ================================================ >> >> Iterative solve completed in 7435 ms. >> >> CONVERGED: rtol. >> >> Iterations: 72 >> >> Final relative residual norm: 9.22287e-07 >> >> ================================================ >> >> [CPU TIME] System solution: 2.27160000e+02 s. >> >> [WALL TIME] System solution: 7.44387218e+00 s. >> >> However, it seems to me that there were still no MKL outputs even I set >> MKL_VERBOSE to be 1. Although, I think it should be many spmv operations >> when doing KSPSolve(). Do you see the possible reasons? >> >> >> >> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS >> is. >> >> >> >> Thanks, >> >> Pierre >> >> >> >> Thanks, >> >> Yongzhong >> >> >> >> >> >> *From: *Matthew Knepley >> *Date: *Saturday, June 22, 2024 at 5:56?PM >> *To: *Yongzhong Li >> *Cc: *Junchao Zhang , Pierre Jolivet < >> pierre at joliv.et>, petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> ????????? knepley at gmail.com ????????????????? >> >> >> On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 >> MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for >> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) >> AVX-512) with support of Vector >> >> ZjQcmQRYFpfptBannerStart >> >> *This Message Is From an External Sender* >> >> This message came from outside your organization. >> >> >> >> ZjQcmQRYFpfptBannerEnd >> >> MKL_VERBOSE=1 ./ex1 >> >> >> matrix nonzeros = 100, allocated nonzeros = 100 >> >> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for >> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) >> AVX-512) with support of Vector Neural Network Instructions enabled >> processors, Lnx 2.50GHz lp64 gnu_thread >> >> MKL_VERBOSE >> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) >> 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) >> 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) >> 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) >> 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) >> 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) >> 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) >> 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) >> 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) >> 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) >> 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) >> 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) >> 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF >> Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) >> 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) >> 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE >> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) >> 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF >> Dyn:1 FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> Yes, for petsc example, there are MKL outputs, but for my own program. >> All I did is to change the matrix type from MATAIJ to MATAIJMKL to get >> optimized performance for spmv from MKL. Should I expect to see any MKL >> outputs in this case? >> >> >> >> Are you sure that the type changed? You can MatView() the matrix with >> format ascii_info to see. >> >> >> >> Thanks, >> >> >> >> Matt >> >> >> >> >> >> Thanks, >> >> Yongzhong >> >> >> >> *From: *Junchao Zhang >> *Date: *Saturday, June 22, 2024 at 9:40?AM >> *To: *Yongzhong Li >> *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < >> petsc-users at mcs.anl.gov> >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> No, you don't. It is strange. Perhaps you can you run a petsc example >> first and see if MKL is really used >> >> $ cd src/mat/tests >> >> $ make ex1 >> >> $ MKL_VERBOSE=1 ./ex1 >> >> >> --Junchao Zhang >> >> >> >> >> >> On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> I am using >> >> export MKL_VERBOSE=1 >> >> ./xx >> >> in the bash file, do I have to use - ksp_converged_reason? >> >> Thanks, >> >> Yongzhong >> >> >> >> *From: *Pierre Jolivet >> *Date: *Friday, June 21, 2024 at 1:47?PM >> *To: *Yongzhong Li >> *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < >> petsc-users at mcs.anl.gov> >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> ????????? pierre at joliv.et ????????????????? >> >> >> How do you set the variable? >> >> >> >> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason >> >> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 >> architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled >> processors, Lnx 2.80GHz lp64 intel_thread >> >> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> >> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 >> TID:0 NThr:1 >> >> [...] >> >> >> >> On 21 Jun 2024, at 7:37?PM, Yongzhong Li >> wrote: >> >> >> >> This Message Is From an External Sender >> >> This message came from outside your organization. >> >> Hello all, >> >> I set MKL_VERBOSE = 1, but observed no print output specific to the use >> of MKL. Does PETSc enable this verbose output? >> >> Best, >> >> Yongzhong >> >> >> >> *From: *Pierre Jolivet >> *Date: *Friday, June 21, 2024 at 1:36?AM >> *To: *Junchao Zhang >> *Cc: *Yongzhong Li , >> petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> ????????? pierre at joliv.et ????????????????? >> >> >> >> >> >> >> On 21 Jun 2024, at 6:42?AM, Junchao Zhang >> wrote: >> >> >> >> This Message Is From an External Sender >> >> This message came from outside your organization. >> >> I remember there are some MKL env vars to print MKL routines called. >> >> >> >> The environment variable is MKL_VERBOSE >> >> >> >> Thanks, >> >> Pierre >> >> >> >> Maybe we can try it to see what MKL routines are really used and then we >> can understand why some petsc functions did not speed up >> >> >> --Junchao Zhang >> >> >> >> >> >> On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> *This Message Is From an External Sender* >> >> This message came from outside your organization. >> >> >> >> Hi Barry, sorry for my last results. I didn?t fully understand the stage >> profiling and logging in PETSc, now I only record KSPSolve() stage of my >> program. Some sample codes are as follow, >> >> // Static variable to keep track of the stage counter >> >> static int stageCounter = 1; >> >> >> >> // Generate a unique stage name >> >> std::ostringstream oss; >> >> oss << "Stage " << stageCounter << " of Code"; >> >> std::string stageName = oss.str(); >> >> >> >> // Register the stage >> >> PetscLogStage stagenum; >> >> >> >> PetscLogStageRegister(stageName.c_str(), &stagenum); >> >> PetscLogStagePush(stagenum); >> >> >> >> *KSPSolve(*ksp_ptr, b, x);* >> >> >> >> PetscLogStagePop(); >> >> stageCounter++; >> >> I have attached my new logging results, there are 1 main stage and 4 >> other stages where each one is KSPSolve() call. >> >> To provide some additional backgrounds, if you recall, I have been trying >> to get efficient iterative solution using multithreading. I found out by >> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to >> perform sparse matrix-vector multiplication faster, I am using >> MATSEQAIJMKL. This makes the shell matrix vector product in each iteration >> scale well with the #of threads. However, I found out the total GMERS solve >> time (~KSPSolve() time) is not scaling well the #of threads. >> >> From the logging results I learned that when performing KSPSolve(), there >> are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs >> using different number of threads and plotted the time consumption for >> PCApply() and KSPGMERSOrthog() against #of thread. I found out these two >> operations are not scaling with the threads at all! My results are attached >> as the pdf to give you a clear view. >> >> My questions is, >> >> From my understanding, in PCApply, MatSolve() is involved, >> KSPGMERSOrthog() will have many vector operations, so why these two parts >> can?t scale well with the # of threads when the intel MKL library is linked? >> >> Thank you, >> Yongzhong >> >> >> >> *From: *Barry Smith >> *Date: *Friday, June 14, 2024 at 11:36?AM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov , >> petsc-maint at mcs.anl.gov , Piero Triverio < >> piero.triverio at utoronto.ca> >> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >> Performance Issue >> >> >> >> I am a bit confused. Without the initial guess computation, there are >> still a bunch of events I don't understand >> >> >> >> MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> >> MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 >> >> MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> >> >> in addition there are many more VecMAXPY then VecMDot (in GMRES they are >> each done the same number of times) >> >> >> >> VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 >> >> VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 >> >> >> >> Finally there are a huge number of >> >> >> >> MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 >> >> >> >> Are you making calls to all these routines? Are you doing this inside >> your MatMult() or before you call KSPSolve? >> >> >> >> The reason I wanted you to make a simpler run without the initial guess >> code is that your events are far more complicated than would be produced by >> GMRES alone so it is not possible to understand the behavior you are seeing >> without fully understanding all the events happening in the code. >> >> >> >> Barry >> >> >> >> >> >> On Jun 14, 2024, at 1:19?AM, Yongzhong Li >> wrote: >> >> >> >> Thanks, I have attached the results without using any KSPGuess. At low >> frequency, the iteration steps are quite close to the one with KSPGuess, >> specifically >> >> KSPGuess Object: 1 MPI process >> >> type: fischer >> >> Model 1, size 200 >> >> However, I found at higher frequency, the # of iteration steps are >> significant higher than the one with KSPGuess, I have attahced both of the >> results for your reference. >> >> Moreover, could I ask why the one without the KSPGuess options can be >> used for a baseline comparsion? What are we comparing here? How does it >> relate to the performance issue/bottleneck I found? ?*I have noticed >> that the time taken by **KSPSolve** is **almost two times **greater than >> the CPU time for matrix-vector product multiplied by the number of >> iteration*? >> >> Thank you! >> Yongzhong >> >> >> >> *From: *Barry Smith >> *Date: *Thursday, June 13, 2024 at 2:14?PM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov , >> petsc-maint at mcs.anl.gov , Piero Triverio < >> piero.triverio at utoronto.ca> >> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >> Performance Issue >> >> >> >> Can you please run the same thing without the KSPGuess option(s) for >> a baseline comparison? >> >> >> >> Thanks >> >> >> >> Barry >> >> >> >> On Jun 13, 2024, at 1:27?PM, Yongzhong Li >> wrote: >> >> >> >> This Message Is From an External Sender >> >> This message came from outside your organization. >> >> Hi Matt, >> >> I have rerun the program with the keys you provided. The system output >> when performing ksp solve and the final petsc log output were stored in a >> .txt file attached for your reference. >> >> Thanks! >> Yongzhong >> >> >> >> *From: *Matthew Knepley >> *Date: *Wednesday, June 12, 2024 at 6:46?PM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov , >> petsc-maint at mcs.anl.gov , Piero Triverio < >> piero.triverio at utoronto.ca> >> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >> Performance Issue >> >> ????????? knepley at gmail.com ????????????????? >> >> >> On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> Dear PETSc?s developers, I hope this email finds you well. I am currently >> working on a project using PETSc and have encountered a performance issue >> with the KSPSolve function. Specifically, I have noticed that the time >> taken by KSPSolve is >> >> ZjQcmQRYFpfptBannerStart >> >> *This Message Is From an External Sender* >> >> This message came from outside your organization. >> >> >> >> ZjQcmQRYFpfptBannerEnd >> >> Dear PETSc?s developers, >> >> I hope this email finds you well. >> >> I am currently working on a project using PETSc and have encountered a >> performance issue with the KSPSolve function. Specifically, *I have >> noticed that the time taken by **KSPSolve** is **almost two times **greater >> than the CPU time for matrix-vector product multiplied by the number of >> iteration steps*. I use C++ chrono to record CPU time. >> >> For context, I am using a shell system matrix A. Despite my efforts to >> parallelize the matrix-vector product (Ax), the overall solve time >> remains higher than the matrix vector product per iteration indicates >> when multiple threads were used. Here are a few details of my setup: >> >> - *Matrix Type*: Shell system matrix >> - *Preconditioner*: Shell PC >> - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK >> library, multithreading is enabled >> >> I have considered several potential reasons, such as preconditioner >> setup, additional solver operations, and the inherent overhead of using a >> shell system matrix. *However, since KSPSolve is a high-level API, I >> have been unable to pinpoint the exact cause of the increased solve time.* >> >> Have you observed the same issue? Could you please provide some >> experience on how to diagnose and address this performance discrepancy? >> Any insights or recommendations you could offer would be greatly >> appreciated. >> >> >> >> For any performance question like this, we need to see the output of your >> code run with >> >> >> >> -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view >> >> >> >> Thanks, >> >> >> >> Matt >> >> >> >> Thank you for your time and assistance. >> >> Best regards, >> >> Yongzhong >> >> ----------------------------------------------------------- >> >> *Yongzhong Li* >> >> PhD student | Electromagnetics Group >> >> Department of Electrical & Computer Engineering >> >> University of Toronto >> >> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7PGFx3_7$ >> >> >> >> >> >> >> >> -- >> >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7EU47BCC$ >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7EU47BCC$ >> >> >> >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cLhz5s-64hAO57C-mkJR6i1W0OTVMfuqLUI6QANOVXoIUOwQ8waTFW5X2F7uFVctLsjLYyJAjSU7_bwnkxdG7EU47BCC$ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mmolinos at us.es Mon Jun 24 13:50:02 2024 From: mmolinos at us.es (MIGUEL MOLINOS PEREZ) Date: Mon, 24 Jun 2024 18:50:02 +0000 Subject: [petsc-users] Error type "Petsc has generated inconsistent data" Message-ID: <4CA5B961-149B-4B30-8A63-197AC0BBB5BC@us.es> Dear all, I am trying to assemble a matrix A with coefficients which I need to assemble the RHS (F) and its Jacobian (J) in a TS type of problem. Determining each coefficient of A involves the resolution of a small non-linear problem (1 dof) using the serial version of SNES. By the way, the matrix A is of the type ?MATMPIAIJ?. The weird part is, if I pass the matrix A to the TS routine inside of a user-context structure, without even accessing to the values inside of A, I got the following error message: > [1]PETSC ERROR: Petsc has generated inconsistent data > [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors But if I comment out the line which calls the SNES routine used to evaluate the coefficients inside of A, I don?t get the error message. Some additional context: - The SNES routine is called once at a time inside of each rank. - I use PetscCall(SNESCreate(PETSC_COMM_SELF, &snes)); - The vectors inside of the SNES function are defined as follows: VecCreateSeq(PETSC_COMM_SELF, 1, &Y) - All the input fields for SNES are also sequential. Any feedback is greatly appreciated! Thanks, Miguel [test-Mass-Transport-Master-Equation-PETSc-Backward-Euler][MgHx-hcp-x5x5x5-cell] t=0.0000e+00 dt=1.0000e-07 it=( 0, 0) 0 KSP Residual norm 4.776631889125e-07 1 KSP Residual norm 6.807505564283e-17 [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Petsc has generated inconsistent data [5]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [5]PETSC ERROR: Petsc has generated inconsistent data [7]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [7]PETSC ERROR: Petsc has generated inconsistent data [0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [1]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [1]PETSC ERROR: Petsc has generated inconsistent data [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [2]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [2]PETSC ERROR: Petsc has generated inconsistent data [2]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [3]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [3]PETSC ERROR: Petsc has generated inconsistent data [3]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [4]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [4]PETSC ERROR: Petsc has generated inconsistent data [4]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [5]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [6]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [6]PETSC ERROR: Petsc has generated inconsistent data [6]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [7]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [1]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [2]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [2]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [2]PETSC ERROR: [3]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [3]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [3]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YCo3JjgVyVqwScfRf05FevGAOdJG2APEIkhxmpYQmmJFmnrOBYrCKeun20x6gytf7m2IpXZPhyib8zJmPUCYCw$ for trouble shooting. [4]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [4]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [4]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YCo3JjgVyVqwScfRf05FevGAOdJG2APEIkhxmpYQmmJFmnrOBYrCKeun20x6gytf7m2IpXZPhyib8zJmPUCYCw$ for trouble shooting. [5]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [5]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [5]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YCo3JjgVyVqwScfRf05FevGAOdJG2APEIkhxmpYQmmJFmnrOBYrCKeun20x6gytf7m2IpXZPhyib8zJmPUCYCw$ for trouble shooting. [5]PETSC ERROR: Petsc Release Version 3.21.0, unknown [6]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [6]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [6]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YCo3JjgVyVqwScfRf05FevGAOdJG2APEIkhxmpYQmmJFmnrOBYrCKeun20x6gytf7m2IpXZPhyib8zJmPUCYCw$ for trouble shooting. [6]PETSC ERROR: Petsc Release Version 3.21.0, unknown [6]PETSC ERROR: [7]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [7]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [0]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [0]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YCo3JjgVyVqwScfRf05FevGAOdJG2APEIkhxmpYQmmJFmnrOBYrCKeun20x6gytf7m2IpXZPhyib8zJmPUCYCw$ for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.21.0, unknown [0]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [1]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [1]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YCo3JjgVyVqwScfRf05FevGAOdJG2APEIkhxmpYQmmJFmnrOBYrCKeun20x6gytf7m2IpXZPhyib8zJmPUCYCw$ for trouble shooting. See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YCo3JjgVyVqwScfRf05FevGAOdJG2APEIkhxmpYQmmJFmnrOBYrCKeun20x6gytf7m2IpXZPhyib8zJmPUCYCw$ for trouble shooting. [2]PETSC ERROR: Petsc Release Version 3.21.0, unknown [2]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [2]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [3]PETSC ERROR: Petsc Release Version 3.21.0, unknown [3]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [3]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [4]PETSC ERROR: Petsc Release Version 3.21.0, unknown [4]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [4]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [5]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [5]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [6]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YCo3JjgVyVqwScfRf05FevGAOdJG2APEIkhxmpYQmmJFmnrOBYrCKeun20x6gytf7m2IpXZPhyib8zJmPUCYCw$ for trouble shooting. [7]PETSC ERROR: Petsc Release Version 3.21.0, unknown [7]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [7]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [1]PETSC ERROR: Petsc Release Version 3.21.0, unknown [1]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [1]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [0]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [2]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [2]PETSC ERROR: [3]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [3]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [4]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [4]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [5]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [5]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [0]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [0]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [0]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [1]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [1]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [1]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [2]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [2]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [2]PETSC ERROR: [3]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [3]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [3]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [4]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [4]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [4]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [5]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [5]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [5]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [5]PETSC ERROR: [6]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [6]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [6]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [6]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [6]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [7]PETSC ERROR: #1 VecXDot_MPI_Default() at /Users/migmolper/petsc/include/../src/vec/vec/impls/mpi/pvecimpl.h:107 [7]PETSC ERROR: #2 VecDot_MPI() at /Users/migmolper/petsc/src/vec/vec/impls/mpi/pvec2.c:10 [7]PETSC ERROR: #3 VecDot() at /Users/migmolper/petsc/src/vec/vec/interface/rvector.c:120 [7]PETSC ERROR: #4 SNESLineSearchApply_CP() at /Users/migmolper/petsc/src/snes/linesearch/impls/cp/linesearchcp.c:28 [0]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [0]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [0]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [0]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [1]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [1]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [1]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [1]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [1]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [1]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [1]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [1]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [2]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [2]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [2]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [2]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [2]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [2]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [3]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [3]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [3]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [3]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [3]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [3]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [4]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [4]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [4]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [4]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [4]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [4]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [5]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [5]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [5]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [5]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [5]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [6]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [6]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [6]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [6]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [6]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [6]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [7]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [7]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [7]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [7]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [7]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [7]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [7]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [0]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [0]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [0]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [0]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [2]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [3]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [1]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [4]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [5]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [6]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [7]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Jun 24 14:52:30 2024 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 24 Jun 2024 15:52:30 -0400 Subject: [petsc-users] Error type "Petsc has generated inconsistent data" In-Reply-To: <4CA5B961-149B-4B30-8A63-197AC0BBB5BC@us.es> References: <4CA5B961-149B-4B30-8A63-197AC0BBB5BC@us.es> Message-ID: On Mon, Jun 24, 2024 at 2:50?PM MIGUEL MOLINOS PEREZ wrote: > Dear all, I am trying to assemble a matrix A with coefficients which I > need to assemble the RHS (F) and its Jacobian (J) in a TS type of problem. > Determining each coefficient of A involves the resolution of a small > non-linear problem (1 dof) > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear all, > > I am trying to assemble a matrix A with coefficients which I need to > assemble the RHS (F) and its Jacobian (J) in a TS type of problem. > > Determining each coefficient of A involves the resolution of a small > non-linear problem (1 dof) using the serial version of SNES. By the way, the > matrix A is of the type ?MATMPIAIJ?. > > The weird part is, if I pass the matrix A to the TS routine inside of a > user-context structure, without even accessing to the values inside of A, I > got the following error message: > > >* [1]PETSC ERROR: Petsc has generated inconsistent data > *>* [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > *>* on different processors* > > But if I comment out the line which calls the SNES routine used to > evaluate the coefficients inside of A, I don?t get the error message. > > Some additional context: > - The SNES routine is called once at a time inside of each rank. > - I use PetscCall(SNESCreate(PETSC_COMM_SELF, &snes)); > - The vectors inside of the SNES function are defined as follows: > VecCreateSeq(PETSC_COMM_SELF, 1, &Y) > - All the input fields for SNES are also sequential. > > Any feedback is greatly appreciated! > There is something inconsistent among the processes. First, I would try running with a constant A. Then execute your nonlinear solve, but return the constant A. If that passes, then likely you are returning inconsistent results across processes with your solve. Thanks, Matt > Thanks, > Miguel > > > > > [test-Mass-Transport-Master-Equation-PETSc-Backward-Euler][MgHx-hcp-x5x5x5-cell] > t=0.0000e+00 dt=1.0000e-07 it=( 0, 0) > 0 KSP Residual norm 4.776631889125e-07 > 1 KSP Residual norm 6.807505564283e-17 > [0]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [0]PETSC ERROR: Petsc has generated inconsistent data > [5]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [5]PETSC ERROR: Petsc has generated inconsistent data > [7]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [7]PETSC ERROR: Petsc has generated inconsistent data > [0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors > [1]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [1]PETSC ERROR: Petsc has generated inconsistent data > [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors > [2]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [2]PETSC ERROR: Petsc has generated inconsistent data > [2]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors > [3]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [3]PETSC ERROR: Petsc has generated inconsistent data > [3]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors > [4]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [4]PETSC ERROR: Petsc has generated inconsistent data > [4]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors > [5]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors > [6]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [6]PETSC ERROR: Petsc has generated inconsistent data > [6]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors > [7]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors > [0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the > program crashed before usage or a spelling mistake, etc! > [1]PETSC ERROR: WARNING! There are unused option(s) set! Could be the > program crashed before usage or a spelling mistake, etc! > [2]PETSC ERROR: WARNING! There are unused option(s) set! Could be the > program crashed before usage or a spelling mistake, etc! > [2]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [2]PETSC ERROR: [3]PETSC ERROR: WARNING! There are unused option(s) set! > Could be the program crashed before usage or a spelling mistake, etc! > [3]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [3]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lfRVlzr2$ > > for trouble shooting. > [4]PETSC ERROR: WARNING! There are unused option(s) set! Could be the > program crashed before usage or a spelling mistake, etc! > [4]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [4]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lfRVlzr2$ > > for trouble shooting. > [5]PETSC ERROR: WARNING! There are unused option(s) set! Could be the > program crashed before usage or a spelling mistake, etc! > [5]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [5]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lfRVlzr2$ > > for trouble shooting. > [5]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [6]PETSC ERROR: WARNING! There are unused option(s) set! Could be the > program crashed before usage or a spelling mistake, etc! > [6]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [6]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lfRVlzr2$ > > for trouble shooting. > [6]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [6]PETSC ERROR: [7]PETSC ERROR: WARNING! There are unused option(s) set! > Could be the program crashed before usage or a spelling mistake, etc! > [7]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [0]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [0]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lfRVlzr2$ > > for trouble shooting. > [0]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [0]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named > mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [1]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [1]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lfRVlzr2$ > > for trouble shooting. > See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lfRVlzr2$ > > for trouble shooting. > [2]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [2]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named > mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [2]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 > --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [3]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [3]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named > mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [3]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 > --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [4]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [4]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named > mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [4]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 > --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [5]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named > mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [5]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 > --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by > migmolper Mon Jun 24 11:37:38 2024 > [6]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 > --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lfRVlzr2$ > > for trouble shooting. > [7]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [7]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named > mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [7]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 > --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 > CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [1]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [1]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named > mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [1]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 > --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [0]PETSC ERROR: #1 PetscSplitReductionApply() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [2]PETSC ERROR: #1 PetscSplitReductionApply() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [2]PETSC ERROR: [3]PETSC ERROR: #1 PetscSplitReductionApply() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [3]PETSC ERROR: #2 PetscSplitReductionEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [4]PETSC ERROR: #1 PetscSplitReductionApply() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [4]PETSC ERROR: #2 PetscSplitReductionEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [5]PETSC ERROR: #1 PetscSplitReductionApply() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [5]PETSC ERROR: #2 PetscSplitReductionEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [0]PETSC ERROR: #2 PetscSplitReductionEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [0]PETSC ERROR: #3 VecNormEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [0]PETSC ERROR: #4 SNESLineSearchApply_BT() at > /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [1]PETSC ERROR: #1 PetscSplitReductionApply() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [1]PETSC ERROR: #2 PetscSplitReductionEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [1]PETSC ERROR: #3 VecNormEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > #2 PetscSplitReductionEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [2]PETSC ERROR: #3 VecNormEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [2]PETSC ERROR: #4 SNESLineSearchApply_BT() at > /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [2]PETSC ERROR: [3]PETSC ERROR: #3 VecNormEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [3]PETSC ERROR: #4 SNESLineSearchApply_BT() at > /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [3]PETSC ERROR: #5 SNESLineSearchApply() at > /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [4]PETSC ERROR: #3 VecNormEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [4]PETSC ERROR: #4 SNESLineSearchApply_BT() at > /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [4]PETSC ERROR: #5 SNESLineSearchApply() at > /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [5]PETSC ERROR: #3 VecNormEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [5]PETSC ERROR: #4 SNESLineSearchApply_BT() at > /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [5]PETSC ERROR: #5 SNESLineSearchApply() at > /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [5]PETSC ERROR: [6]PETSC ERROR: #1 PetscSplitReductionApply() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [6]PETSC ERROR: #2 PetscSplitReductionEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [6]PETSC ERROR: #3 VecNormEnd() at > /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [6]PETSC ERROR: #4 SNESLineSearchApply_BT() at > /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [6]PETSC ERROR: #5 SNESLineSearchApply() at > /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [7]PETSC ERROR: #1 VecXDot_MPI_Default() at > /Users/migmolper/petsc/include/../src/vec/vec/impls/mpi/pvecimpl.h:107 > [7]PETSC ERROR: #2 VecDot_MPI() at > /Users/migmolper/petsc/src/vec/vec/impls/mpi/pvec2.c:10 > [7]PETSC ERROR: #3 VecDot() at > /Users/migmolper/petsc/src/vec/vec/interface/rvector.c:120 > [7]PETSC ERROR: #4 SNESLineSearchApply_CP() at > /Users/migmolper/petsc/src/snes/linesearch/impls/cp/linesearchcp.c:28 > [0]PETSC ERROR: #5 SNESLineSearchApply() at > /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [0]PETSC ERROR: #6 SNESSolve_NEWTONLS() at > /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [0]PETSC ERROR: #7 SNESSolve() at > /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [0]PETSC ERROR: #8 TSTheta_SNESSolve() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [1]PETSC ERROR: #4 SNESLineSearchApply_BT() at > /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [1]PETSC ERROR: #5 SNESLineSearchApply() at > /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [1]PETSC ERROR: #6 SNESSolve_NEWTONLS() at > /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [1]PETSC ERROR: #7 SNESSolve() at > /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [1]PETSC ERROR: #8 TSTheta_SNESSolve() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [1]PETSC ERROR: #9 TSStep_Theta() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [1]PETSC ERROR: #10 TSStep() at > /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [1]PETSC ERROR: #11 TSSolve() at > /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > #5 SNESLineSearchApply() at > /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [2]PETSC ERROR: #6 SNESSolve_NEWTONLS() at > /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [2]PETSC ERROR: #7 SNESSolve() at > /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [2]PETSC ERROR: #8 TSTheta_SNESSolve() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [2]PETSC ERROR: #9 TSStep_Theta() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [2]PETSC ERROR: #10 TSStep() at > /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [2]PETSC ERROR: #11 TSSolve() at > /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [3]PETSC ERROR: #6 SNESSolve_NEWTONLS() at > /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [3]PETSC ERROR: #7 SNESSolve() at > /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [3]PETSC ERROR: #8 TSTheta_SNESSolve() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [3]PETSC ERROR: #9 TSStep_Theta() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [3]PETSC ERROR: #10 TSStep() at > /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [3]PETSC ERROR: #11 TSSolve() at > /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [4]PETSC ERROR: #6 SNESSolve_NEWTONLS() at > /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [4]PETSC ERROR: #7 SNESSolve() at > /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [4]PETSC ERROR: #8 TSTheta_SNESSolve() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [4]PETSC ERROR: #9 TSStep_Theta() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [4]PETSC ERROR: #10 TSStep() at > /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [4]PETSC ERROR: #11 TSSolve() at > /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > #6 SNESSolve_NEWTONLS() at > /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [5]PETSC ERROR: #7 SNESSolve() at > /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [5]PETSC ERROR: #8 TSTheta_SNESSolve() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [5]PETSC ERROR: #9 TSStep_Theta() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [5]PETSC ERROR: #10 TSStep() at > /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [5]PETSC ERROR: #11 TSSolve() at > /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [6]PETSC ERROR: #6 SNESSolve_NEWTONLS() at > /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [6]PETSC ERROR: #7 SNESSolve() at > /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [6]PETSC ERROR: #8 TSTheta_SNESSolve() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [6]PETSC ERROR: #9 TSStep_Theta() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [6]PETSC ERROR: #10 TSStep() at > /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [6]PETSC ERROR: #11 TSSolve() at > /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [7]PETSC ERROR: #5 SNESLineSearchApply() at > /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [7]PETSC ERROR: #6 SNESSolve_NEWTONLS() at > /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [7]PETSC ERROR: #7 SNESSolve() at > /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [7]PETSC ERROR: #8 TSTheta_SNESSolve() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [7]PETSC ERROR: #9 TSStep_Theta() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [7]PETSC ERROR: #10 TSStep() at > /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [7]PETSC ERROR: #11 TSSolve() at > /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [0]PETSC ERROR: #9 TSStep_Theta() at > /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [0]PETSC ERROR: #10 TSStep() at > /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [0]PETSC ERROR: #11 TSSolve() at > /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [0]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at > /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [2]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at > /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [3]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at > /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [1]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at > /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [4]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at > /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [5]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at > /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [6]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at > /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [7]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at > /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YlgDM4WJ1LZd1CuLZGNlMThj7OBlwkCu03P9vWxFt-dOODaCkEboruflECm3ntB9Grn1TGgk5Ic8lUlphb0M$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Mon Jun 24 14:56:46 2024 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 24 Jun 2024 15:56:46 -0400 Subject: [petsc-users] Error type "Petsc has generated inconsistent data" In-Reply-To: <4CA5B961-149B-4B30-8A63-197AC0BBB5BC@us.es> References: <4CA5B961-149B-4B30-8A63-197AC0BBB5BC@us.es> Message-ID: The error is coming from the parallel SNES used by TS TSTheta_SNESSolve() not your small SNES solver but TSTheta_SNESSolve() Process 7 uses [7]PETSC ERROR: #4 SNESLineSearchApply_CP() while the rest uses _BT(). I think this causes the problem since they are different algorithms using different MPI. Are you perhaps setting somewhere the line search to use? Options database or someway that different processes will get a different value? > On Jun 24, 2024, at 2:50?PM, MIGUEL MOLINOS PEREZ wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear all, > > I am trying to assemble a matrix A with coefficients which I need to assemble the RHS (F) and its Jacobian (J) in a TS type of problem. > > Determining each coefficient of A involves the resolution of a small non-linear problem (1 dof) using the serial version of SNES. By the way, the matrix A is of the type ?MATMPIAIJ?. > > The weird part is, if I pass the matrix A to the TS routine inside of a user-context structure, without even accessing to the values inside of A, I got the following error message: > > [1]PETSC ERROR: Petsc has generated inconsistent data > > [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > > on different processors > But if I comment out the line which calls the SNES routine used to evaluate the coefficients inside of A, I don?t get the error message. > > Some additional context: > - The SNES routine is called once at a time inside of each rank. > - I use PetscCall(SNESCreate(PETSC_COMM_SELF, > &snes)); > - The vectors inside of the SNES function are defined as follows: > VecCreateSeq(PETSC_COMM_SELF, > 1, > &Y) > - All the input fields for SNES are also sequential. > > Any feedback is greatly appreciated! > > Thanks, > Miguel > > > > [test-Mass-Transport-Master-Equation-PETSc-Backward-Euler][MgHx-hcp-x5x5x5-cell] > t=0.0000e+00 dt=1.0000e-07 it=( 0, 0) > 0 KSP Residual norm 4.776631889125e-07 > 1 KSP Residual norm 6.807505564283e-17 > [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- > [0]PETSC ERROR: Petsc has generated inconsistent data > [5]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- > [5]PETSC ERROR: Petsc has generated inconsistent data > [7]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- > [7]PETSC ERROR: Petsc has generated inconsistent data > [0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors > [1]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- > [1]PETSC ERROR: Petsc has generated inconsistent data > [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors > [2]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- > [2]PETSC ERROR: Petsc has generated inconsistent data > [2]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors > [3]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- > [3]PETSC ERROR: Petsc has generated inconsistent data > [3]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors > [4]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- > [4]PETSC ERROR: Petsc has generated inconsistent data > [4]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors > [5]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors > [6]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- > [6]PETSC ERROR: Petsc has generated inconsistent data > [6]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors > [7]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors > [0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! > [1]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! > [2]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! > [2]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [2]PETSC ERROR: [3]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! > [3]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [3]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!fEaSpu9Ts97xYvsqM9wnNqYODpUsu00JOWgIqC17b8_Oj0brMFwA1GECPWPgxJVaJ3av-AgGncHm9IRLugqkV1Y$ for trouble shooting. > [4]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! > [4]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [4]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!fEaSpu9Ts97xYvsqM9wnNqYODpUsu00JOWgIqC17b8_Oj0brMFwA1GECPWPgxJVaJ3av-AgGncHm9IRLugqkV1Y$ for trouble shooting. > [5]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! > [5]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [5]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!fEaSpu9Ts97xYvsqM9wnNqYODpUsu00JOWgIqC17b8_Oj0brMFwA1GECPWPgxJVaJ3av-AgGncHm9IRLugqkV1Y$ for trouble shooting. > [5]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [6]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! > [6]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [6]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!fEaSpu9Ts97xYvsqM9wnNqYODpUsu00JOWgIqC17b8_Oj0brMFwA1GECPWPgxJVaJ3av-AgGncHm9IRLugqkV1Y$ for trouble shooting. > [6]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [6]PETSC ERROR: [7]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! > [7]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [0]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [0]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!fEaSpu9Ts97xYvsqM9wnNqYODpUsu00JOWgIqC17b8_Oj0brMFwA1GECPWPgxJVaJ3av-AgGncHm9IRLugqkV1Y$ for trouble shooting. > [0]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [0]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [1]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code > [1]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!fEaSpu9Ts97xYvsqM9wnNqYODpUsu00JOWgIqC17b8_Oj0brMFwA1GECPWPgxJVaJ3av-AgGncHm9IRLugqkV1Y$ for trouble shooting. > See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!fEaSpu9Ts97xYvsqM9wnNqYODpUsu00JOWgIqC17b8_Oj0brMFwA1GECPWPgxJVaJ3av-AgGncHm9IRLugqkV1Y$ for trouble shooting. > [2]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [2]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [2]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [3]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [3]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [3]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [4]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [4]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [4]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [5]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [5]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [6]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!fEaSpu9Ts97xYvsqM9wnNqYODpUsu00JOWgIqC17b8_Oj0brMFwA1GECPWPgxJVaJ3av-AgGncHm9IRLugqkV1Y$ for trouble shooting. > [7]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [7]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [7]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [1]PETSC ERROR: Petsc Release Version 3.21.0, unknown > [1]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 > [1]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x > [0]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [2]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [2]PETSC ERROR: [3]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [3]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [4]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [4]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [5]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [5]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [0]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [0]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [0]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [1]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [1]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [1]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [2]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [2]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [2]PETSC ERROR: [3]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [3]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [3]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [4]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [4]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [4]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [5]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [5]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [5]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [5]PETSC ERROR: [6]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 > [6]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 > [6]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 > [6]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [6]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [7]PETSC ERROR: #1 VecXDot_MPI_Default() at /Users/migmolper/petsc/include/../src/vec/vec/impls/mpi/pvecimpl.h:107 > [7]PETSC ERROR: #2 VecDot_MPI() at /Users/migmolper/petsc/src/vec/vec/impls/mpi/pvec2.c:10 > [7]PETSC ERROR: #3 VecDot() at /Users/migmolper/petsc/src/vec/vec/interface/rvector.c:120 > [7]PETSC ERROR: #4 SNESLineSearchApply_CP() at /Users/migmolper/petsc/src/snes/linesearch/impls/cp/linesearchcp.c:28 > [0]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [0]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [0]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [0]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [1]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 > [1]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [1]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [1]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [1]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [1]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [1]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [1]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [2]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [2]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [2]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [2]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [2]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [2]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [3]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [3]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [3]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [3]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [3]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [3]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [4]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [4]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [4]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [4]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [4]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [4]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [5]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [5]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [5]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [5]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [5]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [6]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [6]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [6]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [6]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [6]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [6]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [7]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 > [7]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 > [7]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 > [7]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 > [7]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [7]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [7]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [0]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 > [0]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 > [0]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 > [0]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [2]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [3]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [1]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [4]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [5]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [6]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 > [7]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mmolinos at us.es Mon Jun 24 15:18:09 2024 From: mmolinos at us.es (MIGUEL MOLINOS PEREZ) Date: Mon, 24 Jun 2024 20:18:09 +0000 Subject: [petsc-users] Error type "Petsc has generated inconsistent data" In-Reply-To: References: <4CA5B961-149B-4B30-8A63-197AC0BBB5BC@us.es> Message-ID: Thanks Barry, that was the problem! Thank you, Miguel On Jun 24, 2024, at 12:56?PM, Barry Smith wrote: The error is coming from the parallel SNES used by TS TSTheta_SNESSolve() not your small SNES solver but TSTheta_SNESSolve() Process 7 uses [7]PETSC ERROR: #4 SNESLineSearchApply_CP() while the rest uses _BT(). I think this causes the problem since they are different algorithms using different MPI. Are you perhaps setting somewhere the line search to use? Options database or someway that different processes will get a different value? On Jun 24, 2024, at 2:50?PM, MIGUEL MOLINOS PEREZ wrote: This Message Is From an External Sender This message came from outside your organization. Dear all, I am trying to assemble a matrix A with coefficients which I need to assemble the RHS (F) and its Jacobian (J) in a TS type of problem. Determining each coefficient of A involves the resolution of a small non-linear problem (1 dof) using the serial version of SNES. By the way, the matrix A is of the type ?MATMPIAIJ?. The weird part is, if I pass the matrix A to the TS routine inside of a user-context structure, without even accessing to the values inside of A, I got the following error message: > [1]PETSC ERROR: Petsc has generated inconsistent data > [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) > on different processors But if I comment out the line which calls the SNES routine used to evaluate the coefficients inside of A, I don?t get the error message. Some additional context: - The SNES routine is called once at a time inside of each rank. - I use PetscCall(SNESCreate(PETSC_COMM_SELF, &snes)); - The vectors inside of the SNES function are defined as follows: VecCreateSeq(PETSC_COMM_SELF, 1, &Y) - All the input fields for SNES are also sequential. Any feedback is greatly appreciated! Thanks, Miguel [test-Mass-Transport-Master-Equation-PETSc-Backward-Euler][MgHx-hcp-x5x5x5-cell] t=0.0000e+00 dt=1.0000e-07 it=( 0, 0) 0 KSP Residual norm 4.776631889125e-07 1 KSP Residual norm 6.807505564283e-17 [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Petsc has generated inconsistent data [5]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [5]PETSC ERROR: Petsc has generated inconsistent data [7]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [7]PETSC ERROR: Petsc has generated inconsistent data [0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [1]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [1]PETSC ERROR: Petsc has generated inconsistent data [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [2]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [2]PETSC ERROR: Petsc has generated inconsistent data [2]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [3]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [3]PETSC ERROR: Petsc has generated inconsistent data [3]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [4]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [4]PETSC ERROR: Petsc has generated inconsistent data [4]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [5]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [6]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [6]PETSC ERROR: Petsc has generated inconsistent data [6]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [7]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors [0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [1]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [2]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [2]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [2]PETSC ERROR: [3]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [3]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [3]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!c-Z57Rxri5Jh6q1_vjqdM7SkhlAM_gLdtUB47zEkqWh4akdRiDP3bjc4UceIyWe0kH8WDSJskhzcSj3SJ2uxjA$ for trouble shooting. [4]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [4]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [4]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!c-Z57Rxri5Jh6q1_vjqdM7SkhlAM_gLdtUB47zEkqWh4akdRiDP3bjc4UceIyWe0kH8WDSJskhzcSj3SJ2uxjA$ for trouble shooting. [5]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [5]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [5]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!c-Z57Rxri5Jh6q1_vjqdM7SkhlAM_gLdtUB47zEkqWh4akdRiDP3bjc4UceIyWe0kH8WDSJskhzcSj3SJ2uxjA$ for trouble shooting. [5]PETSC ERROR: Petsc Release Version 3.21.0, unknown [6]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [6]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [6]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!c-Z57Rxri5Jh6q1_vjqdM7SkhlAM_gLdtUB47zEkqWh4akdRiDP3bjc4UceIyWe0kH8WDSJskhzcSj3SJ2uxjA$ for trouble shooting. [6]PETSC ERROR: Petsc Release Version 3.21.0, unknown [6]PETSC ERROR: [7]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! [7]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [0]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [0]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!c-Z57Rxri5Jh6q1_vjqdM7SkhlAM_gLdtUB47zEkqWh4akdRiDP3bjc4UceIyWe0kH8WDSJskhzcSj3SJ2uxjA$ for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.21.0, unknown [0]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [1]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code [1]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!c-Z57Rxri5Jh6q1_vjqdM7SkhlAM_gLdtUB47zEkqWh4akdRiDP3bjc4UceIyWe0kH8WDSJskhzcSj3SJ2uxjA$ for trouble shooting. See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!c-Z57Rxri5Jh6q1_vjqdM7SkhlAM_gLdtUB47zEkqWh4akdRiDP3bjc4UceIyWe0kH8WDSJskhzcSj3SJ2uxjA$ for trouble shooting. [2]PETSC ERROR: Petsc Release Version 3.21.0, unknown [2]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [2]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [3]PETSC ERROR: Petsc Release Version 3.21.0, unknown [3]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [3]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [4]PETSC ERROR: Petsc Release Version 3.21.0, unknown [4]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [4]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [5]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [5]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [6]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!c-Z57Rxri5Jh6q1_vjqdM7SkhlAM_gLdtUB47zEkqWh4akdRiDP3bjc4UceIyWe0kH8WDSJskhzcSj3SJ2uxjA$ for trouble shooting. [7]PETSC ERROR: Petsc Release Version 3.21.0, unknown [7]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [7]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [1]PETSC ERROR: Petsc Release Version 3.21.0, unknown [1]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 [1]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x [0]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [2]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [2]PETSC ERROR: [3]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [3]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [4]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [4]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [5]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [5]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [0]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [0]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [0]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [1]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [1]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [1]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [2]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [2]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [2]PETSC ERROR: [3]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [3]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [3]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [4]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [4]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [4]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [5]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [5]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [5]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [5]PETSC ERROR: [6]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 [6]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 [6]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 [6]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [6]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [7]PETSC ERROR: #1 VecXDot_MPI_Default() at /Users/migmolper/petsc/include/../src/vec/vec/impls/mpi/pvecimpl.h:107 [7]PETSC ERROR: #2 VecDot_MPI() at /Users/migmolper/petsc/src/vec/vec/impls/mpi/pvec2.c:10 [7]PETSC ERROR: #3 VecDot() at /Users/migmolper/petsc/src/vec/vec/interface/rvector.c:120 [7]PETSC ERROR: #4 SNESLineSearchApply_CP() at /Users/migmolper/petsc/src/snes/linesearch/impls/cp/linesearchcp.c:28 [0]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [0]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [0]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [0]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [1]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 [1]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [1]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [1]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [1]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [1]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [1]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [1]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [2]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [2]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [2]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [2]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [2]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [2]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [3]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [3]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [3]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [3]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [3]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [3]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [4]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [4]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [4]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [4]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [4]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [4]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [5]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [5]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [5]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [5]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [5]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [6]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [6]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [6]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [6]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [6]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [6]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [7]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 [7]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 [7]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 [7]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 [7]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [7]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [7]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [0]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 [0]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 [0]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 [0]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [2]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [3]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [1]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [4]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [5]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [6]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 [7]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Mon Jun 24 16:04:14 2024 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 24 Jun 2024 17:04:14 -0400 Subject: [petsc-users] Error type "Petsc has generated inconsistent data" In-Reply-To: References: <4CA5B961-149B-4B30-8A63-197AC0BBB5BC@us.es> Message-ID: Note you can uses SNESSetOptionsPrefix(snes,"myprefix_") to attach a prefix for the one dimensional solves and then set options for the one dimensional problems with -myprefix_snes_type etc Now you can set different options for the different types of snes in your code > On Jun 24, 2024, at 4:18?PM, MIGUEL MOLINOS PEREZ wrote: > > Thanks Barry, that was the problem! > > Thank you, > Miguel > >> On Jun 24, 2024, at 12:56?PM, Barry Smith wrote: >> >> >> The error is coming from the parallel SNES used by TS TSTheta_SNESSolve() not your small SNES solver but TSTheta_SNESSolve() >> >> Process 7 uses [7]PETSC ERROR: #4 SNESLineSearchApply_CP() while the rest uses _BT(). I think this causes the problem since they are different algorithms using different MPI. >> >> Are you perhaps setting somewhere the line search to use? Options database or someway that different processes will get a different value? >> >> >> >>> On Jun 24, 2024, at 2:50?PM, MIGUEL MOLINOS PEREZ wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> Dear all, >>> >>> I am trying to assemble a matrix A with coefficients which I need to assemble the RHS (F) and its Jacobian (J) in a TS type of problem. >>> >>> Determining each coefficient of A involves the resolution of a small non-linear problem (1 dof) using the serial version of SNES. By the way, the matrix A is of the type ?MATMPIAIJ?. >>> >>> The weird part is, if I pass the matrix A to the TS routine inside of a user-context structure, without even accessing to the values inside of A, I got the following error message: >>> > [1]PETSC ERROR: Petsc has generated inconsistent data >>> > [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) >>> > on different processors >>> But if I comment out the line which calls the SNES routine used to evaluate the coefficients inside of A, I don?t get the error message. >>> >>> Some additional context: >>> - The SNES routine is called once at a time inside of each rank. >>> - I use PetscCall(SNESCreate(PETSC_COMM_SELF, >>> &snes)); >>> - The vectors inside of the SNES function are defined as follows: >>> VecCreateSeq(PETSC_COMM_SELF, >>> 1, >>> &Y) >>> - All the input fields for SNES are also sequential. >>> >>> Any feedback is greatly appreciated! >>> >>> Thanks, >>> Miguel >>> >>> >>> >>> [test-Mass-Transport-Master-Equation-PETSc-Backward-Euler][MgHx-hcp-x5x5x5-cell] >>> t=0.0000e+00 dt=1.0000e-07 it=( 0, 0) >>> 0 KSP Residual norm 4.776631889125e-07 >>> 1 KSP Residual norm 6.807505564283e-17 >>> [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- >>> [0]PETSC ERROR: Petsc has generated inconsistent data >>> [5]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- >>> [5]PETSC ERROR: Petsc has generated inconsistent data >>> [7]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- >>> [7]PETSC ERROR: Petsc has generated inconsistent data >>> [0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors >>> [1]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- >>> [1]PETSC ERROR: Petsc has generated inconsistent data >>> [1]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors >>> [2]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- >>> [2]PETSC ERROR: Petsc has generated inconsistent data >>> [2]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors >>> [3]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- >>> [3]PETSC ERROR: Petsc has generated inconsistent data >>> [3]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors >>> [4]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- >>> [4]PETSC ERROR: Petsc has generated inconsistent data >>> [4]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors >>> [5]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors >>> [6]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- >>> [6]PETSC ERROR: Petsc has generated inconsistent data >>> [6]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors >>> [7]PETSC ERROR: MPI_Allreduce() called in different locations (code lines) on different processors >>> [0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! >>> [1]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! >>> [2]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! >>> [2]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code >>> [2]PETSC ERROR: [3]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! >>> [3]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code >>> [3]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bC0DolaKYXB2ZhjL209T3VwDP2ypey_YUEI5guqAfAa7gn_JtXRJQ3v5QhgLFf432jn25U0OHqIzoJXRmLdKKFA$ for trouble shooting. >>> [4]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! >>> [4]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code >>> [4]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bC0DolaKYXB2ZhjL209T3VwDP2ypey_YUEI5guqAfAa7gn_JtXRJQ3v5QhgLFf432jn25U0OHqIzoJXRmLdKKFA$ for trouble shooting. >>> [5]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! >>> [5]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code >>> [5]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bC0DolaKYXB2ZhjL209T3VwDP2ypey_YUEI5guqAfAa7gn_JtXRJQ3v5QhgLFf432jn25U0OHqIzoJXRmLdKKFA$ for trouble shooting. >>> [5]PETSC ERROR: Petsc Release Version 3.21.0, unknown >>> [6]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! >>> [6]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code >>> [6]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bC0DolaKYXB2ZhjL209T3VwDP2ypey_YUEI5guqAfAa7gn_JtXRJQ3v5QhgLFf432jn25U0OHqIzoJXRmLdKKFA$ for trouble shooting. >>> [6]PETSC ERROR: Petsc Release Version 3.21.0, unknown >>> [6]PETSC ERROR: [7]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc! >>> [7]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code >>> [0]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code >>> [0]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bC0DolaKYXB2ZhjL209T3VwDP2ypey_YUEI5guqAfAa7gn_JtXRJQ3v5QhgLFf432jn25U0OHqIzoJXRmLdKKFA$ for trouble shooting. >>> [0]PETSC ERROR: Petsc Release Version 3.21.0, unknown >>> [0]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 >>> [1]PETSC ERROR: Option left: name:-sns_monitor (no value) source: code >>> [1]PETSC ERROR: See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bC0DolaKYXB2ZhjL209T3VwDP2ypey_YUEI5guqAfAa7gn_JtXRJQ3v5QhgLFf432jn25U0OHqIzoJXRmLdKKFA$ for trouble shooting. >>> See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bC0DolaKYXB2ZhjL209T3VwDP2ypey_YUEI5guqAfAa7gn_JtXRJQ3v5QhgLFf432jn25U0OHqIzoJXRmLdKKFA$ for trouble shooting. >>> [2]PETSC ERROR: Petsc Release Version 3.21.0, unknown >>> [2]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 >>> [2]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x >>> [3]PETSC ERROR: Petsc Release Version 3.21.0, unknown >>> [3]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 >>> [3]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x >>> [4]PETSC ERROR: Petsc Release Version 3.21.0, unknown >>> [4]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 >>> [4]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x >>> [5]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 >>> [5]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x >>> ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 >>> [6]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x >>> See https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bC0DolaKYXB2ZhjL209T3VwDP2ypey_YUEI5guqAfAa7gn_JtXRJQ3v5QhgLFf432jn25U0OHqIzoJXRmLdKKFA$ for trouble shooting. >>> [7]PETSC ERROR: Petsc Release Version 3.21.0, unknown >>> [7]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 >>> [7]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x >>> Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x >>> [1]PETSC ERROR: Petsc Release Version 3.21.0, unknown >>> [1]PETSC ERROR: ./exe-tasting-SOLERA on a arch-darwin-c-debug named mmp-laptop.local by migmolper Mon Jun 24 11:37:38 2024 >>> [1]PETSC ERROR: Configure options --download-hdf5=1 --download-mpich=1 --with-debugging=1 CC=gcc CXX=c++ PETSC_ARCH=arch-darwin-c-debug --with-x >>> [0]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 >>> [2]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 >>> [2]PETSC ERROR: [3]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 >>> [3]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 >>> [4]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 >>> [4]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 >>> [5]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 >>> [5]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 >>> [0]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 >>> [0]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 >>> [0]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 >>> [1]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 >>> [1]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 >>> [1]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 >>> #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 >>> [2]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 >>> [2]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 >>> [2]PETSC ERROR: [3]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 >>> [3]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 >>> [3]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 >>> [4]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 >>> [4]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 >>> [4]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 >>> [5]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 >>> [5]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 >>> [5]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 >>> [5]PETSC ERROR: [6]PETSC ERROR: #1 PetscSplitReductionApply() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:230 >>> [6]PETSC ERROR: #2 PetscSplitReductionEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:172 >>> [6]PETSC ERROR: #3 VecNormEnd() at /Users/migmolper/petsc/src/vec/vec/utils/comb.c:553 >>> [6]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 >>> [6]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 >>> [7]PETSC ERROR: #1 VecXDot_MPI_Default() at /Users/migmolper/petsc/include/../src/vec/vec/impls/mpi/pvecimpl.h:107 >>> [7]PETSC ERROR: #2 VecDot_MPI() at /Users/migmolper/petsc/src/vec/vec/impls/mpi/pvec2.c:10 >>> [7]PETSC ERROR: #3 VecDot() at /Users/migmolper/petsc/src/vec/vec/interface/rvector.c:120 >>> [7]PETSC ERROR: #4 SNESLineSearchApply_CP() at /Users/migmolper/petsc/src/snes/linesearch/impls/cp/linesearchcp.c:28 >>> [0]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 >>> [0]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 >>> [0]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 >>> [0]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 >>> [1]PETSC ERROR: #4 SNESLineSearchApply_BT() at /Users/migmolper/petsc/src/snes/linesearch/impls/bt/linesearchbt.c:88 >>> [1]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 >>> [1]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 >>> [1]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 >>> [1]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 >>> [1]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 >>> [1]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 >>> [1]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 >>> #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 >>> [2]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 >>> [2]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 >>> [2]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 >>> [2]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 >>> [2]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 >>> [2]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 >>> [3]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 >>> [3]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 >>> [3]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 >>> [3]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 >>> [3]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 >>> [3]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 >>> [4]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 >>> [4]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 >>> [4]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 >>> [4]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 >>> [4]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 >>> [4]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 >>> #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 >>> [5]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 >>> [5]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 >>> [5]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 >>> [5]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 >>> [5]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 >>> [6]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 >>> [6]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 >>> [6]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 >>> [6]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 >>> [6]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 >>> [6]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 >>> [7]PETSC ERROR: #5 SNESLineSearchApply() at /Users/migmolper/petsc/src/snes/linesearch/interface/linesearch.c:645 >>> [7]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /Users/migmolper/petsc/src/snes/impls/ls/ls.c:234 >>> [7]PETSC ERROR: #7 SNESSolve() at /Users/migmolper/petsc/src/snes/interface/snes.c:4738 >>> [7]PETSC ERROR: #8 TSTheta_SNESSolve() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:174 >>> [7]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 >>> [7]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 >>> [7]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 >>> [0]PETSC ERROR: #9 TSStep_Theta() at /Users/migmolper/petsc/src/ts/impls/implicit/theta/theta.c:225 >>> [0]PETSC ERROR: #10 TSStep() at /Users/migmolper/petsc/src/ts/interface/ts.c:3391 >>> [0]PETSC ERROR: #11 TSSolve() at /Users/migmolper/petsc/src/ts/interface/ts.c:4037 >>> [0]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 >>> [2]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 >>> [3]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 >>> [1]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 >>> [4]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 >>> [5]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 >>> [6]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 >>> [7]PETSC ERROR: #12 Mass_Transport_Master_Equation_PETSc() at /Users/migmolper/DMD/SOLERA/Chemical-eqs/Mass-Transport-PETSc.cpp:251 >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mmolinos at us.es Mon Jun 24 16:32:56 2024 From: mmolinos at us.es (MIGUEL MOLINOS PEREZ) Date: Mon, 24 Jun 2024 21:32:56 +0000 Subject: [petsc-users] Error type "Petsc has generated inconsistent data" In-Reply-To: References: <4CA5B961-149B-4B30-8A63-197AC0BBB5BC@us.es> Message-ID: <40D8C023-5771-462F-A95E-DD50794BF62D@us.es> This is very useful tip, thank you! Miguel On Jun 24, 2024, at 2:04?PM, Barry Smith wrote: SNESSetOptionsPrefix -------------- next part -------------- An HTML attachment was scrubbed... URL: From mmolinos at us.es Mon Jun 24 19:47:58 2024 From: mmolinos at us.es (MIGUEL MOLINOS PEREZ) Date: Tue, 25 Jun 2024 00:47:58 +0000 Subject: [petsc-users] Doubt about TSMonitorSolutionVTK Message-ID: Dear all, I want to monitor the results at each iteration of TS using vtk format. To do so, I add the following lines to my Monitor function: char vts_File_Name[MAXC]; PetscCall(PetscSNPrintf(vts_File_Name, sizeof(vts_File_Name), "./xi-MgHx-hcp-cube-x5-x5-x5-TS-BE-%i.vtu", step)); PetscCall(TSMonitorSolutionVTK(ts, step, time, X, (void*)vts_File_Name)); My script compiles and executes without any sort of warning/error messages. However, no output files are produced at the end of the simulation. I?ve also tried the option ?-ts_monitor_solution_vtk ?, but I got no results as well. I can?t find any similar example on the petsc website and I don?t see what I am doing wrong. Could somebody point me to the right direction? Thanks, Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mmolinos at us.es Mon Jun 24 20:28:33 2024 From: mmolinos at us.es (MIGUEL MOLINOS PEREZ) Date: Tue, 25 Jun 2024 01:28:33 +0000 Subject: [petsc-users] Doubt about TSMonitorSolutionVTK In-Reply-To: References: Message-ID: By the way, the hdf5 format works and I have the vtk library installed in PETSc. Miguel On Jun 24, 2024, at 5:47?PM, MIGUEL MOLINOS PEREZ wrote: Dear all, I want to monitor the results at each iteration of TS using vtk format. To do so, I add the following lines to my Monitor function: char vts_File_Name[MAXC]; PetscCall(PetscSNPrintf(vts_File_Name, sizeof(vts_File_Name), "./xi-MgHx-hcp-cube-x5-x5-x5-TS-BE-%i.vtu", step)); PetscCall(TSMonitorSolutionVTK(ts, step, time, X, (void*)vts_File_Name)); My script compiles and executes without any sort of warning/error messages. However, no output files are produced at the end of the simulation. I?ve also tried the option ?-ts_monitor_solution_vtk ?, but I got no results as well. I can?t find any similar example on the petsc website and I don?t see what I am doing wrong. Could somebody point me to the right direction? Thanks, Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Mon Jun 24 20:37:18 2024 From: bsmith at petsc.dev (Barry Smith) Date: Mon, 24 Jun 2024 21:37:18 -0400 Subject: [petsc-users] Doubt about TSMonitorSolutionVTK In-Reply-To: References: Message-ID: <2067D58E-F041-429F-8ABE-B19DD9F733C2@petsc.dev> See, for example, the bottom of src/ts/tutorials/ex26.c that uses -ts_monitor_solution_vtk 'foo-%03d.vts' > On Jun 24, 2024, at 8:47?PM, MIGUEL MOLINOS PEREZ wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear all, > > I want to monitor the results at each iteration of TS using vtk format. To do so, I add the following lines to my Monitor function: > > > char > vts_File_Name[MAXC]; > > PetscCall(PetscSNPrintf(vts_File_Name, > sizeof(vts_File_Name), > > "./xi-MgHx-hcp-cube-x5-x5-x5-TS-BE-%i.vtu", > step)); > > PetscCall(TSMonitorSolutionVTK(ts, > step, > time, > X, > (void*)vts_File_Name)); > > > My script compiles and executes without any sort of warning/error messages. However, no output files are produced at the end of the simulation. I?ve also tried the option ?-ts_monitor_solution_vtk ?, but I got no results as well. > > I can?t find any similar example on the petsc website and I don?t see what I am doing wrong. Could somebody point me to the right direction? > > Thanks, > Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From ligang0309 at gmail.com Tue Jun 25 03:44:59 2024 From: ligang0309 at gmail.com (Gang Li) Date: Tue, 25 Jun 2024 16:44:59 +0800 Subject: [petsc-users] =?utf-8?q?Problem_about_compiling_PETSc-3=2E21=2E2?= =?utf-8?q?_under_Cygwin?= In-Reply-To: <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> Message-ID: <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> Hi Barry and Satish, Thanks for your help. The same error when I restart this build using a fresh tarball. See the attached file for details.? Sincerely, Gang ---- Replied Message ---- FromSatish BalayDate6/25/2024 00:11ToBarry SmithCcGang Li, petsc-users at mcs.anl.govSubjectRe: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin Probably best if you can restart this build using a fresh tarball - and see if the problem persists Satish On Mon, 24 Jun 2024, Barry Smith wrote: Do ls /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf/ On Jun 23, 2024, at 8:27?AM, Gang Li wrote: This Message Is From an External Sender This message came from outside your organization. Hi, I have configured the PETSc under Cygwin by: cygpath -u `cygpath -ms '/cygdrive/c/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64'` ./configure --with-cc='win32fe_icl' --with-fc='win32fe_ifort' --with-cxx='win32fe_icl' \ --with-precision=double --with-scalar-type=complex \ --with-shared-libraries=0 \ --with-mpi=0 \ --with-blaslapack-lib='-L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib' It seems to be successful. When make it, I faced a problem: $ make PETSC_DIR=/cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug all /usr/bin/python3 ./config/gmakegen.py --petsc-arch=arch-mswin-c-debug makefile:25: /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf /rules_util.mk: No such file or directory gmake[1]: *** No rule to make target '/cygdrive/e/Major/Codes/libraries/PETSc/pe tsc-3.21.2/lib/petsc/conf/rules_util.mk'. Stop. make: *** [GNUmakefile:9: all] Error 2 Could you help to check it? Thanks. Gang Li -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: error.jpg Type: image/jpeg Size: 199729 bytes Desc: not available URL: From onur.notonur at proton.me Tue Jun 25 03:50:00 2024 From: onur.notonur at proton.me (onur.notonur) Date: Tue, 25 Jun 2024 08:50:00 +0000 Subject: [petsc-users] DMPlexBuildFromCellList node ordering for tetrahedral elements Message-ID: Hi, I'm trying to implement a Tetgen mesh importer for my Petsc/DMPlex-based solver. I am encountering some issues and suspect they might be due to my import process. The Tetgen mesh definitions can be found here for reference: https://urldefense.us/v3/__https://wias-berlin.de/software/tetgen/fformats.html__;!!G_uCfscf7eWS!c6srXdHHqgkq5FoeJpfHljLrP0U3UJOmz6A1-jLgkMBEtv8NTqRKgVTgoP_ArdJT5e5kx9lX53ecjbzZ3iu3-eoV4-Qq_g$ I am building DMPlex using the DMPlexBuildFromCellList function and using the exact ordering of nodes I get from the Tetgen mesh files (.ele file). The resulting mesh looks good when I export it to VTK, but I encounter issues when solving particular PDEs. (I can solve them while using other importers I write) I suspect there may be orientation errors or something similar. So, my question is, Is the ordering of nodes in elements important for tetrahedral elements while using DMPlexBuildFromCellList? If so, how should I arrange them? Thanks, Onur Sent with Proton Mail secure email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Jun 25 06:11:49 2024 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 25 Jun 2024 07:11:49 -0400 Subject: [petsc-users] DMPlexBuildFromCellList node ordering for tetrahedral elements In-Reply-To: References: Message-ID: On Tue, Jun 25, 2024 at 4:50?AM onur.notonur via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi, I'm trying to implement a Tetgen mesh importer for my > Petsc/DMPlex-based solver. I am encountering some issues and suspect they > might be due to my import process. The Tetgen mesh definitions can be found > here for reference: https: //wias-berlin. de/software/tetgen/fformats. > htmlI > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Hi, > > I'm trying to implement a Tetgen mesh importer for my Petsc/DMPlex-based > solver. I am encountering some issues and suspect they might be due to my > import process. The Tetgen mesh definitions can be found here for > reference: https://urldefense.us/v3/__https://wias-berlin.de/software/tetgen/fformats.html__;!!G_uCfscf7eWS!fZrK2yb2VL8fOhEwFSop4QgqPuB1M2_C1OMkfebpI6V32422apR69VnseJBVL8CTE5Gn4r6jRSz7K8CgmdZe$ > > > I am building DMPlex using the DMPlexBuildFromCellList function and using > the exact ordering of nodes I get from the Tetgen mesh files (.ele file). > The resulting mesh looks good when I export it to VTK, but I encounter > issues when solving particular PDEs. (I can solve them while using other > importers I write) I suspect there may be orientation errors or something > similar. > > So, my question is, Is the ordering of nodes in elements important for > tetrahedral elements while using DMPlexBuildFromCellList? If so, how should > I arrange them? > Yes, TetGen inverts tetrahedra compared to Plex, since I use all outward facing normals, whereas those in TetGen are not consistently ordered. However, why not just use DMPlexGenerate() with TetGen? Thanks, Matt > Thanks, > Onur > > Sent with Proton Mail secure email. > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fZrK2yb2VL8fOhEwFSop4QgqPuB1M2_C1OMkfebpI6V32422apR69VnseJBVL8CTE5Gn4r6jRSz7K3D5kEEE$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From onur.notonur at proton.me Tue Jun 25 06:42:08 2024 From: onur.notonur at proton.me (onur.notonur) Date: Tue, 25 Jun 2024 11:42:08 +0000 Subject: [petsc-users] DMPlexBuildFromCellList node ordering for tetrahedral elements In-Reply-To: References: Message-ID: In my workflow, I was trying to use multiple mesh sources, and to achieve that, I converted them to an HDF5 file along with all other complementary information regarding my PDE. This approach seemed beneficial at first, but it's hard to manage now. (Also, I didn't know about DMPlexGenerate() :) ) However, thanks to you, I see I need to inspect the tetgenerate.cxx file. Thank you very much! Thanks, Onur Sent with Proton Mail secure email. On Tuesday, June 25th, 2024 at 2:11 PM, Matthew Knepley wrote: > On Tue, Jun 25, 2024 at 4:50?AM onur.notonur via petsc-users wrote: > >> Hi, I'm trying to implement a Tetgen mesh importer for my Petsc/DMPlex-based solver. I am encountering some issues and suspect they might be due to my import process. The Tetgen mesh definitions can be found here for reference: https:?//wias-berlin.?de/software/tetgen/fformats.?htmlI >> ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> >> Hi, >> >> I'm trying to implement a Tetgen mesh importer for my Petsc/DMPlex-based solver. I am encountering some issues and suspect they might be due to my import process. The Tetgen mesh definitions can be found here for reference: [https://urldefense.us/v3/__https://wias-berlin.de/software/tetgen/fformats.html*(https:/*urldefense.us/v3/__https:/*wias-berlin.de/software/tetgen/fformats.html__;!!G_uCfscf7eWS!c6srXdHHqgkq5FoeJpfHljLrP0U3UJOmz6A1-jLgkMBEtv8NTqRKgVTgoP_ArdJT5e5kx9lX53ecjbzZ3iu3-eoV4-Qq_g$)__;XS8v!!G_uCfscf7eWS!YhhAyOWDVeQm_MzXouB0hyFqonqeV1Ds-awndO2XWRhhfqTAwYpejwkiZTw0ayB3NhhOZiJ1f6k-BSQdaNhcYww5f6pyCw$ >> >> I am building DMPlex using the DMPlexBuildFromCellList function and using the exact ordering of nodes I get from the Tetgen mesh files (.ele file). The resulting mesh looks good when I export it to VTK, but I encounter issues when solving particular PDEs. (I can solve them while using other importers I write) I suspect there may be orientation errors or something similar. >> >> So, my question is, Is the ordering of nodes in elements important for tetrahedral elements while using DMPlexBuildFromCellList? If so, how should I arrange them? > > Yes, TetGen inverts tetrahedra compared to Plex, since I use all outward facing normals, whereas those in TetGen are not consistently ordered. However, why not just use DMPlexGenerate() with TetGen? > > Thanks, > > Matt > >> Thanks, >> Onur >> >> Sent with Proton Mail secure email. > > -- > > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > [https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/*(http:/*www.cse.buffalo.edu/*knepley/)__;fl0vfg!!G_uCfscf7eWS!YhhAyOWDVeQm_MzXouB0hyFqonqeV1Ds-awndO2XWRhhfqTAwYpejwkiZTw0ayB3NhhOZiJ1f6k-BSQdaNhcYwy4tGMtYw$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Jun 25 08:13:21 2024 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 25 Jun 2024 09:13:21 -0400 Subject: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin In-Reply-To: <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> Message-ID: <7E92C471-B1E3-44F0-AD96-8D01EA07A4CA@petsc.dev> Please do ls /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf and send the results > On Jun 25, 2024, at 4:44?AM, Gang Li wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Barry and Satish, > > Thanks for your help. > > The same error when I restart this build using a fresh tarball. > See the attached file for details. > > Sincerely, > Gang > ---- Replied Message ---- > From Satish Balay > Date 6/25/2024 00:11 > To Barry Smith > Cc Gang Li, > petsc-users at mcs.anl.gov > Subject Re: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin > Probably best if you can restart this build using a fresh tarball - and see if the problem persists > > Satish > > On Mon, 24 Jun 2024, Barry Smith wrote: > > > Do > > ls /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf/ > > > > > On Jun 23, 2024, at 8:27?AM, Gang Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, > > I have configured the PETSc under Cygwin by: > > cygpath -u `cygpath -ms '/cygdrive/c/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64'` > ./configure --with-cc='win32fe_icl' --with-fc='win32fe_ifort' --with-cxx='win32fe_icl' \ > --with-precision=double --with-scalar-type=complex \ > --with-shared-libraries=0 \ > --with-mpi=0 \ > --with-blaslapack-lib='-L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib' > > It seems to be successful. When make it, I faced a problem: > > $ make PETSC_DIR=/cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug all > /usr/bin/python3 ./config/gmakegen.py --petsc-arch=arch-mswin-c-debug > makefile:25: /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf > /rules_util.mk: No such file or directory > gmake[1]: *** No rule to make target '/cygdrive/e/Major/Codes/libraries/PETSc/pe > tsc-3.21.2/lib/petsc/conf/rules_util.mk'. Stop. > make: *** [GNUmakefile:9: all] Error 2 > > Could you help to check it? > Thanks. > > Gang Li > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Jun 25 09:20:18 2024 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 25 Jun 2024 10:20:18 -0400 Subject: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin In-Reply-To: <8140C5A4-F483-44F5-B902-CF064B2F7003@gmail.com> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> <7E92C471-B1E3-44F0-AD96-8D01EA07A4CA@petsc.dev> <8140C5A4-F483-44F5-B902-CF064B2F7003@gmail.com> Message-ID: What do file /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf/rules_doc.mk and file /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf/rules_util.mk return? > On Jun 25, 2024, at 9:59?AM, Gang Li wrote: > > Hi Barry, > > It is: > Administrator at YC-20210717DLFI /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2 > $ ls /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf > bfort-base.txt bfort-petsc.txt rules rules_util.mk uncrustify.cfg > bfort-mpi.txt petscvariables rules_doc.mk test variables > > Sincerely, > Gang > ---- Replied Message ---- > From Barry Smith > Date 6/25/2024 21:13 > To Gang Li > Cc petsc-users > Subject Re: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin > > Please do > > ls /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf > > and send the results > > > >> On Jun 25, 2024, at 4:44?AM, Gang Li > wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Hi Barry and Satish, >> >> Thanks for your help. >> >> The same error when I restart this build using a fresh tarball. >> See the attached file for details. >> >> Sincerely, >> Gang >> ---- Replied Message ---- >> From Satish Balay >> Date 6/25/2024 00:11 >> To Barry Smith >> Cc Gang Li, >> petsc-users at mcs.anl.gov >> Subject Re: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin >> Probably best if you can restart this build using a fresh tarball - and see if the problem persists >> >> Satish >> >> On Mon, 24 Jun 2024, Barry Smith wrote: >> >> >> Do >> >> ls /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf/ >> >> >> >> >> On Jun 23, 2024, at 8:27?AM, Gang Li > wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Hi, >> >> I have configured the PETSc under Cygwin by: >> >> cygpath -u `cygpath -ms '/cygdrive/c/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64'` >> ./configure --with-cc='win32fe_icl' --with-fc='win32fe_ifort' --with-cxx='win32fe_icl' \ >> --with-precision=double --with-scalar-type=complex \ >> --with-shared-libraries=0 \ >> --with-mpi=0 \ >> --with-blaslapack-lib='-L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib' >> >> It seems to be successful. When make it, I faced a problem: >> >> $ make PETSC_DIR=/cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug all >> /usr/bin/python3 ./config/gmakegen.py --petsc-arch=arch-mswin-c-debug >> makefile:25: /cygdrive/e/Major/Codes/libraries/PETSc/petsc-3.21.2/lib/petsc/conf >> /rules_util.mk: No such file or directory >> gmake[1]: *** No rule to make target '/cygdrive/e/Major/Codes/libraries/PETSc/pe >> tsc-3.21.2/lib/petsc/conf/rules_util.mk'. Stop. >> make: *** [GNUmakefile:9: all] Error 2 >> >> Could you help to check it? >> Thanks. >> >> Gang Li >> >> >> >> >> >> >> >> > ? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ls.jpg Type: image/jpeg Size: 18653 bytes Desc: not available URL: From lzou at anl.gov Tue Jun 25 10:25:16 2024 From: lzou at anl.gov (Zou, Ling) Date: Tue, 25 Jun 2024 15:25:16 +0000 Subject: [petsc-users] Modelica + PETSc? In-Reply-To: References: Message-ID: Thank you, Matt. To clarify, I myself have no idea about the Modelica implementation/interface to solver, so I was looking for if any such efforts have been done so I could leverage them. -Ling From: Matthew Knepley Date: Monday, June 24, 2024 at 10:39 AM To: Zou, Ling Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Modelica + PETSc? On Mon, Jun 24, 2024 at 10:?29 AM Zou, Ling wrote: This is the website I normally refer to https:?//openmodelica.?org/doc/OpenModelicaUsersGuide/latest/solving.?html Looks like DASSL is the default solver. That is what I would ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Mon, Jun 24, 2024 at 10:29?AM Zou, Ling > wrote: This is the website I normally refer to https://urldefense.us/v3/__https://openmodelica.org/doc/OpenModelicaUsersGuide/latest/solving.html__;!!G_uCfscf7eWS!b_FUpA47GE22L7gOjpyjA0Ai96NvB4NYNw5JB70nNxxDMc5mzM2GtVbJGudJTvPG5-8IFcltmTx9uq9Tni4$ Looks like DASSL is the default solver. That is what I would have guessed. DASSL is a good solver, but quite dated. I think PETSc can solve those problems, and more scalably. We would be happy to give advice on conforming to their interface. Thanks, Matt PS: I was playing with Modelica with some toy problem I have, which solves fine but could not hold on with the steady-state solution for some reason. Maybe I did it wrong, or maybe I am not familiar with the solver. That was the reason of the Modelica+PETSc question since I am quite familiar with PETSc. Also, the combination seems to be a powerful pair. -Ling From: Matthew Knepley > Date: Monday, June 24, 2024 at 6:12 AM To: Zou, Ling > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Modelica + PETSc? On Sun, Jun 23, 2024 at 5:?04 PM Zou, Ling via petsc-users wrote: Hi all, I am just curious ? any effort trying to include PETSc as Modelica?s solution option? (Modelica forum or email list seem to be quite dead ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Sun, Jun 23, 2024 at 5:04?PM Zou, Ling via petsc-users > wrote: Hi all, I am just curious ? any effort trying to include PETSc as Modelica?s solution option? (Modelica forum or email list seem to be quite dead so asking here.) I had not heard of it before. I looked at the 3.6 specification, but it did not sy how the generated DAE were solved, or how to interface packages. Do they have documentation on that? Thanks, Matt -Ling -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!b_FUpA47GE22L7gOjpyjA0Ai96NvB4NYNw5JB70nNxxDMc5mzM2GtVbJGudJTvPG5-8IFcltmTx9OZlpQYY$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!b_FUpA47GE22L7gOjpyjA0Ai96NvB4NYNw5JB70nNxxDMc5mzM2GtVbJGudJTvPG5-8IFcltmTx9OZlpQYY$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bruchon at emse.fr Tue Jun 25 11:04:25 2024 From: bruchon at emse.fr (Julien BRUCHON) Date: Tue, 25 Jun 2024 18:04:25 +0200 (CEST) Subject: [petsc-users] Trying to develop my own Krylov solver Message-ID: <182097699.28951205.1719331465252.JavaMail.zimbra@emse.fr> Hi, Based on 'cg.c', I'm trying to develop my own Krylov solver (a projected conjugate gradient). I want to integrate this into my C++ code, where I already have an interface for PETSC which works well. However, I have the following questions : - Where am I sensed to put my 'cg_projected.c' and 'pcgimpl.h' files? Should they go in a directory petsc/src/ksp/ksp/impls/pcg/? If so, how do I compile that? Is it simply by adding this directory to the Makefile in petsc/src/ksp/ksp/impls/? - I have also tried the basic approach of putting these two files in directories of my own C++ code and compiling. However, I have this error at the link edition: [100%] Linking CXX shared library libcoeur.so /usr/bin/ld: src/solvers/libsolvers.a(cg_projected.c.o): warning: relocation against `petscstack' in read-only section `.text' /usr/bin/ld: src/solvers/libsolvers.a(cg_projected.c.o): relocation R_X86_64_PC32 against symbol `petscstack' can not be used when making a shared object; recompil? avec -fPIC /usr/bin/ld : ?chec de l'?dition de liens finale : bad value collect2: error: ld returned 1 exit status make[2]: *** [CMakeFiles/coeur.dir/build.make:121 : libcoeur.so] Erreur 1 make[1]: *** [CMakeFiles/Makefile2:286 : CMakeFiles/coeur.dir/all] Erreur 2 make: *** [Makefile:91 : all] Erreur 2 Could you please tell me what is the right way to proceed? Thank you, Julien -- Julien Bruchon Professeur IMT - Responsable du d?partement MPE LGF - UMR CNRS 5307 - [ https://urldefense.us/v3/__https://www.mines-stetienne.fr/lgf/__;!!G_uCfscf7eWS!dQgv-IRWC7OgdDf1X9Oew4nHSgleq2ty0AszuRPj70bBiFeCcT4RibQVAvv6FFeD081W1yY8IczRHAHopA0crg$ | https://urldefense.us/v3/__https://www.mines-stetienne.fr/lgf/__;!!G_uCfscf7eWS!dQgv-IRWC7OgdDf1X9Oew4nHSgleq2ty0AszuRPj70bBiFeCcT4RibQVAvv6FFeD081W1yY8IczRHAHopA0crg$ ] Mines Saint-?tienne, une ?cole de l'Institut Mines-T?l?com [ https://urldefense.us/v3/__https://gitlab.emse.fr/bruchon/Coeur/-/wikis/home__;!!G_uCfscf7eWS!dQgv-IRWC7OgdDf1X9Oew4nHSgleq2ty0AszuRPj70bBiFeCcT4RibQVAvv6FFeD081W1yY8IczRHAG6pz3FrQ$ | Librairie ?l?ments Finis Coeur ] 0477420072 -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Jun 25 12:11:43 2024 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 25 Jun 2024 13:11:43 -0400 Subject: [petsc-users] Trying to develop my own Krylov solver In-Reply-To: <182097699.28951205.1719331465252.JavaMail.zimbra@emse.fr> References: <182097699.28951205.1719331465252.JavaMail.zimbra@emse.fr> Message-ID: On Tue, Jun 25, 2024 at 12:05?PM Julien BRUCHON via petsc-users < petsc-users at mcs.anl.gov> wrote: > Hi, Based on 'cg. c', I'm trying to develop my own Krylov solver (a > projected conjugate gradient). I want to integrate this into my C++ code, > where I already have an interface for PETSC which works well. However, I > have the following questions > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Hi, > > Based on 'cg.c', I'm trying to develop my own Krylov solver (a projected > conjugate gradient). I want to integrate this into my C++ code, where I > already have an interface for PETSC which works well. However, I have the > following questions : > > - Where am I sensed to put my 'cg_projected.c' and 'pcgimpl.h' files? > Should they go in a directory petsc/src/ksp/ksp/impls/pcg/? If so, how do I > compile that? Is it simply by adding this directory to the Makefile in > petsc/src/ksp/ksp/impls/? > Yes. Thanks, Matt > - I have also tried the basic approach of putting these two files in > directories of my own C++ code and compiling. However, I have this error > at the link edition: > [100%] Linking CXX shared library libcoeur.so > /usr/bin/ld: src/solvers/libsolvers.a(cg_projected.c.o): warning: > relocation against `petscstack' in read-only section `.text' > /usr/bin/ld: src/solvers/libsolvers.a(cg_projected.c.o): relocation > R_X86_64_PC32 against symbol `petscstack' can not be used when making a > shared object; recompil? avec -fPIC > /usr/bin/ld : ?chec de l'?dition de liens finale : bad value > collect2: error: ld returned 1 exit status > make[2]: *** [CMakeFiles/coeur.dir/build.make:121 : libcoeur.so] Erreur 1 > make[1]: *** [CMakeFiles/Makefile2:286 : CMakeFiles/coeur.dir/all] Erreur 2 > make: *** [Makefile:91 : all] Erreur 2 > > Could you please tell me what is the right way to proceed? > > Thank you, > > Julien > -- > Julien Bruchon > Professeur IMT - Responsable du d?partement MPE > LGF - UMR CNRS 5307 - https://urldefense.us/v3/__https://www.mines-stetienne.fr/lgf/__;!!G_uCfscf7eWS!ZSMOgmxB-aRx34PmTC3s7ZkDC-zT09xxpmLjhj_vx8oVkTvDSORUOeoTe8ZdEFCHVCUxSrs3eOz34zZTK5ep$ > > Mines Saint-?tienne, une ?cole de l'Institut Mines-T?l?com > Librairie ?l?ments Finis Coeur > > 0477420072 > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZSMOgmxB-aRx34PmTC3s7ZkDC-zT09xxpmLjhj_vx8oVkTvDSORUOeoTe8ZdEFCHVCUxSrs3eOz341kHHQmb$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Jun 25 12:14:52 2024 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 25 Jun 2024 13:14:52 -0400 Subject: [petsc-users] Trying to develop my own Krylov solver In-Reply-To: References: <182097699.28951205.1719331465252.JavaMail.zimbra@emse.fr> Message-ID: <791E8646-7C37-4D28-A939-F662E3881A29@petsc.dev> Make sure that you are using the latest PETSc if at all possible Also copy over a makefile from the cg directory (you do not need to edit any makefiles) You also need to add it to KSPRegisterAll() You will need to do make clean before running make all to compiler your new code. Barry > On Jun 25, 2024, at 1:11?PM, Matthew Knepley wrote: > > This Message Is From an External Sender > This message came from outside your organization. > On Tue, Jun 25, 2024 at 12:05?PM Julien BRUCHON via petsc-users > wrote: >> This Message Is From an External Sender >> This message came from outside your organization. >> >> Hi, >> >> Based on 'cg.c', I'm trying to develop my own Krylov solver (a projected conjugate gradient). I want to integrate this into my C++ code, where I already have an interface for PETSC which works well. However, I have the following questions : >> >> - Where am I sensed to put my 'cg_projected.c' and 'pcgimpl.h' files? Should they go in a directory petsc/src/ksp/ksp/impls/pcg/? If so, how do I compile that? Is it simply by adding this directory to the Makefile in petsc/src/ksp/ksp/impls/? > > Yes. > > Thanks, > > Matt > >> - I have also tried the basic approach of putting these two files in directories of my own C++ code and compiling. However, I have this error at the link edition: >> [100%] Linking CXX shared library libcoeur.so >> /usr/bin/ld: src/solvers/libsolvers.a(cg_projected.c.o): warning: relocation against `petscstack' in read-only section `.text' >> /usr/bin/ld: src/solvers/libsolvers.a(cg_projected.c.o): relocation R_X86_64_PC32 against symbol `petscstack' can not be used when making a shared object; recompil? avec -fPIC >> /usr/bin/ld : ?chec de l'?dition de liens finale : bad value >> collect2: error: ld returned 1 exit status >> make[2]: *** [CMakeFiles/coeur.dir/build.make:121 : libcoeur.so ] Erreur 1 >> make[1]: *** [CMakeFiles/Makefile2:286 : CMakeFiles/coeur.dir/all] Erreur 2 >> make: *** [Makefile:91 : all] Erreur 2 >> >> Could you please tell me what is the right way to proceed? >> >> Thank you, >> >> Julien >> -- >> Julien Bruchon >> Professeur IMT - Responsable du d?partement MPE >> LGF - UMR CNRS 5307 - https://urldefense.us/v3/__https://www.mines-stetienne.fr/lgf/__;!!G_uCfscf7eWS!YdfbFMzwUYFerTqmfdwOGaPcudU8m_JVVuuMI9-wko7kKcOqgNi_xiJZC-uQ6hpwAtBdhocY_wzknRHk84A4kYM$ >> Mines Saint-?tienne, une ?cole de l'Institut Mines-T?l?com >> Librairie ?l?ments Finis Coeur >> 0477420072 > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YdfbFMzwUYFerTqmfdwOGaPcudU8m_JVVuuMI9-wko7kKcOqgNi_xiJZC-uQ6hpwAtBdhocY_wzknRHk8zs6keo$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From maitri.ksh at gmail.com Tue Jun 25 13:35:29 2024 From: maitri.ksh at gmail.com (maitri ksh) Date: Tue, 25 Jun 2024 21:35:29 +0300 Subject: [petsc-users] Issues Compiling petsc4py with Cython Message-ID: I am currently working on integrating petsc4py, but I am encountering persistent compilation issues related to Cython. Below are the details of my setup and the errors I am facing. I would greatly appreciate any assistance or guidance on how to resolve these issues. System Configuration: - *PETSc Architecture*: linux-gnu-c-debug - *Python Environment*: Python 3.6 (virtual environment) - *Cython Version*: 3.0.10 - *Compiler*: /gcc11.2/bin/gcc During the build process, I received multiple warnings and errors related to the use of noexcept, nogil, and except in function declarations. Here are some of the specific errors: cythonizing 'petsc4py.PETSc.pyx' -> 'petsc4py.PETSc.c' warning: petsc4py.PETSc.pyx:1:0: Dotted filenames ('petsc4py.PETSc.pyx') are deprecated. Please use the normal Python package directory layout. /home/maitri.ksh/Maitri/petsc/petsc4py/myenv/lib64/python3.6/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: include/petsc4py/PETSc.pxd tree = Parsing.p_module(s, pxd, full_module_name) warning: PETSc/PETSc.pyx:53:48: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython. warning: PETSc/petscvec.pxi:406:79: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython. warning: PETSc/petscvec.pxi:411:79: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython. ... Error compiling Cython file: ... PETSc/petscobj.pxi:91:29: Cannot assign type 'int (void *) except? -1 nogil' to 'int (*)(void *) noexcept'. Exception values are incompatible. Suggest adding 'noexcept' to the type of 'PetscDelPyDict'. ... PETSc/cyclicgc.pxi:34:20: Cannot assign type 'int (PyObject *, visitproc, void *) except? -1' to 'traverseproc *'. Exception values are incompatible. Suggest adding 'noexcept' to the type of 'tp_traverse'. ... PETSc/cyclicgc.pxi:35:20: Cannot assign type 'int (PyObject *) except? -1' to 'inquiry *'. Exception values are incompatible. Suggest adding 'noexcept' to the type of 'tp_clear'. ... PETSc/PETSc.pyx:351:17: Cannot assign type 'void (void) except * nogil' to 'void (*)(void) noexcept'. Exception values are incompatible. Suggest adding 'noexcept' to the type of 'finalize'. error: Cython failure: 'petsc4py.PETSc.pyx' -> 'petsc4py.PETSc.c' Any advice on building petsc4py in environments similar to mine would be greatly appreciated. Thanks, Maitri -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Tue Jun 25 14:00:00 2024 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Tue, 25 Jun 2024 21:00:00 +0200 Subject: [petsc-users] Issues Compiling petsc4py with Cython In-Reply-To: References: Message-ID: Which version of petsc4py is it? Il giorno mar 25 giu 2024 alle ore 20:35 maitri ksh ha scritto: > I am currently working on integrating petsc4py, but I am encountering > persistent compilation issues related to Cython. Below are the details of > my setup and the errors I am facing. I would greatly appreciate any > assistance or guidance on how > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > I am currently working on integrating petsc4py, but I am encountering > persistent compilation issues related to Cython. Below are the details of > my setup and the errors I am facing. I would greatly appreciate any > assistance or guidance on how to resolve these issues. > System Configuration: > > - *PETSc Architecture*: linux-gnu-c-debug > - *Python Environment*: Python 3.6 (virtual environment) > - *Cython Version*: 3.0.10 > - *Compiler*: /gcc11.2/bin/gcc > > During the build process, I received multiple warnings and errors related > to the use of noexcept, nogil, and except in function declarations. Here > are some of the specific errors: > > cythonizing 'petsc4py.PETSc.pyx' -> 'petsc4py.PETSc.c' > warning: petsc4py.PETSc.pyx:1:0: Dotted filenames ('petsc4py.PETSc.pyx') > are deprecated. Please use the normal Python package directory layout. > /home/maitri.ksh/Maitri/petsc/petsc4py/myenv/lib64/python3.6/site-packages/Cython/Compiler/Main.py:381: > FutureWarning: Cython directive 'language_level' not set, using '3str' for > now (Py3). This has changed from earlier releases! File: > include/petsc4py/PETSc.pxd > tree = Parsing.p_module(s, pxd, full_module_name) > warning: PETSc/PETSc.pyx:53:48: The keyword 'nogil' should appear at the > end of the function signature line. Placing it before 'except' or > 'noexcept' will be disallowed in a future version of Cython. > warning: PETSc/petscvec.pxi:406:79: The keyword 'nogil' should appear at > the end of the function signature line. Placing it before 'except' or > 'noexcept' will be disallowed in a future version of Cython. > warning: PETSc/petscvec.pxi:411:79: The keyword 'nogil' should appear at > the end of the function signature line. Placing it before 'except' or > 'noexcept' will be disallowed in a future version of Cython. > ... > > Error compiling Cython file: > ... > PETSc/petscobj.pxi:91:29: Cannot assign type 'int (void *) except? -1 > nogil' to 'int (*)(void *) noexcept'. Exception values are incompatible. > Suggest adding 'noexcept' to the type of 'PetscDelPyDict'. > ... > PETSc/cyclicgc.pxi:34:20: Cannot assign type 'int (PyObject *, visitproc, > void *) except? -1' to 'traverseproc *'. Exception values are incompatible. > Suggest adding 'noexcept' to the type of 'tp_traverse'. > ... > PETSc/cyclicgc.pxi:35:20: Cannot assign type 'int (PyObject *) except? -1' > to 'inquiry *'. Exception values are incompatible. Suggest adding > 'noexcept' to the type of 'tp_clear'. > ... > PETSc/PETSc.pyx:351:17: Cannot assign type 'void (void) except * nogil' to > 'void (*)(void) noexcept'. Exception values are incompatible. Suggest > adding 'noexcept' to the type of 'finalize'. > error: Cython failure: 'petsc4py.PETSc.pyx' -> 'petsc4py.PETSc.c' > > Any advice on building petsc4py in environments similar to mine would be > greatly appreciated. > Thanks, > Maitri > -- Stefano -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay.anl at fastmail.org Tue Jun 25 16:06:56 2024 From: balay.anl at fastmail.org (Satish Balay) Date: Tue, 25 Jun 2024 16:06:56 -0500 (CDT) Subject: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin In-Reply-To: <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> Message-ID: <552dde2a-782a-5238-4897-18736ac9e94a@fastmail.org> An HTML attachment was scrubbed... URL: From balay.anl at fastmail.org Tue Jun 25 16:34:41 2024 From: balay.anl at fastmail.org (Satish Balay) Date: Tue, 25 Jun 2024 16:34:41 -0500 (CDT) Subject: [petsc-users] Issues Compiling petsc4py with Cython In-Reply-To: References: Message-ID: <879eb7b2-16e1-044e-299b-669d062f8650@fastmail.org> An HTML attachment was scrubbed... URL: From andrsd at gmail.com Tue Jun 25 17:10:00 2024 From: andrsd at gmail.com (David Andrs) Date: Tue, 25 Jun 2024 16:10:00 -0600 Subject: [petsc-users] BuildSystem looks for libOpenCL.a Message-ID: Hello! Is there a reason why the PETSc build system looks for libOpenCL.a, but not for libOpenCL.so on linux platforms? I have a machine with debian 12.5 and nvidia card. It has these packages installed: cuda-opencl-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic] cuda-opencl-dev-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic] nvidia-libopencl1/unknown,now 555.42.02-1 amd64 [installed,automatic] nvidia-opencl-common/unknown,now 555.42.02-1 amd64 [installed,automatic] nvidia-opencl-icd/unknown,now 555.42.02-1 amd64 [installed,automatic] opencl-c-headers/stable,now 3.0~2023.02.06-1 all [installed,automatic] opencl-clhpp-headers/stable,now 3.0~2023.02.06-1 all [installed,automatic] opencl-headers/stable,now 3.0~2023.02.06-1 all [installed] It only has .so, but no .a $ find /usr -iname 'libopencl*' /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so.1.0.0 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so.1 /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so.1.0 /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 Are users supposed to use `--with-opencl-include=` and `--with-opencl-lib` switches in this case? Thanks, David -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Tue Jun 25 17:19:21 2024 From: bsmith at petsc.dev (Barry Smith) Date: Tue, 25 Jun 2024 18:19:21 -0400 Subject: [petsc-users] BuildSystem looks for libOpenCL.a In-Reply-To: References: Message-ID: <8E7B6E30-BBD1-481D-835B-8693795F5987@petsc.dev> Did you have a problem with the install? Are you concerned with self.liblist = [['libOpenCL.a'], ['-framework opencl'], ['libOpenCL.lib']] ? Even though only the .a library is listed that should be a stand-in for both the static and shared library and it should automatically find the shared library for you. Barry > On Jun 25, 2024, at 6:10?PM, David Andrs wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello! > > Is there a reason why the PETSc build system looks for libOpenCL.a, but not for libOpenCL.so on linux platforms? I have a machine with debian 12.5 and nvidia card. It has these packages installed: > > cuda-opencl-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic] > cuda-opencl-dev-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic] > nvidia-libopencl1/unknown,now 555.42.02-1 amd64 [installed,automatic] > nvidia-opencl-common/unknown,now 555.42.02-1 amd64 [installed,automatic] > nvidia-opencl-icd/unknown,now 555.42.02-1 amd64 [installed,automatic] > opencl-c-headers/stable,now 3.0~2023.02.06-1 all [installed,automatic] > opencl-clhpp-headers/stable,now 3.0~2023.02.06-1 all [installed,automatic] > opencl-headers/stable,now 3.0~2023.02.06-1 all [installed] > > It only has .so, but no .a > > $ find /usr -iname 'libopencl*' > /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1.0.0 > /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1 > /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so > /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1.0 > /usr/lib/x86_64-linux-gnu/libOpenCL.so .1.0.0 > /usr/lib/x86_64-linux-gnu/libOpenCL.so .1 > > Are users supposed to use `--with-opencl-include=` and `--with-opencl-lib` switches in this case? > > Thanks, > > David -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Tue Jun 25 17:34:12 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Tue, 25 Jun 2024 17:34:12 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Hi, Yongzhong, Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.1 256 1.8 2.7 7.0 512 2.1 3.1 8.6 1024 2.7 4.0 12.3 2048 3.8 6.3 28.0 4096 6.1 10.6 42.4 8192 10.9 21.8 79.5 16384 21.2 39.4 149.6 32768 45.9 75.7 224.6 65536 142.2 215.8 732.1 131072 169.1 233.2 1729.4 262144 367.5 830.0 4159.2 524288 999.2 1718.1 8538.5 1048576 2113.5 4082.1 18274.8 2097152 5392.6 10273.4 43273.4 $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.0 256 1.8 2.7 15.0 512 2.1 9.0 16.6 1024 2.6 8.7 16.1 2048 7.7 10.3 20.5 4096 9.9 11.4 25.9 8192 14.5 22.1 39.6 16384 25.1 27.8 67.8 32768 44.7 95.7 91.5 65536 82.1 156.8 165.1 131072 194.0 335.1 341.5 262144 388.5 380.8 612.9 524288 1046.7 967.1 1653.3 1048576 1997.4 2169.0 4034.4 2097152 5502.9 5787.3 12608.1 The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. --Junchao Zhang On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang wrote: > Let me run some examples on our end to see whether the code calls expected > functions. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > >> On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li > utoronto. ca> wrote: Thank you Pierre for your information. Do we have a >> conclusion for my original question about the parallelization efficiency >> for different stages of >> ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >>> Thank you Pierre for your information. Do we have a conclusion for my >>> original question about the parallelization efficiency for different stages >>> of KSP Solve? Do we need to do more testing to figure out the issues? Thank >>> you, Yongzhong From: >>> ZjQcmQRYFpfptBannerStart >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> >>> ZjQcmQRYFpfptBannerEnd >>> >>> Thank you Pierre for your information. Do we have a conclusion for my >>> original question about the parallelization efficiency for different stages >>> of KSP Solve? Do we need to do more testing to figure out the issues? >>> >> >> We have an extended discussion of this here: >> https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!YH0MZEitgVyMCSrjwIzvt_s5lUzx3y_DLknXI9TNdBzdEWWAvy0nWkeaPe2b54Q6ioRraV7S3gzxr_k9JDqYX1rvYgHD$ >> >> >> The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, >> etc) are memory bandwidth limited. If there is no more bandwidth to be >> marshalled on your board, then adding more processes does nothing at all. >> This is why people were asking about how many "nodes" you are running on, >> because that is the unit of memory bandwidth, not "cores" which make little >> difference. >> >> Thanks, >> >> Matt >> >> >>> Thank you, >>> >>> Yongzhong >>> >>> >>> >>> *From: *Pierre Jolivet >>> *Date: *Sunday, June 23, 2024 at 12:41?AM >>> *To: *Yongzhong Li >>> *Cc: *petsc-users at mcs.anl.gov >>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >>> KSPSolve Performance Issue >>> >>> >>> >>> >>> >>> On 23 Jun 2024, at 4:07?AM, Yongzhong Li >>> wrote: >>> >>> >>> >>> This Message Is From an External Sender >>> >>> This message came from outside your organization. >>> >>> Yeah, I ran my program again using -mat_view::ascii_info and set >>> MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix >>> to be seqaijmkl type (I?ve attached a few as below) >>> >>> --> Setting up matrix-vector products... >>> >>> >>> >>> Mat Object: 1 MPI process >>> >>> type: seqaijmkl >>> >>> rows=16490, cols=35937 >>> >>> total: nonzeros=128496, allocated nonzeros=128496 >>> >>> total number of mallocs used during MatSetValues calls=0 >>> >>> not using I-node routines >>> >>> Mat Object: 1 MPI process >>> >>> type: seqaijmkl >>> >>> rows=16490, cols=35937 >>> >>> total: nonzeros=128496, allocated nonzeros=128496 >>> >>> total number of mallocs used during MatSetValues calls=0 >>> >>> not using I-node routines >>> >>> >>> >>> --> Solving the system... >>> >>> >>> >>> Excitation 1 of 1... >>> >>> >>> >>> ================================================ >>> >>> Iterative solve completed in 7435 ms. >>> >>> CONVERGED: rtol. >>> >>> Iterations: 72 >>> >>> Final relative residual norm: 9.22287e-07 >>> >>> ================================================ >>> >>> [CPU TIME] System solution: 2.27160000e+02 s. >>> >>> [WALL TIME] System solution: 7.44387218e+00 s. >>> >>> However, it seems to me that there were still no MKL outputs even I set >>> MKL_VERBOSE to be 1. Although, I think it should be many spmv operations >>> when doing KSPSolve(). Do you see the possible reasons? >>> >>> >>> >>> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS >>> is. >>> >>> >>> >>> Thanks, >>> >>> Pierre >>> >>> >>> >>> Thanks, >>> >>> Yongzhong >>> >>> >>> >>> >>> >>> *From: *Matthew Knepley >>> *Date: *Saturday, June 22, 2024 at 5:56?PM >>> *To: *Yongzhong Li >>> *Cc: *Junchao Zhang , Pierre Jolivet < >>> pierre at joliv.et>, petsc-users at mcs.anl.gov >>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >>> KSPSolve Performance Issue >>> >>> ????????? knepley at gmail.com ????????????????? >>> >>> >>> On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < >>> yongzhong.li at mail.utoronto.ca> wrote: >>> >>> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 >>> MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for >>> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) >>> AVX-512) with support of Vector >>> >>> ZjQcmQRYFpfptBannerStart >>> >>> *This Message Is From an External Sender* >>> >>> This message came from outside your organization. >>> >>> >>> >>> ZjQcmQRYFpfptBannerEnd >>> >>> MKL_VERBOSE=1 ./ex1 >>> >>> >>> matrix nonzeros = 100, allocated nonzeros = 100 >>> >>> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for >>> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) >>> AVX-512) with support of Vector Neural Network Instructions enabled >>> processors, Lnx 2.50GHz lp64 gnu_thread >>> >>> MKL_VERBOSE >>> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) >>> 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) >>> 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) >>> 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) >>> 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) >>> 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF >>> Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) >>> 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 >>> FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) >>> 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) >>> 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) >>> 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) >>> 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) >>> 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) >>> 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) >>> 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) >>> 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) >>> 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE >>> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) >>> 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns >>> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> Yes, for petsc example, there are MKL outputs, but for my own program. >>> All I did is to change the matrix type from MATAIJ to MATAIJMKL to get >>> optimized performance for spmv from MKL. Should I expect to see any MKL >>> outputs in this case? >>> >>> >>> >>> Are you sure that the type changed? You can MatView() the matrix with >>> format ascii_info to see. >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Matt >>> >>> >>> >>> >>> >>> Thanks, >>> >>> Yongzhong >>> >>> >>> >>> *From: *Junchao Zhang >>> *Date: *Saturday, June 22, 2024 at 9:40?AM >>> *To: *Yongzhong Li >>> *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < >>> petsc-users at mcs.anl.gov> >>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >>> KSPSolve Performance Issue >>> >>> No, you don't. It is strange. Perhaps you can you run a petsc example >>> first and see if MKL is really used >>> >>> $ cd src/mat/tests >>> >>> $ make ex1 >>> >>> $ MKL_VERBOSE=1 ./ex1 >>> >>> >>> --Junchao Zhang >>> >>> >>> >>> >>> >>> On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < >>> yongzhong.li at mail.utoronto.ca> wrote: >>> >>> I am using >>> >>> export MKL_VERBOSE=1 >>> >>> ./xx >>> >>> in the bash file, do I have to use - ksp_converged_reason? >>> >>> Thanks, >>> >>> Yongzhong >>> >>> >>> >>> *From: *Pierre Jolivet >>> *Date: *Friday, June 21, 2024 at 1:47?PM >>> *To: *Yongzhong Li >>> *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < >>> petsc-users at mcs.anl.gov> >>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >>> KSPSolve Performance Issue >>> >>> ????????? pierre at joliv.et ????????????????? >>> >>> >>> How do you set the variable? >>> >>> >>> >>> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason >>> >>> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) >>> 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) >>> enabled processors, Lnx 2.80GHz lp64 intel_thread >>> >>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 >>> FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 >>> FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 >>> FastMM:1 TID:0 NThr:1 >>> >>> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 >>> FastMM:1 TID:0 NThr:1 >>> >>> [...] >>> >>> >>> >>> On 21 Jun 2024, at 7:37?PM, Yongzhong Li >>> wrote: >>> >>> >>> >>> This Message Is From an External Sender >>> >>> This message came from outside your organization. >>> >>> Hello all, >>> >>> I set MKL_VERBOSE = 1, but observed no print output specific to the use >>> of MKL. Does PETSc enable this verbose output? >>> >>> Best, >>> >>> Yongzhong >>> >>> >>> >>> *From: *Pierre Jolivet >>> *Date: *Friday, June 21, 2024 at 1:36?AM >>> *To: *Junchao Zhang >>> *Cc: *Yongzhong Li , >>> petsc-users at mcs.anl.gov >>> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >>> KSPSolve Performance Issue >>> >>> ????????? pierre at joliv.et ????????????????? >>> >>> >>> >>> >>> >>> >>> On 21 Jun 2024, at 6:42?AM, Junchao Zhang >>> wrote: >>> >>> >>> >>> This Message Is From an External Sender >>> >>> This message came from outside your organization. >>> >>> I remember there are some MKL env vars to print MKL routines called. >>> >>> >>> >>> The environment variable is MKL_VERBOSE >>> >>> >>> >>> Thanks, >>> >>> Pierre >>> >>> >>> >>> Maybe we can try it to see what MKL routines are really used and then we >>> can understand why some petsc functions did not speed up >>> >>> >>> --Junchao Zhang >>> >>> >>> >>> >>> >>> On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < >>> yongzhong.li at mail.utoronto.ca> wrote: >>> >>> *This Message Is From an External Sender* >>> >>> This message came from outside your organization. >>> >>> >>> >>> Hi Barry, sorry for my last results. I didn?t fully understand the stage >>> profiling and logging in PETSc, now I only record KSPSolve() stage of my >>> program. Some sample codes are as follow, >>> >>> // Static variable to keep track of the stage counter >>> >>> static int stageCounter = 1; >>> >>> >>> >>> // Generate a unique stage name >>> >>> std::ostringstream oss; >>> >>> oss << "Stage " << stageCounter << " of Code"; >>> >>> std::string stageName = oss.str(); >>> >>> >>> >>> // Register the stage >>> >>> PetscLogStage stagenum; >>> >>> >>> >>> PetscLogStageRegister(stageName.c_str(), &stagenum); >>> >>> PetscLogStagePush(stagenum); >>> >>> >>> >>> *KSPSolve(*ksp_ptr, b, x);* >>> >>> >>> >>> PetscLogStagePop(); >>> >>> stageCounter++; >>> >>> I have attached my new logging results, there are 1 main stage and 4 >>> other stages where each one is KSPSolve() call. >>> >>> To provide some additional backgrounds, if you recall, I have been >>> trying to get efficient iterative solution using multithreading. I found >>> out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am >>> able to perform sparse matrix-vector multiplication faster, I am using >>> MATSEQAIJMKL. This makes the shell matrix vector product in each iteration >>> scale well with the #of threads. However, I found out the total GMERS solve >>> time (~KSPSolve() time) is not scaling well the #of threads. >>> >>> From the logging results I learned that when performing KSPSolve(), >>> there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my >>> programs using different number of threads and plotted the time consumption >>> for PCApply() and KSPGMERSOrthog() against #of thread. I found out these >>> two operations are not scaling with the threads at all! My results are >>> attached as the pdf to give you a clear view. >>> >>> My questions is, >>> >>> From my understanding, in PCApply, MatSolve() is involved, >>> KSPGMERSOrthog() will have many vector operations, so why these two parts >>> can?t scale well with the # of threads when the intel MKL library is linked? >>> >>> Thank you, >>> Yongzhong >>> >>> >>> >>> *From: *Barry Smith >>> *Date: *Friday, June 14, 2024 at 11:36?AM >>> *To: *Yongzhong Li >>> *Cc: *petsc-users at mcs.anl.gov , >>> petsc-maint at mcs.anl.gov , Piero Triverio < >>> piero.triverio at utoronto.ca> >>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >>> Performance Issue >>> >>> >>> >>> I am a bit confused. Without the initial guess computation, there are >>> still a bunch of events I don't understand >>> >>> >>> >>> MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> >>> MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> >>> MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> >>> MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> >>> MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> >>> MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> >>> MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 >>> >>> MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> >>> MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> >>> >>> >>> in addition there are many more VecMAXPY then VecMDot (in GMRES they are >>> each done the same number of times) >>> >>> >>> >>> VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 >>> >>> VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 >>> >>> >>> >>> Finally there are a huge number of >>> >>> >>> >>> MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 >>> >>> >>> >>> Are you making calls to all these routines? Are you doing this inside >>> your MatMult() or before you call KSPSolve? >>> >>> >>> >>> The reason I wanted you to make a simpler run without the initial guess >>> code is that your events are far more complicated than would be produced by >>> GMRES alone so it is not possible to understand the behavior you are seeing >>> without fully understanding all the events happening in the code. >>> >>> >>> >>> Barry >>> >>> >>> >>> >>> >>> On Jun 14, 2024, at 1:19?AM, Yongzhong Li >>> wrote: >>> >>> >>> >>> Thanks, I have attached the results without using any KSPGuess. At low >>> frequency, the iteration steps are quite close to the one with KSPGuess, >>> specifically >>> >>> KSPGuess Object: 1 MPI process >>> >>> type: fischer >>> >>> Model 1, size 200 >>> >>> However, I found at higher frequency, the # of iteration steps are >>> significant higher than the one with KSPGuess, I have attahced both of the >>> results for your reference. >>> >>> Moreover, could I ask why the one without the KSPGuess options can be >>> used for a baseline comparsion? What are we comparing here? How does it >>> relate to the performance issue/bottleneck I found? ?*I have noticed >>> that the time taken by **KSPSolve** is **almost two times **greater >>> than the CPU time for matrix-vector product multiplied by the number of >>> iteration*? >>> >>> Thank you! >>> Yongzhong >>> >>> >>> >>> *From: *Barry Smith >>> *Date: *Thursday, June 13, 2024 at 2:14?PM >>> *To: *Yongzhong Li >>> *Cc: *petsc-users at mcs.anl.gov , >>> petsc-maint at mcs.anl.gov , Piero Triverio < >>> piero.triverio at utoronto.ca> >>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >>> Performance Issue >>> >>> >>> >>> Can you please run the same thing without the KSPGuess option(s) for >>> a baseline comparison? >>> >>> >>> >>> Thanks >>> >>> >>> >>> Barry >>> >>> >>> >>> On Jun 13, 2024, at 1:27?PM, Yongzhong Li >>> wrote: >>> >>> >>> >>> This Message Is From an External Sender >>> >>> This message came from outside your organization. >>> >>> Hi Matt, >>> >>> I have rerun the program with the keys you provided. The system output >>> when performing ksp solve and the final petsc log output were stored in a >>> .txt file attached for your reference. >>> >>> Thanks! >>> Yongzhong >>> >>> >>> >>> *From: *Matthew Knepley >>> *Date: *Wednesday, June 12, 2024 at 6:46?PM >>> *To: *Yongzhong Li >>> *Cc: *petsc-users at mcs.anl.gov , >>> petsc-maint at mcs.anl.gov , Piero Triverio < >>> piero.triverio at utoronto.ca> >>> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >>> Performance Issue >>> >>> ????????? knepley at gmail.com ????????????????? >>> >>> >>> On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < >>> yongzhong.li at mail.utoronto.ca> wrote: >>> >>> Dear PETSc?s developers, I hope this email finds you well. I am >>> currently working on a project using PETSc and have encountered a >>> performance issue with the KSPSolve function. Specifically, I have noticed >>> that the time taken by KSPSolve is >>> >>> ZjQcmQRYFpfptBannerStart >>> >>> *This Message Is From an External Sender* >>> >>> This message came from outside your organization. >>> >>> >>> >>> ZjQcmQRYFpfptBannerEnd >>> >>> Dear PETSc?s developers, >>> >>> I hope this email finds you well. >>> >>> I am currently working on a project using PETSc and have encountered a >>> performance issue with the KSPSolve function. Specifically, *I have >>> noticed that the time taken by **KSPSolve** is **almost two times **greater >>> than the CPU time for matrix-vector product multiplied by the number of >>> iteration steps*. I use C++ chrono to record CPU time. >>> >>> For context, I am using a shell system matrix A. Despite my efforts to >>> parallelize the matrix-vector product (Ax), the overall solve time >>> remains higher than the matrix vector product per iteration indicates >>> when multiple threads were used. Here are a few details of my setup: >>> >>> - *Matrix Type*: Shell system matrix >>> - *Preconditioner*: Shell PC >>> - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK >>> library, multithreading is enabled >>> >>> I have considered several potential reasons, such as preconditioner >>> setup, additional solver operations, and the inherent overhead of using a >>> shell system matrix. *However, since KSPSolve is a high-level API, I >>> have been unable to pinpoint the exact cause of the increased solve time.* >>> >>> Have you observed the same issue? Could you please provide some >>> experience on how to diagnose and address this performance discrepancy? >>> Any insights or recommendations you could offer would be greatly >>> appreciated. >>> >>> >>> >>> For any performance question like this, we need to see the output of >>> your code run with >>> >>> >>> >>> -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Matt >>> >>> >>> >>> Thank you for your time and assistance. >>> >>> Best regards, >>> >>> Yongzhong >>> >>> ----------------------------------------------------------- >>> >>> *Yongzhong Li* >>> >>> PhD student | Electromagnetics Group >>> >>> Department of Electrical & Computer Engineering >>> >>> University of Toronto >>> >>> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!YH0MZEitgVyMCSrjwIzvt_s5lUzx3y_DLknXI9TNdBzdEWWAvy0nWkeaPe2b54Q6ioRraV7S3gzxr_k9JDqYXymw_hMJ$ >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> >>> >>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YH0MZEitgVyMCSrjwIzvt_s5lUzx3y_DLknXI9TNdBzdEWWAvy0nWkeaPe2b54Q6ioRraV7S3gzxr_k9JDqYX78QsGNH$ >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> >>> >>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YH0MZEitgVyMCSrjwIzvt_s5lUzx3y_DLknXI9TNdBzdEWWAvy0nWkeaPe2b54Q6ioRraV7S3gzxr_k9JDqYX78QsGNH$ >>> >>> >>> >>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YH0MZEitgVyMCSrjwIzvt_s5lUzx3y_DLknXI9TNdBzdEWWAvy0nWkeaPe2b54Q6ioRraV7S3gzxr_k9JDqYX78QsGNH$ >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrsd at gmail.com Tue Jun 25 17:38:27 2024 From: andrsd at gmail.com (David Andrs) Date: Tue, 25 Jun 2024 16:38:27 -0600 Subject: [petsc-users] BuildSystem looks for libOpenCL.a In-Reply-To: <8E7B6E30-BBD1-481D-835B-8693795F5987@petsc.dev> References: <8E7B6E30-BBD1-481D-835B-8693795F5987@petsc.dev> Message-ID: > On Jun 25, 2024, at 16:19, Barry Smith wrote: > > > Did you have a problem with the install? Yes. See below. > > Are you concerned with self.liblist = [['libOpenCL.a'], ['-framework opencl'], ['libOpenCL.lib']] ? Yes. > > Even though only the .a library is listed that should be a stand-in for both the static and shared library and it should automatically find the shared library for you. I used `--with-opencl=1` and it did not find the .so. The complete configure line I used: ./configure --COPTFLAGS=-O3 --CXXOPTFLAGS=-O3 --FOPTFLAGS=-O3 --with-debugging=0 --with-64-bit-indices --with-yaml=0 --with-hdf5=1 --with-hwloc=0 --with-mpi=1 --with-pthread=1 --with-shared-libraries --with-ssl=0 --with-scalapack=1 --with-exodusii=1 --with-netcdf=1 --with-pnetcdf=1 --download-ptscotch --with-metis=1 --with-parmetis=1 --with-hypre=1 --with-zlib=1 --with-x=0 --with-pic=1 --with-viennacl=1 --with-viennacl-dir=/home/andrsd/usr --prefix=/home/andrsd/usr --with-metis-dir=/home/andrsd/usr --with-hdf5-dir=/home/andrsd/usr --with-hypre-dir=/home/andrsd/usr --with-pnetcdf-dir=/home/andrsd/usr --with-blas-lib=blas --with-lapack-lib=lapack --with-scalapack-dir=/home/andrsd/usr --with-opencl=1 Attached is a configure.log. David ? > > Barry > > >> On Jun 25, 2024, at 6:10?PM, David Andrs wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Hello! >> >> Is there a reason why the PETSc build system looks for libOpenCL.a, but not for libOpenCL.so on linux platforms? I have a machine with debian 12.5 and nvidia card. It has these packages installed: >> >> cuda-opencl-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic] >> cuda-opencl-dev-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic] >> nvidia-libopencl1/unknown,now 555.42.02-1 amd64 [installed,automatic] >> nvidia-opencl-common/unknown,now 555.42.02-1 amd64 [installed,automatic] >> nvidia-opencl-icd/unknown,now 555.42.02-1 amd64 [installed,automatic] >> opencl-c-headers/stable,now 3.0~2023.02.06-1 all [installed,automatic] >> opencl-clhpp-headers/stable,now 3.0~2023.02.06-1 all [installed,automatic] >> opencl-headers/stable,now 3.0~2023.02.06-1 all [installed] >> >> It only has .so, but no .a >> >> $ find /usr -iname 'libopencl*' >> /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1.0.0 >> /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1 >> /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so >> /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1.0 >> /usr/lib/x86_64-linux-gnu/libOpenCL.so .1.0.0 >> /usr/lib/x86_64-linux-gnu/libOpenCL.so .1 >> >> Are users supposed to use `--with-opencl-include=` and `--with-opencl-lib` switches in this case? >> >> Thanks, >> >> David > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: configure.tar.gz Type: application/x-gzip Size: 106944 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrsd at gmail.com Tue Jun 25 17:55:41 2024 From: andrsd at gmail.com (David Andrs) Date: Tue, 25 Jun 2024 16:55:41 -0600 Subject: [petsc-users] BuildSystem looks for libOpenCL.a In-Reply-To: References: <8E7B6E30-BBD1-481D-835B-8693795F5987@petsc.dev> Message-ID: I spotted the problem. There is no libOpenCL.so in /usr/lib/x86_64-linux-gnu, only .so.1 and .so.1.0.0. That?s why OpenCL is not found. I have to point the build system to /usr/local/cuda-12.5/targets/x86_64-linux/lib/ which actually has .so. Thanks for your time and help, David > On Jun 25, 2024, at 16:38, David Andrs wrote: > > >> On Jun 25, 2024, at 16:19, Barry Smith wrote: >> >> >> Did you have a problem with the install? > > Yes. See below. > >> >> Are you concerned with self.liblist = [['libOpenCL.a'], ['-framework opencl'], ['libOpenCL.lib']] ? > > Yes. > >> >> Even though only the .a library is listed that should be a stand-in for both the static and shared library and it should automatically find the shared library for you. > > I used `--with-opencl=1` and it did not find the .so. The complete configure line I used: > > ./configure --COPTFLAGS=-O3 --CXXOPTFLAGS=-O3 --FOPTFLAGS=-O3 --with-debugging=0 --with-64-bit-indices --with-yaml=0 --with-hdf5=1 --with-hwloc=0 --with-mpi=1 --with-pthread=1 --with-shared-libraries --with-ssl=0 --with-scalapack=1 --with-exodusii=1 --with-netcdf=1 --with-pnetcdf=1 --download-ptscotch --with-metis=1 --with-parmetis=1 --with-hypre=1 --with-zlib=1 --with-x=0 --with-pic=1 --with-viennacl=1 --with-viennacl-dir=/home/andrsd/usr --prefix=/home/andrsd/usr --with-metis-dir=/home/andrsd/usr --with-hdf5-dir=/home/andrsd/usr --with-hypre-dir=/home/andrsd/usr --with-pnetcdf-dir=/home/andrsd/usr --with-blas-lib=blas --with-lapack-lib=lapack --with-scalapack-dir=/home/andrsd/usr --with-opencl=1 > > Attached is a configure.log. > > David > > > >> >> Barry >> >> >>> On Jun 25, 2024, at 6:10?PM, David Andrs wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> Hello! >>> >>> Is there a reason why the PETSc build system looks for libOpenCL.a, but not for libOpenCL.so on linux platforms? I have a machine with debian 12.5 and nvidia card. It has these packages installed: >>> >>> cuda-opencl-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic] >>> cuda-opencl-dev-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic] >>> nvidia-libopencl1/unknown,now 555.42.02-1 amd64 [installed,automatic] >>> nvidia-opencl-common/unknown,now 555.42.02-1 amd64 [installed,automatic] >>> nvidia-opencl-icd/unknown,now 555.42.02-1 amd64 [installed,automatic] >>> opencl-c-headers/stable,now 3.0~2023.02.06-1 all [installed,automatic] >>> opencl-clhpp-headers/stable,now 3.0~2023.02.06-1 all [installed,automatic] >>> opencl-headers/stable,now 3.0~2023.02.06-1 all [installed] >>> >>> It only has .so, but no .a >>> >>> $ find /usr -iname 'libopencl*' >>> /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1.0.0 >>> /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1 >>> /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so >>> /usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so .1.0 >>> /usr/lib/x86_64-linux-gnu/libOpenCL.so .1.0.0 >>> /usr/lib/x86_64-linux-gnu/libOpenCL.so .1 >>> >>> Are users supposed to use `--with-opencl-include=` and `--with-opencl-lib` switches in this case? >>> >>> Thanks, >>> >>> David >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Tue Jun 25 22:19:31 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Wed, 26 Jun 2024 03:19:31 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Hi Junchao, thank you for your help for these benchmarking test! I check out to petsc/main and did a few things to verify from my side, 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.5 1.2 1.8 5.2 256 1.5 0.9 1.6 4.7 512 2.7 2.8 6.1 13.2 1024 4.0 4.0 9.3 16.4 2048 7.4 7.3 11.3 39.3 4096 14.2 13.9 19.1 93.4 8192 28.8 26.3 25.4 31.3 16384 54.1 25.8 26.7 33.8 32768 109.8 25.7 24.2 56.0 65536 220.2 24.4 26.5 89.0 131072 424.1 31.5 36.1 149.6 262144 898.1 37.1 53.9 286.1 524288 1754.6 48.7 100.3 1122.2 1048576 3645.8 86.5 347.9 2950.4 2097152 7371.4 308.7 1440.6 6874.9 $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.9 1.2 1.9 5.2 256 1.5 1.0 1.7 4.7 512 2.7 2.8 6.1 12.0 1024 3.9 4.0 9.3 16.8 2048 7.4 7.3 10.4 41.3 4096 14.0 13.8 18.6 84.2 8192 27.0 21.3 43.8 177.5 16384 54.1 34.1 89.1 330.4 32768 110.4 82.1 203.5 781.1 65536 213.0 191.8 423.9 1696.4 131072 428.7 360.2 934.0 4080.0 262144 883.4 723.2 1745.6 10120.7 524288 1817.5 1466.1 4751.4 23217.2 1048576 3611.0 3796.5 11814.9 48687.7 2097152 7401.9 10592.0 27543.2 106565.4 I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca >From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. Thank you, Yongzhong From: Junchao Zhang Date: Tuesday, June 25, 2024 at 6:34?PM To: Matthew Knepley Cc: Yongzhong Li , Pierre Jolivet , petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Hi, Yongzhong, Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.1 256 1.8 2.7 7.0 512 2.1 3.1 8.6 1024 2.7 4.0 12.3 2048 3.8 6.3 28.0 4096 6.1 10.6 42.4 8192 10.9 21.8 79.5 16384 21.2 39.4 149.6 32768 45.9 75.7 224.6 65536 142.2 215.8 732.1 131072 169.1 233.2 1729.4 262144 367.5 830.0 4159.2 524288 999.2 1718.1 8538.5 1048576 2113.5 4082.1 18274.8 2097152 5392.6 10273.4 43273.4 $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.0 256 1.8 2.7 15.0 512 2.1 9.0 16.6 1024 2.6 8.7 16.1 2048 7.7 10.3 20.5 4096 9.9 11.4 25.9 8192 14.5 22.1 39.6 16384 25.1 27.8 67.8 32768 44.7 95.7 91.5 65536 82.1 156.8 165.1 131072 194.0 335.1 341.5 262144 388.5 380.8 612.9 524288 1046.7 967.1 1653.3 1048576 1997.4 2169.0 4034.4 2097152 5502.9 5787.3 12608.1 The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. --Junchao Zhang On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: Let me run some examples on our end to see whether the code calls expected functions. --Junchao Zhang On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!Y2nH4BM7-Zuq2WFm8kMqkAVUO8uEpIeFLvoio1A15HZpChRoT5UWCnx3vPAn8K1wS-3wdspaUuQn7-qaioNgsyqTHBjpGzTTKTQ$ The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. Thanks, Matt Thank you, Yongzhong From: Pierre Jolivet > Date: Sunday, June 23, 2024 at 12:41?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. Thanks, Pierre Thanks, Yongzhong From: Matthew Knepley > Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!Y2nH4BM7-Zuq2WFm8kMqkAVUO8uEpIeFLvoio1A15HZpChRoT5UWCnx3vPAn8K1wS-3wdspaUuQn7-qaioNgsyqTHBjpoj9DHuo$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Y2nH4BM7-Zuq2WFm8kMqkAVUO8uEpIeFLvoio1A15HZpChRoT5UWCnx3vPAn8K1wS-3wdspaUuQn7-qaioNgsyqTHBjpNqVj8Kc$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Y2nH4BM7-Zuq2WFm8kMqkAVUO8uEpIeFLvoio1A15HZpChRoT5UWCnx3vPAn8K1wS-3wdspaUuQn7-qaioNgsyqTHBjpNqVj8Kc$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!Y2nH4BM7-Zuq2WFm8kMqkAVUO8uEpIeFLvoio1A15HZpChRoT5UWCnx3vPAn8K1wS-3wdspaUuQn7-qaioNgsyqTHBjpNqVj8Kc$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From maitri.ksh at gmail.com Wed Jun 26 07:29:27 2024 From: maitri.ksh at gmail.com (maitri ksh) Date: Wed, 26 Jun 2024 15:29:27 +0300 Subject: [petsc-users] Issues Compiling petsc4py with Cython In-Reply-To: <879eb7b2-16e1-044e-299b-669d062f8650@fastmail.org> References: <879eb7b2-16e1-044e-299b-669d062f8650@fastmail.org> Message-ID: It was a version incompatibility issue, resolved by configuring it with petsc. Thank you stefano & satish. On Wed, Jun 26, 2024 at 12:34?AM Satish Balay wrote: > Best to get latest petsc and build petsc4py along with petsc: > > For example: > > balay at petsc-gpu-02:/scratch/balay$ wget -q > https://urldefense.us/v3/__https://web.cels.anl.gov/projects/petsc/download/release-snapshots/petsc-3.21.2.tar.gz__;!!G_uCfscf7eWS!ZtQZK7hSb7b0O7qL1NxuDxYo3aNcfvkMEs6i_bMC-NNUYojWqG8zVItjnY-Kvihf1GDSMdRoUuj7yOs32AUG7ysB$ > balay at petsc-gpu-02:/scratch/balay$ tar -xf petsc-3.21.2.tar.gz > balay at petsc-gpu-02:/scratch/balay$ cd petsc-3.21.2/ > balay at petsc-gpu-02:/scratch/balay/petsc-3.21.2$ ./configure > --download-mpich --download-fblaslapack --download-mpi4py --with-petsc4py=1 > && make && make check && make petsc4pytest > > ============================================================================================= > Configuring PETSc to compile on your system > > ============================================================================================= > > testPlaceArray (test_vec.TestVecShared) ... ok > testPlaceArray (test_vec.TestVecShared) ... ok > > Ran 6154 tests in 134.314s > OK > ===================================== > balay at petsc-gpu-02:/scratch/balay/petsc-3.21.2$ > > > Satish > > On Tue, 25 Jun 2024, Stefano Zampini wrote: > > > Which version of petsc4py is it? > > > > Il giorno mar 25 giu 2024 alle ore 20:35 maitri ksh < > maitri.ksh at gmail.com> > > ha scritto: > > > > > I am currently working on integrating petsc4py, but I am encountering > > > persistent compilation issues related to Cython. Below are the details > of > > > my setup and the errors I am facing. I would greatly appreciate any > > > assistance or guidance on how > > > ZjQcmQRYFpfptBannerStart > > > This Message Is From an External Sender > > > This message came from outside your organization. > > > > > > ZjQcmQRYFpfptBannerEnd > > > > > > I am currently working on integrating petsc4py, but I am encountering > > > persistent compilation issues related to Cython. Below are the details > of > > > my setup and the errors I am facing. I would greatly appreciate any > > > assistance or guidance on how to resolve these issues. > > > System Configuration: > > > > > > - *PETSc Architecture*: linux-gnu-c-debug > > > - *Python Environment*: Python 3.6 (virtual environment) > > > - *Cython Version*: 3.0.10 > > > - *Compiler*: /gcc11.2/bin/gcc > > > > > > During the build process, I received multiple warnings and errors > related > > > to the use of noexcept, nogil, and except in function declarations. > Here > > > are some of the specific errors: > > > > > > cythonizing 'petsc4py.PETSc.pyx' -> 'petsc4py.PETSc.c' > > > warning: petsc4py.PETSc.pyx:1:0: Dotted filenames > ('petsc4py.PETSc.pyx') > > > are deprecated. Please use the normal Python package directory layout. > > > > /home/maitri.ksh/Maitri/petsc/petsc4py/myenv/lib64/python3.6/site-packages/Cython/Compiler/Main.py:381: > > > FutureWarning: Cython directive 'language_level' not set, using '3str' > for > > > now (Py3). This has changed from earlier releases! File: > > > include/petsc4py/PETSc.pxd > > > tree = Parsing.p_module(s, pxd, full_module_name) > > > warning: PETSc/PETSc.pyx:53:48: The keyword 'nogil' should appear at > the > > > end of the function signature line. Placing it before 'except' or > > > 'noexcept' will be disallowed in a future version of Cython. > > > warning: PETSc/petscvec.pxi:406:79: The keyword 'nogil' should appear > at > > > the end of the function signature line. Placing it before 'except' or > > > 'noexcept' will be disallowed in a future version of Cython. > > > warning: PETSc/petscvec.pxi:411:79: The keyword 'nogil' should appear > at > > > the end of the function signature line. Placing it before 'except' or > > > 'noexcept' will be disallowed in a future version of Cython. > > > ... > > > > > > Error compiling Cython file: > > > ... > > > PETSc/petscobj.pxi:91:29: Cannot assign type 'int (void *) except? -1 > > > nogil' to 'int (*)(void *) noexcept'. Exception values are > incompatible. > > > Suggest adding 'noexcept' to the type of 'PetscDelPyDict'. > > > ... > > > PETSc/cyclicgc.pxi:34:20: Cannot assign type 'int (PyObject *, > visitproc, > > > void *) except? -1' to 'traverseproc *'. Exception values are > incompatible. > > > Suggest adding 'noexcept' to the type of 'tp_traverse'. > > > ... > > > PETSc/cyclicgc.pxi:35:20: Cannot assign type 'int (PyObject *) except? > -1' > > > to 'inquiry *'. Exception values are incompatible. Suggest adding > > > 'noexcept' to the type of 'tp_clear'. > > > ... > > > PETSc/PETSc.pyx:351:17: Cannot assign type 'void (void) except * > nogil' to > > > 'void (*)(void) noexcept'. Exception values are incompatible. Suggest > > > adding 'noexcept' to the type of 'finalize'. > > > error: Cython failure: 'petsc4py.PETSc.pyx' -> 'petsc4py.PETSc.c' > > > > > > Any advice on building petsc4py in environments similar to mine would > be > > > greatly appreciated. > > > Thanks, > > > Maitri > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefano.zampini at gmail.com Wed Jun 26 07:40:11 2024 From: stefano.zampini at gmail.com (Stefano Zampini) Date: Wed, 26 Jun 2024 14:40:11 +0200 Subject: [petsc-users] Issues Compiling petsc4py with Cython In-Reply-To: References: <879eb7b2-16e1-044e-299b-669d062f8650@fastmail.org> Message-ID: > It was a version incompatibility issue, resolved by configuring it with petsc. Note that "configuring with PETSc" is not a standard Python installation procedure. You should be able to install it in your virtual environment by just doing $ ... activate venv .... $ cd src/binding/petsc4py $ PETSC_DIR=... ; PETSC_ARCH=....; pip install . see also https://urldefense.us/v3/__https://petsc.org/release/petsc4py/install.html__;!!G_uCfscf7eWS!YOhounZLijkMQ6SoKMm5OyCugdS8yKtXQk2bpNiYOjsdPUTHQ3ceL1eZnBJM23hEv1qL6F9eyGtWgoGV_Gqu4mZKXuJJWGQ$ Il giorno mer 26 giu 2024 alle ore 14:30 maitri ksh ha scritto: > It was a version incompatibility issue, resolved by configuring it with > petsc. Thank you stefano & satish. On Wed, Jun 26, 2024 at 12: 34 AM Satish > Balay wrote: Best to get latest petsc and > build petsc4py along > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > It was a version incompatibility issue, resolved by configuring it with > petsc. Thank you stefano & satish. > > On Wed, Jun 26, 2024 at 12:34?AM Satish Balay > wrote: > >> Best to get latest petsc and build petsc4py along with petsc: >> >> For example: >> >> balay at petsc-gpu-02:/scratch/balay$ wget -q >> https://urldefense.us/v3/__https://web.cels.anl.gov/projects/petsc/download/release-snapshots/petsc-3.21.2.tar.gz__;!!G_uCfscf7eWS!YOhounZLijkMQ6SoKMm5OyCugdS8yKtXQk2bpNiYOjsdPUTHQ3ceL1eZnBJM23hEv1qL6F9eyGtWgoGV_Gqu4mZKbmitMG8$ >> >> balay at petsc-gpu-02:/scratch/balay$ tar -xf petsc-3.21.2.tar.gz >> balay at petsc-gpu-02:/scratch/balay$ cd petsc-3.21.2/ >> balay at petsc-gpu-02:/scratch/balay/petsc-3.21.2$ ./configure >> --download-mpich --download-fblaslapack --download-mpi4py --with-petsc4py=1 >> && make && make check && make petsc4pytest >> >> ============================================================================================= >> Configuring PETSc to compile on your system >> >> ============================================================================================= >> >> testPlaceArray (test_vec.TestVecShared) ... ok >> testPlaceArray (test_vec.TestVecShared) ... ok >> >> Ran 6154 tests in 134.314s >> OK >> ===================================== >> balay at petsc-gpu-02:/scratch/balay/petsc-3.21.2$ >> >> >> Satish >> >> On Tue, 25 Jun 2024, Stefano Zampini wrote: >> >> > Which version of petsc4py is it? >> > >> > Il giorno mar 25 giu 2024 alle ore 20:35 maitri ksh < >> maitri.ksh at gmail.com> >> > ha scritto: >> > >> > > I am currently working on integrating petsc4py, but I am encountering >> > > persistent compilation issues related to Cython. Below are the >> details of >> > > my setup and the errors I am facing. I would greatly appreciate any >> > > assistance or guidance on how >> > > ZjQcmQRYFpfptBannerStart >> > > This Message Is From an External Sender >> > > This message came from outside your organization. >> > > >> > > ZjQcmQRYFpfptBannerEnd >> > > >> > > I am currently working on integrating petsc4py, but I am encountering >> > > persistent compilation issues related to Cython. Below are the >> details of >> > > my setup and the errors I am facing. I would greatly appreciate any >> > > assistance or guidance on how to resolve these issues. >> > > System Configuration: >> > > >> > > - *PETSc Architecture*: linux-gnu-c-debug >> > > - *Python Environment*: Python 3.6 (virtual environment) >> > > - *Cython Version*: 3.0.10 >> > > - *Compiler*: /gcc11.2/bin/gcc >> > > >> > > During the build process, I received multiple warnings and errors >> related >> > > to the use of noexcept, nogil, and except in function declarations. >> Here >> > > are some of the specific errors: >> > > >> > > cythonizing 'petsc4py.PETSc.pyx' -> 'petsc4py.PETSc.c' >> > > warning: petsc4py.PETSc.pyx:1:0: Dotted filenames >> ('petsc4py.PETSc.pyx') >> > > are deprecated. Please use the normal Python package directory layout. >> > > >> /home/maitri.ksh/Maitri/petsc/petsc4py/myenv/lib64/python3.6/site-packages/Cython/Compiler/Main.py:381: >> > > FutureWarning: Cython directive 'language_level' not set, using >> '3str' for >> > > now (Py3). This has changed from earlier releases! File: >> > > include/petsc4py/PETSc.pxd >> > > tree = Parsing.p_module(s, pxd, full_module_name) >> > > warning: PETSc/PETSc.pyx:53:48: The keyword 'nogil' should appear at >> the >> > > end of the function signature line. Placing it before 'except' or >> > > 'noexcept' will be disallowed in a future version of Cython. >> > > warning: PETSc/petscvec.pxi:406:79: The keyword 'nogil' should appear >> at >> > > the end of the function signature line. Placing it before 'except' or >> > > 'noexcept' will be disallowed in a future version of Cython. >> > > warning: PETSc/petscvec.pxi:411:79: The keyword 'nogil' should appear >> at >> > > the end of the function signature line. Placing it before 'except' or >> > > 'noexcept' will be disallowed in a future version of Cython. >> > > ... >> > > >> > > Error compiling Cython file: >> > > ... >> > > PETSc/petscobj.pxi:91:29: Cannot assign type 'int (void *) except? -1 >> > > nogil' to 'int (*)(void *) noexcept'. Exception values are >> incompatible. >> > > Suggest adding 'noexcept' to the type of 'PetscDelPyDict'. >> > > ... >> > > PETSc/cyclicgc.pxi:34:20: Cannot assign type 'int (PyObject *, >> visitproc, >> > > void *) except? -1' to 'traverseproc *'. Exception values are >> incompatible. >> > > Suggest adding 'noexcept' to the type of 'tp_traverse'. >> > > ... >> > > PETSc/cyclicgc.pxi:35:20: Cannot assign type 'int (PyObject *) >> except? -1' >> > > to 'inquiry *'. Exception values are incompatible. Suggest adding >> > > 'noexcept' to the type of 'tp_clear'. >> > > ... >> > > PETSc/PETSc.pyx:351:17: Cannot assign type 'void (void) except * >> nogil' to >> > > 'void (*)(void) noexcept'. Exception values are incompatible. Suggest >> > > adding 'noexcept' to the type of 'finalize'. >> > > error: Cython failure: 'petsc4py.PETSc.pyx' -> 'petsc4py.PETSc.c' >> > > >> > > Any advice on building petsc4py in environments similar to mine would >> be >> > > greatly appreciated. >> > > Thanks, >> > > Maitri >> > > >> > >> > >> > >> >> -- Stefano -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Wed Jun 26 09:30:12 2024 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 26 Jun 2024 10:30:12 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. > On Jun 25, 2024, at 11:19?PM, Yongzhong Li wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Junchao, thank you for your help for these benchmarking test! > > I check out to petsc/main and did a few things to verify from my side, > > 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, > > $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.5 1.2 1.8 5.2 > 256 1.5 0.9 1.6 4.7 > 512 2.7 2.8 6.1 13.2 > 1024 4.0 4.0 9.3 16.4 > 2048 7.4 7.3 11.3 39.3 > 4096 14.2 13.9 19.1 93.4 > 8192 28.8 26.3 25.4 31.3 > 16384 54.1 25.8 26.7 33.8 > 32768 109.8 25.7 24.2 56.0 > 65536 220.2 24.4 26.5 89.0 > 131072 424.1 31.5 36.1 149.6 > 262144 898.1 37.1 53.9 286.1 > 524288 1754.6 48.7 100.3 1122.2 > 1048576 3645.8 86.5 347.9 2950.4 > 2097152 7371.4 308.7 1440.6 6874.9 > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.9 1.2 1.9 5.2 > 256 1.5 1.0 1.7 4.7 > 512 2.7 2.8 6.1 12.0 > 1024 3.9 4.0 9.3 16.8 > 2048 7.4 7.3 10.4 41.3 > 4096 14.0 13.8 18.6 84.2 > 8192 27.0 21.3 43.8 177.5 > 16384 54.1 34.1 89.1 330.4 > 32768 110.4 82.1 203.5 781.1 > 65536 213.0 191.8 423.9 1696.4 > 131072 428.7 360.2 934.0 4080.0 > 262144 883.4 723.2 1745.6 10120.7 > 524288 1817.5 1466.1 4751.4 23217.2 > 1048576 3611.0 3796.5 11814.9 48687.7 > 2097152 7401.9 10592.0 27543.2 106565.4 > > I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like > > MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca > > From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. > > However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. > > I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. > > Thank you, > Yongzhong > > From: Junchao Zhang > > Date: Tuesday, June 25, 2024 at 6:34?PM > To: Matthew Knepley > > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > Hi, Yongzhong, > Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? > petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.1 > 256 1.8 2.7 7.0 > 512 2.1 3.1 8.6 > 1024 2.7 4.0 12.3 > 2048 3.8 6.3 28.0 > 4096 6.1 10.6 42.4 > 8192 10.9 21.8 79.5 > 16384 21.2 39.4 149.6 > 32768 45.9 75.7 224.6 > 65536 142.2 215.8 732.1 > 131072 169.1 233.2 1729.4 > 262144 367.5 830.0 4159.2 > 524288 999.2 1718.1 8538.5 > 1048576 2113.5 4082.1 18274.8 > 2097152 5392.6 10273.4 43273.4 > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.0 > 256 1.8 2.7 15.0 > 512 2.1 9.0 16.6 > 1024 2.6 8.7 16.1 > 2048 7.7 10.3 20.5 > 4096 9.9 11.4 25.9 > 8192 14.5 22.1 39.6 > 16384 25.1 27.8 67.8 > 32768 44.7 95.7 91.5 > 65536 82.1 156.8 165.1 > 131072 194.0 335.1 341.5 > 262144 388.5 380.8 612.9 > 524288 1046.7 967.1 1653.3 > 1048576 1997.4 2169.0 4034.4 > 2097152 5502.9 5787.3 12608.1 > > The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: > Let me run some examples on our end to see whether the code calls expected functions. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: > Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? > > We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!cX1H6d0CktL2_obQbTo1hQgiLZtLPQom3MTt_yRHnW3ghcWAHjVCbY9MvCM3SAtfJ2jdiP1S7kZgJUGJhHOPQ7s$ > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. > > Thanks, > > Matt > > Thank you, > Yongzhong > > From: Pierre Jolivet > > Date: Sunday, June 23, 2024 at 12:41?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > > --> Solving the system... > > Excitation 1 of 1... > > ================================================ > Iterative solve completed in 7435 ms. > CONVERGED: rtol. > Iterations: 72 > Final relative residual norm: 9.22287e-07 > ================================================ > [CPU TIME] System solution: 2.27160000e+02 s. > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. > > Thanks, > Pierre > > > Thanks, > Yongzhong > > > From: Matthew Knepley > > Date: Saturday, June 22, 2024 at 5:56?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > MKL_VERBOSE=1 ./ex1 > > matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? > > Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. > > Thanks, > > Matt > > > Thanks, > Yongzhong > > From: Junchao Zhang > > Date: Saturday, June 22, 2024 at 9:40?AM > To: Yongzhong Li > > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used > $ cd src/mat/tests > $ make ex1 > $ MKL_VERBOSE=1 ./ex1 > > --Junchao Zhang > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: > I am using > > export MKL_VERBOSE=1 > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > Yongzhong > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:47?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > How do you set the variable? > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > [...] > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? > > Best, > Yongzhong > > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:36?AM > To: Junchao Zhang > > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called. > > The environment variable is MKL_VERBOSE > > Thanks, > Pierre > > Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: > This Message Is From an External Sender > This message came from outside your organization. > > Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > static int stageCounter = 1; > > // Generate a unique stage name > std::ostringstream oss; > oss << "Stage " << stageCounter << " of Code"; > std::string stageName = oss.str(); > > // Register the stage > PetscLogStage stagenum; > > PetscLogStageRegister(stageName.c_str(), &stagenum); > PetscLogStagePush(stagenum); > > KSPSolve(*ksp_ptr, b, x); > > PetscLogStagePop(); > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > From: Barry Smith > > Date: Friday, June 14, 2024 at 11:36?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > Finally there are a huge number of > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? > > The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. > > Barry > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically > > KSPGuess Object: 1 MPI process > type: fischer > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? > > Thank you! > Yongzhong > > From: Barry Smith > > Date: Thursday, June 13, 2024 at 2:14?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? > > Thanks > > Barry > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. > > Thanks! > Yongzhong > > From: Matthew Knepley > > Date: Wednesday, June 12, 2024 at 6:46?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: > Matrix Type: Shell system matrix > Preconditioner: Shell PC > Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled > I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. > Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. > > For any performance question like this, we need to see the output of your code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cX1H6d0CktL2_obQbTo1hQgiLZtLPQom3MTt_yRHnW3ghcWAHjVCbY9MvCM3SAtfJ2jdiP1S7kZgJUGJeQZ8YSQ$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cX1H6d0CktL2_obQbTo1hQgiLZtLPQom3MTt_yRHnW3ghcWAHjVCbY9MvCM3SAtfJ2jdiP1S7kZgJUGJriV_N_c$ > > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cX1H6d0CktL2_obQbTo1hQgiLZtLPQom3MTt_yRHnW3ghcWAHjVCbY9MvCM3SAtfJ2jdiP1S7kZgJUGJriV_N_c$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cX1H6d0CktL2_obQbTo1hQgiLZtLPQom3MTt_yRHnW3ghcWAHjVCbY9MvCM3SAtfJ2jdiP1S7kZgJUGJriV_N_c$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Wed Jun 26 10:12:58 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Wed, 26 Jun 2024 10:12:58 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Yongzhong, Try Barry's approach first. BTW, I ran another petsc test. You can see GEMV was used in KSPSolve. You could also try this one. $ cd src/ksp/ksp/tutorials $ make bench_kspsolve $ MKL_VERBOSE=1 OMP_PROC_BIND=spread MKL_NUM_THREADS=8 ./bench_kspsolve -split_ksp -mat_type aijmkl =========================================== Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 1000000 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... Step2a - running PCSetUp()... Step2b - running KSPSolve()... MKL_VERBOSE oneMKL 2022.0 Product build 20211112 for Intel(R) 64 architecture Intel(R) Architecture processors, Lnx 3.18GHz lp64 gnu_thread MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa9432b5e60,1) 474.25us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa9441f8260,1) 1.93ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE *ZGEMV*(C,1000000,2,0x7ffccef20c20,0x7fa9432b5e60,1000000,0x7fa94513a660,1,0x7ffccef20c30,0x1c4b610,1) 1.86ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa94513a660,1) 2.55ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE *ZGEMV*(C,1000000,3,0x7ffccef20c20,0x7fa9432b5e60,1000000,0x7fa8cb7a6660,1,0x7ffccef20c30,0x1c4b610,1) 2.95ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 --Junchao Zhang On Tue, Jun 25, 2024 at 10:19?PM Yongzhong Li wrote: > Hi Junchao, thank you for your help for these benchmarking test! > > I check out to petsc/main and did a few things to verify from my side, > > 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute > node. The results are as follow, > > $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > > -------------------------------------------------------------------------- > > 128 14.5 1.2 1.8 5.2 > > 256 1.5 0.9 1.6 4.7 > > 512 2.7 2.8 6.1 13.2 > > 1024 4.0 4.0 9.3 16.4 > > 2048 7.4 7.3 11.3 39.3 > > 4096 14.2 13.9 19.1 93.4 > > 8192 28.8 26.3 25.4 31.3 > > 16384 54.1 25.8 26.7 33.8 > > 32768 109.8 25.7 24.2 56.0 > > 65536 220.2 24.4 26.5 89.0 > > 131072 424.1 31.5 36.1 149.6 > > 262144 898.1 37.1 53.9 286.1 > > 524288 1754.6 48.7 100.3 1122.2 > > 1048576 3645.8 86.5 347.9 2950.4 > > 2097152 7371.4 308.7 1440.6 6874.9 > > > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 > > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > > -------------------------------------------------------------------------- > > 128 14.9 1.2 1.9 5.2 > > 256 1.5 1.0 1.7 4.7 > > 512 2.7 2.8 6.1 12.0 > > 1024 3.9 4.0 9.3 16.8 > > 2048 7.4 7.3 10.4 41.3 > > 4096 14.0 13.8 18.6 84.2 > > 8192 27.0 21.3 43.8 177.5 > > 16384 54.1 34.1 89.1 330.4 > > 32768 110.4 82.1 203.5 781.1 > > 65536 213.0 191.8 423.9 1696.4 > > 131072 428.7 360.2 934.0 4080.0 > > 262144 883.4 723.2 1745.6 10120.7 > > 524288 1817.5 1466.1 4751.4 23217.2 > > 1048576 3611.0 3796.5 11814.9 48687.7 > > 2097152 7401.9 10592.0 27543.2 106565.4 > > > I can see the speed up brought by more MKL threads, and if I set > NKL_VERBOSE to 1, I can see something like > > > > > > *MKL_VERBOSE > ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) > 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca *From my understanding, > the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute > node and is using ZGEMV MKL BLAS. > > However, when I ran my own program and set MKL_VERBOSE to 1, it is very > strange that I still can?t find any MKL outputs, though I can see from the > PETSc log that VecMDot and VecMAXPY() are called. > > > I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a > way that is similar to ex2k test? Should I expect to see MKL outputs for > whatever linear system I solve with KSPGMRES? Does it relate to if it is > dense matrix or sparse matrix, although I am not really understand why > VecMDot/MAXPY() have something to do with dense matrix-vector > multiplication. > > Thank you, > > Yongzhong > > *From: *Junchao Zhang > *Date: *Tuesday, June 25, 2024 at 6:34?PM > *To: *Matthew Knepley > *Cc: *Yongzhong Li , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > Hi, Yongzhong, > > Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we > can speed up the two with OpenMP threads, then we can speed up > KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in > dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny > matrices ). So with MKL_VERBOSE=1, you should see something like > "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with > petsc/main? > > petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran > VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was > strange to see no speedup. I then configured petsc with openblas, I did > see better performance with more threads > > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.1 > 256 1.8 2.7 7.0 > 512 2.1 3.1 8.6 > 1024 2.7 4.0 12.3 > 2048 3.8 6.3 28.0 > 4096 6.1 10.6 42.4 > 8192 10.9 21.8 79.5 > 16384 21.2 39.4 149.6 > 32768 45.9 75.7 224.6 > 65536 142.2 215.8 732.1 > 131072 169.1 233.2 1729.4 > 262144 367.5 830.0 4159.2 > 524288 999.2 1718.1 8538.5 > 1048576 2113.5 4082.1 18274.8 > 2097152 5392.6 10273.4 43273.4 > > > > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.0 > 256 1.8 2.7 15.0 > 512 2.1 9.0 16.6 > 1024 2.6 8.7 16.1 > 2048 7.7 10.3 20.5 > 4096 9.9 11.4 25.9 > 8192 14.5 22.1 39.6 > 16384 25.1 27.8 67.8 > 32768 44.7 95.7 91.5 > 65536 82.1 156.8 165.1 > 131072 194.0 335.1 341.5 > 262144 388.5 380.8 612.9 > 524288 1046.7 967.1 1653.3 > 1048576 1997.4 2169.0 4034.4 > 2097152 5502.9 5787.3 12608.1 > > > > The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average > speedup depends on components. So I suggest you run ex2k to see in your > environment whether oneMKL can speedup the kernels. > > > > --Junchao Zhang > > > > > > On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: > > Let me run some examples on our end to see whether the code calls expected > functions. > > > --Junchao Zhang > > > > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > > On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li utoronto. ca> wrote: Thank you Pierre for your information. Do we have a > conclusion for my original question about the parallelization efficiency > for different stages of > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender * > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? Thank > you, Yongzhong From: > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender * > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? > > > > We have an extended discussion of this here: > https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twG25odPc$ > > > > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) > are memory bandwidth limited. If there is no more bandwidth to be > marshalled on your board, then adding more processes does nothing at all. > This is why people were asking about how many "nodes" you are running on, > because that is the unit of memory bandwidth, not "cores" which make little > difference. > > > > Thanks, > > > > Matt > > > > Thank you, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Sunday, June 23, 2024 at 12:41?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Yeah, I ran my program again using -mat_view::ascii_info and set > MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix > to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > > > --> Solving the system... > > > > Excitation 1 of 1... > > > > ================================================ > > Iterative solve completed in 7435 ms. > > CONVERGED: rtol. > > Iterations: 72 > > Final relative residual norm: 9.22287e-07 > > ================================================ > > [CPU TIME] System solution: 2.27160000e+02 s. > > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set > MKL_VERBOSE to be 1. Although, I think it should be many spmv operations > when doing KSPSolve(). Do you see the possible reasons? > > > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS > is. > > > > Thanks, > > Pierre > > > > Thanks, > > Yongzhong > > > > > > *From: *Matthew Knepley > *Date: *Saturday, June 22, 2024 at 5:56?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > MKL_VERBOSE=1 ./ex1 > > > matrix nonzeros = 100, allocated nonzeros = 100 > > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector Neural Network Instructions enabled > processors, Lnx 2.50GHz lp64 gnu_thread > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) > 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) > 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) > 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) > 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) > 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) > 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) > 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) > 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) > 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) > 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) > 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) > 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All > I did is to change the matrix type from MATAIJ to MATAIJMKL to get > optimized performance for spmv from MKL. Should I expect to see any MKL > outputs in this case? > > > > Are you sure that the type changed? You can MatView() the matrix with > format ascii_info to see. > > > > Thanks, > > > > Matt > > > > > > Thanks, > > Yongzhong > > > > *From: *Junchao Zhang > *Date: *Saturday, June 22, 2024 at 9:40?AM > *To: *Yongzhong Li > *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example > first and see if MKL is really used > > $ cd src/mat/tests > > $ make ex1 > > $ MKL_VERBOSE=1 ./ex1 > > > --Junchao Zhang > > > > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > I am using > > export MKL_VERBOSE=1 > > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:47?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > How do you set the variable? > > > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 > architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled > processors, Lnx 2.80GHz lp64 intel_thread > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > [...] > > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of > MKL. Does PETSc enable this verbose output? > > Best, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:36?AM > *To: *Junchao Zhang > *Cc: *Yongzhong Li , > petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > I remember there are some MKL env vars to print MKL routines called. > > > > The environment variable is MKL_VERBOSE > > > > Thanks, > > Pierre > > > > Maybe we can try it to see what MKL routines are really used and then we > can understand why some petsc functions did not speed up > > > --Junchao Zhang > > > > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > > static int stageCounter = 1; > > > > // Generate a unique stage name > > std::ostringstream oss; > > oss << "Stage " << stageCounter << " of Code"; > > std::string stageName = oss.str(); > > > > // Register the stage > > PetscLogStage stagenum; > > > > PetscLogStageRegister(stageName.c_str(), &stagenum); > > PetscLogStagePush(stagenum); > > > > *KSPSolve(*ksp_ptr, b, x);* > > > > PetscLogStagePop(); > > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > > > Finally there are a huge number of > > > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > > > Barry > > > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve** is **almost two times **greater than the CPU > time for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some > experience on how to diagnose and address this performance discrepancy? > Any insights or recommendations you could offer would be greatly > appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twMAVmany$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twFI2FOPm$ > > > > > > > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twFI2FOPm$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fYR1IzVkhHAPHEV3ib7SU9PXqJ3xaxrejJTDrDveL4zA7m_U_FY7jFLdWUD1b0W1-WyzQvb5xXvjGLY51P7twFI2FOPm$ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bruce.Palmer at pnnl.gov Wed Jun 26 15:34:04 2024 From: Bruce.Palmer at pnnl.gov (Palmer, Bruce J) Date: Wed, 26 Jun 2024 20:34:04 +0000 Subject: [petsc-users] Unconstrained optimization question Message-ID: Hi, I?m trying to do an unconstrained optimization on a molecular scale problem. Previously, I was looking at an artificial molecular problem where all parameters were of order 1 and so the objective function and variables were also in the range of 1 or at least within a few orders of magnitude of 1. More recently, I?ve been trying to apply this optimization to a real molecular system. Between Avogadro?s number (6.022e23) and Boltzmann?s constant (1.38e-16) combined with very small distances (1.0e-8 cm), etc. the objective function values and the values of the optimization variables have very large values (~1e86 and ~1e9, respectively). I?ve verified that the analytic gradients of the objective function that I?m calculating are correct by comparing them with numerical derivatives. I?ve tried using the LMVM and Conjugate Gradient optimizations, both of which worked previously, but I find that the optimization completes one objective function evaluation and then declares that the problem is converged and stops. I could find a set of units where everything is approximately 1 but I was hoping that there are some parameters I can set in the optimization that will get it moving again. Any suggestions? Bruce Palmer -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Wed Jun 26 16:02:19 2024 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 26 Jun 2024 17:02:19 -0400 Subject: [petsc-users] Unconstrained optimization question In-Reply-To: References: Message-ID: Please run with -tao_monitor -tao_converged_reason and see why it has stopped. Barry > On Jun 26, 2024, at 4:34?PM, Palmer, Bruce J via petsc-users wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, > > I?m trying to do an unconstrained optimization on a molecular scale problem. Previously, I was looking at an artificial molecular problem where all parameters were of order 1 and so the objective function and variables were also in the range of 1 or at least within a few orders of magnitude of 1. > > More recently, I?ve been trying to apply this optimization to a real molecular system. Between Avogadro?s number (6.022e23) and Boltzmann?s constant (1.38e-16) combined with very small distances (1.0e-8 cm), etc. the objective function values and the values of the optimization variables have very large values (~1e86 and ~1e9, respectively). I?ve verified that the analytic gradients of the objective function that I?m calculating are correct by comparing them with numerical derivatives. > > I?ve tried using the LMVM and Conjugate Gradient optimizations, both of which worked previously, but I find that the optimization completes one objective function evaluation and then declares that the problem is converged and stops. I could find a set of units where everything is approximately 1 but I was hoping that there are some parameters I can set in the optimization that will get it moving again. Any suggestions? > > Bruce Palmer -------------- next part -------------- An HTML attachment was scrubbed... URL: From mmolinos at us.es Wed Jun 26 16:30:00 2024 From: mmolinos at us.es (MIGUEL MOLINOS PEREZ) Date: Wed, 26 Jun 2024 21:30:00 +0000 Subject: [petsc-users] Doubt about TSMonitorSolutionVTK In-Reply-To: References: <2067D58E-F041-429F-8ABE-B19DD9F733C2@petsc.dev> Message-ID: Sorry, I did not put in cc petsc-users at mcs.anl.gov my replay. Miguel On Jun 24, 2024, at 6:39?PM, MIGUEL MOLINOS PEREZ wrote: Thank you Barry, This is exactly how I did it the first time. Miguel On Jun 24, 2024, at 6:37?PM, Barry Smith wrote: See, for example, the bottom of src/ts/tutorials/ex26.c that uses -ts_monitor_solution_vtk 'foo-%03d.vts' On Jun 24, 2024, at 8:47?PM, MIGUEL MOLINOS PEREZ wrote: This Message Is From an External Sender This message came from outside your organization. Dear all, I want to monitor the results at each iteration of TS using vtk format. To do so, I add the following lines to my Monitor function: char vts_File_Name[MAXC]; PetscCall(PetscSNPrintf(vts_File_Name, sizeof(vts_File_Name), "./xi-MgHx-hcp-cube-x5-x5-x5-TS-BE-%i.vtu", step)); PetscCall(TSMonitorSolutionVTK(ts, step, time, X, (void*)vts_File_Name)); My script compiles and executes without any sort of warning/error messages. However, no output files are produced at the end of the simulation. I?ve also tried the option ?-ts_monitor_solution_vtk ?, but I got no results as well. I can?t find any similar example on the petsc website and I don?t see what I am doing wrong. Could somebody point me to the right direction? Thanks, Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bruce.Palmer at pnnl.gov Wed Jun 26 16:48:11 2024 From: Bruce.Palmer at pnnl.gov (Palmer, Bruce J) Date: Wed, 26 Jun 2024 21:48:11 +0000 Subject: [petsc-users] Unconstrained optimization question In-Reply-To: References: Message-ID: This is a fortran code that doesn?t make use of argc,argv (I tried running with the runtime options anyway, in case you implemented some magic I?m not familiar with, but didn?t see anything new in the output). I have a call to TaoView(tao, PETSC_VIEWER_STDOUT_SELF,ierr) in the code and it reports back Tao Object: 1 MPI process type: cg CG Type: prp Gradient steps: 0 Reset steps: 0 TaoLineSearch Object: 1 MPI process type: more-thuente maximum function evaluations=30 tolerances: ftol=0.0001, rtol=1e-10, gtol=0.9 total number of function evaluations=0 total number of gradient evaluations=0 total number of function/gradient evaluations=0 Termination reason: 0 convergence tolerances: gatol=1e-08, steptol=0., gttol=0. Residual in Function/Gradient:=7.54237e+75 Objective value=2.96082e+86 total number of iterations=0, (max: 100) total number of function/gradient evaluations=1, (max: 4000) Solution converged: ||g(X)||/|f(X)| <= grtol Bruce From: Barry Smith Date: Wednesday, June 26, 2024 at 2:02?PM To: Palmer, Bruce J Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Unconstrained optimization question Check twice before you click! This email originated from outside PNNL. Please run with -tao_monitor -tao_converged_reason and see why it has stopped. Barry On Jun 26, 2024, at 4:34?PM, Palmer, Bruce J via petsc-users wrote: This Message Is From an External Sender This message came from outside your organization. Hi, I?m trying to do an unconstrained optimization on a molecular scale problem. Previously, I was looking at an artificial molecular problem where all parameters were of order 1 and so the objective function and variables were also in the range of 1 or at least within a few orders of magnitude of 1. More recently, I?ve been trying to apply this optimization to a real molecular system. Between Avogadro?s number (6.022e23) and Boltzmann?s constant (1.38e-16) combined with very small distances (1.0e-8 cm), etc. the objective function values and the values of the optimization variables have very large values (~1e86 and ~1e9, respectively). I?ve verified that the analytic gradients of the objective function that I?m calculating are correct by comparing them with numerical derivatives. I?ve tried using the LMVM and Conjugate Gradient optimizations, both of which worked previously, but I find that the optimization completes one objective function evaluation and then declares that the problem is converged and stops. I could find a set of units where everything is approximately 1 but I was hoping that there are some parameters I can set in the optimization that will get it moving again. Any suggestions? Bruce Palmer -------------- next part -------------- An HTML attachment was scrubbed... URL: From jed at jedbrown.org Wed Jun 26 17:02:42 2024 From: jed at jedbrown.org (Jed Brown) Date: Wed, 26 Jun 2024 16:02:42 -0600 Subject: [petsc-users] Unconstrained optimization question In-Reply-To: References: Message-ID: <87a5j7l5vh.fsf@jedbrown.org> An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Wed Jun 26 17:59:15 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Wed, 26 Jun 2024 22:59:15 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!fd7ZxW7EKCLlbTqw0DDnyWRJxZCmIMWq56fUIPEAPDnsC33dSV7Kd0gq9PDoRRg4XP-LLo7cTaJQ5lLFLrC4Rnaoz_rAeE012FA$ Can I ask which lines of codes suggest the use of intel mkl? Thanks, Yongzhong From: Barry Smith Date: Wednesday, June 26, 2024 at 10:30?AM To: Yongzhong Li Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. On Jun 25, 2024, at 11:19?PM, Yongzhong Li wrote: This Message Is From an External Sender This message came from outside your organization. Hi Junchao, thank you for your help for these benchmarking test! I check out to petsc/main and did a few things to verify from my side, 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.5 1.2 1.8 5.2 256 1.5 0.9 1.6 4.7 512 2.7 2.8 6.1 13.2 1024 4.0 4.0 9.3 16.4 2048 7.4 7.3 11.3 39.3 4096 14.2 13.9 19.1 93.4 8192 28.8 26.3 25.4 31.3 16384 54.1 25.8 26.7 33.8 32768 109.8 25.7 24.2 56.0 65536 220.2 24.4 26.5 89.0 131072 424.1 31.5 36.1 149.6 262144 898.1 37.1 53.9 286.1 524288 1754.6 48.7 100.3 1122.2 1048576 3645.8 86.5 347.9 2950.4 2097152 7371.4 308.7 1440.6 6874.9 $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.9 1.2 1.9 5.2 256 1.5 1.0 1.7 4.7 512 2.7 2.8 6.1 12.0 1024 3.9 4.0 9.3 16.8 2048 7.4 7.3 10.4 41.3 4096 14.0 13.8 18.6 84.2 8192 27.0 21.3 43.8 177.5 16384 54.1 34.1 89.1 330.4 32768 110.4 82.1 203.5 781.1 65536 213.0 191.8 423.9 1696.4 131072 428.7 360.2 934.0 4080.0 262144 883.4 723.2 1745.6 10120.7 524288 1817.5 1466.1 4751.4 23217.2 1048576 3611.0 3796.5 11814.9 48687.7 2097152 7401.9 10592.0 27543.2 106565.4 I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca >From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. Thank you, Yongzhong From: Junchao Zhang > Date: Tuesday, June 25, 2024 at 6:34?PM To: Matthew Knepley > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Hi, Yongzhong, Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.1 256 1.8 2.7 7.0 512 2.1 3.1 8.6 1024 2.7 4.0 12.3 2048 3.8 6.3 28.0 4096 6.1 10.6 42.4 8192 10.9 21.8 79.5 16384 21.2 39.4 149.6 32768 45.9 75.7 224.6 65536 142.2 215.8 732.1 131072 169.1 233.2 1729.4 262144 367.5 830.0 4159.2 524288 999.2 1718.1 8538.5 1048576 2113.5 4082.1 18274.8 2097152 5392.6 10273.4 43273.4 $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.0 256 1.8 2.7 15.0 512 2.1 9.0 16.6 1024 2.6 8.7 16.1 2048 7.7 10.3 20.5 4096 9.9 11.4 25.9 8192 14.5 22.1 39.6 16384 25.1 27.8 67.8 32768 44.7 95.7 91.5 65536 82.1 156.8 165.1 131072 194.0 335.1 341.5 262144 388.5 380.8 612.9 524288 1046.7 967.1 1653.3 1048576 1997.4 2169.0 4034.4 2097152 5502.9 5787.3 12608.1 The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. --Junchao Zhang On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: Let me run some examples on our end to see whether the code calls expected functions. --Junchao Zhang On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!fd7ZxW7EKCLlbTqw0DDnyWRJxZCmIMWq56fUIPEAPDnsC33dSV7Kd0gq9PDoRRg4XP-LLo7cTaJQ5lLFLrC4Rnaoz_rA0mVFiCk$ The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. Thanks, Matt Thank you, Yongzhong From: Pierre Jolivet > Date: Sunday, June 23, 2024 at 12:41?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. Thanks, Pierre Thanks, Yongzhong From: Matthew Knepley > Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!fd7ZxW7EKCLlbTqw0DDnyWRJxZCmIMWq56fUIPEAPDnsC33dSV7Kd0gq9PDoRRg4XP-LLo7cTaJQ5lLFLrC4Rnaoz_rAlE1oQhE$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fd7ZxW7EKCLlbTqw0DDnyWRJxZCmIMWq56fUIPEAPDnsC33dSV7Kd0gq9PDoRRg4XP-LLo7cTaJQ5lLFLrC4Rnaoz_rAQEQd5oQ$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fd7ZxW7EKCLlbTqw0DDnyWRJxZCmIMWq56fUIPEAPDnsC33dSV7Kd0gq9PDoRRg4XP-LLo7cTaJQ5lLFLrC4Rnaoz_rAQEQd5oQ$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fd7ZxW7EKCLlbTqw0DDnyWRJxZCmIMWq56fUIPEAPDnsC33dSV7Kd0gq9PDoRRg4XP-LLo7cTaJQ5lLFLrC4Rnaoz_rAQEQd5oQ$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Wed Jun 26 19:15:03 2024 From: bsmith at petsc.dev (Barry Smith) Date: Wed, 26 Jun 2024 20:15:03 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: if (m > 1) { PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above PetscScalar one = 1, zero = 0; PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); The call to BLAS above is where it uses MKL. > On Jun 26, 2024, at 6:59?PM, Yongzhong Li wrote: > > Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!YkdYdhSYNaujYHcsb6DHTn7qi2Cae0dZHw3sO7KUE55VZiuUPyXmKFJlSwNe0xr-uRcAZMgRLinheQIZsXxDJ9c$ > Can I ask which lines of codes suggest the use of intel mkl? > > Thanks, > Yongzhong > > From: Barry Smith > > Date: Wednesday, June 26, 2024 at 10:30?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. > > > > > On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Junchao, thank you for your help for these benchmarking test! > > I check out to petsc/main and did a few things to verify from my side, > > 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, > > $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.5 1.2 1.8 5.2 > 256 1.5 0.9 1.6 4.7 > 512 2.7 2.8 6.1 13.2 > 1024 4.0 4.0 9.3 16.4 > 2048 7.4 7.3 11.3 39.3 > 4096 14.2 13.9 19.1 93.4 > 8192 28.8 26.3 25.4 31.3 > 16384 54.1 25.8 26.7 33.8 > 32768 109.8 25.7 24.2 56.0 > 65536 220.2 24.4 26.5 89.0 > 131072 424.1 31.5 36.1 149.6 > 262144 898.1 37.1 53.9 286.1 > 524288 1754.6 48.7 100.3 1122.2 > 1048576 3645.8 86.5 347.9 2950.4 > 2097152 7371.4 308.7 1440.6 6874.9 > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.9 1.2 1.9 5.2 > 256 1.5 1.0 1.7 4.7 > 512 2.7 2.8 6.1 12.0 > 1024 3.9 4.0 9.3 16.8 > 2048 7.4 7.3 10.4 41.3 > 4096 14.0 13.8 18.6 84.2 > 8192 27.0 21.3 43.8 177.5 > 16384 54.1 34.1 89.1 330.4 > 32768 110.4 82.1 203.5 781.1 > 65536 213.0 191.8 423.9 1696.4 > 131072 428.7 360.2 934.0 4080.0 > 262144 883.4 723.2 1745.6 10120.7 > 524288 1817.5 1466.1 4751.4 23217.2 > 1048576 3611.0 3796.5 11814.9 48687.7 > 2097152 7401.9 10592.0 27543.2 106565.4 > > I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like > > MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca > > From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. > > However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. > > I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. > > Thank you, > Yongzhong > > From: Junchao Zhang > > Date: Tuesday, June 25, 2024 at 6:34?PM > To: Matthew Knepley > > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > Hi, Yongzhong, > Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? > petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.1 > 256 1.8 2.7 7.0 > 512 2.1 3.1 8.6 > 1024 2.7 4.0 12.3 > 2048 3.8 6.3 28.0 > 4096 6.1 10.6 42.4 > 8192 10.9 21.8 79.5 > 16384 21.2 39.4 149.6 > 32768 45.9 75.7 224.6 > 65536 142.2 215.8 732.1 > 131072 169.1 233.2 1729.4 > 262144 367.5 830.0 4159.2 > 524288 999.2 1718.1 8538.5 > 1048576 2113.5 4082.1 18274.8 > 2097152 5392.6 10273.4 43273.4 > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.0 > 256 1.8 2.7 15.0 > 512 2.1 9.0 16.6 > 1024 2.6 8.7 16.1 > 2048 7.7 10.3 20.5 > 4096 9.9 11.4 25.9 > 8192 14.5 22.1 39.6 > 16384 25.1 27.8 67.8 > 32768 44.7 95.7 91.5 > 65536 82.1 156.8 165.1 > 131072 194.0 335.1 341.5 > 262144 388.5 380.8 612.9 > 524288 1046.7 967.1 1653.3 > 1048576 1997.4 2169.0 4034.4 > 2097152 5502.9 5787.3 12608.1 > > The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: > Let me run some examples on our end to see whether the code calls expected functions. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: > Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? > > We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!YkdYdhSYNaujYHcsb6DHTn7qi2Cae0dZHw3sO7KUE55VZiuUPyXmKFJlSwNe0xr-uRcAZMgRLinheQIZzNeYI0o$ > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. > > Thanks, > > Matt > > Thank you, > Yongzhong > > From: Pierre Jolivet > > Date: Sunday, June 23, 2024 at 12:41?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > > --> Solving the system... > > Excitation 1 of 1... > > ================================================ > Iterative solve completed in 7435 ms. > CONVERGED: rtol. > Iterations: 72 > Final relative residual norm: 9.22287e-07 > ================================================ > [CPU TIME] System solution: 2.27160000e+02 s. > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. > > Thanks, > Pierre > > > Thanks, > Yongzhong > > > From: Matthew Knepley > > Date: Saturday, June 22, 2024 at 5:56?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > MKL_VERBOSE=1 ./ex1 > > matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? > > Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. > > Thanks, > > Matt > > > Thanks, > Yongzhong > > From: Junchao Zhang > > Date: Saturday, June 22, 2024 at 9:40?AM > To: Yongzhong Li > > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used > $ cd src/mat/tests > $ make ex1 > $ MKL_VERBOSE=1 ./ex1 > > --Junchao Zhang > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: > I am using > > export MKL_VERBOSE=1 > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > Yongzhong > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:47?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > How do you set the variable? > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > [...] > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? > > Best, > Yongzhong > > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:36?AM > To: Junchao Zhang > > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called. > > The environment variable is MKL_VERBOSE > > Thanks, > Pierre > > Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: > This Message Is From an External Sender > This message came from outside your organization. > > Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > static int stageCounter = 1; > > // Generate a unique stage name > std::ostringstream oss; > oss << "Stage " << stageCounter << " of Code"; > std::string stageName = oss.str(); > > // Register the stage > PetscLogStage stagenum; > > PetscLogStageRegister(stageName.c_str(), &stagenum); > PetscLogStagePush(stagenum); > > KSPSolve(*ksp_ptr, b, x); > > PetscLogStagePop(); > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > From: Barry Smith > > Date: Friday, June 14, 2024 at 11:36?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > Finally there are a huge number of > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? > > The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. > > Barry > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically > > KSPGuess Object: 1 MPI process > type: fischer > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? > > Thank you! > Yongzhong > > From: Barry Smith > > Date: Thursday, June 13, 2024 at 2:14?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? > > Thanks > > Barry > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. > > Thanks! > Yongzhong > > From: Matthew Knepley > > Date: Wednesday, June 12, 2024 at 6:46?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: > Matrix Type: Shell system matrix > Preconditioner: Shell PC > Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled > I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. > Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. > > For any performance question like this, we need to see the output of your code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!YkdYdhSYNaujYHcsb6DHTn7qi2Cae0dZHw3sO7KUE55VZiuUPyXmKFJlSwNe0xr-uRcAZMgRLinheQIZhKyJMEs$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YkdYdhSYNaujYHcsb6DHTn7qi2Cae0dZHw3sO7KUE55VZiuUPyXmKFJlSwNe0xr-uRcAZMgRLinheQIZ8xqYVC0$ > > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YkdYdhSYNaujYHcsb6DHTn7qi2Cae0dZHw3sO7KUE55VZiuUPyXmKFJlSwNe0xr-uRcAZMgRLinheQIZ8xqYVC0$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YkdYdhSYNaujYHcsb6DHTn7qi2Cae0dZHw3sO7KUE55VZiuUPyXmKFJlSwNe0xr-uRcAZMgRLinheQIZ8xqYVC0$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Wed Jun 26 23:40:03 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Thu, 27 Jun 2024 04:40:03 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Hi Barry, I used gdb to debug my program, set a breakpoint to VecMultiDot_Seq_GEMV function. I did see when I debug this function, it will call BLAS (but not always, only if m > 1), as shown below. However, I still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. (gdb) 550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst)); (gdb) 553 m = j - i; (gdb) 554 if (m > 1) { (gdb) 555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above (gdb) 556 PetscScalar one = 1, zero = 0; (gdb) 558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); (gdb) s PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> "VecMultiDot_Seq_GEMV", file=0x7ffff68a1078 "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c") at /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106 106 if (!TRdebug) return PETSC_SUCCESS; (gdb) 154 } Am I not using MKL BLAS, is that why I didn?t see multithreading speed up for KSPGMRESOrthog? What do you think could be the potential reasons? Is there any silent mode that will possibly affect the MKL Verbose. Thank you and best regards, Yongzhong From: Barry Smith Date: Wednesday, June 26, 2024 at 8:15?PM To: Yongzhong Li Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue if (m > 1) { PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above PetscScalar one = 1, zero = 0; PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); The call to BLAS above is where it uses MKL. On Jun 26, 2024, at 6:59?PM, Yongzhong Li wrote: Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!YUfsVRoVQQfGeK-iIORKffgc0-KL_n2OafjYnTsZGQRuWDiAbJ4RwqB3SmhIIvM83o5y6AP1fc_gaJVsJ8Fk2-2ihVffczoTXjA$ Can I ask which lines of codes suggest the use of intel mkl? Thanks, Yongzhong From: Barry Smith > Date: Wednesday, June 26, 2024 at 10:30?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Junchao, thank you for your help for these benchmarking test! I check out to petsc/main and did a few things to verify from my side, 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.5 1.2 1.8 5.2 256 1.5 0.9 1.6 4.7 512 2.7 2.8 6.1 13.2 1024 4.0 4.0 9.3 16.4 2048 7.4 7.3 11.3 39.3 4096 14.2 13.9 19.1 93.4 8192 28.8 26.3 25.4 31.3 16384 54.1 25.8 26.7 33.8 32768 109.8 25.7 24.2 56.0 65536 220.2 24.4 26.5 89.0 131072 424.1 31.5 36.1 149.6 262144 898.1 37.1 53.9 286.1 524288 1754.6 48.7 100.3 1122.2 1048576 3645.8 86.5 347.9 2950.4 2097152 7371.4 308.7 1440.6 6874.9 $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.9 1.2 1.9 5.2 256 1.5 1.0 1.7 4.7 512 2.7 2.8 6.1 12.0 1024 3.9 4.0 9.3 16.8 2048 7.4 7.3 10.4 41.3 4096 14.0 13.8 18.6 84.2 8192 27.0 21.3 43.8 177.5 16384 54.1 34.1 89.1 330.4 32768 110.4 82.1 203.5 781.1 65536 213.0 191.8 423.9 1696.4 131072 428.7 360.2 934.0 4080.0 262144 883.4 723.2 1745.6 10120.7 524288 1817.5 1466.1 4751.4 23217.2 1048576 3611.0 3796.5 11814.9 48687.7 2097152 7401.9 10592.0 27543.2 106565.4 I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca >From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. Thank you, Yongzhong From: Junchao Zhang > Date: Tuesday, June 25, 2024 at 6:34?PM To: Matthew Knepley > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Hi, Yongzhong, Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.1 256 1.8 2.7 7.0 512 2.1 3.1 8.6 1024 2.7 4.0 12.3 2048 3.8 6.3 28.0 4096 6.1 10.6 42.4 8192 10.9 21.8 79.5 16384 21.2 39.4 149.6 32768 45.9 75.7 224.6 65536 142.2 215.8 732.1 131072 169.1 233.2 1729.4 262144 367.5 830.0 4159.2 524288 999.2 1718.1 8538.5 1048576 2113.5 4082.1 18274.8 2097152 5392.6 10273.4 43273.4 $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.0 256 1.8 2.7 15.0 512 2.1 9.0 16.6 1024 2.6 8.7 16.1 2048 7.7 10.3 20.5 4096 9.9 11.4 25.9 8192 14.5 22.1 39.6 16384 25.1 27.8 67.8 32768 44.7 95.7 91.5 65536 82.1 156.8 165.1 131072 194.0 335.1 341.5 262144 388.5 380.8 612.9 524288 1046.7 967.1 1653.3 1048576 1997.4 2169.0 4034.4 2097152 5502.9 5787.3 12608.1 The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. --Junchao Zhang On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: Let me run some examples on our end to see whether the code calls expected functions. --Junchao Zhang On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!YUfsVRoVQQfGeK-iIORKffgc0-KL_n2OafjYnTsZGQRuWDiAbJ4RwqB3SmhIIvM83o5y6AP1fc_gaJVsJ8Fk2-2ihVffU9aFj-k$ The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. Thanks, Matt Thank you, Yongzhong From: Pierre Jolivet > Date: Sunday, June 23, 2024 at 12:41?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. Thanks, Pierre Thanks, Yongzhong From: Matthew Knepley > Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!YUfsVRoVQQfGeK-iIORKffgc0-KL_n2OafjYnTsZGQRuWDiAbJ4RwqB3SmhIIvM83o5y6AP1fc_gaJVsJ8Fk2-2ihVffKimfAgU$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YUfsVRoVQQfGeK-iIORKffgc0-KL_n2OafjYnTsZGQRuWDiAbJ4RwqB3SmhIIvM83o5y6AP1fc_gaJVsJ8Fk2-2ihVffoJriLeI$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YUfsVRoVQQfGeK-iIORKffgc0-KL_n2OafjYnTsZGQRuWDiAbJ4RwqB3SmhIIvM83o5y6AP1fc_gaJVsJ8Fk2-2ihVffoJriLeI$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YUfsVRoVQQfGeK-iIORKffgc0-KL_n2OafjYnTsZGQRuWDiAbJ4RwqB3SmhIIvM83o5y6AP1fc_gaJVsJ8Fk2-2ihVffoJriLeI$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Wed Jun 26 23:41:46 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Thu, 27 Jun 2024 04:41:46 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Thank you Junchao for your kind help! I ran this test case and observed the same things as you did, but I am not sure if it is using GMRES or not. Thanks, Yongzhong From: Junchao Zhang Date: Wednesday, June 26, 2024 at 11:13?AM To: Yongzhong Li Cc: Matthew Knepley , Pierre Jolivet , petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Yongzhong, Try Barry's approach first. BTW, I ran another petsc test. You can see GEMV was used in KSPSolve. You could also try this one. $ cd src/ksp/ksp/tutorials $ make bench_kspsolve $ MKL_VERBOSE=1 OMP_PROC_BIND=spread MKL_NUM_THREADS=8 ./bench_kspsolve -split_ksp -mat_type aijmkl =========================================== Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 1000000 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... Step2a - running PCSetUp()... Step2b - running KSPSolve()... MKL_VERBOSE oneMKL 2022.0 Product build 20211112 for Intel(R) 64 architecture Intel(R) Architecture processors, Lnx 3.18GHz lp64 gnu_thread MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa9432b5e60,1) 474.25us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa9441f8260,1) 1.93ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZGEMV(C,1000000,2,0x7ffccef20c20,0x7fa9432b5e60,1000000,0x7fa94513a660,1,0x7ffccef20c30,0x1c4b610,1) 1.86ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZSCAL(1000000,0x7ffccef20c58,0x7fa94513a660,1) 2.55ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZGEMV(C,1000000,3,0x7ffccef20c20,0x7fa9432b5e60,1000000,0x7fa8cb7a6660,1,0x7ffccef20c30,0x1c4b610,1) 2.95ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 --Junchao Zhang On Tue, Jun 25, 2024 at 10:19?PM Yongzhong Li > wrote: Hi Junchao, thank you for your help for these benchmarking test! I check out to petsc/main and did a few things to verify from my side, 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.5 1.2 1.8 5.2 256 1.5 0.9 1.6 4.7 512 2.7 2.8 6.1 13.2 1024 4.0 4.0 9.3 16.4 2048 7.4 7.3 11.3 39.3 4096 14.2 13.9 19.1 93.4 8192 28.8 26.3 25.4 31.3 16384 54.1 25.8 26.7 33.8 32768 109.8 25.7 24.2 56.0 65536 220.2 24.4 26.5 89.0 131072 424.1 31.5 36.1 149.6 262144 898.1 37.1 53.9 286.1 524288 1754.6 48.7 100.3 1122.2 1048576 3645.8 86.5 347.9 2950.4 2097152 7371.4 308.7 1440.6 6874.9 $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.9 1.2 1.9 5.2 256 1.5 1.0 1.7 4.7 512 2.7 2.8 6.1 12.0 1024 3.9 4.0 9.3 16.8 2048 7.4 7.3 10.4 41.3 4096 14.0 13.8 18.6 84.2 8192 27.0 21.3 43.8 177.5 16384 54.1 34.1 89.1 330.4 32768 110.4 82.1 203.5 781.1 65536 213.0 191.8 423.9 1696.4 131072 428.7 360.2 934.0 4080.0 262144 883.4 723.2 1745.6 10120.7 524288 1817.5 1466.1 4751.4 23217.2 1048576 3611.0 3796.5 11814.9 48687.7 2097152 7401.9 10592.0 27543.2 106565.4 I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca >From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. Thank you, Yongzhong From: Junchao Zhang > Date: Tuesday, June 25, 2024 at 6:34?PM To: Matthew Knepley > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Hi, Yongzhong, Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.1 256 1.8 2.7 7.0 512 2.1 3.1 8.6 1024 2.7 4.0 12.3 2048 3.8 6.3 28.0 4096 6.1 10.6 42.4 8192 10.9 21.8 79.5 16384 21.2 39.4 149.6 32768 45.9 75.7 224.6 65536 142.2 215.8 732.1 131072 169.1 233.2 1729.4 262144 367.5 830.0 4159.2 524288 999.2 1718.1 8538.5 1048576 2113.5 4082.1 18274.8 2097152 5392.6 10273.4 43273.4 $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.0 256 1.8 2.7 15.0 512 2.1 9.0 16.6 1024 2.6 8.7 16.1 2048 7.7 10.3 20.5 4096 9.9 11.4 25.9 8192 14.5 22.1 39.6 16384 25.1 27.8 67.8 32768 44.7 95.7 91.5 65536 82.1 156.8 165.1 131072 194.0 335.1 341.5 262144 388.5 380.8 612.9 524288 1046.7 967.1 1653.3 1048576 1997.4 2169.0 4034.4 2097152 5502.9 5787.3 12608.1 The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. --Junchao Zhang On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: Let me run some examples on our end to see whether the code calls expected functions. --Junchao Zhang On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!euhAeEEnBfvIesZHJwcDVzLCX52J1nnxgDX40y_uhuMX9Elp4dBtFwELlYv5RxDuwEmbnPd1nYq0YBGXTQT6qdINMz_d50vQsNs$ The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. Thanks, Matt Thank you, Yongzhong From: Pierre Jolivet > Date: Sunday, June 23, 2024 at 12:41?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. Thanks, Pierre Thanks, Yongzhong From: Matthew Knepley > Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!euhAeEEnBfvIesZHJwcDVzLCX52J1nnxgDX40y_uhuMX9Elp4dBtFwELlYv5RxDuwEmbnPd1nYq0YBGXTQT6qdINMz_dWeXiU08$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!euhAeEEnBfvIesZHJwcDVzLCX52J1nnxgDX40y_uhuMX9Elp4dBtFwELlYv5RxDuwEmbnPd1nYq0YBGXTQT6qdINMz_dloC5WKc$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!euhAeEEnBfvIesZHJwcDVzLCX52J1nnxgDX40y_uhuMX9Elp4dBtFwELlYv5RxDuwEmbnPd1nYq0YBGXTQT6qdINMz_dloC5WKc$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!euhAeEEnBfvIesZHJwcDVzLCX52J1nnxgDX40y_uhuMX9Elp4dBtFwELlYv5RxDuwEmbnPd1nYq0YBGXTQT6qdINMz_dloC5WKc$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Thu Jun 27 00:12:19 2024 From: bsmith at petsc.dev (Barry Smith) Date: Thu, 27 Jun 2024 01:12:19 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> How big are the m's getting in your code? > On Jun 27, 2024, at 12:40?AM, Yongzhong Li wrote: > > Hi Barry, I used gdb to debug my program, set a breakpoint to VecMultiDot_Seq_GEMV function. I did see when I debug this function, it will call BLAS (but not always, only if m > 1), as shown below. However, I still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. > > (gdb) > 550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst)); > (gdb) > 553 m = j - i; > (gdb) > 554 if (m > 1) { > (gdb) > 555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above > (gdb) > 556 PetscScalar one = 1, zero = 0; > (gdb) > 558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); > (gdb) s > PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> "VecMultiDot_Seq_GEMV", > file=0x7ffff68a1078 "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c") > at /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106 > 106 if (!TRdebug) return PETSC_SUCCESS; > (gdb) > 154 } > > Am I not using MKL BLAS, is that why I didn?t see multithreading speed up for KSPGMRESOrthog? What do you think could be the potential reasons? Is there any silent mode that will possibly affect the MKL Verbose. > > Thank you and best regards, > Yongzhong > > From: Barry Smith > > Date: Wednesday, June 26, 2024 at 8:15?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > if (m > 1) { > PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above > PetscScalar one = 1, zero = 0; > > PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); > PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); > > The call to BLAS above is where it uses MKL. > > > > > On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: > > Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!dh3H976l8YzMiiu7dnwzGQD1bVFcbKhwrXd-7KTlCcTy5-HuzF2NCAbjk_OUrSgTzRUM5mm3gkrwgfYHj9kjCas$ > Can I ask which lines of codes suggest the use of intel mkl? > > Thanks, > Yongzhong > > From: Barry Smith > > Date: Wednesday, June 26, 2024 at 10:30?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. > > > > > On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Junchao, thank you for your help for these benchmarking test! > > I check out to petsc/main and did a few things to verify from my side, > > 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, > > $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.5 1.2 1.8 5.2 > 256 1.5 0.9 1.6 4.7 > 512 2.7 2.8 6.1 13.2 > 1024 4.0 4.0 9.3 16.4 > 2048 7.4 7.3 11.3 39.3 > 4096 14.2 13.9 19.1 93.4 > 8192 28.8 26.3 25.4 31.3 > 16384 54.1 25.8 26.7 33.8 > 32768 109.8 25.7 24.2 56.0 > 65536 220.2 24.4 26.5 89.0 > 131072 424.1 31.5 36.1 149.6 > 262144 898.1 37.1 53.9 286.1 > 524288 1754.6 48.7 100.3 1122.2 > 1048576 3645.8 86.5 347.9 2950.4 > 2097152 7371.4 308.7 1440.6 6874.9 > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.9 1.2 1.9 5.2 > 256 1.5 1.0 1.7 4.7 > 512 2.7 2.8 6.1 12.0 > 1024 3.9 4.0 9.3 16.8 > 2048 7.4 7.3 10.4 41.3 > 4096 14.0 13.8 18.6 84.2 > 8192 27.0 21.3 43.8 177.5 > 16384 54.1 34.1 89.1 330.4 > 32768 110.4 82.1 203.5 781.1 > 65536 213.0 191.8 423.9 1696.4 > 131072 428.7 360.2 934.0 4080.0 > 262144 883.4 723.2 1745.6 10120.7 > 524288 1817.5 1466.1 4751.4 23217.2 > 1048576 3611.0 3796.5 11814.9 48687.7 > 2097152 7401.9 10592.0 27543.2 106565.4 > > I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like > > MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca > > From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. > > However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. > > I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. > > Thank you, > Yongzhong > > From: Junchao Zhang > > Date: Tuesday, June 25, 2024 at 6:34?PM > To: Matthew Knepley > > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > Hi, Yongzhong, > Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? > petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.1 > 256 1.8 2.7 7.0 > 512 2.1 3.1 8.6 > 1024 2.7 4.0 12.3 > 2048 3.8 6.3 28.0 > 4096 6.1 10.6 42.4 > 8192 10.9 21.8 79.5 > 16384 21.2 39.4 149.6 > 32768 45.9 75.7 224.6 > 65536 142.2 215.8 732.1 > 131072 169.1 233.2 1729.4 > 262144 367.5 830.0 4159.2 > 524288 999.2 1718.1 8538.5 > 1048576 2113.5 4082.1 18274.8 > 2097152 5392.6 10273.4 43273.4 > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.0 > 256 1.8 2.7 15.0 > 512 2.1 9.0 16.6 > 1024 2.6 8.7 16.1 > 2048 7.7 10.3 20.5 > 4096 9.9 11.4 25.9 > 8192 14.5 22.1 39.6 > 16384 25.1 27.8 67.8 > 32768 44.7 95.7 91.5 > 65536 82.1 156.8 165.1 > 131072 194.0 335.1 341.5 > 262144 388.5 380.8 612.9 > 524288 1046.7 967.1 1653.3 > 1048576 1997.4 2169.0 4034.4 > 2097152 5502.9 5787.3 12608.1 > > The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: > Let me run some examples on our end to see whether the code calls expected functions. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: > Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? > > We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!dh3H976l8YzMiiu7dnwzGQD1bVFcbKhwrXd-7KTlCcTy5-HuzF2NCAbjk_OUrSgTzRUM5mm3gkrwgfYHi--vgMU$ > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. > > Thanks, > > Matt > > Thank you, > Yongzhong > > From: Pierre Jolivet > > Date: Sunday, June 23, 2024 at 12:41?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > > --> Solving the system... > > Excitation 1 of 1... > > ================================================ > Iterative solve completed in 7435 ms. > CONVERGED: rtol. > Iterations: 72 > Final relative residual norm: 9.22287e-07 > ================================================ > [CPU TIME] System solution: 2.27160000e+02 s. > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. > > Thanks, > Pierre > > > Thanks, > Yongzhong > > > From: Matthew Knepley > > Date: Saturday, June 22, 2024 at 5:56?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > MKL_VERBOSE=1 ./ex1 > > matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? > > Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. > > Thanks, > > Matt > > > Thanks, > Yongzhong > > From: Junchao Zhang > > Date: Saturday, June 22, 2024 at 9:40?AM > To: Yongzhong Li > > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used > $ cd src/mat/tests > $ make ex1 > $ MKL_VERBOSE=1 ./ex1 > > --Junchao Zhang > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: > I am using > > export MKL_VERBOSE=1 > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > Yongzhong > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:47?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > How do you set the variable? > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > [...] > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? > > Best, > Yongzhong > > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:36?AM > To: Junchao Zhang > > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called. > > The environment variable is MKL_VERBOSE > > Thanks, > Pierre > > Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: > This Message Is From an External Sender > This message came from outside your organization. > > Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > static int stageCounter = 1; > > // Generate a unique stage name > std::ostringstream oss; > oss << "Stage " << stageCounter << " of Code"; > std::string stageName = oss.str(); > > // Register the stage > PetscLogStage stagenum; > > PetscLogStageRegister(stageName.c_str(), &stagenum); > PetscLogStagePush(stagenum); > > KSPSolve(*ksp_ptr, b, x); > > PetscLogStagePop(); > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > From: Barry Smith > > Date: Friday, June 14, 2024 at 11:36?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > Finally there are a huge number of > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? > > The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. > > Barry > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically > > KSPGuess Object: 1 MPI process > type: fischer > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? > > Thank you! > Yongzhong > > From: Barry Smith > > Date: Thursday, June 13, 2024 at 2:14?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? > > Thanks > > Barry > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. > > Thanks! > Yongzhong > > From: Matthew Knepley > > Date: Wednesday, June 12, 2024 at 6:46?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: > Matrix Type: Shell system matrix > Preconditioner: Shell PC > Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled > I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. > Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. > > For any performance question like this, we need to see the output of your code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!dh3H976l8YzMiiu7dnwzGQD1bVFcbKhwrXd-7KTlCcTy5-HuzF2NCAbjk_OUrSgTzRUM5mm3gkrwgfYHgcv1ywU$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dh3H976l8YzMiiu7dnwzGQD1bVFcbKhwrXd-7KTlCcTy5-HuzF2NCAbjk_OUrSgTzRUM5mm3gkrwgfYHBKMMlBk$ > > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dh3H976l8YzMiiu7dnwzGQD1bVFcbKhwrXd-7KTlCcTy5-HuzF2NCAbjk_OUrSgTzRUM5mm3gkrwgfYHBKMMlBk$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dh3H976l8YzMiiu7dnwzGQD1bVFcbKhwrXd-7KTlCcTy5-HuzF2NCAbjk_OUrSgTzRUM5mm3gkrwgfYHBKMMlBk$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Thu Jun 27 06:59:24 2024 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 27 Jun 2024 07:59:24 -0400 Subject: [petsc-users] Doubt about TSMonitorSolutionVTK In-Reply-To: References: <2067D58E-F041-429F-8ABE-B19DD9F733C2@petsc.dev> Message-ID: Do you get output when you run an example with that option? Is it possible that your current working directory is not what you expect? Maybe try putting in an absolute path. Thanks, Matt On Wed, Jun 26, 2024 at 5:30?PM MIGUEL MOLINOS PEREZ wrote: > Sorry, I did not put in cc petsc-users@ mcs. anl. gov my replay. Miguel > On Jun 24, 2024, at 6: 39 PM, MIGUEL MOLINOS PEREZ > wrote: Thank you Barry, This is exactly how I did it the first time. Miguel > On Jun 24, 2024, at 6: 37 > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Sorry, I did not put in cc petsc-users at mcs.anl.gov my replay. > > Miguel > > On Jun 24, 2024, at 6:39?PM, MIGUEL MOLINOS PEREZ wrote: > > Thank you Barry, > > This is exactly how I did it the first time. > > Miguel > > On Jun 24, 2024, at 6:37?PM, Barry Smith wrote: > > > See, for example, the bottom of src/ts/tutorials/ex26.c that uses > -ts_monitor_*solution_vtk* 'foo-%03d.vts' > > > On Jun 24, 2024, at 8:47?PM, MIGUEL MOLINOS PEREZ wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear all, > > I want to monitor the results at each iteration of TS using vtk format. To > do so, I add the following lines to my Monitor function: > > char vts_File_Name[MAXC]; > PetscCall(PetscSNPrintf(vts_File_Name, sizeof(vts_File_Name), > "./xi-MgHx-hcp-cube-x5-x5-x5-TS-BE-%i.vtu", step)); > PetscCall(TSMonitorSolutionVTK(ts, step, time, X, (void*)vts_File_Name)); > > My script compiles and executes without any sort of warning/error > messages. However, no output files are produced at the end of the > simulation. I?ve also tried the option ?-ts_monitor_solution_vtk > ?, but I got no results as well. > > I can?t find any similar example on the petsc website and I don?t see what > I am doing wrong. Could somebody point me to the right direction? > > Thanks, > Miguel > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cRDEhdo8wMYy8YI7qVH44Ui6kVCB25tWDo4FafPe5dkLag3M8deW0vrvVYE7_UDXg-mBs7lTNZGsNNie5ANx$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Thu Jun 27 09:18:31 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Thu, 27 Jun 2024 09:18:31 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> Message-ID: Yongzhong, VecMDot(x, m, y[], ...) will be called multiple times in GMRES, with an increasing m up to 30. If you continue running, you should hit the breakpoint with m > 1. --Junchao Zhang On Wed, Jun 26, 2024 at 11:40?PM Yongzhong Li wrote: > Hi Barry, I used gdb to debug my program, set a breakpoint to > VecMultiDot_Seq_GEMV function. I did see when I debug this function, it > will call BLAS (but not always, only if m > 1), as shown below. However, I > still didn?t see any MKL outputs > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Hi Barry, I used gdb to debug my program, set a breakpoint to > VecMultiDot_Seq_GEMV function. I did see when I debug this function, it > will call BLAS (but not always, only if m > 1), as shown below. However, I > still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. > > *(gdb) * > > *550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst));* > > *(gdb) * > > *553 m = j - i;* > > *(gdb) * > > *554 if (m > 1) {* > > *(gdb) * > > *555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the > cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above* > > *(gdb) * > > *556 PetscScalar one = 1, zero = 0;* > > *(gdb) * > > *558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, > yarray, &lda2, xarray, &ione, &zero, z + i, &ione));* > > *(gdb) s* > > *PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> > "VecMultiDot_Seq_GEMV", * > > * file=0x7ffff68a1078 > "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c")* > > * at > /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106* > > *106 if (!TRdebug) return PETSC_SUCCESS;* > > *(gdb) * > > *154 }* > > Am I not using MKL BLAS, is that why I didn?t see multithreading speed up > for KSPGMRESOrthog? What do you think could be the potential reasons? Is > there any silent mode that will possibly affect the MKL Verbose. > > Thank you and best regards, > > Yongzhong > > > > *From: *Barry Smith > *Date: *Wednesday, June 26, 2024 at 8:15?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > if (m > 1) { > > PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe > since we've screened out those lda > PETSC_BLAS_INT_MAX above > > PetscScalar one = 1, zero = 0; > > > > PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, > &lda2, xarray, &ione, &zero, z + i, &ione)); > > PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); > > > > The call to BLAS above is where it uses MKL. > > > > > > > > On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: > > > > Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV > https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!fXzW07uptbVnpewjH3Gn1K7QwYw0tD9yNS7qkulS__TDbnWy2tZNakl7i46b_GUg_fdSr7oBEfVlTZ8cz-fxxmj_HT2Z$ > > Can I ask which lines of codes suggest the use of intel mkl? > > Thanks, > > Yongzhong > > > > *From: *Barry Smith > *Date: *Wednesday, June 26, 2024 at 10:30?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > In a debug version of PETSc run your application in a debugger and put > a break point in VecMultiDot_Seq_GEMV. Then next through the code from > that point to see what decision it makes about using dgemv() to see why it > is not getting into the Intel code. > > > > > > > > On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Junchao, thank you for your help for these benchmarking test! > > I check out to petsc/main and did a few things to verify from my side, > > 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute > node. The results are as follow, > > $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > > -------------------------------------------------------------------------- > > 128 14.5 1.2 1.8 5.2 > > 256 1.5 0.9 1.6 4.7 > > 512 2.7 2.8 6.1 13.2 > > 1024 4.0 4.0 9.3 16.4 > > 2048 7.4 7.3 11.3 39.3 > > 4096 14.2 13.9 19.1 93.4 > > 8192 28.8 26.3 25.4 31.3 > > 16384 54.1 25.8 26.7 33.8 > > 32768 109.8 25.7 24.2 56.0 > > 65536 220.2 24.4 26.5 89.0 > > 131072 424.1 31.5 36.1 149.6 > > 262144 898.1 37.1 53.9 286.1 > > 524288 1754.6 48.7 100.3 1122.2 > > 1048576 3645.8 86.5 347.9 2950.4 > > 2097152 7371.4 308.7 1440.6 6874.9 > > > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 > > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > > -------------------------------------------------------------------------- > > 128 14.9 1.2 1.9 5.2 > > 256 1.5 1.0 1.7 4.7 > > 512 2.7 2.8 6.1 12.0 > > 1024 3.9 4.0 9.3 16.8 > > 2048 7.4 7.3 10.4 41.3 > > 4096 14.0 13.8 18.6 84.2 > > 8192 27.0 21.3 43.8 177.5 > > 16384 54.1 34.1 89.1 330.4 > > 32768 110.4 82.1 203.5 781.1 > > 65536 213.0 191.8 423.9 1696.4 > > 131072 428.7 360.2 934.0 4080.0 > > 262144 883.4 723.2 1745.6 10120.7 > > 524288 1817.5 1466.1 4751.4 23217.2 > > 1048576 3611.0 3796.5 11814.9 48687.7 > > 2097152 7401.9 10592.0 27543.2 106565.4 > > > I can see the speed up brought by more MKL threads, and if I set > NKL_VERBOSE to 1, I can see something like > > > > > > *MKL_VERBOSE > ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) > 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca *From my understanding, > the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute > node and is using ZGEMV MKL BLAS. > > However, when I ran my own program and set MKL_VERBOSE to 1, it is very > strange that I still can?t find any MKL outputs, though I can see from the > PETSc log that VecMDot and VecMAXPY() are called. > > > I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a > way that is similar to ex2k test? Should I expect to see MKL outputs for > whatever linear system I solve with KSPGMRES? Does it relate to if it is > dense matrix or sparse matrix, although I am not really understand why > VecMDot/MAXPY() have something to do with dense matrix-vector > multiplication. > > Thank you, > > Yongzhong > > *From: *Junchao Zhang > *Date: *Tuesday, June 25, 2024 at 6:34?PM > *To: *Matthew Knepley > *Cc: *Yongzhong Li , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > Hi, Yongzhong, > > Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we > can speed up the two with OpenMP threads, then we can speed up > KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in > dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny > matrices ). So with MKL_VERBOSE=1, you should see something like > "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with > petsc/main? > > petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran > VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was > strange to see no speedup. I then configured petsc with openblas, I did > see better performance with more threads > > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.1 > 256 1.8 2.7 7.0 > 512 2.1 3.1 8.6 > 1024 2.7 4.0 12.3 > 2048 3.8 6.3 28.0 > 4096 6.1 10.6 42.4 > 8192 10.9 21.8 79.5 > 16384 21.2 39.4 149.6 > 32768 45.9 75.7 224.6 > 65536 142.2 215.8 732.1 > 131072 169.1 233.2 1729.4 > 262144 367.5 830.0 4159.2 > 524288 999.2 1718.1 8538.5 > 1048576 2113.5 4082.1 18274.8 > 2097152 5392.6 10273.4 43273.4 > > > > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.0 > 256 1.8 2.7 15.0 > 512 2.1 9.0 16.6 > 1024 2.6 8.7 16.1 > 2048 7.7 10.3 20.5 > 4096 9.9 11.4 25.9 > 8192 14.5 22.1 39.6 > 16384 25.1 27.8 67.8 > 32768 44.7 95.7 91.5 > 65536 82.1 156.8 165.1 > 131072 194.0 335.1 341.5 > 262144 388.5 380.8 612.9 > 524288 1046.7 967.1 1653.3 > 1048576 1997.4 2169.0 4034.4 > 2097152 5502.9 5787.3 12608.1 > > > > The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average > speedup depends on components. So I suggest you run ex2k to see in your > environment whether oneMKL can speedup the kernels. > > > > --Junchao Zhang > > > > > > On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: > > Let me run some examples on our end to see whether the code calls expected > functions. > > > --Junchao Zhang > > > > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > > On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li utoronto. ca> wrote: Thank you Pierre for your information. Do we have a > conclusion for my original question about the parallelization efficiency > for different stages of > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? Thank > you, Yongzhong From: > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? > > > > We have an extended discussion of this here: > https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!fXzW07uptbVnpewjH3Gn1K7QwYw0tD9yNS7qkulS__TDbnWy2tZNakl7i46b_GUg_fdSr7oBEfVlTZ8cz-fxxpKPCDL9$ > > > > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) > are memory bandwidth limited. If there is no more bandwidth to be > marshalled on your board, then adding more processes does nothing at all. > This is why people were asking about how many "nodes" you are running on, > because that is the unit of memory bandwidth, not "cores" which make little > difference. > > > > Thanks, > > > > Matt > > > > Thank you, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Sunday, June 23, 2024 at 12:41?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Yeah, I ran my program again using -mat_view::ascii_info and set > MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix > to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > > > --> Solving the system... > > > > Excitation 1 of 1... > > > > ================================================ > > Iterative solve completed in 7435 ms. > > CONVERGED: rtol. > > Iterations: 72 > > Final relative residual norm: 9.22287e-07 > > ================================================ > > [CPU TIME] System solution: 2.27160000e+02 s. > > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set > MKL_VERBOSE to be 1. Although, I think it should be many spmv operations > when doing KSPSolve(). Do you see the possible reasons? > > > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS > is. > > > > Thanks, > > Pierre > > > > Thanks, > > Yongzhong > > > > > > *From: *Matthew Knepley > *Date: *Saturday, June 22, 2024 at 5:56?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > MKL_VERBOSE=1 ./ex1 > > > matrix nonzeros = 100, allocated nonzeros = 100 > > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector Neural Network Instructions enabled > processors, Lnx 2.50GHz lp64 gnu_thread > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) > 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) > 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) > 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) > 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) > 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) > 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) > 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) > 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) > 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) > 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) > 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) > 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All > I did is to change the matrix type from MATAIJ to MATAIJMKL to get > optimized performance for spmv from MKL. Should I expect to see any MKL > outputs in this case? > > > > Are you sure that the type changed? You can MatView() the matrix with > format ascii_info to see. > > > > Thanks, > > > > Matt > > > > > > Thanks, > > Yongzhong > > > > *From: *Junchao Zhang > *Date: *Saturday, June 22, 2024 at 9:40?AM > *To: *Yongzhong Li > *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example > first and see if MKL is really used > > $ cd src/mat/tests > > $ make ex1 > > $ MKL_VERBOSE=1 ./ex1 > > > --Junchao Zhang > > > > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > I am using > > export MKL_VERBOSE=1 > > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:47?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > How do you set the variable? > > > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 > architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled > processors, Lnx 2.80GHz lp64 intel_thread > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > [...] > > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of > MKL. Does PETSc enable this verbose output? > > Best, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:36?AM > *To: *Junchao Zhang > *Cc: *Yongzhong Li , > petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > I remember there are some MKL env vars to print MKL routines called. > > > > The environment variable is MKL_VERBOSE > > > > Thanks, > > Pierre > > > > Maybe we can try it to see what MKL routines are really used and then we > can understand why some petsc functions did not speed up > > > --Junchao Zhang > > > > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > > static int stageCounter = 1; > > > > // Generate a unique stage name > > std::ostringstream oss; > > oss << "Stage " << stageCounter << " of Code"; > > std::string stageName = oss.str(); > > > > // Register the stage > > PetscLogStage stagenum; > > > > PetscLogStageRegister(stageName.c_str(), &stagenum); > > PetscLogStagePush(stagenum); > > > > *KSPSolve(*ksp_ptr, b, x);* > > > > PetscLogStagePop(); > > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > > > Finally there are a huge number of > > > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > > > Barry > > > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve** is **almost two times **greater than the CPU > time for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some > experience on how to diagnose and address this performance discrepancy? > Any insights or recommendations you could offer would be greatly > appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!fXzW07uptbVnpewjH3Gn1K7QwYw0tD9yNS7qkulS__TDbnWy2tZNakl7i46b_GUg_fdSr7oBEfVlTZ8cz-fxxlUVQ5ec$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fXzW07uptbVnpewjH3Gn1K7QwYw0tD9yNS7qkulS__TDbnWy2tZNakl7i46b_GUg_fdSr7oBEfVlTZ8cz-fxxoToYAaW$ > > > > > > > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fXzW07uptbVnpewjH3Gn1K7QwYw0tD9yNS7qkulS__TDbnWy2tZNakl7i46b_GUg_fdSr7oBEfVlTZ8cz-fxxoToYAaW$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!fXzW07uptbVnpewjH3Gn1K7QwYw0tD9yNS7qkulS__TDbnWy2tZNakl7i46b_GUg_fdSr7oBEfVlTZ8cz-fxxoToYAaW$ > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Thu Jun 27 09:38:45 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Thu, 27 Jun 2024 14:38:45 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> Message-ID: Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple times From: Barry Smith Date: Thursday, June 27, 2024 at 1:12?AM To: Yongzhong Li Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue How big are the m's getting in your code? On Jun 27, 2024, at 12:40?AM, Yongzhong Li wrote: Hi Barry, I used gdb to debug my program, set a breakpoint to VecMultiDot_Seq_GEMV function. I did see when I debug this function, it will call BLAS (but not always, only if m > 1), as shown below. However, I still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. (gdb) 550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst)); (gdb) 553 m = j - i; (gdb) 554 if (m > 1) { (gdb) 555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above (gdb) 556 PetscScalar one = 1, zero = 0; (gdb) 558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); (gdb) s PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> "VecMultiDot_Seq_GEMV", file=0x7ffff68a1078 "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c") at /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106 106 if (!TRdebug) return PETSC_SUCCESS; (gdb) 154 } Am I not using MKL BLAS, is that why I didn?t see multithreading speed up for KSPGMRESOrthog? What do you think could be the potential reasons? Is there any silent mode that will possibly affect the MKL Verbose. Thank you and best regards, Yongzhong From: Barry Smith > Date: Wednesday, June 26, 2024 at 8:15?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue if (m > 1) { PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above PetscScalar one = 1, zero = 0; PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); The call to BLAS above is where it uses MKL. On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHEITtuVo$ Can I ask which lines of codes suggest the use of intel mkl? Thanks, Yongzhong From: Barry Smith > Date: Wednesday, June 26, 2024 at 10:30?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Junchao, thank you for your help for these benchmarking test! I check out to petsc/main and did a few things to verify from my side, 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.5 1.2 1.8 5.2 256 1.5 0.9 1.6 4.7 512 2.7 2.8 6.1 13.2 1024 4.0 4.0 9.3 16.4 2048 7.4 7.3 11.3 39.3 4096 14.2 13.9 19.1 93.4 8192 28.8 26.3 25.4 31.3 16384 54.1 25.8 26.7 33.8 32768 109.8 25.7 24.2 56.0 65536 220.2 24.4 26.5 89.0 131072 424.1 31.5 36.1 149.6 262144 898.1 37.1 53.9 286.1 524288 1754.6 48.7 100.3 1122.2 1048576 3645.8 86.5 347.9 2950.4 2097152 7371.4 308.7 1440.6 6874.9 $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.9 1.2 1.9 5.2 256 1.5 1.0 1.7 4.7 512 2.7 2.8 6.1 12.0 1024 3.9 4.0 9.3 16.8 2048 7.4 7.3 10.4 41.3 4096 14.0 13.8 18.6 84.2 8192 27.0 21.3 43.8 177.5 16384 54.1 34.1 89.1 330.4 32768 110.4 82.1 203.5 781.1 65536 213.0 191.8 423.9 1696.4 131072 428.7 360.2 934.0 4080.0 262144 883.4 723.2 1745.6 10120.7 524288 1817.5 1466.1 4751.4 23217.2 1048576 3611.0 3796.5 11814.9 48687.7 2097152 7401.9 10592.0 27543.2 106565.4 I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca >From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. Thank you, Yongzhong From: Junchao Zhang > Date: Tuesday, June 25, 2024 at 6:34?PM To: Matthew Knepley > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Hi, Yongzhong, Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.1 256 1.8 2.7 7.0 512 2.1 3.1 8.6 1024 2.7 4.0 12.3 2048 3.8 6.3 28.0 4096 6.1 10.6 42.4 8192 10.9 21.8 79.5 16384 21.2 39.4 149.6 32768 45.9 75.7 224.6 65536 142.2 215.8 732.1 131072 169.1 233.2 1729.4 262144 367.5 830.0 4159.2 524288 999.2 1718.1 8538.5 1048576 2113.5 4082.1 18274.8 2097152 5392.6 10273.4 43273.4 $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.0 256 1.8 2.7 15.0 512 2.1 9.0 16.6 1024 2.6 8.7 16.1 2048 7.7 10.3 20.5 4096 9.9 11.4 25.9 8192 14.5 22.1 39.6 16384 25.1 27.8 67.8 32768 44.7 95.7 91.5 65536 82.1 156.8 165.1 131072 194.0 335.1 341.5 262144 388.5 380.8 612.9 524288 1046.7 967.1 1653.3 1048576 1997.4 2169.0 4034.4 2097152 5502.9 5787.3 12608.1 The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. --Junchao Zhang On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: Let me run some examples on our end to see whether the code calls expected functions. --Junchao Zhang On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHdxA7B0w$ The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. Thanks, Matt Thank you, Yongzhong From: Pierre Jolivet > Date: Sunday, June 23, 2024 at 12:41?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. Thanks, Pierre Thanks, Yongzhong From: Matthew Knepley > Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHpSnB5jI$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHhNWbDeU$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHhNWbDeU$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZshPGnAUymZ7rmZ8Cq0JR23FBhEioHOuAq-lFnn4iQn1bK8ioexLwIQVLSQNCfmBaWWExCcshZ6KphgTYR6kv18wg0MHhNWbDeU$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Thu Jun 27 10:09:58 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Thu, 27 Jun 2024 10:09:58 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> Message-ID: How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))? n is the vector length in VecMDot. it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE *ZGEMV..., *since the code did call gemv. Perhaps you need to double check your spelling etc. If you also use ex2k, and potentially modify Ms[] and Ns[] to match the sizes in your code, to see if there is a speedup with more threads. --Junchao Zhang On Thu, Jun 27, 2024 at 9:39?AM Yongzhong Li wrote: > Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see > the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, > xarray, &ione, &zero, z + i, &ione)); is called multiple > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Mostly 3, maximum 7, but definitely hit the point when m > 1, > > I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, > yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple > times > > > > *From: *Barry Smith > *Date: *Thursday, June 27, 2024 at 1:12?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > How big are the m's getting in your code? > > > > > > On Jun 27, 2024, at 12:40?AM, Yongzhong Li > wrote: > > > > Hi Barry, I used gdb to debug my program, set a breakpoint to > VecMultiDot_Seq_GEMV function. I did see when I debug this function, it > will call BLAS (but not always, only if m > 1), as shown below. However, I > still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. > > *(gdb) * > > *550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst));* > > *(gdb) * > > *553 m = j - i;* > > *(gdb) * > > *554 if (m > 1) {* > > *(gdb) * > > *555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the > cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above* > > *(gdb) * > > *556 PetscScalar one = 1, zero = 0;* > > *(gdb) * > > *558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, > yarray, &lda2, xarray, &ione, &zero, z + i, &ione));* > > *(gdb) s* > > *PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> > "VecMultiDot_Seq_GEMV",* > > * file=0x7ffff68a1078 > "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c")* > > * at > /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106* > > *106 if (!TRdebug) return PETSC_SUCCESS;* > > *(gdb) * > > *154 }* > > Am I not using MKL BLAS, is that why I didn?t see multithreading speed up > for KSPGMRESOrthog? What do you think could be the potential reasons? Is > there any silent mode that will possibly affect the MKL Verbose. > > Thank you and best regards, > > Yongzhong > > > > *From: *Barry Smith > *Date: *Wednesday, June 26, 2024 at 8:15?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > if (m > 1) { > > PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe > since we've screened out those lda > PETSC_BLAS_INT_MAX above > > PetscScalar one = 1, zero = 0; > > > > PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, > &lda2, xarray, &ione, &zero, z + i, &ione)); > > PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); > > > > The call to BLAS above is where it uses MKL. > > > > > > > > On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: > > > > Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV > https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!YIWcf_KKMkD0cF4WByxs1gvoSLUd6eESKvM001K-3QghRdbB1H6soUepzco6ZH-_0gFFh1NpdARL9jReQbb9V90pVsXK$ > > Can I ask which lines of codes suggest the use of intel mkl? > > Thanks, > > Yongzhong > > > > *From: *Barry Smith > *Date: *Wednesday, June 26, 2024 at 10:30?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > In a debug version of PETSc run your application in a debugger and put > a break point in VecMultiDot_Seq_GEMV. Then next through the code from > that point to see what decision it makes about using dgemv() to see why it > is not getting into the Intel code. > > > > > > > > On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Junchao, thank you for your help for these benchmarking test! > > I check out to petsc/main and did a few things to verify from my side, > > 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute > node. The results are as follow, > > $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > > -------------------------------------------------------------------------- > > 128 14.5 1.2 1.8 5.2 > > 256 1.5 0.9 1.6 4.7 > > 512 2.7 2.8 6.1 13.2 > > 1024 4.0 4.0 9.3 16.4 > > 2048 7.4 7.3 11.3 39.3 > > 4096 14.2 13.9 19.1 93.4 > > 8192 28.8 26.3 25.4 31.3 > > 16384 54.1 25.8 26.7 33.8 > > 32768 109.8 25.7 24.2 56.0 > > 65536 220.2 24.4 26.5 89.0 > > 131072 424.1 31.5 36.1 149.6 > > 262144 898.1 37.1 53.9 286.1 > > 524288 1754.6 48.7 100.3 1122.2 > > 1048576 3645.8 86.5 347.9 2950.4 > > 2097152 7371.4 308.7 1440.6 6874.9 > > > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 > > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > > -------------------------------------------------------------------------- > > 128 14.9 1.2 1.9 5.2 > > 256 1.5 1.0 1.7 4.7 > > 512 2.7 2.8 6.1 12.0 > > 1024 3.9 4.0 9.3 16.8 > > 2048 7.4 7.3 10.4 41.3 > > 4096 14.0 13.8 18.6 84.2 > > 8192 27.0 21.3 43.8 177.5 > > 16384 54.1 34.1 89.1 330.4 > > 32768 110.4 82.1 203.5 781.1 > > 65536 213.0 191.8 423.9 1696.4 > > 131072 428.7 360.2 934.0 4080.0 > > 262144 883.4 723.2 1745.6 10120.7 > > 524288 1817.5 1466.1 4751.4 23217.2 > > 1048576 3611.0 3796.5 11814.9 48687.7 > > 2097152 7401.9 10592.0 27543.2 106565.4 > > > I can see the speed up brought by more MKL threads, and if I set > NKL_VERBOSE to 1, I can see something like > > > > > > *MKL_VERBOSE > ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) > 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca *From my understanding, > the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute > node and is using ZGEMV MKL BLAS. > > However, when I ran my own program and set MKL_VERBOSE to 1, it is very > strange that I still can?t find any MKL outputs, though I can see from the > PETSc log that VecMDot and VecMAXPY() are called. > > > I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a > way that is similar to ex2k test? Should I expect to see MKL outputs for > whatever linear system I solve with KSPGMRES? Does it relate to if it is > dense matrix or sparse matrix, although I am not really understand why > VecMDot/MAXPY() have something to do with dense matrix-vector > multiplication. > > Thank you, > > Yongzhong > > *From: *Junchao Zhang > *Date: *Tuesday, June 25, 2024 at 6:34?PM > *To: *Matthew Knepley > *Cc: *Yongzhong Li , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > Hi, Yongzhong, > > Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we > can speed up the two with OpenMP threads, then we can speed up > KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in > dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny > matrices ). So with MKL_VERBOSE=1, you should see something like > "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with > petsc/main? > > petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran > VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was > strange to see no speedup. I then configured petsc with openblas, I did > see better performance with more threads > > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.1 > 256 1.8 2.7 7.0 > 512 2.1 3.1 8.6 > 1024 2.7 4.0 12.3 > 2048 3.8 6.3 28.0 > 4096 6.1 10.6 42.4 > 8192 10.9 21.8 79.5 > 16384 21.2 39.4 149.6 > 32768 45.9 75.7 224.6 > 65536 142.2 215.8 732.1 > 131072 169.1 233.2 1729.4 > 262144 367.5 830.0 4159.2 > 524288 999.2 1718.1 8538.5 > 1048576 2113.5 4082.1 18274.8 > 2097152 5392.6 10273.4 43273.4 > > > > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.0 > 256 1.8 2.7 15.0 > 512 2.1 9.0 16.6 > 1024 2.6 8.7 16.1 > 2048 7.7 10.3 20.5 > 4096 9.9 11.4 25.9 > 8192 14.5 22.1 39.6 > 16384 25.1 27.8 67.8 > 32768 44.7 95.7 91.5 > 65536 82.1 156.8 165.1 > 131072 194.0 335.1 341.5 > 262144 388.5 380.8 612.9 > 524288 1046.7 967.1 1653.3 > 1048576 1997.4 2169.0 4034.4 > 2097152 5502.9 5787.3 12608.1 > > > > The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average > speedup depends on components. So I suggest you run ex2k to see in your > environment whether oneMKL can speedup the kernels. > > > > --Junchao Zhang > > > > > > On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: > > Let me run some examples on our end to see whether the code calls expected > functions. > > > --Junchao Zhang > > > > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > > On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li utoronto. ca> wrote: Thank you Pierre for your information. Do we have a > conclusion for my original question about the parallelization efficiency > for different stages of > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? Thank > you, Yongzhong From: > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? > > > > We have an extended discussion of this here: > https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!YIWcf_KKMkD0cF4WByxs1gvoSLUd6eESKvM001K-3QghRdbB1H6soUepzco6ZH-_0gFFh1NpdARL9jReQbb9Vz-uch5_$ > > > > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) > are memory bandwidth limited. If there is no more bandwidth to be > marshalled on your board, then adding more processes does nothing at all. > This is why people were asking about how many "nodes" you are running on, > because that is the unit of memory bandwidth, not "cores" which make little > difference. > > > > Thanks, > > > > Matt > > > > Thank you, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Sunday, June 23, 2024 at 12:41?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Yeah, I ran my program again using -mat_view::ascii_info and set > MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix > to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > Mat Object: 1 MPI process > > type: seqaijmkl > > rows=16490, cols=35937 > > total: nonzeros=128496, allocated nonzeros=128496 > > total number of mallocs used during MatSetValues calls=0 > > not using I-node routines > > > > --> Solving the system... > > > > Excitation 1 of 1... > > > > ================================================ > > Iterative solve completed in 7435 ms. > > CONVERGED: rtol. > > Iterations: 72 > > Final relative residual norm: 9.22287e-07 > > ================================================ > > [CPU TIME] System solution: 2.27160000e+02 s. > > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set > MKL_VERBOSE to be 1. Although, I think it should be many spmv operations > when doing KSPSolve(). Do you see the possible reasons? > > > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS > is. > > > > Thanks, > > Pierre > > > > Thanks, > > Yongzhong > > > > > > *From: *Matthew Knepley > *Date: *Saturday, June 22, 2024 at 5:56?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > MKL_VERBOSE=1 ./ex1 > > > matrix nonzeros = 100, allocated nonzeros = 100 > > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector Neural Network Instructions enabled > processors, Lnx 2.50GHz lp64 gnu_thread > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) > 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) > 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) > 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) > 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) > 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) > 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) > 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) > 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) > 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) > 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) > 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) > 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All > I did is to change the matrix type from MATAIJ to MATAIJMKL to get > optimized performance for spmv from MKL. Should I expect to see any MKL > outputs in this case? > > > > Are you sure that the type changed? You can MatView() the matrix with > format ascii_info to see. > > > > Thanks, > > > > Matt > > > > > > Thanks, > > Yongzhong > > > > *From: *Junchao Zhang > *Date: *Saturday, June 22, 2024 at 9:40?AM > *To: *Yongzhong Li > *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example > first and see if MKL is really used > > $ cd src/mat/tests > > $ make ex1 > > $ MKL_VERBOSE=1 ./ex1 > > > --Junchao Zhang > > > > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > I am using > > export MKL_VERBOSE=1 > > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:47?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > How do you set the variable? > > > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 > architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled > processors, Lnx 2.80GHz lp64 intel_thread > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > > [...] > > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of > MKL. Does PETSc enable this verbose output? > > Best, > > Yongzhong > > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:36?AM > *To: *Junchao Zhang > *Cc: *Yongzhong Li , > petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > I remember there are some MKL env vars to print MKL routines called. > > > > The environment variable is MKL_VERBOSE > > > > Thanks, > > Pierre > > > > Maybe we can try it to see what MKL routines are really used and then we > can understand why some petsc functions did not speed up > > > --Junchao Zhang > > > > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > > static int stageCounter = 1; > > > > // Generate a unique stage name > > std::ostringstream oss; > > oss << "Stage " << stageCounter << " of Code"; > > std::string stageName = oss.str(); > > > > // Register the stage > > PetscLogStage stagenum; > > > > PetscLogStageRegister(stageName.c_str(), &stagenum); > > PetscLogStagePush(stagenum); > > > > *KSPSolve(*ksp_ptr, b, x);* > > > > PetscLogStagePop(); > > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > > > Finally there are a huge number of > > > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > > > Barry > > > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > > type: fischer > > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve** is **almost two times **greater than the CPU > time for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > > > Thanks > > > > Barry > > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > > > This Message Is From an External Sender > > This message came from outside your organization. > > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > ????????? knepley at gmail.com ????????????????? > > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > > ZjQcmQRYFpfptBannerStart > > *This Message Is From an External Sender* > > This message came from outside your organization. > > > > ZjQcmQRYFpfptBannerEnd > > Dear PETSc?s developers, > > I hope this email finds you well. > > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > > Have you observed the same issue? Could you please provide some > experience on how to diagnose and address this performance discrepancy? > Any insights or recommendations you could offer would be greatly > appreciated. > > > > For any performance question like this, we need to see the output of your > code run with > > > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > > > Thanks, > > > > Matt > > > > Thank you for your time and assistance. > > Best regards, > > Yongzhong > > ----------------------------------------------------------- > > *Yongzhong Li* > > PhD student | Electromagnetics Group > > Department of Electrical & Computer Engineering > > University of Toronto > > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!YIWcf_KKMkD0cF4WByxs1gvoSLUd6eESKvM001K-3QghRdbB1H6soUepzco6ZH-_0gFFh1NpdARL9jReQbb9Vx-JXVUs$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YIWcf_KKMkD0cF4WByxs1gvoSLUd6eESKvM001K-3QghRdbB1H6soUepzco6ZH-_0gFFh1NpdARL9jReQbb9V8zM_3Nb$ > > > > > > > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YIWcf_KKMkD0cF4WByxs1gvoSLUd6eESKvM001K-3QghRdbB1H6soUepzco6ZH-_0gFFh1NpdARL9jReQbb9V8zM_3Nb$ > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!YIWcf_KKMkD0cF4WByxs1gvoSLUd6eESKvM001K-3QghRdbB1H6soUepzco6ZH-_0gFFh1NpdARL9jReQbb9V8zM_3Nb$ > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Bruce.Palmer at pnnl.gov Thu Jun 27 13:41:29 2024 From: Bruce.Palmer at pnnl.gov (Palmer, Bruce J) Date: Thu, 27 Jun 2024 18:41:29 +0000 Subject: [petsc-users] Unconstrained optimization question In-Reply-To: <87a5j7l5vh.fsf@jedbrown.org> References: <87a5j7l5vh.fsf@jedbrown.org> Message-ID: I needed to add a call to TaoSetFromOptions to get the runtime options to work. I also set grtol from the default value of 1.0e-7 to 1.0e-15. The calculation goes for a few iterations, but it looks like it keeps pushing into territory where the objective function blows up. It eventually quits with a line search error. The complete output from both the runtime options and the TaoView command looks like 4 TAO, Function value: 9.24265e+85, Residual: 3.11789e+76 TAO solve did not converge due to DIVERGED_LS_FAILURE iteration 4 Tao Object: 1 MPI process type: cg CG Type: prp Gradient steps: 0 Reset steps: 4 TaoLineSearch Object: 1 MPI process type: more-thuente maximum function evaluations=30 tolerances: ftol=0.0001, rtol=1e-10, gtol=0.9 total number of function evaluations=0 total number of gradient evaluations=0 total number of function/gradient evaluations=0 Termination reason: -3 convergence tolerances: gatol=1e-08, steptol=0., gttol=0. Residual in Function/Gradient:=3.11789e+76 Objective value=9.24265e+85 total number of iterations=4, (max: 100) total number of function/gradient evaluations=20, (max: 4000) Solver terminated: -6 Line Search Failure I?ll have to bite the bullet and convert everything into ps, nm, and au to get values down to a range where they are all of order 1 to get this to work. Bruce From: Jed Brown Date: Wednesday, June 26, 2024 at 3:02?PM To: Palmer, Bruce J , Barry Smith Cc: petsc-users at mcs.anl.gov Subject: Re: [petsc-users] Unconstrained optimization question You can use the PETSC_OPTIONS environment variable to specify options if you don't pass the command line arguments through. You can set -tao_grtol smaller to handle this difference in scales between the objective and the gradient, though applying some nondimensionalization/choice of appropriate units is still recommended. "Palmer, Bruce J via petsc-users" writes: > This is a fortran code that doesn?t make use of argc,argv (I tried running with the runtime options anyway, in case you implemented some magic I?m not familiar with, but didn?t see anything new in the output). I have a call to TaoView(tao, PETSC_VIEWER_STDOUT_SELF,ierr) in the code and it reports back > > > > Tao Object: 1 MPI process > > type: cg > > CG Type: prp > > Gradient steps: 0 > > Reset steps: 0 > > TaoLineSearch Object: 1 MPI process > > type: more-thuente > > maximum function evaluations=30 > > tolerances: ftol=0.0001, rtol=1e-10, gtol=0.9 > > total number of function evaluations=0 > > total number of gradient evaluations=0 > > total number of function/gradient evaluations=0 > > Termination reason: 0 > > convergence tolerances: gatol=1e-08, steptol=0., gttol=0. > > Residual in Function/Gradient:=7.54237e+75 > > Objective value=2.96082e+86 > > total number of iterations=0, (max: 100) > > total number of function/gradient evaluations=1, (max: 4000) > > Solution converged: ||g(X)||/|f(X)| <= grtol > > > > Bruce > > From: Barry Smith > Date: Wednesday, June 26, 2024 at 2:02?PM > To: Palmer, Bruce J > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] Unconstrained optimization question > Check twice before you click! This email originated from outside PNNL. > > > Please run with -tao_monitor -tao_converged_reason and see why it has stopped. > > Barry > > > > On Jun 26, 2024, at 4:34?PM, Palmer, Bruce J via petsc-users wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, > > I?m trying to do an unconstrained optimization on a molecular scale problem. Previously, I was looking at an artificial molecular problem where all parameters were of order 1 and so the objective function and variables were also in the range of 1 or at least within a few orders of magnitude of 1. > > More recently, I?ve been trying to apply this optimization to a real molecular system. Between Avogadro?s number (6.022e23) and Boltzmann?s constant (1.38e-16) combined with very small distances (1.0e-8 cm), etc. the objective function values and the values of the optimization variables have very large values (~1e86 and ~1e9, respectively). I?ve verified that the analytic gradients of the objective function that I?m calculating are correct by comparing them with numerical derivatives. > > I?ve tried using the LMVM and Conjugate Gradient optimizations, both of which worked previously, but I find that the optimization completes one objective function evaluation and then declares that the problem is converged and stops. I could find a set of units where everything is approximately 1 but I was hoping that there are some parameters I can set in the optimization that will get it moving again. Any suggestions? > > Bruce Palmer -------------- next part -------------- An HTML attachment was scrubbed... URL: From ligang0309 at gmail.com Thu Jun 27 21:26:05 2024 From: ligang0309 at gmail.com (Gang Li) Date: Fri, 28 Jun 2024 10:26:05 +0800 Subject: [petsc-users] =?utf-8?q?Problem_about_compiling_PETSc-3=2E21=2E2?= =?utf-8?q?_under_Cygwin?= In-Reply-To: <552dde2a-782a-5238-4897-18736ac9e94a@fastmail.org> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> <552dde2a-782a-5238-4897-18736ac9e94a@fastmail.org> Message-ID: <7620557F-4CB0-4E6A-91AF-B3C47DC1BCDD@gmail.com> Hi Satish, Thanks for your help.? The same error happens after I used tar (the attached file). Sincerely, Gang ---- Replied Message ---- FromSatish BalayDate6/26/2024 05:06ToGang LiCcpetsc-usersSubjectRe: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin On Tue, 25 Jun 2024, Gang Li wrote: The same error when I restart this build using a fresh tarball. Did you use winzip or similar windows utility to 'untar' the tarball? Can you try from cygwin shell: tar -czf petsc-3.21.2.tar.gz And see if the build works here? Alternatively use cygwin git and see if that makes a difference git clone -b release https://urldefense.us/v3/__https://gitlab.com/petsc/petsc.git__;!!G_uCfscf7eWS!cbu0xpk1rHAOkguuYVHMSeA6o7-j5fVq5G2wTZDREAeyAYIovFxqPWeacYu299Sx8OW69AYzSza10erOU2OH4zoI$ Satish -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: error1.jpg Type: image/jpeg Size: 186301 bytes Desc: not available URL: From balay.anl at fastmail.org Thu Jun 27 23:41:41 2024 From: balay.anl at fastmail.org (Satish Balay) Date: Thu, 27 Jun 2024 23:41:41 -0500 (CDT) Subject: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin In-Reply-To: <7620557F-4CB0-4E6A-91AF-B3C47DC1BCDD@gmail.com> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> <552dde2a-782a-5238-4897-18736ac9e94a@fastmail.org> <7620557F-4CB0-4E6A-91AF-B3C47DC1BCDD@gmail.com> Message-ID: <365c3d40-0f77-1158-1759-bb4c4e2b1dda@fastmail.org> An HTML attachment was scrubbed... URL: From yongzhong.li at mail.utoronto.ca Fri Jun 28 00:46:12 2024 From: yongzhong.li at mail.utoronto.ca (Yongzhong Li) Date: Fri, 28 Jun 2024 05:46:12 +0000 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> Message-ID: Thanks all for your help!!! I think I find the issues. I am compiling a large CMake project that relies on many external libraries (projects). Previously, I used OpenBLAS as the BLAS for all the dependencies including PETSc. After I switched to Intel MKL for PETSc, I still kept the OpenBLAS and use it as the BLAS for all the other dependencies. I think somehow even when I specify the blas-lapack-dir to the MKLROOT when PETSc is configured, the actual program still use OpenBLAS as the BLAS for some PETSc functions, such as VecMDot() and VecMAXPY(), so that?s why I didn?t see any MKL verbose during the KSPSolve(). Now I remove the OpenBLAS and use Intel MKL as the BLAS for all the dependencies. The issue is resolved, I can clearly see MKL routines are called when KSP GMRES is running. Back to my original questions, my goal is to achieve good parallelization efficiency for KSP GMRES Solve. As I use multithreading-enabled MKL spmv routines, the wall time for MatMult/MatMultAdd() has been greatly reduced. However,the KSPGMRESOrthog and MatSolve in PCApply still take over 50% of solving time and can?t benefit from multithreading. After I fixed the issue I mentioned, I found I got around 15% time reduced because of more efficient VecMDot() calls. I attach a petsc log comparison for your reference (same settings, only difference is whether use MKL BLAS or not), you can see the percentage of VecMDot() is reduced. However, here comes the interesting part, VecMAXPY() didn?t benefit from MKL BLAS, it still takes almost 40% of solution when I use 64 MKL Threads, which is a lot for my program. And if I multiple this percentage with the actual wall time against different # of threads, it stays the same. Then I used ex2k benchmark to verify what I found. Here is the result, $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 5 -test_name VecMAXPY Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) -------------------------------------------------------------------------- 128 0.4 0.9 2.4 8.8 256 0.3 1.1 3.5 13.3 512 0.5 4.4 6.7 26.5 1024 0.9 4.8 13.3 51.0 2048 3.5 12.3 37.1 94.7 4096 4.3 24.5 73.6 179.6 8192 6.3 48.7 98.9 380.8 16384 9.3 99.2 200.2 774.0 32768 30.6 155.4 421.2 1662.9 65536 101.2 269.4 827.4 3565.0 131072 206.9 551.0 1829.0 7580.5 262144 450.2 1251.9 3986.2 15525.6 524288 1322.1 2901.7 8567.1 31840.0 1048576 2788.6 6190.6 16394.7 63514.9 2097152 5534.8 12619.9 35427.4 130064.5 $ MKL_NUM_THREADS=8 ./ex2k -n 15 -m 5 -test_name VecMAXPY Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) -------------------------------------------------------------------------- 128 0.3 0.7 2.4 8.8 256 0.3 1.1 3.6 13.5 512 0.5 4.4 6.8 26.4 1024 0.9 4.8 13.6 50.5 2048 7.6 12.2 36.5 95.0 4096 8.5 25.7 72.4 182.6 8192 11.9 48.5 103.7 383.7 16384 12.8 97.7 203.7 785.0 32768 11.2 148.5 421.9 1681.5 65536 15.5 271.2 843.8 3613.7 131072 34.3 564.7 1905.2 7558.8 262144 106.4 1334.5 4002.8 15458.3 524288 217.2 2858.4 8407.9 31303.7 1048576 701.5 6060.6 16947.3 64118.5 2097152 1769.7 13218.3 36347.3 131062.9 It stays the same, no benefit from multithreading BLAS!! Unlike what I found for VecMdot(), where I did see speed up for more #of threads. Then, I dig deeper. I learned that for VecMDot(), it calls ZGEMV while for VecMAXPY(), it calls ZAXPY. This observation seems to indicate that ZAXPY is not benefiting from MKL threads. My question is do you know why ZAXPY is not multithreaded? From my perspective, VecMDot() and VecMAXPY() are very similar operations, the only difference is whether we need to scale the vectors to be multiplied or not. I think you have mentioned that recently you did some optimization to these two routines, from my above results and observations, are these aligned with your expectations? Could we further optimize the codes to get more parallelization efficiency in my case? And another question, can MatSolve() in KSPSolve be multithreaded? Would MUMPS help? Thank you and regards, Yongzhong From: Junchao Zhang Sent: Thursday, June 27, 2024 11:10 AM To: Yongzhong Li Cc: Barry Smith ; petsc-users at mcs.anl.gov Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))? n is the vector length in VecMDot. it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE ZGEMV..., since the code did call gemv. Perhaps you need to double check your spelling etc. If you also use ex2k, and potentially modify Ms[] and Ns[] to match the sizes in your code, to see if there is a speedup with more threads. --Junchao Zhang On Thu, Jun 27, 2024 at 9:39?AM Yongzhong Li > wrote: Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple times From: Barry Smith > Date: Thursday, June 27, 2024 at 1:12?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue How big are the m's getting in your code? On Jun 27, 2024, at 12:40?AM, Yongzhong Li > wrote: Hi Barry, I used gdb to debug my program, set a breakpoint to VecMultiDot_Seq_GEMV function. I did see when I debug this function, it will call BLAS (but not always, only if m > 1), as shown below. However, I still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. (gdb) 550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst)); (gdb) 553 m = j - i; (gdb) 554 if (m > 1) { (gdb) 555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above (gdb) 556 PetscScalar one = 1, zero = 0; (gdb) 558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); (gdb) s PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> "VecMultiDot_Seq_GEMV", file=0x7ffff68a1078 "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c") at /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106 106 if (!TRdebug) return PETSC_SUCCESS; (gdb) 154 } Am I not using MKL BLAS, is that why I didn?t see multithreading speed up for KSPGMRESOrthog? What do you think could be the potential reasons? Is there any silent mode that will possibly affect the MKL Verbose. Thank you and best regards, Yongzhong From: Barry Smith > Date: Wednesday, June 26, 2024 at 8:15?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue if (m > 1) { PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above PetscScalar one = 1, zero = 0; PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); The call to BLAS above is where it uses MKL. On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!bXRYdiBfnJwoBG-4JcUQeLNA6EIE9Kicayx8GqhokY2D2U3eRc7_aHec0m8OquNYeuD4V7UO1xpvKA3PLMrZ5KTETKHhfx485MQ$ Can I ask which lines of codes suggest the use of intel mkl? Thanks, Yongzhong From: Barry Smith > Date: Wednesday, June 26, 2024 at 10:30?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Junchao, thank you for your help for these benchmarking test! I check out to petsc/main and did a few things to verify from my side, 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.5 1.2 1.8 5.2 256 1.5 0.9 1.6 4.7 512 2.7 2.8 6.1 13.2 1024 4.0 4.0 9.3 16.4 2048 7.4 7.3 11.3 39.3 4096 14.2 13.9 19.1 93.4 8192 28.8 26.3 25.4 31.3 16384 54.1 25.8 26.7 33.8 32768 109.8 25.7 24.2 56.0 65536 220.2 24.4 26.5 89.0 131072 424.1 31.5 36.1 149.6 262144 898.1 37.1 53.9 286.1 524288 1754.6 48.7 100.3 1122.2 1048576 3645.8 86.5 347.9 2950.4 2097152 7371.4 308.7 1440.6 6874.9 $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 14.9 1.2 1.9 5.2 256 1.5 1.0 1.7 4.7 512 2.7 2.8 6.1 12.0 1024 3.9 4.0 9.3 16.8 2048 7.4 7.3 10.4 41.3 4096 14.0 13.8 18.6 84.2 8192 27.0 21.3 43.8 177.5 16384 54.1 34.1 89.1 330.4 32768 110.4 82.1 203.5 781.1 65536 213.0 191.8 423.9 1696.4 131072 428.7 360.2 934.0 4080.0 262144 883.4 723.2 1745.6 10120.7 524288 1817.5 1466.1 4751.4 23217.2 1048576 3611.0 3796.5 11814.9 48687.7 2097152 7401.9 10592.0 27543.2 106565.4 I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca >From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. Thank you, Yongzhong From: Junchao Zhang > Date: Tuesday, June 25, 2024 at 6:34?PM To: Matthew Knepley > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Hi, Yongzhong, Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.1 256 1.8 2.7 7.0 512 2.1 3.1 8.6 1024 2.7 4.0 12.3 2048 3.8 6.3 28.0 4096 6.1 10.6 42.4 8192 10.9 21.8 79.5 16384 21.2 39.4 149.6 32768 45.9 75.7 224.6 65536 142.2 215.8 732.1 131072 169.1 233.2 1729.4 262144 367.5 830.0 4159.2 524288 999.2 1718.1 8538.5 1048576 2113.5 4082.1 18274.8 2097152 5392.6 10273.4 43273.4 $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) -------------------------------------------------------------------------- 128 2.0 2.5 6.0 256 1.8 2.7 15.0 512 2.1 9.0 16.6 1024 2.6 8.7 16.1 2048 7.7 10.3 20.5 4096 9.9 11.4 25.9 8192 14.5 22.1 39.6 16384 25.1 27.8 67.8 32768 44.7 95.7 91.5 65536 82.1 156.8 165.1 131072 194.0 335.1 341.5 262144 388.5 380.8 612.9 524288 1046.7 967.1 1653.3 1048576 1997.4 2169.0 4034.4 2097152 5502.9 5787.3 12608.1 The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. --Junchao Zhang On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: Let me run some examples on our end to see whether the code calls expected functions. --Junchao Zhang On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!bXRYdiBfnJwoBG-4JcUQeLNA6EIE9Kicayx8GqhokY2D2U3eRc7_aHec0m8OquNYeuD4V7UO1xpvKA3PLMrZ5KTETKHhuzdTCIE$ The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. Thanks, Matt Thank you, Yongzhong From: Pierre Jolivet > Date: Sunday, June 23, 2024 at 12:41?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) --> Setting up matrix-vector products... Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines Mat Object: 1 MPI process type: seqaijmkl rows=16490, cols=35937 total: nonzeros=128496, allocated nonzeros=128496 total number of mallocs used during MatSetValues calls=0 not using I-node routines --> Solving the system... Excitation 1 of 1... ================================================ Iterative solve completed in 7435 ms. CONVERGED: rtol. Iterations: 72 Final relative residual norm: 9.22287e-07 ================================================ [CPU TIME] System solution: 2.27160000e+02 s. [WALL TIME] System solution: 7.44387218e+00 s. However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. Thanks, Pierre Thanks, Yongzhong From: Matthew Knepley > Date: Saturday, June 22, 2024 at 5:56?PM To: Yongzhong Li > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. Thanks, Matt Thanks, Yongzhong From: Junchao Zhang > Date: Saturday, June 22, 2024 at 9:40?AM To: Yongzhong Li > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used $ cd src/mat/tests $ make ex1 $ MKL_VERBOSE=1 ./ex1 --Junchao Zhang On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: I am using export MKL_VERBOSE=1 ./xx in the bash file, do I have to use - ksp_converged_reason? Thanks, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:47?PM To: Yongzhong Li > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? How do you set the variable? $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 [...] On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hello all, I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? Best, Yongzhong From: Pierre Jolivet > Date: Friday, June 21, 2024 at 1:36?AM To: Junchao Zhang > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? pierre at joliv.et ????????????????? On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: This Message Is From an External Sender This message came from outside your organization. I remember there are some MKL env vars to print MKL routines called. The environment variable is MKL_VERBOSE Thanks, Pierre Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up --Junchao Zhang On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, // Static variable to keep track of the stage counter static int stageCounter = 1; // Generate a unique stage name std::ostringstream oss; oss << "Stage " << stageCounter << " of Code"; std::string stageName = oss.str(); // Register the stage PetscLogStage stagenum; PetscLogStageRegister(stageName.c_str(), &stagenum); PetscLogStagePush(stagenum); KSPSolve(*ksp_ptr, b, x); PetscLogStagePop(); stageCounter++; I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. My questions is, >From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? Thank you, Yongzhong From: Barry Smith > Date: Friday, June 14, 2024 at 11:36?AM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 Finally there are a huge number of MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. Barry On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically KSPGuess Object: 1 MPI process type: fischer Model 1, size 200 However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? Thank you! Yongzhong From: Barry Smith > Date: Thursday, June 13, 2024 at 2:14?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? Thanks Barry On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: This Message Is From an External Sender This message came from outside your organization. Hi Matt, I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. Thanks! Yongzhong From: Matthew Knepley > Date: Wednesday, June 12, 2024 at 6:46?PM To: Yongzhong Li > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue ????????? knepley at gmail.com ????????????????? On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: * Matrix Type: Shell system matrix * Preconditioner: Shell PC * Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. For any performance question like this, we need to see the output of your code run with -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view Thanks, Matt Thank you for your time and assistance. Best regards, Yongzhong ----------------------------------------------------------- Yongzhong Li PhD student | Electromagnetics Group Department of Electrical & Computer Engineering University of Toronto https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!bXRYdiBfnJwoBG-4JcUQeLNA6EIE9Kicayx8GqhokY2D2U3eRc7_aHec0m8OquNYeuD4V7UO1xpvKA3PLMrZ5KTETKHhObj1JRo$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bXRYdiBfnJwoBG-4JcUQeLNA6EIE9Kicayx8GqhokY2D2U3eRc7_aHec0m8OquNYeuD4V7UO1xpvKA3PLMrZ5KTETKHhjKQ2DuE$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bXRYdiBfnJwoBG-4JcUQeLNA6EIE9Kicayx8GqhokY2D2U3eRc7_aHec0m8OquNYeuD4V7UO1xpvKA3PLMrZ5KTETKHhjKQ2DuE$ -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bXRYdiBfnJwoBG-4JcUQeLNA6EIE9Kicayx8GqhokY2D2U3eRc7_aHec0m8OquNYeuD4V7UO1xpvKA3PLMrZ5KTETKHhjKQ2DuE$ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: petsc_log_comparison.txt URL: From balay.anl at fastmail.org Fri Jun 28 00:51:55 2024 From: balay.anl at fastmail.org (Satish Balay) Date: Fri, 28 Jun 2024 00:51:55 -0500 (CDT) Subject: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin In-Reply-To: <365c3d40-0f77-1158-1759-bb4c4e2b1dda@fastmail.org> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> <552dde2a-782a-5238-4897-18736ac9e94a@fastmail.org> <7620557F-4CB0-4E6A-91AF-B3C47DC1BCDD@gmail.com> <365c3d40-0f77-1158-1759-bb4c4e2b1dda@fastmail.org> Message-ID: <9d7974dd-22ba-49e8-d96d-d69cba5653bd@fastmail.org> An HTML attachment was scrubbed... URL: From sebastian.blauth at itwm.fraunhofer.de Fri Jun 28 03:05:26 2024 From: sebastian.blauth at itwm.fraunhofer.de (Blauth, Sebastian) Date: Fri, 28 Jun 2024 08:05:26 +0000 Subject: [petsc-users] Question regarding naming of fieldsplit splits Message-ID: Hello everyone, I have a question regarding the naming convention using PETSc?s PCFieldsplit. I have been following https://lists.mcs.anl.gov/pipermail/petsc-users/2019-January/037262.html to create a DMShell with FEniCS in order to customize PCFieldsplit for my application. I am using the following options, which work nicely for me: -ksp_type fgmres -pc_type fieldsplit -pc_fieldsplit_0_fields 0, 1 -pc_fieldsplit_1_fields 2 -pc_fieldsplit_type additive -fieldsplit_0_ksp_type fgmres -fieldsplit_0_pc_type fieldsplit -fieldsplit_0_pc_fieldsplit_type schur -fieldsplit_0_pc_fieldsplit_schur_fact_type full -fieldsplit_0_pc_fieldsplit_schur_precondition selfp -fieldsplit_0_fieldsplit_u_ksp_type preonly -fieldsplit_0_fieldsplit_u_pc_type lu -fieldsplit_0_fieldsplit_p_ksp_type cg -fieldsplit_0_fieldsplit_p_ksp_rtol 1e-14 -fieldsplit_0_fieldsplit_p_ksp_atol 1e-30 -fieldsplit_0_fieldsplit_p_pc_type icc -fieldsplit_0_ksp_rtol 1e-14 -fieldsplit_0_ksp_atol 1e-30 -fieldsplit_0_ksp_monitor_true_residual -fieldsplit_c_ksp_type preonly -fieldsplit_c_pc_type lu -ksp_view Note that this is just an academic example (sorry for the low solver tolerances) to test the approach, consisting of a Stokes equation and some concentration equation (which is not even coupled to Stokes, just for testing). Completely analogous to https://lists.mcs.anl.gov/pipermail/petsc-users/2019-January/037262.html, I translate my IS?s to a PETSc Section, which is then supplied to a DMShell and assigned to a KSP. I am not so familiar with the code or how / why this works, but it seems to do so perfectly. I name my sections with petsc4py using section.setFieldName(0, "u") section.setFieldName(1, "p") section.setFieldName(2, "c") However, this is also reflected in the way I can access the fieldsplit options from the command line. My question is: Is there any way of not using the FieldNames specified in python but use the index of the field as defined with ?-pc_fieldsplit_0_fields 0, 1? and ?-pc_fieldsplit_1_fields 2?, i.e., instead of the prefix ?fieldsplit_0_fieldsplit_u? I want to write ?fieldsplit_0_fieldsplit_0?, instead of ?fieldsplit_0_fieldsplit_p? I want to use ?fieldsplit_0_fieldsplit_1?, and instead of ?fieldsplit_c? I want to use ?fieldsplit_1?. Just changing the names of the fields to section.setFieldName(0, "0") section.setFieldName(1, "1") section.setFieldName(2, "2") does obviously not work as expected, as it works for velocity and pressure, but not for the concentration ? the prefix there is then ?fieldsplit_2? and not ?fieldsplit_1?. In the docs, I have found https://petsc.org/main/manualpages/PC/PCFieldSplitSetFields/ which seems to suggest that the fieldname can potentially be supplied, but I don?t see how to do so from the command line. Also, for the sake of completeness, I attach the output of the solve with ?-ksp_view? below. Thanks a lot in advance and best regards, Sebastian The output of ksp_view is the following: KSP Object: 1 MPI processes type: fgmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-11, divergence=10000. right preconditioning using UNPRECONDITIONED norm type for convergence test PC Object: 1 MPI processes type: fieldsplit FieldSplit with ADDITIVE composition: total splits = 2 Solver info for each split is in the following KSP objects: Split number 0 Defined by IS KSP Object: (fieldsplit_0_) 1 MPI processes type: fgmres restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement happy breakdown tolerance 1e-30 maximum iterations=10000, initial guess is zero tolerances: relative=1e-14, absolute=1e-30, divergence=10000. right preconditioning using UNPRECONDITIONED norm type for convergence test PC Object: (fieldsplit_0_) 1 MPI processes type: fieldsplit FieldSplit with Schur preconditioner, factorization FULL Preconditioner for the Schur complement formed from Sp, an assembled approximation to S, which uses A00's diagonal's inverse Split info: Split number 0 Defined by IS Split number 1 Defined by IS KSP solver for A00 block KSP Object: (fieldsplit_0_fieldsplit_u_) 1 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (fieldsplit_0_fieldsplit_u_) 1 MPI processes type: lu out-of-place factorization tolerance for zero pivot 2.22045e-14 matrix ordering: nd factor fill ratio given 5., needed 3.92639 Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=4290, cols=4290 package used to perform factorization: petsc total: nonzeros=375944, allocated nonzeros=375944 using I-node routines: found 2548 nodes, limit used is 5 linear system matrix = precond matrix: Mat Object: (fieldsplit_0_fieldsplit_u_) 1 MPI processes type: seqaij rows=4290, cols=4290 total: nonzeros=95748, allocated nonzeros=95748 total number of mallocs used during MatSetValues calls=0 using I-node routines: found 3287 nodes, limit used is 5 KSP solver for S = A11 - A10 inv(A00) A01 KSP Object: (fieldsplit_0_fieldsplit_p_) 1 MPI processes type: cg maximum iterations=10000, initial guess is zero tolerances: relative=1e-14, absolute=1e-30, divergence=10000. left preconditioning using PRECONDITIONED norm type for convergence test PC Object: (fieldsplit_0_fieldsplit_p_) 1 MPI processes type: icc out-of-place factorization 0 levels of fill tolerance for zero pivot 2.22045e-14 using Manteuffel shift [POSITIVE_DEFINITE] matrix ordering: natural factor fill ratio given 1., needed 1. Factored matrix follows: Mat Object: 1 MPI processes type: seqsbaij rows=561, cols=561 package used to perform factorization: petsc total: nonzeros=5120, allocated nonzeros=5120 block size is 1 linear system matrix followed by preconditioner matrix: Mat Object: (fieldsplit_0_fieldsplit_p_) 1 MPI processes type: schurcomplement rows=561, cols=561 Schur complement A11 - A10 inv(A00) A01 A11 Mat Object: (fieldsplit_0_fieldsplit_p_) 1 MPI processes type: seqaij rows=561, cols=561 total: nonzeros=3729, allocated nonzeros=3729 total number of mallocs used during MatSetValues calls=0 not using I-node routines A10 Mat Object: 1 MPI processes type: seqaij rows=561, cols=4290 total: nonzeros=19938, allocated nonzeros=19938 total number of mallocs used during MatSetValues calls=0 not using I-node routines KSP of A00 KSP Object: (fieldsplit_0_fieldsplit_u_) 1 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (fieldsplit_0_fieldsplit_u_) 1 MPI processes type: lu out-of-place factorization tolerance for zero pivot 2.22045e-14 matrix ordering: nd factor fill ratio given 5., needed 3.92639 Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=4290, cols=4290 package used to perform factorization: petsc total: nonzeros=375944, allocated nonzeros=375944 using I-node routines: found 2548 nodes, limit used is 5 linear system matrix = precond matrix: Mat Object: (fieldsplit_0_fieldsplit_u_) 1 MPI processes type: seqaij rows=4290, cols=4290 total: nonzeros=95748, allocated nonzeros=95748 total number of mallocs used during MatSetValues calls=0 using I-node routines: found 3287 nodes, limit used is 5 A01 Mat Object: 1 MPI processes type: seqaij rows=4290, cols=561 total: nonzeros=19938, allocated nonzeros=19938 total number of mallocs used during MatSetValues calls=0 using I-node routines: found 3287 nodes, limit used is 5 Mat Object: 1 MPI processes type: seqaij rows=561, cols=561 total: nonzeros=9679, allocated nonzeros=9679 total number of mallocs used during MatSetValues calls=0 not using I-node routines linear system matrix = precond matrix: Mat Object: (fieldsplit_0_) 1 MPI processes type: seqaij rows=4851, cols=4851 total: nonzeros=139353, allocated nonzeros=139353 total number of mallocs used during MatSetValues calls=0 using I-node routines: found 3830 nodes, limit used is 5 Split number 1 Defined by IS KSP Object: (fieldsplit_c_) 1 MPI processes type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000. left preconditioning using NONE norm type for convergence test PC Object: (fieldsplit_c_) 1 MPI processes type: lu out-of-place factorization tolerance for zero pivot 2.22045e-14 matrix ordering: nd factor fill ratio given 5., needed 4.24323 Factored matrix follows: Mat Object: 1 MPI processes type: seqaij rows=561, cols=561 package used to perform factorization: petsc total: nonzeros=15823, allocated nonzeros=15823 not using I-node routines linear system matrix = precond matrix: Mat Object: (fieldsplit_c_) 1 MPI processes type: seqaij rows=561, cols=561 total: nonzeros=3729, allocated nonzeros=3729 total number of mallocs used during MatSetValues calls=0 not using I-node routines linear system matrix = precond matrix: Mat Object: 1 MPI processes type: seqaij rows=5412, cols=5412 total: nonzeros=190416, allocated nonzeros=190416 total number of mallocs used during MatSetValues calls=0 using I-node routines: found 3833 nodes, limit used is 5 -- Dr. Sebastian Blauth Fraunhofer-Institut f?r Techno- und Wirtschaftsmathematik ITWM Abteilung Transportvorg?nge Fraunhofer-Platz 1, 67663 Kaiserslautern Telefon: +49 631 31600-4968 sebastian.blauth at itwm.fraunhofer.de https://www.itwm.fraunhofer.de -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 7943 bytes Desc: not available URL: From bruchon at emse.fr Fri Jun 28 10:30:53 2024 From: bruchon at emse.fr (Julien BRUCHON) Date: Fri, 28 Jun 2024 17:30:53 +0200 (CEST) Subject: [petsc-users] Trying to develop my own Krylov solver In-Reply-To: <791E8646-7C37-4D28-A939-F662E3881A29@petsc.dev> References: <182097699.28951205.1719331465252.JavaMail.zimbra@emse.fr> <791E8646-7C37-4D28-A939-F662E3881A29@petsc.dev> Message-ID: <672314548.30198011.1719588653570.JavaMail.zimbra@emse.fr> Thank you for your answers, it woks well. Julien De: "Barry Smith" ?: "Matthew Knepley" Cc: "Julien BRUCHON" , "petsc-users" Envoy?: Mardi 25 Juin 2024 19:14:52 Objet: Re: [petsc-users] Trying to develop my own Krylov solver Make sure that you are using the latest PETSc if at all possible Also copy over a makefile from the cg directory (you do not need to edit any makefiles) You also need to add it to KSPRegisterAll() You will need to do make clean before running make all to compiler your new code. Barry On Jun 25, 2024, at 1:11 PM, Matthew Knepley wrote: This Message Is From an External Sender This message came from outside your organization. On Tue, Jun 25, 2024 at 12:05 PM Julien BRUCHON via petsc-users < [ mailto:petsc-users at mcs.anl.gov | petsc-users at mcs.anl.gov ] > wrote: BQ_BEGIN This Message Is From an External Sender This message came from outside your organization. Hi, Based on 'cg.c', I'm trying to develop my own Krylov solver (a projected conjugate gradient). I want to integrate this into my C++ code, where I already have an interface for PETSC which works well. However, I have the following questions : - Where am I sensed to put my 'cg_projected.c' and 'pcgimpl.h' files? Should they go in a directory petsc/src/ksp/ksp/impls/pcg/? If so, how do I compile that? Is it simply by adding this directory to the Makefile in petsc/src/ksp/ksp/impls/? Yes. Thanks, Matt BQ_BEGIN - I have also tried the basic approach of putting these two files in directories of my own C++ code and compiling. However, I have this error at the link edition: [100%] Linking CXX shared library [ https://urldefense.us/v3/__http://libcoeur.so/__;!!G_uCfscf7eWS!eJ5wDIF503lmkSoc9ckRL6Hf7-ac1BL99ksSTDaYQ0VLXGrgZJzkMOQhSpDEOSQpkux5LxL9PqMV1b4f3-nU3A$ | libcoeur.so ] /usr/bin/ld: src/solvers/libsolvers.a(cg_projected.c.o): warning: relocation against `petscstack' in read-only section `.text' /usr/bin/ld: src/solvers/libsolvers.a(cg_projected.c.o): relocation R_X86_64_PC32 against symbol `petscstack' can not be used when making a shared object; recompil? avec -fPIC /usr/bin/ld : ?chec de l'?dition de liens finale : bad value collect2: error: ld returned 1 exit status make[2]: *** [CMakeFiles/coeur.dir/build.make:121 : [ https://urldefense.us/v3/__http://libcoeur.so/__;!!G_uCfscf7eWS!eJ5wDIF503lmkSoc9ckRL6Hf7-ac1BL99ksSTDaYQ0VLXGrgZJzkMOQhSpDEOSQpkux5LxL9PqMV1b4f3-nU3A$ | libcoeur.so ] ] Erreur 1 make[1]: *** [CMakeFiles/Makefile2:286 : CMakeFiles/coeur.dir/all] Erreur 2 make: *** [Makefile:91 : all] Erreur 2 Could you please tell me what is the right way to proceed? Thank you, Julien -- Julien Bruchon Professeur IMT - Responsable du d?partement MPE LGF - UMR CNRS 5307 - [ https://urldefense.us/v3/__https://www.mines-stetienne.fr/lgf/__;!!G_uCfscf7eWS!dQgv-IRWC7OgdDf1X9Oew4nHSgleq2ty0AszuRPj70bBiFeCcT4RibQVAvv6FFeD081W1yY8IczRHAHopA0crg$ | https://urldefense.us/v3/__https://www.mines-stetienne.fr/lgf/__;!!G_uCfscf7eWS!eJ5wDIF503lmkSoc9ckRL6Hf7-ac1BL99ksSTDaYQ0VLXGrgZJzkMOQhSpDEOSQpkux5LxL9PqMV1b7q0tC0Xw$ ] Mines Saint-?tienne, une ?cole de l'Institut Mines-T?l?com [ https://urldefense.us/v3/__https://gitlab.emse.fr/bruchon/Coeur/-/wikis/home__;!!G_uCfscf7eWS!dQgv-IRWC7OgdDf1X9Oew4nHSgleq2ty0AszuRPj70bBiFeCcT4RibQVAvv6FFeD081W1yY8IczRHAG6pz3FrQ$ | Librairie ?l?ments Finis Coeur ] 0477420072 BQ_END -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener [ https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZSMOgmxB-aRx34PmTC3s7ZkDC-zT09xxpmLjhj_vx8oVkTvDSORUOeoTe8ZdEFCHVCUxSrs3eOz346S6E1L_$ | https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!eJ5wDIF503lmkSoc9ckRL6Hf7-ac1BL99ksSTDaYQ0VLXGrgZJzkMOQhSpDEOSQpkux5LxL9PqMV1b50sR16ZQ$ ] BQ_END -- Julien Bruchon Professeur IMT - Responsable du d?partement MPE LGF - UMR CNRS 5307 - [ https://urldefense.us/v3/__https://www.mines-stetienne.fr/lgf/__;!!G_uCfscf7eWS!eJ5wDIF503lmkSoc9ckRL6Hf7-ac1BL99ksSTDaYQ0VLXGrgZJzkMOQhSpDEOSQpkux5LxL9PqMV1b7q0tC0Xw$ | https://urldefense.us/v3/__https://www.mines-stetienne.fr/lgf/__;!!G_uCfscf7eWS!eJ5wDIF503lmkSoc9ckRL6Hf7-ac1BL99ksSTDaYQ0VLXGrgZJzkMOQhSpDEOSQpkux5LxL9PqMV1b7q0tC0Xw$ ] Mines Saint-?tienne, une ?cole de l'Institut Mines-T?l?com [ https://urldefense.us/v3/__https://gitlab.emse.fr/bruchon/Coeur/-/wikis/home__;!!G_uCfscf7eWS!eJ5wDIF503lmkSoc9ckRL6Hf7-ac1BL99ksSTDaYQ0VLXGrgZJzkMOQhSpDEOSQpkux5LxL9PqMV1b4DLBtKFA$ | Librairie ?l?ments Finis Coeur ] 0477420072 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Jun 28 11:35:10 2024 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 28 Jun 2024 12:35:10 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> Message-ID: <55B35581-80F7-482D-B53A-35FCAF907554@petsc.dev> Are you running with -vec_maxpy_use_gemv ? > On Jun 28, 2024, at 1:46?AM, Yongzhong Li wrote: > > Thanks all for your help!!! > > I think I find the issues. I am compiling a large CMake project that relies on many external libraries (projects). Previously, I used OpenBLAS as the BLAS for all the dependencies including PETSc. After I switched to Intel MKL for PETSc, I still kept the OpenBLAS and use it as the BLAS for all the other dependencies. I think somehow even when I specify the blas-lapack-dir to the MKLROOT when PETSc is configured, the actual program still use OpenBLAS as the BLAS for some PETSc functions, such as VecMDot() and VecMAXPY(), so that?s why I didn?t see any MKL verbose during the KSPSolve(). Now I remove the OpenBLAS and use Intel MKL as the BLAS for all the dependencies. The issue is resolved, I can clearly see MKL routines are called when KSP GMRES is running. > > Back to my original questions, my goal is to achieve good parallelization efficiency for KSP GMRES Solve. As I use multithreading-enabled MKL spmv routines, the wall time for MatMult/MatMultAdd() has been greatly reduced. However,the KSPGMRESOrthog and MatSolve in PCApply still take over 50% of solving time and can?t benefit from multithreading. After I fixed the issue I mentioned, I found I got around 15% time reduced because of more efficient VecMDot() calls. I attach a petsc log comparison for your reference (same settings, only difference is whether use MKL BLAS or not), you can see the percentage of VecMDot() is reduced. However, here comes the interesting part, VecMAXPY() didn?t benefit from MKL BLAS, it still takes almost 40% of solution when I use 64 MKL Threads, which is a lot for my program. And if I multiple this percentage with the actual wall time against different # of threads, it stays the same. Then I used ex2k benchmark to verify what I found. Here is the result, > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 5 -test_name VecMAXPY > Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) > -------------------------------------------------------------------------- > 128 0.4 0.9 2.4 8.8 > 256 0.3 1.1 3.5 13.3 > 512 0.5 4.4 6.7 26.5 > 1024 0.9 4.8 13.3 51.0 > 2048 3.5 12.3 37.1 94.7 > 4096 4.3 24.5 73.6 179.6 > 8192 6.3 48.7 98.9 380.8 > 16384 9.3 99.2 200.2 774.0 > 32768 30.6 155.4 421.2 1662.9 > 65536 101.2 269.4 827.4 3565.0 > 131072 206.9 551.0 1829.0 7580.5 > 262144 450.2 1251.9 3986.2 15525.6 > 524288 1322.1 2901.7 8567.1 31840.0 > 1048576 2788.6 6190.6 16394.7 63514.9 > 2097152 5534.8 12619.9 35427.4 130064.5 > $ MKL_NUM_THREADS=8 ./ex2k -n 15 -m 5 -test_name VecMAXPY > Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) > -------------------------------------------------------------------------- > 128 0.3 0.7 2.4 8.8 > 256 0.3 1.1 3.6 13.5 > 512 0.5 4.4 6.8 26.4 > 1024 0.9 4.8 13.6 50.5 > 2048 7.6 12.2 36.5 95.0 > 4096 8.5 25.7 72.4 182.6 > 8192 11.9 48.5 103.7 383.7 > 16384 12.8 97.7 203.7 785.0 > 32768 11.2 148.5 421.9 1681.5 > 65536 15.5 271.2 843.8 3613.7 > 131072 34.3 564.7 1905.2 7558.8 > 262144 106.4 1334.5 4002.8 15458.3 > 524288 217.2 2858.4 8407.9 31303.7 > 1048576 701.5 6060.6 16947.3 64118.5 > 2097152 1769.7 13218.3 36347.3 131062.9 > > It stays the same, no benefit from multithreading BLAS!! Unlike what I found for VecMdot(), where I did see speed up for more #of threads. Then, I dig deeper. I learned that for VecMDot(), it calls ZGEMV while for VecMAXPY(), it calls ZAXPY. This observation seems to indicate that ZAXPY is not benefiting from MKL threads. > > My question is do you know why ZAXPY is not multithreaded? From my perspective, VecMDot() and VecMAXPY() are very similar operations, the only difference is whether we need to scale the vectors to be multiplied or not. I think you have mentioned that recently you did some optimization to these two routines, from my above results and observations, are these aligned with your expectations? Could we further optimize the codes to get more parallelization efficiency in my case? > > And another question, can MatSolve() in KSPSolve be multithreaded? Would MUMPS help? > > Thank you and regards, > Yongzhong > > From: Junchao Zhang > > Sent: Thursday, June 27, 2024 11:10 AM > To: Yongzhong Li > > Cc: Barry Smith >; petsc-users at mcs.anl.gov > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))? n is the vector length in VecMDot. > it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE ZGEMV..., since the code did call gemv. Perhaps you need to double check your spelling etc. > > If you also use ex2k, and potentially modify Ms[] and Ns[] to match the sizes in your code, to see if there is a speedup with more threads. > > --Junchao Zhang > > > On Thu, Jun 27, 2024 at 9:39?AM Yongzhong Li > wrote: > Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Mostly 3, maximum 7, but definitely hit the point when m > 1, > > I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple times > > From: Barry Smith > > Date: Thursday, June 27, 2024 at 1:12?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > How big are the m's getting in your code? > > > > On Jun 27, 2024, at 12:40?AM, Yongzhong Li > wrote: > > Hi Barry, I used gdb to debug my program, set a breakpoint to VecMultiDot_Seq_GEMV function. I did see when I debug this function, it will call BLAS (but not always, only if m > 1), as shown below. However, I still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. > > (gdb) > 550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst)); > (gdb) > 553 m = j - i; > (gdb) > 554 if (m > 1) { > (gdb) > 555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above > (gdb) > 556 PetscScalar one = 1, zero = 0; > (gdb) > 558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); > (gdb) s > PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> "VecMultiDot_Seq_GEMV", > file=0x7ffff68a1078 "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c") > at /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106 > 106 if (!TRdebug) return PETSC_SUCCESS; > (gdb) > 154 } > > Am I not using MKL BLAS, is that why I didn?t see multithreading speed up for KSPGMRESOrthog? What do you think could be the potential reasons? Is there any silent mode that will possibly affect the MKL Verbose. > > Thank you and best regards, > Yongzhong > > From: Barry Smith > > Date: Wednesday, June 26, 2024 at 8:15?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > if (m > 1) { > PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above > PetscScalar one = 1, zero = 0; > > PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); > PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); > > The call to BLAS above is where it uses MKL. > > > > > On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: > > Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!ZzgBh2JgD1rvtdQkjydC8NGYB2YAeHfQdv90T8uDT7ySzViGllSABORzXWWSchdrbAhUXSbYu2hOMZ4gFY08IcA$ > Can I ask which lines of codes suggest the use of intel mkl? > > Thanks, > Yongzhong > > From: Barry Smith > > Date: Wednesday, June 26, 2024 at 10:30?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. > > > > > On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Junchao, thank you for your help for these benchmarking test! > > I check out to petsc/main and did a few things to verify from my side, > > 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, > > $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.5 1.2 1.8 5.2 > 256 1.5 0.9 1.6 4.7 > 512 2.7 2.8 6.1 13.2 > 1024 4.0 4.0 9.3 16.4 > 2048 7.4 7.3 11.3 39.3 > 4096 14.2 13.9 19.1 93.4 > 8192 28.8 26.3 25.4 31.3 > 16384 54.1 25.8 26.7 33.8 > 32768 109.8 25.7 24.2 56.0 > 65536 220.2 24.4 26.5 89.0 > 131072 424.1 31.5 36.1 149.6 > 262144 898.1 37.1 53.9 286.1 > 524288 1754.6 48.7 100.3 1122.2 > 1048576 3645.8 86.5 347.9 2950.4 > 2097152 7371.4 308.7 1440.6 6874.9 > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.9 1.2 1.9 5.2 > 256 1.5 1.0 1.7 4.7 > 512 2.7 2.8 6.1 12.0 > 1024 3.9 4.0 9.3 16.8 > 2048 7.4 7.3 10.4 41.3 > 4096 14.0 13.8 18.6 84.2 > 8192 27.0 21.3 43.8 177.5 > 16384 54.1 34.1 89.1 330.4 > 32768 110.4 82.1 203.5 781.1 > 65536 213.0 191.8 423.9 1696.4 > 131072 428.7 360.2 934.0 4080.0 > 262144 883.4 723.2 1745.6 10120.7 > 524288 1817.5 1466.1 4751.4 23217.2 > 1048576 3611.0 3796.5 11814.9 48687.7 > 2097152 7401.9 10592.0 27543.2 106565.4 > > I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like > > MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca > > From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. > > However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. > > I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. > > Thank you, > Yongzhong > > From: Junchao Zhang > > Date: Tuesday, June 25, 2024 at 6:34?PM > To: Matthew Knepley > > Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > Hi, Yongzhong, > Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? > petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.1 > 256 1.8 2.7 7.0 > 512 2.1 3.1 8.6 > 1024 2.7 4.0 12.3 > 2048 3.8 6.3 28.0 > 4096 6.1 10.6 42.4 > 8192 10.9 21.8 79.5 > 16384 21.2 39.4 149.6 > 32768 45.9 75.7 224.6 > 65536 142.2 215.8 732.1 > 131072 169.1 233.2 1729.4 > 262144 367.5 830.0 4159.2 > 524288 999.2 1718.1 8538.5 > 1048576 2113.5 4082.1 18274.8 > 2097152 5392.6 10273.4 43273.4 > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.0 > 256 1.8 2.7 15.0 > 512 2.1 9.0 16.6 > 1024 2.6 8.7 16.1 > 2048 7.7 10.3 20.5 > 4096 9.9 11.4 25.9 > 8192 14.5 22.1 39.6 > 16384 25.1 27.8 67.8 > 32768 44.7 95.7 91.5 > 65536 82.1 156.8 165.1 > 131072 194.0 335.1 341.5 > 262144 388.5 380.8 612.9 > 524288 1046.7 967.1 1653.3 > 1048576 1997.4 2169.0 4034.4 > 2097152 5502.9 5787.3 12608.1 > > The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: > Let me run some examples on our end to see whether the code calls expected functions. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: > Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? > > We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!ZzgBh2JgD1rvtdQkjydC8NGYB2YAeHfQdv90T8uDT7ySzViGllSABORzXWWSchdrbAhUXSbYu2hOMZ4gNuYl-cE$ > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. > > Thanks, > > Matt > > Thank you, > Yongzhong > > From: Pierre Jolivet > > Date: Sunday, June 23, 2024 at 12:41?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > > --> Solving the system... > > Excitation 1 of 1... > > ================================================ > Iterative solve completed in 7435 ms. > CONVERGED: rtol. > Iterations: 72 > Final relative residual norm: 9.22287e-07 > ================================================ > [CPU TIME] System solution: 2.27160000e+02 s. > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. > > Thanks, > Pierre > > > Thanks, > Yongzhong > > > From: Matthew Knepley > > Date: Saturday, June 22, 2024 at 5:56?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > MKL_VERBOSE=1 ./ex1 > > matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? > > Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. > > Thanks, > > Matt > > > Thanks, > Yongzhong > > From: Junchao Zhang > > Date: Saturday, June 22, 2024 at 9:40?AM > To: Yongzhong Li > > Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used > $ cd src/mat/tests > $ make ex1 > $ MKL_VERBOSE=1 ./ex1 > > --Junchao Zhang > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: > I am using > > export MKL_VERBOSE=1 > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > Yongzhong > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:47?PM > To: Yongzhong Li > > Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > How do you set the variable? > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > [...] > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? > > Best, > Yongzhong > > > From: Pierre Jolivet > > Date: Friday, June 21, 2024 at 1:36?AM > To: Junchao Zhang > > Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > > Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? pierre at joliv.et ????????????????? > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called. > > The environment variable is MKL_VERBOSE > > Thanks, > Pierre > > Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: > This Message Is From an External Sender > This message came from outside your organization. > > Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > static int stageCounter = 1; > > // Generate a unique stage name > std::ostringstream oss; > oss << "Stage " << stageCounter << " of Code"; > std::string stageName = oss.str(); > > // Register the stage > PetscLogStage stagenum; > > PetscLogStageRegister(stageName.c_str(), &stagenum); > PetscLogStagePush(stagenum); > > KSPSolve(*ksp_ptr, b, x); > > PetscLogStagePop(); > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > From: Barry Smith > > Date: Friday, June 14, 2024 at 11:36?AM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > Finally there are a huge number of > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? > > The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. > > Barry > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically > > KSPGuess Object: 1 MPI process > type: fischer > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? > > Thank you! > Yongzhong > > From: Barry Smith > > Date: Thursday, June 13, 2024 at 2:14?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > > Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? > > Thanks > > Barry > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. > > Thanks! > Yongzhong > > From: Matthew Knepley > > Date: Wednesday, June 12, 2024 at 6:46?PM > To: Yongzhong Li > > Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > > Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue > > ????????? knepley at gmail.com ????????????????? > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: > Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: > Matrix Type: Shell system matrix > Preconditioner: Shell PC > Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled > I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. > Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. > > For any performance question like this, we need to see the output of your code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > Yongzhong Li > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!ZzgBh2JgD1rvtdQkjydC8NGYB2YAeHfQdv90T8uDT7ySzViGllSABORzXWWSchdrbAhUXSbYu2hOMZ4gzRTNsTo$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZzgBh2JgD1rvtdQkjydC8NGYB2YAeHfQdv90T8uDT7ySzViGllSABORzXWWSchdrbAhUXSbYu2hOMZ4guIGTpCw$ > > > > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZzgBh2JgD1rvtdQkjydC8NGYB2YAeHfQdv90T8uDT7ySzViGllSABORzXWWSchdrbAhUXSbYu2hOMZ4guIGTpCw$ > > > > -- > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!ZzgBh2JgD1rvtdQkjydC8NGYB2YAeHfQdv90T8uDT7ySzViGllSABORzXWWSchdrbAhUXSbYu2hOMZ4guIGTpCw$ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Fri Jun 28 12:20:16 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Fri, 28 Jun 2024 12:20:16 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: <55B35581-80F7-482D-B53A-35FCAF907554@petsc.dev> References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> <55B35581-80F7-482D-B53A-35FCAF907554@petsc.dev> Message-ID: Hi, Yongzhong, It is great to see you have made such good progress. Barry is right, you need -vec_maxpy_use_gemv 1. It's my mistake for not mentioning it earlier. But even with that, there are still problems. petsc tries to optimize VecMDot/MAXPY with BLAS GEMV, with hope that vendors' BLAS library would be highly optimized on that. However, we found though they were good with VecMDot, but not with VecMAXPY. So by default in petsc, we disabled the GEMV optimization for VecMAXPY. One can use -vec_maxpy_use_gemv 1 to turn on it. I turned it on and tested VecMAXPY with ex2k and MKL, but failed to see any improvement with multiple threads. I could not understand why MKL is so bad on it. You can try it yourself in your environment. Without the GEMV optimization, VecMAXPY() is implemented by petsc with a batch of PetscKernelAXPY() kernels, which contain simple for loops but not OpenMP parallelized (since petsc does not support OpenMP outright) . I added "omp parallel for" pragma in PetscKernelAXPY() kernels, and tested ex2k again with now parallelized petsc. Here is the result. $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0 Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) -------------------------------------------------------------------------- 128 7.0 10.1 21.4 72.7 256 7.9 12.9 29.5 101.0 512 9.4 17.2 40.5 136.2 1024 15.9 27.3 67.5 249.3 2048 26.5 48.7 139.6 432.7 4096 47.1 77.3 186.4 710.3 8192 84.8 152.2 423.9 1580.6 16384 154.9 298.5 792.1 2889.2 32768 183.7 338.7 893.9 3436.2 65536 639.1 1247.8 3219.1 12494.8 131072 1125.2 1856.2 6843.0 23653.7 262144 2603.2 4948.4 13259.4 51287.7 524288 5093.6 10305.0 26451.7 96919.6 1048576 5898.6 10947.2 45486.4 127352.8 2097152 11845.4 21912.5 57999.6 331403.4 $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=16 ./ex2k -n 15 -m 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0 Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) -------------------------------------------------------------------------- 128 17.0 16.1 31.5 112.9 256 13.7 16.8 31.2 120.2 512 14.5 18.1 33.9 129.9 1024 16.5 21.0 38.5 150.4 2048 18.5 22.1 41.8 171.4 4096 21.0 25.4 55.3 212.3 8192 27.0 30.3 68.6 251.9 16384 32.2 44.5 93.3 350.5 32768 45.8 65.0 149.8 558.8 65536 59.7 102.8 247.1 946.0 131072 100.7 186.4 485.3 1898.1 262144 183.4 345.2 922.2 3567.0 524288 339.6 676.8 1820.7 7530.4 1048576 662.0 1364.7 3585.3 13969.1 2097152 1379.7 2788.6 7414.0 28275.3 We can see VecMAXPY() can be easily speeded up with multithreading. For MatSolve, I checked petsc's aijmkl.c, and found we don't have interface to MKL's sparse solve. I checked https://urldefense.us/v3/__https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2023-0/openmp-threaded-functions-and-problems.html__;!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIgTsoHmt$ , but confused with MKL's list of threaded function - Direct sparse solver. - All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. I don't know whether MKL has threaded sparse solver. --Junchao Zhang On Fri, Jun 28, 2024 at 11:35?AM Barry Smith wrote: > > Are you running with -vec_maxpy_use_gemv ? > > > On Jun 28, 2024, at 1:46?AM, Yongzhong Li > wrote: > > Thanks all for your help!!! > > I think I find the issues. I am compiling a large CMake project that > relies on many external libraries (projects). Previously, I used OpenBLAS > as the BLAS for all the dependencies including PETSc. After I switched to > Intel MKL for PETSc, I still kept the OpenBLAS and use it as the BLAS for > all the other dependencies. I think somehow even when I specify the > blas-lapack-dir to the MKLROOT when PETSc is configured, the actual program > still use OpenBLAS as the BLAS for some PETSc functions, such as VecMDot() > and VecMAXPY(), so that?s why I didn?t see any MKL verbose during the > KSPSolve(). Now I remove the OpenBLAS and use Intel MKL as the BLAS for all > the dependencies. The issue is resolved, I can clearly see MKL routines are > called when KSP GMRES is running. > > Back to my original questions, my goal is to achieve good parallelization > efficiency for KSP GMRES Solve. As I use multithreading-enabled MKL spmv > routines, the wall time for MatMult/MatMultAdd() has been greatly reduced. > However,the KSPGMRESOrthog and MatSolve in PCApply still take over 50% of > solving time and can?t benefit from multithreading. *After I fixed the > issue I mentioned, I found I got around 15% time reduced because of more > efficient VecMDot() calls*. I attach a petsc log comparison for your > reference (same settings, only difference is whether use MKL BLAS or not), > you can see the percentage of VecMDot() is reduced. However, here comes the > interesting part, *VecMAXPY() didn?t benefit from MKL BLAS, it still > takes almost 40% of solution when I use 64 MKL Threads*, which is a lot > for my program. And if I multiple this percentage with the actual wall time > against different # of threads, it stays the same. Then I used ex2k > benchmark to verify what I found. Here is the result, > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 5 -test_name VecMAXPY > Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) > -------------------------------------------------------------------------- > 128 0.4 0.9 2.4 8.8 > 256 0.3 1.1 3.5 13.3 > 512 0.5 4.4 6.7 26.5 > 1024 0.9 4.8 13.3 51.0 > 2048 3.5 12.3 37.1 94.7 > 4096 4.3 24.5 73.6 179.6 > 8192 6.3 48.7 98.9 380.8 > 16384 9.3 99.2 200.2 774.0 > 32768 30.6 155.4 421.2 1662.9 > 65536 101.2 269.4 827.4 3565.0 > 131072 206.9 551.0 1829.0 7580.5 > 262144 450.2 1251.9 3986.2 15525.6 > 524288 1322.1 2901.7 8567.1 31840.0 > 1048576 2788.6 6190.6 16394.7 63514.9 > 2097152 5534.8 12619.9 35427.4 130064.5 > $ MKL_NUM_THREADS=8 ./ex2k -n 15 -m 5 -test_name VecMAXPY > Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) > -------------------------------------------------------------------------- > 128 0.3 0.7 2.4 8.8 > 256 0.3 1.1 3.6 13.5 > 512 0.5 4.4 6.8 26.4 > 1024 0.9 4.8 13.6 50.5 > 2048 7.6 12.2 36.5 95.0 > 4096 8.5 25.7 72.4 182.6 > 8192 11.9 48.5 103.7 383.7 > 16384 12.8 97.7 203.7 785.0 > 32768 11.2 148.5 421.9 1681.5 > 65536 15.5 271.2 843.8 3613.7 > 131072 34.3 564.7 1905.2 7558.8 > 262144 106.4 1334.5 4002.8 15458.3 > 524288 217.2 2858.4 8407.9 31303.7 > 1048576 701.5 6060.6 16947.3 64118.5 > 2097152 1769.7 13218.3 36347.3 131062.9 > > It stays the same, no benefit from multithreading BLAS!! Unlike what I > found for VecMdot(), where I did see speed up for more #of threads. Then, I > dig deeper. *I learned that for VecMDot(), it calls ZGEMV while for > VecMAXPY(), it calls ZAXPY. This observation seems to indicate that ZAXPY > is not benefiting from MKL threads.* > > My question is *do you know why ZAXPY is not multithreaded*? From my > perspective, VecMDot() and VecMAXPY() are very similar operations, the > only difference is whether we need to scale the vectors to be multiplied or > not. I think you have mentioned that recently you did some optimization to > these two routines*, from my above results and observations, are these > aligned with your expectations*? Could we further optimize the codes to > get more parallelization efficiency in my case? > > *And another question, can MatSolve() in KSPSolve be multithreaded? Would > MUMPS help?* > > Thank you and regards, > Yongzhong > > *From:* Junchao Zhang > *Sent:* Thursday, June 27, 2024 11:10 AM > *To:* Yongzhong Li > *Cc:* Barry Smith ; petsc-users at mcs.anl.gov > *Subject:* Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans, > &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))? n is > the vector length in VecMDot. > it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE *ZGEMV..., *since > the code did call gemv. Perhaps you need to double check your spelling etc. > > If you also use ex2k, and potentially modify Ms[] and Ns[] to match the > sizes in your code, to see if there is a speedup with more threads. > > --Junchao Zhang > > > On Thu, Jun 27, 2024 at 9:39?AM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see > the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, > xarray, &ione, &zero, z + i, &ione)); is called multiple > ZjQcmQRYFpfptBannerStart > *This Message Is From an External Sender* > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Mostly 3, maximum 7, but definitely hit the point when m > 1, > > I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, > yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple > times > > > *From: *Barry Smith > *Date: *Thursday, June 27, 2024 at 1:12?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > How big are the m's getting in your code? > > > > > On Jun 27, 2024, at 12:40?AM, Yongzhong Li > wrote: > > Hi Barry, I used gdb to debug my program, set a breakpoint to > VecMultiDot_Seq_GEMV function. I did see when I debug this function, it > will call BLAS (but not always, only if m > 1), as shown below. However, I > still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. > > *(gdb) * > *550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst));* > *(gdb) * > *553 m = j - i;* > *(gdb) * > *554 if (m > 1) {* > *(gdb) * > *555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the > cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above* > *(gdb) * > *556 PetscScalar one = 1, zero = 0;* > *(gdb) * > *558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, > yarray, &lda2, xarray, &ione, &zero, z + i, &ione));* > *(gdb) s* > *PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> > "VecMultiDot_Seq_GEMV",* > * file=0x7ffff68a1078 > "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c")* > * at > /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106* > *106 if (!TRdebug) return PETSC_SUCCESS;* > *(gdb) * > *154 }* > > Am I not using MKL BLAS, is that why I didn?t see multithreading speed up > for KSPGMRESOrthog? What do you think could be the potential reasons? Is > there any silent mode that will possibly affect the MKL Verbose. > > Thank you and best regards, > Yongzhong > > > *From: *Barry Smith > *Date: *Wednesday, June 26, 2024 at 8:15?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > if (m > 1) { > PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe > since we've screened out those lda > PETSC_BLAS_INT_MAX above > PetscScalar one = 1, zero = 0; > > PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, > &lda2, xarray, &ione, &zero, z + i, &ione)); > PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); > > The call to BLAS above is where it uses MKL. > > > > > > On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: > > Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV > https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIoPmWgCr$ > > Can I ask which lines of codes suggest the use of intel mkl? > > Thanks, > Yongzhong > > > *From: *Barry Smith > *Date: *Wednesday, June 26, 2024 at 10:30?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > In a debug version of PETSc run your application in a debugger and put > a break point in VecMultiDot_Seq_GEMV. Then next through the code from > that point to see what decision it makes about using dgemv() to see why it > is not getting into the Intel code. > > > > > > On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > > Hi Junchao, thank you for your help for these benchmarking test! > > I check out to petsc/main and did a few things to verify from my side, > > 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute > node. The results are as follow, > $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.5 1.2 1.8 5.2 > 256 1.5 0.9 1.6 4.7 > 512 2.7 2.8 6.1 13.2 > 1024 4.0 4.0 9.3 16.4 > 2048 7.4 7.3 11.3 39.3 > 4096 14.2 13.9 19.1 93.4 > 8192 28.8 26.3 25.4 31.3 > 16384 54.1 25.8 26.7 33.8 > 32768 109.8 25.7 24.2 56.0 > 65536 220.2 24.4 26.5 89.0 > 131072 424.1 31.5 36.1 149.6 > 262144 898.1 37.1 53.9 286.1 > 524288 1754.6 48.7 100.3 1122.2 > 1048576 3645.8 86.5 347.9 2950.4 > 2097152 7371.4 308.7 1440.6 6874.9 > > $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 14.9 1.2 1.9 5.2 > 256 1.5 1.0 1.7 4.7 > 512 2.7 2.8 6.1 12.0 > 1024 3.9 4.0 9.3 16.8 > 2048 7.4 7.3 10.4 41.3 > 4096 14.0 13.8 18.6 84.2 > 8192 27.0 21.3 43.8 177.5 > 16384 54.1 34.1 89.1 330.4 > 32768 110.4 82.1 203.5 781.1 > 65536 213.0 191.8 423.9 1696.4 > 131072 428.7 360.2 934.0 4080.0 > 262144 883.4 723.2 1745.6 10120.7 > 524288 1817.5 1466.1 4751.4 23217.2 > 1048576 3611.0 3796.5 11814.9 48687.7 > 2097152 7401.9 10592.0 27543.2 106565.4 > > I can see the speed up brought by more MKL threads, and if I set > NKL_VERBOSE to 1, I can see something like > > > > *MKL_VERBOSE > ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) > 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca*From my understanding, > the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute > node and is using ZGEMV MKL BLAS. > > However, when I ran my own program and set MKL_VERBOSE to 1, it is very > strange that I still can?t find any MKL outputs, though I can see from the > PETSc log that VecMDot and VecMAXPY() are called. > > I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a > way that is similar to ex2k test? Should I expect to see MKL outputs for > whatever linear system I solve with KSPGMRES? Does it relate to if it is > dense matrix or sparse matrix, although I am not really understand why > VecMDot/MAXPY() have something to do with dense matrix-vector > multiplication. > > Thank you, > > Yongzhong > > *From: *Junchao Zhang > *Date: *Tuesday, June 25, 2024 at 6:34?PM > *To: *Matthew Knepley > *Cc: *Yongzhong Li , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > Hi, Yongzhong, > Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we > can speed up the two with OpenMP threads, then we can speed up > KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in > dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny > matrices ). So with MKL_VERBOSE=1, you should see something like > "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with > petsc/main? > petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran > VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was > strange to see no speedup. I then configured petsc with openblas, I did > see better performance with more threads > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.1 > 256 1.8 2.7 7.0 > 512 2.1 3.1 8.6 > 1024 2.7 4.0 12.3 > 2048 3.8 6.3 28.0 > 4096 6.1 10.6 42.4 > 8192 10.9 21.8 79.5 > 16384 21.2 39.4 149.6 > 32768 45.9 75.7 224.6 > 65536 142.2 215.8 732.1 > 131072 169.1 233.2 1729.4 > 262144 367.5 830.0 4159.2 > 524288 999.2 1718.1 8538.5 > 1048576 2113.5 4082.1 18274.8 > 2097152 5392.6 10273.4 43273.4 > > > $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 > Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) > -------------------------------------------------------------------------- > 128 2.0 2.5 6.0 > 256 1.8 2.7 15.0 > 512 2.1 9.0 16.6 > 1024 2.6 8.7 16.1 > 2048 7.7 10.3 20.5 > 4096 9.9 11.4 25.9 > 8192 14.5 22.1 39.6 > 16384 25.1 27.8 67.8 > 32768 44.7 95.7 91.5 > 65536 82.1 156.8 165.1 > 131072 194.0 335.1 341.5 > 262144 388.5 380.8 612.9 > 524288 1046.7 967.1 1653.3 > 1048576 1997.4 2169.0 4034.4 > 2097152 5502.9 5787.3 12608.1 > > The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average > speedup depends on components. So I suggest you run ex2k to see in your > environment whether oneMKL can speedup the kernels. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: > Let me run some examples on our end to see whether the code calls expected > functions. > > --Junchao Zhang > > > On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: > On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li utoronto. ca> wrote: Thank you Pierre for your information. Do we have a > conclusion for my original question about the parallelization efficiency > for different stages of > ZjQcmQRYFpfptBannerStart > *This Message Is From an External Sender* > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? Thank > you, Yongzhong From: > ZjQcmQRYFpfptBannerStart > *This Message Is From an External Sender* > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Thank you Pierre for your information. Do we have a conclusion for my > original question about the parallelization efficiency for different stages > of KSP Solve? Do we need to do more testing to figure out the issues? > > > We have an extended discussion of this here: > https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIp2OY8h7$ > > > The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) > are memory bandwidth limited. If there is no more bandwidth to be > marshalled on your board, then adding more processes does nothing at all. > This is why people were asking about how many "nodes" you are running on, > because that is the unit of memory bandwidth, not "cores" which make little > difference. > > Thanks, > > Matt > > > Thank you, > Yongzhong > > > *From: *Pierre Jolivet > *Date: *Sunday, June 23, 2024 at 12:41?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > > > > > On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Yeah, I ran my program again using -mat_view::ascii_info and set > MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix > to be seqaijmkl type (I?ve attached a few as below) > > --> Setting up matrix-vector products... > > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > Mat Object: 1 MPI process > type: seqaijmkl > rows=16490, cols=35937 > total: nonzeros=128496, allocated nonzeros=128496 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > > --> Solving the system... > > Excitation 1 of 1... > > ================================================ > Iterative solve completed in 7435 ms. > CONVERGED: rtol. > Iterations: 72 > Final relative residual norm: 9.22287e-07 > ================================================ > [CPU TIME] System solution: 2.27160000e+02 s. > [WALL TIME] System solution: 7.44387218e+00 s. > > However, it seems to me that there were still no MKL outputs even I set > MKL_VERBOSE to be 1. Although, I think it should be many spmv operations > when doing KSPSolve(). Do you see the possible reasons? > > > SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS > is. > > Thanks, > Pierre > > > > Thanks, > Yongzhong > > > > *From: *Matthew Knepley > *Date: *Saturday, June 22, 2024 at 5:56?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , Pierre Jolivet < > pierre at joliv.et>, petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > ????????? knepley at gmail.com ????????????????? > > On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector > ZjQcmQRYFpfptBannerStart > *This Message Is From an External Sender* > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > MKL_VERBOSE=1 ./ex1 > > matrix nonzeros = 100, allocated nonzeros = 100 > MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for > Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) > AVX-512) with support of Vector Neural Network Instructions enabled > processors, Lnx 2.50GHz lp64 gnu_thread > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) > 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) > 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) > 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) > 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) > 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) > 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) > 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) > 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) > 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) > 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) > 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us > CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) > 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE > ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) > 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF > Dyn:1 FastMM:1 TID:0 NThr:1 > > Yes, for petsc example, there are MKL outputs, but for my own program. All > I did is to change the matrix type from MATAIJ to MATAIJMKL to get > optimized performance for spmv from MKL. Should I expect to see any MKL > outputs in this case? > > > Are you sure that the type changed? You can MatView() the matrix with > format ascii_info to see. > > Thanks, > > Matt > > > > Thanks, > Yongzhong > > > *From: *Junchao Zhang > *Date: *Saturday, June 22, 2024 at 9:40?AM > *To: *Yongzhong Li > *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > No, you don't. It is strange. Perhaps you can you run a petsc example > first and see if MKL is really used > $ cd src/mat/tests > $ make ex1 > $ MKL_VERBOSE=1 ./ex1 > > --Junchao Zhang > > > On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > I am using > > export MKL_VERBOSE=1 > ./xx > > in the bash file, do I have to use - ksp_converged_reason? > > Thanks, > Yongzhong > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:47?PM > *To: *Yongzhong Li > *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < > petsc-users at mcs.anl.gov> > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > ????????? pierre at joliv.et ????????????????? > > How do you set the variable? > > $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason > MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 > architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled > processors, Lnx 2.80GHz lp64 intel_thread > MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 > FastMM:1 TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 > TID:0 NThr:1 > [...] > > > On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hello all, > > I set MKL_VERBOSE = 1, but observed no print output specific to the use of > MKL. Does PETSc enable this verbose output? > > Best, > > Yongzhong > > > *From: *Pierre Jolivet > *Date: *Friday, June 21, 2024 at 1:36?AM > *To: *Junchao Zhang > *Cc: *Yongzhong Li , > petsc-users at mcs.anl.gov > *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc > KSPSolve Performance Issue > ????????? pierre at joliv.et ????????????????? > > > > > On 21 Jun 2024, at 6:42?AM, Junchao Zhang wrote: > > This Message Is From an External Sender > This message came from outside your organization. > I remember there are some MKL env vars to print MKL routines called. > > > The environment variable is MKL_VERBOSE > > Thanks, > Pierre > > > Maybe we can try it to see what MKL routines are really used and then we > can understand why some petsc functions did not speed up > > --Junchao Zhang > > > On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > *This Message Is From an External Sender* > This message came from outside your organization. > > Hi Barry, sorry for my last results. I didn?t fully understand the stage > profiling and logging in PETSc, now I only record KSPSolve() stage of my > program. Some sample codes are as follow, > > // Static variable to keep track of the stage counter > static int stageCounter = 1; > > // Generate a unique stage name > std::ostringstream oss; > oss << "Stage " << stageCounter << " of Code"; > std::string stageName = oss.str(); > > // Register the stage > PetscLogStage stagenum; > > PetscLogStageRegister(stageName.c_str(), &stagenum); > PetscLogStagePush(stagenum); > > *KSPSolve(*ksp_ptr, b, x);* > > PetscLogStagePop(); > stageCounter++; > > I have attached my new logging results, there are 1 main stage and 4 other > stages where each one is KSPSolve() call. > > To provide some additional backgrounds, if you recall, I have been trying > to get efficient iterative solution using multithreading. I found out by > compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to > perform sparse matrix-vector multiplication faster, I am using > MATSEQAIJMKL. This makes the shell matrix vector product in each iteration > scale well with the #of threads. However, I found out the total GMERS solve > time (~KSPSolve() time) is not scaling well the #of threads. > > From the logging results I learned that when performing KSPSolve(), there > are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs > using different number of threads and plotted the time consumption for > PCApply() and KSPGMERSOrthog() against #of thread. I found out these two > operations are not scaling with the threads at all! My results are attached > as the pdf to give you a clear view. > > My questions is, > > From my understanding, in PCApply, MatSolve() is involved, > KSPGMERSOrthog() will have many vector operations, so why these two parts > can?t scale well with the # of threads when the intel MKL library is linked? > > Thank you, > Yongzhong > > > *From: *Barry Smith > *Date: *Friday, June 14, 2024 at 11:36?AM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > I am a bit confused. Without the initial guess computation, there are > still a bunch of events I don't understand > > MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 > MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > in addition there are many more VecMAXPY then VecMDot (in GMRES they are > each done the same number of times) > > VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 > VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 > > Finally there are a huge number of > > MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 > > Are you making calls to all these routines? Are you doing this inside your > MatMult() or before you call KSPSolve? > > The reason I wanted you to make a simpler run without the initial guess > code is that your events are far more complicated than would be produced by > GMRES alone so it is not possible to understand the behavior you are seeing > without fully understanding all the events happening in the code. > > Barry > > > > On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: > > Thanks, I have attached the results without using any KSPGuess. At low > frequency, the iteration steps are quite close to the one with KSPGuess, > specifically > > KSPGuess Object: 1 MPI process > type: fischer > Model 1, size 200 > > However, I found at higher frequency, the # of iteration steps are > significant higher than the one with KSPGuess, I have attahced both of the > results for your reference. > > Moreover, could I ask why the one without the KSPGuess options can be used > for a baseline comparsion? What are we comparing here? How does it relate > to the performance issue/bottleneck I found? ?*I have noticed that the > time taken by **KSPSolve** is **almost two times **greater than the CPU > time for matrix-vector product multiplied by the number of iteration*? > > Thank you! > Yongzhong > > > *From: *Barry Smith > *Date: *Thursday, June 13, 2024 at 2:14?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > > Can you please run the same thing without the KSPGuess option(s) for a > baseline comparison? > > Thanks > > Barry > > > On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi Matt, > > I have rerun the program with the keys you provided. The system output > when performing ksp solve and the final petsc log output were stored in a > .txt file attached for your reference. > > Thanks! > Yongzhong > > > *From: *Matthew Knepley > *Date: *Wednesday, June 12, 2024 at 6:46?PM > *To: *Yongzhong Li > *Cc: *petsc-users at mcs.anl.gov , > petsc-maint at mcs.anl.gov , Piero Triverio < > piero.triverio at utoronto.ca> > *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve > Performance Issue > ????????? knepley at gmail.com ????????????????? > > On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < > yongzhong.li at mail.utoronto.ca> wrote: > > Dear PETSc?s developers, I hope this email finds you well. I am currently > working on a project using PETSc and have encountered a performance issue > with the KSPSolve function. Specifically, I have noticed that the time > taken by KSPSolve is > ZjQcmQRYFpfptBannerStart > *This Message Is From an External Sender* > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Dear PETSc?s developers, > I hope this email finds you well. > I am currently working on a project using PETSc and have encountered a > performance issue with the KSPSolve function. Specifically, *I have > noticed that the time taken by **KSPSolve** is **almost two times **greater > than the CPU time for matrix-vector product multiplied by the number of > iteration steps*. I use C++ chrono to record CPU time. > For context, I am using a shell system matrix A. Despite my efforts to > parallelize the matrix-vector product (Ax), the overall solve time > remains higher than the matrix vector product per iteration indicates > when multiple threads were used. Here are a few details of my setup: > > - *Matrix Type*: Shell system matrix > - *Preconditioner*: Shell PC > - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK > library, multithreading is enabled > > I have considered several potential reasons, such as preconditioner setup, > additional solver operations, and the inherent overhead of using a shell > system matrix. *However, since KSPSolve is a high-level API, I have been > unable to pinpoint the exact cause of the increased solve time.* > Have you observed the same issue? Could you please provide some > experience on how to diagnose and address this performance discrepancy? > Any insights or recommendations you could offer would be greatly > appreciated. > > > For any performance question like this, we need to see the output of your > code run with > > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view > > Thanks, > > Matt > > > Thank you for your time and assistance. > Best regards, > Yongzhong > ----------------------------------------------------------- > *Yongzhong Li* > PhD student | Electromagnetics Group > Department of Electrical & Computer Engineering > University of Toronto > https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIug_3RUa$ > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIqd3E3yv$ > > > > > > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIqd3E3yv$ > > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bT8Fh0B1GB5nDS3DTpc--fcfGuqOeym0MPwCORXl6F2Sy8A0GFIbVFQUT0J54XZ5Ds7eG_kLdQ-s6tD0GVEQIqd3E3yv$ > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pierre.jolivet at lip6.fr Fri Jun 28 13:04:47 2024 From: pierre.jolivet at lip6.fr (Pierre Jolivet) Date: Fri, 28 Jun 2024 20:04:47 +0200 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> <55B35581-80F7-482D-B53A-35FCAF907554@petsc.dev> Message-ID: > On 28 Jun 2024, at 7:20?PM, Junchao Zhang wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, Yongzhong, > It is great to see you have made such good progress. Barry is right, you need -vec_maxpy_use_gemv 1. It's my mistake for not mentioning it earlier. But even with that, there are still problems. > petsc tries to optimize VecMDot/MAXPY with BLAS GEMV, with hope that vendors' BLAS library would be highly optimized on that. However, we found though they were good with VecMDot, but not with VecMAXPY. So by default in petsc, we disabled the GEMV optimization for VecMAXPY. One can use -vec_maxpy_use_gemv 1 to turn on it. > I turned it on and tested VecMAXPY with ex2k and MKL, but failed to see any improvement with multiple threads. I could not understand why MKL is so bad on it. You can try it yourself in your environment. > Without the GEMV optimization, VecMAXPY() is implemented by petsc with a batch of PetscKernelAXPY() kernels, which contain simple for loops but not OpenMP parallelized (since petsc does not support OpenMP outright) . I added "omp parallel for" pragma in PetscKernelAXPY() kernels, and tested ex2k again with now parallelized petsc. Here is the result. > > $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0 > Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) > -------------------------------------------------------------------------- > 128 7.0 10.1 21.4 72.7 > 256 7.9 12.9 29.5 101.0 > 512 9.4 17.2 40.5 136.2 > 1024 15.9 27.3 67.5 249.3 > 2048 26.5 48.7 139.6 432.7 > 4096 47.1 77.3 186.4 710.3 > 8192 84.8 152.2 423.9 1580.6 > 16384 154.9 298.5 792.1 2889.2 > 32768 183.7 338.7 893.9 3436.2 > 65536 639.1 1247.8 3219.1 12494.8 > 131072 1125.2 1856.2 6843.0 23653.7 > 262144 2603.2 4948.4 13259.4 51287.7 > 524288 5093.6 10305.0 26451.7 96919.6 > 1048576 5898.6 10947.2 45486.4 127352.8 > 2097152 11845.4 21912.5 57999.6 331403.4 > > $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=16 ./ex2k -n 15 -m 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0 > Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) > -------------------------------------------------------------------------- > 128 17.0 16.1 31.5 112.9 > 256 13.7 16.8 31.2 120.2 > 512 14.5 18.1 33.9 129.9 > 1024 16.5 21.0 38.5 150.4 > 2048 18.5 22.1 41.8 171.4 > 4096 21.0 25.4 55.3 212.3 > 8192 27.0 30.3 68.6 251.9 > 16384 32.2 44.5 93.3 350.5 > 32768 45.8 65.0 149.8 558.8 > 65536 59.7 102.8 247.1 946.0 > 131072 100.7 186.4 485.3 1898.1 > 262144 183.4 345.2 922.2 3567.0 > 524288 339.6 676.8 1820.7 7530.4 > 1048576 662.0 1364.7 3585.3 13969.1 > 2097152 1379.7 2788.6 7414.0 28275.3 > > We can see VecMAXPY() can be easily speeded up with multithreading. > > For MatSolve, I checked petsc's aijmkl.c, and found we don't have interface to MKL's sparse solve. We do, it?s in src/mat/impls/aij/seq/mkl_pardiso, and it?s threaded (the distributed version is in src/mat/impls/aij/mpi/mkl_cpardiso). Thanks, Pierre > I checked https://urldefense.us/v3/__https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2023-0/openmp-threaded-functions-and-problems.html__;!!G_uCfscf7eWS!f6hk447h0PeWgp-IOpXXRde0Uf4pZtbzZwUlQYKJenqYBIzAjwjGNioup3__-D4K_wLEmzLfZt_-8QTQumto04oCrka8yfZZ$ , but confused with MKL's list of threaded function > Direct sparse solver. > All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. > I don't know whether MKL has threaded sparse solver. > > --Junchao Zhang > > > On Fri, Jun 28, 2024 at 11:35?AM Barry Smith > wrote: >> >> Are you running with -vec_maxpy_use_gemv ? >> >> >>> On Jun 28, 2024, at 1:46?AM, Yongzhong Li > wrote: >>> >>> Thanks all for your help!!! >>> >>> I think I find the issues. I am compiling a large CMake project that relies on many external libraries (projects). Previously, I used OpenBLAS as the BLAS for all the dependencies including PETSc. After I switched to Intel MKL for PETSc, I still kept the OpenBLAS and use it as the BLAS for all the other dependencies. I think somehow even when I specify the blas-lapack-dir to the MKLROOT when PETSc is configured, the actual program still use OpenBLAS as the BLAS for some PETSc functions, such as VecMDot() and VecMAXPY(), so that?s why I didn?t see any MKL verbose during the KSPSolve(). Now I remove the OpenBLAS and use Intel MKL as the BLAS for all the dependencies. The issue is resolved, I can clearly see MKL routines are called when KSP GMRES is running. >>> >>> Back to my original questions, my goal is to achieve good parallelization efficiency for KSP GMRES Solve. As I use multithreading-enabled MKL spmv routines, the wall time for MatMult/MatMultAdd() has been greatly reduced. However,the KSPGMRESOrthog and MatSolve in PCApply still take over 50% of solving time and can?t benefit from multithreading. After I fixed the issue I mentioned, I found I got around 15% time reduced because of more efficient VecMDot() calls. I attach a petsc log comparison for your reference (same settings, only difference is whether use MKL BLAS or not), you can see the percentage of VecMDot() is reduced. However, here comes the interesting part, VecMAXPY() didn?t benefit from MKL BLAS, it still takes almost 40% of solution when I use 64 MKL Threads, which is a lot for my program. And if I multiple this percentage with the actual wall time against different # of threads, it stays the same. Then I used ex2k benchmark to verify what I found. Here is the result, >>> >>> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 5 -test_name VecMAXPY >>> Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) >>> -------------------------------------------------------------------------- >>> 128 0.4 0.9 2.4 8.8 >>> 256 0.3 1.1 3.5 13.3 >>> 512 0.5 4.4 6.7 26.5 >>> 1024 0.9 4.8 13.3 51.0 >>> 2048 3.5 12.3 37.1 94.7 >>> 4096 4.3 24.5 73.6 179.6 >>> 8192 6.3 48.7 98.9 380.8 >>> 16384 9.3 99.2 200.2 774.0 >>> 32768 30.6 155.4 421.2 1662.9 >>> 65536 101.2 269.4 827.4 3565.0 >>> 131072 206.9 551.0 1829.0 7580.5 >>> 262144 450.2 1251.9 3986.2 15525.6 >>> 524288 1322.1 2901.7 8567.1 31840.0 >>> 1048576 2788.6 6190.6 16394.7 63514.9 >>> 2097152 5534.8 12619.9 35427.4 130064.5 >>> $ MKL_NUM_THREADS=8 ./ex2k -n 15 -m 5 -test_name VecMAXPY >>> Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) >>> -------------------------------------------------------------------------- >>> 128 0.3 0.7 2.4 8.8 >>> 256 0.3 1.1 3.6 13.5 >>> 512 0.5 4.4 6.8 26.4 >>> 1024 0.9 4.8 13.6 50.5 >>> 2048 7.6 12.2 36.5 95.0 >>> 4096 8.5 25.7 72.4 182.6 >>> 8192 11.9 48.5 103.7 383.7 >>> 16384 12.8 97.7 203.7 785.0 >>> 32768 11.2 148.5 421.9 1681.5 >>> 65536 15.5 271.2 843.8 3613.7 >>> 131072 34.3 564.7 1905.2 7558.8 >>> 262144 106.4 1334.5 4002.8 15458.3 >>> 524288 217.2 2858.4 8407.9 31303.7 >>> 1048576 701.5 6060.6 16947.3 64118.5 >>> 2097152 1769.7 13218.3 36347.3 131062.9 >>> >>> It stays the same, no benefit from multithreading BLAS!! Unlike what I found for VecMdot(), where I did see speed up for more #of threads. Then, I dig deeper. I learned that for VecMDot(), it calls ZGEMV while for VecMAXPY(), it calls ZAXPY. This observation seems to indicate that ZAXPY is not benefiting from MKL threads. >>> >>> My question is do you know why ZAXPY is not multithreaded? From my perspective, VecMDot() and VecMAXPY() are very similar operations, the only difference is whether we need to scale the vectors to be multiplied or not. I think you have mentioned that recently you did some optimization to these two routines, from my above results and observations, are these aligned with your expectations? Could we further optimize the codes to get more parallelization efficiency in my case? >>> >>> And another question, can MatSolve() in KSPSolve be multithreaded? Would MUMPS help? >>> >>> Thank you and regards, >>> Yongzhong >>> >>> From: Junchao Zhang > >>> Sent: Thursday, June 27, 2024 11:10 AM >>> To: Yongzhong Li > >>> Cc: Barry Smith >; petsc-users at mcs.anl.gov >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))? n is the vector length in VecMDot. >>> it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE ZGEMV..., since the code did call gemv. Perhaps you need to double check your spelling etc. >>> >>> If you also use ex2k, and potentially modify Ms[] and Ns[] to match the sizes in your code, to see if there is a speedup with more threads. >>> >>> --Junchao Zhang >>> >>> >>> On Thu, Jun 27, 2024 at 9:39?AM Yongzhong Li > wrote: >>> Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple >>> ZjQcmQRYFpfptBannerStart >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> >>> ZjQcmQRYFpfptBannerEnd >>> Mostly 3, maximum 7, but definitely hit the point when m > 1, >>> >>> I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple times >>> >>> From: Barry Smith > >>> Date: Thursday, June 27, 2024 at 1:12?AM >>> To: Yongzhong Li > >>> Cc: petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> >>> How big are the m's getting in your code? >>> >>> >>> >>> On Jun 27, 2024, at 12:40?AM, Yongzhong Li > wrote: >>> >>> Hi Barry, I used gdb to debug my program, set a breakpoint to VecMultiDot_Seq_GEMV function. I did see when I debug this function, it will call BLAS (but not always, only if m > 1), as shown below. However, I still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. >>> >>> (gdb) >>> 550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst)); >>> (gdb) >>> 553 m = j - i; >>> (gdb) >>> 554 if (m > 1) { >>> (gdb) >>> 555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above >>> (gdb) >>> 556 PetscScalar one = 1, zero = 0; >>> (gdb) >>> 558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); >>> (gdb) s >>> PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> "VecMultiDot_Seq_GEMV", >>> file=0x7ffff68a1078 "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c") >>> at /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106 >>> 106 if (!TRdebug) return PETSC_SUCCESS; >>> (gdb) >>> 154 } >>> >>> Am I not using MKL BLAS, is that why I didn?t see multithreading speed up for KSPGMRESOrthog? What do you think could be the potential reasons? Is there any silent mode that will possibly affect the MKL Verbose. >>> >>> Thank you and best regards, >>> Yongzhong >>> >>> From: Barry Smith > >>> Date: Wednesday, June 26, 2024 at 8:15?PM >>> To: Yongzhong Li > >>> Cc: petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> >>> if (m > 1) { >>> PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above >>> PetscScalar one = 1, zero = 0; >>> >>> PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); >>> PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); >>> >>> The call to BLAS above is where it uses MKL. >>> >>> >>> >>> >>> On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: >>> >>> Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!f6hk447h0PeWgp-IOpXXRde0Uf4pZtbzZwUlQYKJenqYBIzAjwjGNioup3__-D4K_wLEmzLfZt_-8QTQumto04oCrqv2WFmI$ >>> Can I ask which lines of codes suggest the use of intel mkl? >>> >>> Thanks, >>> Yongzhong >>> >>> From: Barry Smith > >>> Date: Wednesday, June 26, 2024 at 10:30?AM >>> To: Yongzhong Li > >>> Cc: petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> >>> In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. >>> >>> >>> >>> >>> On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> Hi Junchao, thank you for your help for these benchmarking test! >>> >>> I check out to petsc/main and did a few things to verify from my side, >>> >>> 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, >>> >>> $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 >>> Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) >>> -------------------------------------------------------------------------- >>> 128 14.5 1.2 1.8 5.2 >>> 256 1.5 0.9 1.6 4.7 >>> 512 2.7 2.8 6.1 13.2 >>> 1024 4.0 4.0 9.3 16.4 >>> 2048 7.4 7.3 11.3 39.3 >>> 4096 14.2 13.9 19.1 93.4 >>> 8192 28.8 26.3 25.4 31.3 >>> 16384 54.1 25.8 26.7 33.8 >>> 32768 109.8 25.7 24.2 56.0 >>> 65536 220.2 24.4 26.5 89.0 >>> 131072 424.1 31.5 36.1 149.6 >>> 262144 898.1 37.1 53.9 286.1 >>> 524288 1754.6 48.7 100.3 1122.2 >>> 1048576 3645.8 86.5 347.9 2950.4 >>> 2097152 7371.4 308.7 1440.6 6874.9 >>> >>> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 >>> Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) >>> -------------------------------------------------------------------------- >>> 128 14.9 1.2 1.9 5.2 >>> 256 1.5 1.0 1.7 4.7 >>> 512 2.7 2.8 6.1 12.0 >>> 1024 3.9 4.0 9.3 16.8 >>> 2048 7.4 7.3 10.4 41.3 >>> 4096 14.0 13.8 18.6 84.2 >>> 8192 27.0 21.3 43.8 177.5 >>> 16384 54.1 34.1 89.1 330.4 >>> 32768 110.4 82.1 203.5 781.1 >>> 65536 213.0 191.8 423.9 1696.4 >>> 131072 428.7 360.2 934.0 4080.0 >>> 262144 883.4 723.2 1745.6 10120.7 >>> 524288 1817.5 1466.1 4751.4 23217.2 >>> 1048576 3611.0 3796.5 11814.9 48687.7 >>> 2097152 7401.9 10592.0 27543.2 106565.4 >>> >>> I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like >>> >>> MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca >>> >>> From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. >>> >>> However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. >>> >>> I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. >>> >>> Thank you, >>> Yongzhong >>> >>> From: Junchao Zhang > >>> Date: Tuesday, June 25, 2024 at 6:34?PM >>> To: Matthew Knepley > >>> Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> Hi, Yongzhong, >>> Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? >>> petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads >>> >>> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 >>> Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) >>> -------------------------------------------------------------------------- >>> 128 2.0 2.5 6.1 >>> 256 1.8 2.7 7.0 >>> 512 2.1 3.1 8.6 >>> 1024 2.7 4.0 12.3 >>> 2048 3.8 6.3 28.0 >>> 4096 6.1 10.6 42.4 >>> 8192 10.9 21.8 79.5 >>> 16384 21.2 39.4 149.6 >>> 32768 45.9 75.7 224.6 >>> 65536 142.2 215.8 732.1 >>> 131072 169.1 233.2 1729.4 >>> 262144 367.5 830.0 4159.2 >>> 524288 999.2 1718.1 8538.5 >>> 1048576 2113.5 4082.1 18274.8 >>> 2097152 5392.6 10273.4 43273.4 >>> >>> >>> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 >>> Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) >>> -------------------------------------------------------------------------- >>> 128 2.0 2.5 6.0 >>> 256 1.8 2.7 15.0 >>> 512 2.1 9.0 16.6 >>> 1024 2.6 8.7 16.1 >>> 2048 7.7 10.3 20.5 >>> 4096 9.9 11.4 25.9 >>> 8192 14.5 22.1 39.6 >>> 16384 25.1 27.8 67.8 >>> 32768 44.7 95.7 91.5 >>> 65536 82.1 156.8 165.1 >>> 131072 194.0 335.1 341.5 >>> 262144 388.5 380.8 612.9 >>> 524288 1046.7 967.1 1653.3 >>> 1048576 1997.4 2169.0 4034.4 >>> 2097152 5502.9 5787.3 12608.1 >>> >>> The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. >>> >>> --Junchao Zhang >>> >>> >>> On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: >>> Let me run some examples on our end to see whether the code calls expected functions. >>> >>> --Junchao Zhang >>> >>> >>> On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: >>> On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of >>> ZjQcmQRYFpfptBannerStart >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> >>> ZjQcmQRYFpfptBannerEnd >>> On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: >>> Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? >>> ZjQcmQRYFpfptBannerStart >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> >>> ZjQcmQRYFpfptBannerEnd >>> Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? >>> >>> We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!f6hk447h0PeWgp-IOpXXRde0Uf4pZtbzZwUlQYKJenqYBIzAjwjGNioup3__-D4K_wLEmzLfZt_-8QTQumto04oCrsvkykQX$ >>> >>> The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. >>> >>> Thanks, >>> >>> Matt >>> >>> Thank you, >>> Yongzhong >>> >>> From: Pierre Jolivet > >>> Date: Sunday, June 23, 2024 at 12:41?AM >>> To: Yongzhong Li > >>> Cc: petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> >>> >>> >>> On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) >>> >>> --> Setting up matrix-vector products... >>> >>> Mat Object: 1 MPI process >>> type: seqaijmkl >>> rows=16490, cols=35937 >>> total: nonzeros=128496, allocated nonzeros=128496 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node routines >>> Mat Object: 1 MPI process >>> type: seqaijmkl >>> rows=16490, cols=35937 >>> total: nonzeros=128496, allocated nonzeros=128496 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node routines >>> >>> --> Solving the system... >>> >>> Excitation 1 of 1... >>> >>> ================================================ >>> Iterative solve completed in 7435 ms. >>> CONVERGED: rtol. >>> Iterations: 72 >>> Final relative residual norm: 9.22287e-07 >>> ================================================ >>> [CPU TIME] System solution: 2.27160000e+02 s. >>> [WALL TIME] System solution: 7.44387218e+00 s. >>> >>> However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? >>> >>> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. >>> >>> Thanks, >>> Pierre >>> >>> >>> Thanks, >>> Yongzhong >>> >>> >>> From: Matthew Knepley > >>> Date: Saturday, June 22, 2024 at 5:56?PM >>> To: Yongzhong Li > >>> Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> ????????? knepley at gmail.com ????????????????? >>> On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: >>> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector >>> ZjQcmQRYFpfptBannerStart >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> >>> ZjQcmQRYFpfptBannerEnd >>> MKL_VERBOSE=1 ./ex1 >>> >>> matrix nonzeros = 100, allocated nonzeros = 100 >>> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread >>> MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> >>> Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? >>> >>> Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. >>> >>> Thanks, >>> >>> Matt >>> >>> >>> Thanks, >>> Yongzhong >>> >>> From: Junchao Zhang > >>> Date: Saturday, June 22, 2024 at 9:40?AM >>> To: Yongzhong Li > >>> Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used >>> $ cd src/mat/tests >>> $ make ex1 >>> $ MKL_VERBOSE=1 ./ex1 >>> >>> --Junchao Zhang >>> >>> >>> On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: >>> I am using >>> >>> export MKL_VERBOSE=1 >>> ./xx >>> >>> in the bash file, do I have to use - ksp_converged_reason? >>> >>> Thanks, >>> Yongzhong >>> >>> From: Pierre Jolivet > >>> Date: Friday, June 21, 2024 at 1:47?PM >>> To: Yongzhong Li > >>> Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> ????????? pierre at joliv.et ????????????????? >>> How do you set the variable? >>> >>> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason >>> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread >>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>> [...] >>> >>> On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> Hello all, >>> >>> I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? >>> >>> Best, >>> Yongzhong >>> >>> >>> From: Pierre Jolivet > >>> Date: Friday, June 21, 2024 at 1:36?AM >>> To: Junchao Zhang > >>> Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > >>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> ????????? pierre at joliv.et ????????????????? >>> >>> >>> On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> I remember there are some MKL env vars to print MKL routines called. >>> >>> The environment variable is MKL_VERBOSE >>> >>> Thanks, >>> Pierre >>> >>> Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up >>> >>> --Junchao Zhang >>> >>> >>> On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> >>> Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, >>> >>> // Static variable to keep track of the stage counter >>> static int stageCounter = 1; >>> >>> // Generate a unique stage name >>> std::ostringstream oss; >>> oss << "Stage " << stageCounter << " of Code"; >>> std::string stageName = oss.str(); >>> >>> // Register the stage >>> PetscLogStage stagenum; >>> >>> PetscLogStageRegister(stageName.c_str(), &stagenum); >>> PetscLogStagePush(stagenum); >>> >>> KSPSolve(*ksp_ptr, b, x); >>> >>> PetscLogStagePop(); >>> stageCounter++; >>> >>> I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. >>> >>> To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >>> >>> From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. >>> >>> My questions is, >>> >>> From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? >>> >>> Thank you, >>> Yongzhong >>> >>> From: Barry Smith > >>> Date: Friday, June 14, 2024 at 11:36?AM >>> To: Yongzhong Li > >>> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> >>> I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand >>> >>> MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 >>> MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> >>> in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) >>> >>> VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 >>> VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 >>> >>> Finally there are a huge number of >>> >>> MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 >>> >>> Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? >>> >>> The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. >>> >>> Barry >>> >>> >>> On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: >>> >>> Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically >>> >>> KSPGuess Object: 1 MPI process >>> type: fischer >>> Model 1, size 200 >>> >>> However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. >>> >>> Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? >>> >>> Thank you! >>> Yongzhong >>> >>> From: Barry Smith > >>> Date: Thursday, June 13, 2024 at 2:14?PM >>> To: Yongzhong Li > >>> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> >>> Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? >>> >>> Thanks >>> >>> Barry >>> >>> On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> Hi Matt, >>> >>> I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. >>> >>> Thanks! >>> Yongzhong >>> >>> From: Matthew Knepley > >>> Date: Wednesday, June 12, 2024 at 6:46?PM >>> To: Yongzhong Li > >>> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>> >>> ????????? knepley at gmail.com ????????????????? >>> On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: >>> Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is >>> ZjQcmQRYFpfptBannerStart >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> >>> ZjQcmQRYFpfptBannerEnd >>> Dear PETSc?s developers, >>> I hope this email finds you well. >>> I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. >>> For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: >>> Matrix Type: Shell system matrix >>> Preconditioner: Shell PC >>> Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled >>> I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. >>> Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. >>> >>> For any performance question like this, we need to see the output of your code run with >>> >>> -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view >>> >>> Thanks, >>> >>> Matt >>> >>> Thank you for your time and assistance. >>> Best regards, >>> Yongzhong >>> ----------------------------------------------------------- >>> Yongzhong Li >>> PhD student | Electromagnetics Group >>> Department of Electrical & Computer Engineering >>> University of Toronto >>> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!f6hk447h0PeWgp-IOpXXRde0Uf4pZtbzZwUlQYKJenqYBIzAjwjGNioup3__-D4K_wLEmzLfZt_-8QTQumto04oCrsvFX7ZD$ >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!f6hk447h0PeWgp-IOpXXRde0Uf4pZtbzZwUlQYKJenqYBIzAjwjGNioup3__-D4K_wLEmzLfZt_-8QTQumto04oCrrT0ZwRg$ >>> >>> >>> >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!f6hk447h0PeWgp-IOpXXRde0Uf4pZtbzZwUlQYKJenqYBIzAjwjGNioup3__-D4K_wLEmzLfZt_-8QTQumto04oCrrT0ZwRg$ >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>> -- Norbert Wiener >>> >>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!f6hk447h0PeWgp-IOpXXRde0Uf4pZtbzZwUlQYKJenqYBIzAjwjGNioup3__-D4K_wLEmzLfZt_-8QTQumto04oCrrT0ZwRg$ >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From junchao.zhang at gmail.com Fri Jun 28 13:33:00 2024 From: junchao.zhang at gmail.com (Junchao Zhang) Date: Fri, 28 Jun 2024 13:33:00 -0500 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> <55B35581-80F7-482D-B53A-35FCAF907554@petsc.dev> Message-ID: OK, then you need '--with-mkl_pardiso-dir='+os.environ['MKLROOT'] in petsc configure --Junchao Zhang On Fri, Jun 28, 2024 at 1:05?PM Pierre Jolivet wrote: > > > On 28 Jun 2024, at 7:20?PM, Junchao Zhang wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Hi, Yongzhong, > It is great to see you have made such good progress. Barry is right, > you need -vec_maxpy_use_gemv 1. It's my mistake for not mentioning it > earlier. But even with that, there are still problems. > petsc tries to optimize VecMDot/MAXPY with BLAS GEMV, with hope that > vendors' BLAS library would be highly optimized on that. However, we found > though they were good with VecMDot, but not with VecMAXPY. So by default > in petsc, we disabled the GEMV optimization for VecMAXPY. One can use > -vec_maxpy_use_gemv 1 to turn on it. > I turned it on and tested VecMAXPY with ex2k and MKL, but failed to see > any improvement with multiple threads. I could not understand why MKL is > so bad on it. You can try it yourself in your environment. > Without the GEMV optimization, VecMAXPY() is implemented by petsc with > a batch of PetscKernelAXPY() kernels, which contain simple for loops but > not OpenMP parallelized (since petsc does not support OpenMP outright) . I > added "omp parallel for" pragma in PetscKernelAXPY() kernels, and tested > ex2k again with now parallelized petsc. Here is the result. > > $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m > 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0 > Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) > -------------------------------------------------------------------------- > 128 7.0 10.1 21.4 72.7 > 256 7.9 12.9 29.5 101.0 > 512 9.4 17.2 40.5 136.2 > 1024 15.9 27.3 67.5 249.3 > 2048 26.5 48.7 139.6 432.7 > 4096 47.1 77.3 186.4 710.3 > 8192 84.8 152.2 423.9 1580.6 > 16384 154.9 298.5 792.1 2889.2 > 32768 183.7 338.7 893.9 3436.2 > 65536 639.1 1247.8 3219.1 12494.8 > 131072 1125.2 1856.2 6843.0 23653.7 > 262144 2603.2 4948.4 13259.4 51287.7 > 524288 5093.6 10305.0 26451.7 96919.6 > 1048576 5898.6 10947.2 45486.4 127352.8 > 2097152 11845.4 21912.5 57999.6 331403.4 > > $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=16 ./ex2k -n 15 -m > 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0 > Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) > -------------------------------------------------------------------------- > 128 17.0 16.1 31.5 112.9 > 256 13.7 16.8 31.2 120.2 > 512 14.5 18.1 33.9 129.9 > 1024 16.5 21.0 38.5 150.4 > 2048 18.5 22.1 41.8 171.4 > 4096 21.0 25.4 55.3 212.3 > 8192 27.0 30.3 68.6 251.9 > 16384 32.2 44.5 93.3 350.5 > 32768 45.8 65.0 149.8 558.8 > 65536 59.7 102.8 247.1 946.0 > 131072 100.7 186.4 485.3 1898.1 > 262144 183.4 345.2 922.2 3567.0 > 524288 339.6 676.8 1820.7 7530.4 > 1048576 662.0 1364.7 3585.3 13969.1 > 2097152 1379.7 2788.6 7414.0 28275.3 > > We can see VecMAXPY() can be easily speeded up with multithreading. > > For MatSolve, I checked petsc's aijmkl.c, and found we don't have > interface to MKL's sparse solve. > > > We do, it?s in src/mat/impls/aij/seq/mkl_pardiso, and it?s threaded (the > distributed version is in src/mat/impls/aij/mpi/mkl_cpardiso). > > Thanks, > Pierre > > I checked > https://urldefense.us/v3/__https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2023-0/openmp-threaded-functions-and-problems.html__;!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_-ec-aQc$ > , > but confused with MKL's list of threaded function > > - Direct sparse solver. > - All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse > Triangular solvers. > > I don't know whether MKL has threaded sparse solver. > > --Junchao Zhang > > > On Fri, Jun 28, 2024 at 11:35?AM Barry Smith wrote: > >> >> Are you running with -vec_maxpy_use_gemv ? >> >> >> On Jun 28, 2024, at 1:46?AM, Yongzhong Li >> wrote: >> >> Thanks all for your help!!! >> >> I think I find the issues. I am compiling a large CMake project that >> relies on many external libraries (projects). Previously, I used OpenBLAS >> as the BLAS for all the dependencies including PETSc. After I switched to >> Intel MKL for PETSc, I still kept the OpenBLAS and use it as the BLAS for >> all the other dependencies. I think somehow even when I specify the >> blas-lapack-dir to the MKLROOT when PETSc is configured, the actual program >> still use OpenBLAS as the BLAS for some PETSc functions, such as VecMDot() >> and VecMAXPY(), so that?s why I didn?t see any MKL verbose during the >> KSPSolve(). Now I remove the OpenBLAS and use Intel MKL as the BLAS for all >> the dependencies. The issue is resolved, I can clearly see MKL routines are >> called when KSP GMRES is running. >> >> Back to my original questions, my goal is to achieve good parallelization >> efficiency for KSP GMRES Solve. As I use multithreading-enabled MKL spmv >> routines, the wall time for MatMult/MatMultAdd() has been greatly reduced. >> However,the KSPGMRESOrthog and MatSolve in PCApply still take over 50% of >> solving time and can?t benefit from multithreading. *After I fixed the >> issue I mentioned, I found I got around 15% time reduced because of more >> efficient VecMDot() calls*. I attach a petsc log comparison for your >> reference (same settings, only difference is whether use MKL BLAS or not), >> you can see the percentage of VecMDot() is reduced. However, here comes the >> interesting part, *VecMAXPY() didn?t benefit from MKL BLAS, it still >> takes almost 40% of solution when I use 64 MKL Threads*, which is a lot >> for my program. And if I multiple this percentage with the actual wall time >> against different # of threads, it stays the same. Then I used ex2k >> benchmark to verify what I found. Here is the result, >> >> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 5 -test_name VecMAXPY >> Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) >> -------------------------------------------------------------------------- >> 128 0.4 0.9 2.4 8.8 >> 256 0.3 1.1 3.5 13.3 >> 512 0.5 4.4 6.7 26.5 >> 1024 0.9 4.8 13.3 51.0 >> 2048 3.5 12.3 37.1 94.7 >> 4096 4.3 24.5 73.6 179.6 >> 8192 6.3 48.7 98.9 380.8 >> 16384 9.3 99.2 200.2 774.0 >> 32768 30.6 155.4 421.2 1662.9 >> 65536 101.2 269.4 827.4 3565.0 >> 131072 206.9 551.0 1829.0 7580.5 >> 262144 450.2 1251.9 3986.2 15525.6 >> 524288 1322.1 2901.7 8567.1 31840.0 >> 1048576 2788.6 6190.6 16394.7 63514.9 >> 2097152 5534.8 12619.9 35427.4 130064.5 >> $ MKL_NUM_THREADS=8 ./ex2k -n 15 -m 5 -test_name VecMAXPY >> Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) >> -------------------------------------------------------------------------- >> 128 0.3 0.7 2.4 8.8 >> 256 0.3 1.1 3.6 13.5 >> 512 0.5 4.4 6.8 26.4 >> 1024 0.9 4.8 13.6 50.5 >> 2048 7.6 12.2 36.5 95.0 >> 4096 8.5 25.7 72.4 182.6 >> 8192 11.9 48.5 103.7 383.7 >> 16384 12.8 97.7 203.7 785.0 >> 32768 11.2 148.5 421.9 1681.5 >> 65536 15.5 271.2 843.8 3613.7 >> 131072 34.3 564.7 1905.2 7558.8 >> 262144 106.4 1334.5 4002.8 15458.3 >> 524288 217.2 2858.4 8407.9 31303.7 >> 1048576 701.5 6060.6 16947.3 64118.5 >> 2097152 1769.7 13218.3 36347.3 131062.9 >> >> It stays the same, no benefit from multithreading BLAS!! Unlike what I >> found for VecMdot(), where I did see speed up for more #of threads. Then, I >> dig deeper. *I learned that for VecMDot(), it calls ZGEMV while for >> VecMAXPY(), it calls ZAXPY. This observation seems to indicate that ZAXPY >> is not benefiting from MKL threads.* >> >> My question is *do you know why ZAXPY is not multithreaded*? From my >> perspective, VecMDot() and VecMAXPY() are very similar operations, the >> only difference is whether we need to scale the vectors to be multiplied or >> not. I think you have mentioned that recently you did some optimization to >> these two routines*, from my above results and observations, are these >> aligned with your expectations*? Could we further optimize the codes to >> get more parallelization efficiency in my case? >> >> *And another question, can MatSolve() in KSPSolve be multithreaded? Would >> MUMPS help?* >> >> Thank you and regards, >> Yongzhong >> >> *From:* Junchao Zhang >> *Sent:* Thursday, June 27, 2024 11:10 AM >> *To:* Yongzhong Li >> *Cc:* Barry Smith ; petsc-users at mcs.anl.gov >> *Subject:* Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans, >> &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))? n is >> the vector length in VecMDot. >> it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE *ZGEMV..., *since >> the code did call gemv. Perhaps you need to double check your spelling etc. >> >> If you also use ex2k, and potentially modify Ms[] and Ns[] to match the >> sizes in your code, to see if there is a speedup with more threads. >> >> --Junchao Zhang >> >> >> On Thu, Jun 27, 2024 at 9:39?AM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see >> the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, >> xarray, &ione, &zero, z + i, &ione)); is called multiple >> ZjQcmQRYFpfptBannerStart >> *This Message Is From an External Sender* >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> Mostly 3, maximum 7, but definitely hit the point when m > 1, >> >> I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, >> yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple >> times >> >> >> *From: *Barry Smith >> *Date: *Thursday, June 27, 2024 at 1:12?AM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> How big are the m's getting in your code? >> >> >> >> >> On Jun 27, 2024, at 12:40?AM, Yongzhong Li >> wrote: >> >> Hi Barry, I used gdb to debug my program, set a breakpoint to >> VecMultiDot_Seq_GEMV function. I did see when I debug this function, it >> will call BLAS (but not always, only if m > 1), as shown below. However, I >> still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. >> >> *(gdb) * >> *550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst));* >> *(gdb) * >> *553 m = j - i;* >> *(gdb) * >> *554 if (m > 1) {* >> *(gdb) * >> *555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the >> cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above* >> *(gdb) * >> *556 PetscScalar one = 1, zero = 0;* >> *(gdb) * >> *558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, >> &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione));* >> *(gdb) s* >> *PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> >> "VecMultiDot_Seq_GEMV",* >> * file=0x7ffff68a1078 >> "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c")* >> * at >> /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106* >> *106 if (!TRdebug) return PETSC_SUCCESS;* >> *(gdb) * >> *154 }* >> >> Am I not using MKL BLAS, is that why I didn?t see multithreading speed up >> for KSPGMRESOrthog? What do you think could be the potential reasons? Is >> there any silent mode that will possibly affect the MKL Verbose. >> >> Thank you and best regards, >> Yongzhong >> >> >> *From: *Barry Smith >> *Date: *Wednesday, June 26, 2024 at 8:15?PM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> if (m > 1) { >> PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is >> safe since we've screened out those lda > PETSC_BLAS_INT_MAX above >> PetscScalar one = 1, zero = 0; >> >> PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, >> &lda2, xarray, &ione, &zero, z + i, &ione)); >> PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); >> >> The call to BLAS above is where it uses MKL. >> >> >> >> >> >> On Jun 26, 2024, at 6:59?PM, Yongzhong Li >> wrote: >> >> Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV >> https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_wPb-7VN$ >> >> Can I ask which lines of codes suggest the use of intel mkl? >> >> Thanks, >> Yongzhong >> >> >> *From: *Barry Smith >> *Date: *Wednesday, June 26, 2024 at 10:30?AM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> In a debug version of PETSc run your application in a debugger and put >> a break point in VecMultiDot_Seq_GEMV. Then next through the code from >> that point to see what decision it makes about using dgemv() to see why it >> is not getting into the Intel code. >> >> >> >> >> >> On Jun 25, 2024, at 11:19?PM, Yongzhong Li >> wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> >> Hi Junchao, thank you for your help for these benchmarking test! >> >> I check out to petsc/main and did a few things to verify from my side, >> >> 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute >> node. The results are as follow, >> $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 >> Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) >> -------------------------------------------------------------------------- >> 128 14.5 1.2 1.8 5.2 >> 256 1.5 0.9 1.6 4.7 >> 512 2.7 2.8 6.1 13.2 >> 1024 4.0 4.0 9.3 16.4 >> 2048 7.4 7.3 11.3 39.3 >> 4096 14.2 13.9 19.1 93.4 >> 8192 28.8 26.3 25.4 31.3 >> 16384 54.1 25.8 26.7 33.8 >> 32768 109.8 25.7 24.2 56.0 >> 65536 220.2 24.4 26.5 89.0 >> 131072 424.1 31.5 36.1 149.6 >> 262144 898.1 37.1 53.9 286.1 >> 524288 1754.6 48.7 100.3 1122.2 >> 1048576 3645.8 86.5 347.9 2950.4 >> 2097152 7371.4 308.7 1440.6 6874.9 >> >> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 >> Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) >> -------------------------------------------------------------------------- >> 128 14.9 1.2 1.9 5.2 >> 256 1.5 1.0 1.7 4.7 >> 512 2.7 2.8 6.1 12.0 >> 1024 3.9 4.0 9.3 16.8 >> 2048 7.4 7.3 10.4 41.3 >> 4096 14.0 13.8 18.6 84.2 >> 8192 27.0 21.3 43.8 177.5 >> 16384 54.1 34.1 89.1 330.4 >> 32768 110.4 82.1 203.5 781.1 >> 65536 213.0 191.8 423.9 1696.4 >> 131072 428.7 360.2 934.0 4080.0 >> 262144 883.4 723.2 1745.6 10120.7 >> 524288 1817.5 1466.1 4751.4 23217.2 >> 1048576 3611.0 3796.5 11814.9 48687.7 >> 2097152 7401.9 10592.0 27543.2 106565.4 >> >> I can see the speed up brought by more MKL threads, and if I set >> NKL_VERBOSE to 1, I can see something like >> >> >> >> *MKL_VERBOSE >> ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) >> 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca*From my understanding, >> the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute >> node and is using ZGEMV MKL BLAS. >> >> However, when I ran my own program and set MKL_VERBOSE to 1, it is very >> strange that I still can?t find any MKL outputs, though I can see from the >> PETSc log that VecMDot and VecMAXPY() are called. >> >> I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a >> way that is similar to ex2k test? Should I expect to see MKL outputs for >> whatever linear system I solve with KSPGMRES? Does it relate to if it is >> dense matrix or sparse matrix, although I am not really understand why >> VecMDot/MAXPY() have something to do with dense matrix-vector >> multiplication. >> >> Thank you, >> >> Yongzhong >> >> *From: *Junchao Zhang >> *Date: *Tuesday, June 25, 2024 at 6:34?PM >> *To: *Matthew Knepley >> *Cc: *Yongzhong Li , Pierre Jolivet < >> pierre at joliv.et>, petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> Hi, Yongzhong, >> Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if >> we can speed up the two with OpenMP threads, then we can speed up >> KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in >> dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny >> matrices ). So with MKL_VERBOSE=1, you should see something like >> "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with >> petsc/main? >> petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran >> VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was >> strange to see no speedup. I then configured petsc with openblas, I did >> see better performance with more threads >> >> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 >> Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) >> -------------------------------------------------------------------------- >> 128 2.0 2.5 6.1 >> 256 1.8 2.7 7.0 >> 512 2.1 3.1 8.6 >> 1024 2.7 4.0 12.3 >> 2048 3.8 6.3 28.0 >> 4096 6.1 10.6 42.4 >> 8192 10.9 21.8 79.5 >> 16384 21.2 39.4 149.6 >> 32768 45.9 75.7 224.6 >> 65536 142.2 215.8 732.1 >> 131072 169.1 233.2 1729.4 >> 262144 367.5 830.0 4159.2 >> 524288 999.2 1718.1 8538.5 >> 1048576 2113.5 4082.1 18274.8 >> 2097152 5392.6 10273.4 43273.4 >> >> >> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 >> Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) >> -------------------------------------------------------------------------- >> 128 2.0 2.5 6.0 >> 256 1.8 2.7 15.0 >> 512 2.1 9.0 16.6 >> 1024 2.6 8.7 16.1 >> 2048 7.7 10.3 20.5 >> 4096 9.9 11.4 25.9 >> 8192 14.5 22.1 39.6 >> 16384 25.1 27.8 67.8 >> 32768 44.7 95.7 91.5 >> 65536 82.1 156.8 165.1 >> 131072 194.0 335.1 341.5 >> 262144 388.5 380.8 612.9 >> 524288 1046.7 967.1 1653.3 >> 1048576 1997.4 2169.0 4034.4 >> 2097152 5502.9 5787.3 12608.1 >> >> The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The >> average speedup depends on components. So I suggest you run ex2k to see in >> your environment whether oneMKL can speedup the kernels. >> >> --Junchao Zhang >> >> >> On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang >> wrote: >> Let me run some examples on our end to see whether the code calls >> expected functions. >> >> --Junchao Zhang >> >> >> On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley >> wrote: >> On Mon, Jun 24, 2024 at 11: 21 AM Yongzhong Li > utoronto. ca> wrote: Thank you Pierre for your information. Do we have a >> conclusion for my original question about the parallelization efficiency >> for different stages of >> ZjQcmQRYFpfptBannerStart >> *This Message Is From an External Sender* >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> Thank you Pierre for your information. Do we have a conclusion for my >> original question about the parallelization efficiency for different stages >> of KSP Solve? Do we need to do more testing to figure out the issues? Thank >> you, Yongzhong From: >> ZjQcmQRYFpfptBannerStart >> *This Message Is From an External Sender* >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> Thank you Pierre for your information. Do we have a conclusion for my >> original question about the parallelization efficiency for different stages >> of KSP Solve? Do we need to do more testing to figure out the issues? >> >> >> We have an extended discussion of this here: >> https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_5b-GJoq$ >> >> >> The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, >> etc) are memory bandwidth limited. If there is no more bandwidth to be >> marshalled on your board, then adding more processes does nothing at all. >> This is why people were asking about how many "nodes" you are running on, >> because that is the unit of memory bandwidth, not "cores" which make little >> difference. >> >> Thanks, >> >> Matt >> >> >> Thank you, >> Yongzhong >> >> >> *From: *Pierre Jolivet >> *Date: *Sunday, June 23, 2024 at 12:41?AM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> >> >> >> >> On 23 Jun 2024, at 4:07?AM, Yongzhong Li >> wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Yeah, I ran my program again using -mat_view::ascii_info and set >> MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix >> to be seqaijmkl type (I?ve attached a few as below) >> >> --> Setting up matrix-vector products... >> >> Mat Object: 1 MPI process >> type: seqaijmkl >> rows=16490, cols=35937 >> total: nonzeros=128496, allocated nonzeros=128496 >> total number of mallocs used during MatSetValues calls=0 >> not using I-node routines >> Mat Object: 1 MPI process >> type: seqaijmkl >> rows=16490, cols=35937 >> total: nonzeros=128496, allocated nonzeros=128496 >> total number of mallocs used during MatSetValues calls=0 >> not using I-node routines >> >> --> Solving the system... >> >> Excitation 1 of 1... >> >> ================================================ >> Iterative solve completed in 7435 ms. >> CONVERGED: rtol. >> Iterations: 72 >> Final relative residual norm: 9.22287e-07 >> ================================================ >> [CPU TIME] System solution: 2.27160000e+02 s. >> [WALL TIME] System solution: 7.44387218e+00 s. >> >> However, it seems to me that there were still no MKL outputs even I set >> MKL_VERBOSE to be 1. Although, I think it should be many spmv operations >> when doing KSPSolve(). Do you see the possible reasons? >> >> >> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS >> is. >> >> Thanks, >> Pierre >> >> >> >> Thanks, >> Yongzhong >> >> >> >> *From: *Matthew Knepley >> *Date: *Saturday, June 22, 2024 at 5:56?PM >> *To: *Yongzhong Li >> *Cc: *Junchao Zhang , Pierre Jolivet < >> pierre at joliv.et>, petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> ????????? knepley at gmail.com ????????????????? >> >> On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 >> MKL_VERBOSE Intel(R) MKL 2019. 0 Update 4 Product build 20190411 for >> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) >> AVX-512) with support of Vector >> ZjQcmQRYFpfptBannerStart >> *This Message Is From an External Sender* >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> MKL_VERBOSE=1 ./ex1 >> >> matrix nonzeros = 100, allocated nonzeros = 100 >> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for >> Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) >> AVX-512) with support of Vector Neural Network Instructions enabled >> processors, Lnx 2.50GHz lp64 gnu_thread >> MKL_VERBOSE >> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) >> 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) >> 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) >> 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) >> 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) >> 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) >> 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) >> 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) >> 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) >> 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) >> 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) >> 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) >> 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF >> Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) >> 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) >> 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE >> ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) >> 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF >> Dyn:1 FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns >> CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >> >> Yes, for petsc example, there are MKL outputs, but for my own program. >> All I did is to change the matrix type from MATAIJ to MATAIJMKL to get >> optimized performance for spmv from MKL. Should I expect to see any MKL >> outputs in this case? >> >> >> Are you sure that the type changed? You can MatView() the matrix with >> format ascii_info to see. >> >> Thanks, >> >> Matt >> >> >> >> Thanks, >> Yongzhong >> >> >> *From: *Junchao Zhang >> *Date: *Saturday, June 22, 2024 at 9:40?AM >> *To: *Yongzhong Li >> *Cc: *Pierre Jolivet , petsc-users at mcs.anl.gov < >> petsc-users at mcs.anl.gov> >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> No, you don't. It is strange. Perhaps you can you run a petsc example >> first and see if MKL is really used >> $ cd src/mat/tests >> $ make ex1 >> $ MKL_VERBOSE=1 ./ex1 >> >> --Junchao Zhang >> >> >> On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> I am using >> >> export MKL_VERBOSE=1 >> ./xx >> >> in the bash file, do I have to use - ksp_converged_reason? >> >> Thanks, >> Yongzhong >> >> >> *From: *Pierre Jolivet >> *Date: *Friday, June 21, 2024 at 1:47?PM >> *To: *Yongzhong Li >> *Cc: *Junchao Zhang , petsc-users at mcs.anl.gov < >> petsc-users at mcs.anl.gov> >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> ????????? pierre at joliv.et ????????????????? >> >> How do you set the variable? >> >> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason >> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 >> architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled >> processors, Lnx 2.80GHz lp64 intel_thread >> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 >> FastMM:1 TID:0 NThr:1 >> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 >> TID:0 NThr:1 >> [...] >> >> >> On 21 Jun 2024, at 7:37?PM, Yongzhong Li >> wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Hello all, >> >> I set MKL_VERBOSE = 1, but observed no print output specific to the use >> of MKL. Does PETSc enable this verbose output? >> >> Best, >> >> Yongzhong >> >> >> *From: *Pierre Jolivet >> *Date: *Friday, June 21, 2024 at 1:36?AM >> *To: *Junchao Zhang >> *Cc: *Yongzhong Li , >> petsc-users at mcs.anl.gov >> *Subject: *Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc >> KSPSolve Performance Issue >> ????????? pierre at joliv.et ????????????????? >> >> >> >> >> On 21 Jun 2024, at 6:42?AM, Junchao Zhang >> wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> I remember there are some MKL env vars to print MKL routines called. >> >> >> The environment variable is MKL_VERBOSE >> >> Thanks, >> Pierre >> >> >> Maybe we can try it to see what MKL routines are really used and then we >> can understand why some petsc functions did not speed up >> >> --Junchao Zhang >> >> >> On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> *This Message Is From an External Sender* >> This message came from outside your organization. >> >> Hi Barry, sorry for my last results. I didn?t fully understand the stage >> profiling and logging in PETSc, now I only record KSPSolve() stage of my >> program. Some sample codes are as follow, >> >> // Static variable to keep track of the stage counter >> static int stageCounter = 1; >> >> // Generate a unique stage name >> std::ostringstream oss; >> oss << "Stage " << stageCounter << " of Code"; >> std::string stageName = oss.str(); >> >> // Register the stage >> PetscLogStage stagenum; >> >> PetscLogStageRegister(stageName.c_str(), &stagenum); >> PetscLogStagePush(stagenum); >> >> *KSPSolve(*ksp_ptr, b, x);* >> >> PetscLogStagePop(); >> stageCounter++; >> >> I have attached my new logging results, there are 1 main stage and 4 >> other stages where each one is KSPSolve() call. >> >> To provide some additional backgrounds, if you recall, I have been trying >> to get efficient iterative solution using multithreading. I found out by >> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to >> perform sparse matrix-vector multiplication faster, I am using >> MATSEQAIJMKL. This makes the shell matrix vector product in each iteration >> scale well with the #of threads. However, I found out the total GMERS solve >> time (~KSPSolve() time) is not scaling well the #of threads. >> >> From the logging results I learned that when performing KSPSolve(), there >> are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs >> using different number of threads and plotted the time consumption for >> PCApply() and KSPGMERSOrthog() against #of thread. I found out these two >> operations are not scaling with the threads at all! My results are attached >> as the pdf to give you a clear view. >> >> My questions is, >> >> From my understanding, in PCApply, MatSolve() is involved, >> KSPGMERSOrthog() will have many vector operations, so why these two parts >> can?t scale well with the # of threads when the intel MKL library is linked? >> >> Thank you, >> Yongzhong >> >> >> *From: *Barry Smith >> *Date: *Friday, June 14, 2024 at 11:36?AM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov , >> petsc-maint at mcs.anl.gov , Piero Triverio < >> piero.triverio at utoronto.ca> >> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >> Performance Issue >> >> I am a bit confused. Without the initial guess computation, there are >> still a bunch of events I don't understand >> >> MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 >> MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> >> in addition there are many more VecMAXPY then VecMDot (in GMRES they are >> each done the same number of times) >> >> VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 >> VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 >> >> Finally there are a huge number of >> >> MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 >> >> Are you making calls to all these routines? Are you doing this inside >> your MatMult() or before you call KSPSolve? >> >> The reason I wanted you to make a simpler run without the initial guess >> code is that your events are far more complicated than would be produced by >> GMRES alone so it is not possible to understand the behavior you are seeing >> without fully understanding all the events happening in the code. >> >> Barry >> >> >> >> On Jun 14, 2024, at 1:19?AM, Yongzhong Li >> wrote: >> >> Thanks, I have attached the results without using any KSPGuess. At low >> frequency, the iteration steps are quite close to the one with KSPGuess, >> specifically >> >> KSPGuess Object: 1 MPI process >> type: fischer >> Model 1, size 200 >> >> However, I found at higher frequency, the # of iteration steps are >> significant higher than the one with KSPGuess, I have attahced both of the >> results for your reference. >> >> Moreover, could I ask why the one without the KSPGuess options can be >> used for a baseline comparsion? What are we comparing here? How does it >> relate to the performance issue/bottleneck I found? ?*I have noticed >> that the time taken by **KSPSolve** is **almost two times **greater than >> the CPU time for matrix-vector product multiplied by the number of >> iteration*? >> >> Thank you! >> Yongzhong >> >> >> *From: *Barry Smith >> *Date: *Thursday, June 13, 2024 at 2:14?PM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov , >> petsc-maint at mcs.anl.gov , Piero Triverio < >> piero.triverio at utoronto.ca> >> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >> Performance Issue >> >> Can you please run the same thing without the KSPGuess option(s) for a >> baseline comparison? >> >> Thanks >> >> Barry >> >> >> On Jun 13, 2024, at 1:27?PM, Yongzhong Li >> wrote: >> >> This Message Is From an External Sender >> This message came from outside your organization. >> Hi Matt, >> >> I have rerun the program with the keys you provided. The system output >> when performing ksp solve and the final petsc log output were stored in a >> .txt file attached for your reference. >> >> Thanks! >> Yongzhong >> >> >> *From: *Matthew Knepley >> *Date: *Wednesday, June 12, 2024 at 6:46?PM >> *To: *Yongzhong Li >> *Cc: *petsc-users at mcs.anl.gov , >> petsc-maint at mcs.anl.gov , Piero Triverio < >> piero.triverio at utoronto.ca> >> *Subject: *Re: [petsc-maint] Assistance Needed with PETSc KSPSolve >> Performance Issue >> ????????? knepley at gmail.com ????????????????? >> >> On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li < >> yongzhong.li at mail.utoronto.ca> wrote: >> >> Dear PETSc?s developers, I hope this email finds you well. I am currently >> working on a project using PETSc and have encountered a performance issue >> with the KSPSolve function. Specifically, I have noticed that the time >> taken by KSPSolve is >> ZjQcmQRYFpfptBannerStart >> *This Message Is From an External Sender* >> This message came from outside your organization. >> >> ZjQcmQRYFpfptBannerEnd >> Dear PETSc?s developers, >> I hope this email finds you well. >> I am currently working on a project using PETSc and have encountered a >> performance issue with the KSPSolve function. Specifically, *I have >> noticed that the time taken by **KSPSolve** is **almost two times **greater >> than the CPU time for matrix-vector product multiplied by the number of >> iteration steps*. I use C++ chrono to record CPU time. >> For context, I am using a shell system matrix A. Despite my efforts to >> parallelize the matrix-vector product (Ax), the overall solve time >> remains higher than the matrix vector product per iteration indicates >> when multiple threads were used. Here are a few details of my setup: >> >> - *Matrix Type*: Shell system matrix >> - *Preconditioner*: Shell PC >> - *Parallel Environment*: Using Intel MKL as PETSc?s BLAS/LAPACK >> library, multithreading is enabled >> >> I have considered several potential reasons, such as preconditioner >> setup, additional solver operations, and the inherent overhead of using a >> shell system matrix. *However, since KSPSolve is a high-level API, I >> have been unable to pinpoint the exact cause of the increased solve time.* >> Have you observed the same issue? Could you please provide some >> experience on how to diagnose and address this performance discrepancy? >> Any insights or recommendations you could offer would be greatly >> appreciated. >> >> >> For any performance question like this, we need to see the output of your >> code run with >> >> -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view >> >> Thanks, >> >> Matt >> >> >> Thank you for your time and assistance. >> Best regards, >> Yongzhong >> ----------------------------------------------------------- >> *Yongzhong Li* >> PhD student | Electromagnetics Group >> Department of Electrical & Computer Engineering >> University of Toronto >> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_92mZO9M$ >> >> >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_xKEEYRn$ >> >> >> >> >> >> >> >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_xKEEYRn$ >> >> >> >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!cw4NqFC1djrGs8b1mL87vRIl7UhAC-HTBFefyAgrI5AoRpxI-JFc1ejwiH0LcrfhGr0_nA_giCdoDZLyYbpe_xKEEYRn$ >> >> >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Fri Jun 28 16:11:01 2024 From: bsmith at petsc.dev (Barry Smith) Date: Fri, 28 Jun 2024 17:11:01 -0400 Subject: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue In-Reply-To: References: <5BB0F171-02ED-4ED7-A80B-C626FA482108@petsc.dev> <8177C64C-1C0E-4BD0-9681-7325EB463DB3@petsc.dev> <1B237F44-C03C-4FD9-8B34-2281D557D958@joliv.et> <660A31B0-E6AA-4A4F-85D0-DB5FEAF8527F@joliv.et> <4D1A8BC2-66AD-4627-84B7-B12A18BA0983@petsc.dev> <55B35581-80F7-482D-B53A-35FCAF907554@petsc.dev> Message-ID: In the branch I introduce at https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/merge_requests/7658__;!!G_uCfscf7eWS!bDIUaHqh4wl1qOCJcAguXJHWFoOm4VTVCLhaVCpC9UKNvTShW_jjtq_DiWBWpbj0cSsF0wPyRAJm2hzppiiRwaE$ On src/ksp/ksp/tutorials/ex45.c on a good memory bandwidth Intel system using the options OMP_NUM_THREADS=1-64 ./ex45 -pc_type none -da_refine 4 -ksp_max_it 100 -log_view -ksp_gmres_preallocate true -blas_view -vec_maxpy_use_gemv -ksp_converged_reason -ksp_view I get pretty good performance improvement with speedups of 24 MatMult, 15 VecMdot, 27 VecMAXPY I also get good speedups on an M2 Mac. I am trying, slowly, to improve the experience of OpenMP users of PETSc (with or without MPI). $ OMP_NUM_THREADS=1 ./ex45 -pc_type none -da_refine 4 -ksp_max_it 100 -ksp_monitor -log_view -ksp_gmres_preallocate true -blas_view -vec_maxpy_use_gemv -ksp_converged_reason -ksp_view | egrep -e MatMult -e MAXPY -e MDot MatMult 104 1.0 9.6804e-01 1.0 1.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00 28 17 0 0 0 28 17 0 0 0 1263 VecMDot 100 1.0 8.6170e-01 1.0 2.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 25 38 0 0 0 25 38 0 0 0 3072 VecMAXPY 104 1.0 9.4225e-01 1.0 2.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 40 0 0 0 27 40 0 0 0 3003 $ OMP_NUM_THREADS=2 ./ex45 -pc_type none -da_refine 4 -ksp_max_it 100 -ksp_monitor -log_view -ksp_gmres_preallocate true -blas_view -vec_maxpy_use_gemv -ksp_converged_reason -ksp_view | egrep -e MatMult -e MAXPY -e MDot MatMult 104 1.0 5.2966e-01 1.0 1.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00 25 17 0 0 0 25 17 0 0 0 2308 VecMDot 100 1.0 4.8504e-01 1.0 2.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 23 38 0 0 0 23 38 0 0 0 5457 VecMAXPY 104 1.0 4.9861e-01 1.0 2.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 23 40 0 0 0 23 40 0 0 0 5674 $ OMP_NUM_THREADS=4 ./ex45 -pc_type none -da_refine 4 -ksp_max_it 100 -ksp_monitor -log_view -ksp_gmres_preallocate true -blas_view -vec_maxpy_use_gemv -ksp_converged_reason -ksp_view | egrep -e MatMult -e MAXPY -e MDot MatMult 104 1.0 2.9545e-01 1.0 1.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00 20 17 0 0 0 20 17 0 0 0 4137 VecMDot 100 1.0 2.7871e-01 1.0 2.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 18 38 0 0 0 18 38 0 0 0 9496 VecMAXPY 104 1.0 2.9648e-01 1.0 2.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 20 40 0 0 0 20 40 0 0 0 9543 $ OMP_NUM_THREADS=8 ./ex45 -pc_type none -da_refine 4 -ksp_max_it 100 -ksp_monitor -log_view -ksp_gmres_preallocate true -blas_view -vec_maxpy_use_gemv -ksp_converged_reason -ksp_view | egrep -e MatMult -e MAXPY -e MDot MatMult 104 1.0 1.6195e-01 1.0 1.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 17 0 0 0 15 17 0 0 0 7547 VecMDot 100 1.0 1.4976e-01 1.0 2.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 38 0 0 0 14 38 0 0 0 17674 VecMAXPY 104 1.0 1.5640e-01 1.0 2.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 40 0 0 0 14 40 0 0 0 18090 $ OMP_NUM_THREADS=16 ./ex45 -pc_type none -da_refine 4 -ksp_max_it 100 -ksp_monitor -log_view -ksp_gmres_preallocate true -blas_view -vec_maxpy_use_gemv -ksp_converged_reason -ksp_view | egrep -e MatMult -e MAXPY -e MDot MatMult 104 1.0 9.0148e-02 1.0 1.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00 11 17 0 0 0 11 17 0 0 0 13558 VecMDot 100 1.0 6.9346e-02 1.0 2.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 8 38 0 0 0 8 38 0 0 0 38167 VecMAXPY 104 1.0 6.9274e-02 1.0 2.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 8 40 0 0 0 8 40 0 0 0 40842 $ OMP_NUM_THREADS=24 ./ex45 -pc_type none -da_refine 4 -ksp_max_it 100 -ksp_monitor -log_view -ksp_gmres_preallocate true -blas_view -vec_maxpy_use_gemv -ksp_converged_reason -ksp_view | egrep -e MatMult -e MAXPY -e MDot MatMult 104 1.0 7.4300e-02 1.0 1.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00 8 17 0 0 0 8 17 0 0 0 16450 VecMDot 100 1.0 5.7918e-02 1.0 2.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 7 38 0 0 0 7 38 0 0 0 45698 VecMAXPY 104 1.0 5.2788e-02 1.0 2.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 6 40 0 0 0 6 40 0 0 0 53597 $ OMP_NUM_THREADS=64 ./ex45 -pc_type none -da_refine 4 -ksp_max_it 100 -ksp_monitor -log_view -ksp_gmres_preallocate true -blas_view -vec_maxpy_use_gemv -ksp_converged_reason -ksp_view | egrep -e MatMult -e MAXPY -e MDot MatMult 104 1.0 6.4348e-02 1.0 1.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00 8 17 0 0 0 8 17 0 0 0 18994 VecMDot 100 1.0 5.8682e-02 1.0 2.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 8 38 0 0 0 8 38 0 0 0 45103 VecMAXPY 104 1.0 3.4760e-02 1.0 2.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 4 40 0 0 0 4 40 0 0 0 81394 > On Jun 28, 2024, at 2:33?PM, Junchao Zhang wrote: > > This Message Is From an External Sender > This message came from outside your organization. > OK, then you need '--with-mkl_pardiso-dir='+os.environ['MKLROOT'] in petsc configure > > --Junchao Zhang > > > On Fri, Jun 28, 2024 at 1:05?PM Pierre Jolivet > wrote: >> >> >>> On 28 Jun 2024, at 7:20?PM, Junchao Zhang > wrote: >>> >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> Hi, Yongzhong, >>> It is great to see you have made such good progress. Barry is right, you need -vec_maxpy_use_gemv 1. It's my mistake for not mentioning it earlier. But even with that, there are still problems. >>> petsc tries to optimize VecMDot/MAXPY with BLAS GEMV, with hope that vendors' BLAS library would be highly optimized on that. However, we found though they were good with VecMDot, but not with VecMAXPY. So by default in petsc, we disabled the GEMV optimization for VecMAXPY. One can use -vec_maxpy_use_gemv 1 to turn on it. >>> I turned it on and tested VecMAXPY with ex2k and MKL, but failed to see any improvement with multiple threads. I could not understand why MKL is so bad on it. You can try it yourself in your environment. >>> Without the GEMV optimization, VecMAXPY() is implemented by petsc with a batch of PetscKernelAXPY() kernels, which contain simple for loops but not OpenMP parallelized (since petsc does not support OpenMP outright) . I added "omp parallel for" pragma in PetscKernelAXPY() kernels, and tested ex2k again with now parallelized petsc. Here is the result. >>> >>> $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0 >>> Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) >>> -------------------------------------------------------------------------- >>> 128 7.0 10.1 21.4 72.7 >>> 256 7.9 12.9 29.5 101.0 >>> 512 9.4 17.2 40.5 136.2 >>> 1024 15.9 27.3 67.5 249.3 >>> 2048 26.5 48.7 139.6 432.7 >>> 4096 47.1 77.3 186.4 710.3 >>> 8192 84.8 152.2 423.9 1580.6 >>> 16384 154.9 298.5 792.1 2889.2 >>> 32768 183.7 338.7 893.9 3436.2 >>> 65536 639.1 1247.8 3219.1 12494.8 >>> 131072 1125.2 1856.2 6843.0 23653.7 >>> 262144 2603.2 4948.4 13259.4 51287.7 >>> 524288 5093.6 10305.0 26451.7 96919.6 >>> 1048576 5898.6 10947.2 45486.4 127352.8 >>> 2097152 11845.4 21912.5 57999.6 331403.4 >>> >>> $ OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=16 ./ex2k -n 15 -m 2 -test_name VecMAXPY -vec_maxpy_use_gemv 0 >>> Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) >>> -------------------------------------------------------------------------- >>> 128 17.0 16.1 31.5 112.9 >>> 256 13.7 16.8 31.2 120.2 >>> 512 14.5 18.1 33.9 129.9 >>> 1024 16.5 21.0 38.5 150.4 >>> 2048 18.5 22.1 41.8 171.4 >>> 4096 21.0 25.4 55.3 212.3 >>> 8192 27.0 30.3 68.6 251.9 >>> 16384 32.2 44.5 93.3 350.5 >>> 32768 45.8 65.0 149.8 558.8 >>> 65536 59.7 102.8 247.1 946.0 >>> 131072 100.7 186.4 485.3 1898.1 >>> 262144 183.4 345.2 922.2 3567.0 >>> 524288 339.6 676.8 1820.7 7530.4 >>> 1048576 662.0 1364.7 3585.3 13969.1 >>> 2097152 1379.7 2788.6 7414.0 28275.3 >>> >>> We can see VecMAXPY() can be easily speeded up with multithreading. >>> >>> For MatSolve, I checked petsc's aijmkl.c, and found we don't have interface to MKL's sparse solve. >> >> We do, it?s in src/mat/impls/aij/seq/mkl_pardiso, and it?s threaded (the distributed version is in src/mat/impls/aij/mpi/mkl_cpardiso). >> >> Thanks, >> Pierre >> >>> I checked https://urldefense.us/v3/__https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2023-0/openmp-threaded-functions-and-problems.html__;!!G_uCfscf7eWS!bDIUaHqh4wl1qOCJcAguXJHWFoOm4VTVCLhaVCpC9UKNvTShW_jjtq_DiWBWpbj0cSsF0wPyRAJm2hzp5_cFIXQ$ , but confused with MKL's list of threaded function >>> Direct sparse solver. >>> All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. >>> I don't know whether MKL has threaded sparse solver. >>> >>> --Junchao Zhang >>> >>> >>> On Fri, Jun 28, 2024 at 11:35?AM Barry Smith > wrote: >>>> >>>> Are you running with -vec_maxpy_use_gemv ? >>>> >>>> >>>>> On Jun 28, 2024, at 1:46?AM, Yongzhong Li > wrote: >>>>> >>>>> Thanks all for your help!!! >>>>> >>>>> I think I find the issues. I am compiling a large CMake project that relies on many external libraries (projects). Previously, I used OpenBLAS as the BLAS for all the dependencies including PETSc. After I switched to Intel MKL for PETSc, I still kept the OpenBLAS and use it as the BLAS for all the other dependencies. I think somehow even when I specify the blas-lapack-dir to the MKLROOT when PETSc is configured, the actual program still use OpenBLAS as the BLAS for some PETSc functions, such as VecMDot() and VecMAXPY(), so that?s why I didn?t see any MKL verbose during the KSPSolve(). Now I remove the OpenBLAS and use Intel MKL as the BLAS for all the dependencies. The issue is resolved, I can clearly see MKL routines are called when KSP GMRES is running. >>>>> >>>>> Back to my original questions, my goal is to achieve good parallelization efficiency for KSP GMRES Solve. As I use multithreading-enabled MKL spmv routines, the wall time for MatMult/MatMultAdd() has been greatly reduced. However,the KSPGMRESOrthog and MatSolve in PCApply still take over 50% of solving time and can?t benefit from multithreading. After I fixed the issue I mentioned, I found I got around 15% time reduced because of more efficient VecMDot() calls. I attach a petsc log comparison for your reference (same settings, only difference is whether use MKL BLAS or not), you can see the percentage of VecMDot() is reduced. However, here comes the interesting part, VecMAXPY() didn?t benefit from MKL BLAS, it still takes almost 40% of solution when I use 64 MKL Threads, which is a lot for my program. And if I multiple this percentage with the actual wall time against different # of threads, it stays the same. Then I used ex2k benchmark to verify what I found. Here is the result, >>>>> >>>>> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 5 -test_name VecMAXPY >>>>> Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) >>>>> -------------------------------------------------------------------------- >>>>> 128 0.4 0.9 2.4 8.8 >>>>> 256 0.3 1.1 3.5 13.3 >>>>> 512 0.5 4.4 6.7 26.5 >>>>> 1024 0.9 4.8 13.3 51.0 >>>>> 2048 3.5 12.3 37.1 94.7 >>>>> 4096 4.3 24.5 73.6 179.6 >>>>> 8192 6.3 48.7 98.9 380.8 >>>>> 16384 9.3 99.2 200.2 774.0 >>>>> 32768 30.6 155.4 421.2 1662.9 >>>>> 65536 101.2 269.4 827.4 3565.0 >>>>> 131072 206.9 551.0 1829.0 7580.5 >>>>> 262144 450.2 1251.9 3986.2 15525.6 >>>>> 524288 1322.1 2901.7 8567.1 31840.0 >>>>> 1048576 2788.6 6190.6 16394.7 63514.9 >>>>> 2097152 5534.8 12619.9 35427.4 130064.5 >>>>> $ MKL_NUM_THREADS=8 ./ex2k -n 15 -m 5 -test_name VecMAXPY >>>>> Vector(N) VecMAXPY-1 VecMAXPY-3 VecMAXPY-8 VecMAXPY-30 (us) >>>>> -------------------------------------------------------------------------- >>>>> 128 0.3 0.7 2.4 8.8 >>>>> 256 0.3 1.1 3.6 13.5 >>>>> 512 0.5 4.4 6.8 26.4 >>>>> 1024 0.9 4.8 13.6 50.5 >>>>> 2048 7.6 12.2 36.5 95.0 >>>>> 4096 8.5 25.7 72.4 182.6 >>>>> 8192 11.9 48.5 103.7 383.7 >>>>> 16384 12.8 97.7 203.7 785.0 >>>>> 32768 11.2 148.5 421.9 1681.5 >>>>> 65536 15.5 271.2 843.8 3613.7 >>>>> 131072 34.3 564.7 1905.2 7558.8 >>>>> 262144 106.4 1334.5 4002.8 15458.3 >>>>> 524288 217.2 2858.4 8407.9 31303.7 >>>>> 1048576 701.5 6060.6 16947.3 64118.5 >>>>> 2097152 1769.7 13218.3 36347.3 131062.9 >>>>> >>>>> It stays the same, no benefit from multithreading BLAS!! Unlike what I found for VecMdot(), where I did see speed up for more #of threads. Then, I dig deeper. I learned that for VecMDot(), it calls ZGEMV while for VecMAXPY(), it calls ZAXPY. This observation seems to indicate that ZAXPY is not benefiting from MKL threads. >>>>> >>>>> My question is do you know why ZAXPY is not multithreaded? From my perspective, VecMDot() and VecMAXPY() are very similar operations, the only difference is whether we need to scale the vectors to be multiplied or not. I think you have mentioned that recently you did some optimization to these two routines, from my above results and observations, are these aligned with your expectations? Could we further optimize the codes to get more parallelization efficiency in my case? >>>>> >>>>> And another question, can MatSolve() in KSPSolve be multithreaded? Would MUMPS help? >>>>> >>>>> Thank you and regards, >>>>> Yongzhong >>>>> >>>>> From: Junchao Zhang > >>>>> Sent: Thursday, June 27, 2024 11:10 AM >>>>> To: Yongzhong Li > >>>>> Cc: Barry Smith >; petsc-users at mcs.anl.gov >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> How big is the n when you call PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione))? n is the vector length in VecMDot. >>>>> it is strange with MKL_VERBOSE=1 you did not see MKL_VERBOSE ZGEMV..., since the code did call gemv. Perhaps you need to double check your spelling etc. >>>>> >>>>> If you also use ex2k, and potentially modify Ms[] and Ns[] to match the sizes in your code, to see if there is a speedup with more threads. >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Thu, Jun 27, 2024 at 9:39?AM Yongzhong Li > wrote: >>>>> Mostly 3, maximum 7, but definitely hit the point when m > 1, I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple >>>>> ZjQcmQRYFpfptBannerStart >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> >>>>> ZjQcmQRYFpfptBannerEnd >>>>> Mostly 3, maximum 7, but definitely hit the point when m > 1, >>>>> >>>>> I can see the PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); is called multiple times >>>>> >>>>> From: Barry Smith > >>>>> Date: Thursday, June 27, 2024 at 1:12?AM >>>>> To: Yongzhong Li > >>>>> Cc: petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> >>>>> How big are the m's getting in your code? >>>>> >>>>> >>>>> >>>>> On Jun 27, 2024, at 12:40?AM, Yongzhong Li > wrote: >>>>> >>>>> Hi Barry, I used gdb to debug my program, set a breakpoint to VecMultiDot_Seq_GEMV function. I did see when I debug this function, it will call BLAS (but not always, only if m > 1), as shown below. However, I still didn?t see any MKL outputs even if I set MKLK_VERBOSE=1. >>>>> >>>>> (gdb) >>>>> 550 PetscCall(VecRestoreArrayRead(yin[i], &yfirst)); >>>>> (gdb) >>>>> 553 m = j - i; >>>>> (gdb) >>>>> 554 if (m > 1) { >>>>> (gdb) >>>>> 555 PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above >>>>> (gdb) >>>>> 556 PetscScalar one = 1, zero = 0; >>>>> (gdb) >>>>> 558 PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); >>>>> (gdb) s >>>>> PetscMallocValidate (line=558, function=0x7ffff68a11a0 <__func__.18210> "VecMultiDot_Seq_GEMV", >>>>> file=0x7ffff68a1078 "/gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/vec/vec/impls/seq/dvec2.c") >>>>> at /gpfs/s4h/scratch/t/triverio/modelics/workplace/rebel/build_debug/external/petsc-3.21.0/src/sys/memory/mtr.c:106 >>>>> 106 if (!TRdebug) return PETSC_SUCCESS; >>>>> (gdb) >>>>> 154 } >>>>> >>>>> Am I not using MKL BLAS, is that why I didn?t see multithreading speed up for KSPGMRESOrthog? What do you think could be the potential reasons? Is there any silent mode that will possibly affect the MKL Verbose. >>>>> >>>>> Thank you and best regards, >>>>> Yongzhong >>>>> >>>>> From: Barry Smith > >>>>> Date: Wednesday, June 26, 2024 at 8:15?PM >>>>> To: Yongzhong Li > >>>>> Cc: petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> >>>>> if (m > 1) { >>>>> PetscBLASInt ione = 1, lda2 = (PetscBLASInt)lda; // the cast is safe since we've screened out those lda > PETSC_BLAS_INT_MAX above >>>>> PetscScalar one = 1, zero = 0; >>>>> >>>>> PetscCallBLAS("BLASgemv", BLASgemv_(trans, &n, &m, &one, yarray, &lda2, xarray, &ione, &zero, z + i, &ione)); >>>>> PetscCall(PetscLogFlops(PetscMax(m * (2.0 * n - 1), 0.0))); >>>>> >>>>> The call to BLAS above is where it uses MKL. >>>>> >>>>> >>>>> >>>>> >>>>> On Jun 26, 2024, at 6:59?PM, Yongzhong Li > wrote: >>>>> >>>>> Hi Barry, I am looking into the source codes of VecMultiDot_Seq_GEMV https://urldefense.us/v3/__https://petsc.org/release/src/vec/vec/impls/seq/dvec2.c.html*VecMDot_Seq__;Iw!!G_uCfscf7eWS!bDIUaHqh4wl1qOCJcAguXJHWFoOm4VTVCLhaVCpC9UKNvTShW_jjtq_DiWBWpbj0cSsF0wPyRAJm2hzpCCxkPcI$ >>>>> Can I ask which lines of codes suggest the use of intel mkl? >>>>> >>>>> Thanks, >>>>> Yongzhong >>>>> >>>>> From: Barry Smith > >>>>> Date: Wednesday, June 26, 2024 at 10:30?AM >>>>> To: Yongzhong Li > >>>>> Cc: petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> >>>>> In a debug version of PETSc run your application in a debugger and put a break point in VecMultiDot_Seq_GEMV. Then next through the code from that point to see what decision it makes about using dgemv() to see why it is not getting into the Intel code. >>>>> >>>>> >>>>> >>>>> >>>>> On Jun 25, 2024, at 11:19?PM, Yongzhong Li > wrote: >>>>> >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> Hi Junchao, thank you for your help for these benchmarking test! >>>>> >>>>> I check out to petsc/main and did a few things to verify from my side, >>>>> >>>>> 1. I ran the microbenchmark (vec/vec/tests/ex2k.c) test on my compute node. The results are as follow, >>>>> >>>>> $ MKL_NUM_THREADS=64 ./ex2k -n 15 -m 4 >>>>> Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) >>>>> -------------------------------------------------------------------------- >>>>> 128 14.5 1.2 1.8 5.2 >>>>> 256 1.5 0.9 1.6 4.7 >>>>> 512 2.7 2.8 6.1 13.2 >>>>> 1024 4.0 4.0 9.3 16.4 >>>>> 2048 7.4 7.3 11.3 39.3 >>>>> 4096 14.2 13.9 19.1 93.4 >>>>> 8192 28.8 26.3 25.4 31.3 >>>>> 16384 54.1 25.8 26.7 33.8 >>>>> 32768 109.8 25.7 24.2 56.0 >>>>> 65536 220.2 24.4 26.5 89.0 >>>>> 131072 424.1 31.5 36.1 149.6 >>>>> 262144 898.1 37.1 53.9 286.1 >>>>> 524288 1754.6 48.7 100.3 1122.2 >>>>> 1048576 3645.8 86.5 347.9 2950.4 >>>>> 2097152 7371.4 308.7 1440.6 6874.9 >>>>> >>>>> $ MKL_NUM_THREADS=1 ./ex2k -n 15 -m 4 >>>>> Vector(N) VecMDot-1 VecMDot-3 VecMDot-8 VecMDot-30 (us) >>>>> -------------------------------------------------------------------------- >>>>> 128 14.9 1.2 1.9 5.2 >>>>> 256 1.5 1.0 1.7 4.7 >>>>> 512 2.7 2.8 6.1 12.0 >>>>> 1024 3.9 4.0 9.3 16.8 >>>>> 2048 7.4 7.3 10.4 41.3 >>>>> 4096 14.0 13.8 18.6 84.2 >>>>> 8192 27.0 21.3 43.8 177.5 >>>>> 16384 54.1 34.1 89.1 330.4 >>>>> 32768 110.4 82.1 203.5 781.1 >>>>> 65536 213.0 191.8 423.9 1696.4 >>>>> 131072 428.7 360.2 934.0 4080.0 >>>>> 262144 883.4 723.2 1745.6 10120.7 >>>>> 524288 1817.5 1466.1 4751.4 23217.2 >>>>> 1048576 3611.0 3796.5 11814.9 48687.7 >>>>> 2097152 7401.9 10592.0 27543.2 106565.4 >>>>> >>>>> I can see the speed up brought by more MKL threads, and if I set NKL_VERBOSE to 1, I can see something like >>>>> >>>>> MKL_VERBOSE ZGEMV(C,262144,8,0x7ffd375d6470,0x2ac76e7fb010,262144,0x16d0f40,1,0x7ffd375d6480,0x16435d0,1) 32.70us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:6 ca >>>>> >>>>> From my understanding, the VecMDot()/VecMAXPY() can benefit from more MKL threads in my compute node and is using ZGEMV MKL BLAS. >>>>> >>>>> However, when I ran my own program and set MKL_VERBOSE to 1, it is very strange that I still can?t find any MKL outputs, though I can see from the PETSc log that VecMDot and VecMAXPY() are called. >>>>> >>>>> I am wondering are VecMDot and VecMAXPY in KSPGMRESOrthog optimized in a way that is similar to ex2k test? Should I expect to see MKL outputs for whatever linear system I solve with KSPGMRES? Does it relate to if it is dense matrix or sparse matrix, although I am not really understand why VecMDot/MAXPY() have something to do with dense matrix-vector multiplication. >>>>> >>>>> Thank you, >>>>> Yongzhong >>>>> >>>>> From: Junchao Zhang > >>>>> Date: Tuesday, June 25, 2024 at 6:34?PM >>>>> To: Matthew Knepley > >>>>> Cc: Yongzhong Li >, Pierre Jolivet >, petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> Hi, Yongzhong, >>>>> Since the two kernels of KSPGMRESOrthog are VecMDot and VecMAXPY, if we can speed up the two with OpenMP threads, then we can speed up KSPGMRESOrthog. We recently added an optimization to do VecMDot/MAXPY() in dense matrix-vector multiplication (i.e., BLAS2 GEMV, with tall-and-skinny matrices ). So with MKL_VERBOSE=1, you should see something like "MKL_VERBOSE ZGEMV ..." in output. If not, could you try again with petsc/main? >>>>> petsc has a microbenchmark (vec/vec/tests/ex2k.c) to test them. I ran VecMDot with multithreaded oneMKL (via setting MKL_NUM_THREADS), it was strange to see no speedup. I then configured petsc with openblas, I did see better performance with more threads >>>>> >>>>> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=1 ./ex2k -n 15 -m 4 >>>>> Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) >>>>> -------------------------------------------------------------------------- >>>>> 128 2.0 2.5 6.1 >>>>> 256 1.8 2.7 7.0 >>>>> 512 2.1 3.1 8.6 >>>>> 1024 2.7 4.0 12.3 >>>>> 2048 3.8 6.3 28.0 >>>>> 4096 6.1 10.6 42.4 >>>>> 8192 10.9 21.8 79.5 >>>>> 16384 21.2 39.4 149.6 >>>>> 32768 45.9 75.7 224.6 >>>>> 65536 142.2 215.8 732.1 >>>>> 131072 169.1 233.2 1729.4 >>>>> 262144 367.5 830.0 4159.2 >>>>> 524288 999.2 1718.1 8538.5 >>>>> 1048576 2113.5 4082.1 18274.8 >>>>> 2097152 5392.6 10273.4 43273.4 >>>>> >>>>> >>>>> $ OMP_PROC_BIND=spread OMP_NUM_THREADS=8 ./ex2k -n 15 -m 4 >>>>> Vector(N) VecMDot-3 VecMDot-8 VecMDot-30 (us) >>>>> -------------------------------------------------------------------------- >>>>> 128 2.0 2.5 6.0 >>>>> 256 1.8 2.7 15.0 >>>>> 512 2.1 9.0 16.6 >>>>> 1024 2.6 8.7 16.1 >>>>> 2048 7.7 10.3 20.5 >>>>> 4096 9.9 11.4 25.9 >>>>> 8192 14.5 22.1 39.6 >>>>> 16384 25.1 27.8 67.8 >>>>> 32768 44.7 95.7 91.5 >>>>> 65536 82.1 156.8 165.1 >>>>> 131072 194.0 335.1 341.5 >>>>> 262144 388.5 380.8 612.9 >>>>> 524288 1046.7 967.1 1653.3 >>>>> 1048576 1997.4 2169.0 4034.4 >>>>> 2097152 5502.9 5787.3 12608.1 >>>>> >>>>> The tall-and-skinny matrices in KSPGMRESOrthog vary in width. The average speedup depends on components. So I suggest you run ex2k to see in your environment whether oneMKL can speedup the kernels. >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Mon, Jun 24, 2024 at 11:35?AM Junchao Zhang > wrote: >>>>> Let me run some examples on our end to see whether the code calls expected functions. >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Mon, Jun 24, 2024 at 10:46?AM Matthew Knepley > wrote: >>>>> On Mon, Jun 24, 2024 at 11:?21 AM Yongzhong Li wrote: Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of >>>>> ZjQcmQRYFpfptBannerStart >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> >>>>> ZjQcmQRYFpfptBannerEnd >>>>> On Mon, Jun 24, 2024 at 11:21?AM Yongzhong Li > wrote: >>>>> Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? Thank you, Yongzhong From:? >>>>> ZjQcmQRYFpfptBannerStart >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> >>>>> ZjQcmQRYFpfptBannerEnd >>>>> Thank you Pierre for your information. Do we have a conclusion for my original question about the parallelization efficiency for different stages of KSP Solve? Do we need to do more testing to figure out the issues? >>>>> >>>>> We have an extended discussion of this here: https://urldefense.us/v3/__https://petsc.org/release/faq/*what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup__;Iw!!G_uCfscf7eWS!bDIUaHqh4wl1qOCJcAguXJHWFoOm4VTVCLhaVCpC9UKNvTShW_jjtq_DiWBWpbj0cSsF0wPyRAJm2hzpA6bJvC4$ >>>>> >>>>> The kinds of operations you are talking about (SpMV, VecDot, VecAXPY, etc) are memory bandwidth limited. If there is no more bandwidth to be marshalled on your board, then adding more processes does nothing at all. This is why people were asking about how many "nodes" you are running on, because that is the unit of memory bandwidth, not "cores" which make little difference. >>>>> >>>>> Thanks, >>>>> >>>>> Matt >>>>> >>>>> Thank you, >>>>> Yongzhong >>>>> >>>>> From: Pierre Jolivet > >>>>> Date: Sunday, June 23, 2024 at 12:41?AM >>>>> To: Yongzhong Li > >>>>> Cc: petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> >>>>> >>>>> >>>>> On 23 Jun 2024, at 4:07?AM, Yongzhong Li > wrote: >>>>> >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> Yeah, I ran my program again using -mat_view::ascii_info and set MKL_VERBOSE to be 1, then I noticed the outputs suggested that the matrix to be seqaijmkl type (I?ve attached a few as below) >>>>> >>>>> --> Setting up matrix-vector products... >>>>> >>>>> Mat Object: 1 MPI process >>>>> type: seqaijmkl >>>>> rows=16490, cols=35937 >>>>> total: nonzeros=128496, allocated nonzeros=128496 >>>>> total number of mallocs used during MatSetValues calls=0 >>>>> not using I-node routines >>>>> Mat Object: 1 MPI process >>>>> type: seqaijmkl >>>>> rows=16490, cols=35937 >>>>> total: nonzeros=128496, allocated nonzeros=128496 >>>>> total number of mallocs used during MatSetValues calls=0 >>>>> not using I-node routines >>>>> >>>>> --> Solving the system... >>>>> >>>>> Excitation 1 of 1... >>>>> >>>>> ================================================ >>>>> Iterative solve completed in 7435 ms. >>>>> CONVERGED: rtol. >>>>> Iterations: 72 >>>>> Final relative residual norm: 9.22287e-07 >>>>> ================================================ >>>>> [CPU TIME] System solution: 2.27160000e+02 s. >>>>> [WALL TIME] System solution: 7.44387218e+00 s. >>>>> >>>>> However, it seems to me that there were still no MKL outputs even I set MKL_VERBOSE to be 1. Although, I think it should be many spmv operations when doing KSPSolve(). Do you see the possible reasons? >>>>> >>>>> SPMV are not reported with MKL_VERBOSE (last I checked), only dense BLAS is. >>>>> >>>>> Thanks, >>>>> Pierre >>>>> >>>>> >>>>> Thanks, >>>>> Yongzhong >>>>> >>>>> >>>>> From: Matthew Knepley > >>>>> Date: Saturday, June 22, 2024 at 5:56?PM >>>>> To: Yongzhong Li > >>>>> Cc: Junchao Zhang >, Pierre Jolivet >, petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> ????????? knepley at gmail.com ????????????????? >>>>> On Sat, Jun 22, 2024 at 5:03?PM Yongzhong Li > wrote: >>>>> MKL_VERBOSE=1 ./ex1 matrix nonzeros = 100, allocated nonzeros = 100 MKL_VERBOSE Intel(R) MKL 2019.?0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector >>>>> ZjQcmQRYFpfptBannerStart >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> >>>>> ZjQcmQRYFpfptBannerEnd >>>>> MKL_VERBOSE=1 ./ex1 >>>>> >>>>> matrix nonzeros = 100, allocated nonzeros = 100 >>>>> MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 gnu_thread >>>>> MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 167.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d7078c0,-1,0) 77.19ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x1894490,10,0) 83.97ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 44.94ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 20.72us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZSYTRS(L,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 4.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x1896a70,10) 1.41ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x1896a70,1,0x187b650,1) 381ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x7ffd9d707840,-1,0) 742ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZSYTRF(L,10,0x1894b50,10,0x1893df0,0x18951a0,10,0) 4.20us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZSYTRS(L,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 2.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 292ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEMV(N,10,10,0x7ffd9d7078f0,0x187eb20,10,0x187f7c0,1,0x7ffd9d707900,0x187ff70,1) 1.17us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 202.48ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 20.78ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 954ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGETRS(N,10,2,0x1894b50,10,0x1893df0,0x187d2a0,10,0) 30.74ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEMM(N,N,10,2,10,0x7ffd9d707790,0x187eb20,10,0x187d2a0,10,0x7ffd9d7077a0,0x18969c0,10) 3.95us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x18969c0,1,0x187b650,1) 995ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGETRF(10,10,0x1894b50,10,0x1893df0,0) 4.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGETRS(N,10,1,0x1894b50,10,0x1893df0,0x1880720,10,0) 3.92us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187f7c0,1,0x1880720,1) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEMV(N,15,10,0x7ffd9d7078f0,0x187ec70,15,0x187fc30,1,0x7ffd9d707900,0x1880400,1) 1.59us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x7ffd9d707900,-1,0) 47.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x1894550,0x1895cb0,10,0) 26.62us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x7ffd9d7078b0,-1,0) 35.32us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x1894550,0x1895b00,15,0x1895cb0,10,0) 42.33ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895b00,15,0) 16.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 395ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEMM(N,N,15,2,10,0x7ffd9d707790,0x187ec70,15,0x187d310,10,0x7ffd9d7077a0,0x187b5b0,15) 3.22us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x7ffd9d7078c0,-1,0) 730ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZUNMQR(L,C,15,2,10,0x1894b40,15,0x1894550,0x1897760,15,0x1895cb0,10,0) 4.42us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZTRTRS(U,N,N,10,2,0x1894b40,15,0x1897760,15,0) 5.96us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(20,0x7ffd9d7078a0,0x187d310,1,0x1897610,1) 222ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x7ffd9d707820,-1,0) 685ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZGEQRF(15,10,0x1894b40,15,0x18954b0,0x1895d60,10,0) 6.11us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x7ffd9d7078b0,-1,0) 390ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZUNMQR(L,C,15,1,10,0x1894b40,15,0x18954b0,0x1895bb0,15,0x1895d60,10,0) 3.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZTRTRS(U,N,N,10,1,0x1894b40,15,0x1895bb0,15,0) 1.05us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE ZAXPY(10,0x7ffd9d7078f0,0x187fc30,1,0x1880c70,1) 257ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> >>>>> Yes, for petsc example, there are MKL outputs, but for my own program. All I did is to change the matrix type from MATAIJ to MATAIJMKL to get optimized performance for spmv from MKL. Should I expect to see any MKL outputs in this case? >>>>> >>>>> Are you sure that the type changed? You can MatView() the matrix with format ascii_info to see. >>>>> >>>>> Thanks, >>>>> >>>>> Matt >>>>> >>>>> >>>>> Thanks, >>>>> Yongzhong >>>>> >>>>> From: Junchao Zhang > >>>>> Date: Saturday, June 22, 2024 at 9:40?AM >>>>> To: Yongzhong Li > >>>>> Cc: Pierre Jolivet >, petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> No, you don't. It is strange. Perhaps you can you run a petsc example first and see if MKL is really used >>>>> $ cd src/mat/tests >>>>> $ make ex1 >>>>> $ MKL_VERBOSE=1 ./ex1 >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Fri, Jun 21, 2024 at 4:03?PM Yongzhong Li > wrote: >>>>> I am using >>>>> >>>>> export MKL_VERBOSE=1 >>>>> ./xx >>>>> >>>>> in the bash file, do I have to use - ksp_converged_reason? >>>>> >>>>> Thanks, >>>>> Yongzhong >>>>> >>>>> From: Pierre Jolivet > >>>>> Date: Friday, June 21, 2024 at 1:47?PM >>>>> To: Yongzhong Li > >>>>> Cc: Junchao Zhang >, petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> ????????? pierre at joliv.et ????????????????? >>>>> How do you set the variable? >>>>> >>>>> $ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason >>>>> MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.80GHz lp64 intel_thread >>>>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1 >>>>> [...] >>>>> >>>>> On 21 Jun 2024, at 7:37?PM, Yongzhong Li > wrote: >>>>> >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> Hello all, >>>>> >>>>> I set MKL_VERBOSE = 1, but observed no print output specific to the use of MKL. Does PETSc enable this verbose output? >>>>> >>>>> Best, >>>>> Yongzhong >>>>> >>>>> >>>>> From: Pierre Jolivet > >>>>> Date: Friday, June 21, 2024 at 1:36?AM >>>>> To: Junchao Zhang > >>>>> Cc: Yongzhong Li >, petsc-users at mcs.anl.gov > >>>>> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> ????????? pierre at joliv.et ????????????????? >>>>> >>>>> >>>>> On 21 Jun 2024, at 6:42?AM, Junchao Zhang > wrote: >>>>> >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> I remember there are some MKL env vars to print MKL routines called. >>>>> >>>>> The environment variable is MKL_VERBOSE >>>>> >>>>> Thanks, >>>>> Pierre >>>>> >>>>> Maybe we can try it to see what MKL routines are really used and then we can understand why some petsc functions did not speed up >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Thu, Jun 20, 2024 at 10:39?PM Yongzhong Li > wrote: >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> >>>>> Hi Barry, sorry for my last results. I didn?t fully understand the stage profiling and logging in PETSc, now I only record KSPSolve() stage of my program. Some sample codes are as follow, >>>>> >>>>> // Static variable to keep track of the stage counter >>>>> static int stageCounter = 1; >>>>> >>>>> // Generate a unique stage name >>>>> std::ostringstream oss; >>>>> oss << "Stage " << stageCounter << " of Code"; >>>>> std::string stageName = oss.str(); >>>>> >>>>> // Register the stage >>>>> PetscLogStage stagenum; >>>>> >>>>> PetscLogStageRegister(stageName.c_str(), &stagenum); >>>>> PetscLogStagePush(stagenum); >>>>> >>>>> KSPSolve(*ksp_ptr, b, x); >>>>> >>>>> PetscLogStagePop(); >>>>> stageCounter++; >>>>> >>>>> I have attached my new logging results, there are 1 main stage and 4 other stages where each one is KSPSolve() call. >>>>> >>>>> To provide some additional backgrounds, if you recall, I have been trying to get efficient iterative solution using multithreading. I found out by compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. This makes the shell matrix vector product in each iteration scale well with the #of threads. However, I found out the total GMERS solve time (~KSPSolve() time) is not scaling well the #of threads. >>>>> >>>>> From the logging results I learned that when performing KSPSolve(), there are some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using different number of threads and plotted the time consumption for PCApply() and KSPGMERSOrthog() against #of thread. I found out these two operations are not scaling with the threads at all! My results are attached as the pdf to give you a clear view. >>>>> >>>>> My questions is, >>>>> >>>>> From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() will have many vector operations, so why these two parts can?t scale well with the # of threads when the intel MKL library is linked? >>>>> >>>>> Thank you, >>>>> Yongzhong >>>>> >>>>> From: Barry Smith > >>>>> Date: Friday, June 14, 2024 at 11:36?AM >>>>> To: Yongzhong Li > >>>>> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >>>>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> >>>>> I am a bit confused. Without the initial guess computation, there are still a bunch of events I don't understand >>>>> >>>>> MatTranspose 79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> MatMatMultSym 110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>>>> MatMatMultNum 90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>>>> MatMatMatMultSym 20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>>>> MatRARtSym 25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>>>> MatMatTrnMultSym 25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> MatMatTrnMultNum 25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 275 >>>>> MatTrnMatMultSym 10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> MatTrnMatMultNum 10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> >>>>> in addition there are many more VecMAXPY then VecMDot (in GMRES they are each done the same number of times) >>>>> >>>>> VecMDot 5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 10 0 0 0 8 10 0 0 0 12016 >>>>> VecMAXPY 22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 0.0e+00 39 20 0 0 0 39 20 0 0 0 4913 >>>>> >>>>> Finally there are a huge number of >>>>> >>>>> MatMultAdd 258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 0.0e+00 7 29 0 0 0 7 29 0 0 0 43025 >>>>> >>>>> Are you making calls to all these routines? Are you doing this inside your MatMult() or before you call KSPSolve? >>>>> >>>>> The reason I wanted you to make a simpler run without the initial guess code is that your events are far more complicated than would be produced by GMRES alone so it is not possible to understand the behavior you are seeing without fully understanding all the events happening in the code. >>>>> >>>>> Barry >>>>> >>>>> >>>>> On Jun 14, 2024, at 1:19?AM, Yongzhong Li > wrote: >>>>> >>>>> Thanks, I have attached the results without using any KSPGuess. At low frequency, the iteration steps are quite close to the one with KSPGuess, specifically >>>>> >>>>> KSPGuess Object: 1 MPI process >>>>> type: fischer >>>>> Model 1, size 200 >>>>> >>>>> However, I found at higher frequency, the # of iteration steps are significant higher than the one with KSPGuess, I have attahced both of the results for your reference. >>>>> >>>>> Moreover, could I ask why the one without the KSPGuess options can be used for a baseline comparsion? What are we comparing here? How does it relate to the performance issue/bottleneck I found? ?I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration? >>>>> >>>>> Thank you! >>>>> Yongzhong >>>>> >>>>> From: Barry Smith > >>>>> Date: Thursday, June 13, 2024 at 2:14?PM >>>>> To: Yongzhong Li > >>>>> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >>>>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> >>>>> Can you please run the same thing without the KSPGuess option(s) for a baseline comparison? >>>>> >>>>> Thanks >>>>> >>>>> Barry >>>>> >>>>> On Jun 13, 2024, at 1:27?PM, Yongzhong Li > wrote: >>>>> >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> Hi Matt, >>>>> >>>>> I have rerun the program with the keys you provided. The system output when performing ksp solve and the final petsc log output were stored in a .txt file attached for your reference. >>>>> >>>>> Thanks! >>>>> Yongzhong >>>>> >>>>> From: Matthew Knepley > >>>>> Date: Wednesday, June 12, 2024 at 6:46?PM >>>>> To: Yongzhong Li > >>>>> Cc: petsc-users at mcs.anl.gov >, petsc-maint at mcs.anl.gov >, Piero Triverio > >>>>> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue >>>>> >>>>> ????????? knepley at gmail.com ????????????????? >>>>> On Wed, Jun 12, 2024 at 6:36?PM Yongzhong Li > wrote: >>>>> Dear PETSc?s developers, I hope this email finds you well. I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is >>>>> ZjQcmQRYFpfptBannerStart >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> >>>>> ZjQcmQRYFpfptBannerEnd >>>>> Dear PETSc?s developers, >>>>> I hope this email finds you well. >>>>> I am currently working on a project using PETSc and have encountered a performance issue with the KSPSolve function. Specifically, I have noticed that the time taken by KSPSolve is almost two times greater than the CPU time for matrix-vector product multiplied by the number of iteration steps. I use C++ chrono to record CPU time. >>>>> For context, I am using a shell system matrix A. Despite my efforts to parallelize the matrix-vector product (Ax), the overall solve time remains higher than the matrix vector product per iteration indicates when multiple threads were used. Here are a few details of my setup: >>>>> Matrix Type: Shell system matrix >>>>> Preconditioner: Shell PC >>>>> Parallel Environment: Using Intel MKL as PETSc?s BLAS/LAPACK library, multithreading is enabled >>>>> I have considered several potential reasons, such as preconditioner setup, additional solver operations, and the inherent overhead of using a shell system matrix. However, since KSPSolve is a high-level API, I have been unable to pinpoint the exact cause of the increased solve time. >>>>> Have you observed the same issue? Could you please provide some experience on how to diagnose and address this performance discrepancy? Any insights or recommendations you could offer would be greatly appreciated. >>>>> >>>>> For any performance question like this, we need to see the output of your code run with >>>>> >>>>> -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view >>>>> >>>>> Thanks, >>>>> >>>>> Matt >>>>> >>>>> Thank you for your time and assistance. >>>>> Best regards, >>>>> Yongzhong >>>>> ----------------------------------------------------------- >>>>> Yongzhong Li >>>>> PhD student | Electromagnetics Group >>>>> Department of Electrical & Computer Engineering >>>>> University of Toronto >>>>> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!bDIUaHqh4wl1qOCJcAguXJHWFoOm4VTVCLhaVCpC9UKNvTShW_jjtq_DiWBWpbj0cSsF0wPyRAJm2hzp6WbSEQM$ >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bDIUaHqh4wl1qOCJcAguXJHWFoOm4VTVCLhaVCpC9UKNvTShW_jjtq_DiWBWpbj0cSsF0wPyRAJm2hzpUQSz1ow$ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bDIUaHqh4wl1qOCJcAguXJHWFoOm4VTVCLhaVCpC9UKNvTShW_jjtq_DiWBWpbj0cSsF0wPyRAJm2hzpUQSz1ow$ >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!bDIUaHqh4wl1qOCJcAguXJHWFoOm4VTVCLhaVCpC9UKNvTShW_jjtq_DiWBWpbj0cSsF0wPyRAJm2hzpUQSz1ow$ >>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From y.hu at mpie.de Sat Jun 29 03:28:32 2024 From: y.hu at mpie.de (y.hu at mpie.de) Date: Sat, 29 Jun 2024 08:28:32 +0000 Subject: [petsc-users] SNESVISetComputeVariableBounds() is not available in Fortran interface Message-ID: Dear PETSc team, I find SNES variational inequality capability (e.g. src/snes/tutorials/ex9.c example) very useful for my current problem. Since I am using fortran API of petsc, I explore a little bit about this functionality and it sems not there for fortran (a simple SNESVISetVariableBounds() is not enough, because it only handles constant VI and my VI changes during each SNESSolve call). Am I correct? If I would like to implement the interface myself, is it straightforward? Could you give me some advice or reference code on it? I have some knowledge in MatCreateShell() fortran interface, I saw the fortran API was created in a folder ftn-custom/ and writing such interface seems not that hard, but seems a lot of many manual definitions of other related APIs. Thanks for your help. Best regards, Yi ------------------------------------------------- Stay up to date and follow us on LinkedIn, Twitter and YouTube. Max-Planck-Institut f?r Eisenforschung GmbH Max-Planck-Stra?e 1 D-40237 D?sseldorf Handelsregister B 2533 Amtsgericht D?sseldorf Gesch?ftsf?hrung Prof. Dr. Gerhard Dehm Prof. Dr. J?rg Neugebauer Prof. Dr. Dierk Raabe Dr. Kai de Weldige Ust.-Id.-Nr.: DE 11 93 58 514 Steuernummer: 105 5891 1000 Please consider that invitations and e-mails of our institute are only valid if they end with ?@mpie.de. If you are not sure of the validity please contact rco at mpie.de Bitte beachten Sie, dass Einladungen zu Veranstaltungen und E-Mails aus unserem Haus nur mit der Endung ?@mpie.de g?ltig sind. In Zweifelsf?llen wenden Sie sich bitte an rco at mpie.de ------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at petsc.dev Sat Jun 29 09:12:13 2024 From: bsmith at petsc.dev (Barry Smith) Date: Sat, 29 Jun 2024 10:12:13 -0400 Subject: [petsc-users] SNESVISetComputeVariableBounds() is not available in Fortran interface In-Reply-To: References: Message-ID: Yi, I have made a start of it in the branch barry/2024-06-29/add-fortran-snesvisetcomputevariablebounds but have not tested it. You should be able to do git fetch git checkout barry/2024-06-29/add-fortran-snesvisetcomputevariablebounds make all Please let me know if you have any difficulties Barry > On Jun 29, 2024, at 4:28?AM, y.hu at mpie.de wrote: > > This Message Is From an External Sender > This message came from outside your organization. > Dear PETSc team, > > I find SNES variational inequality capability (e.g. src/snes/tutorials/ex9.c example) very useful for my current problem. Since I am using fortran API of petsc, I explore a little bit about this functionality and it sems not there for fortran (a simple SNESVISetVariableBounds() is not enough, because it only handles constant VI and my VI changes during each SNESSolve call). Am I correct? > > If I would like to implement the interface myself, is it straightforward? Could you give me some advice or reference code on it? > > I have some knowledge in MatCreateShell() fortran interface, I saw the fortran API was created in a folder ftn-custom/ and writing such interface seems not that hard, but seems a lot of many manual definitions of other related APIs. > > Thanks for your help. > > Best regards, > Yi > > > ------------------------------------------------- > Stay up to date and follow us on LinkedIn, Twitter and YouTube. > > Max-Planck-Institut f?r Eisenforschung GmbH > Max-Planck-Stra?e 1 > D-40237 D?sseldorf > > Handelsregister B 2533 > Amtsgericht D?sseldorf > > Gesch?ftsf?hrung > Prof. Dr. Gerhard Dehm > Prof. Dr. J?rg Neugebauer > Prof. Dr. Dierk Raabe > Dr. Kai de Weldige > > Ust.-Id.-Nr.: DE 11 93 58 514 > Steuernummer: 105 5891 1000 > > > Please consider that invitations and e-mails of our institute are > only valid if they end with ?@mpie.de. > If you are not sure of the validity please contact rco at mpie.de > > Bitte beachten Sie, dass Einladungen zu Veranstaltungen und E-Mails > aus unserem Haus nur mit der Endung ?@mpie.de g?ltig sind. > In Zweifelsf?llen wenden Sie sich bitte an rco at mpie.de > ------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From y.hu at mpie.de Sat Jun 29 09:51:16 2024 From: y.hu at mpie.de (y.hu at mpie.de) Date: Sat, 29 Jun 2024 14:51:16 +0000 Subject: [petsc-users] SNESVISetComputeVariableBounds() is not available in Fortran interface In-Reply-To: References: Message-ID: Looks good. Thx! I'll do the corresponding example ex9.c in fortran and let you know how it works. Best regards, Yi ________________________________ Von: Barry Smith Gesendet: Samstag, 29. Juni 2024 16:12 An: Yi Hu Cc: petsc-users Betreff: Re: [petsc-users] SNESVISetComputeVariableBounds() is not available in Fortran interface Yi, I have made a start of it in the branch barry/2024-06-29/add-fortran-snesvisetcomputevariablebounds but have not tested it. You should be able to do git fetch git checkout barry/2024-06-29/add-fortran-snesvisetcomputevariablebounds make all Please let me know if you have any difficulties Barry On Jun 29, 2024, at 4:28?AM, y.hu at mpie.de wrote: This Message Is From an External Sender This message came from outside your organization. Dear PETSc team, I find SNES variational inequality capability (e.g. src/snes/tutorials/ex9.c example) very useful for my current problem. Since I am using fortran API of petsc, I explore a little bit about this functionality and it sems not there for fortran (a simple SNESVISetVariableBounds() is not enough, because it only handles constant VI and my VI changes during each SNESSolve call). Am I correct? If I would like to implement the interface myself, is it straightforward? Could you give me some advice or reference code on it? I have some knowledge in MatCreateShell() fortran interface, I saw the fortran API was created in a folder ftn-custom/ and writing such interface seems not that hard, but seems a lot of many manual definitions of other related APIs. Thanks for your help. Best regards, Yi ________________________________ ------------------------------------------------- Stay up to date and follow us on LinkedIn, Twitter and YouTube. Max-Planck-Institut f?r Eisenforschung GmbH Max-Planck-Stra?e 1 D-40237 D?sseldorf Handelsregister B 2533 Amtsgericht D?sseldorf Gesch?ftsf?hrung Prof. Dr. Gerhard Dehm Prof. Dr. J?rg Neugebauer Prof. Dr. Dierk Raabe Dr. Kai de Weldige Ust.-Id.-Nr.: DE 11 93 58 514 Steuernummer: 105 5891 1000 Please consider that invitations and e-mails of our institute are only valid if they end with ?@mpie.de. If you are not sure of the validity please contact rco at mpie.de Bitte beachten Sie, dass Einladungen zu Veranstaltungen und E-Mails aus unserem Haus nur mit der Endung ?@mpie.de g?ltig sind. In Zweifelsf?llen wenden Sie sich bitte an rco at mpie.de ------------------------------------------------- ------------------------------------------------- Stay up to date and follow us on LinkedIn, Twitter and YouTube. Max-Planck-Institut f?r Eisenforschung GmbH Max-Planck-Stra?e 1 D-40237 D?sseldorf Handelsregister B 2533 Amtsgericht D?sseldorf Gesch?ftsf?hrung Prof. Dr. Gerhard Dehm Prof. Dr. J?rg Neugebauer Prof. Dr. Dierk Raabe Dr. Kai de Weldige Ust.-Id.-Nr.: DE 11 93 58 514 Steuernummer: 105 5891 1000 Please consider that invitations and e-mails of our institute are only valid if they end with ?@mpie.de. If you are not sure of the validity please contact rco at mpie.de Bitte beachten Sie, dass Einladungen zu Veranstaltungen und E-Mails aus unserem Haus nur mit der Endung ?@mpie.de g?ltig sind. In Zweifelsf?llen wenden Sie sich bitte an rco at mpie.de ------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From ligang0309 at gmail.com Sat Jun 29 23:43:23 2024 From: ligang0309 at gmail.com (Gang Li) Date: Sun, 30 Jun 2024 12:43:23 +0800 Subject: [petsc-users] =?utf-8?q?Problem_about_compiling_PETSc-3=2E21=2E2?= =?utf-8?q?_under_Cygwin?= In-Reply-To: <9d7974dd-22ba-49e8-d96d-d69cba5653bd@fastmail.org> References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> <552dde2a-782a-5238-4897-18736ac9e94a@fastmail.org> <7620557F-4CB0-4E6A-91AF-B3C47DC1BCDD@gmail.com> <365c3d40-0f77-1158-1759-bb4c4e2b1dda@fastmail.org> <9d7974dd-22ba-49e8-d96d-d69cba5653bd@fastmail.org> Message-ID: Hi Satish, Thanks for your help. I find the problem. I uninstall the Perl software under windows now the configure works.? Sincerely, Gang ---- Replied Message ---- FromSatish BalayDate6/28/2024 13:51Topetsc-usersCcGang LiSubjectRe: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin Here is what I get Satish ---- balay at petsc-win01 /cygdrive/e/balay $ wget -q https://urldefense.us/v3/__https://web.cels.anl.gov/projects/petsc/download/release-snapshots/petsc-3.21.2.tar.gz__;!!G_uCfscf7eWS!Ze0ZAk1qXDGZFKn3Vq2RYM_vyDsJvhPiTac02fcbpw2inF5bSCaBJjBhIHex9FKkNDld509Av8QQh_iHj3bBZnoY$ balay at petsc-win01 /cygdrive/e/balay $ tar -xzf petsc-3.21.2.tar.gz balay at petsc-win01 /cygdrive/e/balay $ cd petsc-3.21.2 balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ ./configure --with-cc=win32fe_icl --with-fc=win32fe_ifort --with-cxx=win32fe_icl --with-precision=double --with-scalar-type=complex --with-shared-libraries=0 --with-mpi=0 '--with-blaslapack-lib=-L/cygdrive/c/PROGRA~2/Intel/oneAPI/mkl/latest/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib' ============================================================================================= Configuring PETSc to compile on your system ============================================================================================= Compilers: C Compiler: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl -Qstd=c99 -MT -Z7 -Od Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 C++ Compiler: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl -MT -GR -EHsc -Z7 -Od -Qstd=c++17 -TP Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Fortran Compiler: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_ifort -MT -Z7 -Od -fpp Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Linkers: Static linker: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_lib -a BlasLapack: Libraries: -L/cygdrive/c/PROGRA~2/Intel/oneAPI/mkl/latest/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib Unknown if this uses OpenMP (try export OMP_NUM_THREADS=<1-4> yourprogram -log_view) uses 4 byte integers MPI: Version: PETSc MPIUNI uniprocessor MPI replacement mpiexec: ${PETSC_DIR}/lib/petsc/bin/petsc-mpiexec.uni python: Executable: /usr/bin/python3 cmake: Version: 3.20.0 Executable: /usr/bin/cmake bison: Version: 3.8 Executable: /usr/bin/bison PETSc: Language used to compile PETSc: C PETSC_ARCH: arch-mswin-c-debug PETSC_DIR: /cygdrive/e/balay/petsc-3.21.2 Prefix: Scalar type: complex Precision: double Integer size: 4 bytes Single library: yes Shared libraries: no Memory alignment from malloc(): 16 bytes Using GNU make: /usr/bin/make xxx=======================================================================================xxx Configure stage complete. Now build PETSc libraries with: make PETSC_DIR=/cygdrive/e/balay/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug all xxx=======================================================================================xxx balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ ls -l lib/petsc/conf/ total 135 -rw-r--r--+ 1 balay Domain Users 391 Mar 29 08:59 bfort-base.txt -rw-r--r--+ 1 balay Domain Users 877 Mar 29 08:59 bfort-mpi.txt -rw-r--r--+ 1 balay Domain Users 5735 Mar 29 19:34 bfort-petsc.txt -rw-rw-r--+ 1 balay Domain Users 136 Jun 28 00:33 petscvariables -rw-r--r--+ 1 balay Domain Users 13140 May 29 14:34 rules -rw-r--r--+ 1 balay Domain Users 613 Mar 29 19:34 rules_doc.mk -rw-r--r--+ 1 balay Domain Users 16516 May 29 14:06 rules_util.mk -rw-r--r--+ 1 balay Domain Users 119 Mar 29 08:59 test -rw-r--r--+ 1 balay Domain Users 71503 Mar 29 08:59 uncrustify.cfg -rw-r--r--+ 1 balay Domain Users 4769 Mar 29 19:34 variables balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ make ========================================== See documentation/faq.html and documentation/bugreporting.html for help with installation problems. Please send EVERYTHING printed out below when reporting problems. Please check the mailing list archives and consider subscribing. https://urldefense.us/v3/__https://petsc.org/release/community/mailing/__;!!G_uCfscf7eWS!Ze0ZAk1qXDGZFKn3Vq2RYM_vyDsJvhPiTac02fcbpw2inF5bSCaBJjBhIHex9FKkNDld509Av8QQh_iHj0lPUVJI$ ========================================== Starting make run on petsc-win01 at Fri, 28 Jun 2024 00:34:15 -0500 Machine characteristics: CYGWIN_NT-10.0 petsc-win01 3.2.0(0.340/5/3) 2021-03-29 08:42 x86_64 Cygwin ----------------------------------------- Using PETSc directory: /cygdrive/e/balay/petsc-3.21.2 Using PETSc arch: arch-mswin-c-debug ----------------------------------------- PETSC_VERSION_RELEASE 1 PETSC_VERSION_MAJOR 3 PETSC_VERSION_MINOR 21 PETSC_VERSION_SUBMINOR 2 PETSC_VERSION_DATE "May 29, 2024" PETSC_VERSION_GIT "v3.21.2" PETSC_VERSION_DATE_GIT "2024-05-29 14:05:28 -0500" ----------------------------------------- Using configure Options: --with-cc=win32fe_icl --with-fc=win32fe_ifort --with-cxx=win32fe_icl --with-precision=double --with-scalar-type=complex --with-shared-libraries=0 --with-mpi=0 --with-blaslapack-lib="-L/cygdrive/c/PROGRA~2/Intel/oneAPI/mkl/latest/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib" Using configuration flags: #define MPI_Comm_create_errhandler(p_err_fun,p_errhandler) MPI_Errhandler_create((p_err_fun),(p_errhandler)) #define MPI_Comm_set_errhandler(comm,p_errhandler) MPI_Errhandler_set((comm),(p_errhandler)) #define MPI_Type_create_struct(count,lens,displs,types,newtype) MPI_Type_struct((count),(lens),(displs),(types),(newtype)) #define PETSC_ARCH "arch-mswin-c-debug" #define PETSC_ATTRIBUTEALIGNED(size) #define PETSC_BLASLAPACK_CAPS 1 #define PETSC_CANNOT_START_DEBUGGER 1 #define PETSC_CLANGUAGE_C 1 #define PETSC_CXX_RESTRICT __restrict #define PETSC_DEPRECATED_ENUM_BASE(string_literal_why) #define PETSC_DEPRECATED_FUNCTION_BASE(string_literal_why) __declspec(deprecated(string_literal_why)) #define PETSC_DEPRECATED_MACRO_BASE(string_literal_why) PETSC_DEPRECATED_MACRO_BASE_(GCC warning string_literal_why) #define PETSC_DEPRECATED_MACRO_BASE_(why) _Pragma(#why) #define PETSC_DEPRECATED_OBJECT_BASE(string_literal_why) __declspec(deprecated(string_literal_why)) #define PETSC_DEPRECATED_TYPEDEF_BASE(string_literal_why) #define PETSC_DIR "E:\\balay\\petsc-3.21.2" #define PETSC_DIR_SEPARATOR '\\' #define PETSC_FORTRAN_CHARLEN_T int #define PETSC_FORTRAN_TYPE_INITIALIZE = -2 #define PETSC_FUNCTION_NAME_C __func__ #define PETSC_FUNCTION_NAME_CXX __func__ #define PETSC_HAVE_ACCESS 1 #define PETSC_HAVE_ATOLL 1 #define PETSC_HAVE_BUILTIN_EXPECT 1 #define PETSC_HAVE_C99_COMPLEX 1 #define PETSC_HAVE_CLOCK 1 #define PETSC_HAVE_CLOSESOCKET 1 #define PETSC_HAVE_CXX 1 #define PETSC_HAVE_CXX_COMPLEX 1 #define PETSC_HAVE_CXX_COMPLEX_FIX 1 #define PETSC_HAVE_CXX_DIALECT_CXX11 1 #define PETSC_HAVE_CXX_DIALECT_CXX14 1 #define PETSC_HAVE_CXX_DIALECT_CXX17 1 #define PETSC_HAVE_DIRECT_H 1 #define PETSC_HAVE_DOS_H 1 #define PETSC_HAVE_DOUBLE_ALIGN_MALLOC 1 #define PETSC_HAVE_ERF 1 #define PETSC_HAVE_FCNTL_H 1 #define PETSC_HAVE_FENV_H 1 #define PETSC_HAVE_FE_VALUES 1 #define PETSC_HAVE_FLOAT_H 1 #define PETSC_HAVE_FORTRAN_CAPS 1 #define PETSC_HAVE_FORTRAN_FLUSH 1 #define PETSC_HAVE_FORTRAN_FREE_LINE_LENGTH_NONE 1 #define PETSC_HAVE_FORTRAN_TYPE_STAR 1 #define PETSC_HAVE_FREELIBRARY 1 #define PETSC_HAVE_GETCOMPUTERNAME 1 #define PETSC_HAVE_GETCWD 1 #define PETSC_HAVE_GETLASTERROR 1 #define PETSC_HAVE_GETPROCADDRESS 1 #define PETSC_HAVE_GET_USER_NAME 1 #define PETSC_HAVE_IMMINTRIN_H 1 #define PETSC_HAVE_INTTYPES_H 1 #define PETSC_HAVE_IO_H 1 #define PETSC_HAVE_ISINF 1 #define PETSC_HAVE_ISNAN 1 #define PETSC_HAVE_ISNORMAL 1 #define PETSC_HAVE_LARGE_INTEGER_U 1 #define PETSC_HAVE_LGAMMA 1 #define PETSC_HAVE_LOADLIBRARY 1 #define PETSC_HAVE_LOG2 1 #define PETSC_HAVE_LSEEK 1 #define PETSC_HAVE_MALLOC_H 1 #define PETSC_HAVE_MEMMOVE 1 #define PETSC_HAVE_MKL_LIBS 1 #define PETSC_HAVE_MPIUNI 1 #define PETSC_HAVE_O_BINARY 1 #define PETSC_HAVE_PACKAGES ":blaslapack:mathlib:mpi:" #define PETSC_HAVE_RAND 1 #define PETSC_HAVE_SETJMP_H 1 #define PETSC_HAVE_SETLASTERROR 1 #define PETSC_HAVE_SNPRINTF 1 #define PETSC_HAVE_STDINT_H 1 #define PETSC_HAVE_STRICMP 1 #define PETSC_HAVE_SYS_TYPES_H 1 #define PETSC_HAVE_TAU_PERFSTUBS 1 #define PETSC_HAVE_TGAMMA 1 #define PETSC_HAVE_TIME 1 #define PETSC_HAVE_TIME_H 1 #define PETSC_HAVE_TMPNAM_S 1 #define PETSC_HAVE_VA_COPY 1 #define PETSC_HAVE_VSNPRINTF 1 #define PETSC_HAVE_WINDOWSX_H 1 #define PETSC_HAVE_WINDOWS_COMPILERS 1 #define PETSC_HAVE_WINDOWS_H 1 #define PETSC_HAVE_WINSOCK2_H 1 #define PETSC_HAVE_WS2TCPIP_H 1 #define PETSC_HAVE_WSAGETLASTERROR 1 #define PETSC_HAVE_XMMINTRIN_H 1 #define PETSC_HAVE__ACCESS 1 #define PETSC_HAVE__GETCWD 1 #define PETSC_HAVE__LSEEK 1 #define PETSC_HAVE__MKDIR 1 #define PETSC_HAVE__SLEEP 1 #define PETSC_HAVE__SNPRINTF 1 #define PETSC_HAVE___INT64 1 #define PETSC_INTPTR_T intptr_t #define PETSC_INTPTR_T_FMT "#" PRIxPTR #define PETSC_IS_COLORING_MAX USHRT_MAX #define PETSC_IS_COLORING_VALUE_TYPE short #define PETSC_IS_COLORING_VALUE_TYPE_F integer2 #define PETSC_LEVEL1_DCACHE_LINESIZE 32 #define PETSC_LIB_DIR "/cygdrive/e/balay/petsc-3.21.2/arch-mswin-c-debug/lib" #define PETSC_MAX_PATH_LEN 4096 #define PETSC_MEMALIGN 16 #define PETSC_MISSING_GETLINE 1 #define PETSC_MISSING_SIGALRM 1 #define PETSC_MISSING_SIGBUS 1 #define PETSC_MISSING_SIGCHLD 1 #define PETSC_MISSING_SIGCONT 1 #define PETSC_MISSING_SIGHUP 1 #define PETSC_MISSING_SIGKILL 1 #define PETSC_MISSING_SIGPIPE 1 #define PETSC_MISSING_SIGQUIT 1 #define PETSC_MISSING_SIGSTOP 1 #define PETSC_MISSING_SIGSYS 1 #define PETSC_MISSING_SIGTRAP 1 #define PETSC_MISSING_SIGTSTP 1 #define PETSC_MISSING_SIGURG 1 #define PETSC_MISSING_SIGUSR1 1 #define PETSC_MISSING_SIGUSR2 1 #define PETSC_MPICC_SHOW "Unavailable" #define PETSC_MPIU_IS_COLORING_VALUE_TYPE MPI_UNSIGNED_SHORT #define PETSC_NEEDS_UTYPE_TYPEDEFS 1 #define PETSC_OMAKE "/usr/bin/make --no-print-directory" #define PETSC_PREFETCH_HINT_NTA _MM_HINT_NTA #define PETSC_PREFETCH_HINT_T0 _MM_HINT_T0 #define PETSC_PREFETCH_HINT_T1 _MM_HINT_T1 #define PETSC_PREFETCH_HINT_T2 _MM_HINT_T2 #define PETSC_PYTHON_EXE "/usr/bin/python3" #define PETSC_Prefetch(a,b,c) _mm_prefetch((const char*)(a),(c)) #define PETSC_REPLACE_DIR_SEPARATOR '/' #define PETSC_SIGNAL_CAST #define PETSC_SIZEOF_INT 4 #define PETSC_SIZEOF_LONG 4 #define PETSC_SIZEOF_LONG_LONG 8 #define PETSC_SIZEOF_SIZE_T 8 #define PETSC_SIZEOF_VOID_P 8 #define PETSC_SLSUFFIX "" #define PETSC_UINTPTR_T uintptr_t #define PETSC_UINTPTR_T_FMT "#" PRIxPTR #define PETSC_UNUSED #define PETSC_USE_AVX512_KERNELS 1 #define PETSC_USE_BACKWARD_LOOP 1 #define PETSC_USE_COMPLEX 1 #define PETSC_USE_CTABLE 1 #define PETSC_USE_DEBUG 1 #define PETSC_USE_DEBUGGER "gdb" #define PETSC_USE_DMLANDAU_2D 1 #define PETSC_USE_FORTRAN_BINDINGS 1 #define PETSC_USE_INFO 1 #define PETSC_USE_ISATTY 1 #define PETSC_USE_LOG 1 #define PETSC_USE_MICROSOFT_TIME 1 #define PETSC_USE_PROC_FOR_SIZE 1 #define PETSC_USE_REAL_DOUBLE 1 #define PETSC_USE_SINGLE_LIBRARY 1 #define PETSC_USE_WINDOWS_GRAPHICS 1 #define PETSC_USING_64BIT_PTR 1 #define PETSC_USING_F2003 1 #define PETSC_USING_F90FREEFORM 1 #define PETSC__BSD_SOURCE 1 #define PETSC__DEFAULT_SOURCE 1 #define R_OK 04 #define S_ISDIR(a) (((a)&_S_IFMT) == _S_IFDIR) #define S_ISREG(a) (((a)&_S_IFMT) == _S_IFREG) #define W_OK 02 #define X_OK 01 #define _USE_MATH_DEFINES 1 ----------------------------------------- Using C compile: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl -o .o -c -Qstd=c99 -MT -Z7 -Od mpicc -show: Unavailable C compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Using C++ compile: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl -o .o -c -MT -GR -EHsc -Z7 -Od -Qstd=c++17 -TP -I/cygdrive/e/balay/petsc-3.21.2/include -I/cygdrive/e/balay/petsc-3.21.2/arch-mswin-c-debug/include mpicxx -show: Unavailable C++ compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Using Fortran compile: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_ifort -o .o -c -MT -Z7 -Od -fpp -I/cygdrive/e/balay/petsc-3.21.2/include -I/cygdrive/e/balay/petsc-3.21.2/arch-mswin-c-debug/include mpif90 -show: Unavailable Fortran compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 ----------------------------------------- Using C/C++ linker: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl Using C/C++ flags: -Qwd10161 -Qstd=c99 -MT -Z7 -Od Using Fortran linker: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_ifort Using Fortran flags: -MT -Z7 -Od -fpp ----------------------------------------- Using system modules: Using mpi.h: mpiuni ----------------------------------------- Using libraries: -L/cygdrive/e/balay/petsc-3.21.2/arch-mswin-c-debug/lib -L/cygdrive/c/PROGRA~2/Intel/oneAPI/mkl/latest/lib/intel64 -lpetsc mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib Gdi32.lib User32.lib Advapi32.lib Kernel32.lib Ws2_32.lib ------------------------------------------ Using mpiexec: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/petsc-mpiexec.uni ------------------------------------------ Using MAKE: /usr/bin/make Default MAKEFLAGS: MAKE_NP:10 MAKE_LOAD:18.0 MAKEFLAGS: --no-print-directory -- PETSC_ARCH=arch-mswin-c-debug PETSC_DIR=/cygdrive/e/balay/petsc-3.21.2 ========================================== /usr/bin/make --print-directory -f gmakefile -j10 -l18.0 --output-sync=recurse V= libs /usr/bin/python3 ./config/gmakegen.py --petsc-arch=arch-mswin-c-debug CC arch-mswin-c-debug/obj/src/vec/vec/interface/veccreate.o veccreate.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vecreg.o vecreg.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vecregall.o vecregall.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vector.o vector.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecglvis.o vecglvis.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/rvector.o rvector.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecs.o vecs.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecio.o vecio.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecstash.o vecstash.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vsection.o vsection.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vinv.o vinv.c CC arch-mswin-c-debug/obj/src/mat/graphops/coarsen/scoarsen.o scoarsen.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/fdaij.o fdaij.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/ij.o ij.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/inode2.o inode2.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/matrart.o matrart.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/mattransposematmult.o taoshell.c CC arch-mswin-c-debug/obj/src/tao/snes/taosnes.o taosnes.c CC arch-mswin-c-debug/obj/src/tao/util/ftn-auto/tao_utilf.o tao_utilf.c CC arch-mswin-c-debug/obj/src/tao/python/ftn-custom/zpythontaof.o zpythontaof.c CC arch-mswin-c-debug/obj/src/tao/util/tao_util.o tao_util.c FC arch-mswin-c-debug/obj/src/sys/f90-mod/petscsysmod.o FC arch-mswin-c-debug/obj/src/sys/mpiuni/fsrc/somempifort.o FC arch-mswin-c-debug/obj/src/sys/objects/f2003-src/fsrc/optionenum.o FC arch-mswin-c-debug/obj/src/vec/f90-mod/petscvecmod.o FC arch-mswin-c-debug/obj/src/sys/classes/bag/f2003-src/fsrc/bagenum.o FC arch-mswin-c-debug/obj/src/mat/f90-mod/petscmatmod.o FC arch-mswin-c-debug/obj/src/dm/f90-mod/petscdmmod.o FC arch-mswin-c-debug/obj/src/dm/f90-mod/petscdmswarmmod.o FC arch-mswin-c-debug/obj/src/dm/f90-mod/petscdmplexmod.o FC arch-mswin-c-debug/obj/src/dm/f90-mod/petscdmdamod.o FC arch-mswin-c-debug/obj/src/ksp/f90-mod/petsckspdefmod.o CC arch-mswin-c-debug/obj/src/tao/python/pythontao.o pythontao.c FC arch-mswin-c-debug/obj/src/ksp/f90-mod/petscpcmod.o FC arch-mswin-c-debug/obj/src/ksp/f90-mod/petsckspmod.o FC arch-mswin-c-debug/obj/src/snes/f90-mod/petscsnesmod.o FC arch-mswin-c-debug/obj/src/ts/f90-mod/petsctsmod.o FC arch-mswin-c-debug/obj/src/tao/f90-mod/petsctaomod.o AR arch-mswin-c-debug/lib/libpetsc.lib ========================================= Now to check if the libraries are working do: make PETSC_DIR=/cygdrive/e/balay/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug check ========================================= balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ make check Running PETSc check examples to verify correct installation Using PETSC_DIR=/cygdrive/e/balay/petsc-3.21.2 and PETSC_ARCH=arch-mswin-c-debug C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process Fortran example src/snes/tutorials/ex5f run successfully with 1 MPI process Completed PETSc check examples balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ligang0309 at gmail.com Sun Jun 30 00:42:59 2024 From: ligang0309 at gmail.com (Gang Li) Date: Sun, 30 Jun 2024 13:42:59 +0800 Subject: [petsc-users] =?utf-8?q?Problem_about_compiling_PETSc-3=2E21=2E2?= =?utf-8?q?_under_Cygwin?= In-Reply-To: References: <8E45A797-EC22-41B4-9222-5389EEAFCB64@gmail.com> <73B3587D-BE73-4DE3-8E89-6F395FC3F849@petsc.dev> <21e32b88-aed2-a618-3e3c-dca47c6bc456@fastmail.org> <5627D31E-5225-47CA-B337-A08E74C29D4A@gmail.com> <552dde2a-782a-5238-4897-18736ac9e94a@fastmail.org> <7620557F-4CB0-4E6A-91AF-B3C47DC1BCDD@gmail.com> <365c3d40-0f77-1158-1759-bb4c4e2b1dda@fastmail.org> <9d7974dd-22ba-49e8-d96d-d69cba5653bd@fastmail.org> Message-ID: <854C9B5E-1FF5-40B9-B45C-A61EACA2EE94@gmail.com> Hi Satish, I met another issue when make the lib: gli at WROKSTATION-OFFICE308 /cygdrive/c/Users/gli/Desktop/PETSc $ tar -xzf petsc-3.21.3.tar.gz gli at WROKSTATION-OFFICE308 /cygdrive/c/Users/gli/Desktop/PETSc $ cd petsc-3.21.3 gli at WROKSTATION-OFFICE308 /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 $ cd petsc-3.21.3 -bash: cd: petsc-3.21.3: No such file or directory gli at WROKSTATION-OFFICE308 /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 $ cygpath -u `cygpath -ms '/cygdrive/c/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64'` /cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 gli at WROKSTATION-OFFICE308 /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 $ ./configure --with-cc=win32fe_icl --with-fc=win32fe_ifort --with-cxx=win32fe_icl \ --with-precision=double --with-scalar-type=complex \ --with-shared-libraries=0 \ --with-mpi=0 \ --with-blaslapack-lib='-L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib' ================================================================================ Configuring PETSc to compile on your system ================================================================================ Compilers: C Compiler: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_icl -Qstd=c99 -MT -Z7 -Od Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.8.275 Build 20180907 C++ Compiler: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_icl -MT -GR -EHsc -Z7 -Od -Qstd=c++14 -TP Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.8.275 Build 20180907 Fortran Compiler: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_ifort -MT -Z7 -Od -fpp Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.8.275 Build 20180907 Linkers: Static linker: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_lib -a BlasLapack: Intel MKL Version: 20170004 Libraries: -L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib Unknown if this uses OpenMP (try export OMP_NUM_THREADS=<1-4> yourprogram -log_view) uses 4 byte integers MPI: Version: PETSc MPIUNI uniprocessor MPI replacement mpiexec: ${PETSC_DIR}/lib/petsc/bin/petsc-mpiexec.uni python: Executable: /usr/bin/python3 mkl_sparse: Unknown if this uses OpenMP (try export OMP_NUM_THREADS=<1-4> yourprogram -log_view) mkl_sparse_optimize: Unknown if this uses OpenMP (try export OMP_NUM_THREADS=<1-4> yourprogram -log_view) PETSc: Language used to compile PETSc: C PETSC_ARCH: arch-mswin-c-debug PETSC_DIR: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 Prefix: Scalar type: complex Precision: double Integer size: 4 bytes Single library: yes Shared libraries: no Memory alignment from malloc(): 16 bytes Using GNU make: /usr/bin/make xxx=======================================================================================xxx Configure stage complete. Now build PETSc libraries with: make PETSC_DIR=/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 PETSC_ARCH=arch-mswin-c-debug all xxx=======================================================================================xxx gli at WROKSTATION-OFFICE308 /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 $ make make[2]: Entering directory '/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3' ========================================== See documentation/faq.html and documentation/bugreporting.html for help with installation problems. Please send EVERYTHING printed out below when reporting problems. Please check the mailing list archives and consider subscribing. https://urldefense.us/v3/__https://petsc.org/release/community/mailing/__;!!G_uCfscf7eWS!drRoCJiI5IcVVrrjYlGWO1leUL5hjFHVfTGJtV0Smxkw6N7wTSeO5I3sGNYcF_DVCZjpoTfUtIbHzDqPEqUK3Mfi$ ========================================== Starting make run on WROKSTATION-OFFICE308 at Sun, 30 Jun 2024 13:11:53 +0800 Machine characteristics: CYGWIN_NT-10.0-19045 WROKSTATION-OFFICE308 3.5.3-1.x86_64 2024-04-03 17:25 UTC x86_64 Cygwin ----------------------------------------- Using PETSc directory: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 Using PETSc arch: arch-mswin-c-debug ----------------------------------------- PETSC_VERSION_RELEASE 1 PETSC_VERSION_MAJOR 3 PETSC_VERSION_MINOR 21 PETSC_VERSION_SUBMINOR 3 PETSC_VERSION_DATE "Jun 28, 2024" PETSC_VERSION_GIT "v3.21.3" PETSC_VERSION_DATE_GIT "2024-06-28 11:53:00 -0500" ----------------------------------------- Using configure Options: --with-cc=win32fe_icl --with-fc=win32fe_ifort --with-cxx=win32fe_icl --with-precision=double --with-scalar-type=complex --with-shared-libraries=0 --with-mpi=0 --with-blaslapack-lib="-L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib" Using configuration flags: #define MPI_Comm_create_errhandler(p_err_fun,p_errhandler) MPI_Errhandler_create((p_err_fun),(p_errhandler)) #define MPI_Comm_set_errhandler(comm,p_errhandler) MPI_Errhandler_set((comm),(p_errhandler)) #define MPI_Type_create_struct(count,lens,displs,types,newtype) MPI_Type_struct((count),(lens),(displs),(types),(newtype)) #define PETSC_ARCH "arch-mswin-c-debug" #define PETSC_ATTRIBUTEALIGNED(size) #define PETSC_BLASLAPACK_CAPS 1 #define PETSC_CANNOT_START_DEBUGGER 1 #define PETSC_CLANGUAGE_C 1 #define PETSC_CXX_RESTRICT __restrict #define PETSC_DEPRECATED_ENUM_BASE(string_literal_why) #define PETSC_DEPRECATED_FUNCTION_BASE(string_literal_why) __declspec(deprecated(string_literal_why)) #define PETSC_DEPRECATED_MACRO_BASE(string_literal_why) PETSC_DEPRECATED_MACRO_BASE_(GCC warning string_literal_why) #define PETSC_DEPRECATED_MACRO_BASE_(why) _Pragma(#why) #define PETSC_DEPRECATED_OBJECT_BASE(string_literal_why) __declspec(deprecated(string_literal_why)) #define PETSC_DEPRECATED_TYPEDEF_BASE(string_literal_why) #define PETSC_DIR "C:\\Users\\gli\\Desktop\\PETSc\\petsc-3.21.3" #define PETSC_DIR_SEPARATOR '\\' #define PETSC_FORTRAN_CHARLEN_T int #define PETSC_FORTRAN_TYPE_INITIALIZE = -2 #define PETSC_FUNCTION_NAME_C __func__ #define PETSC_FUNCTION_NAME_CXX __func__ #define PETSC_HAVE_ACCESS 1 #define PETSC_HAVE_ATOLL 1 #define PETSC_HAVE_BUILTIN_EXPECT 1 #define PETSC_HAVE_C99_COMPLEX 1 #define PETSC_HAVE_CLOCK 1 #define PETSC_HAVE_CLOSESOCKET 1 #define PETSC_HAVE_CXX 1 #define PETSC_HAVE_CXX_ATOMIC 1 #define PETSC_HAVE_CXX_COMPLEX 1 #define PETSC_HAVE_CXX_COMPLEX_FIX 1 #define PETSC_HAVE_CXX_DIALECT_CXX11 1 #define PETSC_HAVE_CXX_DIALECT_CXX14 1 #define PETSC_HAVE_DIRECT_H 1 #define PETSC_HAVE_DOS_H 1 #define PETSC_HAVE_DOUBLE_ALIGN_MALLOC 1 #define PETSC_HAVE_ERF 1 #define PETSC_HAVE_FCNTL_H 1 #define PETSC_HAVE_FENV_H 1 #define PETSC_HAVE_FE_VALUES 1 #define PETSC_HAVE_FLOAT_H 1 #define PETSC_HAVE_FORTRAN_CAPS 1 #define PETSC_HAVE_FORTRAN_FLUSH 1 #define PETSC_HAVE_FORTRAN_FREE_LINE_LENGTH_NONE 1 #define PETSC_HAVE_FORTRAN_TYPE_STAR 1 #define PETSC_HAVE_FREELIBRARY 1 #define PETSC_HAVE_GETCOMPUTERNAME 1 #define PETSC_HAVE_GETCWD 1 #define PETSC_HAVE_GETLASTERROR 1 #define PETSC_HAVE_GETPROCADDRESS 1 #define PETSC_HAVE_GET_USER_NAME 1 #define PETSC_HAVE_IMMINTRIN_H 1 #define PETSC_HAVE_INTTYPES_H 1 #define PETSC_HAVE_IO_H 1 #define PETSC_HAVE_ISINF 1 #define PETSC_HAVE_ISNAN 1 #define PETSC_HAVE_ISNORMAL 1 #define PETSC_HAVE_LARGE_INTEGER_U 1 #define PETSC_HAVE_LGAMMA 1 #define PETSC_HAVE_LOADLIBRARY 1 #define PETSC_HAVE_LOG2 1 #define PETSC_HAVE_LSEEK 1 #define PETSC_HAVE_MALLOC_H 1 #define PETSC_HAVE_MEMMOVE 1 #define PETSC_HAVE_MKL_LIBS 1 #define PETSC_HAVE_MKL_SPARSE 1 #define PETSC_HAVE_MKL_SPARSE_OPTIMIZE 1 #define PETSC_HAVE_MPIUNI 1 #define PETSC_HAVE_O_BINARY 1 #define PETSC_HAVE_PACKAGES ":blaslapack:mathlib:mkl_sparse:mkl_sparse_optimize:mpi:" #define PETSC_HAVE_RAND 1 #define PETSC_HAVE_SETJMP_H 1 #define PETSC_HAVE_SETLASTERROR 1 #define PETSC_HAVE_STDATOMIC_H 1 #define PETSC_HAVE_STDINT_H 1 #define PETSC_HAVE_STRICMP 1 #define PETSC_HAVE_SYS_TYPES_H 1 #define PETSC_HAVE_TAU_PERFSTUBS 1 #define PETSC_HAVE_TGAMMA 1 #define PETSC_HAVE_TIME 1 #define PETSC_HAVE_TIME_H 1 #define PETSC_HAVE_TMPNAM_S 1 #define PETSC_HAVE_VA_COPY 1 #define PETSC_HAVE_VSNPRINTF 1 #define PETSC_HAVE_WINDOWSX_H 1 #define PETSC_HAVE_WINDOWS_COMPILERS 1 #define PETSC_HAVE_WINDOWS_H 1 #define PETSC_HAVE_WINSOCK2_H 1 #define PETSC_HAVE_WS2TCPIP_H 1 #define PETSC_HAVE_WSAGETLASTERROR 1 #define PETSC_HAVE_XMMINTRIN_H 1 #define PETSC_HAVE__ACCESS 1 #define PETSC_HAVE__GETCWD 1 #define PETSC_HAVE__LSEEK 1 #define PETSC_HAVE__MKDIR 1 #define PETSC_HAVE__SLEEP 1 #define PETSC_HAVE___INT64 1 #define PETSC_INTPTR_T intptr_t #define PETSC_INTPTR_T_FMT "#" PRIxPTR #define PETSC_IS_COLORING_MAX USHRT_MAX #define PETSC_IS_COLORING_VALUE_TYPE short #define PETSC_IS_COLORING_VALUE_TYPE_F integer2 #define PETSC_LEVEL1_DCACHE_LINESIZE 32 #define PETSC_LIB_DIR "/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/arch-mswin-c-debug/lib" #define PETSC_MAX_PATH_LEN 4096 #define PETSC_MEMALIGN 16 #define PETSC_MISSING_GETLINE 1 #define PETSC_MISSING_SIGALRM 1 #define PETSC_MISSING_SIGBUS 1 #define PETSC_MISSING_SIGCHLD 1 #define PETSC_MISSING_SIGCONT 1 #define PETSC_MISSING_SIGHUP 1 #define PETSC_MISSING_SIGKILL 1 #define PETSC_MISSING_SIGPIPE 1 #define PETSC_MISSING_SIGQUIT 1 #define PETSC_MISSING_SIGSTOP 1 #define PETSC_MISSING_SIGSYS 1 #define PETSC_MISSING_SIGTRAP 1 #define PETSC_MISSING_SIGTSTP 1 #define PETSC_MISSING_SIGURG 1 #define PETSC_MISSING_SIGUSR1 1 #define PETSC_MISSING_SIGUSR2 1 #define PETSC_MPICC_SHOW "Unavailable" #define PETSC_MPIU_IS_COLORING_VALUE_TYPE MPI_UNSIGNED_SHORT #define PETSC_NEEDS_UTYPE_TYPEDEFS 1 #define PETSC_OMAKE "/usr/bin/make --no-print-directory" #define PETSC_PREFETCH_HINT_NTA _MM_HINT_NTA #define PETSC_PREFETCH_HINT_T0 _MM_HINT_T0 #define PETSC_PREFETCH_HINT_T1 _MM_HINT_T1 #define PETSC_PREFETCH_HINT_T2 _MM_HINT_T2 #define PETSC_PYTHON_EXE "/usr/bin/python3" #define PETSC_Prefetch(a,b,c) _mm_prefetch((const char*)(a),(c)) #define PETSC_REPLACE_DIR_SEPARATOR '/' #define PETSC_SIGNAL_CAST #define PETSC_SIZEOF_INT 4 #define PETSC_SIZEOF_LONG 4 #define PETSC_SIZEOF_LONG_LONG 8 #define PETSC_SIZEOF_SIZE_T 8 #define PETSC_SIZEOF_VOID_P 8 #define PETSC_SLSUFFIX "" #define PETSC_UINTPTR_T uintptr_t #define PETSC_UINTPTR_T_FMT "#" PRIxPTR #define PETSC_UNUSED #define PETSC_USE_AVX512_KERNELS 1 #define PETSC_USE_BACKWARD_LOOP 1 #define PETSC_USE_COMPLEX 1 #define PETSC_USE_CTABLE 1 #define PETSC_USE_DEBUG 1 #define PETSC_USE_DEBUGGER "gdb" #define PETSC_USE_DMLANDAU_2D 1 #define PETSC_USE_FORTRAN_BINDINGS 1 #define PETSC_USE_INFO 1 #define PETSC_USE_ISATTY 1 #define PETSC_USE_LOG 1 #define PETSC_USE_MICROSOFT_TIME 1 #define PETSC_USE_PROC_FOR_SIZE 1 #define PETSC_USE_REAL_DOUBLE 1 #define PETSC_USE_SINGLE_LIBRARY 1 #define PETSC_USE_WINDOWS_GRAPHICS 1 #define PETSC_USING_64BIT_PTR 1 #define PETSC_USING_F2003 1 #define PETSC_USING_F90FREEFORM 1 #define PETSC__BSD_SOURCE 1 #define PETSC__DEFAULT_SOURCE 1 #define R_OK 04 #define S_ISDIR(a) (((a)&_S_IFMT) == _S_IFDIR) #define S_ISREG(a) (((a)&_S_IFMT) == _S_IFREG) #define W_OK 02 #define X_OK 01 #define _USE_MATH_DEFINES 1 ----------------------------------------- Using C compile: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_icl -o .o -c -Qstd=c99 -MT -Z7 -Od mpicc -show: Unavailable C compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.8.275 Build 20180907 Using C++ compile: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_icl -o .o -c -MT -GR -EHsc -Z7 -Od -Qstd=c++14 -TP -I/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/include -I/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/arch-mswin-c-debug/include mpicxx -show: Unavailable C++ compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.8.275 Build 20180907 Using Fortran compile: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_ifort -o .o -c -MT -Z7 -Od -fpp -I/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/include -I/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/arch-mswin-c-debug/include mpif90 -show: Unavailable Fortran compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.8.275 Build 20180907 ----------------------------------------- Using C/C++ linker: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_icl Using C/C++ flags: -Qwd10161 -Qstd=c99 -MT -Z7 -Od Using Fortran linker: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/win32fe/win32fe_ifort Using Fortran flags: -MT -Z7 -Od -fpp ----------------------------------------- Using system modules: Using mpi.h: mpiuni ----------------------------------------- Using libraries: -L/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/arch-mswin-c-debug/lib -L/cygdrive/c/PROGRA~2/INTELS~1/COMPIL~2/windows/mkl/lib/intel64 -lpetsc mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib Gdi32.lib User32.lib Advapi32.lib Kernel32.lib Ws2_32.lib ------------------------------------------ Using mpiexec: /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/bin/petsc-mpiexec.uni ------------------------------------------ Using MAKE: /usr/bin/make Default MAKEFLAGS: MAKE_NP:24 MAKE_LOAD:48.0 MAKEFLAGS: --no-print-directory -- PETSC_ARCH=arch-mswin-c-debug PETSC_DIR=/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 ========================================== /usr/bin/make --print-directory -f gmakefile -j24 -l48.0 --output-sync=recurse V= libs make[3]: Entering directory '/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3' /usr/bin/python3 ./config/gmakegen.py --petsc-arch=arch-mswin-c-debug CC arch-mswin-c-debug/obj/src/sys/error/pstack.o pstack.c CC arch-mswin-c-debug/obj/src/sys/error/signal.o signal.c CC arch-mswin-c-debug/obj/src/sys/fileio/fwd.o fwd.c CC arch-mswin-c-debug/obj/src/sys/fileio/ghome.o ghome.c CC arch-mswin-c-debug/obj/src/sys/fileio/grpath.o grpath.c CC arch-mswin-c-debug/obj/src/sys/fileio/mpiuopen.o mpiuopen.c CC arch-mswin-c-debug/obj/src/sys/fileio/mprint.o mprint.c CC arch-mswin-c-debug/obj/src/sys/fileio/rpath.o rpath.c CC arch-mswin-c-debug/obj/src/sys/fileio/smatlab.o smatlab.c CC arch-mswin-c-debug/obj/src/sys/fileio/sysio.o sysio.c CC arch-mswin-c-debug/obj/src/sys/objects/garbage.o garbage.c CC arch-mswin-c-debug/obj/src/sys/objects/gcomm.o gcomm.c CC arch-mswin-c-debug/obj/src/sys/objects/gcookie.o gcookie.c CC arch-mswin-c-debug/obj/src/sys/objects/gtype.o gtype.c CC arch-mswin-c-debug/obj/src/sys/objects/inherit.o inherit.c CC arch-mswin-c-debug/obj/src/sys/objects/init.o init.c CC arch-mswin-c-debug/obj/src/sys/objects/olist.o olist.c CC arch-mswin-c-debug/obj/src/sys/objects/options.o options.c CC arch-mswin-c-debug/obj/src/sys/objects/package.o package.c CC arch-mswin-c-debug/obj/src/sys/objects/pgname.o pgname.c CC arch-mswin-c-debug/obj/src/sys/objects/optionsyaml.o optionsyaml.c C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\optionsyaml.c(297): warning #161: unrecognized #pragma #pragma GCC diagnostic push ^ C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\optionsyaml.c(298): warning #161: unrecognized #pragma #pragma GCC diagnostic ignored "-Wsign-conversion" ^ C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\optionsyaml.c(300): warning #161: unrecognized #pragma #pragma GCC diagnostic pop ^ CC arch-mswin-c-debug/obj/src/sys/objects/pname.o pname.c CC arch-mswin-c-debug/obj/src/sys/objects/pinit.o pinit.c CC arch-mswin-c-debug/obj/src/sys/objects/prefix.o prefix.c CC arch-mswin-c-debug/obj/src/sys/objects/ptype.o ptype.c CC arch-mswin-c-debug/obj/src/sys/objects/tagm.o tagm.c CC arch-mswin-c-debug/obj/src/sys/objects/subcomm.o subcomm.c CC arch-mswin-c-debug/obj/src/sys/objects/state.o state.c CC arch-mswin-c-debug/obj/src/sys/objects/version.o version.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/veccreate.o veccreate.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vecreg.o vecreg.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vecregall.o vecregall.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/rvector.o rvector.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vector.o vector.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecglvis.o vecglvis.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecs.o vecs.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecio.o vecio.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecstash.o vecstash.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vsection.o vsection.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vinv.o vinv.c CC arch-mswin-c-debug/obj/src/mat/graphops/coarsen/scoarsen.o scoarsen.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/fdaij.o fdaij.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/ij.o ij.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/matmatmatmult.o matmatmatmult.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/inode2.o inode2.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/inode.o inode.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/matrart.o matrart.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/matmatmult.o matmatmult.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/matptap.o matptap.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/mattransposematmult.o mattransposematmult.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/symtranspose.o symtranspose.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolv.o baijsolv.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat1.o baijsolvnat1.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat14.o baijsolvnat14.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat11.o baijsolvnat11.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat15.o baijsolvnat15.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat2.o baijsolvnat2.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat3.o baijsolvnat3.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat4.o baijsolvnat4.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat5.o baijsolvnat5.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat6.o baijsolvnat6.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtran1.o baijsolvtran1.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvnat7.o baijsolvnat7.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtran2.o baijsolvtran2.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtran3.o baijsolvtran3.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtran4.o baijsolvtran4.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtran5.o baijsolvtran5.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtran6.o baijsolvtran6.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtran7.o baijsolvtran7.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtrann.o baijsolvtrann.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtrannat1.o baijsolvtrannat1.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtrannat2.o baijsolvtrannat2.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtrannat4.o baijsolvtrannat4.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtrannat3.o baijsolvtrannat3.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtrannat5.o baijsolvtrannat5.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtrannat6.o baijsolvtrannat6.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/dgedi.o dgedi.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/dgefa.o dgefa.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/baijsolvtrannat7.o baijsolvtrannat7.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/dgefa2.o dgefa2.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/dgefa4.o dgefa4.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/dgefa3.o dgefa3.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/dgefa5.o dgefa5.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/dgefa6.o dgefa6.c CC arch-mswin-c-debug/obj/src/mat/impls/baij/seq/dgefa7.o dgefa7.c CC arch-mswin-c-debug/obj/src/mat/interface/matnull.o matnull.c CC arch-mswin-c-debug/obj/src/mat/interface/matproduct.o matproduct.c CC arch-mswin-c-debug/obj/src/mat/interface/matreg.o matreg.c CC arch-mswin-c-debug/obj/src/mat/interface/matregis.o matregis.c CC arch-mswin-c-debug/obj/src/mat/interface/matrix.o matrix.c CC arch-mswin-c-debug/obj/src/dm/impls/da/gr1.o gr1.c CC arch-mswin-c-debug/obj/src/dm/impls/da/gr2.o gr2.c CC arch-mswin-c-debug/obj/src/dm/impls/da/grglvis.o grglvis.c CC arch-mswin-c-debug/obj/src/dm/impls/da/grvtk.o grvtk.c CC arch-mswin-c-debug/obj/src/dm/impls/swarm/swarm_migrate.o swarm_migrate.c CC arch-mswin-c-debug/obj/src/dm/impls/swarm/swarm.o swarm.c CC arch-mswin-c-debug/obj/src/dm/impls/swarm/swarmpic.o swarmpic.c CC arch-mswin-c-debug/obj/src/dm/impls/swarm/swarmpic_da.o swarmpic_da.c CC arch-mswin-c-debug/obj/src/dm/impls/swarm/swarmpic_plex.o swarmpic_plex.c CC arch-mswin-c-debug/obj/src/dm/impls/swarm/swarmpic_sort.o swarmpic_sort.c CC arch-mswin-c-debug/obj/src/dm/impls/swarm/swarmpic_view.o swarmpic_view.c CC arch-mswin-c-debug/obj/src/ksp/ksp/impls/gmres/gmpre.o gmpre.c CC arch-mswin-c-debug/obj/src/ksp/ksp/impls/gmres/gmreig.o gmreig.c CC arch-mswin-c-debug/obj/src/ksp/ksp/impls/gmres/gmres.o gmres.c CC arch-mswin-c-debug/obj/src/ksp/ksp/impls/gmres/gmres2.o gmres2.c CC arch-mswin-c-debug/obj/src/ksp/ksp/interface/iguess.o iguess.c CC arch-mswin-c-debug/obj/src/ksp/ksp/interface/itcl.o itcl.c CC arch-mswin-c-debug/obj/src/ksp/ksp/interface/itcreate.o itcreate.c CC arch-mswin-c-debug/obj/src/ksp/ksp/interface/itregis.o itregis.c CC arch-mswin-c-debug/obj/src/ksp/ksp/interface/iterativ.o iterativ.c CC arch-mswin-c-debug/obj/src/ksp/ksp/interface/itres.o itres.c CC arch-mswin-c-debug/obj/src/ksp/ksp/interface/itfunc.o itfunc.c CC arch-mswin-c-debug/obj/src/ksp/ksp/interface/xmon.o xmon.c CC arch-mswin-c-debug/obj/src/ksp/pc/impls/mg/gdsw.o gdsw.c CC arch-mswin-c-debug/obj/src/ksp/pc/impls/mg/mgfunc.o mgfunc.c CC arch-mswin-c-debug/obj/src/ksp/pc/impls/mg/mg.o mg.c CC arch-mswin-c-debug/obj/src/ksp/pc/impls/mg/mgadapt.o mgadapt.c CC arch-mswin-c-debug/obj/src/ksp/pc/impls/mg/smg.o smg.c CC arch-mswin-c-debug/obj/src/snes/interface/snesj.o snesj.c CC arch-mswin-c-debug/obj/src/snes/interface/snesj2.o snesj2.c CC arch-mswin-c-debug/obj/src/snes/interface/snesob.o snesob.c CC arch-mswin-c-debug/obj/src/snes/interface/snespc.o snespc.c CC arch-mswin-c-debug/obj/src/snes/interface/snesregi.o snesregi.c CC arch-mswin-c-debug/obj/src/snes/interface/snesut.o snesut.c CC arch-mswin-c-debug/obj/src/snes/interface/snes.o snes.c CC arch-mswin-c-debug/obj/src/ts/interface/tscreate.o tscreate.c CC arch-mswin-c-debug/obj/src/ts/interface/tseig.o tseig.c CC arch-mswin-c-debug/obj/src/ts/interface/tshistory.o tshistory.c CC arch-mswin-c-debug/obj/src/ts/interface/tsreg.o tsreg.c CC arch-mswin-c-debug/obj/src/ts/interface/tsmon.o tsmon.c CC arch-mswin-c-debug/obj/src/ts/interface/ts.o ts.c CC arch-mswin-c-debug/obj/src/ts/interface/tsregall.o tsregall.c CC arch-mswin-c-debug/obj/src/ts/interface/tsrhssplit.o tsrhssplit.c CC arch-mswin-c-debug/obj/src/ts/utils/dmplexts.o dmplexts.c CC arch-mswin-c-debug/obj/src/ts/utils/tsconvest.o tsconvest.c CC arch-mswin-c-debug/obj/src/ts/utils/dmts.o dmts.c FC arch-mswin-c-debug/obj/src/sys/mpiuni/f90-mod/mpiunimod.o FC arch-mswin-c-debug/obj/src/sys/f90-src/fsrc/f90_fwrap.o FC arch-mswin-c-debug/obj/src/sys/fsrc/somefort.o CXX arch-mswin-c-debug/obj/src/sys/dll/cxx/demangle.o demangle.cxx CXX arch-mswin-c-debug/obj/src/sys/objects/device/impls/host/hostcontext.o hostcontext.cxx CXX arch-mswin-c-debug/obj/src/sys/objects/cxx/object_pool.o object_pool.cxx C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\cxx\object_pool.cxx(330): error: no instance of function template "Petsc::util::construct_at" matches the argument list argument types are: (Petsc::memory::PoolAllocator::AllocationHeader *, Petsc::memory::PoolAllocator::size_type, Petsc::memory::PoolAllocator::align_type) PetscCallCXX(base_ptr = reinterpret_cast(util::construct_at(reinterpret_cast(base_ptr), size, align))); ^ C:\Users\gli\Desktop\PETSc\PETSC-~1.3\include\petsc/private/cpp/memory.hpp(77): note: this candidate was rejected because at least one template argument could not be deduced inline constexpr T *construct_at(T *ptr, Args &&...args) noexcept(std::is_nothrow_constructible::value) ^ compilation aborted for C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\cxx\object_pool.cxx (code 2) make[3]: *** [gmakefile:203: arch-mswin-c-debug/obj/src/sys/objects/cxx/object_pool.o] Error 2 make[3]: *** Waiting for unfinished jobs.... CXX arch-mswin-c-debug/obj/src/sys/objects/device/impls/host/hostdevice.o hostdevice.cxx CXX arch-mswin-c-debug/obj/src/sys/objects/device/interface/dcontext.o dcontext.cxx C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\device\INTERF~1\petscdevice_interface_internal.hpp(47): error: defaulted default constructor cannot be constexpr because the corresponding implicitly declared default constructor would not be constexpr constexpr _n_WeakContext() noexcept = default; ^ compilation aborted for C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\device\INTERF~1\dcontext.cxx (code 2) make[3]: *** [gmakefile:203: arch-mswin-c-debug/obj/src/sys/objects/device/interface/dcontext.o] Error 2 CXX arch-mswin-c-debug/obj/src/sys/objects/device/interface/global_dcontext.o global_dcontext.cxx C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\device\INTERF~1\petscdevice_interface_internal.hpp(47): error: defaulted default constructor cannot be constexpr because the corresponding implicitly declared default constructor would not be constexpr constexpr _n_WeakContext() noexcept = default; ^ compilation aborted for C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\device\INTERF~1\global_dcontext.cxx (code 2) make[3]: *** [gmakefile:203: arch-mswin-c-debug/obj/src/sys/objects/device/interface/global_dcontext.o] Error 2 CXX arch-mswin-c-debug/obj/src/sys/objects/device/interface/device.o device.cxx C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\device\INTERF~1\petscdevice_interface_internal.hpp(47): error: defaulted default constructor cannot be constexpr because the corresponding implicitly declared default constructor would not be constexpr constexpr _n_WeakContext() noexcept = default; ^ compilation aborted for C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\device\INTERF~1\device.cxx (code 2) make[3]: *** [gmakefile:203: arch-mswin-c-debug/obj/src/sys/objects/device/interface/device.o] Error 2 CXX arch-mswin-c-debug/obj/src/sys/objects/device/interface/mark_dcontext.o mark_dcontext.cxx C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\device\INTERF~1\petscdevice_interface_internal.hpp(47): error: defaulted default constructor cannot be constexpr because the corresponding implicitly declared default constructor would not be constexpr constexpr _n_WeakContext() noexcept = default; ^ compilation aborted for C:\Users\gli\Desktop\PETSc\PETSC-~1.3\src\sys\objects\device\INTERF~1\mark_dcontext.cxx (code 2) make[3]: *** [gmakefile:203: arch-mswin-c-debug/obj/src/sys/objects/device/interface/mark_dcontext.o] Error 2 make[3]: Leaving directory '/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3' make[2]: *** [/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3/lib/petsc/conf/rules_doc.mk:5: libs] Error 2 make[2]: Leaving directory '/cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3' **************************ERROR************************************* Error during compile, check arch-mswin-c-debug/lib/petsc/conf/make.log Send it and arch-mswin-c-debug/lib/petsc/conf/configure.log to petsc-maint at mcs.anl.gov ******************************************************************** make[1]: *** [makefile:44: all] Error 1 make: *** [GNUmakefile:9: all] Error 2 gli at WROKSTATION-OFFICE308 /cygdrive/c/Users/gli/Desktop/PETSc/petsc-3.21.3 $ Sincerely, Gang ---- Replied Message ---- FromGang LiDate6/30/2024 12:43Topetsc-usersSubjectRe: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin Hi Satish, Thanks for your help. I find the problem. I uninstall the Perl software under windows now the configure works.? Sincerely, Gang ---- Replied Message ---- FromSatish BalayDate6/28/2024 13:51Topetsc-usersCcGang LiSubjectRe: [petsc-users] Problem about compiling PETSc-3.21.2 under Cygwin Here is what I get Satish ---- balay at petsc-win01 /cygdrive/e/balay $ wget -q https://urldefense.us/v3/__https://web.cels.anl.gov/projects/petsc/download/release-snapshots/petsc-3.21.2.tar.gz__;!!G_uCfscf7eWS!drRoCJiI5IcVVrrjYlGWO1leUL5hjFHVfTGJtV0Smxkw6N7wTSeO5I3sGNYcF_DVCZjpoTfUtIbHzDqPEqlDiEPK$ balay at petsc-win01 /cygdrive/e/balay $ tar -xzf petsc-3.21.2.tar.gz balay at petsc-win01 /cygdrive/e/balay $ cd petsc-3.21.2 balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ ./configure --with-cc=win32fe_icl --with-fc=win32fe_ifort --with-cxx=win32fe_icl --with-precision=double --with-scalar-type=complex --with-shared-libraries=0 --with-mpi=0 '--with-blaslapack-lib=-L/cygdrive/c/PROGRA~2/Intel/oneAPI/mkl/latest/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib' ============================================================================================= Configuring PETSc to compile on your system ============================================================================================= Compilers: C Compiler: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl -Qstd=c99 -MT -Z7 -Od Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 C++ Compiler: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl -MT -GR -EHsc -Z7 -Od -Qstd=c++17 -TP Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Fortran Compiler: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_ifort -MT -Z7 -Od -fpp Version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM\nIntel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Linkers: Static linker: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_lib -a BlasLapack: Libraries: -L/cygdrive/c/PROGRA~2/Intel/oneAPI/mkl/latest/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib Unknown if this uses OpenMP (try export OMP_NUM_THREADS=<1-4> yourprogram -log_view) uses 4 byte integers MPI: Version: PETSc MPIUNI uniprocessor MPI replacement mpiexec: ${PETSC_DIR}/lib/petsc/bin/petsc-mpiexec.uni python: Executable: /usr/bin/python3 cmake: Version: 3.20.0 Executable: /usr/bin/cmake bison: Version: 3.8 Executable: /usr/bin/bison PETSc: Language used to compile PETSc: C PETSC_ARCH: arch-mswin-c-debug PETSC_DIR: /cygdrive/e/balay/petsc-3.21.2 Prefix: Scalar type: complex Precision: double Integer size: 4 bytes Single library: yes Shared libraries: no Memory alignment from malloc(): 16 bytes Using GNU make: /usr/bin/make xxx=======================================================================================xxx Configure stage complete. Now build PETSc libraries with: make PETSC_DIR=/cygdrive/e/balay/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug all xxx=======================================================================================xxx balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ ls -l lib/petsc/conf/ total 135 -rw-r--r--+ 1 balay Domain Users 391 Mar 29 08:59 bfort-base.txt -rw-r--r--+ 1 balay Domain Users 877 Mar 29 08:59 bfort-mpi.txt -rw-r--r--+ 1 balay Domain Users 5735 Mar 29 19:34 bfort-petsc.txt -rw-rw-r--+ 1 balay Domain Users 136 Jun 28 00:33 petscvariables -rw-r--r--+ 1 balay Domain Users 13140 May 29 14:34 rules -rw-r--r--+ 1 balay Domain Users 613 Mar 29 19:34 rules_doc.mk -rw-r--r--+ 1 balay Domain Users 16516 May 29 14:06 rules_util.mk -rw-r--r--+ 1 balay Domain Users 119 Mar 29 08:59 test -rw-r--r--+ 1 balay Domain Users 71503 Mar 29 08:59 uncrustify.cfg -rw-r--r--+ 1 balay Domain Users 4769 Mar 29 19:34 variables balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ make ========================================== See documentation/faq.html and documentation/bugreporting.html for help with installation problems. Please send EVERYTHING printed out below when reporting problems. Please check the mailing list archives and consider subscribing. https://urldefense.us/v3/__https://petsc.org/release/community/mailing/__;!!G_uCfscf7eWS!drRoCJiI5IcVVrrjYlGWO1leUL5hjFHVfTGJtV0Smxkw6N7wTSeO5I3sGNYcF_DVCZjpoTfUtIbHzDqPEqUK3Mfi$ ========================================== Starting make run on petsc-win01 at Fri, 28 Jun 2024 00:34:15 -0500 Machine characteristics: CYGWIN_NT-10.0 petsc-win01 3.2.0(0.340/5/3) 2021-03-29 08:42 x86_64 Cygwin ----------------------------------------- Using PETSc directory: /cygdrive/e/balay/petsc-3.21.2 Using PETSc arch: arch-mswin-c-debug ----------------------------------------- PETSC_VERSION_RELEASE 1 PETSC_VERSION_MAJOR 3 PETSC_VERSION_MINOR 21 PETSC_VERSION_SUBMINOR 2 PETSC_VERSION_DATE "May 29, 2024" PETSC_VERSION_GIT "v3.21.2" PETSC_VERSION_DATE_GIT "2024-05-29 14:05:28 -0500" ----------------------------------------- Using configure Options: --with-cc=win32fe_icl --with-fc=win32fe_ifort --with-cxx=win32fe_icl --with-precision=double --with-scalar-type=complex --with-shared-libraries=0 --with-mpi=0 --with-blaslapack-lib="-L/cygdrive/c/PROGRA~2/Intel/oneAPI/mkl/latest/lib/intel64 mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib" Using configuration flags: #define MPI_Comm_create_errhandler(p_err_fun,p_errhandler) MPI_Errhandler_create((p_err_fun),(p_errhandler)) #define MPI_Comm_set_errhandler(comm,p_errhandler) MPI_Errhandler_set((comm),(p_errhandler)) #define MPI_Type_create_struct(count,lens,displs,types,newtype) MPI_Type_struct((count),(lens),(displs),(types),(newtype)) #define PETSC_ARCH "arch-mswin-c-debug" #define PETSC_ATTRIBUTEALIGNED(size) #define PETSC_BLASLAPACK_CAPS 1 #define PETSC_CANNOT_START_DEBUGGER 1 #define PETSC_CLANGUAGE_C 1 #define PETSC_CXX_RESTRICT __restrict #define PETSC_DEPRECATED_ENUM_BASE(string_literal_why) #define PETSC_DEPRECATED_FUNCTION_BASE(string_literal_why) __declspec(deprecated(string_literal_why)) #define PETSC_DEPRECATED_MACRO_BASE(string_literal_why) PETSC_DEPRECATED_MACRO_BASE_(GCC warning string_literal_why) #define PETSC_DEPRECATED_MACRO_BASE_(why) _Pragma(#why) #define PETSC_DEPRECATED_OBJECT_BASE(string_literal_why) __declspec(deprecated(string_literal_why)) #define PETSC_DEPRECATED_TYPEDEF_BASE(string_literal_why) #define PETSC_DIR "E:\\balay\\petsc-3.21.2" #define PETSC_DIR_SEPARATOR '\\' #define PETSC_FORTRAN_CHARLEN_T int #define PETSC_FORTRAN_TYPE_INITIALIZE = -2 #define PETSC_FUNCTION_NAME_C __func__ #define PETSC_FUNCTION_NAME_CXX __func__ #define PETSC_HAVE_ACCESS 1 #define PETSC_HAVE_ATOLL 1 #define PETSC_HAVE_BUILTIN_EXPECT 1 #define PETSC_HAVE_C99_COMPLEX 1 #define PETSC_HAVE_CLOCK 1 #define PETSC_HAVE_CLOSESOCKET 1 #define PETSC_HAVE_CXX 1 #define PETSC_HAVE_CXX_COMPLEX 1 #define PETSC_HAVE_CXX_COMPLEX_FIX 1 #define PETSC_HAVE_CXX_DIALECT_CXX11 1 #define PETSC_HAVE_CXX_DIALECT_CXX14 1 #define PETSC_HAVE_CXX_DIALECT_CXX17 1 #define PETSC_HAVE_DIRECT_H 1 #define PETSC_HAVE_DOS_H 1 #define PETSC_HAVE_DOUBLE_ALIGN_MALLOC 1 #define PETSC_HAVE_ERF 1 #define PETSC_HAVE_FCNTL_H 1 #define PETSC_HAVE_FENV_H 1 #define PETSC_HAVE_FE_VALUES 1 #define PETSC_HAVE_FLOAT_H 1 #define PETSC_HAVE_FORTRAN_CAPS 1 #define PETSC_HAVE_FORTRAN_FLUSH 1 #define PETSC_HAVE_FORTRAN_FREE_LINE_LENGTH_NONE 1 #define PETSC_HAVE_FORTRAN_TYPE_STAR 1 #define PETSC_HAVE_FREELIBRARY 1 #define PETSC_HAVE_GETCOMPUTERNAME 1 #define PETSC_HAVE_GETCWD 1 #define PETSC_HAVE_GETLASTERROR 1 #define PETSC_HAVE_GETPROCADDRESS 1 #define PETSC_HAVE_GET_USER_NAME 1 #define PETSC_HAVE_IMMINTRIN_H 1 #define PETSC_HAVE_INTTYPES_H 1 #define PETSC_HAVE_IO_H 1 #define PETSC_HAVE_ISINF 1 #define PETSC_HAVE_ISNAN 1 #define PETSC_HAVE_ISNORMAL 1 #define PETSC_HAVE_LARGE_INTEGER_U 1 #define PETSC_HAVE_LGAMMA 1 #define PETSC_HAVE_LOADLIBRARY 1 #define PETSC_HAVE_LOG2 1 #define PETSC_HAVE_LSEEK 1 #define PETSC_HAVE_MALLOC_H 1 #define PETSC_HAVE_MEMMOVE 1 #define PETSC_HAVE_MKL_LIBS 1 #define PETSC_HAVE_MPIUNI 1 #define PETSC_HAVE_O_BINARY 1 #define PETSC_HAVE_PACKAGES ":blaslapack:mathlib:mpi:" #define PETSC_HAVE_RAND 1 #define PETSC_HAVE_SETJMP_H 1 #define PETSC_HAVE_SETLASTERROR 1 #define PETSC_HAVE_SNPRINTF 1 #define PETSC_HAVE_STDINT_H 1 #define PETSC_HAVE_STRICMP 1 #define PETSC_HAVE_SYS_TYPES_H 1 #define PETSC_HAVE_TAU_PERFSTUBS 1 #define PETSC_HAVE_TGAMMA 1 #define PETSC_HAVE_TIME 1 #define PETSC_HAVE_TIME_H 1 #define PETSC_HAVE_TMPNAM_S 1 #define PETSC_HAVE_VA_COPY 1 #define PETSC_HAVE_VSNPRINTF 1 #define PETSC_HAVE_WINDOWSX_H 1 #define PETSC_HAVE_WINDOWS_COMPILERS 1 #define PETSC_HAVE_WINDOWS_H 1 #define PETSC_HAVE_WINSOCK2_H 1 #define PETSC_HAVE_WS2TCPIP_H 1 #define PETSC_HAVE_WSAGETLASTERROR 1 #define PETSC_HAVE_XMMINTRIN_H 1 #define PETSC_HAVE__ACCESS 1 #define PETSC_HAVE__GETCWD 1 #define PETSC_HAVE__LSEEK 1 #define PETSC_HAVE__MKDIR 1 #define PETSC_HAVE__SLEEP 1 #define PETSC_HAVE__SNPRINTF 1 #define PETSC_HAVE___INT64 1 #define PETSC_INTPTR_T intptr_t #define PETSC_INTPTR_T_FMT "#" PRIxPTR #define PETSC_IS_COLORING_MAX USHRT_MAX #define PETSC_IS_COLORING_VALUE_TYPE short #define PETSC_IS_COLORING_VALUE_TYPE_F integer2 #define PETSC_LEVEL1_DCACHE_LINESIZE 32 #define PETSC_LIB_DIR "/cygdrive/e/balay/petsc-3.21.2/arch-mswin-c-debug/lib" #define PETSC_MAX_PATH_LEN 4096 #define PETSC_MEMALIGN 16 #define PETSC_MISSING_GETLINE 1 #define PETSC_MISSING_SIGALRM 1 #define PETSC_MISSING_SIGBUS 1 #define PETSC_MISSING_SIGCHLD 1 #define PETSC_MISSING_SIGCONT 1 #define PETSC_MISSING_SIGHUP 1 #define PETSC_MISSING_SIGKILL 1 #define PETSC_MISSING_SIGPIPE 1 #define PETSC_MISSING_SIGQUIT 1 #define PETSC_MISSING_SIGSTOP 1 #define PETSC_MISSING_SIGSYS 1 #define PETSC_MISSING_SIGTRAP 1 #define PETSC_MISSING_SIGTSTP 1 #define PETSC_MISSING_SIGURG 1 #define PETSC_MISSING_SIGUSR1 1 #define PETSC_MISSING_SIGUSR2 1 #define PETSC_MPICC_SHOW "Unavailable" #define PETSC_MPIU_IS_COLORING_VALUE_TYPE MPI_UNSIGNED_SHORT #define PETSC_NEEDS_UTYPE_TYPEDEFS 1 #define PETSC_OMAKE "/usr/bin/make --no-print-directory" #define PETSC_PREFETCH_HINT_NTA _MM_HINT_NTA #define PETSC_PREFETCH_HINT_T0 _MM_HINT_T0 #define PETSC_PREFETCH_HINT_T1 _MM_HINT_T1 #define PETSC_PREFETCH_HINT_T2 _MM_HINT_T2 #define PETSC_PYTHON_EXE "/usr/bin/python3" #define PETSC_Prefetch(a,b,c) _mm_prefetch((const char*)(a),(c)) #define PETSC_REPLACE_DIR_SEPARATOR '/' #define PETSC_SIGNAL_CAST #define PETSC_SIZEOF_INT 4 #define PETSC_SIZEOF_LONG 4 #define PETSC_SIZEOF_LONG_LONG 8 #define PETSC_SIZEOF_SIZE_T 8 #define PETSC_SIZEOF_VOID_P 8 #define PETSC_SLSUFFIX "" #define PETSC_UINTPTR_T uintptr_t #define PETSC_UINTPTR_T_FMT "#" PRIxPTR #define PETSC_UNUSED #define PETSC_USE_AVX512_KERNELS 1 #define PETSC_USE_BACKWARD_LOOP 1 #define PETSC_USE_COMPLEX 1 #define PETSC_USE_CTABLE 1 #define PETSC_USE_DEBUG 1 #define PETSC_USE_DEBUGGER "gdb" #define PETSC_USE_DMLANDAU_2D 1 #define PETSC_USE_FORTRAN_BINDINGS 1 #define PETSC_USE_INFO 1 #define PETSC_USE_ISATTY 1 #define PETSC_USE_LOG 1 #define PETSC_USE_MICROSOFT_TIME 1 #define PETSC_USE_PROC_FOR_SIZE 1 #define PETSC_USE_REAL_DOUBLE 1 #define PETSC_USE_SINGLE_LIBRARY 1 #define PETSC_USE_WINDOWS_GRAPHICS 1 #define PETSC_USING_64BIT_PTR 1 #define PETSC_USING_F2003 1 #define PETSC_USING_F90FREEFORM 1 #define PETSC__BSD_SOURCE 1 #define PETSC__DEFAULT_SOURCE 1 #define R_OK 04 #define S_ISDIR(a) (((a)&_S_IFMT) == _S_IFDIR) #define S_ISREG(a) (((a)&_S_IFMT) == _S_IFREG) #define W_OK 02 #define X_OK 01 #define _USE_MATH_DEFINES 1 ----------------------------------------- Using C compile: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl -o .o -c -Qstd=c99 -MT -Z7 -Od mpicc -show: Unavailable C compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Using C++ compile: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl -o .o -c -MT -GR -EHsc -Z7 -Od -Qstd=c++17 -TP -I/cygdrive/e/balay/petsc-3.21.2/include -I/cygdrive/e/balay/petsc-3.21.2/arch-mswin-c-debug/include mpicxx -show: Unavailable C++ compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Using Fortran compile: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_ifort -o .o -c -MT -Z7 -Od -fpp -I/cygdrive/e/balay/petsc-3.21.2/include -I/cygdrive/e/balay/petsc-3.21.2/arch-mswin-c-debug/include mpif90 -show: Unavailable Fortran compiler version: Win32 Development Tool Front End, version 1.11.4 Fri, Sep 10, 2021 6:33:40 PM Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 ----------------------------------------- Using C/C++ linker: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_icl Using C/C++ flags: -Qwd10161 -Qstd=c99 -MT -Z7 -Od Using Fortran linker: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/win32fe/win32fe_ifort Using Fortran flags: -MT -Z7 -Od -fpp ----------------------------------------- Using system modules: Using mpi.h: mpiuni ----------------------------------------- Using libraries: -L/cygdrive/e/balay/petsc-3.21.2/arch-mswin-c-debug/lib -L/cygdrive/c/PROGRA~2/Intel/oneAPI/mkl/latest/lib/intel64 -lpetsc mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib Gdi32.lib User32.lib Advapi32.lib Kernel32.lib Ws2_32.lib ------------------------------------------ Using mpiexec: /cygdrive/e/balay/petsc-3.21.2/lib/petsc/bin/petsc-mpiexec.uni ------------------------------------------ Using MAKE: /usr/bin/make Default MAKEFLAGS: MAKE_NP:10 MAKE_LOAD:18.0 MAKEFLAGS: --no-print-directory -- PETSC_ARCH=arch-mswin-c-debug PETSC_DIR=/cygdrive/e/balay/petsc-3.21.2 ========================================== /usr/bin/make --print-directory -f gmakefile -j10 -l18.0 --output-sync=recurse V= libs /usr/bin/python3 ./config/gmakegen.py --petsc-arch=arch-mswin-c-debug CC arch-mswin-c-debug/obj/src/vec/vec/interface/veccreate.o veccreate.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vecreg.o vecreg.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vecregall.o vecregall.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/vector.o vector.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecglvis.o vecglvis.c CC arch-mswin-c-debug/obj/src/vec/vec/interface/rvector.o rvector.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecs.o vecs.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecio.o vecio.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vecstash.o vecstash.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vsection.o vsection.c CC arch-mswin-c-debug/obj/src/vec/vec/utils/vinv.o vinv.c CC arch-mswin-c-debug/obj/src/mat/graphops/coarsen/scoarsen.o scoarsen.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/fdaij.o fdaij.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/ij.o ij.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/inode2.o inode2.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/matrart.o matrart.c CC arch-mswin-c-debug/obj/src/mat/impls/aij/seq/mattransposematmult.o taoshell.c CC arch-mswin-c-debug/obj/src/tao/snes/taosnes.o taosnes.c CC arch-mswin-c-debug/obj/src/tao/util/ftn-auto/tao_utilf.o tao_utilf.c CC arch-mswin-c-debug/obj/src/tao/python/ftn-custom/zpythontaof.o zpythontaof.c CC arch-mswin-c-debug/obj/src/tao/util/tao_util.o tao_util.c FC arch-mswin-c-debug/obj/src/sys/f90-mod/petscsysmod.o FC arch-mswin-c-debug/obj/src/sys/mpiuni/fsrc/somempifort.o FC arch-mswin-c-debug/obj/src/sys/objects/f2003-src/fsrc/optionenum.o FC arch-mswin-c-debug/obj/src/vec/f90-mod/petscvecmod.o FC arch-mswin-c-debug/obj/src/sys/classes/bag/f2003-src/fsrc/bagenum.o FC arch-mswin-c-debug/obj/src/mat/f90-mod/petscmatmod.o FC arch-mswin-c-debug/obj/src/dm/f90-mod/petscdmmod.o FC arch-mswin-c-debug/obj/src/dm/f90-mod/petscdmswarmmod.o FC arch-mswin-c-debug/obj/src/dm/f90-mod/petscdmplexmod.o FC arch-mswin-c-debug/obj/src/dm/f90-mod/petscdmdamod.o FC arch-mswin-c-debug/obj/src/ksp/f90-mod/petsckspdefmod.o CC arch-mswin-c-debug/obj/src/tao/python/pythontao.o pythontao.c FC arch-mswin-c-debug/obj/src/ksp/f90-mod/petscpcmod.o FC arch-mswin-c-debug/obj/src/ksp/f90-mod/petsckspmod.o FC arch-mswin-c-debug/obj/src/snes/f90-mod/petscsnesmod.o FC arch-mswin-c-debug/obj/src/ts/f90-mod/petsctsmod.o FC arch-mswin-c-debug/obj/src/tao/f90-mod/petsctaomod.o AR arch-mswin-c-debug/lib/libpetsc.lib ========================================= Now to check if the libraries are working do: make PETSC_DIR=/cygdrive/e/balay/petsc-3.21.2 PETSC_ARCH=arch-mswin-c-debug check ========================================= balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ make check Running PETSc check examples to verify correct installation Using PETSC_DIR=/cygdrive/e/balay/petsc-3.21.2 and PETSC_ARCH=arch-mswin-c-debug C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process Fortran example src/snes/tutorials/ex5f run successfully with 1 MPI process Completed PETSc check examples balay at petsc-win01 /cygdrive/e/balay/petsc-3.21.2 $ -------------- next part -------------- An HTML attachment was scrubbed... URL: