[petsc-users] [EXTERNAL] Re: Kokkos backend for Mat and Vec diverging when running on CUDA device.

Fackler, Philip facklerpw at ornl.gov
Wed Nov 16 13:38:15 CST 2022


------------------------------------------------------------------ PETSc Performance Summary: ------------------------------------------------------------------

Unknown Name on a  named PC0115427 with 1 processor, by 4pf Wed Nov 16 14:36:46 2022
Using Petsc Development GIT revision: v3.18.1-115-gdca010e0e9a  GIT Date: 2022-10-28 14:39:41 +0000

                         Max       Max/Min     Avg       Total
Time (sec):           6.023e+00     1.000   6.023e+00
Objects:              1.020e+02     1.000   1.020e+02
Flops:                1.080e+09     1.000   1.080e+09  1.080e+09
Flops/sec:            1.793e+08     1.000   1.793e+08  1.793e+08
MPI Msg Count:        0.000e+00     0.000   0.000e+00  0.000e+00
MPI Msg Len (bytes):  0.000e+00     0.000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00     0.000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
 0:      Main Stage: 6.0226e+00 100.0%  1.0799e+09 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
   CpuToGpu Count: total number of CPU to GPU copies per processor
   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
   GpuToCpu Count: total number of GPU to CPU copies per processor
   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
   GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total
   GPU    - CpuToGpu -   - GpuToCpu - GPU

                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
 Mflop/s Count   Size   Count   Size  %F

------------------------------------------------------------------------------------------------------------------------
---------------------------------------


--- Event Stage 0: Main Stage

BuildTwoSided          3 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

DMCreateMat            1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

SFSetGraph             3 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

SFSetUp                3 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

SFPack              4647 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

SFUnpack            4647 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

VecDot               190 1.0   nan nan 2.11e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecMDot              775 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

VecNorm             1728 1.0   nan nan 1.92e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   0  2  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecScale            1983 1.0   nan nan 6.24e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecCopy              780 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

VecSet              4955 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

VecAXPY              190 1.0   nan nan 2.11e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecAYPX              597 1.0   nan nan 6.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecAXPBYCZ           643 1.0   nan nan 1.79e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   0  2  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecWAXPY             502 1.0   nan nan 5.58e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecMAXPY            1159 1.0   nan nan 3.68e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecScatterBegin     4647 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0  -nan
    -nan      2 5.14e-03    0 0.00e+00  0

VecScatterEnd       4647 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

VecReduceArith       380 1.0   nan nan 4.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

VecReduceComm        190 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

VecNormalize         965 1.0   nan nan 1.61e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

TSStep                20 1.0 5.8699e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0  97100  0  0  0   184
    -nan      2 5.14e-03    0 0.00e+00 54

TSFunctionEval       597 1.0   nan nan 6.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 63  1  0  0  0  63  1  0  0  0  -nan
    -nan      1 3.36e-04    0 0.00e+00 100

TSJacobianEval       190 1.0   nan nan 3.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00 24  3  0  0  0  24  3  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 97

MatMult             1930 1.0   nan nan 4.46e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1 41  0  0  0   1 41  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

MatMultTranspose       1 1.0   nan nan 3.44e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

MatSolve             965 1.0   nan nan 5.04e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  5  0  0  0   1  5  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatSOR               965 1.0   nan nan 3.33e+08 1.0 0.0e+00 0.0e+00 0.0e+00  4 31  0  0  0   4 31  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatLUFactorSym         1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatLUFactorNum       190 1.0   nan nan 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1 11  0  0  0   1 11  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatScale             190 1.0   nan nan 3.26e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

MatAssemblyBegin     761 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatAssemblyEnd       761 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatGetRowIJ            1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatCreateSubMats     380 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatGetOrdering         1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatZeroEntries       379 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatSetPreallCOO        1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

MatSetValuesCOO      190 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

KSPSetUp             760 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

KSPSolve             190 1.0 5.8052e-01 1.0 9.30e+08 1.0 0.0e+00 0.0e+00 0.0e+00 10 86  0  0  0  10 86  0  0  0  1602
    -nan      1 4.80e-03    0 0.00e+00 46

KSPGMRESOrthog       775 1.0   nan nan 2.27e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

SNESSolve             71 1.0 5.7117e+00 1.0 1.07e+09 1.0 0.0e+00 0.0e+00 0.0e+00 95 99  0  0  0  95 99  0  0  0   188
    -nan      1 4.80e-03    0 0.00e+00 53

SNESSetUp              1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

SNESFunctionEval     573 1.0   nan nan 2.23e+07 1.0 0.0e+00 0.0e+00 0.0e+00 60  2  0  0  0  60  2  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

SNESJacobianEval     190 1.0   nan nan 3.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00 24  3  0  0  0  24  3  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 97

SNESLineSearch       190 1.0   nan nan 1.05e+08 1.0 0.0e+00 0.0e+00 0.0e+00 53 10  0  0  0  53 10  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00 100

PCSetUp              570 1.0   nan nan 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2 11  0  0  0   2 11  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

PCApply              965 1.0   nan nan 6.14e+08 1.0 0.0e+00 0.0e+00 0.0e+00  8 57  0  0  0   8 57  0  0  0  -nan
    -nan      1 4.80e-03    0 0.00e+00 19

KSPSolve_FS_0        965 1.0   nan nan 3.33e+08 1.0 0.0e+00 0.0e+00 0.0e+00  4 31  0  0  0   4 31  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0

KSPSolve_FS_1        965 1.0   nan nan 1.66e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2 15  0  0  0   2 15  0  0  0  -nan
    -nan      0 0.00e+00    0 0.00e+00  0


--- Event Stage 1: Unknown

------------------------------------------------------------------------------------------------------------------------
---------------------------------------


Object Type          Creations   Destructions. Reports information only for process 0.

--- Event Stage 0: Main Stage

           Container     5              5
    Distributed Mesh     2              2
           Index Set    11             11
   IS L to G Mapping     1              1
   Star Forest Graph     7              7
     Discrete System     2              2
           Weak Form     2              2
              Vector    49             49
             TSAdapt     1              1
                  TS     1              1
                DMTS     1              1
                SNES     1              1
              DMSNES     3              3
      SNESLineSearch     1              1
       Krylov Solver     4              4
     DMKSP interface     1              1
              Matrix     4              4
      Preconditioner     4              4
              Viewer     2              1

--- Event Stage 1: Unknown

========================================================================================================================
Average time to get PetscTime(): 3.14e-08
#PETSc Option Table entries:
-log_view
-log_view_gpu_times
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with 64 bit PetscInt
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 8
Configure options: PETSC_DIR=/home/4pf/repos/petsc PETSC_ARCH=arch-kokkos-cuda-no-tpls --with-cc=mpicc --with-cxx=mpicxx --with-fc=0 --with-cuda --with-debugging=0 --with-shared-libraries --prefix=/home/4pf/build/petsc/cuda-no-tpls/install --with-64-bit-indices --COPTFLAGS=-O3 --CXXOPTFLAGS=-O3 --CUDAOPTFLAGS=-O3 --with-kokkos-dir=/home/4pf/build/kokkos/cuda/install --with-kokkos-kernels-dir=/home/4pf/build/kokkos-kernels/cuda-no-tpls/install

-----------------------------------------
Libraries compiled on 2022-11-01 21:01:08 on PC0115427
Machine characteristics: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
Using PETSc directory: /home/4pf/build/petsc/cuda-no-tpls/install
Using PETSc arch:
-----------------------------------------

Using C compiler: mpicc  -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O3
-----------------------------------------

Using include paths: -I/home/4pf/build/petsc/cuda-no-tpls/install/include -I/home/4pf/build/kokkos-kernels/cuda-no-tpls/install/include -I/home/4pf/build/kokkos/cuda/install/include -I/usr/local/cuda-11.8/include
-----------------------------------------

Using C linker: mpicc
Using libraries: -Wl,-rpath,/home/4pf/build/petsc/cuda-no-tpls/install/lib -L/home/4pf/build/petsc/cuda-no-tpls/install/lib -lpetsc -Wl,-rpath,/home/4pf/build/kokkos-kernels/cuda-no-tpls/install/lib -L/home/4pf/build/kokkos-kernels/cuda-no-tpls/install/lib -Wl,-rpath,/home/4pf/build/kokkos/cuda/install/lib -L/home/4pf/build/kokkos/cuda/install/lib -Wl,-rpath,/usr/local/cuda-11.8/lib64 -L/usr/local/cuda-11.8/lib64 -L/usr/local/cuda-11.8/lib64/stubs -lkokkoskernels -lkokkoscontainers -lkokkoscore -llapack -lblas -lm -lcudart -lnvToolsExt -lcufft -lcublas -lcusparse -lcusolver -lcurand -lcuda -lquadmath -lstdc++ -ldl
-----------------------------------------


Philip Fackler
Research Software Engineer, Application Engineering Group
Advanced Computing Systems Research Section
Computer Science and Mathematics Division
Oak Ridge National Laboratory
________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Tuesday, November 15, 2022 13:03
To: Fackler, Philip <facklerpw at ornl.gov>
Cc: xolotl-psi-development at lists.sourceforge.net <xolotl-psi-development at lists.sourceforge.net>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Blondel, Sophie <sblondel at utk.edu>; Roth, Philip <rothpc at ornl.gov>
Subject: Re: [EXTERNAL] Re: [petsc-users] Kokkos backend for Mat and Vec diverging when running on CUDA device.

Can you paste -log_view result so I can see what functions are used?

--Junchao Zhang


On Tue, Nov 15, 2022 at 10:24 AM Fackler, Philip <facklerpw at ornl.gov<mailto:facklerpw at ornl.gov>> wrote:
Yes, most (but not all) of our system test cases fail with the kokkos/cuda or cuda backends. All of them pass with the CPU-only kokkos backend.

Philip Fackler
Research Software Engineer, Application Engineering Group
Advanced Computing Systems Research Section
Computer Science and Mathematics Division
Oak Ridge National Laboratory
________________________________
From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
Sent: Monday, November 14, 2022 19:34
To: Fackler, Philip <facklerpw at ornl.gov<mailto:facklerpw at ornl.gov>>
Cc: xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net> <xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net>>; petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>; Blondel, Sophie <sblondel at utk.edu<mailto:sblondel at utk.edu>>; Zhang, Junchao <jczhang at mcs.anl.gov<mailto:jczhang at mcs.anl.gov>>; Roth, Philip <rothpc at ornl.gov<mailto:rothpc at ornl.gov>>
Subject: [EXTERNAL] Re: [petsc-users] Kokkos backend for Mat and Vec diverging when running on CUDA device.

Hi, Philip,
  Sorry to hear that.  It seems you could run the same code on CPUs but not no GPUs (with either petsc/Kokkos backend or petsc/cuda backend, is it right?

--Junchao Zhang


On Mon, Nov 14, 2022 at 12:13 PM Fackler, Philip via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
This is an issue I've brought up before (and discussed in-person with Richard). I wanted to bring it up again because I'm hitting the limits of what I know to do, and I need help figuring this out.

The problem can be reproduced using Xolotl's "develop" branch built against a petsc build with kokkos and kokkos-kernels enabled. Then, either add the relevant kokkos options to the "petscArgs=" line in the system test parameter file(s), or just replace the system test parameter files with the ones from the "feature-petsc-kokkos" branch. See here the files that begin with "params_system_".

Note that those files use the "kokkos" options, but the problem is similar using the corresponding cuda/cusparse options. I've already tried building kokkos-kernels with no TPLs and got slightly different results, but the same problem.

Any help would be appreciated.

Thanks,

Philip Fackler
Research Software Engineer, Application Engineering Group
Advanced Computing Systems Research Section
Computer Science and Mathematics Division
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20221116/cb7a9e57/attachment-0001.html>


More information about the petsc-users mailing list