[petsc-users] CPU utilization during GPU solver

Sat Nov 17 14:39:55 CST 2012

Hi David,

the cuda-runtime might spawn threads in addition to PETSc. How many GPUs 
do you have on your system?

You might also want to compare with the CUDA examples. If they also run 
at ~800% CPU utilization, then it's definitely due to the CUDA runtime.

Note that there might full CPU utilization even if all the operations 
were carried out on the GPU, because the CPU-threads are waiting in a 
synchronization loop for the GPU kernels to terminate. This is, of 
course, nothing specific to PETSc.

Best regards,
Karli

On 11/17/2012 02:05 PM, David Fuentes wrote:
> Thanks Jed.
> I was trying to run it in dbg mode to verify if all significant parts of
> the solver were running on the GPU and not on the CPU by mistake.
> I cant pinpoint what part of the solver is running on the CPU. When I
> run top while running the solver there seems to be ~800% CPU utilization
> that I wasn't expecting. I cant tell if i'm slowing things down
> by transferring between CPU/GPU on accident?
>
> thanks again,
> df
>
> On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <jedbrown at mcs.anl.gov
> <mailto:jedbrown at mcs.anl.gov>> wrote:
>
>     Please read the large boxed message about debugging mode.
>
>     (Replying from phone so can't make it 72 point blinking red, sorry.)
>
>     On Nov 17, 2012 1:41 PM, "David Fuentes" <fuentesdt at gmail.com
>     <mailto:fuentesdt at gmail.com>> wrote:
>
>         thanks Matt,
>
>         My log summary is below.
>
>         ************************************************************************************************************************
>         ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use
>         'enscript -r -fCourier9' to print this document            ***
>         ************************************************************************************************************************
>
>         ---------------------------------------------- PETSc Performance
>         Summary: ----------------------------------------------
>
>         ./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg
>         named SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012
>         Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51
>         CDT 2012
>
>                                   Max       Max/Min        Avg      Total
>         Time (sec):           3.164e+01      1.00000   3.164e+01
>         Objects:              4.100e+01      1.00000   4.100e+01
>         Flops:                2.561e+09      1.00000   2.561e+09  2.561e+09
>         Flops/sec:            8.097e+07      1.00000   8.097e+07  8.097e+07
>         Memory:               2.129e+08      1.00000              2.129e+08
>         MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
>         MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
>         MPI Reductions:       4.230e+02      1.00000
>
>         Flop counting convention: 1 flop = 1 real number operation of
>         type (multiply/divide/add/subtract)
>                                      e.g., VecAXPY() for real vectors of
>         length N --> 2N flops
>                                      and VecAXPY() for complex vectors
>         of length N --> 8N flops
>
>         Summary of Stages:   ----- Time ------  ----- Flops -----  ---
>         Messages ---  -- Message Lengths --  -- Reductions --
>                                  Avg     %Total     Avg     %Total
>         counts   %Total     Avg         %Total   counts   %Total
>           0:      Main Stage: 3.1636e+01 100.0%  2.5615e+09 100.0%
>           0.000e+00   0.0%  0.000e+00        0.0%  4.220e+02  99.8%
>
>         ------------------------------------------------------------------------------------------------------------------------
>         See the 'Profiling' chapter of the users' manual for details on
>         interpreting output.
>         Phase summary info:
>             Count: number of times phase was executed
>             Time and Flops: Max - maximum over all processors
>                             Ratio - ratio of maximum to minimum over all
>         processors
>             Mess: number of messages sent
>             Avg. len: average message length
>             Reduct: number of global reductions
>             Global: entire computation
>             Stage: stages of a computation. Set stages with
>         PetscLogStagePush() and PetscLogStagePop().
>                %T - percent time in this phase         %f - percent
>         flops in this phase
>                %M - percent messages in this phase     %L - percent
>         message lengths in this phase
>                %R - percent reductions in this phase
>             Total Mflop/s: 10e-6 * (sum of flops over all
>         processors)/(max time over all processors)
>         ------------------------------------------------------------------------------------------------------------------------
>
>
>                ##########################################################
>                #                                                        #
>                #                          WARNING!!!                    #
>                #                                                        #
>                #   This code was compiled with a debugging option,      #
>                #   To get timing results run ./configure                #
>                #   using --with-debugging=no, the performance will      #
>                #   be generally two or three times faster.              #
>                #                                                        #
>                ##########################################################
>
>
>         Event                Count      Time (sec)     Flops
>                          --- Global ---  --- Stage ---   Total
>                             Max Ratio  Max     Ratio   Max  Ratio  Mess
>            Avg len Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s
>         ------------------------------------------------------------------------------------------------------------------------
>
>         --- Event Stage 0: Main Stage
>
>         ComputeFunction       52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 3.0e+00  1  0  0  0  1   1  0  0  0  1     0
>         VecDot                50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00
>         0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0  3025
>         VecMDot               50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00
>         0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0   741
>         VecNorm              200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00
>         0.0e+00 0.0e+00  0 15  0  0  0   0 15  0  0  0  3963
>         VecScale             100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00
>         0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0   719
>         VecCopy              150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>         VecSet               164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>         VecAXPY               50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00
>         0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0  3014
>         VecWAXPY              50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00
>         0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0   167
>         VecMAXPY             100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00
>         0.0e+00 0.0e+00  2  8  0  0  0   2  8  0  0  0   356
>         VecPointwiseMult     100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00
>         0.0e+00 0.0e+00  2  4  0  0  0   2  4  0  0  0   183
>         VecScatterBegin       53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>         VecReduceArith       101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00
>         0.0e+00 0.0e+00  0  8  0  0  0   0  8  0  0  0  2801
>         VecReduceComm         51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>         VecNormalize         100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00
>         0.0e+00 0.0e+00  1 11  0  0  0   1 11  0  0  0  1568
>         VecCUSPCopyTo        152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>         VecCUSPCopyFrom      201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>         MatMult              100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00
>         0.0e+00 0.0e+00  2 49  0  0  0   2 49  0  0  0  1825
>         MatAssemblyBegin       3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>         MatAssemblyEnd         3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>         MatZeroEntries         1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>         MatCUSPCopyTo          3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>         SNESSolve              1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00
>         0.0e+00 3.7e+02 70100  0  0 88  70100  0  0 89   116
>         SNESFunctionEval      51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>         SNESJacobianEval      50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
>         SNESLineSearch        50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00
>         0.0e+00 5.0e+01 20 45  0  0 12  20 45  0  0 12   184
>         KSPGMRESOrthog        50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00
>         0.0e+00 5.0e+01  1  8  0  0 12   1  8  0  0 12   480
>         KSPSetUp              50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 1.5e+01  0  0  0  0  4   0  0  0  0  4     0
>         KSPSolve              50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00
>         0.0e+00 3.2e+02 42 55  0  0 75  42 55  0  0 75   106
>         PCSetUp               50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00
>         0.0e+00 4.9e+01  6  0  0  0 12   6  0  0  0 12     0
>         PCApply              100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00
>         0.0e+00 4.0e+00  2  4  0  0  1   2  4  0  0  1   169
>         ------------------------------------------------------------------------------------------------------------------------
>
>         Memory usage is given in bytes:
>
>         Object Type          Creations   Destructions     Memory
>           Descendants' Mem.
>         Reports information only for process 0.
>
>         --- Event Stage 0: Main Stage
>
>                     Container     2              2         1096     0
>                        Vector    16 16    108696592
>         <tel:16%20%C2%A0%20%C2%A0108696592>     0
>                Vector Scatter     2              2         1240     0
>                        Matrix     1              1     96326824     0
>              Distributed Mesh     3              3      7775936     0
>               Bipartite Graph     6              6         4104     0
>                     Index Set     5              5      3884908     0
>             IS L to G Mapping     1              1      3881760     0
>                          SNES     1              1         1268     0
>                SNESLineSearch     1              1          840     0
>                        Viewer     1              0            0     0
>                 Krylov Solver     1              1        18288     0
>                Preconditioner     1              1          792     0
>         ========================================================================================================================
>         Average time to get PetscTime(): 9.53674e-08
>         #PETSc Option Table entries:
>         -da_vec_type cusp
>         -dm_mat_type seqaijcusp
>         -ksp_monitor
>         -log_summary
>         -pc_type jacobi
>         -snes_converged_reason
>         -snes_monitor
>         #End of PETSc Option Table entries
>         Compiled without FORTRAN kernels
>         Compiled with full precision matrices (default)
>         sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>         sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>         Configure run at: Fri Nov 16 08:40:52 2012
>         Configure options: --with-clanguage=C++ --with-mpi-dir=/usr
>         --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0
>         --CXXFLAGS=-O0 --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0
>         --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]"
>         --download-blacs --download-superlu_dist --download-triangle
>         --download-parmetis --download-metis --download-mumps
>         --download-scalapack --with-cuda=1 --with-cusp=1 --with-thrust=1
>         --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1
>         --download-exodusii=yes --download-netcdf --with-boost=1
>         --with-boost-dir=/usr --download-fiat=yes --download-generator
>         --download-scientificpython --with-matlab=1
>         --with-matlab-engine=1 --with-matlab-dir=/opt/MATLAB/R2011a
>         -----------------------------------------
>         Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2
>         Machine characteristics:
>         Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid
>         Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4
>         Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg
>         -----------------------------------------
>
>         Using C compiler: /usr/bin/mpicxx -O0 -g   -fPIC   ${COPTFLAGS}
>         ${CFLAGS}
>         Using Fortran compiler: /usr/bin/mpif90  -fPIC -Wall
>         -Wno-unused-variable -g   ${FOPTFLAGS} ${FFLAGS}
>         -----------------------------------------
>
>         Using include paths:
>         -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
>         -I/opt/apps/PETSC/petsc-3.3-p4/include
>         -I/opt/apps/PETSC/petsc-3.3-p4/include
>         -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
>         -I/opt/apps/cuda/4.2//cuda/include
>         -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve
>         -I/opt/MATLAB/R2011a/extern/include -I/usr/include
>         -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include
>         -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include
>         -I/usr/include/mpich2
>         -----------------------------------------
>
>         Using C linker: /usr/bin/mpicxx
>         Using Fortran linker: /usr/bin/mpif90
>         Using libraries:
>         -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>         -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>         -lpetsc
>         -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>         -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>         -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps
>         -lsmumps -lzmumps -lmumps_common -lpord -lparmetis -lmetis
>         -lscalapack -lblacs -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64
>         -L/opt/apps/cuda/4.2//cuda/lib64 -lcufft -lcublas -lcudart
>         -lcusparse
>         -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64
>         -L/opt/MATLAB/R2011a/bin/glnxa64
>         -L/opt/MATLAB/R2011a/extern/lib/glnxa64 -leng -lmex -lmx -lmat
>         -lut -licudata -licui18n -licuuc
>         -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib
>         -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt
>         -lmkl_intel_thread -lmkl_core -liomp5 -lexoIIv2for -lexodus
>         -lnetcdf_c++ -lnetcdf
>         -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3
>         -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm
>         -lm -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa
>         -lpthread -lrt -lgcc_s -ldl
>         -----------------------------------------
>
>
>
>         On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley
>         <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>
>             On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes
>             <fuentesdt at gmail.com <mailto:fuentesdt at gmail.com>> wrote:
>              > Hi,
>              >
>              > I'm using petsc 3.3p4
>              > I'm trying to run a nonlinear SNES solver on GPU with
>             gmres and jacobi PC
>              > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs
>             and jacobian matrix
>              > respectively.
>              > When running top I still see significant CPU utilization
>             (800-900 %CPU)
>              > during the solve ? possibly from some multithreaded
>             operations ?
>              >
>              > Is this expected ?
>              > I was thinking that since I input everything into the
>             solver as a CUSP
>              > datatype, all linear algebra operations would be on the
>             GPU device from
>              > there and wasn't expecting to see such CPU utilization
>             during the solve ?
>              > Do I probably have an error in my code somewhere ?
>
>             We cannot answer performance questions without -log_summary
>
>                 Matt
>
>              > Thanks,
>              > David
>
>
>
>             --
>             What most experimenters take for granted before they begin their
>             experiments is infinitely more interesting than any results
>             to which
>             their experiments lead.
>             -- Norbert Wiener
>
>
>