[petsc-users] CPU utilization during GPU solver
Karl Rupp
rupp at mcs.anl.gov
Sat Nov 17 14:39:55 CST 2012
Hi David,
the cuda-runtime might spawn threads in addition to PETSc. How many GPUs
do you have on your system?
You might also want to compare with the CUDA examples. If they also run
at ~800% CPU utilization, then it's definitely due to the CUDA runtime.
Note that there might full CPU utilization even if all the operations
were carried out on the GPU, because the CPU-threads are waiting in a
synchronization loop for the GPU kernels to terminate. This is, of
course, nothing specific to PETSc.
Best regards,
Karli
On 11/17/2012 02:05 PM, David Fuentes wrote:
> Thanks Jed.
> I was trying to run it in dbg mode to verify if all significant parts of
> the solver were running on the GPU and not on the CPU by mistake.
> I cant pinpoint what part of the solver is running on the CPU. When I
> run top while running the solver there seems to be ~800% CPU utilization
> that I wasn't expecting. I cant tell if i'm slowing things down
> by transferring between CPU/GPU on accident?
>
> thanks again,
> df
>
> On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <jedbrown at mcs.anl.gov
> <mailto:jedbrown at mcs.anl.gov>> wrote:
>
> Please read the large boxed message about debugging mode.
>
> (Replying from phone so can't make it 72 point blinking red, sorry.)
>
> On Nov 17, 2012 1:41 PM, "David Fuentes" <fuentesdt at gmail.com
> <mailto:fuentesdt at gmail.com>> wrote:
>
> thanks Matt,
>
> My log summary is below.
>
> ************************************************************************************************************************
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use
> 'enscript -r -fCourier9' to print this document ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> ./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg
> named SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012
> Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51
> CDT 2012
>
> Max Max/Min Avg Total
> Time (sec): 3.164e+01 1.00000 3.164e+01
> Objects: 4.100e+01 1.00000 4.100e+01
> Flops: 2.561e+09 1.00000 2.561e+09 2.561e+09
> Flops/sec: 8.097e+07 1.00000 8.097e+07 8.097e+07
> Memory: 2.129e+08 1.00000 2.129e+08
> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Reductions: 4.230e+02 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of
> type (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of
> length N --> 2N flops
> and VecAXPY() for complex vectors
> of length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- ---
> Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total
> counts %Total Avg %Total counts %Total
> 0: Main Stage: 3.1636e+01 100.0% 2.5615e+09 100.0%
> 0.000e+00 0.0% 0.000e+00 0.0% 4.220e+02 99.8%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all
> processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with
> PetscLogStagePush() and PetscLogStagePop().
> %T - percent time in this phase %f - percent
> flops in this phase
> %M - percent messages in this phase %L - percent
> message lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all
> processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option, #
> # To get timing results run ./configure #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
> Event Count Time (sec) Flops
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess
> Avg len Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> ComputeFunction 52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 3.0e+00 1 0 0 0 1 1 0 0 0 1 0
> VecDot 50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 3025
> VecMDot 50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 741
> VecNorm 200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 15 0 0 0 0 15 0 0 0 3963
> VecScale 100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 719
> VecCopy 150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> VecSet 164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPY 50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 3014
> VecWAXPY 50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00
> 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 167
> VecMAXPY 100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 2 8 0 0 0 2 8 0 0 0 356
> VecPointwiseMult 100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00 0.0e+00 2 4 0 0 0 2 4 0 0 0 183
> VecScatterBegin 53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecReduceArith 101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 8 0 0 0 0 8 0 0 0 2801
> VecReduceComm 51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecNormalize 100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 1568
> VecCUSPCopyTo 152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> VecCUSPCopyFrom 201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> MatMult 100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 2 49 0 0 0 2 49 0 0 0 1825
> MatAssemblyBegin 3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyEnd 3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatZeroEntries 1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatCUSPCopyTo 3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> SNESSolve 1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00
> 0.0e+00 3.7e+02 70100 0 0 88 70100 0 0 89 116
> SNESFunctionEval 51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> SNESJacobianEval 50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 4 0 0 0 0 4 0 0 0 0 0
> SNESLineSearch 50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00
> 0.0e+00 5.0e+01 20 45 0 0 12 20 45 0 0 12 184
> KSPGMRESOrthog 50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00
> 0.0e+00 5.0e+01 1 8 0 0 12 1 8 0 0 12 480
> KSPSetUp 50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.5e+01 0 0 0 0 4 0 0 0 0 4 0
> KSPSolve 50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00
> 0.0e+00 3.2e+02 42 55 0 0 75 42 55 0 0 75 106
> PCSetUp 50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 4.9e+01 6 0 0 0 12 6 0 0 0 12 0
> PCApply 100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00
> 0.0e+00 4.0e+00 2 4 0 0 1 2 4 0 0 1 169
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory
> Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Container 2 2 1096 0
> Vector 16 16 108696592
> <tel:16%20%C2%A0%20%C2%A0108696592> 0
> Vector Scatter 2 2 1240 0
> Matrix 1 1 96326824 0
> Distributed Mesh 3 3 7775936 0
> Bipartite Graph 6 6 4104 0
> Index Set 5 5 3884908 0
> IS L to G Mapping 1 1 3881760 0
> SNES 1 1 1268 0
> SNESLineSearch 1 1 840 0
> Viewer 1 0 0 0
> Krylov Solver 1 1 18288 0
> Preconditioner 1 1 792 0
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -da_vec_type cusp
> -dm_mat_type seqaijcusp
> -ksp_monitor
> -log_summary
> -pc_type jacobi
> -snes_converged_reason
> -snes_monitor
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure run at: Fri Nov 16 08:40:52 2012
> Configure options: --with-clanguage=C++ --with-mpi-dir=/usr
> --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0
> --CXXFLAGS=-O0 --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0
> --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]"
> --download-blacs --download-superlu_dist --download-triangle
> --download-parmetis --download-metis --download-mumps
> --download-scalapack --with-cuda=1 --with-cusp=1 --with-thrust=1
> --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1
> --download-exodusii=yes --download-netcdf --with-boost=1
> --with-boost-dir=/usr --download-fiat=yes --download-generator
> --download-scientificpython --with-matlab=1
> --with-matlab-engine=1 --with-matlab-dir=/opt/MATLAB/R2011a
> -----------------------------------------
> Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2
> Machine characteristics:
> Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid
> Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4
> Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg
> -----------------------------------------
>
> Using C compiler: /usr/bin/mpicxx -O0 -g -fPIC ${COPTFLAGS}
> ${CFLAGS}
> Using Fortran compiler: /usr/bin/mpif90 -fPIC -Wall
> -Wno-unused-variable -g ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths:
> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
> -I/opt/apps/PETSC/petsc-3.3-p4/include
> -I/opt/apps/PETSC/petsc-3.3-p4/include
> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
> -I/opt/apps/cuda/4.2//cuda/include
> -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve
> -I/opt/MATLAB/R2011a/extern/include -I/usr/include
> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include
> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include
> -I/usr/include/mpich2
> -----------------------------------------
>
> Using C linker: /usr/bin/mpicxx
> Using Fortran linker: /usr/bin/mpif90
> Using libraries:
> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
> -lpetsc
> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
> -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps
> -lsmumps -lzmumps -lmumps_common -lpord -lparmetis -lmetis
> -lscalapack -lblacs -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64
> -L/opt/apps/cuda/4.2//cuda/lib64 -lcufft -lcublas -lcudart
> -lcusparse
> -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64
> -L/opt/MATLAB/R2011a/bin/glnxa64
> -L/opt/MATLAB/R2011a/extern/lib/glnxa64 -leng -lmex -lmx -lmat
> -lut -licudata -licui18n -licuuc
> -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib
> -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt
> -lmkl_intel_thread -lmkl_core -liomp5 -lexoIIv2for -lexodus
> -lnetcdf_c++ -lnetcdf
> -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3
> -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm
> -lm -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa
> -lpthread -lrt -lgcc_s -ldl
> -----------------------------------------
>
>
>
> On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley
> <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>
> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes
> <fuentesdt at gmail.com <mailto:fuentesdt at gmail.com>> wrote:
> > Hi,
> >
> > I'm using petsc 3.3p4
> > I'm trying to run a nonlinear SNES solver on GPU with
> gmres and jacobi PC
> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs
> and jacobian matrix
> > respectively.
> > When running top I still see significant CPU utilization
> (800-900 %CPU)
> > during the solve ? possibly from some multithreaded
> operations ?
> >
> > Is this expected ?
> > I was thinking that since I input everything into the
> solver as a CUSP
> > datatype, all linear algebra operations would be on the
> GPU device from
> > there and wasn't expecting to see such CPU utilization
> during the solve ?
> > Do I probably have an error in my code somewhere ?
>
> We cannot answer performance questions without -log_summary
>
> Matt
>
> > Thanks,
> > David
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results
> to which
> their experiments lead.
> -- Norbert Wiener
>
>
>
More information about the petsc-users
mailing list