[petsc-users] CPU utilization during GPU solver
Matthew Knepley
knepley at gmail.com
Sat Nov 17 14:42:58 CST 2012
On Sat, Nov 17, 2012 at 3:05 PM, David Fuentes <fuentesdt at gmail.com> wrote:
> Thanks Jed.
> I was trying to run it in dbg mode to verify if all significant parts of the
> solver were running on the GPU and not on the CPU by mistake.
> I cant pinpoint what part of the solver is running on the CPU. When I run
> top while running the solver there seems to be ~800% CPU utilization
> that I wasn't expecting. I cant tell if i'm slowing things down by
> transferring between CPU/GPU on accident?
1) I am not sure what you mean by 800%, but it is definitely
legitimate to want to know where you are computing.
2) At least some computation is happening on the GPU. I can tell this from the
Vec/MatCopyToGPU events.
3) Your flop rates are not great. The MatMult is about half what we
get on the Tesla, but you
could have another card without good support for double precision.
The vector ops however
are pretty bad.
4) It looks like half the flops are in MatMult, which is definitely on
the card, and the others are in
vector operations. Do you create any other vectors without the CUSP type?
Matt
> thanks again,
> df
>
> On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>
>> Please read the large boxed message about debugging mode.
>>
>> (Replying from phone so can't make it 72 point blinking red, sorry.)
>>
>> On Nov 17, 2012 1:41 PM, "David Fuentes" <fuentesdt at gmail.com> wrote:
>>>
>>> thanks Matt,
>>>
>>> My log summary is below.
>>>
>>>
>>> ************************************************************************************************************************
>>> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
>>> -fCourier9' to print this document ***
>>>
>>> ************************************************************************************************************************
>>>
>>> ---------------------------------------------- PETSc Performance Summary:
>>> ----------------------------------------------
>>>
>>> ./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named
>>> SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012
>>> Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT 2012
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 3.164e+01 1.00000 3.164e+01
>>> Objects: 4.100e+01 1.00000 4.100e+01
>>> Flops: 2.561e+09 1.00000 2.561e+09 2.561e+09
>>> Flops/sec: 8.097e+07 1.00000 8.097e+07 8.097e+07
>>> Memory: 2.129e+08 1.00000 2.129e+08
>>> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
>>> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
>>> MPI Reductions: 4.230e+02 1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of length N
>>> --> 2N flops
>>> and VecAXPY() for complex vectors of length N
>>> --> 8N flops
>>>
>>> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages
>>> --- -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total counts
>>> %Total Avg %Total counts %Total
>>> 0: Main Stage: 3.1636e+01 100.0% 2.5615e+09 100.0% 0.000e+00
>>> 0.0% 0.000e+00 0.0% 4.220e+02 99.8%
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>> Count: number of times phase was executed
>>> Time and Flops: Max - maximum over all processors
>>> Ratio - ratio of maximum to minimum over all
>>> processors
>>> Mess: number of messages sent
>>> Avg. len: average message length
>>> Reduct: number of global reductions
>>> Global: entire computation
>>> Stage: stages of a computation. Set stages with PetscLogStagePush()
>>> and PetscLogStagePop().
>>> %T - percent time in this phase %f - percent flops in this
>>> phase
>>> %M - percent messages in this phase %L - percent message
>>> lengths in this phase
>>> %R - percent reductions in this phase
>>> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
>>> over all processors)
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> ##########################################################
>>> # #
>>> # WARNING!!! #
>>> # #
>>> # This code was compiled with a debugging option, #
>>> # To get timing results run ./configure #
>>> # using --with-debugging=no, the performance will #
>>> # be generally two or three times faster. #
>>> # #
>>> ##########################################################
>>>
>>>
>>> Event Count Time (sec) Flops
>>> --- Global --- --- Stage --- Total
>>> Max Ratio Max Ratio Max Ratio Mess Avg len
>>> Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> ComputeFunction 52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 3.0e+00 1 0 0 0 1 1 0 0 0 1 0
>>> VecDot 50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 3025
>>> VecMDot 50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 741
>>> VecNorm 200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 15 0 0 0 0 15 0 0 0 3963
>>> VecScale 100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 719
>>> VecCopy 150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
>>> VecSet 164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> VecAXPY 50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 3014
>>> VecWAXPY 50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 2 0 0 0 1 2 0 0 0 167
>>> VecMAXPY 100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 8 0 0 0 2 8 0 0 0 356
>>> VecPointwiseMult 100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 4 0 0 0 2 4 0 0 0 183
>>> VecScatterBegin 53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> VecReduceArith 101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 8 0 0 0 0 8 0 0 0 2801
>>> VecReduceComm 51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecNormalize 100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 11 0 0 0 1 11 0 0 0 1568
>>> VecCUSPCopyTo 152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
>>> VecCUSPCopyFrom 201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
>>> MatMult 100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 49 0 0 0 2 49 0 0 0 1825
>>> MatAssemblyBegin 3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatAssemblyEnd 3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> MatZeroEntries 1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatCUSPCopyTo 3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> SNESSolve 1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00 0.0e+00
>>> 3.7e+02 70100 0 0 88 70100 0 0 89 116
>>> SNESFunctionEval 51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> SNESJacobianEval 50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 4 0 0 0 0 4 0 0 0 0 0
>>> SNESLineSearch 50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
>>> 5.0e+01 20 45 0 0 12 20 45 0 0 12 184
>>> KSPGMRESOrthog 50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00
>>> 5.0e+01 1 8 0 0 12 1 8 0 0 12 480
>>> KSPSetUp 50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 1.5e+01 0 0 0 0 4 0 0 0 0 4 0
>>> KSPSolve 50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00 0.0e+00
>>> 3.2e+02 42 55 0 0 75 42 55 0 0 75 106
>>> PCSetUp 50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 4.9e+01 6 0 0 0 12 6 0 0 0 12 0
>>> PCApply 100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 4.0e+00 2 4 0 0 1 2 4 0 0 1 169
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type Creations Destructions Memory Descendants'
>>> Mem.
>>> Reports information only for process 0.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> Container 2 2 1096 0
>>> Vector 16 16 108696592 0
>>> Vector Scatter 2 2 1240 0
>>> Matrix 1 1 96326824 0
>>> Distributed Mesh 3 3 7775936 0
>>> Bipartite Graph 6 6 4104 0
>>> Index Set 5 5 3884908 0
>>> IS L to G Mapping 1 1 3881760 0
>>> SNES 1 1 1268 0
>>> SNESLineSearch 1 1 840 0
>>> Viewer 1 0 0 0
>>> Krylov Solver 1 1 18288 0
>>> Preconditioner 1 1 792 0
>>>
>>> ========================================================================================================================
>>> Average time to get PetscTime(): 9.53674e-08
>>> #PETSc Option Table entries:
>>> -da_vec_type cusp
>>> -dm_mat_type seqaijcusp
>>> -ksp_monitor
>>> -log_summary
>>> -pc_type jacobi
>>> -snes_converged_reason
>>> -snes_monitor
>>> #End of PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>> Configure run at: Fri Nov 16 08:40:52 2012
>>> Configure options: --with-clanguage=C++ --with-mpi-dir=/usr
>>> --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0 --CXXFLAGS=-O0
>>> --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0
>>> --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]"
>>> --download-blacs --download-superlu_dist --download-triangle
>>> --download-parmetis --download-metis --download-mumps --download-scalapack
>>> --with-cuda=1 --with-cusp=1 --with-thrust=1
>>> --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1
>>> --download-exodusii=yes --download-netcdf --with-boost=1
>>> --with-boost-dir=/usr --download-fiat=yes --download-generator
>>> --download-scientificpython --with-matlab=1 --with-matlab-engine=1
>>> --with-matlab-dir=/opt/MATLAB/R2011a
>>> -----------------------------------------
>>> Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2
>>> Machine characteristics:
>>> Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid
>>> Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4
>>> Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg
>>> -----------------------------------------
>>>
>>> Using C compiler: /usr/bin/mpicxx -O0 -g -fPIC ${COPTFLAGS} ${CFLAGS}
>>> Using Fortran compiler: /usr/bin/mpif90 -fPIC -Wall -Wno-unused-variable
>>> -g ${FOPTFLAGS} ${FFLAGS}
>>> -----------------------------------------
>>>
>>> Using include paths:
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
>>> -I/opt/apps/cuda/4.2//cuda/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve
>>> -I/opt/MATLAB/R2011a/extern/include -I/usr/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include
>>> -I/usr/include/mpich2
>>> -----------------------------------------
>>>
>>> Using C linker: /usr/bin/mpicxx
>>> Using Fortran linker: /usr/bin/mpif90
>>> Using libraries:
>>> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>>> -lpetsc
>>> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>>> -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps -lsmumps
>>> -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs
>>> -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64 -L/opt/apps/cuda/4.2//cuda/lib64
>>> -lcufft -lcublas -lcudart -lcusparse
>>> -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64
>>> -L/opt/MATLAB/R2011a/bin/glnxa64 -L/opt/MATLAB/R2011a/extern/lib/glnxa64
>>> -leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc
>>> -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib
>>> -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread
>>> -lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf
>>> -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3
>>> -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm
>>> -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread -lrt
>>> -lgcc_s -ldl
>>> -----------------------------------------
>>>
>>>
>>>
>>> On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>>
>>>> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <fuentesdt at gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > I'm using petsc 3.3p4
>>>> > I'm trying to run a nonlinear SNES solver on GPU with gmres and jacobi
>>>> > PC
>>>> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and jacobian
>>>> > matrix
>>>> > respectively.
>>>> > When running top I still see significant CPU utilization (800-900
>>>> > %CPU)
>>>> > during the solve ? possibly from some multithreaded operations ?
>>>> >
>>>> > Is this expected ?
>>>> > I was thinking that since I input everything into the solver as a CUSP
>>>> > datatype, all linear algebra operations would be on the GPU device
>>>> > from
>>>> > there and wasn't expecting to see such CPU utilization during the
>>>> > solve ?
>>>> > Do I probably have an error in my code somewhere ?
>>>>
>>>> We cannot answer performance questions without -log_summary
>>>>
>>>> Matt
>>>>
>>>> > Thanks,
>>>> > David
>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which
>>>> their experiments lead.
>>>> -- Norbert Wiener
>>>
>>>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
More information about the petsc-users
mailing list