[petsc-users] CPU utilization during GPU solver

Matthew Knepley knepley at gmail.com
Sat Nov 17 14:42:58 CST 2012


On Sat, Nov 17, 2012 at 3:05 PM, David Fuentes <fuentesdt at gmail.com> wrote:
> Thanks Jed.
> I was trying to run it in dbg mode to verify if all significant parts of the
> solver were running on the GPU and not on the CPU by mistake.
> I cant pinpoint what part of the solver is running on the CPU. When I run
> top while running the solver there seems to be ~800% CPU utilization
> that I wasn't expecting. I cant tell if i'm slowing things down by
> transferring between CPU/GPU on accident?

1) I am not sure what you mean by 800%, but it is definitely
legitimate to want to know where you are computing.

2) At least some computation is happening on the GPU. I can tell this from the
    Vec/MatCopyToGPU events.

3) Your flop rates are not great. The MatMult is about half what we
get on the Tesla, but you
    could have another card without good support for double precision.
The vector ops however
    are pretty bad.

4) It looks like half the flops are in MatMult, which is definitely on
the card, and the others are in
    vector operations. Do you create any other vectors without the CUSP type?

   Matt

> thanks again,
> df
>
> On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>
>> Please read the large boxed message about debugging mode.
>>
>> (Replying from phone so can't make it 72 point blinking red, sorry.)
>>
>> On Nov 17, 2012 1:41 PM, "David Fuentes" <fuentesdt at gmail.com> wrote:
>>>
>>> thanks Matt,
>>>
>>> My log summary is below.
>>>
>>>
>>> ************************************************************************************************************************
>>> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
>>> -fCourier9' to print this document            ***
>>>
>>> ************************************************************************************************************************
>>>
>>> ---------------------------------------------- PETSc Performance Summary:
>>> ----------------------------------------------
>>>
>>> ./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named
>>> SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012
>>> Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT 2012
>>>
>>>                          Max       Max/Min        Avg      Total
>>> Time (sec):           3.164e+01      1.00000   3.164e+01
>>> Objects:              4.100e+01      1.00000   4.100e+01
>>> Flops:                2.561e+09      1.00000   2.561e+09  2.561e+09
>>> Flops/sec:            8.097e+07      1.00000   8.097e+07  8.097e+07
>>> Memory:               2.129e+08      1.00000              2.129e+08
>>> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
>>> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
>>> MPI Reductions:       4.230e+02      1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>>                             e.g., VecAXPY() for real vectors of length N
>>> --> 2N flops
>>>                             and VecAXPY() for complex vectors of length N
>>> --> 8N flops
>>>
>>> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
>>> ---  -- Message Lengths --  -- Reductions --
>>>                         Avg     %Total     Avg     %Total   counts
>>> %Total     Avg         %Total   counts   %Total
>>>  0:      Main Stage: 3.1636e+01 100.0%  2.5615e+09 100.0%  0.000e+00
>>> 0.0%  0.000e+00        0.0%  4.220e+02  99.8%
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>>    Count: number of times phase was executed
>>>    Time and Flops: Max - maximum over all processors
>>>                    Ratio - ratio of maximum to minimum over all
>>> processors
>>>    Mess: number of messages sent
>>>    Avg. len: average message length
>>>    Reduct: number of global reductions
>>>    Global: entire computation
>>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>>> and PetscLogStagePop().
>>>       %T - percent time in this phase         %f - percent flops in this
>>> phase
>>>       %M - percent messages in this phase     %L - percent message
>>> lengths in this phase
>>>       %R - percent reductions in this phase
>>>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
>>> over all processors)
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>>       ##########################################################
>>>       #                                                        #
>>>       #                          WARNING!!!                    #
>>>       #                                                        #
>>>       #   This code was compiled with a debugging option,      #
>>>       #   To get timing results run ./configure                #
>>>       #   using --with-debugging=no, the performance will      #
>>>       #   be generally two or three times faster.              #
>>>       #                                                        #
>>>       ##########################################################
>>>
>>>
>>> Event                Count      Time (sec)     Flops
>>> --- Global ---  --- Stage ---   Total
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
>>> Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> ComputeFunction       52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 3.0e+00  1  0  0  0  1   1  0  0  0  1     0
>>> VecDot                50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0  3025
>>> VecMDot               50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0   741
>>> VecNorm              200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0 15  0  0  0   0 15  0  0  0  3963
>>> VecScale             100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0   719
>>> VecCopy              150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>> VecSet               164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> VecAXPY               50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0  3014
>>> VecWAXPY              50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  2  0  0  0   1  2  0  0  0   167
>>> VecMAXPY             100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  2  8  0  0  0   2  8  0  0  0   356
>>> VecPointwiseMult     100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  2  4  0  0  0   2  4  0  0  0   183
>>> VecScatterBegin       53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> VecReduceArith       101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  8  0  0  0   0  8  0  0  0  2801
>>> VecReduceComm         51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecNormalize         100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  1 11  0  0  0   1 11  0  0  0  1568
>>> VecCUSPCopyTo        152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>> VecCUSPCopyFrom      201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>> MatMult              100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  2 49  0  0  0   2 49  0  0  0  1825
>>> MatAssemblyBegin       3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyEnd         3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatZeroEntries         1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatCUSPCopyTo          3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> SNESSolve              1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00 0.0e+00
>>> 3.7e+02 70100  0  0 88  70100  0  0 89   116
>>> SNESFunctionEval      51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> SNESJacobianEval      50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
>>> SNESLineSearch        50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
>>> 5.0e+01 20 45  0  0 12  20 45  0  0 12   184
>>> KSPGMRESOrthog        50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00
>>> 5.0e+01  1  8  0  0 12   1  8  0  0 12   480
>>> KSPSetUp              50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 1.5e+01  0  0  0  0  4   0  0  0  0  4     0
>>> KSPSolve              50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00 0.0e+00
>>> 3.2e+02 42 55  0  0 75  42 55  0  0 75   106
>>> PCSetUp               50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 4.9e+01  6  0  0  0 12   6  0  0  0 12     0
>>> PCApply              100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
>>> 4.0e+00  2  4  0  0  1   2  4  0  0  1   169
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type          Creations   Destructions     Memory  Descendants'
>>> Mem.
>>> Reports information only for process 0.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>>            Container     2              2         1096     0
>>>               Vector    16             16    108696592     0
>>>       Vector Scatter     2              2         1240     0
>>>               Matrix     1              1     96326824     0
>>>     Distributed Mesh     3              3      7775936     0
>>>      Bipartite Graph     6              6         4104     0
>>>            Index Set     5              5      3884908     0
>>>    IS L to G Mapping     1              1      3881760     0
>>>                 SNES     1              1         1268     0
>>>       SNESLineSearch     1              1          840     0
>>>               Viewer     1              0            0     0
>>>        Krylov Solver     1              1        18288     0
>>>       Preconditioner     1              1          792     0
>>>
>>> ========================================================================================================================
>>> Average time to get PetscTime(): 9.53674e-08
>>> #PETSc Option Table entries:
>>> -da_vec_type cusp
>>> -dm_mat_type seqaijcusp
>>> -ksp_monitor
>>> -log_summary
>>> -pc_type jacobi
>>> -snes_converged_reason
>>> -snes_monitor
>>> #End of PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>> Configure run at: Fri Nov 16 08:40:52 2012
>>> Configure options: --with-clanguage=C++ --with-mpi-dir=/usr
>>> --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0 --CXXFLAGS=-O0
>>> --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0
>>> --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]"
>>> --download-blacs --download-superlu_dist --download-triangle
>>> --download-parmetis --download-metis --download-mumps --download-scalapack
>>> --with-cuda=1 --with-cusp=1 --with-thrust=1
>>> --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1
>>> --download-exodusii=yes --download-netcdf --with-boost=1
>>> --with-boost-dir=/usr --download-fiat=yes --download-generator
>>> --download-scientificpython --with-matlab=1 --with-matlab-engine=1
>>> --with-matlab-dir=/opt/MATLAB/R2011a
>>> -----------------------------------------
>>> Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2
>>> Machine characteristics:
>>> Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid
>>> Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4
>>> Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg
>>> -----------------------------------------
>>>
>>> Using C compiler: /usr/bin/mpicxx -O0 -g   -fPIC   ${COPTFLAGS} ${CFLAGS}
>>> Using Fortran compiler: /usr/bin/mpif90  -fPIC -Wall -Wno-unused-variable
>>> -g   ${FOPTFLAGS} ${FFLAGS}
>>> -----------------------------------------
>>>
>>> Using include paths:
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
>>> -I/opt/apps/cuda/4.2//cuda/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve
>>> -I/opt/MATLAB/R2011a/extern/include -I/usr/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include
>>> -I/usr/include/mpich2
>>> -----------------------------------------
>>>
>>> Using C linker: /usr/bin/mpicxx
>>> Using Fortran linker: /usr/bin/mpif90
>>> Using libraries:
>>> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>>> -lpetsc
>>> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
>>> -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps -lsmumps
>>> -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs
>>> -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64 -L/opt/apps/cuda/4.2//cuda/lib64
>>> -lcufft -lcublas -lcudart -lcusparse
>>> -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64
>>> -L/opt/MATLAB/R2011a/bin/glnxa64 -L/opt/MATLAB/R2011a/extern/lib/glnxa64
>>> -leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc
>>> -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib
>>> -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread
>>> -lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf
>>> -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3
>>> -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm
>>> -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread -lrt
>>> -lgcc_s -ldl
>>> -----------------------------------------
>>>
>>>
>>>
>>> On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>>
>>>> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <fuentesdt at gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > I'm using petsc 3.3p4
>>>> > I'm trying to run a nonlinear SNES solver on GPU with gmres and jacobi
>>>> > PC
>>>> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and jacobian
>>>> > matrix
>>>> > respectively.
>>>> > When running top I still see significant CPU utilization (800-900
>>>> > %CPU)
>>>> > during the solve ? possibly from some multithreaded operations ?
>>>> >
>>>> > Is this expected ?
>>>> > I was thinking that since I input everything into the solver as a CUSP
>>>> > datatype, all linear algebra operations would be on the GPU device
>>>> > from
>>>> > there and wasn't expecting to see such CPU utilization during the
>>>> > solve ?
>>>> > Do I probably have an error in my code somewhere ?
>>>>
>>>> We cannot answer performance questions without -log_summary
>>>>
>>>>    Matt
>>>>
>>>> > Thanks,
>>>> > David
>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which
>>>> their experiments lead.
>>>> -- Norbert Wiener
>>>
>>>
>



--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


More information about the petsc-users mailing list