[petsc-users] CPU utilization during GPU solver
David Fuentes
fuentesdt at gmail.com
Sat Nov 17 13:41:35 CST 2012
thanks Matt,
My log summary is below.
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
-fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named SCRGP2
with 1 processor, by fuentes Sat Nov 17 13:35:06 2012
Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT 2012
Max Max/Min Avg Total
Time (sec): 3.164e+01 1.00000 3.164e+01
Objects: 4.100e+01 1.00000 4.100e+01
Flops: 2.561e+09 1.00000 2.561e+09 2.561e+09
Flops/sec: 8.097e+07 1.00000 8.097e+07 8.097e+07
Memory: 2.129e+08 1.00000 2.129e+08
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 4.230e+02 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flops
and VecAXPY() for complex vectors of length N
--> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 3.1636e+01 100.0% 2.5615e+09 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 4.220e+02 99.8%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %f - percent flops in this
phase
%M - percent messages in this phase %L - percent message lengths
in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
Event Count Time (sec) Flops
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len
Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
ComputeFunction 52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
3.0e+00 1 0 0 0 1 1 0 0 0 1 0
VecDot 50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00 0 4 0 0 0 0 4 0 0 0 3025
VecMDot 50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00 0 4 0 0 0 0 4 0 0 0 741
VecNorm 200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00 0.0e+00
0.0e+00 0 15 0 0 0 0 15 0 0 0 3963
VecScale 100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00 0 4 0 0 0 0 4 0 0 0 719
VecCopy 150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 2 0 0 0 0 2 0 0 0 0 0
VecSet 164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecAXPY 50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00 0 4 0 0 0 0 4 0 0 0 3014
VecWAXPY 50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00
0.0e+00 1 2 0 0 0 1 2 0 0 0 167
VecMAXPY 100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00
0.0e+00 2 8 0 0 0 2 8 0 0 0 356
VecPointwiseMult 100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
0.0e+00 2 4 0 0 0 2 4 0 0 0 183
VecScatterBegin 53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecReduceArith 101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
0.0e+00 0 8 0 0 0 0 8 0 0 0 2801
VecReduceComm 51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00 0.0e+00
0.0e+00 1 11 0 0 0 1 11 0 0 0 1568
VecCUSPCopyTo 152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 2 0 0 0 0 2 0 0 0 0 0
VecCUSPCopyFrom 201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatMult 100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00 0.0e+00
0.0e+00 2 49 0 0 0 2 49 0 0 0 1825
MatAssemblyBegin 3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatZeroEntries 1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatCUSPCopyTo 3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SNESSolve 1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00 0.0e+00
3.7e+02 70100 0 0 88 70100 0 0 89 116
SNESFunctionEval 51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
SNESJacobianEval 50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 4 0 0 0 0 4 0 0 0 0 0
SNESLineSearch 50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00 0.0e+00
5.0e+01 20 45 0 0 12 20 45 0 0 12 184
KSPGMRESOrthog 50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00
5.0e+01 1 8 0 0 12 1 8 0 0 12 480
KSPSetUp 50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
1.5e+01 0 0 0 0 4 0 0 0 0 4 0
KSPSolve 50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00 0.0e+00
3.2e+02 42 55 0 0 75 42 55 0 0 75 106
PCSetUp 50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.9e+01 6 0 0 0 12 6 0 0 0 12 0
PCApply 100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00
4.0e+00 2 4 0 0 1 2 4 0 0 1 169
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Container 2 2 1096 0
Vector 16 16 108696592 0
Vector Scatter 2 2 1240 0
Matrix 1 1 96326824 0
Distributed Mesh 3 3 7775936 0
Bipartite Graph 6 6 4104 0
Index Set 5 5 3884908 0
IS L to G Mapping 1 1 3881760 0
SNES 1 1 1268 0
SNESLineSearch 1 1 840 0
Viewer 1 0 0 0
Krylov Solver 1 1 18288 0
Preconditioner 1 1 792 0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
#PETSc Option Table entries:
-da_vec_type cusp
-dm_mat_type seqaijcusp
-ksp_monitor
-log_summary
-pc_type jacobi
-snes_converged_reason
-snes_monitor
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure run at: Fri Nov 16 08:40:52 2012
Configure options: --with-clanguage=C++ --with-mpi-dir=/usr
--with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0 --CXXFLAGS=-O0
--CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0
--with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]"
--download-blacs --download-superlu_dist --download-triangle
--download-parmetis --download-metis --download-mumps --download-scalapack
--with-cuda=1 --with-cusp=1 --with-thrust=1
--with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1
--download-exodusii=yes --download-netcdf --with-boost=1
--with-boost-dir=/usr --download-fiat=yes --download-generator
--download-scientificpython --with-matlab=1 --with-matlab-engine=1
--with-matlab-dir=/opt/MATLAB/R2011a
-----------------------------------------
Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2
Machine characteristics:
Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid
Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4
Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg
-----------------------------------------
Using C compiler: /usr/bin/mpicxx -O0 -g -fPIC ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /usr/bin/mpif90 -fPIC -Wall -Wno-unused-variable
-g ${FOPTFLAGS} ${FFLAGS}
-----------------------------------------
Using include paths:
-I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
-I/opt/apps/PETSC/petsc-3.3-p4/include
-I/opt/apps/PETSC/petsc-3.3-p4/include
-I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include
-I/opt/apps/cuda/4.2//cuda/include
-I/opt/apps/PETSC/petsc-3.3-p4/include/sieve
-I/opt/MATLAB/R2011a/extern/include -I/usr/include
-I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include
-I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include
-I/usr/include/mpich2
-----------------------------------------
Using C linker: /usr/bin/mpicxx
Using Fortran linker: /usr/bin/mpif90
Using libraries:
-Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
-L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
-lpetsc
-Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
-L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib
-ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps -lsmumps
-lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs
-Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64 -L/opt/apps/cuda/4.2//cuda/lib64
-lcufft -lcublas -lcudart -lcusparse
-Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64
-L/opt/MATLAB/R2011a/bin/glnxa64 -L/opt/MATLAB/R2011a/extern/lib/glnxa64
-leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc
-Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib
-L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread
-lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf
-Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3
-L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm
-lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread -lrt
-lgcc_s -ldl
-----------------------------------------
On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <knepley at gmail.com> wrote:
> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <fuentesdt at gmail.com>
> wrote:
> > Hi,
> >
> > I'm using petsc 3.3p4
> > I'm trying to run a nonlinear SNES solver on GPU with gmres and jacobi PC
> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and jacobian
> matrix
> > respectively.
> > When running top I still see significant CPU utilization (800-900 %CPU)
> > during the solve ? possibly from some multithreaded operations ?
> >
> > Is this expected ?
> > I was thinking that since I input everything into the solver as a CUSP
> > datatype, all linear algebra operations would be on the GPU device from
> > there and wasn't expecting to see such CPU utilization during the solve ?
> > Do I probably have an error in my code somewhere ?
>
> We cannot answer performance questions without -log_summary
>
> Matt
>
> > Thanks,
> > David
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121117/6521a3ad/attachment.html>
More information about the petsc-users
mailing list