Thanks Jed.<div>I was trying to run it in dbg mode to verify if all significant parts of the solver were running on the GPU and not on the CPU by mistake.</div><div>I cant pinpoint what part of the solver is running on the CPU. When I run top while running the solver there seems to be ~800% CPU utilization</div>
<div>that I wasn't expecting. I cant tell if i'm slowing things down by transferring between CPU/GPU on accident?</div><div><br></div><div>thanks again,</div><div>df</div><div><br><div class="gmail_quote">On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <span dir="ltr"><<a href="mailto:jedbrown@mcs.anl.gov" target="_blank">jedbrown@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p>Please read the large boxed message about debugging mode.</p>
<p>(Replying from phone so can't make it 72 point blinking red, sorry.)</p><div class="HOEnZb"><div class="h5">
<div class="gmail_quote">On Nov 17, 2012 1:41 PM, "David Fuentes" <<a href="mailto:fuentesdt@gmail.com" target="_blank">fuentesdt@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>thanks Matt,</div><div><br></div><div>My log summary is below.</div><div><br></div><div><div>************************************************************************************************************************</div>
<div>*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***</div><div>************************************************************************************************************************</div>
<div><br></div><div>---------------------------------------------- PETSc Performance Summary: ----------------------------------------------</div><div><br></div><div>./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012</div>
<div>Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT 2012 </div><div><br></div><div> Max Max/Min Avg Total </div><div>Time (sec): 3.164e+01 1.00000 3.164e+01</div>
<div>Objects: 4.100e+01 1.00000 4.100e+01</div><div>Flops: 2.561e+09 1.00000 2.561e+09 2.561e+09</div><div>Flops/sec: 8.097e+07 1.00000 8.097e+07 8.097e+07</div>
<div>Memory: 2.129e+08 1.00000 2.129e+08</div><div>MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00</div><div>MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00</div>
<div>MPI Reductions: 4.230e+02 1.00000</div><div><br></div><div>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)</div><div> e.g., VecAXPY() for real vectors of length N --> 2N flops</div>
<div> and VecAXPY() for complex vectors of length N --> 8N flops</div><div><br></div><div>Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --</div>
<div> Avg %Total Avg %Total counts %Total Avg %Total counts %Total </div><div> 0: Main Stage: 3.1636e+01 100.0% 2.5615e+09 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 4.220e+02 99.8% </div>
<div><br></div><div>------------------------------------------------------------------------------------------------------------------------</div><div>See the 'Profiling' chapter of the users' manual for details on interpreting output.</div>
<div>Phase summary info:</div><div> Count: number of times phase was executed</div><div> Time and Flops: Max - maximum over all processors</div><div> Ratio - ratio of maximum to minimum over all processors</div>
<div> Mess: number of messages sent</div><div> Avg. len: average message length</div><div> Reduct: number of global reductions</div><div> Global: entire computation</div><div> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().</div>
<div> %T - percent time in this phase %f - percent flops in this phase</div><div> %M - percent messages in this phase %L - percent message lengths in this phase</div><div> %R - percent reductions in this phase</div>
<div> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)</div><div>------------------------------------------------------------------------------------------------------------------------</div>
<div><br></div><div><br></div><div> ##########################################################</div><div> # #</div><div> # WARNING!!! #</div>
<div> # #</div><div> # This code was compiled with a debugging option, #</div><div> # To get timing results run ./configure #</div>
<div> # using --with-debugging=no, the performance will #</div><div> # be generally two or three times faster. #</div><div> # #</div>
<div> ##########################################################</div><div><br></div><div><br></div><div>Event Count Time (sec) Flops --- Global --- --- Stage --- Total</div>
<div> Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s</div><div>------------------------------------------------------------------------------------------------------------------------</div>
<div><br></div><div>--- Event Stage 0: Main Stage</div><div><br></div><div>ComputeFunction 52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00 1 0 0 0 1 1 0 0 0 1 0</div><div>VecDot 50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 3025</div>
<div>VecMDot 50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 741</div><div>VecNorm 200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 15 0 0 0 0 15 0 0 0 3963</div>
<div>VecScale 100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 719</div><div>VecCopy 150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0</div>
<div>VecSet 164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0</div><div>VecAXPY 50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 3014</div>
<div>VecWAXPY 50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 167</div><div>VecMAXPY 100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 8 0 0 0 2 8 0 0 0 356</div>
<div>VecPointwiseMult 100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00 0.0e+00 2 4 0 0 0 2 4 0 0 0 183</div><div>VecScatterBegin 53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0</div>
<div>VecReduceArith 101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 8 0 0 0 0 8 0 0 0 2801</div><div>VecReduceComm 51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</div>
<div>VecNormalize 100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 1 11 0 0 0 1568</div><div>VecCUSPCopyTo 152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0</div>
<div>VecCUSPCopyFrom 201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0</div><div>MatMult 100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00 0.0e+00 0.0e+00 2 49 0 0 0 2 49 0 0 0 1825</div>
<div>MatAssemblyBegin 3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</div><div>MatAssemblyEnd 3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0</div>
<div>MatZeroEntries 1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</div><div>MatCUSPCopyTo 3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</div>
<div>SNESSolve 1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00 0.0e+00 3.7e+02 70100 0 0 88 70100 0 0 89 116</div><div>SNESFunctionEval 51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0</div>
<div>SNESJacobianEval 50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 4 0 0 0 0 4 0 0 0 0 0</div><div>SNESLineSearch 50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00 0.0e+00 5.0e+01 20 45 0 0 12 20 45 0 0 12 184</div>
<div>KSPGMRESOrthog 50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00 5.0e+01 1 8 0 0 12 1 8 0 0 12 480</div><div>KSPSetUp 50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.5e+01 0 0 0 0 4 0 0 0 0 4 0</div>
<div>KSPSolve 50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00 0.0e+00 3.2e+02 42 55 0 0 75 42 55 0 0 75 106</div><div>PCSetUp 50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.9e+01 6 0 0 0 12 6 0 0 0 12 0</div>
<div>PCApply 100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00 4.0e+00 2 4 0 0 1 2 4 0 0 1 169</div><div>------------------------------------------------------------------------------------------------------------------------</div>
<div><br></div><div>Memory usage is given in bytes:</div><div><br></div><div>Object Type Creations Destructions Memory Descendants' Mem.</div><div>Reports information only for process 0.</div><div><br>
</div><div>--- Event Stage 0: Main Stage</div><div><br></div><div> Container 2 2 1096 0</div><div> Vector 16 <a href="tel:16%20%C2%A0%20%C2%A0108696592" value="+16108696592" target="_blank">16 108696592</a> 0</div>
<div> Vector Scatter 2 2 1240 0</div>
<div> Matrix 1 1 96326824 0</div><div> Distributed Mesh 3 3 7775936 0</div><div> Bipartite Graph 6 6 4104 0</div><div> Index Set 5 5 3884908 0</div>
<div> IS L to G Mapping 1 1 3881760 0</div><div> SNES 1 1 1268 0</div><div> SNESLineSearch 1 1 840 0</div><div> Viewer 1 0 0 0</div>
<div> Krylov Solver 1 1 18288 0</div><div> Preconditioner 1 1 792 0</div><div>========================================================================================================================</div>
<div>Average time to get PetscTime(): 9.53674e-08</div><div>#PETSc Option Table entries:</div><div>-da_vec_type cusp</div><div>-dm_mat_type seqaijcusp</div><div>-ksp_monitor</div><div>-log_summary</div><div>-pc_type jacobi</div>
<div>-snes_converged_reason</div><div>-snes_monitor</div><div>#End of PETSc Option Table entries</div><div>Compiled without FORTRAN kernels</div><div>Compiled with full precision matrices (default)</div><div>sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4</div>
<div>Configure run at: Fri Nov 16 08:40:52 2012</div><div>Configure options: --with-clanguage=C++ --with-mpi-dir=/usr --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0 --CXXFLAGS=-O0 --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0 --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]" --download-blacs --download-superlu_dist --download-triangle --download-parmetis --download-metis --download-mumps --download-scalapack --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1 --download-exodusii=yes --download-netcdf --with-boost=1 --with-boost-dir=/usr --download-fiat=yes --download-generator --download-scientificpython --with-matlab=1 --with-matlab-engine=1 --with-matlab-dir=/opt/MATLAB/R2011a</div>
<div>-----------------------------------------</div><div>Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2 </div><div>Machine characteristics: Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid</div><div>Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4</div>
<div>Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg</div><div>-----------------------------------------</div><div><br></div><div>Using C compiler: /usr/bin/mpicxx -O0 -g -fPIC ${COPTFLAGS} ${CFLAGS}</div><div>Using Fortran compiler: /usr/bin/mpif90 -fPIC -Wall -Wno-unused-variable -g ${FOPTFLAGS} ${FFLAGS} </div>
<div>-----------------------------------------</div><div><br></div><div>Using include paths: -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include -I/opt/apps/PETSC/petsc-3.3-p4/include -I/opt/apps/PETSC/petsc-3.3-p4/include -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include -I/opt/apps/cuda/4.2//cuda/include -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve -I/opt/MATLAB/R2011a/extern/include -I/usr/include -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include -I/usr/include/mpich2</div>
<div>-----------------------------------------</div><div><br></div><div>Using C linker: /usr/bin/mpicxx</div><div>Using Fortran linker: /usr/bin/mpif90</div><div>Using libraries: -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib -lpetsc -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64 -L/opt/apps/cuda/4.2//cuda/lib64 -lcufft -lcublas -lcudart -lcusparse -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64 -L/opt/MATLAB/R2011a/bin/glnxa64 -L/opt/MATLAB/R2011a/extern/lib/glnxa64 -leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread -lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread -lrt -lgcc_s -ldl </div>
<div>-----------------------------------------</div><div><br></div></div><br><br><div class="gmail_quote">On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div>On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <<a href="mailto:fuentesdt@gmail.com" target="_blank">fuentesdt@gmail.com</a>> wrote:<br>
> Hi,<br>
><br>
> I'm using petsc 3.3p4<br>
> I'm trying to run a nonlinear SNES solver on GPU with gmres and jacobi PC<br>
> using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and jacobian matrix<br>
> respectively.<br>
> When running top I still see significant CPU utilization (800-900 %CPU)<br>
> during the solve ? possibly from some multithreaded operations ?<br>
><br>
> Is this expected ?<br>
> I was thinking that since I input everything into the solver as a CUSP<br>
> datatype, all linear algebra operations would be on the GPU device from<br>
> there and wasn't expecting to see such CPU utilization during the solve ?<br>
> Do I probably have an error in my code somewhere ?<br>
<br>
</div></div>We cannot answer performance questions without -log_summary<br>
<br>
Matt<br>
<br>
> Thanks,<br>
> David<br>
<br>
<br>
<br>
--<br>
What most experimenters take for granted before they begin their<br>
experiments is infinitely more interesting than any results to which<br>
their experiments lead.<br>
-- Norbert Wiener<br>
</blockquote></div><br>
</blockquote></div>
</div></div></blockquote></div><br></div>