Thanks Karl, Matt,<div><br></div><div>I thought I created all vectors of CUSP type. I'll double check.</div><div>I was trying to find vectors that I may have accidentally setup not with CUSP type through somehow interface with the SNES solver?</div>
<div><br></div><div>I'll also double check w/ the cuda examples as Karl suggested.</div><div>There are 6 Tesla M2070 on this box, but i'm only running on one of them.</div><div><br></div><div><br><br><div class="gmail_quote">
On Sat, Nov 17, 2012 at 2:42 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On Sat, Nov 17, 2012 at 3:05 PM, David Fuentes <<a href="mailto:fuentesdt@gmail.com">fuentesdt@gmail.com</a>> wrote:<br>
> Thanks Jed.<br>
> I was trying to run it in dbg mode to verify if all significant parts of the<br>
> solver were running on the GPU and not on the CPU by mistake.<br>
> I cant pinpoint what part of the solver is running on the CPU. When I run<br>
> top while running the solver there seems to be ~800% CPU utilization<br>
> that I wasn't expecting. I cant tell if i'm slowing things down by<br>
> transferring between CPU/GPU on accident?<br>
<br>
</div>1) I am not sure what you mean by 800%, but it is definitely<br>
legitimate to want to know where you are computing.<br>
<br>
2) At least some computation is happening on the GPU. I can tell this from the<br>
Vec/MatCopyToGPU events.<br>
<br>
3) Your flop rates are not great. The MatMult is about half what we<br>
get on the Tesla, but you<br>
could have another card without good support for double precision.<br>
The vector ops however<br>
are pretty bad.<br>
<br>
4) It looks like half the flops are in MatMult, which is definitely on<br>
the card, and the others are in<br>
vector operations. Do you create any other vectors without the CUSP type?<br>
<br>
Matt<br>
<div class="HOEnZb"><div class="h5"><br>
> thanks again,<br>
> df<br>
><br>
> On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <<a href="mailto:jedbrown@mcs.anl.gov">jedbrown@mcs.anl.gov</a>> wrote:<br>
>><br>
>> Please read the large boxed message about debugging mode.<br>
>><br>
>> (Replying from phone so can't make it 72 point blinking red, sorry.)<br>
>><br>
>> On Nov 17, 2012 1:41 PM, "David Fuentes" <<a href="mailto:fuentesdt@gmail.com">fuentesdt@gmail.com</a>> wrote:<br>
>>><br>
>>> thanks Matt,<br>
>>><br>
>>> My log summary is below.<br>
>>><br>
>>><br>
>>> ************************************************************************************************************************<br>
>>> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r<br>
>>> -fCourier9' to print this document ***<br>
>>><br>
>>> ************************************************************************************************************************<br>
>>><br>
>>> ---------------------------------------------- PETSc Performance Summary:<br>
>>> ----------------------------------------------<br>
>>><br>
>>> ./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named<br>
>>> SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012<br>
>>> Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT 2012<br>
>>><br>
>>> Max Max/Min Avg Total<br>
>>> Time (sec): 3.164e+01 1.00000 3.164e+01<br>
>>> Objects: 4.100e+01 1.00000 4.100e+01<br>
>>> Flops: 2.561e+09 1.00000 2.561e+09 2.561e+09<br>
>>> Flops/sec: 8.097e+07 1.00000 8.097e+07 8.097e+07<br>
>>> Memory: 2.129e+08 1.00000 2.129e+08<br>
>>> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00<br>
>>> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00<br>
>>> MPI Reductions: 4.230e+02 1.00000<br>
>>><br>
>>> Flop counting convention: 1 flop = 1 real number operation of type<br>
>>> (multiply/divide/add/subtract)<br>
>>> e.g., VecAXPY() for real vectors of length N<br>
>>> --> 2N flops<br>
>>> and VecAXPY() for complex vectors of length N<br>
>>> --> 8N flops<br>
>>><br>
>>> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages<br>
>>> --- -- Message Lengths -- -- Reductions --<br>
>>> Avg %Total Avg %Total counts<br>
>>> %Total Avg %Total counts %Total<br>
>>> 0: Main Stage: 3.1636e+01 100.0% 2.5615e+09 100.0% 0.000e+00<br>
>>> 0.0% 0.000e+00 0.0% 4.220e+02 99.8%<br>
>>><br>
>>><br>
>>> ------------------------------------------------------------------------------------------------------------------------<br>
>>> See the 'Profiling' chapter of the users' manual for details on<br>
>>> interpreting output.<br>
>>> Phase summary info:<br>
>>> Count: number of times phase was executed<br>
>>> Time and Flops: Max - maximum over all processors<br>
>>> Ratio - ratio of maximum to minimum over all<br>
>>> processors<br>
>>> Mess: number of messages sent<br>
>>> Avg. len: average message length<br>
>>> Reduct: number of global reductions<br>
>>> Global: entire computation<br>
>>> Stage: stages of a computation. Set stages with PetscLogStagePush()<br>
>>> and PetscLogStagePop().<br>
>>> %T - percent time in this phase %f - percent flops in this<br>
>>> phase<br>
>>> %M - percent messages in this phase %L - percent message<br>
>>> lengths in this phase<br>
>>> %R - percent reductions in this phase<br>
>>> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time<br>
>>> over all processors)<br>
>>><br>
>>> ------------------------------------------------------------------------------------------------------------------------<br>
>>><br>
>>><br>
>>> ##########################################################<br>
>>> # #<br>
>>> # WARNING!!! #<br>
>>> # #<br>
>>> # This code was compiled with a debugging option, #<br>
>>> # To get timing results run ./configure #<br>
>>> # using --with-debugging=no, the performance will #<br>
>>> # be generally two or three times faster. #<br>
>>> # #<br>
>>> ##########################################################<br>
>>><br>
>>><br>
>>> Event Count Time (sec) Flops<br>
>>> --- Global --- --- Stage --- Total<br>
>>> Max Ratio Max Ratio Max Ratio Mess Avg len<br>
>>> Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s<br>
>>><br>
>>> ------------------------------------------------------------------------------------------------------------------------<br>
>>><br>
>>> --- Event Stage 0: Main Stage<br>
>>><br>
>>> ComputeFunction 52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 3.0e+00 1 0 0 0 1 1 0 0 0 1 0<br>
>>> VecDot 50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 3025<br>
>>> VecMDot 50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 741<br>
>>> VecNorm 200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 15 0 0 0 0 15 0 0 0 3963<br>
>>> VecScale 100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 719<br>
>>> VecCopy 150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0<br>
>>> VecSet 164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>
>>> VecAXPY 50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 3014<br>
>>> VecWAXPY 50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 1 2 0 0 0 1 2 0 0 0 167<br>
>>> VecMAXPY 100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 2 8 0 0 0 2 8 0 0 0 356<br>
>>> VecPointwiseMult 100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 2 4 0 0 0 2 4 0 0 0 183<br>
>>> VecScatterBegin 53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>
>>> VecReduceArith 101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 8 0 0 0 0 8 0 0 0 2801<br>
>>> VecReduceComm 51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
>>> VecNormalize 100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 1 11 0 0 0 1 11 0 0 0 1568<br>
>>> VecCUSPCopyTo 152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0<br>
>>> VecCUSPCopyFrom 201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0<br>
>>> MatMult 100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 2 49 0 0 0 2 49 0 0 0 1825<br>
>>> MatAssemblyBegin 3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
>>> MatAssemblyEnd 3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>
>>> MatZeroEntries 1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
>>> MatCUSPCopyTo 3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
>>> SNESSolve 1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00 0.0e+00<br>
>>> 3.7e+02 70100 0 0 88 70100 0 0 89 116<br>
>>> SNESFunctionEval 51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>
>>> SNESJacobianEval 50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 0.0e+00 4 0 0 0 0 4 0 0 0 0 0<br>
>>> SNESLineSearch 50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00 0.0e+00<br>
>>> 5.0e+01 20 45 0 0 12 20 45 0 0 12 184<br>
>>> KSPGMRESOrthog 50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00<br>
>>> 5.0e+01 1 8 0 0 12 1 8 0 0 12 480<br>
>>> KSPSetUp 50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 1.5e+01 0 0 0 0 4 0 0 0 0 4 0<br>
>>> KSPSolve 50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00 0.0e+00<br>
>>> 3.2e+02 42 55 0 0 75 42 55 0 0 75 106<br>
>>> PCSetUp 50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
>>> 4.9e+01 6 0 0 0 12 6 0 0 0 12 0<br>
>>> PCApply 100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>
>>> 4.0e+00 2 4 0 0 1 2 4 0 0 1 169<br>
>>><br>
>>> ------------------------------------------------------------------------------------------------------------------------<br>
>>><br>
>>> Memory usage is given in bytes:<br>
>>><br>
>>> Object Type Creations Destructions Memory Descendants'<br>
>>> Mem.<br>
>>> Reports information only for process 0.<br>
>>><br>
>>> --- Event Stage 0: Main Stage<br>
>>><br>
>>> Container 2 2 1096 0<br>
>>> Vector 16 16 108696592 0<br>
>>> Vector Scatter 2 2 1240 0<br>
>>> Matrix 1 1 96326824 0<br>
>>> Distributed Mesh 3 3 7775936 0<br>
>>> Bipartite Graph 6 6 4104 0<br>
>>> Index Set 5 5 3884908 0<br>
>>> IS L to G Mapping 1 1 3881760 0<br>
>>> SNES 1 1 1268 0<br>
>>> SNESLineSearch 1 1 840 0<br>
>>> Viewer 1 0 0 0<br>
>>> Krylov Solver 1 1 18288 0<br>
>>> Preconditioner 1 1 792 0<br>
>>><br>
>>> ========================================================================================================================<br>
>>> Average time to get PetscTime(): 9.53674e-08<br>
>>> #PETSc Option Table entries:<br>
>>> -da_vec_type cusp<br>
>>> -dm_mat_type seqaijcusp<br>
>>> -ksp_monitor<br>
>>> -log_summary<br>
>>> -pc_type jacobi<br>
>>> -snes_converged_reason<br>
>>> -snes_monitor<br>
>>> #End of PETSc Option Table entries<br>
>>> Compiled without FORTRAN kernels<br>
>>> Compiled with full precision matrices (default)<br>
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8<br>
>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4<br>
>>> Configure run at: Fri Nov 16 08:40:52 2012<br>
>>> Configure options: --with-clanguage=C++ --with-mpi-dir=/usr<br>
>>> --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0 --CXXFLAGS=-O0<br>
>>> --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0<br>
>>> --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]"<br>
>>> --download-blacs --download-superlu_dist --download-triangle<br>
>>> --download-parmetis --download-metis --download-mumps --download-scalapack<br>
>>> --with-cuda=1 --with-cusp=1 --with-thrust=1<br>
>>> --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1<br>
>>> --download-exodusii=yes --download-netcdf --with-boost=1<br>
>>> --with-boost-dir=/usr --download-fiat=yes --download-generator<br>
>>> --download-scientificpython --with-matlab=1 --with-matlab-engine=1<br>
>>> --with-matlab-dir=/opt/MATLAB/R2011a<br>
>>> -----------------------------------------<br>
>>> Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2<br>
>>> Machine characteristics:<br>
>>> Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid<br>
>>> Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4<br>
>>> Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg<br>
>>> -----------------------------------------<br>
>>><br>
>>> Using C compiler: /usr/bin/mpicxx -O0 -g -fPIC ${COPTFLAGS} ${CFLAGS}<br>
>>> Using Fortran compiler: /usr/bin/mpif90 -fPIC -Wall -Wno-unused-variable<br>
>>> -g ${FOPTFLAGS} ${FFLAGS}<br>
>>> -----------------------------------------<br>
>>><br>
>>> Using include paths:<br>
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include<br>
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include<br>
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include<br>
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include<br>
>>> -I/opt/apps/cuda/4.2//cuda/include<br>
>>> -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve<br>
>>> -I/opt/MATLAB/R2011a/extern/include -I/usr/include<br>
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include<br>
>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include<br>
>>> -I/usr/include/mpich2<br>
>>> -----------------------------------------<br>
>>><br>
>>> Using C linker: /usr/bin/mpicxx<br>
>>> Using Fortran linker: /usr/bin/mpif90<br>
>>> Using libraries:<br>
>>> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib<br>
>>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib<br>
>>> -lpetsc<br>
>>> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib<br>
>>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib<br>
>>> -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps -lsmumps<br>
>>> -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs<br>
>>> -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64 -L/opt/apps/cuda/4.2//cuda/lib64<br>
>>> -lcufft -lcublas -lcudart -lcusparse<br>
>>> -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64<br>
>>> -L/opt/MATLAB/R2011a/bin/glnxa64 -L/opt/MATLAB/R2011a/extern/lib/glnxa64<br>
>>> -leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc<br>
>>> -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib<br>
>>> -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread<br>
>>> -lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf<br>
>>> -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3<br>
>>> -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm<br>
>>> -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread -lrt<br>
>>> -lgcc_s -ldl<br>
>>> -----------------------------------------<br>
>>><br>
>>><br>
>>><br>
>>> On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>><br>
>>> wrote:<br>
>>>><br>
>>>> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <<a href="mailto:fuentesdt@gmail.com">fuentesdt@gmail.com</a>><br>
>>>> wrote:<br>
>>>> > Hi,<br>
>>>> ><br>
>>>> > I'm using petsc 3.3p4<br>
>>>> > I'm trying to run a nonlinear SNES solver on GPU with gmres and jacobi<br>
>>>> > PC<br>
>>>> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and jacobian<br>
>>>> > matrix<br>
>>>> > respectively.<br>
>>>> > When running top I still see significant CPU utilization (800-900<br>
>>>> > %CPU)<br>
>>>> > during the solve ? possibly from some multithreaded operations ?<br>
>>>> ><br>
>>>> > Is this expected ?<br>
>>>> > I was thinking that since I input everything into the solver as a CUSP<br>
>>>> > datatype, all linear algebra operations would be on the GPU device<br>
>>>> > from<br>
>>>> > there and wasn't expecting to see such CPU utilization during the<br>
>>>> > solve ?<br>
>>>> > Do I probably have an error in my code somewhere ?<br>
>>>><br>
>>>> We cannot answer performance questions without -log_summary<br>
>>>><br>
>>>> Matt<br>
>>>><br>
>>>> > Thanks,<br>
>>>> > David<br>
>>>><br>
>>>><br>
>>>><br>
>>>> --<br>
>>>> What most experimenters take for granted before they begin their<br>
>>>> experiments is infinitely more interesting than any results to which<br>
>>>> their experiments lead.<br>
>>>> -- Norbert Wiener<br>
>>><br>
>>><br>
><br>
<br>
<br>
<br>
--<br>
What most experimenters take for granted before they begin their<br>
experiments is infinitely more interesting than any results to which<br>
their experiments lead.<br>
-- Norbert Wiener<br>
</div></div></blockquote></div><br></div>