Thanks Karl, Matt,<div><br></div><div>I thought I created all vectors of CUSP type. I'll double check.</div><div>I was trying to find vectors that I may have accidentally setup not with CUSP type through somehow interface with the SNES solver?</div>

<div><br></div><div>I'll also double check w/ the cuda examples as Karl suggested.</div><div>There are 6 Tesla M2070 on this box, but i'm only running on one of them.</div><div><br></div><div><br><br><div class="gmail_quote">

On Sat, Nov 17, 2012 at 2:42 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On Sat, Nov 17, 2012 at 3:05 PM, David Fuentes <<a href="mailto:fuentesdt@gmail.com">fuentesdt@gmail.com</a>> wrote:<br>

> Thanks Jed.<br>

> I was trying to run it in dbg mode to verify if all significant parts of the<br>

> solver were running on the GPU and not on the CPU by mistake.<br>

> I cant pinpoint what part of the solver is running on the CPU. When I run<br>

> top while running the solver there seems to be ~800% CPU utilization<br>

> that I wasn't expecting. I cant tell if i'm slowing things down by<br>

> transferring between CPU/GPU on accident?<br>

<br>

</div>1) I am not sure what you mean by 800%, but it is definitely<br>

legitimate to want to know where you are computing.<br>

<br>

2) At least some computation is happening on the GPU. I can tell this from the<br>

    Vec/MatCopyToGPU events.<br>

<br>

3) Your flop rates are not great. The MatMult is about half what we<br>

get on the Tesla, but you<br>

    could have another card without good support for double precision.<br>

The vector ops however<br>

    are pretty bad.<br>

<br>

4) It looks like half the flops are in MatMult, which is definitely on<br>

the card, and the others are in<br>

    vector operations. Do you create any other vectors without the CUSP type?<br>

<br>

   Matt<br>

<div class="HOEnZb"><div class="h5"><br>

> thanks again,<br>

> df<br>

><br>

> On Sat, Nov 17, 2012 at 1:49 PM, Jed Brown <<a href="mailto:jedbrown@mcs.anl.gov">jedbrown@mcs.anl.gov</a>> wrote:<br>

>><br>

>> Please read the large boxed message about debugging mode.<br>

>><br>

>> (Replying from phone so can't make it 72 point blinking red, sorry.)<br>

>><br>

>> On Nov 17, 2012 1:41 PM, "David Fuentes" <<a href="mailto:fuentesdt@gmail.com">fuentesdt@gmail.com</a>> wrote:<br>

>>><br>

>>> thanks Matt,<br>

>>><br>

>>> My log summary is below.<br>

>>><br>

>>><br>

>>> ************************************************************************************************************************<br>

>>> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r<br>

>>> -fCourier9' to print this document            ***<br>

>>><br>

>>> ************************************************************************************************************************<br>

>>><br>

>>> ---------------------------------------------- PETSc Performance Summary:<br>

>>> ----------------------------------------------<br>

>>><br>

>>> ./FocusUltraSoundModel on a gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg named<br>

>>> SCRGP2 with 1 processor, by fuentes Sat Nov 17 13:35:06 2012<br>

>>> Using Petsc Release Version 3.3.0, Patch 4, Fri Oct 26 10:46:51 CDT 2012<br>

>>><br>

>>>                          Max       Max/Min        Avg      Total<br>

>>> Time (sec):           3.164e+01      1.00000   3.164e+01<br>

>>> Objects:              4.100e+01      1.00000   4.100e+01<br>

>>> Flops:                2.561e+09      1.00000   2.561e+09  2.561e+09<br>

>>> Flops/sec:            8.097e+07      1.00000   8.097e+07  8.097e+07<br>

>>> Memory:               2.129e+08      1.00000              2.129e+08<br>

>>> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00<br>

>>> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00<br>

>>> MPI Reductions:       4.230e+02      1.00000<br>

>>><br>

>>> Flop counting convention: 1 flop = 1 real number operation of type<br>

>>> (multiply/divide/add/subtract)<br>

>>>                             e.g., VecAXPY() for real vectors of length N<br>

>>> --> 2N flops<br>

>>>                             and VecAXPY() for complex vectors of length N<br>

>>> --> 8N flops<br>

>>><br>

>>> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages<br>

>>> ---  -- Message Lengths --  -- Reductions --<br>

>>>                         Avg     %Total     Avg     %Total   counts<br>

>>> %Total     Avg         %Total   counts   %Total<br>

>>>  0:      Main Stage: 3.1636e+01 100.0%  2.5615e+09 100.0%  0.000e+00<br>

>>> 0.0%  0.000e+00        0.0%  4.220e+02  99.8%<br>

>>><br>

>>><br>

>>> ------------------------------------------------------------------------------------------------------------------------<br>

>>> See the 'Profiling' chapter of the users' manual for details on<br>

>>> interpreting output.<br>

>>> Phase summary info:<br>

>>>    Count: number of times phase was executed<br>

>>>    Time and Flops: Max - maximum over all processors<br>

>>>                    Ratio - ratio of maximum to minimum over all<br>

>>> processors<br>

>>>    Mess: number of messages sent<br>

>>>    Avg. len: average message length<br>

>>>    Reduct: number of global reductions<br>

>>>    Global: entire computation<br>

>>>    Stage: stages of a computation. Set stages with PetscLogStagePush()<br>

>>> and PetscLogStagePop().<br>

>>>       %T - percent time in this phase         %f - percent flops in this<br>

>>> phase<br>

>>>       %M - percent messages in this phase     %L - percent message<br>

>>> lengths in this phase<br>

>>>       %R - percent reductions in this phase<br>

>>>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time<br>

>>> over all processors)<br>

>>><br>

>>> ------------------------------------------------------------------------------------------------------------------------<br>

>>><br>

>>><br>

>>>       ##########################################################<br>

>>>       #                                                        #<br>

>>>       #                          WARNING!!!                    #<br>

>>>       #                                                        #<br>

>>>       #   This code was compiled with a debugging option,      #<br>

>>>       #   To get timing results run ./configure                #<br>

>>>       #   using --with-debugging=no, the performance will      #<br>

>>>       #   be generally two or three times faster.              #<br>

>>>       #                                                        #<br>

>>>       ##########################################################<br>

>>><br>

>>><br>

>>> Event                Count      Time (sec)     Flops<br>

>>> --- Global ---  --- Stage ---   Total<br>

>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len<br>

>>> Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s<br>

>>><br>

>>> ------------------------------------------------------------------------------------------------------------------------<br>

>>><br>

>>> --- Event Stage 0: Main Stage<br>

>>><br>

>>> ComputeFunction       52 1.0 3.9104e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 3.0e+00  1  0  0  0  1   1  0  0  0  1     0<br>

>>> VecDot                50 1.0 3.2072e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0  3025<br>

>>> VecMDot               50 1.0 1.3100e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0   741<br>

>>> VecNorm              200 1.0 9.7943e-02 1.0 3.88e+08 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0 15  0  0  0   0 15  0  0  0  3963<br>

>>> VecScale             100 1.0 1.3496e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0   719<br>

>>> VecCopy              150 1.0 4.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0<br>

>>> VecSet               164 1.0 2.9707e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0<br>

>>> VecAXPY               50 1.0 3.2194e-02 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0  3014<br>

>>> VecWAXPY              50 1.0 2.9040e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  1  2  0  0  0   1  2  0  0  0   167<br>

>>> VecMAXPY             100 1.0 5.4555e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  2  8  0  0  0   2  8  0  0  0   356<br>

>>> VecPointwiseMult     100 1.0 5.3003e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  2  4  0  0  0   2  4  0  0  0   183<br>

>>> VecScatterBegin       53 1.0 1.8660e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0<br>

>>> VecReduceArith       101 1.0 6.9973e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  8  0  0  0   0  8  0  0  0  2801<br>

>>> VecReduceComm         51 1.0 1.0252e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

>>> VecNormalize         100 1.0 1.8565e-01 1.0 2.91e+08 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  1 11  0  0  0   1 11  0  0  0  1568<br>

>>> VecCUSPCopyTo        152 1.0 5.8016e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0<br>

>>> VecCUSPCopyFrom      201 1.0 6.0029e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0<br>

>>> MatMult              100 1.0 6.8465e-01 1.0 1.25e+09 1.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  2 49  0  0  0   2 49  0  0  0  1825<br>

>>> MatAssemblyBegin       3 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

>>> MatAssemblyEnd         3 1.0 2.7767e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0<br>

>>> MatZeroEntries         1 1.0 2.0346e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

>>> MatCUSPCopyTo          3 1.0 1.4056e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

>>> SNESSolve              1 1.0 2.2094e+01 1.0 2.56e+09 1.0 0.0e+00 0.0e+00<br>

>>> 3.7e+02 70100  0  0 88  70100  0  0 89   116<br>

>>> SNESFunctionEval      51 1.0 3.9031e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0<br>

>>> SNESJacobianEval      50 1.0 1.3191e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 0.0e+00  4  0  0  0  0   4  0  0  0  0     0<br>

>>> SNESLineSearch        50 1.0 6.2922e+00 1.0 1.16e+09 1.0 0.0e+00 0.0e+00<br>

>>> 5.0e+01 20 45  0  0 12  20 45  0  0 12   184<br>

>>> KSPGMRESOrthog        50 1.0 4.0436e-01 1.0 1.94e+08 1.0 0.0e+00 0.0e+00<br>

>>> 5.0e+01  1  8  0  0 12   1  8  0  0 12   480<br>

>>> KSPSetUp              50 1.0 2.1935e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 1.5e+01  0  0  0  0  4   0  0  0  0  4     0<br>

>>> KSPSolve              50 1.0 1.3230e+01 1.0 1.40e+09 1.0 0.0e+00 0.0e+00<br>

>>> 3.2e+02 42 55  0  0 75  42 55  0  0 75   106<br>

>>> PCSetUp               50 1.0 1.9897e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>

>>> 4.9e+01  6  0  0  0 12   6  0  0  0 12     0<br>

>>> PCApply              100 1.0 5.7457e-01 1.0 9.70e+07 1.0 0.0e+00 0.0e+00<br>

>>> 4.0e+00  2  4  0  0  1   2  4  0  0  1   169<br>

>>><br>

>>> ------------------------------------------------------------------------------------------------------------------------<br>

>>><br>

>>> Memory usage is given in bytes:<br>

>>><br>

>>> Object Type          Creations   Destructions     Memory  Descendants'<br>

>>> Mem.<br>

>>> Reports information only for process 0.<br>

>>><br>

>>> --- Event Stage 0: Main Stage<br>

>>><br>

>>>            Container     2              2         1096     0<br>

>>>               Vector    16             16    108696592     0<br>

>>>       Vector Scatter     2              2         1240     0<br>

>>>               Matrix     1              1     96326824     0<br>

>>>     Distributed Mesh     3              3      7775936     0<br>

>>>      Bipartite Graph     6              6         4104     0<br>

>>>            Index Set     5              5      3884908     0<br>

>>>    IS L to G Mapping     1              1      3881760     0<br>

>>>                 SNES     1              1         1268     0<br>

>>>       SNESLineSearch     1              1          840     0<br>

>>>               Viewer     1              0            0     0<br>

>>>        Krylov Solver     1              1        18288     0<br>

>>>       Preconditioner     1              1          792     0<br>

>>><br>

>>> ========================================================================================================================<br>

>>> Average time to get PetscTime(): 9.53674e-08<br>

>>> #PETSc Option Table entries:<br>

>>> -da_vec_type cusp<br>

>>> -dm_mat_type seqaijcusp<br>

>>> -ksp_monitor<br>

>>> -log_summary<br>

>>> -pc_type jacobi<br>

>>> -snes_converged_reason<br>

>>> -snes_monitor<br>

>>> #End of PETSc Option Table entries<br>

>>> Compiled without FORTRAN kernels<br>

>>> Compiled with full precision matrices (default)<br>

>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8<br>

>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4<br>

>>> Configure run at: Fri Nov 16 08:40:52 2012<br>

>>> Configure options: --with-clanguage=C++ --with-mpi-dir=/usr<br>

>>> --with-shared-libraries --with-cuda-arch=sm_20 --CFLAGS=-O0 --CXXFLAGS=-O0<br>

>>> --CUDAFLAGS=-O0 --with-etags=1 --with-mpi4py=0<br>

>>> --with-blas-lapack-lib="[/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_rt.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_intel_thread.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libmkl_core.so,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib/libiomp5.so]"<br>


>>> --download-blacs --download-superlu_dist --download-triangle<br>

>>> --download-parmetis --download-metis --download-mumps --download-scalapack<br>

>>> --with-cuda=1 --with-cusp=1 --with-thrust=1<br>

>>> --with-cuda-dir=/opt/apps/cuda/4.2//cuda --with-sieve=1<br>

>>> --download-exodusii=yes --download-netcdf --with-boost=1<br>

>>> --with-boost-dir=/usr --download-fiat=yes --download-generator<br>

>>> --download-scientificpython --with-matlab=1 --with-matlab-engine=1<br>

>>> --with-matlab-dir=/opt/MATLAB/R2011a<br>

>>> -----------------------------------------<br>

>>> Libraries compiled on Fri Nov 16 08:40:52 2012 on SCRGP2<br>

>>> Machine characteristics:<br>

>>> Linux-2.6.32-41-server-x86_64-with-debian-squeeze-sid<br>

>>> Using PETSc directory: /opt/apps/PETSC/petsc-3.3-p4<br>

>>> Using PETSc arch: gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg<br>

>>> -----------------------------------------<br>

>>><br>

>>> Using C compiler: /usr/bin/mpicxx -O0 -g   -fPIC   ${COPTFLAGS} ${CFLAGS}<br>

>>> Using Fortran compiler: /usr/bin/mpif90  -fPIC -Wall -Wno-unused-variable<br>

>>> -g   ${FOPTFLAGS} ${FFLAGS}<br>

>>> -----------------------------------------<br>

>>><br>

>>> Using include paths:<br>

>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include<br>

>>> -I/opt/apps/PETSC/petsc-3.3-p4/include<br>

>>> -I/opt/apps/PETSC/petsc-3.3-p4/include<br>

>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/include<br>

>>> -I/opt/apps/cuda/4.2//cuda/include<br>

>>> -I/opt/apps/PETSC/petsc-3.3-p4/include/sieve<br>

>>> -I/opt/MATLAB/R2011a/extern/include -I/usr/include<br>

>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/cbind/include<br>

>>> -I/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/forbind/include<br>

>>> -I/usr/include/mpich2<br>

>>> -----------------------------------------<br>

>>><br>

>>> Using C linker: /usr/bin/mpicxx<br>

>>> Using Fortran linker: /usr/bin/mpif90<br>

>>> Using libraries:<br>

>>> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib<br>

>>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib<br>

>>> -lpetsc<br>

>>> -Wl,-rpath,/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib<br>

>>> -L/opt/apps/PETSC/petsc-3.3-p4/gcc-4.4.3-mpich2-1.2-epd-sm_20-dbg/lib<br>

>>> -ltriangle -lX11 -lpthread -lsuperlu_dist_3.1 -lcmumps -ldmumps -lsmumps<br>

>>> -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs<br>

>>> -Wl,-rpath,/opt/apps/cuda/4.2//cuda/lib64 -L/opt/apps/cuda/4.2//cuda/lib64<br>

>>> -lcufft -lcublas -lcudart -lcusparse<br>

>>> -Wl,-rpath,/opt/MATLAB/R2011a/sys/os/glnxa64:/opt/MATLAB/R2011a/bin/glnxa64:/opt/MATLAB/R2011a/extern/lib/glnxa64<br>

>>> -L/opt/MATLAB/R2011a/bin/glnxa64 -L/opt/MATLAB/R2011a/extern/lib/glnxa64<br>

>>> -leng -lmex -lmx -lmat -lut -licudata -licui18n -licuuc<br>

>>> -Wl,-rpath,/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib<br>

>>> -L/opt/apps/EPD/epd-7.3-1-rh5-x86_64/lib -lmkl_rt -lmkl_intel_thread<br>

>>> -lmkl_core -liomp5 -lexoIIv2for -lexodus -lnetcdf_c++ -lnetcdf<br>

>>> -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.4.3<br>

>>> -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -lmpichf90 -lgfortran -lm -lm<br>

>>> -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lpthread -lrt<br>

>>> -lgcc_s -ldl<br>

>>> -----------------------------------------<br>

>>><br>

>>><br>

>>><br>

>>> On Sat, Nov 17, 2012 at 11:02 AM, Matthew Knepley <<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>><br>

>>> wrote:<br>

>>>><br>

>>>> On Sat, Nov 17, 2012 at 10:50 AM, David Fuentes <<a href="mailto:fuentesdt@gmail.com">fuentesdt@gmail.com</a>><br>

>>>> wrote:<br>

>>>> > Hi,<br>

>>>> ><br>

>>>> > I'm using petsc 3.3p4<br>

>>>> > I'm trying to run a nonlinear SNES solver on GPU with gmres and jacobi<br>

>>>> > PC<br>

>>>> > using VECSEQCUSP and MATSEQAIJCUSP datatypes for the rhs and jacobian<br>

>>>> > matrix<br>

>>>> > respectively.<br>

>>>> > When running top I still see significant CPU utilization (800-900<br>

>>>> > %CPU)<br>

>>>> > during the solve ? possibly from some multithreaded operations ?<br>

>>>> ><br>

>>>> > Is this expected ?<br>

>>>> > I was thinking that since I input everything into the solver as a CUSP<br>

>>>> > datatype, all linear algebra operations would be on the GPU device<br>

>>>> > from<br>

>>>> > there and wasn't expecting to see such CPU utilization during the<br>

>>>> > solve ?<br>

>>>> > Do I probably have an error in my code somewhere ?<br>

>>>><br>

>>>> We cannot answer performance questions without -log_summary<br>

>>>><br>

>>>>    Matt<br>

>>>><br>

>>>> > Thanks,<br>

>>>> > David<br>

>>>><br>

>>>><br>

>>>><br>

>>>> --<br>

>>>> What most experimenters take for granted before they begin their<br>

>>>> experiments is infinitely more interesting than any results to which<br>

>>>> their experiments lead.<br>

>>>> -- Norbert Wiener<br>

>>><br>

>>><br>

><br>

<br>

<br>

<br>

--<br>

What most experimenters take for granted before they begin their<br>

experiments is infinitely more interesting than any results to which<br>

their experiments lead.<br>

-- Norbert Wiener<br>

</div></div></blockquote></div><br></div>