[petsc-dev] [petsc-maint #87339] Re: ex19 on GPU

Barry Smith bsmith at mcs.anl.gov
Sat Sep 17 22:48:41 CDT 2011


  Run the first one  with -da_vec_type seqcusp and -da_mat_type seqaijcusp 

> VecScatterBegin     2097 1.0 1.0270e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
> VecCUSPCopyTo       2140 1.0 2.4991e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0
> VecCUSPCopyFrom     2135 1.0 1.0437e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0

   Why is it doing all these vector copy ups and downs? It is run on one process it shouldn't be doing more than a handful total.

   Barry

On Sep 17, 2011, at 9:56 PM, Shiyuan wrote:

> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_show_devices
> CUDA device 0: Tesla M2050
> CUDA device 1: Tesla M2050
> lid velocity = 0.0001, prandtl # = 1, grashof # = 1
> Number of SNES iterations = 2
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
> ************************************************************************************************************************
> 
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> 
> ./ex19 on a gpu00CCT- named gpu00.cct.lsu.edu with 1 processor, by sgu Sat Sep 17 20:34:38 2011
> Using Petsc Development HG revision: 94fea4d40b1fcca2e886a14e7fdb916b8f6fecf3  HG Date: Sat Sep 17 00:48:29 2011 -0500
> 
>                         Max       Max/Min        Avg      Total 
> Time (sec):           1.928e+01      1.00000   1.928e+01
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                9.039e+09      1.00000   9.039e+09  9.039e+09
> Flops/sec:            4.687e+08      1.00000   4.687e+08  4.687e+08
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       0.000e+00      0.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length N --> 2N flops
>                            and VecAXPY() for complex vectors of length N --> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
> 0:      Main Stage: 4.3905e+00  22.8%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 1:           SetUp: 6.0178e-02   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 2:           Solve: 1.4834e+01  76.9%  9.0389e+09 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this phase
>      %M - percent messages in this phase     %L - percent message lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> PetscBarrier           1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> --- Event Stage 1: SetUp
> 
> MatAssemblyBegin       1 1.0 1.1921e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 2.0661e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
> MatFDColorCreate       1 1.0 1.8455e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  31  0  0  0  0     0
> 
> --- Event Stage 2: Solve
> 
> VecDot                 2 1.0 1.6947e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    94
> VecMDot             2024 1.0 8.6724e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00 0.0e+00 45 28  0  0  0  58 28  0  0  0   293
> VecNorm             2096 1.0 1.5712e+00 1.0 3.35e+08 1.0 0.0e+00 0.0e+00 0.0e+00  8  4  0  0  0  11  4  0  0  0   213
> VecScale            2092 1.0 3.7956e-01 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   3  1  0  0  0   220
> VecCopy             2072 1.0 3.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   3  0  0  0  0     0
> VecSet                70 1.0 1.3284e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY              108 1.0 4.7269e-02 1.0 8.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   183
> VecWAXPY              68 1.0 1.2537e-02 1.0 2.72e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   217
> VecMAXPY            2092 1.0 6.4375e-01 1.0 2.71e+09 1.0 0.0e+00 0.0e+00 0.0e+00  3 30  0  0  0   4 30  0  0  0  4203
> VecScatterBegin     2097 1.0 1.0270e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
> VecReduceArith         2 1.0 3.7239e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    43
> VecReduceComm          1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCUSPCopyTo       2140 1.0 2.4991e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0
> VecCUSPCopyFrom     2135 1.0 1.0437e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
> SNESSolve              1 1.0 1.4807e+01 1.0 9.04e+09 1.0 0.0e+00 0.0e+00 0.0e+00 77100  0  0  0 100100  0  0  0   610
> SNESLineSearch         2 1.0 1.2360e-02 1.0 5.81e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   470
> SNESFunctionEval       3 1.0 2.7061e-03 1.0 2.52e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   931
> SNESJacobianEval       2 1.0 2.4291e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0   158
> KSPGMRESOrthog      2024 1.0 9.2966e+00 1.0 5.09e+09 1.0 0.0e+00 0.0e+00 0.0e+00 48 56  0  0  0  63 56  0  0  0   547
> KSPSetup               2 1.0 6.2943e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               2 1.0 1.4543e+01 1.0 8.99e+09 1.0 0.0e+00 0.0e+00 0.0e+00 75 99  0  0  0  98 99  0  0  0   618
> PCSetUp                2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             2024 1.0 3.8127e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   3  0  0  0  0     0
> MatMult             2092 1.0 2.8551e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 37  0  0  0  19 37  0  0  0  1163
> MatAssemblyBegin       2 1.0 1.8120e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         2 1.0 3.1030e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         2 1.0 1.8611e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatFDColorApply        2 1.0 2.4285e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0   158
> MatFDColorFunc        42 1.0 1.2794e-02 1.0 3.53e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2758
> MatCUSPCopyTo          4 1.0 1.6344e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
> 
> --- Event Stage 1: SetUp
> 
>    Distributed Mesh     1              0            0     0
>              Vector    11              3         4424     0
>      Vector Scatter     4              0            0     0
>           Index Set    29              9        46600     0
>   IS L to G Mapping     3              0            0     0
>                SNES     1              0            0     0
>       Krylov Solver     2              1         1064     0
>      Preconditioner     2              1          752     0
>              Matrix     3              0            0     0
>  Matrix FD Coloring     1              0            0     0
> 
> --- Event Stage 2: Solve
> 
>    Distributed Mesh     0              1       204840     0
>              Vector    74             82     13242416     0
>      Vector Scatter     0              4         2448     0
>           Index Set     0             20       174720     0
>   IS L to G Mapping     0              3       161668     0
>                SNES     0              1         1288     0
>       Krylov Solver     0              1        18864     0
>      Preconditioner     0              1          952     0
>              Matrix     0              3     10810468     0
>  Matrix FD Coloring     0              1      6510068     0
>              Viewer     1              0            0     0
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -cuda_show_devices
> -cusp_synchronize
> -da_grid_x 100
> -da_grid_y 100
> -da_mat_type mpiaijcusp
> -da_vec_type mpicusp
> -dmmg_nlevels 1
> -log_summary
> -mat_no_inode
> -pc_type none
> -preload off
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
> Configure run at: Sat Sep 17 11:25:49 2011
> Configure options: PETSC_DIR=/home/sgu/softwares/petsc-dev PETSC_ARCH=gpu00CCT-cxx-nompi-release -with-clanguage=cxx --with-mpi=0 --download-f2cblaslapack=1 --download-f-blas-lapack=1 --with-debugging=0 --with-c2html=0 --with-valgrind-dir=~/softwares/valgrind --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-arch=sm_20
> -----------------------------------------
> Libraries compiled on Sat Sep 17 11:25:49 2011 on gpu00.cct.lsu.edu 
> Machine characteristics: Linux-2.6.32-131.6.1.el6.x86_64-x86_64-with-redhat-6.1-Santiago
> Using PETSc directory: /home/sgu/softwares/petsc-dev
> Using PETSc arch: gpu00CCT-cxx-nompi-release
> -----------------------------------------
> 
> Using C compiler: g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O     ${COPTFLAGS} ${CFLAGS}
> -----------------------------------------
> 
> Using include paths: -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/valgrind/include -I/usr/local/cuda/include -I/home/sgu/softwares/petsc-dev/include/mpiuni
> -----------------------------------------
> 
> Using C linker: g++
> Using libraries: -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lf2clapack -lf2cblas -lm -lm -lstdc++ -ldl 
> -----------------------------------------
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 200 -da_grid_y 200 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0
> lid velocity = 2.5e-05, prandtl # = 1, grashof # = 1
> Number of SNES iterations = 2
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
> ************************************************************************************************************************
> 
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> 
> ./ex19 on a gpu00CCT- named gpu00.cct.lsu.edu with 1 processor, by sgu Sat Sep 17 20:36:14 2011
> Using Petsc Development HG revision: 94fea4d40b1fcca2e886a14e7fdb916b8f6fecf3  HG Date: Sat Sep 17 00:48:29 2011 -0500
> 
>                         Max       Max/Min        Avg      Total 
> Time (sec):           5.042e+01      1.00000   5.042e+01
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                8.283e+10      1.00000   8.283e+10  8.283e+10
> Flops/sec:            1.643e+09      1.00000   1.643e+09  1.643e+09
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       0.000e+00      0.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length N --> 2N flops
>                            and VecAXPY() for complex vectors of length N --> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
> 0:      Main Stage: 4.6509e+00   9.2%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 1:           SetUp: 2.5148e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 2:           Solve: 4.5517e+01  90.3%  8.2826e+10 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this phase
>      %M - percent messages in this phase     %L - percent message lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> PetscBarrier           1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> --- Event Stage 1: SetUp
> 
> MatAssemblyBegin       1 1.0 1.4067e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 8.0690e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
> MatFDColorCreate       1 1.0 7.4871e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  30  0  0  0  0     0
> 
> --- Event Stage 2: Solve
> 
> VecDot                 2 1.0 1.6088e-03 1.0 6.40e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   398
> VecMDot             4637 1.0 2.1155e+01 1.0 2.34e+10 1.0 0.0e+00 0.0e+00 0.0e+00 42 28  0  0  0  46 28  0  0  0  1104
> VecNorm             4796 1.0 3.7077e+00 1.0 3.07e+09 1.0 0.0e+00 0.0e+00 0.0e+00  7  4  0  0  0   8  4  0  0  0   828
> VecScale            4792 1.0 9.7300e-01 1.0 7.67e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0   788
> VecCopy             4685 1.0 9.9265e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecSet               157 1.0 3.0819e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY              195 1.0 9.1851e-02 1.0 6.24e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   679
> VecWAXPY             155 1.0 3.3326e-02 1.0 2.48e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   744
> VecMAXPY            4792 1.0 2.6158e+00 1.0 2.48e+10 1.0 0.0e+00 0.0e+00 0.0e+00  5 30  0  0  0   6 30  0  0  0  9498
> VecScatterBegin     4797 1.0 4.9713e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10  0  0  0  0  11  0  0  0  0     0
> VecReduceArith         2 1.0 5.0960e-03 1.0 6.40e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   126
> VecReduceComm          1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCUSPCopyTo       4840 1.0 6.1929e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom     4835 1.0 5.0045e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10  0  0  0  0  11  0  0  0  0     0
> SNESSolve              1 1.0 4.5474e+01 1.0 8.28e+10 1.0 0.0e+00 0.0e+00 0.0e+00 90100  0  0  0 100100  0  0  0  1821
> SNESLineSearch         2 1.0 2.3559e-02 1.0 2.33e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   989
> SNESFunctionEval       3 1.0 8.9130e-03 1.0 1.01e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1131
> SNESJacobianEval       2 1.0 9.7259e-01 1.0 1.54e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   158
> KSPGMRESOrthog      4637 1.0 2.3658e+01 1.0 4.67e+10 1.0 0.0e+00 0.0e+00 0.0e+00 47 56  0  0  0  52 56  0  0  0  1975
> KSPSetup               2 1.0 6.1035e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               2 1.0 4.4465e+01 1.0 8.26e+10 1.0 0.0e+00 0.0e+00 0.0e+00 88100  0  0  0  98100  0  0  0  1859
> PCSetUp                2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             4637 1.0 9.8032e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> MatMult             4792 1.0 1.4925e+01 1.0 3.05e+10 1.0 0.0e+00 0.0e+00 0.0e+00 30 37  0  0  0  33 37  0  0  0  2047
> MatAssemblyBegin       2 1.0 2.0027e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         2 1.0 1.2705e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         2 1.0 7.4351e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatFDColorApply        2 1.0 9.7253e-01 1.0 1.54e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   158
> MatFDColorFunc        42 1.0 5.1462e-02 1.0 1.41e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2742
> MatCUSPCopyTo          4 1.0 4.9795e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
> 
> --- Event Stage 1: SetUp
> 
>    Distributed Mesh     1              0            0     0
>              Vector    11              3         4424     0
>      Vector Scatter     4              0            0     0
>           Index Set    29              9       166600     0
>   IS L to G Mapping     3              0            0     0
>                SNES     1              0            0     0
>       Krylov Solver     2              1         1064     0
>      Preconditioner     2              1          752     0
>              Matrix     3              0            0     0
>  Matrix FD Coloring     1              0            0     0
> 
> --- Event Stage 2: Solve
> 
>    Distributed Mesh     0              1       804840     0
>              Vector    74             82     52602416     0
>      Vector Scatter     0              4         2448     0
>           Index Set     0             20       654720     0
>   IS L to G Mapping     0              3       641668     0
>                SNES     0              1         1288     0
>       Krylov Solver     0              1        18864     0
>      Preconditioner     0              1          952     0
>              Matrix     0              3     43373668     0
>  Matrix FD Coloring     0              1     26138868     0
>              Viewer     1              0            0     0
> ========================================================================================================================
> Average time to get PetscTime(): 0
> #PETSc Option Table entries:
> -cuda_set_device 0
> -cusp_synchronize
> -da_grid_x 200
> -da_grid_y 200
> -da_mat_type mpiaijcusp
> -da_vec_type mpicusp
> -dmmg_nlevels 1
> -log_summary
> -mat_no_inode
> -pc_type none
> -preload off
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
> Configure run at: Sat Sep 17 11:25:49 2011
> Configure options: PETSC_DIR=/home/sgu/softwares/petsc-dev PETSC_ARCH=gpu00CCT-cxx-nompi-release -with-clanguage=cxx --with-mpi=0 --download-f2cblaslapack=1 --download-f-blas-lapack=1 --with-debugging=0 --with-c2html=0 --with-valgrind-dir=~/softwares/valgrind --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-arch=sm_20
> -----------------------------------------
> Libraries compiled on Sat Sep 17 11:25:49 2011 on gpu00.cct.lsu.edu 
> Machine characteristics: Linux-2.6.32-131.6.1.el6.x86_64-x86_64-with-redhat-6.1-Santiago
> Using PETSc directory: /home/sgu/softwares/petsc-dev
> Using PETSc arch: gpu00CCT-cxx-nompi-release
> -----------------------------------------
> 
> Using C compiler: g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O     ${COPTFLAGS} ${CFLAGS}
> -----------------------------------------
> 
> Using include paths: -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/valgrind/include -I/usr/local/cuda/include -I/home/sgu/softwares/petsc-dev/include/mpiuni
> -----------------------------------------
> 
> Using C linker: g++
> Using libraries: -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lf2clapack -lf2cblas -lm -lm -lstdc++ -ldl 
> -----------------------------------------
> 
> 
> 
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 300 -da_grid_y 300 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0
> 
> lid velocity = 1.11111e-05, prandtl # = 1, grashof # = 1
> Number of SNES iterations = 2
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
> ************************************************************************************************************************
> 
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> 
> ./ex19 on a gpu00CCT- named gpu00.cct.lsu.edu with 1 processor, by sgu Sat Sep 17 20:38:29 2011
> Using Petsc Development HG revision: 94fea4d40b1fcca2e886a14e7fdb916b8f6fecf3  HG Date: Sat Sep 17 00:48:29 2011 -0500
> 
>                         Max       Max/Min        Avg      Total 
> Time (sec):           1.095e+02      1.00000   1.095e+02
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                3.136e+11      1.00000   3.136e+11  3.136e+11
> Flops/sec:            2.865e+09      1.00000   2.865e+09  2.865e+09
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       0.000e+00      0.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length N --> 2N flops
>                            and VecAXPY() for complex vectors of length N --> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
> 0:      Main Stage: 4.4090e+00   4.0%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 1:           SetUp: 5.6010e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 2:           Solve: 1.0449e+02  95.5%  3.1360e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this phase
>      %M - percent messages in this phase     %L - percent message lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> PetscBarrier           1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> --- Event Stage 1: SetUp
> 
> MatAssemblyBegin       1 1.0 1.4067e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 1.7501e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
> MatFDColorCreate       1 1.0 1.6907e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  30  0  0  0  0     0
> 
> --- Event Stage 2: Solve
> 
> VecDot                 2 1.0 1.6890e-03 1.0 1.44e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   853
> VecMDot             7803 1.0 3.8534e+01 1.0 8.85e+10 1.0 0.0e+00 0.0e+00 0.0e+00 35 28  0  0  0  37 28  0  0  0  2297
> VecNorm             8068 1.0 6.5087e+00 1.0 1.16e+10 1.0 0.0e+00 0.0e+00 0.0e+00  6  4  0  0  0   6  4  0  0  0  1785
> VecScale            8064 1.0 1.8853e+00 1.0 2.90e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0  1540
> VecCopy             7851 1.0 1.9321e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecSet               263 1.0 5.4441e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY              301 1.0 1.5158e-01 1.0 2.17e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1430
> VecWAXPY             261 1.0 6.9037e-02 1.0 9.40e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1361
> VecMAXPY            8064 1.0 7.6110e+00 1.0 9.41e+10 1.0 0.0e+00 0.0e+00 0.0e+00  7 30  0  0  0   7 30  0  0  0 12366
> VecScatterBegin     8069 1.0 1.2707e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 12  0  0  0  0  12  0  0  0  0     0
> VecReduceArith         2 1.0 6.5138e-03 1.0 1.44e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   221
> VecReduceComm          1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCUSPCopyTo       8112 1.0 1.0913e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom     8107 1.0 1.2753e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 12  0  0  0  0  12  0  0  0  0     0
> SNESSolve              1 1.0 1.0444e+02 1.0 3.14e+11 1.0 0.0e+00 0.0e+00 0.0e+00 95100  0  0  0 100100  0  0  0  3003
> SNESLineSearch         2 1.0 3.9190e-02 1.0 5.25e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1339
> SNESFunctionEval       3 1.0 1.7656e-02 1.0 2.27e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1285
> SNESJacobianEval       2 1.0 2.0955e+00 1.0 3.46e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   165
> KSPGMRESOrthog      7803 1.0 4.5761e+01 1.0 1.77e+11 1.0 0.0e+00 0.0e+00 0.0e+00 42 56  0  0  0  44 56  0  0  0  3868
> KSPSetup               2 1.0 4.4107e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               2 1.0 1.0228e+02 1.0 3.13e+11 1.0 0.0e+00 0.0e+00 0.0e+00 93100  0  0  0  98100  0  0  0  3062
> PCSetUp                2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             7803 1.0 1.9026e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> MatMult             8064 1.0 4.5629e+01 1.0 1.16e+11 1.0 0.0e+00 0.0e+00 0.0e+00 42 37  0  0  0  44 37  0  0  0  2538
> MatAssemblyBegin       2 1.0 2.0981e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         2 1.0 2.8598e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         2 1.0 1.9902e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatFDColorApply        2 1.0 2.0955e+00 1.0 3.46e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   165
> MatFDColorFunc        42 1.0 1.1288e-01 1.0 3.18e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2813
> MatCUSPCopyTo          4 1.0 8.1736e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
> 
> --- Event Stage 1: SetUp
> 
>    Distributed Mesh     1              0            0     0
>              Vector    11              3         4424     0
>      Vector Scatter     4              0            0     0
>           Index Set    29              9       366600     0
>   IS L to G Mapping     3              0            0     0
>                SNES     1              0            0     0
>       Krylov Solver     2              1         1064     0
>      Preconditioner     2              1          752     0
>              Matrix     3              0            0     0
>  Matrix FD Coloring     1              0            0     0
> 
> --- Event Stage 2: Solve
> 
>    Distributed Mesh     0              1      1804840     0
>              Vector    74             82    118202416     0
>      Vector Scatter     0              4         2448     0
>           Index Set     0             20      1454720     0
>   IS L to G Mapping     0              3      1441668     0
>                SNES     0              1         1288     0
>       Krylov Solver     0              1        18864     0
>      Preconditioner     0              1          952     0
>              Matrix     0              3     97696868     0
>  Matrix FD Coloring     0              1     58887668     0
>              Viewer     1              0            0     0
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -cuda_set_device 0
> -cusp_synchronize
> -da_grid_x 300
> -da_grid_y 300
> -da_mat_type mpiaijcusp
> -da_vec_type mpicusp
> -dmmg_nlevels 1
> -log_summary
> -mat_no_inode
> -pc_type none
> -preload off
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
> Configure run at: Sat Sep 17 11:25:49 2011
> Configure options: PETSC_DIR=/home/sgu/softwares/petsc-dev PETSC_ARCH=gpu00CCT-cxx-nompi-release -with-clanguage=cxx --with-mpi=0 --download-f2cblaslapack=1 --download-f-blas-lapack=1 --with-debugging=0 --with-c2html=0 --with-valgrind-dir=~/softwares/valgrind --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-arch=sm_20
> -----------------------------------------
> Libraries compiled on Sat Sep 17 11:25:49 2011 on gpu00.cct.lsu.edu 
> Machine characteristics: Linux-2.6.32-131.6.1.el6.x86_64-x86_64-with-redhat-6.1-Santiago
> Using PETSc directory: /home/sgu/softwares/petsc-dev
> Using PETSc arch: gpu00CCT-cxx-nompi-release
> -----------------------------------------
> 
> Using C compiler: g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O     ${COPTFLAGS} ${CFLAGS}
> -----------------------------------------
> 
> Using include paths: -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/valgrind/include -I/usr/local/cuda/include -I/home/sgu/softwares/petsc-dev/include/mpiuni
> -----------------------------------------
> 
> Using C linker: g++
> Using libraries: -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lf2clapack -lf2cblas -lm -lm -lstdc++ -ldl 
> -----------------------------------------
> 
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 400 -da_grid_y 400 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0
> 
> lid velocity = 6.25e-06, prandtl # = 1, grashof # = 1
> Number of SNES iterations = 2
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
> ************************************************************************************************************************
> 
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> 
> ./ex19 on a gpu00CCT- named gpu00.cct.lsu.edu with 1 processor, by sgu Sat Sep 17 20:42:05 2011
> Using Petsc Development HG revision: 94fea4d40b1fcca2e886a14e7fdb916b8f6fecf3  HG Date: Sat Sep 17 00:48:29 2011 -0500
> 
>                         Max       Max/Min        Avg      Total 
> Time (sec):           1.909e+02      1.00000   1.909e+02
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                7.167e+11      1.00000   7.167e+11  7.167e+11
> Flops/sec:            3.753e+09      1.00000   3.753e+09  3.753e+09
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       0.000e+00      0.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length N --> 2N flops
>                            and VecAXPY() for complex vectors of length N --> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
> 0:      Main Stage: 4.4291e+00   2.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 1:           SetUp: 1.0122e+00   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 2:           Solve: 1.8551e+02  97.2%  7.1669e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this phase
>      %M - percent messages in this phase     %L - percent message lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> PetscBarrier           1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> --- Event Stage 1: SetUp
> 
> MatAssemblyBegin       1 1.0 1.5974e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 3.1045e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
> MatFDColorCreate       1 1.0 3.1857e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  31  0  0  0  0     0
> 
> --- Event Stage 2: Solve
> 
> VecDot                 2 1.0 1.8530e-03 1.0 2.56e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1382
> VecMDot            10031 1.0 5.4102e+01 1.0 2.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 28 28  0  0  0  29 28  0  0  0  3739
> VecNorm            10370 1.0 8.5987e+00 1.0 2.65e+10 1.0 0.0e+00 0.0e+00 0.0e+00  5  4  0  0  0   5  4  0  0  0  3087
> VecScale           10366 1.0 2.9179e+00 1.0 6.63e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0  2274
> VecCopy            10079 1.0 2.9971e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecSet               337 1.0 7.6832e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY              375 1.0 2.3210e-01 1.0 4.80e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2068
> VecWAXPY             335 1.0 1.1250e-01 1.0 2.14e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1906
> VecMAXPY           10366 1.0 1.5716e+01 1.0 2.15e+11 1.0 0.0e+00 0.0e+00 0.0e+00  8 30  0  0  0   8 30  0  0  0 13687
> VecScatterBegin    10371 1.0 2.5508e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 13  0  0  0  0  14  0  0  0  0     0
> VecReduceArith         2 1.0 8.3668e-03 1.0 2.56e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   306
> VecReduceComm          1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCUSPCopyTo      10414 1.0 1.5341e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom    10409 1.0 2.5585e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 13  0  0  0  0  14  0  0  0  0     0
> SNESSolve              1 1.0 1.8546e+02 1.0 7.17e+11 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0 100100  0  0  0  3864
> SNESLineSearch         2 1.0 6.2440e-02 1.0 9.33e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1495
> SNESFunctionEval       3 1.0 3.0468e-02 1.0 4.03e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1323
> SNESJacobianEval       2 1.0 3.7313e+00 1.0 6.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   165
> KSPGMRESOrthog     10031 1.0 6.8969e+01 1.0 4.05e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 56  0  0  0  37 56  0  0  0  5865
> KSPSetup               2 1.0 4.6015e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               2 1.0 1.8163e+02 1.0 7.16e+11 1.0 0.0e+00 0.0e+00 0.0e+00 95100  0  0  0  98100  0  0  0  3942
> PCSetUp                2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply            10031 1.0 2.9429e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> MatMult            10366 1.0 9.7032e+01 1.0 2.65e+11 1.0 0.0e+00 0.0e+00 0.0e+00 51 37  0  0  0  52 37  0  0  0  2729
> MatAssemblyBegin       2 1.0 2.1935e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         2 1.0 5.1729e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         2 1.0 3.1707e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatFDColorApply        2 1.0 3.7312e+00 1.0 6.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   165
> MatFDColorFunc        42 1.0 2.3831e-01 1.0 5.64e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2369
> MatCUSPCopyTo          4 1.0 1.3754e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
> 
> --- Event Stage 1: SetUp
> 
>    Distributed Mesh     1              0            0     0
>              Vector    11              3         4424     0
>      Vector Scatter     4              0            0     0
>           Index Set    29              9       646600     0
>   IS L to G Mapping     3              0            0     0
>                SNES     1              0            0     0
>       Krylov Solver     2              1         1064     0
>      Preconditioner     2              1          752     0
>              Matrix     3              0            0     0
>  Matrix FD Coloring     1              0            0     0
> 
> --- Event Stage 2: Solve
> 
>    Distributed Mesh     0              1      3204840     0
>              Vector    74             82    210042416     0
>      Vector Scatter     0              4         2448     0
>           Index Set     0             20      2574720     0
>   IS L to G Mapping     0              3      2561668     0
>                SNES     0              1         1288     0
>       Krylov Solver     0              1        18864     0
>      Preconditioner     0              1          952     0
>              Matrix     0              3    173780068     0
>  Matrix FD Coloring     0              1    104756468     0
>              Viewer     1              0            0     0
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -cuda_set_device 0
> -cusp_synchronize
> -da_grid_x 400
> -da_grid_y 400
> -da_mat_type mpiaijcusp
> -da_vec_type mpicusp
> -dmmg_nlevels 1
> -log_summary
> -mat_no_inode
> -pc_type none
> -preload off
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
> Configure run at: Sat Sep 17 11:25:49 2011
> Configure options: PETSC_DIR=/home/sgu/softwares/petsc-dev PETSC_ARCH=gpu00CCT-cxx-nompi-release -with-clanguage=cxx --with-mpi=0 --download-f2cblaslapack=1 --download-f-blas-lapack=1 --with-debugging=0 --with-c2html=0 --with-valgrind-dir=~/softwares/valgrind --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-arch=sm_20
> -----------------------------------------
> Libraries compiled on Sat Sep 17 11:25:49 2011 on gpu00.cct.lsu.edu 
> Machine characteristics: Linux-2.6.32-131.6.1.el6.x86_64-x86_64-with-redhat-6.1-Santiago
> Using PETSc directory: /home/sgu/softwares/petsc-dev
> Using PETSc arch: gpu00CCT-cxx-nompi-release
> -----------------------------------------
> 
> Using C compiler: g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O     ${COPTFLAGS} ${CFLAGS}
> -----------------------------------------
> 
> Using include paths: -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/valgrind/include -I/usr/local/cuda/include -I/home/sgu/softwares/petsc-dev/include/mpiuni
> -----------------------------------------
> 
> Using C linker: g++
> Using libraries: -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lf2clapack -lf2cblas -lm -lm -lstdc++ -ldl 
> -----------------------------------------
> 
> 
> On Sat, Sep 17, 2011 at 4:14 PM, Matthew Knepley <petsc-maint at mcs.anl.gov>wrote:
> 
>> On Sat, Sep 17, 2011 at 3:26 PM, Shiyuan <gshy2014 at gmail.com> wrote:
>> 
>>> I configure petsc-dev with --with-cuda-arch=sm_20 and rebuild, but it
>>> doesn't help. The performance is essentially the same. The machine has two
>>> Tesla M2050, with CUDA driver 4.0 and cusp 2.0 and I use -cuda-set-device
>>> to
>>> choose one.  Any clues what's going wrong ?  configure.log is attached.
>>> 
>> 
>> Can you show me the output of -cuda_show_devices?
>> 
> CUDA device 0: Tesla M2050
> CUDA device 1: Tesla M2050
> 
>> 
>> 
>> In order to investigate further, please run with
>> 
>>  -da_vec_type mpicusp -da_mat_type mpiaijcusp
>> 
>> and then a series of sizes
>> 
>>  -da_grid_x {100,200,300,400} -da_grid_y {100,200,300,400}
>> 
>> and send the log summaries.
>> 
>> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none
> -dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -log_summary -mat_no_inode
> -preload off  -cusp_synchronize -cuda_show_devices
> CUDA device 0: Tesla M2050
> CUDA device 1: Tesla M2050
> 
> Time (sec):           1.928e+01      1.00000   1.928e+01
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                9.039e+09      1.00000   9.039e+09  9.039e+09
> Flops/sec:            4.687e+08      1.00000   4.687e+08  4.687e+08
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 4.3905e+00  22.8%  0.0000e+00   0.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 1:           SetUp: 6.0178e-02   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 2:           Solve: 1.4834e+01  76.9%  9.0389e+09 100.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> VecMDot             2024 1.0 8.6724e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 45 28  0  0  0  58 28  0  0  0   293
> VecNorm             2096 1.0 1.5712e+00 1.0 3.35e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  8  4  0  0  0  11  4  0  0  0   213
> VecCUSPCopyTo       2140 1.0 2.4991e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   2  0  0  0  0     0
> VecCUSPCopyFrom     2135 1.0 1.0437e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
> KSPSolve               2 1.0 1.4543e+01 1.0 8.99e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 75 99  0  0  0  98 99  0  0  0   618
> MatMult             2092 1.0 2.8551e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 15 37  0  0  0  19 37  0  0  0  1163
> MatCUSPCopyTo          4 1.0 1.6344e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> 
> 
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none
> -dmmg_nlevels 1 -da_grid_x 200 -da_grid_y 200 -log_summary -mat_no_inode
> -preload off  -cusp_synchronize -cuda_set_device 0
> 
> Time (sec):           5.042e+01      1.00000   5.042e+01
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                8.283e+10      1.00000   8.283e+10  8.283e+10
> Flops/sec:            1.643e+09      1.00000   1.643e+09  1.643e+09
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 4.6509e+00   9.2%  0.0000e+00   0.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 1:           SetUp: 2.5148e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 2:           Solve: 4.5517e+01  90.3%  8.2826e+10 100.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> VecMDot             4637 1.0 2.1155e+01 1.0 2.34e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00 42 28  0  0  0  46 28  0  0  0  1104
> VecNorm             4796 1.0 3.7077e+00 1.0 3.07e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  7  4  0  0  0   8  4  0  0  0   828
> VecCUSPCopyTo       4840 1.0 6.1929e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom     4835 1.0 5.0045e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 10  0  0  0  0  11  0  0  0  0     0
> KSPSolve               2 1.0 4.4465e+01 1.0 8.26e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00 88100  0  0  0  98100  0  0  0  1859
> MatMult             4792 1.0 1.4925e+01 1.0 3.05e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00 30 37  0  0  0  33 37  0  0  0  2047
> MatCUSPCopyTo          4 1.0 4.9795e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> 
> 
> 
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none
> -dmmg_nlevels 1 -da_grid_x 300 -da_grid_y 300 -log_summary -mat_no_inode
> -preload off  -cusp_synchronize -cuda_set_device 0 >> ex19p.txt
> Time (sec):           1.095e+02      1.00000   1.095e+02
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                3.136e+11      1.00000   3.136e+11  3.136e+11
> Flops/sec:            2.865e+09      1.00000   2.865e+09  2.865e+09
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 4.4090e+00   4.0%  0.0000e+00   0.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 1:           SetUp: 5.6010e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 2:           Solve: 1.0449e+02  95.5%  3.1360e+11 100.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> VecMDot             7803 1.0 3.8534e+01 1.0 8.85e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00 35 28  0  0  0  37 28  0  0  0  2297
> VecNorm             8068 1.0 6.5087e+00 1.0 1.16e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00  6  4  0  0  0   6  4  0  0  0  1785
> VecCUSPCopyTo       8112 1.0 1.0913e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom     8107 1.0 1.2753e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 12  0  0  0  0  12  0  0  0  0     0
> KSPSolve               2 1.0 1.0228e+02 1.0 3.13e+11 1.0 0.0e+00 0.0e+00
> 0.0e+00 93100  0  0  0  98100  0  0  0  3062
> MatMult             8064 1.0 4.5629e+01 1.0 1.16e+11 1.0 0.0e+00 0.0e+00
> 0.0e+00 42 37  0  0  0  44 37  0  0  0  2538
> MatCUSPCopyTo          4 1.0 8.1736e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> 
> 
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none
> -dmmg_nlevels 1 -da_grid_x 400 -da_grid_y 400 -log_summary -mat_no_inode
> -preload off  -cusp_synchronize -cuda_set_device 0 >> ex19p.txt
> 
> Time (sec):           1.909e+02      1.00000   1.909e+02
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                7.167e+11      1.00000   7.167e+11  7.167e+11
> Flops/sec:            3.753e+09      1.00000   3.753e+09  3.753e+09
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 4.4291e+00   2.3%  0.0000e+00   0.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 1:           SetUp: 1.0122e+00   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 2:           Solve: 1.8551e+02  97.2%  7.1669e+11 100.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> VecMDot            10031 1.0 5.4102e+01 1.0 2.02e+11 1.0 0.0e+00 0.0e+00
> 0.0e+00 28 28  0  0  0  29 28  0  0  0  3739
> VecNorm            10370 1.0 8.5987e+00 1.0 2.65e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00  5  4  0  0  0   5  4  0  0  0  3087
> VecCUSPCopyTo      10414 1.0 1.5341e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom    10409 1.0 2.5585e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 13  0  0  0  0  14  0  0  0  0     0
> KSPSolve               2 1.0 1.8163e+02 1.0 7.16e+11 1.0 0.0e+00 0.0e+00
> 0.0e+00 95100  0  0  0  98100  0  0  0  3942
> MatMult            10366 1.0 9.7032e+01 1.0 2.65e+11 1.0 0.0e+00 0.0e+00
> 0.0e+00 51 37  0  0  0  52 37  0  0  0  2729
> MatCUSPCopyTo          4 1.0 1.3754e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
>  The complete log_summaries are attached.
> 
> 
> 
> 
> On Sat, Sep 17, 2011 at 4:14 PM, Matthew Knepley <petsc-maint at mcs.anl.gov> wrote:
> On Sat, Sep 17, 2011 at 3:26 PM, Shiyuan <gshy2014 at gmail.com> wrote:
> I configure petsc-dev with --with-cuda-arch=sm_20 and rebuild, but it
> doesn't help. The performance is essentially the same. The machine has two
> Tesla M2050, with CUDA driver 4.0 and cusp 2.0 and I use -cuda-set-device to
> choose one.  Any clues what's going wrong ?  configure.log is attached.
> 
> Can you show me the output of -cuda_show_devices?
> CUDA device 0: Tesla M2050
> CUDA device 1: Tesla M2050 
>  
> 
> In order to investigate further, please run with
> 
>   -da_vec_type mpicusp -da_mat_type mpiaijcusp
> 
> and then a series of sizes
> 
>   -da_grid_x {100,200,300,400} -da_grid_y {100,200,300,400}
> 
> and send the log summaries.
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_show_devices
> CUDA device 0: Tesla M2050
> CUDA device 1: Tesla M2050
> 
> Time (sec):           1.928e+01      1.00000   1.928e+01
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                9.039e+09      1.00000   9.039e+09  9.039e+09
> Flops/sec:            4.687e+08      1.00000   4.687e+08  4.687e+08
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 4.3905e+00  22.8%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  1:           SetUp: 6.0178e-02   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  2:           Solve: 1.4834e+01  76.9%  9.0389e+09 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> VecMDot             2024 1.0 8.6724e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00 0.0e+00 45 28  0  0  0  58 28  0  0  0   293
> VecNorm             2096 1.0 1.5712e+00 1.0 3.35e+08 1.0 0.0e+00 0.0e+00 0.0e+00  8  4  0  0  0  11  4  0  0  0   213
> VecCUSPCopyTo       2140 1.0 2.4991e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0
> VecCUSPCopyFrom     2135 1.0 1.0437e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
> KSPSolve               2 1.0 1.4543e+01 1.0 8.99e+09 1.0 0.0e+00 0.0e+00 0.0e+00 75 99  0  0  0  98 99  0  0  0   618
> MatMult             2092 1.0 2.8551e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 37  0  0  0  19 37  0  0  0  1163
> MatCUSPCopyTo          4 1.0 1.6344e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> 
> 
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 200 -da_grid_y 200 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0
> 
> Time (sec):           5.042e+01      1.00000   5.042e+01
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                8.283e+10      1.00000   8.283e+10  8.283e+10
> Flops/sec:            1.643e+09      1.00000   1.643e+09  1.643e+09
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 4.6509e+00   9.2%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  1:           SetUp: 2.5148e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  2:           Solve: 4.5517e+01  90.3%  8.2826e+10 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> VecMDot             4637 1.0 2.1155e+01 1.0 2.34e+10 1.0 0.0e+00 0.0e+00 0.0e+00 42 28  0  0  0  46 28  0  0  0  1104
> VecNorm             4796 1.0 3.7077e+00 1.0 3.07e+09 1.0 0.0e+00 0.0e+00 0.0e+00  7  4  0  0  0   8  4  0  0  0   828
> VecCUSPCopyTo       4840 1.0 6.1929e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom     4835 1.0 5.0045e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10  0  0  0  0  11  0  0  0  0     0
> KSPSolve               2 1.0 4.4465e+01 1.0 8.26e+10 1.0 0.0e+00 0.0e+00 0.0e+00 88100  0  0  0  98100  0  0  0  1859
> MatMult             4792 1.0 1.4925e+01 1.0 3.05e+10 1.0 0.0e+00 0.0e+00 0.0e+00 30 37  0  0  0  33 37  0  0  0  2047
> MatCUSPCopyTo          4 1.0 4.9795e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> 
> 
> 
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 300 -da_grid_y 300 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0 >> ex19p.txt
> Time (sec):           1.095e+02      1.00000   1.095e+02
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                3.136e+11      1.00000   3.136e+11  3.136e+11
> Flops/sec:            2.865e+09      1.00000   2.865e+09  2.865e+09
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 4.4090e+00   4.0%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  1:           SetUp: 5.6010e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  2:           Solve: 1.0449e+02  95.5%  3.1360e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> VecMDot             7803 1.0 3.8534e+01 1.0 8.85e+10 1.0 0.0e+00 0.0e+00 0.0e+00 35 28  0  0  0  37 28  0  0  0  2297
> VecNorm             8068 1.0 6.5087e+00 1.0 1.16e+10 1.0 0.0e+00 0.0e+00 0.0e+00  6  4  0  0  0   6  4  0  0  0  1785
> VecCUSPCopyTo       8112 1.0 1.0913e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom     8107 1.0 1.2753e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 12  0  0  0  0  12  0  0  0  0     0
> KSPSolve               2 1.0 1.0228e+02 1.0 3.13e+11 1.0 0.0e+00 0.0e+00 0.0e+00 93100  0  0  0  98100  0  0  0  3062
> MatMult             8064 1.0 4.5629e+01 1.0 1.16e+11 1.0 0.0e+00 0.0e+00 0.0e+00 42 37  0  0  0  44 37  0  0  0  2538
> MatCUSPCopyTo          4 1.0 8.1736e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
> 
> 
> 
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 400 -da_grid_y 400 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0 >> ex19p.txt
> 
> Time (sec):           1.909e+02      1.00000   1.909e+02
> Objects:              1.320e+02      1.00000   1.320e+02
> Flops:                7.167e+11      1.00000   7.167e+11  7.167e+11
> Flops/sec:            3.753e+09      1.00000   3.753e+09  3.753e+09
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 4.4291e+00   2.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  1:           SetUp: 1.0122e+00   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  2:           Solve: 1.8551e+02  97.2%  7.1669e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> VecMDot            10031 1.0 5.4102e+01 1.0 2.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 28 28  0  0  0  29 28  0  0  0  3739
> VecNorm            10370 1.0 8.5987e+00 1.0 2.65e+10 1.0 0.0e+00 0.0e+00 0.0e+00  5  4  0  0  0   5  4  0  0  0  3087
> VecCUSPCopyTo      10414 1.0 1.5341e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecCUSPCopyFrom    10409 1.0 2.5585e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 13  0  0  0  0  14  0  0  0  0     0
> KSPSolve               2 1.0 1.8163e+02 1.0 7.16e+11 1.0 0.0e+00 0.0e+00 0.0e+00 95100  0  0  0  98100  0  0  0  3942
> MatMult            10366 1.0 9.7032e+01 1.0 2.65e+11 1.0 0.0e+00 0.0e+00 0.0e+00 51 37  0  0  0  52 37  0  0  0  2729
> MatCUSPCopyTo          4 1.0 1.3754e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> 
>   The complete log_summaries are attached.  




More information about the petsc-dev mailing list