[petsc-dev] [petsc-maint #87339] Re: ex19 on GPU

Satish Balay balay at mcs.anl.gov
Mon Sep 19 10:44:23 CDT 2011


Attached is the output from the run on breadboard. It has 2 "nVidia
Corporation GT200 [Tesla C1060]" cards.

Satish

--------

balay at bb30:~/petsc-dev/src/snes/examples/tutorials>./ex19 -da_vec_type seqcusp -da_mat_type seqaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0 -log_summary ex19.cuda.log
lid velocity = 0.0001, prandtl # = 1, grashof # = 1
Number of SNES iterations = 2
balay at bb30:~/petsc-dev/src/snes/examples/tutorials>

On Sun, 18 Sep 2011, Barry Smith wrote:

> 
> 
>    Ok, the copy up and down are not a problem. 
> 
>    Except for VecMAXPY() the vector operations are terrible (like they are not using the GPU, but they must be?) The MatMult() must be GPU because it is pretty good 2779??? 
> 
>    Does someone else have access to a similar system and can they run the exact same test to see what numbers they get? Satish, could you on breadboard? Maybe on Magellion :-)
> 
> 
>    Barry
> 
> 
> 
> VecDot                 2 1.0 1.7049e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    94
> VecMDot             2024 1.0 8.6273e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00 0.0e+00 50 29  0  0  0  66 29  0  0  0   295
> VecNorm             2096 1.0 1.5544e+00 1.0 1.68e+08 1.0 0.0e+00 0.0e+00 0.0e+00  9  2  0  0  0  12  2  0  0  0   108
> VecScale            2092 1.0 3.7774e-01 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   3  1  0  0  0   222
> VecCopy             2072 1.0 3.8258e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   3  0  0  0  0     0
> VecSet                70 1.0 1.3119e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY              108 1.0 4.7407e-02 1.0 8.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   182
> VecWAXPY              68 1.0 1.2545e-02 1.0 2.72e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   217
> VecMAXPY            2092 1.0 6.4464e-01 1.0 2.71e+09 1.0 0.0e+00 0.0e+00 0.0e+00  4 31  0  0  0   5 31  0  0  0  4198
> VecScatterBegin        5 1.0 1.5609e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecReduceArith         2 1.0 3.8650e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    41
> VecReduceComm          1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCUSPCopyTo         49 1.0 3.0950e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCUSPCopyFrom       44 1.0 2.0876e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SNESSolve              1 1.0 1.3044e+01 1.0 8.87e+09 1.0 0.0e+00 0.0e+00 0.0e+00 75100  0  0  0 100100  0  0  0   680
> SNESLineSearch         2 1.0 1.1921e-02 1.0 5.49e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   461
> SNESFunctionEval       3 1.0 2.7192e-03 1.0 2.52e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   927
> SNESJacobianEval       2 1.0 2.0424e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0   188
> KSPGMRESOrthog      2024 1.0 9.2522e+00 1.0 5.09e+09 1.0 0.0e+00 0.0e+00 0.0e+00 53 57  0  0  0  71 57  0  0  0   550
> KSPSetup               2 1.0 5.1975e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               2 1.0 1.2819e+01 1.0 8.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 74 99  0  0  0  98 99  0  0  0   689
> PCSetUp                2 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             2024 1.0 3.8054e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   3  0  0  0  0     0
> MatMult             2092 1.0 1.1950e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00  7 37  0  0  0   9 37  0  0  0  2779
> 
> On Sep 18, 2011, at 10:29 AM, Shiyuan wrote:
> 
> > 
> > 
> > On Sat, Sep 17, 2011 at 10:48 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > 
> >  Run the first one  with -da_vec_type seqcusp and -da_mat_type seqaijcusp
> > 
> > > VecScatterBegin     2097 1.0 1.0270e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
> > > VecCUSPCopyTo       2140 1.0 2.4991e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0
> > > VecCUSPCopyFrom     2135 1.0 1.0437e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
> > 
> >   Why is it doing all these vector copy ups and downs? It is run on one process it shouldn't be doing more than a handful total.
> > 
> >   Barry
> > 
> > ./ex19 -da_vec_type seqcusp -da_mat_type seqaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0 | tee ex19p2.txt
> > 
> > Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
> >                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
> >  0:      Main Stage: 4.2393e+00  24.4%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> >  1:           SetUp: 4.9079e-02   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> >  2:           Solve: 1.3071e+01  75.3%  8.8712e+09 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > 
> > VecScatterBegin        5 1.0 1.5609e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecReduceArith         2 1.0 3.8650e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    41
> > VecReduceComm          1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecCUSPCopyTo         49 1.0 3.0950e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecCUSPCopyFrom       44 1.0 2.0876e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > 
> > The complete log is attached.  Thanks. 
> > <ex19p2.txt>
> 
> 
-------------- next part --------------
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./ex19 on a arch-cuda named bb30 with 1 processor, by balay Mon Sep 19 10:41:28 2011
Using Petsc Development HG revision: 0c1d30b63d8488b9b083d69444e587dbdd98ebee  HG Date: Sun Sep 18 11:45:23 2011 -0700

                         Max       Max/Min        Avg      Total 
Time (sec):           6.106e+00      1.00000   6.106e+00
Objects:              1.260e+02      1.00000   1.260e+02
Flops:                8.871e+09      1.00000   8.871e+09  8.871e+09
Flops/sec:            1.453e+09      1.00000   1.453e+09  1.453e+09
Memory:               2.505e+07      1.00000              2.505e+07
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       3.384e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 1.6191e+00  26.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 1:           SetUp: 1.2512e-01   2.0%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  7.800e+01   0.2% 
 2:           Solve: 4.3615e+00  71.4%  8.8712e+09 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  3.376e+04  99.8% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################


Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

PetscBarrier           1 1.0 5.0068e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: SetUp

MatAssemblyBegin       1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 3.4699e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
MatFDColorCreate       1 1.0 4.1661e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.2e+01  1  0  0  0  0  33  0  0  0 41     0

--- Event Stage 2: Solve

VecDot                 2 1.0 1.2088e-04 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1324
VecMDot             2024 1.0 2.1392e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00 0.0e+00 35 29  0  0  0  49 29  0  0  0  1189
VecNorm             2096 1.0 1.1928e-01 1.0 1.68e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  2  0  0  0   3  2  0  0  0  1406
VecScale            2092 1.0 5.2948e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  1580
VecCopy             2072 1.0 6.9294e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0
VecSet                70 1.0 1.7152e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              108 1.0 1.3336e-02 1.0 8.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   648
VecWAXPY              68 1.0 2.0800e-03 1.0 2.72e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1308
VecMAXPY            2092 1.0 2.9396e-01 1.0 2.71e+09 1.0 0.0e+00 0.0e+00 0.0e+00  5 31  0  0  0   7 31  0  0  0  9205
VecScatterBegin        5 1.0 1.7390e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecReduceArith         2 1.0 9.2983e-04 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   172
VecReduceComm          1 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCUSPCopyTo         49 1.0 9.9115e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCUSPCopyFrom       44 1.0 1.4175e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SNESSolve              1 1.0 4.3493e+00 1.0 8.87e+09 1.0 0.0e+00 0.0e+00 3.4e+04 71100  0  0100 100100  0  0100  2040
SNESLineSearch         2 1.0 8.4569e-03 1.0 5.49e+06 1.0 0.0e+00 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0   650
SNESFunctionEval       3 1.0 6.6361e-03 1.0 2.52e+06 1.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0   380
SNESJacobianEval       2 1.0 5.0695e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 4.3e+01  8  0  0  0  0  12  0  0  0  0    76
KSPGMRESOrthog      2024 1.0 2.4282e+00 1.0 5.09e+09 1.0 0.0e+00 0.0e+00 3.1e+04 40 57  0  0 92  56 57  0  0 93  2095
KSPSetup               2 1.0 2.9206e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+01  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 3.8301e+00 1.0 8.83e+09 1.0 0.0e+00 0.0e+00 3.4e+04 63 99  0  0100  88 99  0  0100  2304
PCSetUp                2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             2024 1.0 6.3726e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
MatMult             2092 1.0 1.1330e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00 19 37  0  0  0  26 37  0  0  0  2931
MatAssemblyBegin       2 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 6.8700e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 1.8141e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatFDColorApply        2 1.0 5.0662e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 4.3e+01  8  0  0  0  0  12  0  0  0  0    76
MatFDColorFunc        42 1.0 5.3421e-02 1.0 3.53e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0   660
MatCUSPCopyTo          2 1.0 9.9909e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Viewer     1              0            0     0

--- Event Stage 1: SetUp

    Distributed Mesh     1              0            0     0
              Vector     9              2         2928     0
      Vector Scatter     3              0            0     0
           Index Set    27              7        45136     0
   IS L to G Mapping     3              0            0     0
                SNES     1              0            0     0
       Krylov Solver     2              1         1064     0
      Preconditioner     2              1          752     0
              Matrix     1              0            0     0
  Matrix FD Coloring     1              0            0     0

--- Event Stage 2: Solve

    Distributed Mesh     0              1       204840     0
              Vector    74             81      3636056     0
      Vector Scatter     0              3         1836     0
           Index Set     0             20       174720     0
   IS L to G Mapping     0              3       161668     0
                SNES     0              1         1288     0
       Krylov Solver     0              1        18864     0
      Preconditioner     0              1          952     0
              Matrix     0              1     10165692     0
  Matrix FD Coloring     0              1          708     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 1.90735e-07
#PETSc Option Table entries:
-cuda_set_device 0
-cusp_synchronize
-da_grid_x 100
-da_grid_y 100
-da_mat_type seqaijcusp
-da_vec_type seqcusp
-dmmg_nlevels 1
-log_summary ex19.cuda.log
-malloc_dump
-mat_no_inode
-nox
-nox_warning
-pc_type none
-preload off
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Sun Sep 18 18:38:34 2011
Configure options: --with-cc=gcc --with-cxx=g++ --download-mpich=1 --with-cuda=1 --with-cusp=1 --with-thrust=1 PETSC_ARCH=arch-cuda-double --with-precision=double --with-fc=0 --with-clanguage=c --with-cuda-arch=sm_13
-----------------------------------------
Libraries compiled on Sun Sep 18 18:38:34 2011 on bb30 
Machine characteristics: Linux-2.6.32-24-generic-x86_64-with-Ubuntu-10.04-lucid
Using PETSc directory: /home/balay/petsc-dev
Using PETSc arch: arch-cuda-double
-----------------------------------------

Using C compiler: /home/balay/petsc-dev/arch-cuda-double/bin/mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g3  ${COPTFLAGS} ${CFLAGS}
-----------------------------------------

Using include paths: -I/home/balay/petsc-dev/arch-cuda-double/include -I/home/balay/petsc-dev/include -I/home/balay/petsc-dev/include -I/home/balay/petsc-dev/arch-cuda-double/include -I/usr/local/cuda/include
-----------------------------------------

Using C linker: /home/balay/petsc-dev/arch-cuda-double/bin/mpicc
Using libraries: -Wl,-rpath,/home/balay/petsc-dev/arch-cuda-double/lib -L/home/balay/petsc-dev/arch-cuda-double/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -llapack -lblas -lm -lmpichcxx -lstdc++ -ldl 
-----------------------------------------



More information about the petsc-dev mailing list