[petsc-dev] ex19 on GPU

Shiyuan gshy2014 at gmail.com
Sat Sep 17 21:56:11 CDT 2011


On Sat, Sep 17, 2011 at 4:14 PM, Matthew Knepley <petsc-maint at mcs.anl.gov>wrote:

> On Sat, Sep 17, 2011 at 3:26 PM, Shiyuan <gshy2014 at gmail.com> wrote:
>
>> I configure petsc-dev with --with-cuda-arch=sm_20 and rebuild, but it
>> doesn't help. The performance is essentially the same. The machine has two
>> Tesla M2050, with CUDA driver 4.0 and cusp 2.0 and I use -cuda-set-device
>> to
>> choose one.  Any clues what's going wrong ?  configure.log is attached.
>>
>
> Can you show me the output of -cuda_show_devices?
>
CUDA device 0: Tesla M2050
CUDA device 1: Tesla M2050

>
>
> In order to investigate further, please run with
>
>   -da_vec_type mpicusp -da_mat_type mpiaijcusp
>
> and then a series of sizes
>
>   -da_grid_x {100,200,300,400} -da_grid_y {100,200,300,400}
>
> and send the log summaries.
>
> ./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none
-dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -log_summary -mat_no_inode
-preload off  -cusp_synchronize -cuda_show_devices
CUDA device 0: Tesla M2050
CUDA device 1: Tesla M2050

Time (sec):           1.928e+01      1.00000   1.928e+01
Objects:              1.320e+02      1.00000   1.320e+02
Flops:                9.039e+09      1.00000   9.039e+09  9.039e+09
Flops/sec:            4.687e+08      1.00000   4.687e+08  4.687e+08
Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
-- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 4.3905e+00  22.8%  0.0000e+00   0.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%
 1:           SetUp: 6.0178e-02   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%
 2:           Solve: 1.4834e+01  76.9%  9.0389e+09 100.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
VecMDot             2024 1.0 8.6724e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00
0.0e+00 45 28  0  0  0  58 28  0  0  0   293
VecNorm             2096 1.0 1.5712e+00 1.0 3.35e+08 1.0 0.0e+00 0.0e+00
0.0e+00  8  4  0  0  0  11  4  0  0  0   213
VecCUSPCopyTo       2140 1.0 2.4991e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   2  0  0  0  0     0
VecCUSPCopyFrom     2135 1.0 1.0437e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  5  0  0  0  0   7  0  0  0  0     0
KSPSolve               2 1.0 1.4543e+01 1.0 8.99e+09 1.0 0.0e+00 0.0e+00
0.0e+00 75 99  0  0  0  98 99  0  0  0   618
MatMult             2092 1.0 2.8551e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00
0.0e+00 15 37  0  0  0  19 37  0  0  0  1163
MatCUSPCopyTo          4 1.0 1.6344e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0




./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none
-dmmg_nlevels 1 -da_grid_x 200 -da_grid_y 200 -log_summary -mat_no_inode
-preload off  -cusp_synchronize -cuda_set_device 0

Time (sec):           5.042e+01      1.00000   5.042e+01
Objects:              1.320e+02      1.00000   1.320e+02
Flops:                8.283e+10      1.00000   8.283e+10  8.283e+10
Flops/sec:            1.643e+09      1.00000   1.643e+09  1.643e+09
Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
-- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 4.6509e+00   9.2%  0.0000e+00   0.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%
 1:           SetUp: 2.5148e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%
 2:           Solve: 4.5517e+01  90.3%  8.2826e+10 100.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
VecMDot             4637 1.0 2.1155e+01 1.0 2.34e+10 1.0 0.0e+00 0.0e+00
0.0e+00 42 28  0  0  0  46 28  0  0  0  1104
VecNorm             4796 1.0 3.7077e+00 1.0 3.07e+09 1.0 0.0e+00 0.0e+00
0.0e+00  7  4  0  0  0   8  4  0  0  0   828
VecCUSPCopyTo       4840 1.0 6.1929e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecCUSPCopyFrom     4835 1.0 5.0045e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 10  0  0  0  0  11  0  0  0  0     0
KSPSolve               2 1.0 4.4465e+01 1.0 8.26e+10 1.0 0.0e+00 0.0e+00
0.0e+00 88100  0  0  0  98100  0  0  0  1859
MatMult             4792 1.0 1.4925e+01 1.0 3.05e+10 1.0 0.0e+00 0.0e+00
0.0e+00 30 37  0  0  0  33 37  0  0  0  2047
MatCUSPCopyTo          4 1.0 4.9795e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0





./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none
-dmmg_nlevels 1 -da_grid_x 300 -da_grid_y 300 -log_summary -mat_no_inode
-preload off  -cusp_synchronize -cuda_set_device 0 >> ex19p.txt
Time (sec):           1.095e+02      1.00000   1.095e+02
Objects:              1.320e+02      1.00000   1.320e+02
Flops:                3.136e+11      1.00000   3.136e+11  3.136e+11
Flops/sec:            2.865e+09      1.00000   2.865e+09  2.865e+09
Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
-- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 4.4090e+00   4.0%  0.0000e+00   0.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%
 1:           SetUp: 5.6010e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%
 2:           Solve: 1.0449e+02  95.5%  3.1360e+11 100.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
VecMDot             7803 1.0 3.8534e+01 1.0 8.85e+10 1.0 0.0e+00 0.0e+00
0.0e+00 35 28  0  0  0  37 28  0  0  0  2297
VecNorm             8068 1.0 6.5087e+00 1.0 1.16e+10 1.0 0.0e+00 0.0e+00
0.0e+00  6  4  0  0  0   6  4  0  0  0  1785
VecCUSPCopyTo       8112 1.0 1.0913e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecCUSPCopyFrom     8107 1.0 1.2753e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 12  0  0  0  0  12  0  0  0  0     0
KSPSolve               2 1.0 1.0228e+02 1.0 3.13e+11 1.0 0.0e+00 0.0e+00
0.0e+00 93100  0  0  0  98100  0  0  0  3062
MatMult             8064 1.0 4.5629e+01 1.0 1.16e+11 1.0 0.0e+00 0.0e+00
0.0e+00 42 37  0  0  0  44 37  0  0  0  2538
MatCUSPCopyTo          4 1.0 8.1736e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0




./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none
-dmmg_nlevels 1 -da_grid_x 400 -da_grid_y 400 -log_summary -mat_no_inode
-preload off  -cusp_synchronize -cuda_set_device 0 >> ex19p.txt

Time (sec):           1.909e+02      1.00000   1.909e+02
Objects:              1.320e+02      1.00000   1.320e+02
Flops:                7.167e+11      1.00000   7.167e+11  7.167e+11
Flops/sec:            3.753e+09      1.00000   3.753e+09  3.753e+09
Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
-- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 4.4291e+00   2.3%  0.0000e+00   0.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%
 1:           SetUp: 1.0122e+00   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%
 2:           Solve: 1.8551e+02  97.2%  7.1669e+11 100.0%  0.000e+00   0.0%
0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
VecMDot            10031 1.0 5.4102e+01 1.0 2.02e+11 1.0 0.0e+00 0.0e+00
0.0e+00 28 28  0  0  0  29 28  0  0  0  3739
VecNorm            10370 1.0 8.5987e+00 1.0 2.65e+10 1.0 0.0e+00 0.0e+00
0.0e+00  5  4  0  0  0   5  4  0  0  0  3087
VecCUSPCopyTo      10414 1.0 1.5341e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecCUSPCopyFrom    10409 1.0 2.5585e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 13  0  0  0  0  14  0  0  0  0     0
KSPSolve               2 1.0 1.8163e+02 1.0 7.16e+11 1.0 0.0e+00 0.0e+00
0.0e+00 95100  0  0  0  98100  0  0  0  3942
MatMult            10366 1.0 9.7032e+01 1.0 2.65e+11 1.0 0.0e+00 0.0e+00
0.0e+00 51 37  0  0  0  52 37  0  0  0  2729
MatCUSPCopyTo          4 1.0 1.3754e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0

  The complete log_summaries are attached.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110917/891ffc20/attachment.html>
-------------- next part --------------
./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_show_devices
CUDA device 0: Tesla M2050
CUDA device 1: Tesla M2050
lid velocity = 0.0001, prandtl # = 1, grashof # = 1
Number of SNES iterations = 2
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./ex19 on a gpu00CCT- named gpu00.cct.lsu.edu with 1 processor, by sgu Sat Sep 17 20:34:38 2011
Using Petsc Development HG revision: 94fea4d40b1fcca2e886a14e7fdb916b8f6fecf3  HG Date: Sat Sep 17 00:48:29 2011 -0500

                         Max       Max/Min        Avg      Total 
Time (sec):           1.928e+01      1.00000   1.928e+01
Objects:              1.320e+02      1.00000   1.320e+02
Flops:                9.039e+09      1.00000   9.039e+09  9.039e+09
Flops/sec:            4.687e+08      1.00000   4.687e+08  4.687e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 4.3905e+00  22.8%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 1:           SetUp: 6.0178e-02   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 2:           Solve: 1.4834e+01  76.9%  9.0389e+09 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

PetscBarrier           1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: SetUp

MatAssemblyBegin       1 1.0 1.1921e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 2.0661e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
MatFDColorCreate       1 1.0 1.8455e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  31  0  0  0  0     0

--- Event Stage 2: Solve

VecDot                 2 1.0 1.6947e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    94
VecMDot             2024 1.0 8.6724e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00 0.0e+00 45 28  0  0  0  58 28  0  0  0   293
VecNorm             2096 1.0 1.5712e+00 1.0 3.35e+08 1.0 0.0e+00 0.0e+00 0.0e+00  8  4  0  0  0  11  4  0  0  0   213
VecScale            2092 1.0 3.7956e-01 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   3  1  0  0  0   220
VecCopy             2072 1.0 3.8405e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   3  0  0  0  0     0
VecSet                70 1.0 1.3284e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              108 1.0 4.7269e-02 1.0 8.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   183
VecWAXPY              68 1.0 1.2537e-02 1.0 2.72e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   217
VecMAXPY            2092 1.0 6.4375e-01 1.0 2.71e+09 1.0 0.0e+00 0.0e+00 0.0e+00  3 30  0  0  0   4 30  0  0  0  4203
VecScatterBegin     2097 1.0 1.0270e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
VecReduceArith         2 1.0 3.7239e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    43
VecReduceComm          1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCUSPCopyTo       2140 1.0 2.4991e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0
VecCUSPCopyFrom     2135 1.0 1.0437e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   7  0  0  0  0     0
SNESSolve              1 1.0 1.4807e+01 1.0 9.04e+09 1.0 0.0e+00 0.0e+00 0.0e+00 77100  0  0  0 100100  0  0  0   610
SNESLineSearch         2 1.0 1.2360e-02 1.0 5.81e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   470
SNESFunctionEval       3 1.0 2.7061e-03 1.0 2.52e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   931
SNESJacobianEval       2 1.0 2.4291e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0   158
KSPGMRESOrthog      2024 1.0 9.2966e+00 1.0 5.09e+09 1.0 0.0e+00 0.0e+00 0.0e+00 48 56  0  0  0  63 56  0  0  0   547
KSPSetup               2 1.0 6.2943e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 1.4543e+01 1.0 8.99e+09 1.0 0.0e+00 0.0e+00 0.0e+00 75 99  0  0  0  98 99  0  0  0   618
PCSetUp                2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             2024 1.0 3.8127e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   3  0  0  0  0     0
MatMult             2092 1.0 2.8551e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 37  0  0  0  19 37  0  0  0  1163
MatAssemblyBegin       2 1.0 1.8120e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 3.1030e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 1.8611e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatFDColorApply        2 1.0 2.4285e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0   158
MatFDColorFunc        42 1.0 1.2794e-02 1.0 3.53e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2758
MatCUSPCopyTo          4 1.0 1.6344e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage


--- Event Stage 1: SetUp

    Distributed Mesh     1              0            0     0
              Vector    11              3         4424     0
      Vector Scatter     4              0            0     0
           Index Set    29              9        46600     0
   IS L to G Mapping     3              0            0     0
                SNES     1              0            0     0
       Krylov Solver     2              1         1064     0
      Preconditioner     2              1          752     0
              Matrix     3              0            0     0
  Matrix FD Coloring     1              0            0     0

--- Event Stage 2: Solve

    Distributed Mesh     0              1       204840     0
              Vector    74             82     13242416     0
      Vector Scatter     0              4         2448     0
           Index Set     0             20       174720     0
   IS L to G Mapping     0              3       161668     0
                SNES     0              1         1288     0
       Krylov Solver     0              1        18864     0
      Preconditioner     0              1          952     0
              Matrix     0              3     10810468     0
  Matrix FD Coloring     0              1      6510068     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
#PETSc Option Table entries:
-cuda_show_devices
-cusp_synchronize
-da_grid_x 100
-da_grid_y 100
-da_mat_type mpiaijcusp
-da_vec_type mpicusp
-dmmg_nlevels 1
-log_summary
-mat_no_inode
-pc_type none
-preload off
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Sat Sep 17 11:25:49 2011
Configure options: PETSC_DIR=/home/sgu/softwares/petsc-dev PETSC_ARCH=gpu00CCT-cxx-nompi-release -with-clanguage=cxx --with-mpi=0 --download-f2cblaslapack=1 --download-f-blas-lapack=1 --with-debugging=0 --with-c2html=0 --with-valgrind-dir=~/softwares/valgrind --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-arch=sm_20
-----------------------------------------
Libraries compiled on Sat Sep 17 11:25:49 2011 on gpu00.cct.lsu.edu 
Machine characteristics: Linux-2.6.32-131.6.1.el6.x86_64-x86_64-with-redhat-6.1-Santiago
Using PETSc directory: /home/sgu/softwares/petsc-dev
Using PETSc arch: gpu00CCT-cxx-nompi-release
-----------------------------------------

Using C compiler: g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O     ${COPTFLAGS} ${CFLAGS}
-----------------------------------------

Using include paths: -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/valgrind/include -I/usr/local/cuda/include -I/home/sgu/softwares/petsc-dev/include/mpiuni
-----------------------------------------

Using C linker: g++
Using libraries: -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lf2clapack -lf2cblas -lm -lm -lstdc++ -ldl 
-----------------------------------------

./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 200 -da_grid_y 200 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0
lid velocity = 2.5e-05, prandtl # = 1, grashof # = 1
Number of SNES iterations = 2
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./ex19 on a gpu00CCT- named gpu00.cct.lsu.edu with 1 processor, by sgu Sat Sep 17 20:36:14 2011
Using Petsc Development HG revision: 94fea4d40b1fcca2e886a14e7fdb916b8f6fecf3  HG Date: Sat Sep 17 00:48:29 2011 -0500

                         Max       Max/Min        Avg      Total 
Time (sec):           5.042e+01      1.00000   5.042e+01
Objects:              1.320e+02      1.00000   1.320e+02
Flops:                8.283e+10      1.00000   8.283e+10  8.283e+10
Flops/sec:            1.643e+09      1.00000   1.643e+09  1.643e+09
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 4.6509e+00   9.2%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 1:           SetUp: 2.5148e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 2:           Solve: 4.5517e+01  90.3%  8.2826e+10 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

PetscBarrier           1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: SetUp

MatAssemblyBegin       1 1.0 1.4067e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 8.0690e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
MatFDColorCreate       1 1.0 7.4871e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  30  0  0  0  0     0

--- Event Stage 2: Solve

VecDot                 2 1.0 1.6088e-03 1.0 6.40e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   398
VecMDot             4637 1.0 2.1155e+01 1.0 2.34e+10 1.0 0.0e+00 0.0e+00 0.0e+00 42 28  0  0  0  46 28  0  0  0  1104
VecNorm             4796 1.0 3.7077e+00 1.0 3.07e+09 1.0 0.0e+00 0.0e+00 0.0e+00  7  4  0  0  0   8  4  0  0  0   828
VecScale            4792 1.0 9.7300e-01 1.0 7.67e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0   788
VecCopy             4685 1.0 9.9265e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
VecSet               157 1.0 3.0819e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              195 1.0 9.1851e-02 1.0 6.24e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   679
VecWAXPY             155 1.0 3.3326e-02 1.0 2.48e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   744
VecMAXPY            4792 1.0 2.6158e+00 1.0 2.48e+10 1.0 0.0e+00 0.0e+00 0.0e+00  5 30  0  0  0   6 30  0  0  0  9498
VecScatterBegin     4797 1.0 4.9713e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10  0  0  0  0  11  0  0  0  0     0
VecReduceArith         2 1.0 5.0960e-03 1.0 6.40e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   126
VecReduceComm          1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCUSPCopyTo       4840 1.0 6.1929e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecCUSPCopyFrom     4835 1.0 5.0045e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10  0  0  0  0  11  0  0  0  0     0
SNESSolve              1 1.0 4.5474e+01 1.0 8.28e+10 1.0 0.0e+00 0.0e+00 0.0e+00 90100  0  0  0 100100  0  0  0  1821
SNESLineSearch         2 1.0 2.3559e-02 1.0 2.33e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   989
SNESFunctionEval       3 1.0 8.9130e-03 1.0 1.01e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1131
SNESJacobianEval       2 1.0 9.7259e-01 1.0 1.54e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   158
KSPGMRESOrthog      4637 1.0 2.3658e+01 1.0 4.67e+10 1.0 0.0e+00 0.0e+00 0.0e+00 47 56  0  0  0  52 56  0  0  0  1975
KSPSetup               2 1.0 6.1035e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 4.4465e+01 1.0 8.26e+10 1.0 0.0e+00 0.0e+00 0.0e+00 88100  0  0  0  98100  0  0  0  1859
PCSetUp                2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             4637 1.0 9.8032e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatMult             4792 1.0 1.4925e+01 1.0 3.05e+10 1.0 0.0e+00 0.0e+00 0.0e+00 30 37  0  0  0  33 37  0  0  0  2047
MatAssemblyBegin       2 1.0 2.0027e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 1.2705e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 7.4351e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatFDColorApply        2 1.0 9.7253e-01 1.0 1.54e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   158
MatFDColorFunc        42 1.0 5.1462e-02 1.0 1.41e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2742
MatCUSPCopyTo          4 1.0 4.9795e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage


--- Event Stage 1: SetUp

    Distributed Mesh     1              0            0     0
              Vector    11              3         4424     0
      Vector Scatter     4              0            0     0
           Index Set    29              9       166600     0
   IS L to G Mapping     3              0            0     0
                SNES     1              0            0     0
       Krylov Solver     2              1         1064     0
      Preconditioner     2              1          752     0
              Matrix     3              0            0     0
  Matrix FD Coloring     1              0            0     0

--- Event Stage 2: Solve

    Distributed Mesh     0              1       804840     0
              Vector    74             82     52602416     0
      Vector Scatter     0              4         2448     0
           Index Set     0             20       654720     0
   IS L to G Mapping     0              3       641668     0
                SNES     0              1         1288     0
       Krylov Solver     0              1        18864     0
      Preconditioner     0              1          952     0
              Matrix     0              3     43373668     0
  Matrix FD Coloring     0              1     26138868     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 0
#PETSc Option Table entries:
-cuda_set_device 0
-cusp_synchronize
-da_grid_x 200
-da_grid_y 200
-da_mat_type mpiaijcusp
-da_vec_type mpicusp
-dmmg_nlevels 1
-log_summary
-mat_no_inode
-pc_type none
-preload off
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Sat Sep 17 11:25:49 2011
Configure options: PETSC_DIR=/home/sgu/softwares/petsc-dev PETSC_ARCH=gpu00CCT-cxx-nompi-release -with-clanguage=cxx --with-mpi=0 --download-f2cblaslapack=1 --download-f-blas-lapack=1 --with-debugging=0 --with-c2html=0 --with-valgrind-dir=~/softwares/valgrind --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-arch=sm_20
-----------------------------------------
Libraries compiled on Sat Sep 17 11:25:49 2011 on gpu00.cct.lsu.edu 
Machine characteristics: Linux-2.6.32-131.6.1.el6.x86_64-x86_64-with-redhat-6.1-Santiago
Using PETSc directory: /home/sgu/softwares/petsc-dev
Using PETSc arch: gpu00CCT-cxx-nompi-release
-----------------------------------------

Using C compiler: g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O     ${COPTFLAGS} ${CFLAGS}
-----------------------------------------

Using include paths: -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/valgrind/include -I/usr/local/cuda/include -I/home/sgu/softwares/petsc-dev/include/mpiuni
-----------------------------------------

Using C linker: g++
Using libraries: -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lf2clapack -lf2cblas -lm -lm -lstdc++ -ldl 
-----------------------------------------




./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 300 -da_grid_y 300 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0

lid velocity = 1.11111e-05, prandtl # = 1, grashof # = 1
Number of SNES iterations = 2
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./ex19 on a gpu00CCT- named gpu00.cct.lsu.edu with 1 processor, by sgu Sat Sep 17 20:38:29 2011
Using Petsc Development HG revision: 94fea4d40b1fcca2e886a14e7fdb916b8f6fecf3  HG Date: Sat Sep 17 00:48:29 2011 -0500

                         Max       Max/Min        Avg      Total 
Time (sec):           1.095e+02      1.00000   1.095e+02
Objects:              1.320e+02      1.00000   1.320e+02
Flops:                3.136e+11      1.00000   3.136e+11  3.136e+11
Flops/sec:            2.865e+09      1.00000   2.865e+09  2.865e+09
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 4.4090e+00   4.0%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 1:           SetUp: 5.6010e-01   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 2:           Solve: 1.0449e+02  95.5%  3.1360e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

PetscBarrier           1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: SetUp

MatAssemblyBegin       1 1.0 1.4067e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 1.7501e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
MatFDColorCreate       1 1.0 1.6907e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  30  0  0  0  0     0

--- Event Stage 2: Solve

VecDot                 2 1.0 1.6890e-03 1.0 1.44e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   853
VecMDot             7803 1.0 3.8534e+01 1.0 8.85e+10 1.0 0.0e+00 0.0e+00 0.0e+00 35 28  0  0  0  37 28  0  0  0  2297
VecNorm             8068 1.0 6.5087e+00 1.0 1.16e+10 1.0 0.0e+00 0.0e+00 0.0e+00  6  4  0  0  0   6  4  0  0  0  1785
VecScale            8064 1.0 1.8853e+00 1.0 2.90e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0  1540
VecCopy             7851 1.0 1.9321e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
VecSet               263 1.0 5.4441e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              301 1.0 1.5158e-01 1.0 2.17e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1430
VecWAXPY             261 1.0 6.9037e-02 1.0 9.40e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1361
VecMAXPY            8064 1.0 7.6110e+00 1.0 9.41e+10 1.0 0.0e+00 0.0e+00 0.0e+00  7 30  0  0  0   7 30  0  0  0 12366
VecScatterBegin     8069 1.0 1.2707e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 12  0  0  0  0  12  0  0  0  0     0
VecReduceArith         2 1.0 6.5138e-03 1.0 1.44e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   221
VecReduceComm          1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCUSPCopyTo       8112 1.0 1.0913e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecCUSPCopyFrom     8107 1.0 1.2753e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 12  0  0  0  0  12  0  0  0  0     0
SNESSolve              1 1.0 1.0444e+02 1.0 3.14e+11 1.0 0.0e+00 0.0e+00 0.0e+00 95100  0  0  0 100100  0  0  0  3003
SNESLineSearch         2 1.0 3.9190e-02 1.0 5.25e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1339
SNESFunctionEval       3 1.0 1.7656e-02 1.0 2.27e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1285
SNESJacobianEval       2 1.0 2.0955e+00 1.0 3.46e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   165
KSPGMRESOrthog      7803 1.0 4.5761e+01 1.0 1.77e+11 1.0 0.0e+00 0.0e+00 0.0e+00 42 56  0  0  0  44 56  0  0  0  3868
KSPSetup               2 1.0 4.4107e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 1.0228e+02 1.0 3.13e+11 1.0 0.0e+00 0.0e+00 0.0e+00 93100  0  0  0  98100  0  0  0  3062
PCSetUp                2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             7803 1.0 1.9026e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatMult             8064 1.0 4.5629e+01 1.0 1.16e+11 1.0 0.0e+00 0.0e+00 0.0e+00 42 37  0  0  0  44 37  0  0  0  2538
MatAssemblyBegin       2 1.0 2.0981e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 2.8598e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 1.9902e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatFDColorApply        2 1.0 2.0955e+00 1.0 3.46e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   165
MatFDColorFunc        42 1.0 1.1288e-01 1.0 3.18e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2813
MatCUSPCopyTo          4 1.0 8.1736e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage


--- Event Stage 1: SetUp

    Distributed Mesh     1              0            0     0
              Vector    11              3         4424     0
      Vector Scatter     4              0            0     0
           Index Set    29              9       366600     0
   IS L to G Mapping     3              0            0     0
                SNES     1              0            0     0
       Krylov Solver     2              1         1064     0
      Preconditioner     2              1          752     0
              Matrix     3              0            0     0
  Matrix FD Coloring     1              0            0     0

--- Event Stage 2: Solve

    Distributed Mesh     0              1      1804840     0
              Vector    74             82    118202416     0
      Vector Scatter     0              4         2448     0
           Index Set     0             20      1454720     0
   IS L to G Mapping     0              3      1441668     0
                SNES     0              1         1288     0
       Krylov Solver     0              1        18864     0
      Preconditioner     0              1          952     0
              Matrix     0              3     97696868     0
  Matrix FD Coloring     0              1     58887668     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
#PETSc Option Table entries:
-cuda_set_device 0
-cusp_synchronize
-da_grid_x 300
-da_grid_y 300
-da_mat_type mpiaijcusp
-da_vec_type mpicusp
-dmmg_nlevels 1
-log_summary
-mat_no_inode
-pc_type none
-preload off
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Sat Sep 17 11:25:49 2011
Configure options: PETSC_DIR=/home/sgu/softwares/petsc-dev PETSC_ARCH=gpu00CCT-cxx-nompi-release -with-clanguage=cxx --with-mpi=0 --download-f2cblaslapack=1 --download-f-blas-lapack=1 --with-debugging=0 --with-c2html=0 --with-valgrind-dir=~/softwares/valgrind --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-arch=sm_20
-----------------------------------------
Libraries compiled on Sat Sep 17 11:25:49 2011 on gpu00.cct.lsu.edu 
Machine characteristics: Linux-2.6.32-131.6.1.el6.x86_64-x86_64-with-redhat-6.1-Santiago
Using PETSc directory: /home/sgu/softwares/petsc-dev
Using PETSc arch: gpu00CCT-cxx-nompi-release
-----------------------------------------

Using C compiler: g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O     ${COPTFLAGS} ${CFLAGS}
-----------------------------------------

Using include paths: -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/valgrind/include -I/usr/local/cuda/include -I/home/sgu/softwares/petsc-dev/include/mpiuni
-----------------------------------------

Using C linker: g++
Using libraries: -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lf2clapack -lf2cblas -lm -lm -lstdc++ -ldl 
-----------------------------------------


./ex19 -da_vec_type mpicusp -da_mat_type mpiaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 400 -da_grid_y 400 -log_summary -mat_no_inode -preload off  -cusp_synchronize -cuda_set_device 0

lid velocity = 6.25e-06, prandtl # = 1, grashof # = 1
Number of SNES iterations = 2
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./ex19 on a gpu00CCT- named gpu00.cct.lsu.edu with 1 processor, by sgu Sat Sep 17 20:42:05 2011
Using Petsc Development HG revision: 94fea4d40b1fcca2e886a14e7fdb916b8f6fecf3  HG Date: Sat Sep 17 00:48:29 2011 -0500

                         Max       Max/Min        Avg      Total 
Time (sec):           1.909e+02      1.00000   1.909e+02
Objects:              1.320e+02      1.00000   1.320e+02
Flops:                7.167e+11      1.00000   7.167e+11  7.167e+11
Flops/sec:            3.753e+09      1.00000   3.753e+09  3.753e+09
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 4.4291e+00   2.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 1:           SetUp: 1.0122e+00   0.5%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 
 2:           Solve: 1.8551e+02  97.2%  7.1669e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

PetscBarrier           1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: SetUp

MatAssemblyBegin       1 1.0 1.5974e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 3.1045e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0
MatFDColorCreate       1 1.0 3.1857e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  31  0  0  0  0     0

--- Event Stage 2: Solve

VecDot                 2 1.0 1.8530e-03 1.0 2.56e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1382
VecMDot            10031 1.0 5.4102e+01 1.0 2.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 28 28  0  0  0  29 28  0  0  0  3739
VecNorm            10370 1.0 8.5987e+00 1.0 2.65e+10 1.0 0.0e+00 0.0e+00 0.0e+00  5  4  0  0  0   5  4  0  0  0  3087
VecScale           10366 1.0 2.9179e+00 1.0 6.63e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0  2274
VecCopy            10079 1.0 2.9971e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
VecSet               337 1.0 7.6832e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              375 1.0 2.3210e-01 1.0 4.80e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2068
VecWAXPY             335 1.0 1.1250e-01 1.0 2.14e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1906
VecMAXPY           10366 1.0 1.5716e+01 1.0 2.15e+11 1.0 0.0e+00 0.0e+00 0.0e+00  8 30  0  0  0   8 30  0  0  0 13687
VecScatterBegin    10371 1.0 2.5508e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 13  0  0  0  0  14  0  0  0  0     0
VecReduceArith         2 1.0 8.3668e-03 1.0 2.56e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   306
VecReduceComm          1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCUSPCopyTo      10414 1.0 1.5341e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecCUSPCopyFrom    10409 1.0 2.5585e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 13  0  0  0  0  14  0  0  0  0     0
SNESSolve              1 1.0 1.8546e+02 1.0 7.17e+11 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0 100100  0  0  0  3864
SNESLineSearch         2 1.0 6.2440e-02 1.0 9.33e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1495
SNESFunctionEval       3 1.0 3.0468e-02 1.0 4.03e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1323
SNESJacobianEval       2 1.0 3.7313e+00 1.0 6.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   165
KSPGMRESOrthog     10031 1.0 6.8969e+01 1.0 4.05e+11 1.0 0.0e+00 0.0e+00 0.0e+00 36 56  0  0  0  37 56  0  0  0  5865
KSPSetup               2 1.0 4.6015e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 1.8163e+02 1.0 7.16e+11 1.0 0.0e+00 0.0e+00 0.0e+00 95100  0  0  0  98100  0  0  0  3942
PCSetUp                2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply            10031 1.0 2.9429e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatMult            10366 1.0 9.7032e+01 1.0 2.65e+11 1.0 0.0e+00 0.0e+00 0.0e+00 51 37  0  0  0  52 37  0  0  0  2729
MatAssemblyBegin       2 1.0 2.1935e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 5.1729e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 3.1707e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatFDColorApply        2 1.0 3.7312e+00 1.0 6.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   165
MatFDColorFunc        42 1.0 2.3831e-01 1.0 5.64e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2369
MatCUSPCopyTo          4 1.0 1.3754e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage


--- Event Stage 1: SetUp

    Distributed Mesh     1              0            0     0
              Vector    11              3         4424     0
      Vector Scatter     4              0            0     0
           Index Set    29              9       646600     0
   IS L to G Mapping     3              0            0     0
                SNES     1              0            0     0
       Krylov Solver     2              1         1064     0
      Preconditioner     2              1          752     0
              Matrix     3              0            0     0
  Matrix FD Coloring     1              0            0     0

--- Event Stage 2: Solve

    Distributed Mesh     0              1      3204840     0
              Vector    74             82    210042416     0
      Vector Scatter     0              4         2448     0
           Index Set     0             20      2574720     0
   IS L to G Mapping     0              3      2561668     0
                SNES     0              1         1288     0
       Krylov Solver     0              1        18864     0
      Preconditioner     0              1          952     0
              Matrix     0              3    173780068     0
  Matrix FD Coloring     0              1    104756468     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
#PETSc Option Table entries:
-cuda_set_device 0
-cusp_synchronize
-da_grid_x 400
-da_grid_y 400
-da_mat_type mpiaijcusp
-da_vec_type mpicusp
-dmmg_nlevels 1
-log_summary
-mat_no_inode
-pc_type none
-preload off
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Sat Sep 17 11:25:49 2011
Configure options: PETSC_DIR=/home/sgu/softwares/petsc-dev PETSC_ARCH=gpu00CCT-cxx-nompi-release -with-clanguage=cxx --with-mpi=0 --download-f2cblaslapack=1 --download-f-blas-lapack=1 --with-debugging=0 --with-c2html=0 --with-valgrind-dir=~/softwares/valgrind --with-cuda=1 --with-cusp=1 --with-thrust=1 --with-cuda-arch=sm_20
-----------------------------------------
Libraries compiled on Sat Sep 17 11:25:49 2011 on gpu00.cct.lsu.edu 
Machine characteristics: Linux-2.6.32-131.6.1.el6.x86_64-x86_64-with-redhat-6.1-Santiago
Using PETSc directory: /home/sgu/softwares/petsc-dev
Using PETSc arch: gpu00CCT-cxx-nompi-release
-----------------------------------------

Using C compiler: g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O     ${COPTFLAGS} ${CFLAGS}
-----------------------------------------

Using include paths: -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/include -I/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/include -I/home/sgu/softwares/valgrind/include -I/usr/local/cuda/include -I/home/sgu/softwares/petsc-dev/include/mpiuni
-----------------------------------------

Using C linker: g++
Using libraries: -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -Wl,-rpath,/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -L/home/sgu/softwares/petsc-dev/gpu00CCT-cxx-nompi-release/lib -lf2clapack -lf2cblas -lm -lm -lstdc++ -ldl 
-----------------------------------------


More information about the petsc-dev mailing list