[petsc-dev] [petsc-maint #87339] Re: ex19 on GPU
Satish Balay
balay at mcs.anl.gov
Mon Sep 19 10:44:23 CDT 2011
Attached is the output from the run on breadboard. It has 2 "nVidia
Corporation GT200 [Tesla C1060]" cards.
Satish
--------
balay at bb30:~/petsc-dev/src/snes/examples/tutorials>./ex19 -da_vec_type seqcusp -da_mat_type seqaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -mat_no_inode -preload off -cusp_synchronize -cuda_set_device 0 -log_summary ex19.cuda.log
lid velocity = 0.0001, prandtl # = 1, grashof # = 1
Number of SNES iterations = 2
balay at bb30:~/petsc-dev/src/snes/examples/tutorials>
On Sun, 18 Sep 2011, Barry Smith wrote:
>
>
> Ok, the copy up and down are not a problem.
>
> Except for VecMAXPY() the vector operations are terrible (like they are not using the GPU, but they must be?) The MatMult() must be GPU because it is pretty good 2779???
>
> Does someone else have access to a similar system and can they run the exact same test to see what numbers they get? Satish, could you on breadboard? Maybe on Magellion :-)
>
>
> Barry
>
>
>
> VecDot 2 1.0 1.7049e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 94
> VecMDot 2024 1.0 8.6273e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00 0.0e+00 50 29 0 0 0 66 29 0 0 0 295
> VecNorm 2096 1.0 1.5544e+00 1.0 1.68e+08 1.0 0.0e+00 0.0e+00 0.0e+00 9 2 0 0 0 12 2 0 0 0 108
> VecScale 2092 1.0 3.7774e-01 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00 2 1 0 0 0 3 1 0 0 0 222
> VecCopy 2072 1.0 3.8258e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 3 0 0 0 0 0
> VecSet 70 1.0 1.3119e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 108 1.0 4.7407e-02 1.0 8.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 182
> VecWAXPY 68 1.0 1.2545e-02 1.0 2.72e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 217
> VecMAXPY 2092 1.0 6.4464e-01 1.0 2.71e+09 1.0 0.0e+00 0.0e+00 0.0e+00 4 31 0 0 0 5 31 0 0 0 4198
> VecScatterBegin 5 1.0 1.5609e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecReduceArith 2 1.0 3.8650e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 41
> VecReduceComm 1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecCUSPCopyTo 49 1.0 3.0950e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecCUSPCopyFrom 44 1.0 2.0876e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> SNESSolve 1 1.0 1.3044e+01 1.0 8.87e+09 1.0 0.0e+00 0.0e+00 0.0e+00 75100 0 0 0 100100 0 0 0 680
> SNESLineSearch 2 1.0 1.1921e-02 1.0 5.49e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 461
> SNESFunctionEval 3 1.0 2.7192e-03 1.0 2.52e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 927
> SNESJacobianEval 2 1.0 2.0424e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 2 0 0 0 0 188
> KSPGMRESOrthog 2024 1.0 9.2522e+00 1.0 5.09e+09 1.0 0.0e+00 0.0e+00 0.0e+00 53 57 0 0 0 71 57 0 0 0 550
> KSPSetup 2 1.0 5.1975e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 2 1.0 1.2819e+01 1.0 8.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 74 99 0 0 0 98 99 0 0 0 689
> PCSetUp 2 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> PCApply 2024 1.0 3.8054e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 3 0 0 0 0 0
> MatMult 2092 1.0 1.1950e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00 7 37 0 0 0 9 37 0 0 0 2779
>
> On Sep 18, 2011, at 10:29 AM, Shiyuan wrote:
>
> >
> >
> > On Sat, Sep 17, 2011 at 10:48 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > Run the first one with -da_vec_type seqcusp and -da_mat_type seqaijcusp
> >
> > > VecScatterBegin 2097 1.0 1.0270e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 5 0 0 0 0 7 0 0 0 0 0
> > > VecCUSPCopyTo 2140 1.0 2.4991e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 2 0 0 0 0 0
> > > VecCUSPCopyFrom 2135 1.0 1.0437e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 5 0 0 0 0 7 0 0 0 0 0
> >
> > Why is it doing all these vector copy ups and downs? It is run on one process it shouldn't be doing more than a handful total.
> >
> > Barry
> >
> > ./ex19 -da_vec_type seqcusp -da_mat_type seqaijcusp -pc_type none -dmmg_nlevels 1 -da_grid_x 100 -da_grid_y 100 -log_summary -mat_no_inode -preload off -cusp_synchronize -cuda_set_device 0 | tee ex19p2.txt
> >
> > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
> > Avg %Total Avg %Total counts %Total Avg %Total counts %Total
> > 0: Main Stage: 4.2393e+00 24.4% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
> > 1: SetUp: 4.9079e-02 0.3% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
> > 2: Solve: 1.3071e+01 75.3% 8.8712e+09 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
> >
> > ------------------------------------------------------------------------------------------------------------------------
> >
> > VecScatterBegin 5 1.0 1.5609e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > VecReduceArith 2 1.0 3.8650e-03 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 41
> > VecReduceComm 1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > VecCUSPCopyTo 49 1.0 3.0950e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > VecCUSPCopyFrom 44 1.0 2.0876e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> >
> > The complete log is attached. Thanks.
> > <ex19p2.txt>
>
>
-------------- next part --------------
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
./ex19 on a arch-cuda named bb30 with 1 processor, by balay Mon Sep 19 10:41:28 2011
Using Petsc Development HG revision: 0c1d30b63d8488b9b083d69444e587dbdd98ebee HG Date: Sun Sep 18 11:45:23 2011 -0700
Max Max/Min Avg Total
Time (sec): 6.106e+00 1.00000 6.106e+00
Objects: 1.260e+02 1.00000 1.260e+02
Flops: 8.871e+09 1.00000 8.871e+09 8.871e+09
Flops/sec: 1.453e+09 1.00000 1.453e+09 1.453e+09
Memory: 2.505e+07 1.00000 2.505e+07
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 3.384e+04 1.00000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 1.6191e+00 26.5% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
1: SetUp: 1.2512e-01 2.0% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 7.800e+01 0.2%
2: Solve: 4.3615e+00 71.4% 8.8712e+09 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 3.376e+04 99.8%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
PetscBarrier 1 1.0 5.0068e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
--- Event Stage 1: SetUp
MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 3.4699e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0
MatFDColorCreate 1 1.0 4.1661e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.2e+01 1 0 0 0 0 33 0 0 0 41 0
--- Event Stage 2: Solve
VecDot 2 1.0 1.2088e-04 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1324
VecMDot 2024 1.0 2.1392e+00 1.0 2.54e+09 1.0 0.0e+00 0.0e+00 0.0e+00 35 29 0 0 0 49 29 0 0 0 1189
VecNorm 2096 1.0 1.1928e-01 1.0 1.68e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 3 2 0 0 0 1406
VecScale 2092 1.0 5.2948e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 1580
VecCopy 2072 1.0 6.9294e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 2 0 0 0 0 0
VecSet 70 1.0 1.7152e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 108 1.0 1.3336e-02 1.0 8.64e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 648
VecWAXPY 68 1.0 2.0800e-03 1.0 2.72e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1308
VecMAXPY 2092 1.0 2.9396e-01 1.0 2.71e+09 1.0 0.0e+00 0.0e+00 0.0e+00 5 31 0 0 0 7 31 0 0 0 9205
VecScatterBegin 5 1.0 1.7390e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecReduceArith 2 1.0 9.2983e-04 1.0 1.60e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 172
VecReduceComm 1 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecCUSPCopyTo 49 1.0 9.9115e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecCUSPCopyFrom 44 1.0 1.4175e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SNESSolve 1 1.0 4.3493e+00 1.0 8.87e+09 1.0 0.0e+00 0.0e+00 3.4e+04 71100 0 0100 100100 0 0100 2040
SNESLineSearch 2 1.0 8.4569e-03 1.0 5.49e+06 1.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 650
SNESFunctionEval 3 1.0 6.6361e-03 1.0 2.52e+06 1.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 380
SNESJacobianEval 2 1.0 5.0695e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 4.3e+01 8 0 0 0 0 12 0 0 0 0 76
KSPGMRESOrthog 2024 1.0 2.4282e+00 1.0 5.09e+09 1.0 0.0e+00 0.0e+00 3.1e+04 40 57 0 0 92 56 57 0 0 93 2095
KSPSetup 2 1.0 2.9206e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+01 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 2 1.0 3.8301e+00 1.0 8.83e+09 1.0 0.0e+00 0.0e+00 3.4e+04 63 99 0 0100 88 99 0 0100 2304
PCSetUp 2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
PCApply 2024 1.0 6.3726e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatMult 2092 1.0 1.1330e+00 1.0 3.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00 19 37 0 0 0 26 37 0 0 0 2931
MatAssemblyBegin 2 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 2 1.0 6.8700e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 2 1.0 1.8141e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatFDColorApply 2 1.0 5.0662e-01 1.0 3.85e+07 1.0 0.0e+00 0.0e+00 4.3e+01 8 0 0 0 0 12 0 0 0 0 76
MatFDColorFunc 42 1.0 5.3421e-02 1.0 3.53e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 660
MatCUSPCopyTo 2 1.0 9.9909e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Viewer 1 0 0 0
--- Event Stage 1: SetUp
Distributed Mesh 1 0 0 0
Vector 9 2 2928 0
Vector Scatter 3 0 0 0
Index Set 27 7 45136 0
IS L to G Mapping 3 0 0 0
SNES 1 0 0 0
Krylov Solver 2 1 1064 0
Preconditioner 2 1 752 0
Matrix 1 0 0 0
Matrix FD Coloring 1 0 0 0
--- Event Stage 2: Solve
Distributed Mesh 0 1 204840 0
Vector 74 81 3636056 0
Vector Scatter 0 3 1836 0
Index Set 0 20 174720 0
IS L to G Mapping 0 3 161668 0
SNES 0 1 1288 0
Krylov Solver 0 1 18864 0
Preconditioner 0 1 952 0
Matrix 0 1 10165692 0
Matrix FD Coloring 0 1 708 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 1.90735e-07
#PETSc Option Table entries:
-cuda_set_device 0
-cusp_synchronize
-da_grid_x 100
-da_grid_y 100
-da_mat_type seqaijcusp
-da_vec_type seqcusp
-dmmg_nlevels 1
-log_summary ex19.cuda.log
-malloc_dump
-mat_no_inode
-nox
-nox_warning
-pc_type none
-preload off
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Sun Sep 18 18:38:34 2011
Configure options: --with-cc=gcc --with-cxx=g++ --download-mpich=1 --with-cuda=1 --with-cusp=1 --with-thrust=1 PETSC_ARCH=arch-cuda-double --with-precision=double --with-fc=0 --with-clanguage=c --with-cuda-arch=sm_13
-----------------------------------------
Libraries compiled on Sun Sep 18 18:38:34 2011 on bb30
Machine characteristics: Linux-2.6.32-24-generic-x86_64-with-Ubuntu-10.04-lucid
Using PETSc directory: /home/balay/petsc-dev
Using PETSc arch: arch-cuda-double
-----------------------------------------
Using C compiler: /home/balay/petsc-dev/arch-cuda-double/bin/mpicc -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g3 ${COPTFLAGS} ${CFLAGS}
-----------------------------------------
Using include paths: -I/home/balay/petsc-dev/arch-cuda-double/include -I/home/balay/petsc-dev/include -I/home/balay/petsc-dev/include -I/home/balay/petsc-dev/arch-cuda-double/include -I/usr/local/cuda/include
-----------------------------------------
Using C linker: /home/balay/petsc-dev/arch-cuda-double/bin/mpicc
Using libraries: -Wl,-rpath,/home/balay/petsc-dev/arch-cuda-double/lib -L/home/balay/petsc-dev/arch-cuda-double/lib -lpetsc -lX11 -lpthread -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart -llapack -lblas -lm -lmpichcxx -lstdc++ -ldl
-----------------------------------------
More information about the petsc-dev
mailing list