[petsc-dev] KSP run in parallel with GPU
Matthew Knepley
knepley at gmail.com
Mon Apr 25 09:34:35 CDT 2011
On Mon, Apr 25, 2011 at 9:30 AM, Eugene Kozlov <neoveneficus at gmail.com>wrote:
> I never use the maillist before this. How can I better show logs?
> Attach file to mail?
1) You attached 3 runs with 1 processor
2) This matrix looks pretty small
3) You are still getting 2 GF/s
Matt
> Norm of error 0.00012322 iterations 448
> Norm of error 0.00012322 iterations 448
> Norm of error 0.00012322 iterations 448
>
> ************************************************************************************************************************
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> ex2 on a arch-linu named cn11 with 1 processor, by kukushkinav Mon Apr
> 25 18:24:15 2011
> Using Petsc Development HG revision:
> d3e10315d68b1dd5481adb2889c7d354880da362 HG Date: Wed Apr 20 21:03:56
> 2011 -0500
>
> Max Max/Min Avg Total
> Time (sec): 3.891e+01 1.00000 3.891e+01
> Objects: 2.500e+01 1.00000 2.500e+01
> Flops: 9.474e+09 1.00000 9.474e+09 9.474e+09
> Flops/sec: 2.435e+08 1.00000 2.435e+08 2.435e+08
> Memory: 1.562e+08 1.00000 1.562e+08
> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Reductions: 5.881e+03 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length
> N --> 2N flops
> and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- ---
> Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 3.8368e+01 98.6% 9.4739e+09 100.0% 0.000e+00
> 0.0% 0.000e+00 0.0% 5.855e+03 99.6%
> 1: Assembly: 5.3823e-01 1.4% 0.0000e+00 0.0% 0.000e+00
> 0.0% 0.000e+00 0.0% 1.200e+01 0.2%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this
> phase
> %M - percent messages in this phase %L - percent message
> lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option, #
> # To get timing results run ./configure #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
> Event Count Time (sec) Flops
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg
> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult 449 1.0 1.4191e+00 1.0 2.87e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 4 30 0 0 0 4 30 0 0 0 2023
> MatSolve 449 1.0 1.3127e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 34 27 0 0 0 34 27 0 0 0 197
> MatCholFctrNum 1 1.0 1.7007e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 4
> MatICCFactorSym 1 1.0 1.1560e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetRowIJ 1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 1.0204e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatCUSPCopyTo 2 1.0 2.0546e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecDot 896 1.0 1.0086e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 3 12 0 0 30 3 12 0 0 31 1137
> VecNorm 450 1.0 8.4599e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 9.0e+02 2 12 0 0 15 2 12 0 0 15 1362
> VecCopy 2 1.0 2.6081e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 451 1.0 2.7629e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPY 897 1.0 2.1131e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 1 12 0 0 0 1 12 0 0 0 5434
> VecAYPX 447 1.0 1.0049e-01 1.0 5.72e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 6 0 0 0 0 6 0 0 0 5694
> VecScatterBegin 449 1.0 6.8694e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> VecCUSPCopyTo 1346 1.0 1.2865e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
> VecCUSPCopyFrom 1346 1.0 2.2437e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0
> KSPSetup 2 1.0 3.1233e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 1.8947e+01 1.0 9.46e+09 1.0 0.0e+00
> 0.0e+00 5.8e+03 49100 0 0 99 49100 0 0100 499
> PCSetUp 2 1.0 3.8846e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 7.0e+00 1 0 0 0 0 1 0 0 0 0 2
> PCSetUpOnBlocks 1 1.0 3.8828e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 5.0e+00 1 0 0 0 0 1 0 0 0 0 2
> PCApply 449 1.0 1.4987e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 39 27 0 0 31 39 27 0 0 31 172
>
> --- Event Stage 1: Assembly
>
> MatAssemblyBegin 1 1.0 2.3842e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 17 0
> MatAssemblyEnd 1 1.0 7.0319e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+01 0 0 0 0 0 13 0 0 0 83 0
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Matrix 4 4 120310620 0
> Vec 8 9 20493272 0
> Vec Scatter 0 1 596 0
> Index Set 3 3 2562160 0
> Krylov Solver 2 2 2048 0
> Preconditioner 2 2 1688 0
> Viewer 1 0 0 0
>
> --- Event Stage 1: Assembly
>
> Vec 2 1 1496 0
> Vec Scatter 1 0 0 0
> Index Set 2 2 1432 0
>
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -ksp_type cg
> -log_summary
> -m 800
> -mat_type mpiaijcusp
> -n 800
> -vec_type mpicusp
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Mon Apr 25 12:42:34 2011
> Configure options: --prefix=/home/kukushkinav
> --with-blas-lapack-dir=/opt/intel/composerxe-2011.0.084/mkl
> --with-mpi-dir=/opt/intel/impi/4.0.1.007/intel64/bin --with-cuda=1
> --with-cusp=1 --with-thrust=1
> --with-thrust-dir=/home/kukushkinav/include
> --with-cusp-dir=/home/kukushkinav/include --with-cuda-arch=sm_13
> -----------------------------------------
> Libraries compiled on Mon Apr 25 12:42:34 2011 on manager
> Machine characteristics:
> Linux-2.6.18-238.5.1.el5-x86_64-with-redhat-5.6-Tikanga
> Using PETSc directory: /export/home/kukushkinav/soft/petsc-dev
> Using PETSc arch: arch-linux2-c-debug
> -----------------------------------------
>
> Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing
> -Wno-unknown-pragmas -g3 ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: mpif90 -Wall -Wno-unused-variable -g
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths:
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/usr/local/cuda/include -I/home/kukushkinav/include/
> -I/opt/intel/impi/4.0.1.007/intel64/bin/include
> -I/opt/intel/impi/4.0.1.007/intel64/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries:
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -L/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec
> -lpetscsys -lX11 -Wl,-rpath,/usr/local/cuda/lib64
> -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart
> -Wl,-rpath,/opt/intel/composerxe-2011.0.084/mkl
> -L/opt/intel/composerxe-2011.0.084/mkl -lmkl_intel_lp64
> -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
> -L/opt/intel/impi/4.0.1.007/intel64/lib
> -L/opt/intel/composerxe-2011.0.084/compiler/lib/intel64
> -L/opt/intel/composerxe-2011.0.084/mkl/lib/intel64
> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -lmpi -lmpigf -lmpigi
> -lpthread -lrt -lgcc_s
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/-Xlinker -lmpi_dbg
> -lgfortran -lm -Wl,-rpath,/opt/intel/impi/4.0.1.007/intel64/lib
> -Wl,-rpath,/opt/intel/mpi-rt/4.0.1 -lm -lmpigc4 -lmpi_dbg -lstdc++
> -lmpigc4 -lmpi_dbg -lstdc++ -ldl -lmpi -lmpigf -lmpigi -lpthread -lrt
> -lgcc_s -ldl
> -----------------------------------------
>
>
> ************************************************************************************************************************
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> ex2 on a arch-linu named cn11 with 1 processor, by kukushkinav Mon Apr
> 25 18:24:15 2011
> Using Petsc Development HG revision:
> d3e10315d68b1dd5481adb2889c7d354880da362 HG Date: Wed Apr 20 21:03:56
> 2011 -0500
>
> Max Max/Min Avg Total
> Time (sec): 3.892e+01 1.00000 3.892e+01
> Objects: 2.500e+01 1.00000 2.500e+01
> Flops: 9.474e+09 1.00000 9.474e+09 9.474e+09
> Flops/sec: 2.434e+08 1.00000 2.434e+08 2.434e+08
> Memory: 1.562e+08 1.00000 1.562e+08
> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Reductions: 5.881e+03 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length
> N --> 2N flops
> and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- ---
> Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 3.8368e+01 98.6% 9.4739e+09 100.0% 0.000e+00
> 0.0% 0.000e+00 0.0% 5.855e+03 99.6%
> 1: Assembly: 5.4742e-01 1.4% 0.0000e+00 0.0% 0.000e+00
> 0.0% 0.000e+00 0.0% 1.200e+01 0.2%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this
> phase
> %M - percent messages in this phase %L - percent message
> lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option, #
> # To get timing results run ./configure #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
> Event Count Time (sec) Flops
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg
> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult 449 1.0 1.4400e+00 1.0 2.87e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 4 30 0 0 0 4 30 0 0 0 1994
> MatSolve 449 1.0 1.3012e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 33 27 0 0 0 34 27 0 0 0 199
> MatCholFctrNum 1 1.0 1.7184e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 4
> MatICCFactorSym 1 1.0 1.1631e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 1.0190e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatCUSPCopyTo 2 1.0 2.1061e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecDot 896 1.0 1.0891e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 3 12 0 0 30 3 12 0 0 31 1053
> VecNorm 450 1.0 1.1033e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 9.0e+02 3 12 0 0 15 3 12 0 0 15 1044
> VecCopy 2 1.0 2.7471e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 451 1.0 2.5600e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPY 897 1.0 1.9726e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 1 12 0 0 0 1 12 0 0 0 5821
> VecAYPX 447 1.0 1.0141e-01 1.0 5.72e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 6 0 0 0 0 6 0 0 0 5642
> VecScatterBegin 449 1.0 6.9110e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> VecCUSPCopyTo 1346 1.0 1.5747e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 4 0 0 0 0 4 0 0 0 0 0
> VecCUSPCopyFrom 1346 1.0 2.0444e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 5 0 0 0 0 5 0 0 0 0 0
> KSPSetup 2 1.0 4.0770e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 1.8957e+01 1.0 9.46e+09 1.0 0.0e+00
> 0.0e+00 5.8e+03 49100 0 0 99 49100 0 0100 499
> PCSetUp 2 1.0 3.9091e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 7.0e+00 1 0 0 0 0 1 0 0 0 0 2
> PCSetUpOnBlocks 1 1.0 3.9070e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 5.0e+00 1 0 0 0 0 1 0 0 0 0 2
> PCApply 449 1.0 1.4649e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 38 27 0 0 31 38 27 0 0 31 176
>
> --- Event Stage 1: Assembly
>
> MatAssemblyBegin 1 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 17 0
> MatAssemblyEnd 1 1.0 7.0405e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+01 0 0 0 0 0 13 0 0 0 83 0
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Matrix 4 4 120310620 0
> Vec 8 9 20493272 0
> Vec Scatter 0 1 596 0
> Index Set 3 3 2562160 0
> Krylov Solver 2 2 2048 0
> Preconditioner 2 2 1688 0
> Viewer 1 0 0 0
>
> --- Event Stage 1: Assembly
>
> Vec 2 1 1496 0
> Vec Scatter 1 0 0 0
> Index Set 2 2 1432 0
>
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -ksp_type cg
> -log_summary
> -m 800
> -mat_type mpiaijcusp
> -n 800
> -vec_type mpicusp
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Mon Apr 25 12:42:34 2011
> Configure options: --prefix=/home/kukushkinav
> --with-blas-lapack-dir=/opt/intel/composerxe-2011.0.084/mkl
> --with-mpi-dir=/opt/intel/impi/4.0.1.007/intel64/bin --with-cuda=1
> --with-cusp=1 --with-thrust=1
> --with-thrust-dir=/home/kukushkinav/include
> --with-cusp-dir=/home/kukushkinav/include --with-cuda-arch=sm_13
> -----------------------------------------
> Libraries compiled on Mon Apr 25 12:42:34 2011 on manager
> Machine characteristics:
> Linux-2.6.18-238.5.1.el5-x86_64-with-redhat-5.6-Tikanga
> Using PETSc directory: /export/home/kukushkinav/soft/petsc-dev
> Using PETSc arch: arch-linux2-c-debug
> -----------------------------------------
>
> Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing
> -Wno-unknown-pragmas -g3 ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: mpif90 -Wall -Wno-unused-variable -g
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths:
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/usr/local/cuda/include -I/home/kukushkinav/include/
> -I/opt/intel/impi/4.0.1.007/intel64/bin/include
> -I/opt/intel/impi/4.0.1.007/intel64/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries:
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -L/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec
> -lpetscsys -lX11 -Wl,-rpath,/usr/local/cuda/lib64
> -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart
> -Wl,-rpath,/opt/intel/composerxe-2011.0.084/mkl
> -L/opt/intel/composerxe-2011.0.084/mkl -lmkl_intel_lp64
> -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
> -L/opt/intel/impi/4.0.1.007/intel64/lib
> -L/opt/intel/composerxe-2011.0.084/compiler/lib/intel64
> -L/opt/intel/composerxe-2011.0.084/mkl/lib/intel64
> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -lmpi -lmpigf -lmpigi
> -lpthread -lrt -lgcc_s
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/-Xlinker -lmpi_dbg
> -lgfortran -lm -Wl,-rpath,/opt/intel/impi/4.0.1.007/intel64/lib
> -Wl,-rpath,/opt/intel/mpi-rt/4.0.1 -lm -lmpigc4 -lmpi_dbg -lstdc++
> -lmpigc4 -lmpi_dbg -lstdc++ -ldl -lmpi -lmpigf -lmpigi -lpthread -lrt
> -lgcc_s -ldl
> -----------------------------------------
>
>
> ************************************************************************************************************************
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> ex2 on a arch-linu named cn11 with 1 processor, by kukushkinav Mon Apr
> 25 18:24:16 2011
> Using Petsc Development HG revision:
> d3e10315d68b1dd5481adb2889c7d354880da362 HG Date: Wed Apr 20 21:03:56
> 2011 -0500
>
> Max Max/Min Avg Total
> Time (sec): 3.946e+01 1.00000 3.946e+01
> Objects: 2.500e+01 1.00000 2.500e+01
> Flops: 9.474e+09 1.00000 9.474e+09 9.474e+09
> Flops/sec: 2.401e+08 1.00000 2.401e+08 2.401e+08
> Memory: 1.562e+08 1.00000 1.562e+08
> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Reductions: 5.881e+03 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length
> N --> 2N flops
> and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- ---
> Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 3.8927e+01 98.6% 9.4739e+09 100.0% 0.000e+00
> 0.0% 0.000e+00 0.0% 5.855e+03 99.6%
> 1: Assembly: 5.3425e-01 1.4% 0.0000e+00 0.0% 0.000e+00
> 0.0% 0.000e+00 0.0% 1.200e+01 0.2%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this
> phase
> %M - percent messages in this phase %L - percent message
> lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option, #
> # To get timing results run ./configure #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
> Event Count Time (sec) Flops
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg
> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult 449 1.0 1.4559e+00 1.0 2.87e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 4 30 0 0 0 4 30 0 0 0 1972
> MatSolve 449 1.0 1.2787e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 32 27 0 0 0 33 27 0 0 0 202
> MatCholFctrNum 1 1.0 1.6867e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 4
> MatICCFactorSym 1 1.0 1.1588e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 1.0179e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatCUSPCopyTo 2 1.0 1.9272e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecDot 896 1.0 1.1960e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 3 12 0 0 30 3 12 0 0 31 959
> VecNorm 450 1.0 1.0563e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 9.0e+02 3 12 0 0 15 3 12 0 0 15 1091
> VecCopy 2 1.0 2.4669e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 451 1.0 2.5303e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPY 897 1.0 1.9844e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 1 12 0 0 0 1 12 0 0 0 5786
> VecAYPX 447 1.0 1.0348e-01 1.0 5.72e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 6 0 0 0 0 6 0 0 0 5529
> VecScatterBegin 449 1.0 7.1386e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> VecCUSPCopyTo 1346 1.0 1.6784e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 4 0 0 0 0 4 0 0 0 0 0
> VecCUSPCopyFrom 1346 1.0 2.2173e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0
> KSPSetup 2 1.0 4.5061e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 1.8962e+01 1.0 9.46e+09 1.0 0.0e+00
> 0.0e+00 5.8e+03 48100 0 0 99 49100 0 0100 499
> PCSetUp 2 1.0 3.8716e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 7.0e+00 1 0 0 0 0 1 0 0 0 0 2
> PCSetUpOnBlocks 1 1.0 3.8694e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 5.0e+00 1 0 0 0 0 1 0 0 0 0 2
> PCApply 449 1.0 1.4576e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 37 27 0 0 31 37 27 0 0 31 177
>
> --- Event Stage 1: Assembly
>
> MatAssemblyBegin 1 1.0 5.1022e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 17 0
> MatAssemblyEnd 1 1.0 6.9999e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+01 0 0 0 0 0 13 0 0 0 83 0
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Matrix 4 4 120310620 0
> Vec 8 9 20493272 0
> Vec Scatter 0 1 596 0
> Index Set 3 3 2562160 0
> Krylov Solver 2 2 2048 0
> Preconditioner 2 2 1688 0
> Viewer 1 0 0 0
>
> --- Event Stage 1: Assembly
>
> Vec 2 1 1496 0
> Vec Scatter 1 0 0 0
> Index Set 2 2 1432 0
>
> ========================================================================================================================
> Average time to get PetscTime(): 0
> #PETSc Option Table entries:
> -ksp_type cg
> -log_summary
> -m 800
> -mat_type mpiaijcusp
> -n 800
> -vec_type mpicusp
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Mon Apr 25 12:42:34 2011
> Configure options: --prefix=/home/kukushkinav
> --with-blas-lapack-dir=/opt/intel/composerxe-2011.0.084/mkl
> --with-mpi-dir=/opt/intel/impi/4.0.1.007/intel64/bin --with-cuda=1
> --with-cusp=1 --with-thrust=1
> --with-thrust-dir=/home/kukushkinav/include
> --with-cusp-dir=/home/kukushkinav/include --with-cuda-arch=sm_13
> -----------------------------------------
> Libraries compiled on Mon Apr 25 12:42:34 2011 on manager
> Machine characteristics:
> Linux-2.6.18-238.5.1.el5-x86_64-with-redhat-5.6-Tikanga
> Using PETSc directory: /export/home/kukushkinav/soft/petsc-dev
> Using PETSc arch: arch-linux2-c-debug
> -----------------------------------------
>
> Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing
> -Wno-unknown-pragmas -g3 ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: mpif90 -Wall -Wno-unused-variable -g
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths:
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/usr/local/cuda/include -I/home/kukushkinav/include/
> -I/opt/intel/impi/4.0.1.007/intel64/bin/include
> -I/opt/intel/impi/4.0.1.007/intel64/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries:
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -L/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec
> -lpetscsys -lX11 -Wl,-rpath,/usr/local/cuda/lib64
> -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart
> -Wl,-rpath,/opt/intel/composerxe-2011.0.084/mkl
> -L/opt/intel/composerxe-2011.0.084/mkl -lmkl_intel_lp64
> -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
> -L/opt/intel/impi/4.0.1.007/intel64/lib
> -L/opt/intel/composerxe-2011.0.084/compiler/lib/intel64
> -L/opt/intel/composerxe-2011.0.084/mkl/lib/intel64
> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -lmpi -lmpigf -lmpigi
> -lpthread -lrt -lgcc_s
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/-Xlinker -lmpi_dbg
> -lgfortran -lm -Wl,-rpath,/opt/intel/impi/4.0.1.007/intel64/lib
> -Wl,-rpath,/opt/intel/mpi-rt/4.0.1 -lm -lmpigc4 -lmpi_dbg -lstdc++
> -lmpigc4 -lmpi_dbg -lstdc++ -ldl -lmpi -lmpigf -lmpigi -lpthread -lrt
> -lgcc_s -ldl
> -----------------------------------------
>
>
> 2011/4/25 Matthew Knepley <knepley at gmail.com>:
> > On Mon, Apr 25, 2011 at 9:06 AM, Eugene Kozlov <neoveneficus at gmail.com>
> > wrote:
> >>
> >> Hello,
> >
> > To answer any kind of question about performance, we need the full output
> of
> > -log_summary.
> > Matt
> >
> >>
> >> I am trying to test PETSc capability of solving sparse linear systems
> >> in parallel with GPU.
> >>
> >> I compiled and tried to run example
> >> src/ksp/ksp/examples/tutorials/ex2.c, which can be executed in
> >> parallel.
> >>
> >> In this example matrix and vectors created using VecSetFromOptions()
> >> and MatSetFromOptions().
> >>
> >> According to the page
> >> http://www.mcs.anl.gov/petsc/petsc-2/features/gpus.html , I execute
> >> the program with keys
> >>
> >> -vec_type mpicusp -mat_type mpiaijcusp
> >>
> >> in parallel on the different number of GPUs. Full command:
> >>
> >> cleo-submit -np 1 ex2 -ksp_type cg -vec_type mpicusp -mat_type
> >> mpiaijcusp -m 800 -n 800
> >>
> >> Where 'cleo-submit' is a batch manager utility.
> >>
> >> I tested the program on 1, 2 and 3 GPUs. As a result I have output as
> >> (for 3 GPUs):
> >>
> >> Norm of error 0.00012322 iterations 448
> >> Norm of error 0.00012322 iterations 448
> >> Norm of error 0.00012322 iterations 448
> >>
> >> and run times: 30, 40 and 46 seconds respectively.
> >>
> >> What can be a cause of these results?
> >>
> >> --
> >> Best regards,
> >> Eugene
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments
> > is infinitely more interesting than any results to which their
> experiments
> > lead.
> > -- Norbert Wiener
> >
>
--
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110425/02528fb5/attachment.html>
More information about the petsc-dev
mailing list