[petsc-dev] KSP run in parallel with GPU

Matthew Knepley knepley at gmail.com
Mon Apr 25 09:34:35 CDT 2011


On Mon, Apr 25, 2011 at 9:30 AM, Eugene Kozlov <neoveneficus at gmail.com>wrote:

> I never use the maillist before this. How can I better show logs?
> Attach file to mail?


1) You attached 3 runs with 1 processor

2) This matrix looks pretty small

3) You are still getting 2 GF/s

   Matt


> Norm of error 0.00012322 iterations 448
> Norm of error 0.00012322 iterations 448
> Norm of error 0.00012322 iterations 448
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> ex2 on a arch-linu named cn11 with 1 processor, by kukushkinav Mon Apr
> 25 18:24:15 2011
> Using Petsc Development HG revision:
> d3e10315d68b1dd5481adb2889c7d354880da362  HG Date: Wed Apr 20 21:03:56
> 2011 -0500
>
>                         Max       Max/Min        Avg      Total
> Time (sec):           3.891e+01      1.00000   3.891e+01
> Objects:              2.500e+01      1.00000   2.500e+01
> Flops:                9.474e+09      1.00000   9.474e+09  9.474e+09
> Flops/sec:            2.435e+08      1.00000   2.435e+08  2.435e+08
> Memory:               1.562e+08      1.00000              1.562e+08
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       5.881e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length
> N --> 2N flops
>                            and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 3.8368e+01  98.6%  9.4739e+09 100.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  5.855e+03  99.6%
>  1:        Assembly: 5.3823e-01   1.4%  0.0000e+00   0.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  1.200e+01   0.2%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this
> phase
>      %M - percent messages in this phase     %L - percent message
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was compiled with a debugging option,      #
>      #   To get timing results run ./configure                #
>      #   using --with-debugging=no, the performance will      #
>      #   be generally two or three times faster.              #
>      #                                                        #
>      ##########################################################
>
>
> Event                Count      Time (sec)     Flops
>          --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult              449 1.0 1.4191e+00 1.0 2.87e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  4 30  0  0  0   4 30  0  0  0  2023
> MatSolve             449 1.0 1.3127e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 34 27  0  0  0  34 27  0  0  0   197
> MatCholFctrNum         1 1.0 1.7007e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     4
> MatICCFactorSym        1 1.0 1.1560e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 1.1921e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.0204e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatCUSPCopyTo          2 1.0 2.0546e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecDot               896 1.0 1.0086e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03  3 12  0  0 30   3 12  0  0 31  1137
> VecNorm              450 1.0 8.4599e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 9.0e+02  2 12  0  0 15   2 12  0  0 15  1362
> VecCopy                2 1.0 2.6081e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet               451 1.0 2.7629e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY              897 1.0 2.1131e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  1 12  0  0  0   1 12  0  0  0  5434
> VecAYPX              447 1.0 1.0049e-01 1.0 5.72e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  6  0  0  0   0  6  0  0  0  5694
> VecScatterBegin      449 1.0 6.8694e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecCUSPCopyTo       1346 1.0 1.2865e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  3  0  0  0  0   3  0  0  0  0     0
> VecCUSPCopyFrom     1346 1.0 2.2437e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0
> KSPSetup               2 1.0 3.1233e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 1.8947e+01 1.0 9.46e+09 1.0 0.0e+00
> 0.0e+00 5.8e+03 49100  0  0 99  49100  0  0100   499
> PCSetUp                2 1.0 3.8846e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 7.0e+00  1  0  0  0  0   1  0  0  0  0     2
> PCSetUpOnBlocks        1 1.0 3.8828e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 5.0e+00  1  0  0  0  0   1  0  0  0  0     2
> PCApply              449 1.0 1.4987e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 39 27  0  0 31  39 27  0  0 31   172
>
> --- Event Stage 1: Assembly
>
> MatAssemblyBegin       1 1.0 2.3842e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0 17     0
> MatAssemblyEnd         1 1.0 7.0319e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+01  0  0  0  0  0  13  0  0  0 83     0
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>              Matrix     4              4    120310620     0
>                 Vec     8              9     20493272     0
>         Vec Scatter     0              1          596     0
>           Index Set     3              3      2562160     0
>       Krylov Solver     2              2         2048     0
>      Preconditioner     2              2         1688     0
>              Viewer     1              0            0     0
>
> --- Event Stage 1: Assembly
>
>                 Vec     2              1         1496     0
>         Vec Scatter     1              0            0     0
>           Index Set     2              2         1432     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -ksp_type cg
> -log_summary
> -m 800
> -mat_type mpiaijcusp
> -n 800
> -vec_type mpicusp
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Mon Apr 25 12:42:34 2011
> Configure options: --prefix=/home/kukushkinav
> --with-blas-lapack-dir=/opt/intel/composerxe-2011.0.084/mkl
> --with-mpi-dir=/opt/intel/impi/4.0.1.007/intel64/bin --with-cuda=1
> --with-cusp=1 --with-thrust=1
> --with-thrust-dir=/home/kukushkinav/include
> --with-cusp-dir=/home/kukushkinav/include --with-cuda-arch=sm_13
> -----------------------------------------
> Libraries compiled on Mon Apr 25 12:42:34 2011 on manager
> Machine characteristics:
> Linux-2.6.18-238.5.1.el5-x86_64-with-redhat-5.6-Tikanga
> Using PETSc directory: /export/home/kukushkinav/soft/petsc-dev
> Using PETSc arch: arch-linux2-c-debug
> -----------------------------------------
>
> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing
> -Wno-unknown-pragmas -g3  ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: mpif90  -Wall -Wno-unused-variable -g
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths:
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/usr/local/cuda/include -I/home/kukushkinav/include/
> -I/opt/intel/impi/4.0.1.007/intel64/bin/include
> -I/opt/intel/impi/4.0.1.007/intel64/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries:
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -L/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec
> -lpetscsys -lX11 -Wl,-rpath,/usr/local/cuda/lib64
> -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart
> -Wl,-rpath,/opt/intel/composerxe-2011.0.084/mkl
> -L/opt/intel/composerxe-2011.0.084/mkl -lmkl_intel_lp64
> -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
> -L/opt/intel/impi/4.0.1.007/intel64/lib
> -L/opt/intel/composerxe-2011.0.084/compiler/lib/intel64
> -L/opt/intel/composerxe-2011.0.084/mkl/lib/intel64
> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -lmpi -lmpigf -lmpigi
> -lpthread -lrt -lgcc_s
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/-Xlinker -lmpi_dbg
> -lgfortran -lm -Wl,-rpath,/opt/intel/impi/4.0.1.007/intel64/lib
> -Wl,-rpath,/opt/intel/mpi-rt/4.0.1 -lm -lmpigc4 -lmpi_dbg -lstdc++
> -lmpigc4 -lmpi_dbg -lstdc++ -ldl -lmpi -lmpigf -lmpigi -lpthread -lrt
> -lgcc_s -ldl
> -----------------------------------------
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> ex2 on a arch-linu named cn11 with 1 processor, by kukushkinav Mon Apr
> 25 18:24:15 2011
> Using Petsc Development HG revision:
> d3e10315d68b1dd5481adb2889c7d354880da362  HG Date: Wed Apr 20 21:03:56
> 2011 -0500
>
>                         Max       Max/Min        Avg      Total
> Time (sec):           3.892e+01      1.00000   3.892e+01
> Objects:              2.500e+01      1.00000   2.500e+01
> Flops:                9.474e+09      1.00000   9.474e+09  9.474e+09
> Flops/sec:            2.434e+08      1.00000   2.434e+08  2.434e+08
> Memory:               1.562e+08      1.00000              1.562e+08
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       5.881e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length
> N --> 2N flops
>                            and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 3.8368e+01  98.6%  9.4739e+09 100.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  5.855e+03  99.6%
>  1:        Assembly: 5.4742e-01   1.4%  0.0000e+00   0.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  1.200e+01   0.2%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this
> phase
>      %M - percent messages in this phase     %L - percent message
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was compiled with a debugging option,      #
>      #   To get timing results run ./configure                #
>      #   using --with-debugging=no, the performance will      #
>      #   be generally two or three times faster.              #
>      #                                                        #
>      ##########################################################
>
>
> Event                Count      Time (sec)     Flops
>          --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult              449 1.0 1.4400e+00 1.0 2.87e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  4 30  0  0  0   4 30  0  0  0  1994
> MatSolve             449 1.0 1.3012e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 33 27  0  0  0  34 27  0  0  0   199
> MatCholFctrNum         1 1.0 1.7184e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     4
> MatICCFactorSym        1 1.0 1.1631e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.0190e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatCUSPCopyTo          2 1.0 2.1061e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecDot               896 1.0 1.0891e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03  3 12  0  0 30   3 12  0  0 31  1053
> VecNorm              450 1.0 1.1033e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 9.0e+02  3 12  0  0 15   3 12  0  0 15  1044
> VecCopy                2 1.0 2.7471e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet               451 1.0 2.5600e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY              897 1.0 1.9726e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  1 12  0  0  0   1 12  0  0  0  5821
> VecAYPX              447 1.0 1.0141e-01 1.0 5.72e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  6  0  0  0   0  6  0  0  0  5642
> VecScatterBegin      449 1.0 6.9110e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecCUSPCopyTo       1346 1.0 1.5747e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
> VecCUSPCopyFrom     1346 1.0 2.0444e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  5  0  0  0  0   5  0  0  0  0     0
> KSPSetup               2 1.0 4.0770e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 1.8957e+01 1.0 9.46e+09 1.0 0.0e+00
> 0.0e+00 5.8e+03 49100  0  0 99  49100  0  0100   499
> PCSetUp                2 1.0 3.9091e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 7.0e+00  1  0  0  0  0   1  0  0  0  0     2
> PCSetUpOnBlocks        1 1.0 3.9070e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 5.0e+00  1  0  0  0  0   1  0  0  0  0     2
> PCApply              449 1.0 1.4649e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 38 27  0  0 31  38 27  0  0 31   176
>
> --- Event Stage 1: Assembly
>
> MatAssemblyBegin       1 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0 17     0
> MatAssemblyEnd         1 1.0 7.0405e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+01  0  0  0  0  0  13  0  0  0 83     0
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>              Matrix     4              4    120310620     0
>                 Vec     8              9     20493272     0
>         Vec Scatter     0              1          596     0
>           Index Set     3              3      2562160     0
>       Krylov Solver     2              2         2048     0
>      Preconditioner     2              2         1688     0
>              Viewer     1              0            0     0
>
> --- Event Stage 1: Assembly
>
>                 Vec     2              1         1496     0
>         Vec Scatter     1              0            0     0
>           Index Set     2              2         1432     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -ksp_type cg
> -log_summary
> -m 800
> -mat_type mpiaijcusp
> -n 800
> -vec_type mpicusp
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Mon Apr 25 12:42:34 2011
> Configure options: --prefix=/home/kukushkinav
> --with-blas-lapack-dir=/opt/intel/composerxe-2011.0.084/mkl
> --with-mpi-dir=/opt/intel/impi/4.0.1.007/intel64/bin --with-cuda=1
> --with-cusp=1 --with-thrust=1
> --with-thrust-dir=/home/kukushkinav/include
> --with-cusp-dir=/home/kukushkinav/include --with-cuda-arch=sm_13
> -----------------------------------------
> Libraries compiled on Mon Apr 25 12:42:34 2011 on manager
> Machine characteristics:
> Linux-2.6.18-238.5.1.el5-x86_64-with-redhat-5.6-Tikanga
> Using PETSc directory: /export/home/kukushkinav/soft/petsc-dev
> Using PETSc arch: arch-linux2-c-debug
> -----------------------------------------
>
> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing
> -Wno-unknown-pragmas -g3  ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: mpif90  -Wall -Wno-unused-variable -g
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths:
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/usr/local/cuda/include -I/home/kukushkinav/include/
> -I/opt/intel/impi/4.0.1.007/intel64/bin/include
> -I/opt/intel/impi/4.0.1.007/intel64/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries:
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -L/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec
> -lpetscsys -lX11 -Wl,-rpath,/usr/local/cuda/lib64
> -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart
> -Wl,-rpath,/opt/intel/composerxe-2011.0.084/mkl
> -L/opt/intel/composerxe-2011.0.084/mkl -lmkl_intel_lp64
> -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
> -L/opt/intel/impi/4.0.1.007/intel64/lib
> -L/opt/intel/composerxe-2011.0.084/compiler/lib/intel64
> -L/opt/intel/composerxe-2011.0.084/mkl/lib/intel64
> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -lmpi -lmpigf -lmpigi
> -lpthread -lrt -lgcc_s
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/-Xlinker -lmpi_dbg
> -lgfortran -lm -Wl,-rpath,/opt/intel/impi/4.0.1.007/intel64/lib
> -Wl,-rpath,/opt/intel/mpi-rt/4.0.1 -lm -lmpigc4 -lmpi_dbg -lstdc++
> -lmpigc4 -lmpi_dbg -lstdc++ -ldl -lmpi -lmpigf -lmpigi -lpthread -lrt
> -lgcc_s -ldl
> -----------------------------------------
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> ex2 on a arch-linu named cn11 with 1 processor, by kukushkinav Mon Apr
> 25 18:24:16 2011
> Using Petsc Development HG revision:
> d3e10315d68b1dd5481adb2889c7d354880da362  HG Date: Wed Apr 20 21:03:56
> 2011 -0500
>
>                         Max       Max/Min        Avg      Total
> Time (sec):           3.946e+01      1.00000   3.946e+01
> Objects:              2.500e+01      1.00000   2.500e+01
> Flops:                9.474e+09      1.00000   9.474e+09  9.474e+09
> Flops/sec:            2.401e+08      1.00000   2.401e+08  2.401e+08
> Memory:               1.562e+08      1.00000              1.562e+08
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       5.881e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length
> N --> 2N flops
>                            and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 3.8927e+01  98.6%  9.4739e+09 100.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  5.855e+03  99.6%
>  1:        Assembly: 5.3425e-01   1.4%  0.0000e+00   0.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  1.200e+01   0.2%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this
> phase
>      %M - percent messages in this phase     %L - percent message
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was compiled with a debugging option,      #
>      #   To get timing results run ./configure                #
>      #   using --with-debugging=no, the performance will      #
>      #   be generally two or three times faster.              #
>      #                                                        #
>      ##########################################################
>
>
> Event                Count      Time (sec)     Flops
>          --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult              449 1.0 1.4559e+00 1.0 2.87e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  4 30  0  0  0   4 30  0  0  0  1972
> MatSolve             449 1.0 1.2787e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 32 27  0  0  0  33 27  0  0  0   202
> MatCholFctrNum         1 1.0 1.6867e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     4
> MatICCFactorSym        1 1.0 1.1588e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.0179e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatCUSPCopyTo          2 1.0 1.9272e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecDot               896 1.0 1.1960e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03  3 12  0  0 30   3 12  0  0 31   959
> VecNorm              450 1.0 1.0563e+00 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 9.0e+02  3 12  0  0 15   3 12  0  0 15  1091
> VecCopy                2 1.0 2.4669e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet               451 1.0 2.5303e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY              897 1.0 1.9844e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  1 12  0  0  0   1 12  0  0  0  5786
> VecAYPX              447 1.0 1.0348e-01 1.0 5.72e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  6  0  0  0   0  6  0  0  0  5529
> VecScatterBegin      449 1.0 7.1386e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecCUSPCopyTo       1346 1.0 1.6784e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
> VecCUSPCopyFrom     1346 1.0 2.2173e+00 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0
> KSPSetup               2 1.0 4.5061e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 1.8962e+01 1.0 9.46e+09 1.0 0.0e+00
> 0.0e+00 5.8e+03 48100  0  0 99  49100  0  0100   499
> PCSetUp                2 1.0 3.8716e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 7.0e+00  1  0  0  0  0   1  0  0  0  0     2
> PCSetUpOnBlocks        1 1.0 3.8694e-01 1.0 6.40e+05 1.0 0.0e+00
> 0.0e+00 5.0e+00  1  0  0  0  0   1  0  0  0  0     2
> PCApply              449 1.0 1.4576e+01 1.0 2.58e+09 1.0 0.0e+00
> 0.0e+00 1.8e+03 37 27  0  0 31  37 27  0  0 31   177
>
> --- Event Stage 1: Assembly
>
> MatAssemblyBegin       1 1.0 5.1022e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0 17     0
> MatAssemblyEnd         1 1.0 6.9999e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+01  0  0  0  0  0  13  0  0  0 83     0
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>              Matrix     4              4    120310620     0
>                 Vec     8              9     20493272     0
>         Vec Scatter     0              1          596     0
>           Index Set     3              3      2562160     0
>       Krylov Solver     2              2         2048     0
>      Preconditioner     2              2         1688     0
>              Viewer     1              0            0     0
>
> --- Event Stage 1: Assembly
>
>                 Vec     2              1         1496     0
>         Vec Scatter     1              0            0     0
>           Index Set     2              2         1432     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 0
> #PETSc Option Table entries:
> -ksp_type cg
> -log_summary
> -m 800
> -mat_type mpiaijcusp
> -n 800
> -vec_type mpicusp
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Mon Apr 25 12:42:34 2011
> Configure options: --prefix=/home/kukushkinav
> --with-blas-lapack-dir=/opt/intel/composerxe-2011.0.084/mkl
> --with-mpi-dir=/opt/intel/impi/4.0.1.007/intel64/bin --with-cuda=1
> --with-cusp=1 --with-thrust=1
> --with-thrust-dir=/home/kukushkinav/include
> --with-cusp-dir=/home/kukushkinav/include --with-cuda-arch=sm_13
> -----------------------------------------
> Libraries compiled on Mon Apr 25 12:42:34 2011 on manager
> Machine characteristics:
> Linux-2.6.18-238.5.1.el5-x86_64-with-redhat-5.6-Tikanga
> Using PETSc directory: /export/home/kukushkinav/soft/petsc-dev
> Using PETSc arch: arch-linux2-c-debug
> -----------------------------------------
>
> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing
> -Wno-unknown-pragmas -g3  ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: mpif90  -Wall -Wno-unused-variable -g
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths:
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/include
> -I/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/include
> -I/usr/local/cuda/include -I/home/kukushkinav/include/
> -I/opt/intel/impi/4.0.1.007/intel64/bin/include
> -I/opt/intel/impi/4.0.1.007/intel64/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries:
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -L/export/home/kukushkinav/soft/petsc-dev/arch-linux2-c-debug/lib
> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec
> -lpetscsys -lX11 -Wl,-rpath,/usr/local/cuda/lib64
> -L/usr/local/cuda/lib64 -lcufft -lcublas -lcudart
> -Wl,-rpath,/opt/intel/composerxe-2011.0.084/mkl
> -L/opt/intel/composerxe-2011.0.084/mkl -lmkl_intel_lp64
> -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl
> -L/opt/intel/impi/4.0.1.007/intel64/lib
> -L/opt/intel/composerxe-2011.0.084/compiler/lib/intel64
> -L/opt/intel/composerxe-2011.0.084/mkl/lib/intel64
> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -lmpi -lmpigf -lmpigi
> -lpthread -lrt -lgcc_s
> -Wl,-rpath,/export/home/kukushkinav/soft/petsc-dev/-Xlinker -lmpi_dbg
> -lgfortran -lm -Wl,-rpath,/opt/intel/impi/4.0.1.007/intel64/lib
> -Wl,-rpath,/opt/intel/mpi-rt/4.0.1 -lm -lmpigc4 -lmpi_dbg -lstdc++
> -lmpigc4 -lmpi_dbg -lstdc++ -ldl -lmpi -lmpigf -lmpigi -lpthread -lrt
> -lgcc_s -ldl
> -----------------------------------------
>
>
> 2011/4/25 Matthew Knepley <knepley at gmail.com>:
> > On Mon, Apr 25, 2011 at 9:06 AM, Eugene Kozlov <neoveneficus at gmail.com>
> > wrote:
> >>
> >> Hello,
> >
> > To answer any kind of question about performance, we need the full output
> of
> > -log_summary.
> >     Matt
> >
> >>
> >> I am trying to test PETSc capability of solving sparse linear systems
> >> in parallel with GPU.
> >>
> >> I compiled and tried to run example
> >> src/ksp/ksp/examples/tutorials/ex2.c, which can be executed in
> >> parallel.
> >>
> >> In this example matrix and vectors created using VecSetFromOptions()
> >> and MatSetFromOptions().
> >>
> >> According to the page
> >> http://www.mcs.anl.gov/petsc/petsc-2/features/gpus.html , I execute
> >> the program with keys
> >>
> >> -vec_type mpicusp -mat_type mpiaijcusp
> >>
> >> in parallel on the different number of GPUs. Full command:
> >>
> >> cleo-submit -np 1 ex2 -ksp_type cg -vec_type mpicusp -mat_type
> >> mpiaijcusp -m 800 -n 800
> >>
> >> Where 'cleo-submit' is a batch manager utility.
> >>
> >> I tested the program on 1, 2 and 3 GPUs. As a result I have output as
> >> (for 3 GPUs):
> >>
> >> Norm of error 0.00012322 iterations 448
> >> Norm of error 0.00012322 iterations 448
> >> Norm of error 0.00012322 iterations 448
> >>
> >> and run times: 30, 40 and 46 seconds respectively.
> >>
> >> What can be a cause of these results?
> >>
> >> --
> >> Best regards,
> >> Eugene
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments
> > is infinitely more interesting than any results to which their
> experiments
> > lead.
> > -- Norbert Wiener
> >
>



-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110425/02528fb5/attachment.html>


More information about the petsc-dev mailing list