[petsc-users] Code (possibly) not running on GPU with CUDA

Wed Aug 5 12:09:28 CDT 2020

Hi,

I used nvidia-smi before, essentially a kind of ’top’ for nvidia-gpus.

The log output I get is:

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

      ##########################################################
      #                                                        #
      #                       WARNING!!!                       #
      #                                                        #
      #   This code was compiled with a debugging option.      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

      ##########################################################
      #                                                        #
      #                       WARNING!!!                       #
      #                                                        #
      #   This code was compiled with GPU support but you used #
      #   an MPI that's not GPU-aware, such Petsc had to copy  #
      #   data from GPU to CPU for MPI communication. To get   #
      #   meaningfull timing results, please use a GPU-aware   #
      #   MPI instead.                                         #
      ##########################################################

/lustre/home/z04/gpsgibb/TPLS/petsc/share/petsc/examples/src/vec/vec/tests/./ex28 on a  named r2i7n0 with 1 processor, by gpsgibb Wed Aug  5 18:05:59 2020
Using Petsc Release Version 3.13.3, Jul 01, 2020

                         Max       Max/Min     Avg       Total
Time (sec):           1.566e-01     1.000   1.566e-01
Objects:              4.400e+01     1.000   4.400e+01
Flop:                 2.546e+03     1.000   2.546e+03  2.546e+03
Flop/sec:             1.626e+04     1.000   1.626e+04  1.626e+04
Memory:               1.438e+05     1.000   1.438e+05  1.438e+05
MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00     0.000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flop
                            and VecAXPY() for complex vectors of length N --> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
 0:      Main Stage: 1.5657e-01 100.0%  2.5460e+03 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
   CpuToGpu Count: total number of CPU to GPU copies per processor
   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
   GpuToCpu Count: total number of GPU to CPU copies per processor
   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
   GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------

      ##########################################################
      #                                                        #
      #                       WARNING!!!                       #
      #                                                        #
      #   This code was compiled with a debugging option.      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecDot                 4 1.0 7.4222e-05 1.0 1.96e+02 1.0 0.0e+00 0.0e+00 0.0e+00  0  8  0  0  0   0  8  0  0  0     3       3      0 0.00e+00    0 0.00e+00 100
VecNorm                1 1.0 5.4168e-05 1.0 7.30e+01 1.0 0.0e+00 0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0     1       1      0 0.00e+00    0 0.00e+00 100
VecSet                83 1.0 9.0480e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecAssemblyBegin       1 1.0 2.7206e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecAssemblyEnd         1 1.0 2.6403e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecSetRandom           1 1.0 1.5260e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecReduceArith        52 1.0 1.1307e-03 1.0 2.28e+03 1.0 0.0e+00 0.0e+00 0.0e+00  1 89  0  0  0   1 89  0  0  0     2       2      2 4.00e-04    0 0.00e+00 100
VecReduceComm          4 1.0 3.4969e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecReduceBegin         1 1.0 2.5639e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecReduceEnd           1 1.0 2.5495e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecCUDACopyTo          2 1.0 1.7550e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      2 4.00e-04    0 0.00e+00  0
VecCUDACopyFrom       42 1.0 3.7747e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00   42 8.40e-03  0
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector    42             42        75264     0.
         PetscRandom     1              1          646     0.
              Viewer     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 3.67989e-08
#PETSc Option Table entries:
-log_view
-use_gpu_aware_mpi 0
-vec_type cuda
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: CC=nvcc FC=mpif90 CXX=mpicxx --prefix=/lustre/home/z04/gpsgibb/TPLS/petsc --with-cudac=nvcc --with-cuda=1 --with-mpi-dir= --with-batch
-----------------------------------------
Libraries compiled on 2020-07-31 14:46:25 on r2i7n0
Machine characteristics: Linux-4.18.0-147.8.1.el8_1.x86_64-x86_64-with-centos-8.1.1911-Core
Using PETSc directory: /lustre/home/z04/gpsgibb/TPLS/petsc
Using PETSc arch:
-----------------------------------------

Using C compiler: nvcc  -g  -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
Using Fortran compiler: mpif90  -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g     -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
-----------------------------------------

Using include paths: -I/lustre/home/z04/gpsgibb/TPLS/petsc/include -I/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/include -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
-----------------------------------------

Using C linker: nvcc
Using Fortran linker: mpif90
Using libraries: -L/lustre/home/z04/gpsgibb/TPLS/petsc/lib -L/lustre/home/z04/gpsgibb/TPLS/petsc/lib -lpetsc -L/lustre/sw/intel/compilers_and_libraries_2019.0.117/linux/mkl -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/lib64 -L/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/lib -L/opt/hpe/hpc/mpt/mpt-2.22/lib -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/20.5/math_libs/10.2/lib64 -L/lustre/sw/gcc/6.3.0/lib/gcc/x86_64-pc-linux-gnu/6.3.0 -L/lustre/sw/gcc/6.3.0/lib64 -L/lustre/sw/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64 -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/bin -L/lustre/sw/gcc/6.3.0/lib -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lX11 -lcufft -lcublas -lcudart -lcusparse -lcusolver -lcuda -lmpi++ -lmpi -lstdc++ -ldl -lpthread -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------

      ##########################################################
      #                                                        #
      #                       WARNING!!!                       #
      #                                                        #
      #   This code was compiled with GPU support but you used #
      #   an MPI that's not GPU-aware, such Petsc had to copy  #
      #   data from GPU to CPU for MPI communication. To get   #
      #   meaningfull timing results, please use a GPU-aware   #
      #   MPI instead.                                         #
      ##########################################################

      ##########################################################
      #                                                        #
      #                       WARNING!!!                       #
      #                                                        #
      #   This code was compiled with a debugging option.      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

-----------------------------------------------
Dr Gordon P S Gibb
EPCC, The University of Edinburgh
Tel: +44 131 651 3459

On 5 Aug 2020, at 17:58, Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:

On Wed, Aug 5, 2020 at 12:47 PM GIBB Gordon <g.gibb at epcc.ed.ac.uk<mailto:g.gibb at epcc.ed.ac.uk>> wrote:
Hi Matt,

It runs, however it doesn’t produce any output, and I have no way of checking to see if it actually ran on the GPU. It was run with:

srun -n 1 ./ex28 -vec_type cuda -use_gpu_aware_mpi 0

1) How did you check last time?

2) You can check using -log_view

  Thanks,

     Matt

Cheers,

Gordon

-----------------------------------------------
Dr Gordon P S Gibb
EPCC, The University of Edinburgh
Tel: +44 131 651 3459

On 5 Aug 2020, at 17:10, Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:

On Wed, Aug 5, 2020 at 11:24 AM GIBB Gordon <g.gibb at epcc.ed.ac.uk<mailto:g.gibb at epcc.ed.ac.uk>> wrote:
Hi,

I’ve built PETSc with NVIDIA support for our GPU machine (https://cirrus.readthedocs.io/en/master/user-guide/gpu.html), and then compiled our executable against this PETSc (using version 3.13.3). I should add that the MPI on our system is not GPU-aware so I have to use -use_gpu_aware_mpi 0

When running this, in the .petscrc I put

-dm_vec_type cuda
-dm_mat_type aijcusparse

as is suggested on the PETSc GPU page (https://www.mcs.anl.gov/petsc/features/gpus.html) to enable CUDA for DMs (all our PETSc data structures are with DMs). I have also ensured I'm using the jacobi preconditioner so that it definitely runs on the GPU (again, according to the PETSc GPU page).

When I run this, I note that the GPU seems to have memory allocated on it from my executable, however seems to be doing no computation:

Wed Aug  5 13:10:23 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                  Off |
| N/A   43C    P0    64W / 300W |    490MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     33712      C   .../z04/gpsgibb/TPLS/TPLS-GPU/./twophase.x   479MiB |
+-----------------------------------------------------------------------------+

I then ran the same example but without the -dm_vec_type cuda, -dm_mat_type aijcusparse arguments, and I found the same behaviour (479MB allocated on the GPU, 0% GPU utilisation).

In both cases the runtime of the example are near identical, suggesting that both are essentially the same run.

As a further test I compiled PETSc without CUDA support and ran the same example again, and found the same runtime as with the GPUs, and (as expected) no GPU memory allocated. I then tried to run the example with the -dm_vec_type cuda, -dm_mat_type aijcusparse arguments and it ran without complaint. I would have expected it to throw an error or at least a warning if invalid arguments were passed to it.

All this suggests to me that PETSc is ignoring my requests to use the GPUs. For the GPU-aware PETSc it seems to allocate memory on the GPUs but perform no calculations on them, regardless of whether I requested it to use the GPUs or not. On non-GPU-aware PETSc it accepts my requests to use the GPUs, but does not throw an error.

What am I doing wrong?

Lets step back to a simpler thing so we can make sure your configuration is correct. Can you run the 2_cuda test from
src/vec/vec/tests/ex28.c ? Does it execute on your GPU?

  Thanks,

    Matt

Thanks in advance,

Gordon
-----------------------------------------------
Dr Gordon P S Gibb
EPCC, The University of Edinburgh
Tel: +44 131 651 3459

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>

--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200805/153a1465/attachment-0001.html>