[petsc-users] Code (possibly) not running on GPU with CUDA
GIBB Gordon
g.gibb at epcc.ed.ac.uk
Wed Aug 5 12:09:28 CDT 2020
Hi,
I used nvidia-smi before, essentially a kind of ’top’ for nvidia-gpus.
The log output I get is:
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with GPU support but you used #
# an MPI that's not GPU-aware, such Petsc had to copy #
# data from GPU to CPU for MPI communication. To get #
# meaningfull timing results, please use a GPU-aware #
# MPI instead. #
##########################################################
/lustre/home/z04/gpsgibb/TPLS/petsc/share/petsc/examples/src/vec/vec/tests/./ex28 on a named r2i7n0 with 1 processor, by gpsgibb Wed Aug 5 18:05:59 2020
Using Petsc Release Version 3.13.3, Jul 01, 2020
Max Max/Min Avg Total
Time (sec): 1.566e-01 1.000 1.566e-01
Objects: 4.400e+01 1.000 4.400e+01
Flop: 2.546e+03 1.000 2.546e+03 2.546e+03
Flop/sec: 1.626e+04 1.000 1.626e+04 1.626e+04
Memory: 1.438e+05 1.000 1.438e+05 1.438e+05
MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Reductions: 0.000e+00 0.000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flop
and VecAXPY() for complex vectors of length N --> 8N flop
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total Avg %Total Count %Total
0: Main Stage: 1.5657e-01 100.0% 2.5460e+03 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
CpuToGpu Count: total number of CPU to GPU copies per processor
CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
GpuToCpu Count: total number of GPU to CPU copies per processor
GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecDot 4 1.0 7.4222e-05 1.0 1.96e+02 1.0 0.0e+00 0.0e+00 0.0e+00 0 8 0 0 0 0 8 0 0 0 3 3 0 0.00e+00 0 0.00e+00 100
VecNorm 1 1.0 5.4168e-05 1.0 7.30e+01 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 1 1 0 0.00e+00 0 0.00e+00 100
VecSet 83 1.0 9.0480e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecAssemblyBegin 1 1.0 2.7206e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecAssemblyEnd 1 1.0 2.6403e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecSetRandom 1 1.0 1.5260e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecReduceArith 52 1.0 1.1307e-03 1.0 2.28e+03 1.0 0.0e+00 0.0e+00 0.0e+00 1 89 0 0 0 1 89 0 0 0 2 2 2 4.00e-04 0 0.00e+00 100
VecReduceComm 4 1.0 3.4969e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecReduceBegin 1 1.0 2.5639e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecReduceEnd 1 1.0 2.5495e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecCUDACopyTo 2 1.0 1.7550e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 2 4.00e-04 0 0.00e+00 0
VecCUDACopyFrom 42 1.0 3.7747e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 42 8.40e-03 0
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Vector 42 42 75264 0.
PetscRandom 1 1 646 0.
Viewer 1 0 0 0.
========================================================================================================================
Average time to get PetscTime(): 3.67989e-08
#PETSc Option Table entries:
-log_view
-use_gpu_aware_mpi 0
-vec_type cuda
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: CC=nvcc FC=mpif90 CXX=mpicxx --prefix=/lustre/home/z04/gpsgibb/TPLS/petsc --with-cudac=nvcc --with-cuda=1 --with-mpi-dir= --with-batch
-----------------------------------------
Libraries compiled on 2020-07-31 14:46:25 on r2i7n0
Machine characteristics: Linux-4.18.0-147.8.1.el8_1.x86_64-x86_64-with-centos-8.1.1911-Core
Using PETSc directory: /lustre/home/z04/gpsgibb/TPLS/petsc
Using PETSc arch:
-----------------------------------------
Using C compiler: nvcc -g -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
Using Fortran compiler: mpif90 -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
-----------------------------------------
Using include paths: -I/lustre/home/z04/gpsgibb/TPLS/petsc/include -I/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/include -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
-----------------------------------------
Using C linker: nvcc
Using Fortran linker: mpif90
Using libraries: -L/lustre/home/z04/gpsgibb/TPLS/petsc/lib -L/lustre/home/z04/gpsgibb/TPLS/petsc/lib -lpetsc -L/lustre/sw/intel/compilers_and_libraries_2019.0.117/linux/mkl -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/lib64 -L/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/lib -L/opt/hpe/hpc/mpt/mpt-2.22/lib -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/20.5/math_libs/10.2/lib64 -L/lustre/sw/gcc/6.3.0/lib/gcc/x86_64-pc-linux-gnu/6.3.0 -L/lustre/sw/gcc/6.3.0/lib64 -L/lustre/sw/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64 -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/bin -L/lustre/sw/gcc/6.3.0/lib -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lX11 -lcufft -lcublas -lcudart -lcusparse -lcusolver -lcuda -lmpi++ -lmpi -lstdc++ -ldl -lpthread -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with GPU support but you used #
# an MPI that's not GPU-aware, such Petsc had to copy #
# data from GPU to CPU for MPI communication. To get #
# meaningfull timing results, please use a GPU-aware #
# MPI instead. #
##########################################################
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
-----------------------------------------------
Dr Gordon P S Gibb
EPCC, The University of Edinburgh
Tel: +44 131 651 3459
On 5 Aug 2020, at 17:58, Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:
On Wed, Aug 5, 2020 at 12:47 PM GIBB Gordon <g.gibb at epcc.ed.ac.uk<mailto:g.gibb at epcc.ed.ac.uk>> wrote:
Hi Matt,
It runs, however it doesn’t produce any output, and I have no way of checking to see if it actually ran on the GPU. It was run with:
srun -n 1 ./ex28 -vec_type cuda -use_gpu_aware_mpi 0
1) How did you check last time?
2) You can check using -log_view
Thanks,
Matt
Cheers,
Gordon
-----------------------------------------------
Dr Gordon P S Gibb
EPCC, The University of Edinburgh
Tel: +44 131 651 3459
On 5 Aug 2020, at 17:10, Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:
On Wed, Aug 5, 2020 at 11:24 AM GIBB Gordon <g.gibb at epcc.ed.ac.uk<mailto:g.gibb at epcc.ed.ac.uk>> wrote:
Hi,
I’ve built PETSc with NVIDIA support for our GPU machine (https://cirrus.readthedocs.io/en/master/user-guide/gpu.html), and then compiled our executable against this PETSc (using version 3.13.3). I should add that the MPI on our system is not GPU-aware so I have to use -use_gpu_aware_mpi 0
When running this, in the .petscrc I put
-dm_vec_type cuda
-dm_mat_type aijcusparse
as is suggested on the PETSc GPU page (https://www.mcs.anl.gov/petsc/features/gpus.html) to enable CUDA for DMs (all our PETSc data structures are with DMs). I have also ensured I'm using the jacobi preconditioner so that it definitely runs on the GPU (again, according to the PETSc GPU page).
When I run this, I note that the GPU seems to have memory allocated on it from my executable, however seems to be doing no computation:
Wed Aug 5 13:10:23 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | Off |
| N/A 43C P0 64W / 300W | 490MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 33712 C .../z04/gpsgibb/TPLS/TPLS-GPU/./twophase.x 479MiB |
+-----------------------------------------------------------------------------+
I then ran the same example but without the -dm_vec_type cuda, -dm_mat_type aijcusparse arguments, and I found the same behaviour (479MB allocated on the GPU, 0% GPU utilisation).
In both cases the runtime of the example are near identical, suggesting that both are essentially the same run.
As a further test I compiled PETSc without CUDA support and ran the same example again, and found the same runtime as with the GPUs, and (as expected) no GPU memory allocated. I then tried to run the example with the -dm_vec_type cuda, -dm_mat_type aijcusparse arguments and it ran without complaint. I would have expected it to throw an error or at least a warning if invalid arguments were passed to it.
All this suggests to me that PETSc is ignoring my requests to use the GPUs. For the GPU-aware PETSc it seems to allocate memory on the GPUs but perform no calculations on them, regardless of whether I requested it to use the GPUs or not. On non-GPU-aware PETSc it accepts my requests to use the GPUs, but does not throw an error.
What am I doing wrong?
Lets step back to a simpler thing so we can make sure your configuration is correct. Can you run the 2_cuda test from
src/vec/vec/tests/ex28.c ? Does it execute on your GPU?
Thanks,
Matt
Thanks in advance,
Gordon
-----------------------------------------------
Dr Gordon P S Gibb
EPCC, The University of Edinburgh
Tel: +44 131 651 3459
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200805/153a1465/attachment-0001.html>
More information about the petsc-users
mailing list