[petsc-users] Code (possibly) not running on GPU with CUDA
Matthew Knepley
knepley at gmail.com
Wed Aug 5 12:30:55 CDT 2020
On Wed, Aug 5, 2020 at 1:09 PM GIBB Gordon <g.gibb at epcc.ed.ac.uk> wrote:
> Hi,
>
> I used nvidia-smi before, essentially a kind of ’top’ for nvidia-gpus.
>
> The log output I get is:
>
You can see that all flops are done on the GPU by looking at the last
column:
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total GPU - CpuToGpu - -
GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio Mess AvgLen
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size
Count Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecDot 4 1.0 7.4222e-05 1.0 1.96e+02 1.0 0.0e+00 0.0e+00
0.0e+00 0 8 0 0 0 0 8 0 0 0 3 3 0 0.00e+00 0
0.00e+00 100
VecNorm 1 1.0 5.4168e-05 1.0 7.30e+01 1.0 0.0e+00 0.0e+00
0.0e+00 0 3 0 0 0 0 3 0 0 0 1 1 0 0.00e+00 0
0.00e+00 100
Thanks,
Matt
************************************************************************************************************************
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option. #
> # To get timing results run ./configure #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with GPU support but you used #
> # an MPI that's not GPU-aware, such Petsc had to copy #
> # data from GPU to CPU for MPI communication. To get #
> # meaningfull timing results, please use a GPU-aware #
> # MPI instead. #
> ##########################################################
>
>
> /lustre/home/z04/gpsgibb/TPLS/petsc/share/petsc/examples/src/vec/vec/tests/./ex28
> on a named r2i7n0 with 1 processor, by gpsgibb Wed Aug 5 18:05:59 2020
> Using Petsc Release Version 3.13.3, Jul 01, 2020
>
> Max Max/Min Avg Total
> Time (sec): 1.566e-01 1.000 1.566e-01
> Objects: 4.400e+01 1.000 4.400e+01
> Flop: 2.546e+03 1.000 2.546e+03 2.546e+03
> Flop/sec: 1.626e+04 1.000 1.626e+04 1.626e+04
> Memory: 1.438e+05 1.000 1.438e+05 1.438e+05
> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
> MPI Reductions: 0.000e+00 0.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N
> --> 2N flop
> and VecAXPY() for complex vectors of length N
> --> 8N flop
>
> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages
> --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total Count
> %Total Avg %Total Count %Total
> 0: Main Stage: 1.5657e-01 100.0% 2.5460e+03 100.0% 0.000e+00
> 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> AvgLen: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this
> phase
> %M - percent messages in this phase %L - percent message lengths
> in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
> time over all processors)
> CpuToGpu Count: total number of CPU to GPU copies per processor
> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
> processor)
> GpuToCpu Count: total number of GPU to CPU copies per processor
> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
> processor)
> GPU %F: percent flops on GPU in this event
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option. #
> # To get timing results run ./configure #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
> Event Count Time (sec) Flop
> --- Global --- --- Stage ---- Total GPU - CpuToGpu - -
> GpuToCpu - GPU
> Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count
> Size %F
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> VecDot 4 1.0 7.4222e-05 1.0 1.96e+02 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 8 0 0 0 0 8 0 0 0 3 3 0 0.00e+00 0
> 0.00e+00 100
> VecNorm 1 1.0 5.4168e-05 1.0 7.30e+01 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 3 0 0 0 0 3 0 0 0 1 1 0 0.00e+00 0
> 0.00e+00 100
> VecSet 83 1.0 9.0480e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> VecAssemblyBegin 1 1.0 2.7206e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> VecAssemblyEnd 1 1.0 2.6403e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> VecSetRandom 1 1.0 1.5260e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> VecReduceArith 52 1.0 1.1307e-03 1.0 2.28e+03 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 89 0 0 0 1 89 0 0 0 2 2 2 4.00e-04 0
> 0.00e+00 100
> VecReduceComm 4 1.0 3.4969e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> VecReduceBegin 1 1.0 2.5639e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> VecReduceEnd 1 1.0 2.5495e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> VecCUDACopyTo 2 1.0 1.7550e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 2 4.00e-04 0
> 0.00e+00 0
> VecCUDACopyFrom 42 1.0 3.7747e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 42
> 8.40e-03 0
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Vector 42 42 75264 0.
> PetscRandom 1 1 646 0.
> Viewer 1 0 0 0.
>
> ========================================================================================================================
> Average time to get PetscTime(): 3.67989e-08
> #PETSc Option Table entries:
> -log_view
> -use_gpu_aware_mpi 0
> -vec_type cuda
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: CC=nvcc FC=mpif90 CXX=mpicxx
> --prefix=/lustre/home/z04/gpsgibb/TPLS/petsc --with-cudac=nvcc
> --with-cuda=1 --with-mpi-dir= --with-batch
> -----------------------------------------
> Libraries compiled on 2020-07-31 14:46:25 on r2i7n0
> Machine characteristics:
> Linux-4.18.0-147.8.1.el8_1.x86_64-x86_64-with-centos-8.1.1911-Core
> Using PETSc directory: /lustre/home/z04/gpsgibb/TPLS/petsc
> Using PETSc arch:
> -----------------------------------------
>
> Using C compiler: nvcc -g
> -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
> Using Fortran compiler: mpif90 -Wall -ffree-line-length-0
> -Wno-unused-dummy-argument -g
> -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
> -----------------------------------------
>
> Using include paths: -I/lustre/home/z04/gpsgibb/TPLS/petsc/include
> -I/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/include
> -I/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/include
> -----------------------------------------
>
> Using C linker: nvcc
> Using Fortran linker: mpif90
> Using libraries: -L/lustre/home/z04/gpsgibb/TPLS/petsc/lib
> -L/lustre/home/z04/gpsgibb/TPLS/petsc/lib -lpetsc
> -L/lustre/sw/intel/compilers_and_libraries_2019.0.117/linux/mkl
> -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/lib64
> -L/lustre/home/z04/gpsgibb/TPLS/petsc-3.13.3/lib
> -L/opt/hpe/hpc/mpt/mpt-2.22/lib
> -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/20.5/math_libs/10.2/lib64
> -L/lustre/sw/gcc/6.3.0/lib/gcc/x86_64-pc-linux-gnu/6.3.0
> -L/lustre/sw/gcc/6.3.0/lib64
> -L/lustre/sw/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64
> -L/lustre/sw/nvidia/hpcsdk/Linux_x86_64/cuda/10.2/bin
> -L/lustre/sw/gcc/6.3.0/lib -lmkl_intel_lp64 -lmkl_core -lmkl_sequential
> -lpthread -lX11 -lcufft -lcublas -lcudart -lcusparse -lcusolver -lcuda
> -lmpi++ -lmpi -lstdc++ -ldl -lpthread -lmpi -lgfortran -lm -lgfortran -lm
> -lgcc_s -lquadmath -lstdc++ -ldl
> -----------------------------------------
>
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with GPU support but you used #
> # an MPI that's not GPU-aware, such Petsc had to copy #
> # data from GPU to CPU for MPI communication. To get #
> # meaningfull timing results, please use a GPU-aware #
> # MPI instead. #
> ##########################################################
>
>
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option. #
> # To get timing results run ./configure #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
> -----------------------------------------------
> Dr Gordon P S Gibb
> EPCC, The University of Edinburgh
> Tel: +44 131 651 3459
>
> On 5 Aug 2020, at 17:58, Matthew Knepley <knepley at gmail.com> wrote:
>
> On Wed, Aug 5, 2020 at 12:47 PM GIBB Gordon <g.gibb at epcc.ed.ac.uk> wrote:
>
>> Hi Matt,
>>
>> It runs, however it doesn’t produce any output, and I have no way of
>> checking to see if it actually ran on the GPU. It was run with:
>>
>> srun -n 1 ./ex28 -vec_type cuda -use_gpu_aware_mpi 0
>>
>
> 1) How did you check last time?
>
> 2) You can check using -log_view
>
> Thanks,
>
> Matt
>
>
>> Cheers,
>>
>> Gordon
>>
>> -----------------------------------------------
>> Dr Gordon P S Gibb
>> EPCC, The University of Edinburgh
>> Tel: +44 131 651 3459
>>
>> On 5 Aug 2020, at 17:10, Matthew Knepley <knepley at gmail.com> wrote:
>>
>> On Wed, Aug 5, 2020 at 11:24 AM GIBB Gordon <g.gibb at epcc.ed.ac.uk> wrote:
>>
>>> Hi,
>>>
>>> I’ve built PETSc with NVIDIA support for our GPU machine (
>>> https://cirrus.readthedocs.io/en/master/user-guide/gpu.html), and then
>>> compiled our executable against this PETSc (using version 3.13.3). I should
>>> add that the MPI on our system is not GPU-aware so I have to use -use_gpu_aware_mpi
>>> 0
>>>
>>> When running this, in the .petscrc I put
>>>
>>> -dm_vec_type cuda
>>> -dm_mat_type aijcusparse
>>>
>>> as is suggested on the PETSc GPU page (
>>> https://www.mcs.anl.gov/petsc/features/gpus.html) to enable CUDA for
>>> DMs (all our PETSc data structures are with DMs). I have also ensured I'm
>>> using the jacobi preconditioner so that it definitely runs on the GPU
>>> (again, according to the PETSc GPU page).
>>>
>>> When I run this, I note that the GPU seems to have memory allocated on
>>> it from my executable, however seems to be doing no computation:
>>>
>>> Wed Aug 5 13:10:23 2020
>>>
>>> +-----------------------------------------------------------------------------+
>>> | NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version:
>>> 10.2 |
>>>
>>> |-------------------------------+----------------------+----------------------+
>>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
>>> Uncorr. ECC |
>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
>>> Compute M. |
>>>
>>> |===============================+======================+======================|
>>> | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off |
>>> Off |
>>> | N/A 43C P0 64W / 300W | 490MiB / 16160MiB | 0%
>>> Default |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>>
>>>
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes: GPU
>>> Memory |
>>> | GPU PID Type Process name
>>> Usage |
>>>
>>> |=============================================================================|
>>> | 0 33712 C .../z04/gpsgibb/TPLS/TPLS-GPU/./twophase.x
>>> 479MiB |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>> I then ran the same example but without the -dm_vec_type cuda,
>>> -dm_mat_type aijcusparse arguments, and I found the same behaviour (479MB
>>> allocated on the GPU, 0% GPU utilisation).
>>>
>>> In both cases the runtime of the example are near identical, suggesting
>>> that both are essentially the same run.
>>>
>>> As a further test I compiled PETSc without CUDA support and ran the same
>>> example again, and found the same runtime as with the GPUs, and (as
>>> expected) no GPU memory allocated. I then tried to run the example with the -dm_vec_type
>>> cuda, -dm_mat_type aijcusparse arguments and it ran without complaint.
>>> I would have expected it to throw an error or at least a warning if invalid
>>> arguments were passed to it.
>>>
>>> All this suggests to me that PETSc is ignoring my requests to use the
>>> GPUs. For the GPU-aware PETSc it seems to allocate memory on the GPUs but
>>> perform no calculations on them, regardless of whether I requested it to
>>> use the GPUs or not. On non-GPU-aware PETSc it accepts my requests to use
>>> the GPUs, but does not throw an error.
>>>
>>> What am I doing wrong?
>>>
>>
>> Lets step back to a simpler thing so we can make sure your configuration
>> is correct. Can you run the 2_cuda test from
>> src/vec/vec/tests/ex28.c ? Does it execute on your GPU?
>>
>> Thanks,
>>
>> Matt
>>
>>
>>> Thanks in advance,
>>>
>>> Gordon
>>> -----------------------------------------------
>>> Dr Gordon P S Gibb
>>> EPCC, The University of Edinburgh
>>> Tel: +44 131 651 3459
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200805/d8c96851/attachment-0001.html>
More information about the petsc-users
mailing list