[petsc-users] [External] Re: MatVec on GPUs
Swarnava Ghosh
swarnava89 at gmail.com
Tue Oct 19 20:17:30 CDT 2021
Thank you Junchao! Is it possible to determine how much time is being spent
on data transfer from the CPU mem to the GPU mem from the log?
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
-fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
/ccsopen/home/swarnava/MiniApp_xl_cu/bin/sq on a named h49n15 with 4
processors, by swarnava Tue Oct 19 21:10:56 2021
Using Petsc Release Version 3.15.0, Mar 30, 2021
Max Max/Min Avg Total
Time (sec): 1.172e+02 1.000 1.172e+02
Objects: 1.160e+02 1.000 1.160e+02
Flop: 5.832e+10 1.125 5.508e+10 2.203e+11
Flop/sec: 4.974e+08 1.125 4.698e+08 1.879e+09
MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Reductions: 1.320e+02 1.000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flop
and VecAXPY() for complex vectors of length N
--> 8N flop
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total
Avg %Total Count %Total
0: Main Stage: 1.1725e+02 100.0% 2.2033e+11 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 1.140e+02 86.4%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this
phase
%M - percent messages in this phase %L - percent message lengths
in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
time over all processors)
CpuToGpu Count: total number of CPU to GPU copies per processor
CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
processor)
GpuToCpu Count: total number of GPU to CPU copies per processor
GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
processor)
GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total GPU - CpuToGpu - -
GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio Mess AvgLen
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count
Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
BuildTwoSided 2 1.0 6.2501e-03145.1 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0
0.00e+00 0
BuildTwoSidedF 2 1.0 6.2628e-03123.2 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0
0.00e+00 0
VecDot 89991 1.1 3.4663e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00
0.0e+00 3 3 0 0 0 3 3 0 0 0 1816 1841 0 0.00e+00 84992
6.80e-01 100
VecNorm 89991 1.1 5.5282e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00
0.0e+00 4 3 0 0 0 4 3 0 0 0 1139 1148 0 0.00e+00 84992
6.80e-01 100
VecScale 89991 1.1 1.3902e+00 1.2 8.33e+08 1.1 0.0e+00 0.0e+00
0.0e+00 1 1 0 0 0 1 1 0 0 0 2265 2343 84992 6.80e-01 0
0.00e+00 100
VecCopy 178201 1.1 2.9825e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
VecSet 3589 1.1 1.0195e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
VecAXPY 179091 1.1 2.7456e+00 1.2 3.32e+09 1.1 0.0e+00 0.0e+00
0.0e+00 2 6 0 0 0 2 6 0 0 0 4564 4739 169142 1.35e+00 0
0.00e+00 100
VecCUDACopyTo 891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 6.23e+01 0
0.00e+00 0
VecCUDACopyFrom 891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 842
6.23e+01 0
DMCreateMat 5 1.0 7.3491e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
7.0e+00 1 0 0 0 5 1 0 0 0 6 0 0 0 0.00e+00 0
0.00e+00 0
SFSetGraph 5 1.0 3.5016e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
MatMult 89991 1.1 2.0423e+00 1.2 5.08e+10 1.1 0.0e+00 0.0e+00
0.0e+00 2 87 0 0 0 2 87 0 0 0 94039 105680 1683 2.00e+03 0
0.00e+00 100
MatCopy 891 1.1 1.3600e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
MatConvert 2 1.0 1.0489e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
MatScale 2 1.0 2.7950e-04 1.3 3.18e+05 1.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 4530 0 0 0.00e+00 0
0.00e+00 0
MatAssemblyBegin 7 1.0 6.3768e-0368.8 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0
0.00e+00 0
MatAssemblyEnd 7 1.0 7.9870e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.0e+00 0 0 0 0 3 0 0 0 0 4 0 0 0 0.00e+00 0
0.00e+00 0
MatCUSPARSCopyTo 891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 1.93e+03 0
0.00e+00 0
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Vector 69 11 19112 0.
Distributed Mesh 3 0 0 0.
Index Set 12 10 187512 0.
IS L to G Mapping 3 0 0 0.
Star Forest Graph 11 0 0 0.
Discrete System 3 0 0 0.
Weak Form 3 0 0 0.
Application Order 1 0 0 0.
Matrix 8 0 0 0.
Krylov Solver 1 0 0 0.
Preconditioner 1 0 0 0.
Viewer 1 0 0 0.
========================================================================================================================
Average time to get PetscTime(): 4.32e-08
Average time for MPI_Barrier(): 9.94e-07
Average time for zero size MPI_Send(): 4.20135e-05
Sincerely,
SG
On Tue, Oct 19, 2021 at 12:28 AM Junchao Zhang <junchao.zhang at gmail.com>
wrote:
>
>
>
> On Mon, Oct 18, 2021 at 10:56 PM Swarnava Ghosh <swarnava89 at gmail.com>
> wrote:
>
>> I am trying the port parts of the following function on GPUs.
>> Essentially, the lines of codes between the two "TODO..." comments should
>> be executed on the device. Here is the function:
>>
>> PetscScalar CalculateSpectralNodesAndWeights(LSDFT_OBJ *pLsdft, int p,
>> int LIp)
>> {
>>
>> PetscInt N_qp;
>> N_qp = pLsdft->N_qp;
>>
>> int k;
>> PetscScalar *a, *b;
>> k=0;
>>
>> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &a);
>> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &b);
>>
>> /*
>> * TODO: COPY a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1,
>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from HOST to DEVICE
>> * DO THE FOLLOWING OPERATIONS ON DEVICE
>> */
>>
>> //zero out vectors
>> VecZeroEntries(pLsdft->Vk);
>> VecZeroEntries(pLsdft->Vkm1);
>> VecZeroEntries(pLsdft->Vkp1);
>>
>> VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES);
>> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vkm1,pLsdft->Vk);
>> VecDot(pLsdft->Vkm1, pLsdft->Vk, &a[0]);
>> VecAXPY(pLsdft->Vk, -a[0], pLsdft->Vkm1);
>> VecNorm(pLsdft->Vk, NORM_2, &b[0]);
>> VecScale(pLsdft->Vk, 1.0 / b[0]);
>>
>> for (k = 0; k < N_qp; k++) {
>> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vk,pLsdft->Vkp1);
>> VecDot(pLsdft->Vk, pLsdft->Vkp1, &a[k + 1]);
>> VecAXPY(pLsdft->Vkp1, -a[k + 1], pLsdft->Vk);
>> VecAXPY(pLsdft->Vkp1, -b[k], pLsdft->Vkm1);
>> VecCopy(pLsdft->Vk, pLsdft->Vkm1);
>> VecNorm(pLsdft->Vkp1, NORM_2, &b[k + 1]);
>> VecCopy(pLsdft->Vkp1, pLsdft->Vk);
>> VecScale(pLsdft->Vk, 1.0 / b[k + 1]);
>> }
>>
>> /*
>> * TODO: Copy back a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1,
>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from DEVICE to HOST
>> */
>>
>> /*
>> * Some operation with a, and b on HOST
>> *
>> */
>> TridiagEigenVecSolve_NodesAndWeights(pLsdft, a, b, N_qp, LIp); //
>> operation on the host
>>
>> // free a,b
>> PetscFree(a);
>> PetscFree(b);
>>
>> return 0;
>> }
>>
>> If I just use the command line options to set vectors Vk,Vkp1 and Vkm1 as
>> cuda vectors and the matrix LapPlusVeffOprloc as aijcusparse, will the
>> lines of code between the two "TODO" comments be entirely executed on the
>> device?
>>
> yes, except VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); which is
> done on CPU, by pulling down vector data from GPU to CPU and setting the
> value. Subsequent vector operations will push the updated vector data to
> GPU again.
>
>
>>
>> Sincerely,
>> Swarnava
>>
>>
>> On Mon, Oct 18, 2021 at 10:13 PM Swarnava Ghosh <swarnava89 at gmail.com>
>> wrote:
>>
>>> Thanks for the clarification, Junchao.
>>>
>>> Sincerely,
>>> Swarnava
>>>
>>> On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh <swarnava89 at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Junchao,
>>>>>
>>>>> If I want to pass command line options as -mymat_mat_type
>>>>> aijcusparse, should it be MatSetOptionsPrefix(A,"mymat"); or
>>>>> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify?
>>>>>
>>>> my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in
>>>> mat/tests/ex62.c
>>>> Thanks
>>>>
>>>>
>>>>>
>>>>> Sincerely,
>>>>> Swarnava
>>>>>
>>>>> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> MatSetOptionsPrefix(A,"mymat")
>>>>>> VecSetOptionsPrefix(v,"myvec")
>>>>>>
>>>>>> --Junchao Zhang
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu <cliu at pppl.gov> wrote:
>>>>>>
>>>>>>> Hi Junchao,
>>>>>>>
>>>>>>> Thank you for your answer. I tried MatConvert and it works. I didn't
>>>>>>> make it before because I forgot to convert a vector from mpi to
>>>>>>> mpicuda
>>>>>>> previously.
>>>>>>>
>>>>>>> For vector, there is no VecConvert to use, so I have to do
>>>>>>> VecDuplicate,
>>>>>>> VecSetType and VecCopy. Is there an easier option?
>>>>>>>
>>>>>> As Matt suggested, you could single out the matrix and vector with
>>>>>> options prefix and set their type on command line
>>>>>>
>>>>>> MatSetOptionsPrefix(A,"mymat");
>>>>>> VecSetOptionsPrefix(v,"myvec");
>>>>>>
>>>>>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda
>>>>>>
>>>>>> A simpler code is to have the vector type automatically set by
>>>>>> MatCreateVecs(A,&v,NULL)
>>>>>>
>>>>>>
>>>>>>> Chang
>>>>>>>
>>>>>>> On 10/18/21 5:23 PM, Junchao Zhang wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users
>>>>>>> > <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>>>>>> >
>>>>>>> > Hi Matt,
>>>>>>> >
>>>>>>> > I have a related question. In my code I have many matrices and
>>>>>>> I only
>>>>>>> > want to have one living on GPU, the others still staying on
>>>>>>> CPU mem.
>>>>>>> >
>>>>>>> > I wonder if there is an easier way to copy a mpiaij matrix to
>>>>>>> > mpiaijcusparse (in other words, copy data to GPUs). I can
>>>>>>> think of
>>>>>>> > creating a new mpiaijcusparse matrix, and copying the data
>>>>>>> line by
>>>>>>> > line.
>>>>>>> > But I wonder if there is a better option.
>>>>>>> >
>>>>>>> > I have tried MatCopy and MatConvert but neither work.
>>>>>>> >
>>>>>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)?
>>>>>>> >
>>>>>>> >
>>>>>>> > Chang
>>>>>>> >
>>>>>>> > On 10/17/21 7:50 PM, Matthew Knepley wrote:
>>>>>>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh
>>>>>>> > <swarnava89 at gmail.com <mailto:swarnava89 at gmail.com>
>>>>>>> > > <mailto:swarnava89 at gmail.com <mailto:swarnava89 at gmail.com>>>
>>>>>>> wrote:
>>>>>>> > >
>>>>>>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in
>>>>>>> code?
>>>>>>> > >
>>>>>>> > >
>>>>>>> > > You would need a call to MatSetFromOptions() to take that
>>>>>>> type
>>>>>>> > from the
>>>>>>> > > command line, and not have
>>>>>>> > > the type hard-coded in your application. It is generally a
>>>>>>> bad
>>>>>>> > idea to
>>>>>>> > > hard code the implementation type.
>>>>>>> > >
>>>>>>> > > If I do it from command line, then are the other MatVec
>>>>>>> calls are
>>>>>>> > > ported onto CUDA? I have many MatVec calls in my code,
>>>>>>> but I
>>>>>>> > > specifically want to port just one call.
>>>>>>> > >
>>>>>>> > >
>>>>>>> > > You can give that one matrix an options prefix to isolate
>>>>>>> it.
>>>>>>> > >
>>>>>>> > > Thanks,
>>>>>>> > >
>>>>>>> > > Matt
>>>>>>> > >
>>>>>>> > > Sincerely,
>>>>>>> > > Swarnava
>>>>>>> > >
>>>>>>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang
>>>>>>> > > <junchao.zhang at gmail.com <mailto:
>>>>>>> junchao.zhang at gmail.com>
>>>>>>> > <mailto:junchao.zhang at gmail.com <mailto:
>>>>>>> junchao.zhang at gmail.com>>>
>>>>>>> > wrote:
>>>>>>> > >
>>>>>>> > > You can do that with command line options -mat_type
>>>>>>> > aijcusparse
>>>>>>> > > -vec_type cuda
>>>>>>> > >
>>>>>>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh
>>>>>>> > > <swarnava89 at gmail.com <mailto:swarnava89 at gmail.com>
>>>>>>> > <mailto:swarnava89 at gmail.com <mailto:swarnava89 at gmail.com>>>
>>>>>>> wrote:
>>>>>>> > >
>>>>>>> > > Dear Petsc team,
>>>>>>> > >
>>>>>>> > > I had a query regarding using CUDA to
>>>>>>> accelerate a matrix
>>>>>>> > > vector product.
>>>>>>> > > I have a sequential sparse matrix
>>>>>>> (MATSEQBAIJ type).
>>>>>>> > I want
>>>>>>> > > to port a MatVec call onto GPUs. Is there any
>>>>>>> > code/example I
>>>>>>> > > can look at?
>>>>>>> > >
>>>>>>> > > Sincerely,
>>>>>>> > > SG
>>>>>>> > >
>>>>>>> > >
>>>>>>> > >
>>>>>>> > > --
>>>>>>> > > What most experimenters take for granted before they begin
>>>>>>> their
>>>>>>> > > experiments is infinitely more interesting than any results
>>>>>>> to which
>>>>>>> > > their experiments lead.
>>>>>>> > > -- Norbert Wiener
>>>>>>> > >
>>>>>>> > > https://www.cse.buffalo.edu/~knepley/
>>>>>>> > <https://www.cse.buffalo.edu/~knepley/>
>>>>>>> > <http://www.cse.buffalo.edu/~knepley/
>>>>>>> > <http://www.cse.buffalo.edu/~knepley/>>
>>>>>>> >
>>>>>>> > --
>>>>>>> > Chang Liu
>>>>>>> > Staff Research Physicist
>>>>>>> > +1 609 243 3438
>>>>>>> > cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>> > Princeton Plasma Physics Laboratory
>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>> >
>>>>>>>
>>>>>>> --
>>>>>>> Chang Liu
>>>>>>> Staff Research Physicist
>>>>>>> +1 609 243 3438
>>>>>>> cliu at pppl.gov
>>>>>>> Princeton Plasma Physics Laboratory
>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20211019/d67e6d08/attachment-0001.html>
More information about the petsc-users
mailing list