[petsc-users] [External] Re: MatVec on GPUs

Tue Oct 19 20:17:30 CDT 2021

Thank you Junchao! Is it possible to determine how much time is being spent
on data transfer from the CPU mem to the GPU mem from the log?

************************************************************************************************************************

***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
-fCourier9' to print this document            ***

************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

/ccsopen/home/swarnava/MiniApp_xl_cu/bin/sq on a  named h49n15 with 4
processors, by swarnava Tue Oct 19 21:10:56 2021

Using Petsc Release Version 3.15.0, Mar 30, 2021

                         Max       Max/Min     Avg       Total

Time (sec):           1.172e+02     1.000   1.172e+02

Objects:              1.160e+02     1.000   1.160e+02

Flop:                 5.832e+10     1.125   5.508e+10  2.203e+11

Flop/sec:             4.974e+08     1.125   4.698e+08  1.879e+09

MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00

MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00

MPI Reductions:       1.320e+02     1.000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)

                            e.g., VecAXPY() for real vectors of length N
--> 2N flop

                            and VecAXPY() for complex vectors of length N
--> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---
-- Message Lengths --  -- Reductions --

                        Avg     %Total     Avg     %Total    Count   %Total
  Avg         %Total    Count   %Total

 0:      Main Stage: 1.1725e+02 100.0%  2.2033e+11 100.0%  0.000e+00   0.0%
0.000e+00        0.0%  1.140e+02  86.4%

------------------------------------------------------------------------------------------------------------------------

See the 'Profiling' chapter of the users' manual for details on
interpreting output.

Phase summary info:

   Count: number of times phase was executed

   Time and Flop: Max - maximum over all processors

                  Ratio - ratio of maximum to minimum over all processors

   Mess: number of messages sent

   AvgLen: average message length (bytes)

   Reduct: number of global reductions

   Global: entire computation

   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().

      %T - percent time in this phase         %F - percent flop in this
phase

      %M - percent messages in this phase     %L - percent message lengths
in this phase

      %R - percent reductions in this phase

   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)

   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
time over all processors)

   CpuToGpu Count: total number of CPU to GPU copies per processor

   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
processor)

   GpuToCpu Count: total number of GPU to CPU copies per processor

   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
processor)

   GPU %F: percent flops on GPU in this event

------------------------------------------------------------------------------------------------------------------------

Event                Count      Time (sec)     Flop
      --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
GpuToCpu - GPU

                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count
Size  %F

---------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided          2 1.0 6.2501e-03145.1 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00  0  0  0  0  2   0  0  0  0  2     0       0      0 0.00e+00    0
0.00e+00  0

BuildTwoSidedF         2 1.0 6.2628e-03123.2 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00  0  0  0  0  2   0  0  0  0  2     0       0      0 0.00e+00    0
0.00e+00  0

VecDot             89991 1.1 3.4663e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00
0.0e+00  3  3  0  0  0   3  3  0  0  0  1816    1841      0 0.00e+00 84992
6.80e-01 100

VecNorm            89991 1.1 5.5282e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00
0.0e+00  4  3  0  0  0   4  3  0  0  0  1139    1148      0 0.00e+00 84992
6.80e-01 100

VecScale           89991 1.1 1.3902e+00 1.2 8.33e+08 1.1 0.0e+00 0.0e+00
0.0e+00  1  1  0  0  0   1  1  0  0  0  2265    2343   84992 6.80e-01    0
0.00e+00 100

VecCopy           178201 1.1 2.9825e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  2  0  0  0  0   2  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0

VecSet              3589 1.1 1.0195e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0

VecAXPY           179091 1.1 2.7456e+00 1.2 3.32e+09 1.1 0.0e+00 0.0e+00
0.0e+00  2  6  0  0  0   2  6  0  0  0  4564    4739   169142 1.35e+00    0
0.00e+00 100

VecCUDACopyTo        891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0    842 6.23e+01    0
0.00e+00  0

VecCUDACopyFrom      891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00  842
6.23e+01  0

DMCreateMat            5 1.0 7.3491e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
7.0e+00  1  0  0  0  5   1  0  0  0  6     0       0      0 0.00e+00    0
0.00e+00  0

SFSetGraph             5 1.0 3.5016e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0

MatMult            89991 1.1 2.0423e+00 1.2 5.08e+10 1.1 0.0e+00 0.0e+00
0.0e+00  2 87  0  0  0   2 87  0  0  0 94039   105680   1683 2.00e+03    0
0.00e+00 100

MatCopy              891 1.1 1.3600e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0

MatConvert             2 1.0 1.0489e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0

MatScale               2 1.0 2.7950e-04 1.3 3.18e+05 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  4530       0      0 0.00e+00    0
0.00e+00  0

MatAssemblyBegin       7 1.0 6.3768e-0368.8 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00  0  0  0  0  2   0  0  0  0  2     0       0      0 0.00e+00    0
0.00e+00  0

MatAssemblyEnd         7 1.0 7.9870e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.0e+00  0  0  0  0  3   0  0  0  0  4     0       0      0 0.00e+00    0
0.00e+00  0

MatCUSPARSCopyTo     891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0    842 1.93e+03    0
0.00e+00  0

---------------------------------------------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.

Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector    69             11        19112     0.

    Distributed Mesh     3              0            0     0.

           Index Set    12             10       187512     0.

   IS L to G Mapping     3              0            0     0.

   Star Forest Graph    11              0            0     0.

     Discrete System     3              0            0     0.

           Weak Form     3              0            0     0.

   Application Order     1              0            0     0.

              Matrix     8              0            0     0.

       Krylov Solver     1              0            0     0.

      Preconditioner     1              0            0     0.

              Viewer     1              0            0     0.

========================================================================================================================

Average time to get PetscTime(): 4.32e-08

Average time for MPI_Barrier(): 9.94e-07

Average time for zero size MPI_Send(): 4.20135e-05

Sincerely,

SG

On Tue, Oct 19, 2021 at 12:28 AM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

>
>
>
> On Mon, Oct 18, 2021 at 10:56 PM Swarnava Ghosh <swarnava89 at gmail.com>
> wrote:
>
>> I am trying the port parts of the following function on GPUs.
>> Essentially, the lines of codes between the two "TODO..." comments should
>> be executed on the device. Here is the function:
>>
>> PetscScalar CalculateSpectralNodesAndWeights(LSDFT_OBJ *pLsdft, int p,
>> int LIp)
>> {
>>
>>   PetscInt N_qp;
>>   N_qp = pLsdft->N_qp;
>>
>>   int k;
>>   PetscScalar *a, *b;
>>   k=0;
>>
>>   PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &a);
>>   PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &b);
>>
>>   /*
>>    * TODO: COPY a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1,
>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from HOST to DEVICE
>>    * DO THE FOLLOWING OPERATIONS ON DEVICE
>>    */
>>
>>   //zero out vectors
>>   VecZeroEntries(pLsdft->Vk);
>>   VecZeroEntries(pLsdft->Vkm1);
>>   VecZeroEntries(pLsdft->Vkp1);
>>
>>   VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES);
>>   MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vkm1,pLsdft->Vk);
>>   VecDot(pLsdft->Vkm1, pLsdft->Vk, &a[0]);
>>   VecAXPY(pLsdft->Vk, -a[0], pLsdft->Vkm1);
>>   VecNorm(pLsdft->Vk, NORM_2, &b[0]);
>>   VecScale(pLsdft->Vk, 1.0 / b[0]);
>>
>>   for (k = 0; k < N_qp; k++) {
>>     MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vk,pLsdft->Vkp1);
>>     VecDot(pLsdft->Vk, pLsdft->Vkp1, &a[k + 1]);
>>     VecAXPY(pLsdft->Vkp1, -a[k + 1], pLsdft->Vk);
>>     VecAXPY(pLsdft->Vkp1, -b[k], pLsdft->Vkm1);
>>     VecCopy(pLsdft->Vk, pLsdft->Vkm1);
>>     VecNorm(pLsdft->Vkp1, NORM_2, &b[k + 1]);
>>     VecCopy(pLsdft->Vkp1, pLsdft->Vk);
>>     VecScale(pLsdft->Vk, 1.0 / b[k + 1]);
>>   }
>>
>>   /*
>>    * TODO: Copy back a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1,
>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from DEVICE to HOST
>>    */
>>
>>   /*
>>    * Some operation with a, and b on HOST
>>    *
>>    */
>>   TridiagEigenVecSolve_NodesAndWeights(pLsdft, a, b, N_qp, LIp);  //
>> operation on the host
>>
>>   // free a,b
>>   PetscFree(a);
>>   PetscFree(b);
>>
>>   return 0;
>> }
>>
>> If I just use the command line options to set vectors Vk,Vkp1 and Vkm1 as
>> cuda vectors and the matrix  LapPlusVeffOprloc as aijcusparse, will the
>> lines of code between the two "TODO" comments be entirely executed on the
>> device?
>>
> yes, except  VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES);  which is
> done on CPU, by pulling down vector data from GPU to CPU and setting the
> value.  Subsequent vector operations will push the updated vector data to
> GPU again.
>
>
>>
>> Sincerely,
>> Swarnava
>>
>>
>> On Mon, Oct 18, 2021 at 10:13 PM Swarnava Ghosh <swarnava89 at gmail.com>
>> wrote:
>>
>>> Thanks for the clarification, Junchao.
>>>
>>> Sincerely,
>>> Swarnava
>>>
>>> On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh <swarnava89 at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Junchao,
>>>>>
>>>>> If I want to pass command line options as  -mymat_mat_type
>>>>> aijcusparse, should it be MatSetOptionsPrefix(A,"mymat"); or
>>>>> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify?
>>>>>
>>>>  my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in
>>>> mat/tests/ex62.c
>>>>  Thanks
>>>>
>>>>
>>>>>
>>>>> Sincerely,
>>>>> Swarnava
>>>>>
>>>>> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> MatSetOptionsPrefix(A,"mymat")
>>>>>> VecSetOptionsPrefix(v,"myvec")
>>>>>>
>>>>>> --Junchao Zhang
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu <cliu at pppl.gov> wrote:
>>>>>>
>>>>>>> Hi Junchao,
>>>>>>>
>>>>>>> Thank you for your answer. I tried MatConvert and it works. I didn't
>>>>>>> make it before because I forgot to convert a vector from mpi to
>>>>>>> mpicuda
>>>>>>> previously.
>>>>>>>
>>>>>>> For vector, there is no VecConvert to use, so I have to do
>>>>>>> VecDuplicate,
>>>>>>> VecSetType and VecCopy. Is there an easier option?
>>>>>>>
>>>>>>  As Matt suggested, you could single out the matrix and vector with
>>>>>> options prefix and set their type on command line
>>>>>>
>>>>>> MatSetOptionsPrefix(A,"mymat");
>>>>>> VecSetOptionsPrefix(v,"myvec");
>>>>>>
>>>>>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda
>>>>>>
>>>>>> A simpler code is to have the vector type automatically set by
>>>>>> MatCreateVecs(A,&v,NULL)
>>>>>>
>>>>>>
>>>>>>> Chang
>>>>>>>
>>>>>>> On 10/18/21 5:23 PM, Junchao Zhang wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users
>>>>>>> > <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>>>>>> >
>>>>>>> >     Hi Matt,
>>>>>>> >
>>>>>>> >     I have a related question. In my code I have many matrices and
>>>>>>> I only
>>>>>>> >     want to have one living on GPU, the others still staying on
>>>>>>> CPU mem.
>>>>>>> >
>>>>>>> >     I wonder if there is an easier way to copy a mpiaij matrix to
>>>>>>> >     mpiaijcusparse (in other words, copy data to GPUs). I can
>>>>>>> think of
>>>>>>> >     creating a new mpiaijcusparse matrix, and copying the data
>>>>>>> line by
>>>>>>> >     line.
>>>>>>> >     But I wonder if there is a better option.
>>>>>>> >
>>>>>>> >     I have tried MatCopy and MatConvert but neither work.
>>>>>>> >
>>>>>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)?
>>>>>>> >
>>>>>>> >
>>>>>>> >     Chang
>>>>>>> >
>>>>>>> >     On 10/17/21 7:50 PM, Matthew Knepley wrote:
>>>>>>> >      > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh
>>>>>>> >     <swarnava89 at gmail.com <mailto:swarnava89 at gmail.com>
>>>>>>> >      > <mailto:swarnava89 at gmail.com <mailto:swarnava89 at gmail.com>>>
>>>>>>> wrote:
>>>>>>> >      >
>>>>>>> >      >     Do I need convert the MATSEQBAIJ to a cuda matrix in
>>>>>>> code?
>>>>>>> >      >
>>>>>>> >      >
>>>>>>> >      > You would need a call to MatSetFromOptions() to take that
>>>>>>> type
>>>>>>> >     from the
>>>>>>> >      > command line, and not have
>>>>>>> >      > the type hard-coded in your application. It is generally a
>>>>>>> bad
>>>>>>> >     idea to
>>>>>>> >      > hard code the implementation type.
>>>>>>> >      >
>>>>>>> >      >     If I do it from command line, then are the other MatVec
>>>>>>> calls are
>>>>>>> >      >     ported onto CUDA? I have many MatVec calls in my code,
>>>>>>> but I
>>>>>>> >      >     specifically want to port just one call.
>>>>>>> >      >
>>>>>>> >      >
>>>>>>> >      > You can give that one matrix an options prefix to isolate
>>>>>>> it.
>>>>>>> >      >
>>>>>>> >      >    Thanks,
>>>>>>> >      >
>>>>>>> >      >       Matt
>>>>>>> >      >
>>>>>>> >      >     Sincerely,
>>>>>>> >      >     Swarnava
>>>>>>> >      >
>>>>>>> >      >     On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang
>>>>>>> >      >     <junchao.zhang at gmail.com <mailto:
>>>>>>> junchao.zhang at gmail.com>
>>>>>>> >     <mailto:junchao.zhang at gmail.com <mailto:
>>>>>>> junchao.zhang at gmail.com>>>
>>>>>>> >     wrote:
>>>>>>> >      >
>>>>>>> >      >         You can do that with command line options -mat_type
>>>>>>> >     aijcusparse
>>>>>>> >      >         -vec_type cuda
>>>>>>> >      >
>>>>>>> >      >         On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh
>>>>>>> >      >         <swarnava89 at gmail.com <mailto:swarnava89 at gmail.com>
>>>>>>> >     <mailto:swarnava89 at gmail.com <mailto:swarnava89 at gmail.com>>>
>>>>>>> wrote:
>>>>>>> >      >
>>>>>>> >      >             Dear Petsc team,
>>>>>>> >      >
>>>>>>> >      >             I had a query regarding using CUDA to
>>>>>>> accelerate a matrix
>>>>>>> >      >             vector product.
>>>>>>> >      >             I have a sequential sparse matrix
>>>>>>> (MATSEQBAIJ type).
>>>>>>> >     I want
>>>>>>> >      >             to port a MatVec call onto GPUs. Is there any
>>>>>>> >     code/example I
>>>>>>> >      >             can look at?
>>>>>>> >      >
>>>>>>> >      >             Sincerely,
>>>>>>> >      >             SG
>>>>>>> >      >
>>>>>>> >      >
>>>>>>> >      >
>>>>>>> >      > --
>>>>>>> >      > What most experimenters take for granted before they begin
>>>>>>> their
>>>>>>> >      > experiments is infinitely more interesting than any results
>>>>>>> to which
>>>>>>> >      > their experiments lead.
>>>>>>> >      > -- Norbert Wiener
>>>>>>> >      >
>>>>>>> >      > https://www.cse.buffalo.edu/~knepley/
>>>>>>> >     <https://www.cse.buffalo.edu/~knepley/>
>>>>>>> >     <http://www.cse.buffalo.edu/~knepley/
>>>>>>> >     <http://www.cse.buffalo.edu/~knepley/>>
>>>>>>> >
>>>>>>> >     --
>>>>>>> >     Chang Liu
>>>>>>> >     Staff Research Physicist
>>>>>>> >     +1 609 243 3438
>>>>>>> >     cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>> >     Princeton Plasma Physics Laboratory
>>>>>>> >     100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>> >
>>>>>>>
>>>>>>> --
>>>>>>> Chang Liu
>>>>>>> Staff Research Physicist
>>>>>>> +1 609 243 3438
>>>>>>> cliu at pppl.gov
>>>>>>> Princeton Plasma Physics Laboratory
>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20211019/d67e6d08/attachment-0001.html>