We log gpu time before/after cusparse calls. https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441
But according to https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, cusparse is asynchronous. Does that mean the gpu time is meaningless?
   Hannah, Junchao and Richard,

    The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally funky. 951558 and 973391 they are so much lower than unvirtualized 3084009
  and 3133521 and yet the total time to solution is similar for the runs.

    Is it possible these are being counted or calculated wrong? If not what does this mean? Please check the code that computes them (I can't imagine it is wrong but ...)

    It means the GPUs are taking 3.x times more to do the multiplies in the MPS case but where is that time coming from in the other numbers? Communication time doesn't drop that much?

    I can't present these numbers with this huge inconsistency



> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to profile it. I will also collect  data with multiple nodes. Are the matrix and tests proper?
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> 6 MPI ranks (CPU version)
> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    0 0.00e+00  0
> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> 6 MPI ranks + 6 GPUs + regular SF
> MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 2.69e+02 100
> VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00  100 2.69e+02  0
> VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02    0 0.00e+00  0
> VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00  100 2.69e+02  0
> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00    0 0.00e+00 100
> VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> 24 MPI ranks + 6 GPUs + regular SF
> MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 6.72e+01 100
> VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00  100 6.72e+01  0
> VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    0 0.00e+00  0
> VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00  100 6.72e+01  0
> 24 MPI ranks + 6 GPUs + CUDA-aware SF
> MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 0.00e+00    0 0.00e+00 100
> VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
