<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div dir="ltr">
<div>We log gpu time before/after cusparse calls. <a href="https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441">
https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441</a></div>
<div>But according to <a href="https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution">https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution</a>, cusparse is asynchronous. Does that mean the gpu time is meaningless?</div>
<div>
<div>
<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Hannah, Junchao and Richard,<br>
<br>
The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally funky. 951558 and 973391 they are so much lower than unvirtualized 3084009<br>
and 3133521 and yet the total time to solution is similar for the runs.<br>
<br>
Is it possible these are being counted or calculated wrong? If not what does this mean? Please check the code that computes them (I can't imagine it is wrong but ...)<br>
<br>
It means the GPUs are taking 3.x times more to do the multiplies in the MPS case but where is that time coming from in the other numbers? Communication time doesn't drop that much?<br>
<br>
I can't present these numbers with this huge inconsistency<br>
<br>
Thanks,<br>
<br>
Barry<br>
<br>
<br>
<br>
<br>
> On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>> wrote:<br>
> <br>
> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was almost dominated by VecScatter
in this simple test. Using 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improve performance. But if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to profile it.
I will also collect data with multiple nodes. Are the matrix and tests proper?<br>
> <br>
> ------------------------------------------------------------------------------------------------------------------------<br>
> Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU<br>
> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F<br>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------<br>
> 6 MPI ranks (CPU version)<br>
> MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 0.00e+00 0 0.00e+00 0<br>
> VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0<br>
> VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>
> <br>
> 6 MPI ranks + 6 GPUs + regular SF<br>
> MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 1.02e+02 100 2.69e+02 100<br>
> VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 0.00e+00 100 2.69e+02 0<br>
> VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>
> VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 1.02e+02 0 0.00e+00 0<br>
> VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 0.00e+00 100 2.69e+02 0<br>
> <br>
> 6 MPI ranks + 6 GPUs + CUDA-aware SF<br>
> MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 0.00e+00 0 0.00e+00 100<br>
> VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0<br>
> VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>
> <br>
> 24 MPI ranks + 6 GPUs + regular SF<br>
> MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 4.61e+01 100 6.72e+01 100<br>
> VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 0.00e+00 100 6.72e+01 0<br>
> VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>
> VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 4.61e+01 0 0.00e+00 0<br>
> VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 0.00e+00 100 6.72e+01 0<br>
> <br>
> 24 MPI ranks + 6 GPUs + CUDA-aware SF<br>
> MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 0.00e+00 0 0.00e+00 100<br>
> VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0<br>
> VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>
> <br>
> <br>
> --Junchao Zhang<br>
<br>
</blockquote>
</div>
</body>
</html>