[petsc-dev] MatMult on Summit
Mills, Richard Tran
rtmills at anl.gov
Tue Sep 24 00:46:37 CDT 2019
Karl, that would be fantastic. Much obliged!
--Richard
On 9/23/19 8:09 PM, Karl Rupp wrote:
Hi,
`git grep cudaStreamCreate` reports that vectors, matrices and scatters create their own streams. This will almost inevitably create races (there is no synchronization mechanism implemented), unless one calls WaitForGPU() after each operation. Some of the non-deterministic tests can likely be explained by this.
I'll clean this up in the next few hours if there are no objections.
Best regards,
Karli
On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
I'm no CUDA expert (not yet, anyway), but, from what I've read, the default stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() is not needed in that case. I don't know if there is any performance penalty in explicitly calling it in that case, anyway.
In any case, it looks like there are still some cases where potentially asynchronous CUDA library calls are being "timed" without a WaitForGPU() to ensure that the calls actually complete. I will make a pass through the aijcusparse and aijviennacl code looking for these.
--Richard
On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL). I don't know logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang
On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov> <mailto:petsc-dev at mcs.anl.gov><mailto:petsc-dev at mcs.anl.gov>> wrote:
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
the end of the function it had
if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
ierr = WaitForGPU();CHKERRCUDA(ierr);
}
}
I assume we don't need the logic to do this only in the MatMult()
with no add case and should just do this all the time, for the
purposes of timing if no other reason. Is there some reason to NOT
do this because of worries the about effects that these
WaitForGPU() invocations might have on performance?
I notice other problems in aijcusparse.cu <http://aijcusparse.cu><http://aijcusparse.cu>,
now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
see that we have GPU timing calls around the cusparse_csr_spmv()
(but no WaitForGPU() inside the timed region). I believe this is
another area in which we get a meaningless timing. It looks like
we need a WaitForGPU() there, and then maybe inside the timed
region handling the scatter. (I don't know if this stuff happens
asynchronously or not.) But do we potentially want two
WaitForGPU() calls in one function, just to help with getting
timings? I don't have a good idea of how much overhead this adds.
--Richard
On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
ierr = WaitForGPU();CHKERRCUDA(ierr);
ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
The old code swapped the first two lines. Since with
-log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
order to have better overlap.
ierr =
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
ierr =
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know
execution time without -log_view (hence cuda synchronization). I
manually calculated the Total Mflop/s for these cases for easy
comparison.
<<Note the CPU versions are copied from yesterday's results>>
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU -
CpuToGpu - - GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s
Count Size Count Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
6 MPI ranks,
MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 0.00e+00 0 0.00e+00 0
VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
24 MPI ranks
MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04
5.9e+04 0.0e+00 8 99 97 25 0 100100100100 0 17948 0 0 0.00e+00 0 0.00e+00 0
VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 2 0 0 0 0 19 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
42 MPI ranks
MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04
4.1e+04 0.0e+00 23 99 97 30 0 100100100100 0 27493 0 0 0.00e+00 0 0.00e+00 0
VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04
4.1e+04 0.0e+00 0 0 97 30 0 1 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 6 0 0 0 0 24 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult 100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03
2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 335743 629278 100 1.02e+02 100 2.69e+02 100
VecScatterBegin 100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 24 0100100 0 0 0 0 0.00e+00 100 2.69e+02 0
VecScatterEnd 100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 20 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecCUDACopyTo 100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 4 0 0 0 0 0 0 100 1.02e+02 0 0.00e+00 0
VecCopyFromSome 100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 14 0 0 0 0 0 0 0 0.00e+00 100 2.69e+02 0
6 MPI ranks + 6 GPUs + regular SF + No log_view
MatMult: 100 1.0 1.4180e-01 399268
6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
MatMult 100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03
2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 512224 642075 0 0.00e+00 0 0.00e+00 100
VecScatterBegin 100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 6 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 16 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
MatMult: 100 1.0 9.8344e-02 575717
24 MPI ranks + 6 GPUs + regular SF + log_view
MatMult 100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04
5.9e+04 0.0e+00 0 99 97 25 0 100100100100 0 489223 708601 100 4.61e+01 100 6.72e+01 100
VecScatterBegin 100 1.0 2.0641e-02 2.0 0.00e+00 0.0 1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 13 0100100 0 0 0 0 0.00e+00 100 6.72e+01 0
VecScatterEnd 100 1.0 6.8114e-02 5.6 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 38 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecCUDACopyTo 100 1.0 6.6646e-03 2.5 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 4.61e+01 0 0.00e+00 0
VecCopyFromSome 100 1.0 1.0546e-02 1.7 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 7 0 0 0 0 0 0 0 0.00e+00 100 6.72e+01 0
24 MPI ranks + 6 GPUs + regular SF + No log_view
MatMult: 100 1.0 9.8254e-02 576201
24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
MatMult 100 1.0 1.1602e-01 1.0 2.63e+09 1.2 1.9e+04
5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 487956 707524 0 0.00e+00 0 0.00e+00 100
VecScatterBegin 100 1.0 2.7088e-02 7.0 0.00e+00 0.0 1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 8 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 8.4262e-02 3.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 1 0 0 0 0 52 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
MatMult: 100 1.0 1.0397e-01 544510
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190924/9ba7ce82/attachment-0001.html>
More information about the petsc-dev
mailing list