[petsc-dev] MatMult on Summit
Karl Rupp
rupp at iue.tuwien.ac.at
Tue Sep 24 07:37:55 CDT 2019
Hi Mark, Richard, Junchao, et al.,
here we go:
https://gitlab.com/petsc/petsc/merge_requests/2091
This fixes indeed all the inconsistencies in test results for SNES ex19
and even ex56. A-priori I wasn't sure about the latter, but it looks
like this was the only missing piece.
Mark, this should allow you to move forward with GPUs.
Best regards,
Karli
On 9/24/19 11:05 AM, Mark Adams wrote:
> Yes, please, thank you.
>
> On Tue, Sep 24, 2019 at 1:46 AM Mills, Richard Tran via petsc-dev
> <petsc-dev at mcs.anl.gov <mailto:petsc-dev at mcs.anl.gov>> wrote:
>
> Karl, that would be fantastic. Much obliged!
>
> --Richard
>
> On 9/23/19 8:09 PM, Karl Rupp wrote:
>> Hi,
>>
>> `git grep cudaStreamCreate` reports that vectors, matrices and
>> scatters create their own streams. This will almost inevitably
>> create races (there is no synchronization mechanism implemented),
>> unless one calls WaitForGPU() after each operation. Some of the
>> non-deterministic tests can likely be explained by this.
>>
>> I'll clean this up in the next few hours if there are no objections.
>>
>> Best regards,
>> Karli
>>
>>
>>
>> On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
>>> I'm no CUDA expert (not yet, anyway), but, from what I've read,
>>> the default stream (stream 0) is (mostly) synchronous to host and
>>> device, so WaitForGPU() is not needed in that case. I don't know
>>> if there is any performance penalty in explicitly calling it in
>>> that case, anyway.
>>>
>>> In any case, it looks like there are still some cases where
>>> potentially asynchronous CUDA library calls are being "timed"
>>> without a WaitForGPU() to ensure that the calls actually
>>> complete. I will make a pass through the aijcusparse and
>>> aijviennacl code looking for these.
>>>
>>> --Richard
>>>
>>> On 9/23/19 3:28 PM, Zhang, Junchao wrote:
>>>> It looks cusparsestruct->stream is always created (not NULL). I
>>>> don't know logic of the "if (!cusparsestruct->stream)".
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via
>>>> petsc-dev <petsc-dev at mcs.anl.gov <mailto:petsc-dev at mcs.anl.gov>
>>>> <mailto:petsc-dev at mcs.anl.gov> <mailto:petsc-dev at mcs.anl.gov>>
>>>> wrote:
>>>>
>>>> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
>>>> the end of the function it had
>>>>
>>>> if (!yy) { /* MatMult */
>>>> if (!cusparsestruct->stream) {
>>>> ierr = WaitForGPU();CHKERRCUDA(ierr);
>>>> }
>>>> }
>>>>
>>>> I assume we don't need the logic to do this only in the
>>>> MatMult()
>>>> with no add case and should just do this all the time, for the
>>>> purposes of timing if no other reason. Is there some reason
>>>> to NOT
>>>> do this because of worries the about effects that these
>>>> WaitForGPU() invocations might have on performance?
>>>>
>>>> I notice other problems in aijcusparse.cu
>>>> <http://aijcusparse.cu> <http://aijcusparse.cu>
>>>> <http://aijcusparse.cu>,
>>>> now that I look closer. In
>>>> MatMultTransposeAdd_SeqAIJCUSPARSE(), I
>>>> see that we have GPU timing calls around the
>>>> cusparse_csr_spmv()
>>>> (but no WaitForGPU() inside the timed region). I believe
>>>> this is
>>>> another area in which we get a meaningless timing. It looks
>>>> like
>>>> we need a WaitForGPU() there, and then maybe inside the timed
>>>> region handling the scatter. (I don't know if this stuff
>>>> happens
>>>> asynchronously or not.) But do we potentially want two
>>>> WaitForGPU() calls in one function, just to help with getting
>>>> timings? I don't have a good idea of how much overhead this
>>>> adds.
>>>>
>>>> --Richard
>>>>
>>>> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>>>> I made the following changes:
>>>>> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at
>>>>> the end
>>>>> ierr = WaitForGPU();CHKERRCUDA(ierr);
>>>>> ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>>>> ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>>>> PetscFunctionReturn(0);
>>>>> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
>>>>> The old code swapped the first two lines. Since with
>>>>> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed
>>>>> the
>>>>> order to have better overlap.
>>>>> ierr =
>>>>>
>>>>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>>> ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>>>> ierr =
>>>>>
>>>>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>>> ierr =
>>>>> (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>>>>> 3) Log time directly in the test code so we can also know
>>>>> execution time without -log_view (hence cuda
>>>>> synchronization). I
>>>>> manually calculated the Total Mflop/s for these cases for easy
>>>>> comparison.
>>>>>
>>>>> <<Note the CPU versions are copied from yesterday's results>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> Event Count Time (sec)
>>>>> Flop --- Global --- --- Stage
>>>>> ---- Total GPU -
>>>>> CpuToGpu - - GpuToCpu - GPU
>>>>> Max Ratio Max Ratio Max Ratio
>>>>> Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R
>>>>> Mflop/s Mflop/s
>>>>> Count Size Count Size %F
>>>>>
>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>> 6 MPI ranks,
>>>>> MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1
>>>>> 2.8e+03
>>>>> 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0
>>>>> 2.8e+03
>>>>> 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecScatterEnd 100 1.0 2.9441e+00 133 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>>
>>>>> 24 MPI ranks
>>>>> MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2
>>>>> 1.9e+04
>>>>> 5.9e+04 0.0e+00 8 99 97 25 0 100100100100 0 17948
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0
>>>>> 1.9e+04
>>>>> 5.9e+04 0.0e+00 0 0 97 25 0 0 0100100 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 2 0 0 0 0 19 0 0 0 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>>
>>>>> 42 MPI ranks
>>>>> MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3
>>>>> 3.5e+04
>>>>> 4.1e+04 0.0e+00 23 99 97 30 0 100100100100 0 27493
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0
>>>>> 3.5e+04
>>>>> 4.1e+04 0.0e+00 0 0 97 30 0 1 0100100 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 6 0 0 0 0 24 0 0 0 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>>
>>>>> 6 MPI ranks + 6 GPUs + regular SF + log_view
>>>>> MatMult 100 1.0 1.6863e-01 1.0 9.66e+09 1.1
>>>>> 2.8e+03
>>>>> 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 335743
>>>>> 629278 100 1.02e+02 100 2.69e+02 100
>>>>> VecScatterBegin 100 1.0 5.0157e-02 1.6 0.00e+00 0.0
>>>>> 2.8e+03
>>>>> 2.2e+05 0.0e+00 0 0 97 18 0 24 0100100 0 0
>>>>> 0 0 0.00e+00 100 2.69e+02 0
>>>>> VecScatterEnd 100 1.0 4.9155e-02 2.5 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 20 0 0 0 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecCUDACopyTo 100 1.0 9.5078e-03 2.0 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 4 0 0 0 0 0
>>>>> 0 100 1.02e+02 0 0.00e+00 0
>>>>> VecCopyFromSome 100 1.0 2.8485e-02 1.4 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 14 0 0 0 0 0
>>>>> 0 0 0.00e+00 100 2.69e+02 0
>>>>>
>>>>> 6 MPI ranks + 6 GPUs + regular SF + No log_view
>>>>> MatMult: 100
>>>>> 1.0 1.4180e-01
>>>>> 399268
>>>>>
>>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
>>>>> MatMult 100 1.0 1.1053e-01 1.0 9.66e+09 1.1
>>>>> 2.8e+03
>>>>> 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 512224
>>>>> 642075 0 0.00e+00 0 0.00e+00 100
>>>>> VecScatterBegin 100 1.0 8.3418e-03 1.5 0.00e+00 0.0
>>>>> 2.8e+03
>>>>> 2.2e+05 0.0e+00 0 0 97 18 0 6 0100100 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecScatterEnd 100 1.0 2.2619e-02 1.6 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 16 0 0 0 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>>
>>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
>>>>> MatMult: 100 1.0
>>>>> 9.8344e-02
>>>>> 575717
>>>>>
>>>>> 24 MPI ranks + 6 GPUs + regular SF + log_view
>>>>> MatMult 100 1.0 1.1572e-01 1.0 2.63e+09 1.2
>>>>> 1.9e+04
>>>>> 5.9e+04 0.0e+00 0 99 97 25 0 100100100100 0 489223
>>>>> 708601 100 4.61e+01 100 6.72e+01 100
>>>>> VecScatterBegin 100 1.0 2.0641e-02 2.0 0.00e+00 0.0
>>>>> 1.9e+04
>>>>> 5.9e+04 0.0e+00 0 0 97 25 0 13 0100100 0 0
>>>>> 0 0 0.00e+00 100 6.72e+01 0
>>>>> VecScatterEnd 100 1.0 6.8114e-02 5.6 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 38 0 0 0 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecCUDACopyTo 100 1.0 6.6646e-03 2.5 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0
>>>>> 0 100 4.61e+01 0 0.00e+00 0
>>>>> VecCopyFromSome 100 1.0 1.0546e-02 1.7 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 7 0 0 0 0 0
>>>>> 0 0 0.00e+00 100 6.72e+01 0
>>>>>
>>>>> 24 MPI ranks + 6 GPUs + regular SF + No log_view
>>>>> MatMult: 100 1.0
>>>>> 9.8254e-02
>>>>> 576201
>>>>>
>>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
>>>>> MatMult 100 1.0 1.1602e-01 1.0 2.63e+09 1.2
>>>>> 1.9e+04
>>>>> 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 487956
>>>>> 707524 0 0.00e+00 0 0.00e+00 100
>>>>> VecScatterBegin 100 1.0 2.7088e-02 7.0 0.00e+00 0.0
>>>>> 1.9e+04
>>>>> 5.9e+04 0.0e+00 0 0 97 25 0 8 0100100 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>> VecScatterEnd 100 1.0 8.4262e-02 3.0 0.00e+00 0.0
>>>>> 0.0e+00
>>>>> 0.0e+00 0.0e+00 1 0 0 0 0 52 0 0 0 0 0
>>>>> 0 0 0.00e+00 0 0.00e+00 0
>>>>>
>>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
>>>>> MatMult: 100 1.0
>>>>> 1.0397e-01
>>>>> 544510
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>
More information about the petsc-dev
mailing list