[petsc-dev] MatMult on Summit
Smith, Barry F.
bsmith at mcs.anl.gov
Sat Sep 21 11:11:16 CDT 2019
Sorry, forgot
Could you please put the GPU wait call before each of the log ends in that routine and see what kind of new numbers you get?
Thanks
Barry
> On Sep 21, 2019, at 11:00 AM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
>
> We log gpu time before/after cusparse calls. https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441
> But according to https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, cusparse is asynchronous. Does that mean the gpu time is meaningless?
> --Junchao Zhang
>
>
> On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>
> Hannah, Junchao and Richard,
>
> The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally funky. 951558 and 973391 they are so much lower than unvirtualized 3084009
> and 3133521 and yet the total time to solution is similar for the runs.
>
> Is it possible these are being counted or calculated wrong? If not what does this mean? Please check the code that computes them (I can't imagine it is wrong but ...)
>
> It means the GPUs are taking 3.x times more to do the multiplies in the MPS case but where is that time coming from in the other numbers? Communication time doesn't drop that much?
>
> I can't present these numbers with this huge inconsistency
>
> Thanks,
>
> Barry
>
>
>
>
> > On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
> >
> > I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improve performance. But if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to profile it. I will also collect data with multiple nodes. Are the matrix and tests proper?
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
> > Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
> > ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> > 6 MPI ranks (CPU version)
> > MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 0.00e+00 0 0.00e+00 0
> > VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> > VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> >
> > 6 MPI ranks + 6 GPUs + regular SF
> > MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 1.02e+02 100 2.69e+02 100
> > VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 0.00e+00 100 2.69e+02 0
> > VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> > VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 1.02e+02 0 0.00e+00 0
> > VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 0.00e+00 100 2.69e+02 0
> >
> > 6 MPI ranks + 6 GPUs + CUDA-aware SF
> > MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 0.00e+00 0 0.00e+00 100
> > VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> > VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> >
> > 24 MPI ranks + 6 GPUs + regular SF
> > MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 4.61e+01 100 6.72e+01 100
> > VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 0.00e+00 100 6.72e+01 0
> > VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> > VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 4.61e+01 0 0.00e+00 0
> > VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 0.00e+00 100 6.72e+01 0
> >
> > 24 MPI ranks + 6 GPUs + CUDA-aware SF
> > MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 0.00e+00 0 0.00e+00 100
> > VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> > VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> >
> >
> > --Junchao Zhang
>
More information about the petsc-dev
mailing list