[petsc-dev] MatPinToCPU
Mark Adams
mfadams at lbl.gov
Tue Jul 30 09:19:48 CDT 2019
On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>
> Thanks. Could you please send the 24 processors with the GPU?
>
That is in out_cuda_000024....
> Note the final column of the table gives you the percentage of flops
> (not rates, actual operations) on the GPU. For you biggest run it is
>
> For the MatMult it is 18 percent and for KSP solve it is 23 percent. I
> think this is much too low, we'd like to see well over 90 percent of the
> flops on the GPU; or 95 or more. Is this because you are forced to put very
> large matrices only the CPU?
>
Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on
the CPU. This could be because it is > 99.5%. And there is this in the last
solve phase:
MatMult 679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04
0.0e+00 1 39 14 8 0 3 74 79 60 0 16438647 438720307 578 1.99e+02
519 2.55e+02 18
MatMultAdd 150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03
0.0e+00 0 2 2 0 0 1 3 10 1 0 3409019 191195194 120 2.48e+01
60 2.25e+00 21
MatMultTranspose 150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03
0.0e+00 0 2 2 0 0 0 3 10 1 0 6867795 2539317196 38 1.02e+02
150 3.22e+00 92
I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well
over 90% should be on the GPU. I am puzzled. I'll keep digging but the log
statements look OK.
> For the MatMult if we assume the flop rate for the GPU is 25 times as
> fast as the CPU and 18 percent of the flops are done on the GPU then the
> ratio of time for the GPU should be 82.7 percent of the time for the CPU
> but it is .90; so where is the extra time? Seems too much than just for
> the communication.
>
I don't follow this analysis but the there is something funny about the
logging ...
>
> There is so much information and so much happening in the final stage
> that it is hard to discern what is killing the performance in the GPU case
> for the KSP solve. Anyway you can just have a stage at the end with several
> KSP solves and nothing else?
>
I added this, eg,
--- Event Stage 7: KSP only
SFBcastOpBegin 263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03
0.0e+00 0 0 15 7 0 1 0 91 98 0 0 0 0 0.00e+00 0
0.00e+00 0
SFBcastOpEnd 263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 8 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
SFReduceBegin 48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02
0.0e+00 0 0 2 0 0 0 0 9 2 0 0 0 0 0.00e+00 0
0.00e+00 0
SFReduceEnd 48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
MatMult 215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03
0.0e+00 1 24 14 7 0 83 89 81 95 0 33405 177859 430 1.75e+01 358
2.23e+01 17
MatMultAdd 48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02
0.0e+00 0 1 2 0 0 7 5 9 2 0 20318 106989 48 2.33e+00 48
2.24e-01 20
MatMultTranspose 48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02
0.0e+00 0 1 2 0 0 2 4 9 2 0 55325 781863 0 0.00e+00 72
3.23e-01 93
MatSolve 24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 2810 0 0 0.00e+00 0
0.00e+00 0
MatResidual 48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03
0.0e+00 0 5 3 1 0 17 19 18 20 0 33284 136803 96 3.62e+00 72
4.50e+00 19
VecTDot 46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00
4.6e+01 0 0 0 0 2 1 0 0 0 66 4109 6814 0 0.00e+00 0
0.00e+00 100
VecNorm 24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00
2.4e+01 0 0 0 0 1 1 0 0 0 34 2507 5050 0 0.00e+00 0
0.00e+00 100
VecCopy 146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 24
9.87e-02 0
VecSet 169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
VecAXPY 46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 15870 23070 0 0.00e+00 0
0.00e+00 100
VecAYPX 310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 3 1 0 0 0 7273 12000 48 1.97e-01 0
0.00e+00 100
VecAXPBYCZ 96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 1 1 0 0 0 20134 46381 0 0.00e+00 0
0.00e+00 100
VecPointwiseMult 192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 1 0 0 0 0 3886 4184 24 9.87e-02 0
0.00e+00 100
VecScatterBegin 311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03
0.0e+00 0 0 17 7 0 2 0100100 0 0 0 0 0.00e+00 72
3.50e-01 0
VecScatterEnd 311 1.0 7.2357e-02 7.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
VecCUDACopyTo 550 1.0 1.5607e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 550 2.01e+01 0
0.00e+00 0
VecCUDACopyFrom 478 1.0 1.7491e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 478
2.29e+01 0
VecCopyFromSome 24 1.0 7.9868e-04 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 24
1.26e-01 0
KSPSolve 1 1.0 4.6980e-01 1.0 7.11e+08 1.4 6.8e+04 2.3e+03
7.0e+01 1 28 17 7 3 100100100100100 31476 83700 550 2.01e+01 502
2.30e+01 23
PCSetUpOnBlocks 24 1.0 4.2097e-05 3.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
PCApply 24 1.0 3.8880e-01 1.0 6.02e+08 1.4 6.2e+04 2.2e+03
0.0e+00 1 23 16 6 0 83 84 91 86 0 32127 96704 504 1.71e+01 456
1.88e+01 24
---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Barry
>
>
> > On Jul 29, 2019, at 5:26 PM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> >
> >
> > On Mon, Jul 29, 2019 at 5:31 PM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> >
> > I don't understand the notation in the legend on the second page
> >
> > 12,288 cpus and no GPUs ?
> >
> > Yes
> >
> >
> > 24 GPUs? or 6 GPUs
> >
> > 24 virtual, 6 real GPUs per node. The first case is one node, 24
> cores/vGPUs
> >
> >
> > 192 GPUs?
> >
> > 1536 GPUs?
> >
> > 12,288 GPUs? or 12288/4 = 3072 GPUs?
> >
> > All "GPUs" are one core/process/vGPU. So 12288 virtual GPUs and 3072
> physical GPUs.
> >
> > Maybe I should add 'virtual GPUs' and put (4 processes/SUMMIT GPU)
> >
> >
> > So on the largest run using GPUs or not takes pretty much exactly the
> same
> > amount of time?
> >
> > yes. The raw Mat-vec is about 3x faster with ~95K equations/process.
> I've attached the data.
> >
> >
> > What about 6 GPUs vs 24 CPUs ? Same equal amount of time.
> >
> > Can you send some log summaries
> >
> >
> <out_cpu_012288><out_cuda_000024><out_cuda_001536><out_cuda_000192><out_cuda_012288>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190730/66637131/attachment.html>
More information about the petsc-dev
mailing list