[petsc-dev] MatPinToCPU

Tue Jul 30 12:47:42 CDT 2019

  Sorry, I meant 24 CPU only


> On Jul 30, 2019, at 9:19 AM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> 
> 
> On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> 
>   Thanks. Could you please send the 24 processors with the GPU? 
> 
> That is in  out_cuda_000024....
> 
> 
>    Note the final column of the table gives you the percentage of flops (not rates, actual operations) on the GPU. For you biggest run it is
> 
>    For the MatMult it is 18 percent and for KSP solve it is 23 percent. I think this is much too low, we'd like to see well over 90 percent of the flops on the GPU; or 95 or more. Is this because you are forced to put very large matrices only the CPU? 
> 
> Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on the CPU. This could be because it is > 99.5%. And there is this in the last solve phase:
> 
> MatMult              679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04 0.0e+00  1 39 14  8  0   3 74 79 60  0 16438647   438720307    578 1.99e+02  519 2.55e+02 18
> MatMultAdd           150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03 0.0e+00  0  2  2  0  0   1  3 10  1  0 3409019   191195194    120 2.48e+01   60 2.25e+00 21
> MatMultTranspose     150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03 0.0e+00  0  2  2  0  0   0  3 10  1  0 6867795   2539317196     38 1.02e+02  150 3.22e+00 92
>  
> I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well over 90% should be on the GPU. I am puzzled. I'll keep digging but the log statements look OK.
> 
> 
>    For the MatMult if we assume the flop rate for the GPU is 25 times as fast as the CPU and 18 percent of the flops are done on the GPU then the ratio of time for the GPU should be 82.7 percent of the time for the CPU but  it is .90; so where is the extra time? Seems too much than just for the communication. 
> 
> I don't follow this analysis but the there is something funny about the logging ...
>  
> 
>    There is so much information and so much happening in the final stage that it is hard to discern what is killing the performance in the GPU case for the KSP solve. Anyway you can just have a stage at the end with several KSP solves and nothing else? 
> 
> I added this, eg, 
> 
> --- Event Stage 7: KSP only
> 
> SFBcastOpBegin       263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03 0.0e+00  0  0 15  7  0   1  0 91 98  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFBcastOpEnd         263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   8  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFReduceBegin         48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02 0.0e+00  0  0  2  0  0   0  0  9  2  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFReduceEnd           48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> MatMult              215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03 0.0e+00  1 24 14  7  0  83 89 81 95  0 33405   177859    430 1.75e+01  358 2.23e+01 17
> MatMultAdd            48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02 0.0e+00  0  1  2  0  0   7  5  9  2  0 20318   106989     48 2.33e+00   48 2.24e-01 20
> MatMultTranspose      48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02 0.0e+00  0  1  2  0  0   2  4  9  2  0 55325   781863      0 0.00e+00   72 3.23e-01 93
> MatSolve              24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2810       0      0 0.00e+00    0 0.00e+00  0
> MatResidual           48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03 0.0e+00  0  5  3  1  0  17 19 18 20  0 33284   136803     96 3.62e+00   72 4.50e+00 19
> VecTDot               46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00 4.6e+01  0  0  0  0  2   1  0  0  0 66  4109    6814      0 0.00e+00    0 0.00e+00 100
> VecNorm               24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00 2.4e+01  0  0  0  0  1   1  0  0  0 34  2507    5050      0 0.00e+00    0 0.00e+00 100
> VecCopy              146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0       0      0 0.00e+00   24 9.87e-02  0
> VecSet               169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecAXPY               46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 15870   23070      0 0.00e+00    0 0.00e+00 100
> VecAYPX              310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  1  0  0  0  7273   12000     48 1.97e-01    0 0.00e+00 100
> VecAXPBYCZ            96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  1  0  0  0 20134   46381      0 0.00e+00    0 0.00e+00 100
> VecPointwiseMult     192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0  3886    4184     24 9.87e-02    0 0.00e+00 100
> VecScatterBegin      311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03 0.0e+00  0  0 17  7  0   2  0100100  0     0       0      0 0.00e+00   72 3.50e-01  0
> VecScatterEnd        311 1.0 7.2357e-02 7.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   9  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecCUDACopyTo        550 1.0 1.5607e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    550 2.01e+01    0 0.00e+00  0
> VecCUDACopyFrom      478 1.0 1.7491e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0      0 0.00e+00  478 2.29e+01  0
> VecCopyFromSome       24 1.0 7.9868e-04 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00   24 1.26e-01  0
> KSPSolve               1 1.0 4.6980e-01 1.0 7.11e+08 1.4 6.8e+04 2.3e+03 7.0e+01  1 28 17  7  3 100100100100100 31476   83700    550 2.01e+01  502 2.30e+01 23
> PCSetUpOnBlocks       24 1.0 4.2097e-05 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> PCApply               24 1.0 3.8880e-01 1.0 6.02e+08 1.4 6.2e+04 2.2e+03 0.0e+00  1 23 16  6  0  83 84 91 86  0 32127   96704    504 1.71e+01  456 1.88e+01 24
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>  
> 
>    Barry
> 
> 
> > On Jul 29, 2019, at 5:26 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > 
> > 
> > 
> > On Mon, Jul 29, 2019 at 5:31 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> > 
> >   I don't understand the notation in the legend on the second page
> > 
> > 12,288 cpus and no GPUs ?
> > 
> > Yes
> >  
> > 
> > 24 GPUs?  or 6 GPUs
> > 
> > 24 virtual, 6 real GPUs per node. The first case is one node, 24 cores/vGPUs
> >  
> > 
> > 192 GPUs?
> > 
> > 1536 GPUs?
> > 
> > 12,288 GPUs?  or 12288/4 = 3072  GPUs?
> > 
> > All "GPUs" are one core/process/vGPU. So 12288 virtual GPUs and 3072 physical GPUs.
> > 
> > Maybe I should add 'virtual GPUs' and put (4 processes/SUMMIT GPU)
> >  
> > 
> > So on the largest run using GPUs or not takes pretty much exactly the same 
> > amount of  time?
> > 
> > yes. The raw Mat-vec is about 3x faster with ~95K equations/process. I've attached the data.
> >  
> > 
> > What about 6 GPUs vs 24 CPUs ? Same equal amount of time. 
> > 
> > Can you send some log summaries
> > 
> > <out_cpu_012288><out_cuda_000024><out_cuda_001536><out_cuda_000192><out_cuda_012288>
>