<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
Thanks. Could you please send the 24 processors with the GPU? <br></blockquote><div><br></div><div>That is in out_cuda_000024....</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Note the final column of the table gives you the percentage of flops (not rates, actual operations) on the GPU. For you biggest run it is<br>
<br>
For the MatMult it is 18 percent and for KSP solve it is 23 percent. I think this is much too low, we'd like to see well over 90 percent of the flops on the GPU; or 95 or more. Is this because you are forced to put very large matrices only the CPU? <br></blockquote><div><br></div><div>Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on the CPU. This could be because it is > 99.5%. And there is this in the last solve phase:</div><div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace">MatMult 679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04 0.0e+00 1 39 14 8 0 3 74 79 60 0 16438647 438720307 578 1.99e+02 519 2.55e+02 18<br>MatMultAdd 150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03 0.0e+00 0 2 2 0 0 1 3 10 1 0 3409019 191195194 120 2.48e+01 60 2.25e+00 21<br>MatMultTranspose 150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03 0.0e+00 0 2 2 0 0 0 3 10 1 0 6867795 2539317196 38 1.02e+02 150 3.22e+00 92<br></font></div><div><font face="courier new, monospace"> </font></div>I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well over 90% should be on the GPU. I am puzzled. I'll keep digging but the log statements look OK.</div><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
For the MatMult if we assume the flop rate for the GPU is 25 times as fast as the CPU and 18 percent of the flops are done on the GPU then the ratio of time for the GPU should be 82.7 percent of the time for the CPU but it is .90; so where is the extra time? Seems too much than just for the communication. <br></blockquote><div><br></div><div>I don't follow this analysis but the there is something funny about the logging ...</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
There is so much information and so much happening in the final stage that it is hard to discern what is killing the performance in the GPU case for the KSP solve. Anyway you can just have a stage at the end with several KSP solves and nothing else? <br></blockquote><div><br></div><div>I added this, eg, </div><div><br></div><div>--- Event Stage 7: KSP only<br><font face="courier new, monospace"><br>SFBcastOpBegin 263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03 0.0e+00 0 0 15 7 0 1 0 91 98 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFBcastOpEnd 263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 8 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFReduceBegin 48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02 0.0e+00 0 0 2 0 0 0 0 9 2 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFReduceEnd 48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>MatMult 215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03 0.0e+00 1 24 14 7 0 83 89 81 95 0 33405 177859 430 1.75e+01 358 2.23e+01 17<br>MatMultAdd 48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02 0.0e+00 0 1 2 0 0 7 5 9 2 0 20318 106989 48 2.33e+00 48 2.24e-01 20<br>MatMultTranspose 48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02 0.0e+00 0 1 2 0 0 2 4 9 2 0 55325 781863 0 0.00e+00 72 3.23e-01 93<br>MatSolve 24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2810 0 0 0.00e+00 0 0.00e+00 0<br>MatResidual 48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03 0.0e+00 0 5 3 1 0 17 19 18 20 0 33284 136803 96 3.62e+00 72 4.50e+00 19<br>VecTDot 46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00 4.6e+01 0 0 0 0 2 1 0 0 0 66 4109 6814 0 0.00e+00 0 0.00e+00 100<br>VecNorm 24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 1 1 0 0 0 34 2507 5050 0 0.00e+00 0 0.00e+00 100<br>VecCopy 146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 24 9.87e-02 0<br>VecSet 169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecAXPY 46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 15870 23070 0 0.00e+00 0 0.00e+00 100<br>VecAYPX 310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 1 0 0 0 7273 12000 48 1.97e-01 0 0.00e+00 100<br>VecAXPBYCZ 96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 1 0 0 0 20134 46381 0 0.00e+00 0 0.00e+00 100<br>VecPointwiseMult 192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 3886 4184 24 9.87e-02 0 0.00e+00 100<br>VecScatterBegin 311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03 0.0e+00 0 0 17 7 0 2 0100100 0 0 0 0 0.00e+00 72 3.50e-01 0<br>VecScatterEnd 311 1.0 7.2357e-02 7.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecCUDACopyTo 550 1.0 1.5607e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 550 2.01e+01 0 0.00e+00 0<br>VecCUDACopyFrom 478 1.0 1.7491e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 478 2.29e+01 0<br>VecCopyFromSome 24 1.0 7.9868e-04 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 24 1.26e-01 0<br>KSPSolve 1 1.0 4.6980e-01 1.0 7.11e+08 1.4 6.8e+04 2.3e+03 7.0e+01 1 28 17 7 3 100100100100100 31476 83700 550 2.01e+01 502 2.30e+01 23<br>PCSetUpOnBlocks 24 1.0 4.2097e-05 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>PCApply 24 1.0 3.8880e-01 1.0 6.02e+08 1.4 6.2e+04 2.2e+03 0.0e+00 1 23 16 6 0 83 84 91 86 0 32127 96704 504 1.71e+01 456 1.88e+01 24<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------</font><br><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Barry<br>
<br>
<br>
> On Jul 29, 2019, at 5:26 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> <br>
> <br>
> <br>
> On Mon, Jul 29, 2019 at 5:31 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
> <br>
> I don't understand the notation in the legend on the second page<br>
> <br>
> 12,288 cpus and no GPUs ?<br>
> <br>
> Yes<br>
> <br>
> <br>
> 24 GPUs? or 6 GPUs<br>
> <br>
> 24 virtual, 6 real GPUs per node. The first case is one node, 24 cores/vGPUs<br>
> <br>
> <br>
> 192 GPUs?<br>
> <br>
> 1536 GPUs?<br>
> <br>
> 12,288 GPUs? or 12288/4 = 3072 GPUs?<br>
> <br>
> All "GPUs" are one core/process/vGPU. So 12288 virtual GPUs and 3072 physical GPUs.<br>
> <br>
> Maybe I should add 'virtual GPUs' and put (4 processes/SUMMIT GPU)<br>
> <br>
> <br>
> So on the largest run using GPUs or not takes pretty much exactly the same <br>
> amount of time?<br>
> <br>
> yes. The raw Mat-vec is about 3x faster with ~95K equations/process. I've attached the data.<br>
> <br>
> <br>
> What about 6 GPUs vs 24 CPUs ? Same equal amount of time. <br>
> <br>
> Can you send some log summaries<br>
> <br>
> <out_cpu_012288><out_cuda_000024><out_cuda_001536><out_cuda_000192><out_cuda_012288><br>
<br>
</blockquote></div></div>