<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Dec 11, 2021, at 11:52 AM, Rohan Yadav <<a href="mailto:rohany@alumni.cmu.edu" class="">rohany@alumni.cmu.edu</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Thanks Barry!<div class=""><br class=""></div><div class="">> The flop rates for the sparse matrix-vector product are very low for an IBM Power 9. This is probably, at least partially, because the code is configured without any optimization flags. You should run ./configure with additional options something like COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" but please consult the IBM documentation to determine exactly what optimization flags to use for mpixlc and mpixlf.</div><div class=""><br class=""></div><div class="">This is a great catch! I was using the pre-built petsc provided on Lassen, so I'm very surprised that it wasn't built with optimizations. I'll try building with optimizations enabled and see what the performance is.</div><div class=""><br class=""></div><div class="">> When running in parallel I would expect the "sweet spot" of optimal performance to be roughly around 20 MPI ranks since the memory bandwidth of the CPU will be saturated long before you reach 40 ranks. I would recommend running with 1, 2, 3, 4, .... ranks to determine the optimal number of ranks. Also please consult the documentation on the placement of the ranks into the cores of the CPU; it is crucial to get this right and likely the default is far from correct. Essentially you want each core used to be as far away from the other cores being used as possible to maximize the achievable memory bandwidth. So the first core should be on the first socket, the second core on the second socket, the third core back on the first socket far from the first core (that is it should not share L1 or L2 cache with the first core), etc.</div><div class=""><br class=""></div><div class="">I did a sweep of rank counts already and found that 40 is the best performing on this system.</div></div></div></blockquote><div><br class=""></div> It may be different with the optimization turned on. I am surprised that it is 40 usually it is lower.<br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">> The <span style="font-family:"Helvetica Neue";font-size:12px" class="">arabic-2005</span><span style="font-family:"Helvetica Neue";font-size:12px" class=""> </span> matrix is not at all representative of the types of matrices PETSc is designed to solve. It does not come from a PDE and does not have the stencil structure of a matrix that comes from a PDE. PETSc's performance on such a matrix will be much lower than its performance for PDE matrices since PETSc is not designed for this type of matrix. Depending on the goals of your work you may want to use different matrices that come from PDEs.</div><div class=""><br class=""></div><div class="">I'm probably not using PETSc for solvers right now, but more so for distributed sparse linear algebra operations. Is the matrix structure going to affect PETSc's performance that much for these kinds of operations?</div></div></div></blockquote><div><br class=""></div> Yes, especially with multiple MPI ranks. The reason is that for the arabic-2005 like graphs the PETSc parallel CSR split by rows across MPI ranks is not a good layout of the data, it induces a lot of communication.</div><div><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">Rohan</div><div class=""><br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Dec 11, 2021 at 11:40 AM Barry Smith <<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class=""><div class=""><br class=""></div> Rohan,<div class=""><br class=""></div><div class=""> The flop rates for the sparse matrix-vector product are very low for an IBM Power 9. This is probably, at least partially, because the code is configured without any optimization flags. You should run ./configure with additional options something like COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" but please consult the IBM documentation to determine exactly what optimization flags to use for mpixlc and mpixlf.</div><div class=""><br class=""></div><div class=""> When running in parallel I would expect the "sweet spot" of optimal performance to be roughly around 20 MPI ranks since the memory bandwidth of the CPU will be saturated long before you reach 40 ranks. I would recommend running with 1, 2, 3, 4, .... ranks to determine the optimal number of ranks. Also please consult the documentation on the placement of the ranks into the cores of the CPU; it is crucial to get this right and likely the default is far from correct. Essentially you want each core used to be as far away from the other cores being used as possible to maximize the achievable memory bandwidth. So the first core should be on the first socket, the second core on the second socket, the third core back on the first socket far from the first core (that is it should not share L1 or L2 cache with the first core), etc.</div><div class=""><br class=""></div><div class=""> The <span style="font-family:"Helvetica Neue";font-size:12px" class="">arabic-2005</span><span style="font-family:"Helvetica Neue";font-size:12px" class=""> </span> matrix is not at all representative of the types of matrices PETSc is designed to solve. It does not come from a PDE and does not have the stencil structure of a matrix that comes from a PDE. PETSc's performance on such a matrix will be much lower than its performance for PDE matrices since PETSc is not designed for this type of matrix. Depending on the goals of your work you may want to use different matrices that come from PDEs.</div><div class=""><br class=""></div><div class=""> Regarding loading the matrix. Yes, it is expected that one uses a custom stand-along utility to read in SuiteSparse formatted matrices and converts them to the PETSc binary format; we do have a couple of examples of how such code can be written in src/mat/tutorials or tests</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""> Barry</div><div class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Dec 10, 2021, at 6:54 PM, Rohan Yadav <<a href="mailto:rohany@alumni.cmu.edu" target="_blank" class="">rohany@alumni.cmu.edu</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class=""><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">Hi, I’m Rohan, a student working on compilation techniques for distributed tensor computations. I’m looking at using PETSc as a baseline for experiments I’m running, and want to understand if I’m using PETSc as it was intended to achieve high performance, and if the performance I’m seeing is expected. Currently, I’m just looking at SpMV operations.</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">My experiments are run on the Lassen Supercomputer (<a href="https://hpc.llnl.gov/hardware/platforms/lassen" target="_blank" class=""><span style="color:rgb(220,161,13)" class="">https://hpc.llnl.gov/hardware/platforms/lassen</span></a>). The system has 40 CPUs, 4 V100s and an Infiniband interconnect. A visualization of the architecture is here: <a href="https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png" target="_blank" class=""><span style="color:rgb(220,161,13)" title="" class="">https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png</span></a>.</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">As of now, I’m trying to understand the single-node performance of PETSc, as the scaling performance onto multiple nodes appears to be as I expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix collection, detailed here: <a href="https://sparse.tamu.edu/LAW/arabic-2005" target="_blank" class=""><span style="color:rgb(220,161,13)" class="">https://sparse.tamu.edu/LAW/arabic-2005</span></a>. As a trusted baseline, I am comparing against SpMV code generated by the TACO compiler (<a href="http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)" target="_blank" class=""><span style="color:rgb(220,161,13)" class="">http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)</span></a>.</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">My experiments find that PETSc is roughly 4 times slower on a single thread and node than the kernel generated by TACO:</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";color:rgb(220,161,13)" class=""><span class="">My code using PETSc is here: <a href="https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38" target="_blank" class="">https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38</a></span>.</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";color:rgb(220,161,13);min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">Runs from 1 thread and 1 node with -log_view are attached to the email. The command lines for each were as follows:</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">In addition to these benchmarking concerns, I wanted to share my experiences trying to load data from Matrix Market files into PETSc, which ended up 1being much more difficult than I anticipated. Essentially, trying to iterate through the Matrix Market files and using `write` to insert entries into a `Mat` was extremely slow. In order to get reasonable performance, I had to use an external utility to basically construct a CSR matrix, and then pass the arrays from the CSR Matrix into `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc forums or Google, so I wanted to know if this was the right way to go.</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">Thanks,</div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue";min-height:14px" class=""><br class=""></div><div style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:12px;line-height:normal;font-family:"Helvetica Neue"" class="">Rohan Yadav</div></div>
<span id="gmail-m_-8813746456660801058cid:f_kx11ocav0" class=""><petsc-1-node-1-thread.txt></span><span id="gmail-m_-8813746456660801058cid:f_kx11ocb51" class=""><petsc-1-node-40-threads.txt></span></div></blockquote></div><br class=""></div></div></blockquote></div>
</div></blockquote></div><br class=""></body></html>