[petsc-dev] Questions around benchmarking and data loading with PETSc

Sat Dec 11 10:40:01 CST 2021

   Rohan,

     The flop rates for the sparse matrix-vector product are very low for an IBM Power 9. This is probably, at least partially, because the code is configured without any optimization flags. You should run ./configure with additional options something like COPTFLAGS="-O3"  CXXOPTFLAGS="-O3"  FOPTFLAGS="-O3" but please consult the IBM documentation to determine exactly what optimization flags to use for mpixlc and mpixlf.

    When running in parallel I would expect the "sweet spot" of optimal performance to be roughly around 20 MPI ranks since the memory bandwidth of the CPU will be saturated long before you reach 40 ranks. I would recommend running with 1, 2, 3, 4, .... ranks to determine the optimal number of ranks. Also please consult the documentation on the placement of the ranks into the cores of the CPU; it is crucial to get this right and likely the default is far from correct. Essentially you want each core used to be as far away from the other cores being used as possible to maximize the achievable memory bandwidth. So the first core should be on the first socket, the second core on the second socket, the third core back on the first socket far from the first core (that is it should not share L1 or L2 cache with the first core), etc.

   The arabic-2005  matrix is not at all representative of the types of matrices PETSc is designed to solve. It does not come from a PDE and does not have the stencil structure of a matrix that comes from a PDE. PETSc's performance on such a matrix will be much lower than its performance for PDE matrices since PETSc is not designed for this type of matrix. Depending on the goals of your work you may want to use different matrices that come from PDEs.

  Regarding loading the matrix. Yes, it is expected that one uses a custom stand-along utility to read in SuiteSparse formatted matrices and converts them to the PETSc binary format; we do have a couple of examples of how such code can be written in src/mat/tutorials or tests

 Barry

> On Dec 10, 2021, at 6:54 PM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
> 
> Hi, I’m Rohan, a student working on compilation techniques for distributed tensor computations. I’m looking at using PETSc as a baseline for experiments I’m running, and want to understand if I’m using PETSc as it was intended to achieve high performance, and if the performance I’m seeing is expected. Currently, I’m just looking at SpMV operations.
> 
> My experiments are run on the Lassen Supercomputer (https://hpc.llnl.gov/hardware/platforms/lassen <https://hpc.llnl.gov/hardware/platforms/lassen>). The system has 40 CPUs, 4 V100s and an Infiniband interconnect. A visualization of the architecture is here: https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png <https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png>.
> 
> As of now, I’m trying to understand the single-node performance of PETSc, as the scaling performance onto multiple nodes appears to be as I expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005 <https://sparse.tamu.edu/LAW/arabic-2005>. As a trusted baseline, I am comparing against SpMV code generated by the TACO compiler (http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races) <http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)>.
> 
> My experiments find that PETSc is roughly 4 times slower on a single thread and node than the kernel generated by TACO:
> 
> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
> 
> My code using PETSc is here: https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38 <https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38>.
> 
> Runs from 1 thread and 1 node with -log_view are attached to the email. The command lines for each were as follows:
> 
> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
> 
> 
> In addition to these benchmarking concerns, I wanted to share my experiences trying to load data from Matrix Market files into PETSc, which ended up 1being much more difficult than I anticipated. Essentially, trying to iterate through the Matrix Market files and using `write` to insert entries into a `Mat` was extremely slow. In order to get reasonable performance, I had to use an external utility to basically construct a CSR matrix, and then pass the arrays from the CSR Matrix into `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc forums or Google, so I wanted to know if this was the right way to go.
> 
> Thanks,
> 
> Rohan Yadav
> <petsc-1-node-1-thread.txt><petsc-1-node-40-threads.txt>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20211211/a25f65f2/attachment-0001.html>