[petsc-dev] Questions around benchmarking and data loading with PETSc

Sat Dec 11 10:52:14 CST 2021

Thanks Barry!

>      The flop rates for the sparse matrix-vector product are very low for
an IBM Power 9. This is probably, at least partially, because the code is
configured without any optimization flags. You should run ./configure with
additional options something like COPTFLAGS="-O3"  CXXOPTFLAGS="-O3"
 FOPTFLAGS="-O3" but please consult the IBM documentation to determine
exactly what optimization flags to use for mpixlc and mpixlf.

This is a great catch! I was using the pre-built petsc provided on Lassen,
so I'm very surprised that it wasn't built with optimizations. I'll try
building with optimizations enabled and see what the performance is.

>    When running in parallel I would expect the "sweet spot" of optimal
performance to be roughly around 20 MPI ranks since the memory bandwidth of
the CPU will be saturated long before you reach 40 ranks. I would recommend
running with 1, 2, 3, 4, .... ranks to determine the optimal number of
ranks. Also please consult the documentation on the placement of the ranks
into the cores of the CPU; it is crucial to get this right and likely the
default is far from correct. Essentially you want each core used to be as
far away from the other cores being used as possible to maximize the
achievable memory bandwidth. So the first core should be on the first
socket, the second core on the second socket, the third core back on the
first socket far from the first core (that is it should not share L1 or L2
cache with the first core), etc.

I did a sweep of rank counts already and found that 40 is the best
performing on this system.

> The arabic-2005  matrix is not at all representative of the types of
matrices PETSc is designed to solve. It does not come from a PDE and does
not have the stencil structure of a matrix that comes from a PDE. PETSc's
performance on such a matrix will be much lower than its performance for
PDE matrices since PETSc is not designed for this type of matrix. Depending
on the goals of your work you may want to use different matrices that come
from PDEs.

I'm probably not using PETSc for solvers right now, but more so for
distributed sparse linear algebra operations. Is the matrix structure going
to affect PETSc's performance that much for these kinds of operations?

Rohan

On Sat, Dec 11, 2021 at 11:40 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>    Rohan,
>
>      The flop rates for the sparse matrix-vector product are very low for
> an IBM Power 9. This is probably, at least partially, because the code is
> configured without any optimization flags. You should run ./configure with
> additional options something like COPTFLAGS="-O3"  CXXOPTFLAGS="-O3"
>  FOPTFLAGS="-O3" but please consult the IBM documentation to determine
> exactly what optimization flags to use for mpixlc and mpixlf.
>
>     When running in parallel I would expect the "sweet spot" of optimal
> performance to be roughly around 20 MPI ranks since the memory bandwidth of
> the CPU will be saturated long before you reach 40 ranks. I would recommend
> running with 1, 2, 3, 4, .... ranks to determine the optimal number of
> ranks. Also please consult the documentation on the placement of the ranks
> into the cores of the CPU; it is crucial to get this right and likely the
> default is far from correct. Essentially you want each core used to be as
> far away from the other cores being used as possible to maximize the
> achievable memory bandwidth. So the first core should be on the first
> socket, the second core on the second socket, the third core back on the
> first socket far from the first core (that is it should not share L1 or L2
> cache with the first core), etc.
>
>    The arabic-2005  matrix is not at all representative of the types of
> matrices PETSc is designed to solve. It does not come from a PDE and does
> not have the stencil structure of a matrix that comes from a PDE. PETSc's
> performance on such a matrix will be much lower than its performance for
> PDE matrices since PETSc is not designed for this type of matrix. Depending
> on the goals of your work you may want to use different matrices that come
> from PDEs.
>
>   Regarding loading the matrix. Yes, it is expected that one uses a custom
> stand-along utility to read in SuiteSparse formatted matrices and converts
> them to the PETSc binary format; we do have a couple of examples of how
> such code can be written in src/mat/tutorials or tests
>
>
>  Barry
>
>
> On Dec 10, 2021, at 6:54 PM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>
> Hi, I’m Rohan, a student working on compilation techniques for distributed
> tensor computations. I’m looking at using PETSc as a baseline for
> experiments I’m running, and want to understand if I’m using PETSc as it
> was intended to achieve high performance, and if the performance I’m seeing
> is expected. Currently, I’m just looking at SpMV operations.
>
> My experiments are run on the Lassen Supercomputer (
> https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40 CPUs,
> 4 V100s and an Infiniband interconnect. A visualization of the architecture
> is here:
> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png.
>
> As of now, I’m trying to understand the single-node performance of PETSc,
> as the scaling performance onto multiple nodes appears to be as I expect.
> I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix
> collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005. As a
> trusted baseline, I am comparing against SpMV code generated by the TACO
> compiler (
> http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)
> .
>
> My experiments find that PETSc is roughly 4 times slower on a single
> thread and node than the kernel generated by TACO:
>
> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>
> My code using PETSc is here:
> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38
> .
>
> Runs from 1 thread and 1 node with -log_view are attached to the email.
> The command lines for each were as follows:
>
> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 -warmup
> 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20
> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>
>
> In addition to these benchmarking concerns, I wanted to share my
> experiences trying to load data from Matrix Market files into PETSc, which
> ended up 1being much more difficult than I anticipated. Essentially, trying
> to iterate through the Matrix Market files and using `write` to insert
> entries into a `Mat` was extremely slow. In order to get reasonable
> performance, I had to use an external utility to basically construct a CSR
> matrix, and then pass the arrays from the CSR Matrix into
> `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc
> forums or Google, so I wanted to know if this was the right way to go.
>
> Thanks,
>
> Rohan Yadav
> <petsc-1-node-1-thread.txt><petsc-1-node-40-threads.txt>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20211211/87aa6117/attachment.html>