[petsc-dev] Questions around benchmarking and data loading with PETSc

Junchao Zhang junchao.zhang at gmail.com
Sat Dec 11 12:04:44 CST 2021


On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <rohany at alumni.cmu.edu> wrote:

> Hi Junchao,
>
> Thanks for the response!
>
> > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to convert
> a Matrix Market file into a petsc binary file. And then in your test,
> load the binary matrix, following this example
> https://petsc.org/main/src/mat/tutorials/ex1.c.html
>
> I tried an example like this, but the performance was too slow (it would
> process ~2000-3000 calls to `SetValue` a second), which is not reasonable
> for loading matrices with millions of non-zeros.
>
The matrix market file in text format is not good for load.  One should
convert it to petsc binary format (only once), and use the new binary file
afterwards.


>
> > I don't know what "No Races" means, but it seems you'd better also
> verify the result of SpMV.
>
> This is a correct implementation of SpMV. The no-races is fine as it
> parallelizes over the rows of the matrix, and thus does not need
> synchronization between writes to the output.
>
> > You can think petsc's default CSR spmv is the baseline,  which is done
> in ~10 lines of code.
>
> I'm sorry, but I don't think that is a reasonable statement w.r.t to the
> lines of code making it a good baseline. The TACO compiler also can be used
> in 10 lines of code to compute an SpMV, or any other state-of-the-art
> library could wrap an SpMV implementation behind a single function call.
> I'm wondering if this performance I'm seeing using PETSc is expected, or if
> I've misconfigured or am misusing the system in some way.
>
I meant 10 lines of code without any function call, which can be thought of
as a textbook implementation of SpMV. As a baseline, one can apply
optimizations to it.  PETSc does not do sophisticated sparse matrix
optimization itself, instead it relies on third-party libraries.  I
remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse,
hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can add
an interface to it too.


> Rohan
>
>
> On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <rohany at alumni.cmu.edu>
>> wrote:
>>
>>> Hi, I’m Rohan, a student working on compilation techniques for
>>> distributed tensor computations. I’m looking at using PETSc as a baseline
>>> for experiments I’m running, and want to understand if I’m using PETSc as
>>> it was intended to achieve high performance, and if the performance I’m
>>> seeing is expected. Currently, I’m just looking at SpMV operations.
>>>
>>>
>>> My experiments are run on the Lassen Supercomputer (
>>> https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40
>>> CPUs, 4 V100s and an Infiniband interconnect. A visualization of the
>>> architecture is here:
>>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png
>>> .
>>>
>>>
>>> As of now, I’m trying to understand the single-node performance of
>>> PETSc, as the scaling performance onto multiple nodes appears to be as I
>>> expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix
>>> collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005. As
>>> a trusted baseline, I am comparing against SpMV code generated by the TACO
>>> compiler (
>>> http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)
>>> .
>>>
>> I don't know what "No Races" means, but it seems you'd better also verify
>> the result of SpMV.
>>
>>>
>>> My experiments find that PETSc is roughly 4 times slower on a single
>>> thread and node than the kernel generated by TACO:
>>>
>>>
>>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
>>>
>>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>>>
>> You can think petsc's default CSR spmv is the baseline,  which is done in
>> ~10 lines of code.
>>
>>>
>>> My code using PETSc is here:
>>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38
>>> .
>>>
>>>
>>> Runs from 1 thread and 1 node with -log_view are attached to the email.
>>> The command lines for each were as follows:
>>>
>>>
>>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20
>>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>
>>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20
>>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>
>>>
>>>
>>> In addition to these benchmarking concerns, I wanted to share my
>>> experiences trying to load data from Matrix Market files into PETSc, which
>>> ended up 1being much more difficult than I anticipated. Essentially, trying
>>> to iterate through the Matrix Market files and using `write` to insert
>>> entries into a `Mat` was extremely slow. In order to get reasonable
>>> performance, I had to use an external utility to basically construct a CSR
>>> matrix, and then pass the arrays from the CSR Matrix into
>>> `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc
>>> forums or Google, so I wanted to know if this was the right way to go.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Rohan Yadav
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20211211/690b2c4b/attachment.html>


More information about the petsc-dev mailing list