[petsc-dev] Questions around benchmarking and data loading with PETSc

Sat Dec 11 10:28:05 CST 2021

Hi Junchao,

Thanks for the response!

> You can use https://petsc.org/main/src/mat/tests/ex72.c.html to convert a Matrix
Market file into a petsc binary file. And then in your test, load the
binary matrix, following this example
https://petsc.org/main/src/mat/tutorials/ex1.c.html

I tried an example like this, but the performance was too slow (it would
process ~2000-3000 calls to `SetValue` a second), which is not reasonable
for loading matrices with millions of non-zeros.

> I don't know what "No Races" means, but it seems you'd better also verify
the result of SpMV.

This is a correct implementation of SpMV. The no-races is fine as it
parallelizes over the rows of the matrix, and thus does not need
synchronization between writes to the output.

> You can think petsc's default CSR spmv is the baseline,  which is done in
~10 lines of code.

I'm sorry, but I don't think that is a reasonable statement w.r.t to the
lines of code making it a good baseline. The TACO compiler also can be used
in 10 lines of code to compute an SpMV, or any other state-of-the-art
library could wrap an SpMV implementation behind a single function call.
I'm wondering if this performance I'm seeing using PETSc is expected, or if
I've misconfigured or am misusing the system in some way.

Rohan

On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>
>> Hi, I’m Rohan, a student working on compilation techniques for
>> distributed tensor computations. I’m looking at using PETSc as a baseline
>> for experiments I’m running, and want to understand if I’m using PETSc as
>> it was intended to achieve high performance, and if the performance I’m
>> seeing is expected. Currently, I’m just looking at SpMV operations.
>>
>>
>> My experiments are run on the Lassen Supercomputer (
>> https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40 CPUs,
>> 4 V100s and an Infiniband interconnect. A visualization of the architecture
>> is here:
>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png
>> .
>>
>>
>> As of now, I’m trying to understand the single-node performance of PETSc,
>> as the scaling performance onto multiple nodes appears to be as I expect.
>> I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix
>> collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005. As a
>> trusted baseline, I am comparing against SpMV code generated by the TACO
>> compiler (
>> http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)
>> .
>>
> I don't know what "No Races" means, but it seems you'd better also verify
> the result of SpMV.
>
>>
>> My experiments find that PETSc is roughly 4 times slower on a single
>> thread and node than the kernel generated by TACO:
>>
>>
>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
>>
>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>>
> You can think petsc's default CSR spmv is the baseline,  which is done in
> ~10 lines of code.
>
>>
>> My code using PETSc is here:
>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38
>> .
>>
>>
>> Runs from 1 thread and 1 node with -log_view are attached to the email.
>> The command lines for each were as follows:
>>
>>
>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20
>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>
>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20
>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>
>>
>>
>> In addition to these benchmarking concerns, I wanted to share my
>> experiences trying to load data from Matrix Market files into PETSc, which
>> ended up 1being much more difficult than I anticipated. Essentially, trying
>> to iterate through the Matrix Market files and using `write` to insert
>> entries into a `Mat` was extremely slow. In order to get reasonable
>> performance, I had to use an external utility to basically construct a CSR
>> matrix, and then pass the arrays from the CSR Matrix into
>> `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc
>> forums or Google, so I wanted to know if this was the right way to go.
>>
>>
>> Thanks,
>>
>>
>> Rohan Yadav
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20211211/9bdfe5e7/attachment-0001.html>