[petsc-dev] Questions around benchmarking and data loading with PETSc

Sat Dec 11 17:06:56 CST 2021

On Sat, Dec 11, 2021, 4:22 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:

> Thanks all for the help, the main problem was the lack of optimization
> flags in the default build provided by my system. A manual installation
> with optimization flags delivers performance equal to the single node
> benchmark I discussed before.
>
Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close to 1
thread or 40 threads of TACO?

>
> Rohan
>
> On Sat, Dec 11, 2021 at 4:04 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>
>> > The matrix market file in text format is not good for load.  One should
>> convert it to petsc binary format (only once), and use the new binary file
>> afterwards.
>>
>> Yes, I understand this. The point I'm trying to make is that using PETSc
>> to even perform the initial conversion from matrix market to the binary
>> format was prohibitively slow using `MatSetValues`.
>>
>> > I meant 10 lines of code without any function call, which can be
>> thought of as a textbook implementation of SpMV. As a baseline, one can
>> apply optimizations to it.  PETSc does not do sophisticated sparse matrix
>> optimization itself, instead it relies on third-party libraries.  I
>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse,
>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can add
>> an interface to it too.
>>
>> Yes, this is what I expected. Given that PETSc uses high-performance
>> kernels for for the sparse matrix operation itself, I was surprised to see
>> that the single-thread performance of PETSc to be closer to a baseline like
>> TACO. This performance will likely improve when I compile PETSc with
>> optimization flags.
>>
>> Rohan
>>
>> On Sat, Dec 11, 2021 at 1:04 PM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>>
>>>
>>>
>>> On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <rohany at alumni.cmu.edu>
>>> wrote:
>>>
>>>> Hi Junchao,
>>>>
>>>> Thanks for the response!
>>>>
>>>> > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to
>>>> convert a Matrix Market file into a petsc binary file. And then in
>>>> your test, load the binary matrix, following this example
>>>> https://petsc.org/main/src/mat/tutorials/ex1.c.html
>>>>
>>>> I tried an example like this, but the performance was too slow (it
>>>> would process ~2000-3000 calls to `SetValue` a second), which is not
>>>> reasonable for loading matrices with millions of non-zeros.
>>>>
>>> The matrix market file in text format is not good for load.  One should
>>> convert it to petsc binary format (only once), and use the new binary file
>>> afterwards.
>>>
>>>
>>>>
>>>> > I don't know what "No Races" means, but it seems you'd better also
>>>> verify the result of SpMV.
>>>>
>>>> This is a correct implementation of SpMV. The no-races is fine as it
>>>> parallelizes over the rows of the matrix, and thus does not need
>>>> synchronization between writes to the output.
>>>>
>>>> > You can think petsc's default CSR spmv is the baseline,  which is
>>>> done in ~10 lines of code.
>>>>
>>>> I'm sorry, but I don't think that is a reasonable statement w.r.t to
>>>> the lines of code making it a good baseline. The TACO compiler also can be
>>>> used in 10 lines of code to compute an SpMV, or any other state-of-the-art
>>>> library could wrap an SpMV implementation behind a single function call.
>>>> I'm wondering if this performance I'm seeing using PETSc is expected, or if
>>>> I've misconfigured or am misusing the system in some way.
>>>>
>>> I meant 10 lines of code without any function call, which can be thought
>>> of as a textbook implementation of SpMV. As a baseline, one can apply
>>> optimizations to it.  PETSc does not do sophisticated sparse matrix
>>> optimization itself, instead it relies on third-party libraries.  I
>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse,
>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can add
>>> an interface to it too.
>>>
>>>
>>>> Rohan
>>>>
>>>>
>>>> On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>>> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <rohany at alumni.cmu.edu>
>>>>> wrote:
>>>>>
>>>>>> Hi, I’m Rohan, a student working on compilation techniques for
>>>>>> distributed tensor computations. I’m looking at using PETSc as a baseline
>>>>>> for experiments I’m running, and want to understand if I’m using PETSc as
>>>>>> it was intended to achieve high performance, and if the performance I’m
>>>>>> seeing is expected. Currently, I’m just looking at SpMV operations.
>>>>>>
>>>>>>
>>>>>> My experiments are run on the Lassen Supercomputer (
>>>>>> https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40
>>>>>> CPUs, 4 V100s and an Infiniband interconnect. A visualization of the
>>>>>> architecture is here:
>>>>>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png
>>>>>> .
>>>>>>
>>>>>>
>>>>>> As of now, I’m trying to understand the single-node performance of
>>>>>> PETSc, as the scaling performance onto multiple nodes appears to be as I
>>>>>> expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix
>>>>>> collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005.
>>>>>> As a trusted baseline, I am comparing against SpMV code generated by the
>>>>>> TACO compiler (
>>>>>> http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)
>>>>>> .
>>>>>>
>>>>> I don't know what "No Races" means, but it seems you'd better also
>>>>> verify the result of SpMV.
>>>>>
>>>>>>
>>>>>> My experiments find that PETSc is roughly 4 times slower on a single
>>>>>> thread and node than the kernel generated by TACO:
>>>>>>
>>>>>>
>>>>>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
>>>>>>
>>>>>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>>>>>>
>>>>> You can think petsc's default CSR spmv is the baseline,  which is done
>>>>> in ~10 lines of code.
>>>>>
>>>>>>
>>>>>> My code using PETSc is here:
>>>>>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38
>>>>>> .
>>>>>>
>>>>>>
>>>>>> Runs from 1 thread and 1 node with -log_view are attached to the
>>>>>> email. The command lines for each were as follows:
>>>>>>
>>>>>>
>>>>>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20
>>>>>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>
>>>>>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n
>>>>>> 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>
>>>>>>
>>>>>>
>>>>>> In addition to these benchmarking concerns, I wanted to share my
>>>>>> experiences trying to load data from Matrix Market files into PETSc, which
>>>>>> ended up 1being much more difficult than I anticipated. Essentially, trying
>>>>>> to iterate through the Matrix Market files and using `write` to insert
>>>>>> entries into a `Mat` was extremely slow. In order to get reasonable
>>>>>> performance, I had to use an external utility to basically construct a CSR
>>>>>> matrix, and then pass the arrays from the CSR Matrix into
>>>>>> `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc
>>>>>> forums or Google, so I wanted to know if this was the right way to go.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> Rohan Yadav
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20211211/aaf55e8f/attachment-0001.html>