[petsc-dev] Questions around benchmarking and data loading with PETSc

Sat Dec 11 17:56:08 CST 2021

Sorry, what’s surprising about this? 40 mpi ranks on a single node should be similar performance as 40 threads. Both petsc and taco are doing a row-based parallelism strategy so it should line up.

Rohan Yadav 

> On Dec 11, 2021, at 6:44 PM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
> 
> 
> 
>> On Sat, Dec 11, 2021 at 5:09 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>> > Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close to 1 thread or 40 threads of TACO?
>> 
>> The 1 rank time is the same as taco 1 thread, and the 40 rank time is the same as taco 40 threads.
> Interesting. TACO is supposed to give an optimized SpMV. 
>  
>> 
>> Rohan
>> 
>>> On Sat, Dec 11, 2021 at 6:07 PM Junchao Zhang <junchao.zhang at gmail.com> wrote:
>>> 
>>> 
>>>> On Sat, Dec 11, 2021, 4:22 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>> Thanks all for the help, the main problem was the lack of optimization flags in the default build provided by my system. A manual installation with optimization flags delivers performance equal to the single node benchmark I discussed before.
>>> 
>>> Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close to 1 thread or 40 threads of TACO?
>>>> 
>>>> Rohan
>>>> 
>>>>> On Sat, Dec 11, 2021 at 4:04 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>>> > The matrix market file in text format is not good for load.  One should convert it to petsc binary format (only once), and use the new binary file  afterwards. 
>>>>> 
>>>>> Yes, I understand this. The point I'm trying to make is that using PETSc to even perform the initial conversion from matrix market to the binary format was prohibitively slow using `MatSetValues`.
>>>>> 
>>>>> > I meant 10 lines of code without any function call, which can be thought of as a textbook implementation of SpMV. As a baseline, one can apply optimizations to it.  PETSc does not do sophisticated sparse matrix optimization itself, instead it relies on third-party libraries.  I remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse, hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can add an interface to it too.
>>>>> 
>>>>> Yes, this is what I expected. Given that PETSc uses high-performance kernels for for the sparse matrix operation itself, I was surprised to see that the single-thread performance of PETSc to be closer to a baseline like TACO. This performance will likely improve when I compile PETSc with optimization flags.
>>>>> 
>>>>> Rohan
>>>>> 
>>>>>> On Sat, Dec 11, 2021 at 1:04 PM Junchao Zhang <junchao.zhang at gmail.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>>>>> Hi Junchao,
>>>>>>> 
>>>>>>> Thanks for the response!
>>>>>>> 
>>>>>>> > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to convert a Matrix Market file into a petsc binary file. And then in your test, load the binary matrix, following this example https://petsc.org/main/src/mat/tutorials/ex1.c.html
>>>>>>> 
>>>>>>> I tried an example like this, but the performance was too slow (it would process ~2000-3000 calls to `SetValue` a second), which is not reasonable for loading matrices with millions of non-zeros.
>>>>>> The matrix market file in text format is not good for load.  One should convert it to petsc binary format (only once), and use the new binary file  afterwards. 
>>>>>>  
>>>>>>> 
>>>>>>> > I don't know what "No Races" means, but it seems you'd better also verify the result of SpMV. 
>>>>>>> 
>>>>>>> This is a correct implementation of SpMV. The no-races is fine as it parallelizes over the rows of the matrix, and thus does not need synchronization between writes to the output.
>>>>>>> 
>>>>>>> > You can think petsc's default CSR spmv is the baseline,  which is done in ~10 lines of code. 
>>>>>>> 
>>>>>>> I'm sorry, but I don't think that is a reasonable statement w.r.t to the lines of code making it a good baseline. The TACO compiler also can be used in 10 lines of code to compute an SpMV, or any other state-of-the-art library could wrap an SpMV implementation behind a single function call. I'm wondering if this performance I'm seeing using PETSc is expected, or if I've misconfigured or am misusing the system in some way.
>>>>>> I meant 10 lines of code without any function call, which can be thought of as a textbook implementation of SpMV. As a baseline, one can apply optimizations to it.  PETSc does not do sophisticated sparse matrix optimization itself, instead it relies on third-party libraries.  I remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse, hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can add an interface to it too.
>>>>>>  
>>>>>>> Rohan
>>>>>>> 
>>>>>>> 
>>>>>>>> On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang <junchao.zhang at gmail.com> wrote:
>>>>>>>>> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>>>>>> 
>>>>>>>>> Hi, I’m Rohan, a student working on compilation techniques for distributed tensor computations. I’m looking at using PETSc as a baseline for experiments I’m running, and want to understand if I’m using PETSc as it was intended to achieve high performance, and if the performance I’m seeing is expected. Currently, I’m just looking at SpMV operations.
>>>>>>>>> 
>>>>>>>>> My experiments are run on the Lassen Supercomputer (https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40 CPUs, 4 V100s and an Infiniband interconnect. A visualization of the architecture is here: https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png.
>>>>>>>>> 
>>>>>>>>> As of now, I’m trying to understand the single-node performance of PETSc, as the scaling performance onto multiple nodes appears to be as I expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005. As a trusted baseline, I am comparing against SpMV code generated by the TACO compiler (http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races).
>>>>>>>> I don't know what "No Races" means, but it seems you'd better also verify the result of SpMV. 
>>>>>>>>> 
>>>>>>>>> My experiments find that PETSc is roughly 4 times slower on a single thread and node than the kernel generated by TACO:
>>>>>>>>> 
>>>>>>>>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
>>>>>>>>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
>>>>>>>> You can think petsc's default CSR spmv is the baseline,  which is done in ~10 lines of code. 
>>>>>>>>> 
>>>>>>>>> My code using PETSc is here: https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38.
>>>>>>>>> 
>>>>>>>>> Runs from 1 thread and 1 node with -log_view are attached to the email. The command lines for each were as follows:
>>>>>>>>> 
>>>>>>>>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>>>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> In addition to these benchmarking concerns, I wanted to share my experiences trying to load data from Matrix Market files into PETSc, which ended up 1being much more difficult than I anticipated. Essentially, trying to iterate through the Matrix Market files and using `write` to insert entries into a `Mat` was extremely slow. In order to get reasonable performance, I had to use an external utility to basically construct a CSR matrix, and then pass the arrays from the CSR Matrix into `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc forums or Google, so I wanted to know if this was the right way to go.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Rohan Yadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20211211/db4e18ed/attachment.html>