[petsc-users] Python PETSc performance vs scipy ZVODE

Tue Aug 15 05:48:17 CDT 2023

On the basis of your suggestion, I tried using vode for the real-valued 
problem with scipy and I get roughly the same speed as before with 
scipy, which could have three reasons

 1. (Z)VODE is slower than plain RK (however, I must admit that I'm not
    quite sure what (Z)VODE does precisely)
 2. Sparse matrix operations in scipy are slow. Some of them are even
    written in pure python.
 3. The RHS function in scipy must *return* a vector and therefore
    allocates new memory for each iteration.

Parallelizing the code is of course a goal of mine, but I believe this 
will only become relevant for larger systems, which I want to 
investigate in the near future.

Regarding the RHS Jacobian, I see why defining RHSFunction vs 
RHSJacobian should be computationally equivalent, but I found it much 
easier to optimize the RHSFunction in this case, and I'm not quite sure 
as to why the documentation is so specific in strictly recommending the 
pattern of only providing a Jacobian and not a RHS function, while that 
should be equivalent.

Lastly, I'm aware that another performance boost awaits upon turning off 
the debugging functionality, but for this simple test I just wanted to 
see, if there is *any* improvement in performance and I was very much 
surprised over the factor of 7 with debugging turned on already.

Thank you all for the interesting input and have a nice day!
Niclas

On 15.08.23 00:37, Zhang, Hong wrote:
> PETSs is not necessarily faster than scipy for your problem when 
> executed in serial. But you get benefits when running in parallel. The 
> PETSc code you wrote uses float64 while your scipy code uses 
> complex128, so the comparison may not be fair.
>
> In addition, using the RHS Jacobian does not necessarily make your 
> PETSc code slower. In your case, the bottleneck is the matrix 
> operations. For best performance, you should avoid adding two sparse 
> matrices (especially with different sparsity patterns) which is very 
> costly. So one MatMult + one MultAdd is the best option. MatAXPY with 
> the same nonzero pattern would be a bit slower but still faster than 
> MatAXPY with subset nonzero pattern, which you used in the Jacobian 
> function.
>
> I echo Barry’s suggestion that debugging should be turned off before 
> you do any performance study.
>
> Hong (Mr.)
>
>> On Aug 10, 2023, at 4:40 AM, Niclas Götting 
>> <ngoetting at itp.uni-bremen.de> wrote:
>>
>> Thank you both for the very quick answer!
>>
>> So far, I compiled PETSc with debugging turned on, but I think it 
>> should still be faster than standard scipy in both cases. Actually, 
>> Stefano's answer has got me very far already; now I only define the 
>> RHS of the ODE and no Jacobian (I wonder, why the documentation 
>> suggests otherwise, though). I had the following four tries at 
>> implementing the RHS:
>>
>>  1. def rhsfunc1(ts, t, u, F):
>>         scale = 0.5 * (5 < t < 10)
>>         (l + scale * pump).mult(u, F)
>>  2. def rhsfunc2(ts, t, u, F):
>>         l.mult(u, F)
>>         scale = 0.5 * (5 < t < 10)
>>         (scale * pump).multAdd(u, F, F)
>>  3. def rhsfunc3(ts, t, u, F):
>>         l.mult(u, F)
>>         scale = 0.5 * (5 < t < 10)
>>         if scale != 0:
>>             pump.scale(scale)
>>             pump.multAdd(u, F, F)
>>             pump.scale(1/scale)
>>  4. def rhsfunc4(ts, t, u, F):
>>         tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>>         l.mult(u, F)
>>         scale = 0.5 * (5 < t < 10)
>>         tmp_pump.axpy(scale, pump,
>>     structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>>         tmp_pump.multAdd(u, F, F)
>>
>> They all yield the same results, but with 50it/s, 800it/, 2300it/s 
>> and 1900it/s, respectively, which is a huge performance boost (almost 
>> 7 times as fast as scipy, with PETSc debugging still turned on). As 
>> the scale function will most likely be a gaussian in the future, I 
>> think that option 3 will be become numerically unstable and I'll have 
>> to go with option 4, which is already faster than I expected. If you 
>> think it is possible to speed up the RHS calculation even more, I'd 
>> be happy to hear your suggestions; the -log_view is attached to this 
>> message.
>>
>> One last point: If I didn't misunderstand the documentation at 
>> https://petsc.org/release/manual/ts/#special-cases, should this maybe 
>> be changed?
>>
>> Best regards
>> Niclas
>>
>> On 09.08.23 17:51, Stefano Zampini wrote:
>>> TSRK is an explicit solver. Unless you are changing the ts type from 
>>> command line,  the explicit  jacobian should not be needed. On top 
>>> of Barry's suggestion, I would suggest you to write the explicit RHS 
>>> instead of assembly a throw away matrix every time that function 
>>> needs to be sampled.
>>>
>>> On Wed, Aug 9, 2023, 17:09 Niclas Götting 
>>> <ngoetting at itp.uni-bremen.de> wrote:
>>>
>>>     Hi all,
>>>
>>>     I'm currently trying to convert a quantum simulation from scipy to
>>>     PETSc. The problem itself is extremely simple and of the form
>>>     \dot{u}(t)
>>>     = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test
>>>     case is
>>>     a square function. The matrices A_const and B_const are
>>>     extremely sparse
>>>     and therefore I thought, the problem will be well suited for PETSc.
>>>     Currently, I solve the ODE with the following procedure in scipy
>>>     (I can
>>>     provide the necessary data files, if needed, but they are just some
>>>     trace-preserving, very sparse matrices):
>>>
>>>     import numpy as np
>>>     import scipy.sparse
>>>     import scipy.integrate
>>>
>>>     from tqdm import tqdm
>>>
>>>
>>>     l = np.load("../liouvillian.npy")
>>>     pump = np.load("../pump_operator.npy")
>>>     state = np.load("../initial_state.npy")
>>>
>>>     l = scipy.sparse.csr_array(l)
>>>     pump = scipy.sparse.csr_array(pump)
>>>
>>>     def f(t, y, *args):
>>>          return (l + 0.5 * (5 < t < 10) * pump) @ y
>>>          #return l @ y # Uncomment for f(t) = 0
>>>
>>>     dt = 0.1
>>>     NUM_STEPS = 200
>>>     res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
>>>     solver =
>>>     scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
>>>     times = []
>>>     for i in tqdm(range(NUM_STEPS)):
>>>          res[i, :] = solver.integrate(solver.t + dt)
>>>          times.append(solver.t)
>>>
>>>     Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm
>>>     reports
>>>     about 330it/s on my machine. When converting the code to PETSc,
>>>     I came
>>>     to the following result (according to the chapter
>>>     https://petsc.org/main/manual/ts/#special-cases)
>>>
>>>     import sys
>>>     import petsc4py
>>>     petsc4py.init(args=sys.argv)
>>>     import numpy as np
>>>     import scipy.sparse
>>>
>>>     from tqdm import tqdm
>>>     from petsc4py import PETSc
>>>
>>>     comm = PETSc.COMM_WORLD
>>>
>>>
>>>     def mat_to_real(arr):
>>>          return np.block([[arr.real, -arr.imag], [arr.imag,
>>>     arr.real]]).astype(np.float64)
>>>
>>>     def mat_to_petsc_aij(arr):
>>>          arr_sc_sp = scipy.sparse.csr_array(arr)
>>>          mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
>>>          rstart, rend = mat.getOwnershipRange()
>>>          print(rstart, rend)
>>>          print(arr.shape[0])
>>>          print(mat.sizes)
>>>          I = arr_sc_sp.indptr[rstart : rend + 1] -
>>>     arr_sc_sp.indptr[rstart]
>>>          J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
>>>     arr_sc_sp.indptr[rend]]
>>>          V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] :
>>>     arr_sc_sp.indptr[rend]]
>>>
>>>          print(I.shape, J.shape, V.shape)
>>>          mat.setValuesCSR(I, J, V)
>>>          mat.assemble()
>>>          return mat
>>>
>>>
>>>     l = np.load("../liouvillian.npy")
>>>     l = mat_to_real(l)
>>>     pump = np.load("../pump_operator.npy")
>>>     pump = mat_to_real(pump)
>>>     state = np.load("../initial_state.npy")
>>>     state = np.hstack([state.real, state.imag]).astype(np.float64)
>>>
>>>     l = mat_to_petsc_aij(l)
>>>     pump = mat_to_petsc_aij(pump)
>>>
>>>
>>>     jac = l.duplicate()
>>>     for i in range(8192):
>>>          jac.setValue(i, i, 0)
>>>     jac.assemble()
>>>     jac += l
>>>
>>>     vec = l.createVecRight()
>>>     vec.setValues(np.arange(state.shape[0], dtype=np.int32), state)
>>>     vec.assemble()
>>>
>>>
>>>     dt = 0.1
>>>
>>>     ts = PETSc.TS().create(comm=comm)
>>>     ts.setFromOptions()
>>>     ts.setProblemType(ts.ProblemType.LINEAR)
>>>     ts.setEquationType(ts.EquationType.ODE_EXPLICIT)
>>>     ts.setType(ts.Type.RK)
>>>     ts.setRKType(ts.RKType.RK3BS)
>>>     ts.setTime(0)
>>>     print("KSP:", ts.getKSP().getType())
>>>     print("KSP PC:",ts.getKSP().getPC().getType())
>>>     print("SNES :", ts.getSNES().getType())
>>>
>>>     def jacobian(ts, t, u, Amat, Pmat):
>>>          Amat.zeroEntries()
>>>          Amat.aypx(1, l,
>>>     structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
>>>          Amat.axpy(0.5 * (5 < t < 10), pump,
>>>     structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
>>>
>>>     ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)
>>>     #ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant, l, l) #
>>>     Uncomment for f(t) = 0
>>>     ts.setRHSJacobian(jacobian, jac)
>>>
>>>     NUM_STEPS = 200
>>>     res = np.empty((NUM_STEPS, 8192), dtype=np.float64)
>>>     times = []
>>>     rstart, rend = vec.getOwnershipRange()
>>>     for i in tqdm(range(NUM_STEPS)):
>>>          time = ts.getTime()
>>>          ts.setMaxTime(time + dt)
>>>          ts.solve(vec)
>>>          res[i, rstart:rend] = vec.getArray()[:]
>>>          times.append(time)
>>>
>>>     I decomposed the complex ODE into a larger real ODE, so that I can
>>>     easily switch maybe to GPU computation later on. Now, the
>>>     solutions of
>>>     both scripts are very much identical, but PETSc runs about 3 times
>>>     slower at 120it/s on my machine. I don't use MPI for PETSc yet.
>>>
>>>     I strongly suppose that the problem lies within the jacobian
>>>     definition,
>>>     as PETSc is about 3 times *faster* than scipy with f(t) = 0 and
>>>     therefore a constant jacobian.
>>>
>>>     Thank you in advance.
>>>
>>>     All the best,
>>>     Niclas
>>>
>>>
>> <log.log>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230815/6622c641/attachment-0001.html>