[petsc-users] Python PETSc performance vs scipy ZVODE

Stefano Zampini stefano.zampini at gmail.com
Thu Aug 10 05:27:12 CDT 2023


Then just do the multiplications you need. My proposal was for the example
function you were showing.

On Thu, Aug 10, 2023, 12:25 Niclas Götting <ngoetting at itp.uni-bremen.de>
wrote:

> You are absolutely right for this specific case (I get about 2400it/s
> instead of 2100it/s). However, the single square function will be replaced
> by a series of gaussian pulses in the future, which will never be zero.
> Maybe one could do an approximation and skip the second mult, if the
> gaussians are close to zero.
> On 10.08.23 12:16, Stefano Zampini wrote:
>
> If you do the mult of "pump" inside an if it should be faster
>
> On Thu, Aug 10, 2023, 12:12 Niclas Götting <ngoetting at itp.uni-bremen.de>
> wrote:
>
>> If I understood you right, this should be the resulting RHS:
>>
>> def rhsfunc5(ts, t, u, F):
>>     l.mult(u, F)
>>     pump.mult(u, tmp_vec)
>>     scale = 0.5 * (5 < t < 10)
>>     F.axpy(scale, tmp_vec)
>>
>> It is a little bit slower than option 3, but with about 2100it/s
>> consistently ~10% faster than option 4.
>>
>> Thank you very much for the suggestion!
>> On 10.08.23 11:47, Stefano Zampini wrote:
>>
>> I would use option 3. Keep a work vector and do a vector summation
>> instead of the multiple multiplication by scale and 1/scale.
>>
>> I agree with you the docs are a little misleading here.
>>
>> On Thu, Aug 10, 2023, 11:40 Niclas Götting <ngoetting at itp.uni-bremen.de>
>> wrote:
>>
>>> Thank you both for the very quick answer!
>>>
>>> So far, I compiled PETSc with debugging turned on, but I think it should
>>> still be faster than standard scipy in both cases. Actually, Stefano's
>>> answer has got me very far already; now I only define the RHS of the ODE
>>> and no Jacobian (I wonder, why the documentation suggests otherwise,
>>> though). I had the following four tries at implementing the RHS:
>>>
>>>    1. def rhsfunc1(ts, t, u, F):
>>>        scale = 0.5 * (5 < t < 10)
>>>        (l + scale * pump).mult(u, F)
>>>    2. def rhsfunc2(ts, t, u, F):
>>>        l.mult(u, F)
>>>        scale = 0.5 * (5 < t < 10)
>>>        (scale * pump).multAdd(u, F, F)
>>>    3. def rhsfunc3(ts, t, u, F):
>>>        l.mult(u, F)
>>>        scale = 0.5 * (5 < t < 10)
>>>        if scale != 0:
>>>            pump.scale(scale)
>>>            pump.multAdd(u, F, F)
>>>            pump.scale(1/scale)
>>>    4. def rhsfunc4(ts, t, u, F):
>>>        tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>>>        l.mult(u, F)
>>>        scale = 0.5 * (5 < t < 10)
>>>        tmp_pump.axpy(scale, pump,
>>>    structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>>>        tmp_pump.multAdd(u, F, F)
>>>
>>> They all yield the same results, but with 50it/s, 800it/, 2300it/s and
>>> 1900it/s, respectively, which is a huge performance boost (almost 7 times
>>> as fast as scipy, with PETSc debugging still turned on). As the scale
>>> function will most likely be a gaussian in the future, I think that option
>>> 3 will be become numerically unstable and I'll have to go with option 4,
>>> which is already faster than I expected. If you think it is possible to
>>> speed up the RHS calculation even more, I'd be happy to hear your
>>> suggestions; the -log_view is attached to this message.
>>>
>>> One last point: If I didn't misunderstand the documentation at
>>> https://petsc.org/release/manual/ts/#special-cases, should this maybe
>>> be changed?
>>>
>>> Best regards
>>> Niclas
>>> On 09.08.23 17:51, Stefano Zampini wrote:
>>>
>>> TSRK is an explicit solver. Unless you are changing the ts type from
>>> command line,  the explicit  jacobian should not be needed. On top of
>>> Barry's suggestion, I would suggest you to write the explicit RHS instead
>>> of assembly a throw away matrix every time that function needs to be
>>> sampled.
>>>
>>> On Wed, Aug 9, 2023, 17:09 Niclas Götting <ngoetting at itp.uni-bremen.de>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm currently trying to convert a quantum simulation from scipy to
>>>> PETSc. The problem itself is extremely simple and of the form
>>>> \dot{u}(t)
>>>> = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
>>>> a square function. The matrices A_const and B_const are extremely
>>>> sparse
>>>> and therefore I thought, the problem will be well suited for PETSc.
>>>> Currently, I solve the ODE with the following procedure in scipy (I can
>>>> provide the necessary data files, if needed, but they are just some
>>>> trace-preserving, very sparse matrices):
>>>>
>>>> import numpy as np
>>>> import scipy.sparse
>>>> import scipy.integrate
>>>>
>>>> from tqdm import tqdm
>>>>
>>>>
>>>> l = np.load("../liouvillian.npy")
>>>> pump = np.load("../pump_operator.npy")
>>>> state = np.load("../initial_state.npy")
>>>>
>>>> l = scipy.sparse.csr_array(l)
>>>> pump = scipy.sparse.csr_array(pump)
>>>>
>>>> def f(t, y, *args):
>>>>      return (l + 0.5 * (5 < t < 10) * pump) @ y
>>>>      #return l @ y # Uncomment for f(t) = 0
>>>>
>>>> dt = 0.1
>>>> NUM_STEPS = 200
>>>> res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
>>>> solver =
>>>> scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
>>>> times = []
>>>> for i in tqdm(range(NUM_STEPS)):
>>>>      res[i, :] = solver.integrate(solver.t + dt)
>>>>      times.append(solver.t)
>>>>
>>>> Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
>>>> about 330it/s on my machine. When converting the code to PETSc, I came
>>>> to the following result (according to the chapter
>>>> https://petsc.org/main/manual/ts/#special-cases)
>>>>
>>>> import sys
>>>> import petsc4py
>>>> petsc4py.init(args=sys.argv)
>>>> import numpy as np
>>>> import scipy.sparse
>>>>
>>>> from tqdm import tqdm
>>>> from petsc4py import PETSc
>>>>
>>>> comm = PETSc.COMM_WORLD
>>>>
>>>>
>>>> def mat_to_real(arr):
>>>>      return np.block([[arr.real, -arr.imag], [arr.imag,
>>>> arr.real]]).astype(np.float64)
>>>>
>>>> def mat_to_petsc_aij(arr):
>>>>      arr_sc_sp = scipy.sparse.csr_array(arr)
>>>>      mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
>>>>      rstart, rend = mat.getOwnershipRange()
>>>>      print(rstart, rend)
>>>>      print(arr.shape[0])
>>>>      print(mat.sizes)
>>>>      I = arr_sc_sp.indptr[rstart : rend + 1] - arr_sc_sp.indptr[rstart]
>>>>      J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
>>>> arr_sc_sp.indptr[rend]]
>>>>      V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] :
>>>> arr_sc_sp.indptr[rend]]
>>>>
>>>>      print(I.shape, J.shape, V.shape)
>>>>      mat.setValuesCSR(I, J, V)
>>>>      mat.assemble()
>>>>      return mat
>>>>
>>>>
>>>> l = np.load("../liouvillian.npy")
>>>> l = mat_to_real(l)
>>>> pump = np.load("../pump_operator.npy")
>>>> pump = mat_to_real(pump)
>>>> state = np.load("../initial_state.npy")
>>>> state = np.hstack([state.real, state.imag]).astype(np.float64)
>>>>
>>>> l = mat_to_petsc_aij(l)
>>>> pump = mat_to_petsc_aij(pump)
>>>>
>>>>
>>>> jac = l.duplicate()
>>>> for i in range(8192):
>>>>      jac.setValue(i, i, 0)
>>>> jac.assemble()
>>>> jac += l
>>>>
>>>> vec = l.createVecRight()
>>>> vec.setValues(np.arange(state.shape[0], dtype=np.int32), state)
>>>> vec.assemble()
>>>>
>>>>
>>>> dt = 0.1
>>>>
>>>> ts = PETSc.TS().create(comm=comm)
>>>> ts.setFromOptions()
>>>> ts.setProblemType(ts.ProblemType.LINEAR)
>>>> ts.setEquationType(ts.EquationType.ODE_EXPLICIT)
>>>> ts.setType(ts.Type.RK)
>>>> ts.setRKType(ts.RKType.RK3BS)
>>>> ts.setTime(0)
>>>> print("KSP:", ts.getKSP().getType())
>>>> print("KSP PC:",ts.getKSP().getPC().getType())
>>>> print("SNES :", ts.getSNES().getType())
>>>>
>>>> def jacobian(ts, t, u, Amat, Pmat):
>>>>      Amat.zeroEntries()
>>>>      Amat.aypx(1, l,
>>>> structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
>>>>      Amat.axpy(0.5 * (5 < t < 10), pump,
>>>> structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
>>>>
>>>> ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)
>>>> #ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant, l, l) #
>>>> Uncomment for f(t) = 0
>>>> ts.setRHSJacobian(jacobian, jac)
>>>>
>>>> NUM_STEPS = 200
>>>> res = np.empty((NUM_STEPS, 8192), dtype=np.float64)
>>>> times = []
>>>> rstart, rend = vec.getOwnershipRange()
>>>> for i in tqdm(range(NUM_STEPS)):
>>>>      time = ts.getTime()
>>>>      ts.setMaxTime(time + dt)
>>>>      ts.solve(vec)
>>>>      res[i, rstart:rend] = vec.getArray()[:]
>>>>      times.append(time)
>>>>
>>>> I decomposed the complex ODE into a larger real ODE, so that I can
>>>> easily switch maybe to GPU computation later on. Now, the solutions of
>>>> both scripts are very much identical, but PETSc runs about 3 times
>>>> slower at 120it/s on my machine. I don't use MPI for PETSc yet.
>>>>
>>>> I strongly suppose that the problem lies within the jacobian
>>>> definition,
>>>> as PETSc is about 3 times *faster* than scipy with f(t) = 0 and
>>>> therefore a constant jacobian.
>>>>
>>>> Thank you in advance.
>>>>
>>>> All the best,
>>>> Niclas
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230810/e582ba9e/attachment.html>


More information about the petsc-users mailing list