[petsc-users] Python PETSc performance vs scipy ZVODE

Niclas Götting ngoetting at itp.uni-bremen.de
Thu Aug 10 05:25:24 CDT 2023


You are absolutely right for this specific case (I get about 2400it/s 
instead of 2100it/s). However, the single square function will be 
replaced by a series of gaussian pulses in the future, which will never 
be zero. Maybe one could do an approximation and skip the second mult, 
if the gaussians are close to zero.

On 10.08.23 12:16, Stefano Zampini wrote:
> If you do the mult of "pump" inside an if it should be faster
>
> On Thu, Aug 10, 2023, 12:12 Niclas Götting 
> <ngoetting at itp.uni-bremen.de> wrote:
>
>     If I understood you right, this should be the resulting RHS:
>
>     def rhsfunc5(ts, t, u, F):
>         l.mult(u, F)
>         pump.mult(u, tmp_vec)
>         scale = 0.5 * (5 < t < 10)
>         F.axpy(scale, tmp_vec)
>
>     It is a little bit slower than option 3, but with about 2100it/s
>     consistently ~10% faster than option 4.
>
>     Thank you very much for the suggestion!
>
>     On 10.08.23 11:47, Stefano Zampini wrote:
>>     I would use option 3. Keep a work vector and do a vector
>>     summation instead of the multiple multiplication by scale and
>>     1/scale.
>>
>>     I agree with you the docs are a little misleading here.
>>
>>     On Thu, Aug 10, 2023, 11:40 Niclas Götting
>>     <ngoetting at itp.uni-bremen.de> wrote:
>>
>>         Thank you both for the very quick answer!
>>
>>         So far, I compiled PETSc with debugging turned on, but I
>>         think it should still be faster than standard scipy in both
>>         cases. Actually, Stefano's answer has got me very far
>>         already; now I only define the RHS of the ODE and no Jacobian
>>         (I wonder, why the documentation suggests otherwise, though).
>>         I had the following four tries at implementing the RHS:
>>
>>          1. def rhsfunc1(ts, t, u, F):
>>                 scale = 0.5 * (5 < t < 10)
>>                 (l + scale * pump).mult(u, F)
>>          2. def rhsfunc2(ts, t, u, F):
>>                 l.mult(u, F)
>>                 scale = 0.5 * (5 < t < 10)
>>                 (scale * pump).multAdd(u, F, F)
>>          3. def rhsfunc3(ts, t, u, F):
>>                 l.mult(u, F)
>>                 scale = 0.5 * (5 < t < 10)
>>                 if scale != 0:
>>                     pump.scale(scale)
>>                     pump.multAdd(u, F, F)
>>                     pump.scale(1/scale)
>>          4. def rhsfunc4(ts, t, u, F):
>>                 tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>>                 l.mult(u, F)
>>                 scale = 0.5 * (5 < t < 10)
>>                 tmp_pump.axpy(scale, pump,
>>             structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>>                 tmp_pump.multAdd(u, F, F)
>>
>>         They all yield the same results, but with 50it/s, 800it/,
>>         2300it/s and 1900it/s, respectively, which is a huge
>>         performance boost (almost 7 times as fast as scipy, with
>>         PETSc debugging still turned on). As the scale function will
>>         most likely be a gaussian in the future, I think that option
>>         3 will be become numerically unstable and I'll have to go
>>         with option 4, which is already faster than I expected. If
>>         you think it is possible to speed up the RHS calculation even
>>         more, I'd be happy to hear your suggestions; the -log_view is
>>         attached to this message.
>>
>>         One last point: If I didn't misunderstand the documentation
>>         at https://petsc.org/release/manual/ts/#special-cases, should
>>         this maybe be changed?
>>
>>         Best regards
>>         Niclas
>>
>>         On 09.08.23 17:51, Stefano Zampini wrote:
>>>         TSRK is an explicit solver. Unless you are changing the ts
>>>         type from command line,  the explicit  jacobian should not
>>>         be needed. On top of Barry's suggestion, I would suggest you
>>>         to write the explicit RHS instead of assembly a throw away
>>>         matrix every time that function needs to be sampled.
>>>
>>>         On Wed, Aug 9, 2023, 17:09 Niclas Götting
>>>         <ngoetting at itp.uni-bremen.de> wrote:
>>>
>>>             Hi all,
>>>
>>>             I'm currently trying to convert a quantum simulation
>>>             from scipy to
>>>             PETSc. The problem itself is extremely simple and of the
>>>             form \dot{u}(t)
>>>             = (A_const + f(t)*B_const)*u(t), where f(t) in this
>>>             simple test case is
>>>             a square function. The matrices A_const and B_const are
>>>             extremely sparse
>>>             and therefore I thought, the problem will be well suited
>>>             for PETSc.
>>>             Currently, I solve the ODE with the following procedure
>>>             in scipy (I can
>>>             provide the necessary data files, if needed, but they
>>>             are just some
>>>             trace-preserving, very sparse matrices):
>>>
>>>             import numpy as np
>>>             import scipy.sparse
>>>             import scipy.integrate
>>>
>>>             from tqdm import tqdm
>>>
>>>
>>>             l = np.load("../liouvillian.npy")
>>>             pump = np.load("../pump_operator.npy")
>>>             state = np.load("../initial_state.npy")
>>>
>>>             l = scipy.sparse.csr_array(l)
>>>             pump = scipy.sparse.csr_array(pump)
>>>
>>>             def f(t, y, *args):
>>>                  return (l + 0.5 * (5 < t < 10) * pump) @ y
>>>                  #return l @ y # Uncomment for f(t) = 0
>>>
>>>             dt = 0.1
>>>             NUM_STEPS = 200
>>>             res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
>>>             solver =
>>>             scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
>>>             times = []
>>>             for i in tqdm(range(NUM_STEPS)):
>>>                  res[i, :] = solver.integrate(solver.t + dt)
>>>                  times.append(solver.t)
>>>
>>>             Here, A_const = l, B_const = pump and f(t) = 5 < t < 10.
>>>             tqdm reports
>>>             about 330it/s on my machine. When converting the code to
>>>             PETSc, I came
>>>             to the following result (according to the chapter
>>>             https://petsc.org/main/manual/ts/#special-cases)
>>>
>>>             import sys
>>>             import petsc4py
>>>             petsc4py.init(args=sys.argv)
>>>             import numpy as np
>>>             import scipy.sparse
>>>
>>>             from tqdm import tqdm
>>>             from petsc4py import PETSc
>>>
>>>             comm = PETSc.COMM_WORLD
>>>
>>>
>>>             def mat_to_real(arr):
>>>                  return np.block([[arr.real, -arr.imag], [arr.imag,
>>>             arr.real]]).astype(np.float64)
>>>
>>>             def mat_to_petsc_aij(arr):
>>>                  arr_sc_sp = scipy.sparse.csr_array(arr)
>>>                  mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
>>>                  rstart, rend = mat.getOwnershipRange()
>>>                  print(rstart, rend)
>>>                  print(arr.shape[0])
>>>                  print(mat.sizes)
>>>                  I = arr_sc_sp.indptr[rstart : rend + 1] -
>>>             arr_sc_sp.indptr[rstart]
>>>                  J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
>>>             arr_sc_sp.indptr[rend]]
>>>                  V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] :
>>>             arr_sc_sp.indptr[rend]]
>>>
>>>                  print(I.shape, J.shape, V.shape)
>>>                  mat.setValuesCSR(I, J, V)
>>>                  mat.assemble()
>>>                  return mat
>>>
>>>
>>>             l = np.load("../liouvillian.npy")
>>>             l = mat_to_real(l)
>>>             pump = np.load("../pump_operator.npy")
>>>             pump = mat_to_real(pump)
>>>             state = np.load("../initial_state.npy")
>>>             state = np.hstack([state.real,
>>>             state.imag]).astype(np.float64)
>>>
>>>             l = mat_to_petsc_aij(l)
>>>             pump = mat_to_petsc_aij(pump)
>>>
>>>
>>>             jac = l.duplicate()
>>>             for i in range(8192):
>>>                  jac.setValue(i, i, 0)
>>>             jac.assemble()
>>>             jac += l
>>>
>>>             vec = l.createVecRight()
>>>             vec.setValues(np.arange(state.shape[0], dtype=np.int32),
>>>             state)
>>>             vec.assemble()
>>>
>>>
>>>             dt = 0.1
>>>
>>>             ts = PETSc.TS().create(comm=comm)
>>>             ts.setFromOptions()
>>>             ts.setProblemType(ts.ProblemType.LINEAR)
>>>             ts.setEquationType(ts.EquationType.ODE_EXPLICIT)
>>>             ts.setType(ts.Type.RK)
>>>             ts.setRKType(ts.RKType.RK3BS)
>>>             ts.setTime(0)
>>>             print("KSP:", ts.getKSP().getType())
>>>             print("KSP PC:",ts.getKSP().getPC().getType())
>>>             print("SNES :", ts.getSNES().getType())
>>>
>>>             def jacobian(ts, t, u, Amat, Pmat):
>>>                  Amat.zeroEntries()
>>>                  Amat.aypx(1, l,
>>>             structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
>>>                  Amat.axpy(0.5 * (5 < t < 10), pump,
>>>             structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
>>>
>>>             ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)
>>>             #ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant,
>>>             l, l) #
>>>             Uncomment for f(t) = 0
>>>             ts.setRHSJacobian(jacobian, jac)
>>>
>>>             NUM_STEPS = 200
>>>             res = np.empty((NUM_STEPS, 8192), dtype=np.float64)
>>>             times = []
>>>             rstart, rend = vec.getOwnershipRange()
>>>             for i in tqdm(range(NUM_STEPS)):
>>>                  time = ts.getTime()
>>>                  ts.setMaxTime(time + dt)
>>>                  ts.solve(vec)
>>>                  res[i, rstart:rend] = vec.getArray()[:]
>>>                  times.append(time)
>>>
>>>             I decomposed the complex ODE into a larger real ODE, so
>>>             that I can
>>>             easily switch maybe to GPU computation later on. Now,
>>>             the solutions of
>>>             both scripts are very much identical, but PETSc runs
>>>             about 3 times
>>>             slower at 120it/s on my machine. I don't use MPI for
>>>             PETSc yet.
>>>
>>>             I strongly suppose that the problem lies within the
>>>             jacobian definition,
>>>             as PETSc is about 3 times *faster* than scipy with f(t)
>>>             = 0 and
>>>             therefore a constant jacobian.
>>>
>>>             Thank you in advance.
>>>
>>>             All the best,
>>>             Niclas
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230810/1ee33f67/attachment-0001.html>


More information about the petsc-users mailing list