<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>On the basis of your suggestion, I tried using vode for the
      real-valued problem with scipy and I get roughly the same speed as
      before with scipy, which could have three reasons</p>
    <ol>
      <li>(Z)VODE is slower than plain RK (however, I must admit that
        I'm not quite sure what (Z)VODE does precisely)</li>
      <li>Sparse matrix operations in scipy are slow. Some of them are
        even written in pure python.</li>
      <li>The RHS function in scipy must *return* a vector and therefore
        allocates new memory for each iteration.</li>
    </ol>
    <p>Parallelizing the code is of course a goal of mine, but I believe
      this will only become relevant for larger systems, which I want to
      investigate in the near future. <br>
    </p>
    <p>Regarding the RHS Jacobian, I see why defining RHSFunction vs
      RHSJacobian should be computationally equivalent, but I found it
      much easier to optimize the RHSFunction in this case, and I'm not
      quite sure as to why the documentation is so specific in strictly
      recommending the pattern of only providing a Jacobian and not a
      RHS function, while that should be equivalent.</p>
    <p>Lastly, I'm aware that another performance boost awaits upon
      turning off the debugging functionality, but for this simple test
      I just wanted to see, if there is *any* improvement in performance
      and I was very much surprised over the factor of 7 with debugging
      turned on already.</p>
    <p>Thank you all for the interesting input and have a nice day!<br>
      Niclas<br>
    </p>
    <div class="moz-cite-prefix">On 15.08.23 00:37, Zhang, Hong wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:F72D34A9-4D3C-4F21-8557-BB44B63D91BB@anl.gov">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      PETSs is not necessarily faster than scipy for your problem when
      executed in serial. But you get benefits when running in parallel.
      The PETSc code you wrote uses float64 while your scipy code uses
      complex128, so the comparison may not be fair.
      <div><br>
      </div>
      <div>In addition, using the RHS Jacobian does not necessarily make
        your PETSc code slower. In your case, the bottleneck is the
        matrix operations. For best performance, you should avoid adding
        two sparse matrices (especially with different sparsity
        patterns) which is very costly. So one MatMult + one MultAdd is
        the best option. MatAXPY with the same nonzero pattern would be
        a bit slower but still faster than MatAXPY with subset nonzero
        pattern, which you used in the Jacobian function.</div>
      <div><br>
      </div>
      <div>I echo Barry’s suggestion that debugging should be turned off
        before you do any performance study.</div>
      <div><br>
      </div>
      <div>Hong (Mr.)</div>
      <div>
        <div><br>
          <blockquote type="cite">
            <div>On Aug 10, 2023, at 4:40 AM, Niclas Götting
              <a class="moz-txt-link-rfc2396E" href="mailto:ngoetting@itp.uni-bremen.de"><ngoetting@itp.uni-bremen.de></a> wrote:</div>
            <br class="Apple-interchange-newline">
            <div>
              <div>
                <p>Thank you both for the very quick answer!</p>
                <p>So far, I compiled PETSc with debugging turned on,
                  but I think it should still be faster than standard
                  scipy in both cases. Actually, Stefano's answer has
                  got me very far already; now I only define the RHS of
                  the ODE and no Jacobian (I wonder, why the
                  documentation suggests otherwise, though). I had the
                  following four tries at implementing the RHS:</p>
                <ol>
                  <li><font face="monospace">def rhsfunc1(ts, t, u, F):<br>
                          scale = 0.5 * (5 < t < 10)<br>
                          (l + scale * pump).mult(u, F)</font></li>
                  <li><font face="monospace">def rhsfunc2(ts, t, u, F):<br>
                          l.mult(u, F)<br>
                          scale = 0.5 * (5 < t < 10)<br>
                          (scale * pump).multAdd(u, F, F)</font></li>
                  <li><font face="monospace">def rhsfunc3(ts, t, u, F):<br>
                          l.mult(u, F)<br>
                          scale = 0.5 * (5 < t < 10)<br>
                          if scale != 0:<br>
                              pump.scale(scale)<br>
                              pump.multAdd(u, F, F)<br>
                              pump.scale(1/scale)</font></li>
                  <li><font face="monospace">def rhsfunc4(ts, t, u, F):<br>
                          tmp_pump.zeroEntries() # tmp_pump is
                      pump.duplicate()<br>
                          l.mult(u, F)<br>
                          scale = 0.5 * (5 < t < 10)<br>
                          tmp_pump.axpy(scale, pump,
                      structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)<br>
                          tmp_pump.multAdd(u, F, F)<br>
                    </font></li>
                </ol>
                <p>They all yield the same results, but with 50it/s,
                  800it/, 2300it/s and 1900it/s, respectively, which is
                  a huge performance boost (almost 7 times as fast as
                  scipy, with PETSc debugging still turned on). As the
                  scale function will most likely be a gaussian in the
                  future, I think that option 3 will be become
                  numerically unstable and I'll have to go with option
                  4, which is already faster than I expected. If you
                  think it is possible to speed up the RHS calculation
                  even more, I'd be happy to hear your suggestions; the
                  -log_view is attached to this message.</p>
                <p>One last point: If I didn't misunderstand the
                  documentation at <a class="moz-txt-link-freetext"
href="https://petsc.org/release/manual/ts/#special-cases"
                    moz-do-not-send="true">
                    https://petsc.org/release/manual/ts/#special-cases</a>,
                  should this maybe be changed?</p>
                <p>Best regards<br>
                  Niclas<br>
                </p>
                <div class="moz-cite-prefix">On 09.08.23 17:51, Stefano
                  Zampini wrote:<br>
                </div>
                <blockquote type="cite"
cite="mid:CAGPUisj1qQ=6Km1FsjmrDHWD7VH7XP7sNL4-ity+wtAoxdNLAA@mail.gmail.com">
                  <div dir="auto">
                    <div>TSRK is an explicit solver. Unless you are
                      changing the ts type from command line,  the
                      explicit  jacobian should not be needed. On top of
                      Barry's suggestion, I would suggest you to write
                      the explicit RHS instead of assembly a throw away
                      matrix every time that function needs to be
                      sampled.<br>
                      <br>
                      <div class="gmail_quote">
                        <div dir="ltr" class="gmail_attr">On Wed, Aug 9,
                          2023, 17:09 Niclas Götting <<a
                            href="mailto:ngoetting@itp.uni-bremen.de"
                            moz-do-not-send="true"
                            class="moz-txt-link-freetext">ngoetting@itp.uni-bremen.de</a>>
                          wrote:<br>
                        </div>
                        <blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                          Hi all,<br>
                          <br>
                          I'm currently trying to convert a quantum
                          simulation from scipy to <br>
                          PETSc. The problem itself is extremely simple
                          and of the form \dot{u}(t) <br>
                          = (A_const + f(t)*B_const)*u(t), where f(t) in
                          this simple test case is <br>
                          a square function. The matrices A_const and
                          B_const are extremely sparse <br>
                          and therefore I thought, the problem will be
                          well suited for PETSc. <br>
                          Currently, I solve the ODE with the following
                          procedure in scipy (I can <br>
                          provide the necessary data files, if needed,
                          but they are just some <br>
                          trace-preserving, very sparse matrices):<br>
                          <br>
                          import numpy as np<br>
                          import scipy.sparse<br>
                          import scipy.integrate<br>
                          <br>
                          from tqdm import tqdm<br>
                          <br>
                          <br>
                          l = np.load("../liouvillian.npy")<br>
                          pump = np.load("../pump_operator.npy")<br>
                          state = np.load("../initial_state.npy")<br>
                          <br>
                          l = scipy.sparse.csr_array(l)<br>
                          pump = scipy.sparse.csr_array(pump)<br>
                          <br>
                          def f(t, y, *args):<br>
                               return (l + 0.5 * (5 < t < 10) *
                          pump) @ y<br>
                               #return l @ y # Uncomment for f(t) = 0<br>
                          <br>
                          dt = 0.1<br>
                          NUM_STEPS = 200<br>
                          res = np.empty((NUM_STEPS, 4096),
                          dtype=np.complex128)<br>
                          solver = <br>
scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)<br>
                          times = []<br>
                          for i in tqdm(range(NUM_STEPS)):<br>
                               res[i, :] = solver.integrate(solver.t +
                          dt)<br>
                               times.append(solver.t)<br>
                          <br>
                          Here, A_const = l, B_const = pump and f(t) = 5
                          < t < 10. tqdm reports <br>
                          about 330it/s on my machine. When converting
                          the code to PETSc, I came <br>
                          to the following result (according to the
                          chapter <br>
                          <a
href="https://petsc.org/main/manual/ts/#special-cases"
                            rel="noreferrer noreferrer" target="_blank"
                            moz-do-not-send="true"
                            class="moz-txt-link-freetext">https://petsc.org/main/manual/ts/#special-cases</a>)<br>
                          <br>
                          import sys<br>
                          import petsc4py<br>
                          petsc4py.init(args=sys.argv)<br>
                          import numpy as np<br>
                          import scipy.sparse<br>
                          <br>
                          from tqdm import tqdm<br>
                          from petsc4py import PETSc<br>
                          <br>
                          comm = PETSc.COMM_WORLD<br>
                          <br>
                          <br>
                          def mat_to_real(arr):<br>
                               return np.block([[arr.real, -arr.imag],
                          [arr.imag, <br>
                          arr.real]]).astype(np.float64)<br>
                          <br>
                          def mat_to_petsc_aij(arr):<br>
                               arr_sc_sp = scipy.sparse.csr_array(arr)<br>
                               mat = PETSc.Mat().createAIJ(arr.shape[0],
                          comm=comm)<br>
                               rstart, rend = mat.getOwnershipRange()<br>
                               print(rstart, rend)<br>
                               print(arr.shape[0])<br>
                               print(mat.sizes)<br>
                               I = arr_sc_sp.indptr[rstart : rend + 1] -
                          arr_sc_sp.indptr[rstart]<br>
                               J =
                          arr_sc_sp.indices[arr_sc_sp.indptr[rstart] : <br>
                          arr_sc_sp.indptr[rend]]<br>
                               V =
                          arr_sc_sp.data[arr_sc_sp.indptr[rstart] :
                          arr_sc_sp.indptr[rend]]<br>
                          <br>
                               print(I.shape, J.shape, V.shape)<br>
                               mat.setValuesCSR(I, J, V)<br>
                               mat.assemble()<br>
                               return mat<br>
                          <br>
                          <br>
                          l = np.load("../liouvillian.npy")<br>
                          l = mat_to_real(l)<br>
                          pump = np.load("../pump_operator.npy")<br>
                          pump = mat_to_real(pump)<br>
                          state = np.load("../initial_state.npy")<br>
                          state = np.hstack([state.real,
                          state.imag]).astype(np.float64)<br>
                          <br>
                          l = mat_to_petsc_aij(l)<br>
                          pump = mat_to_petsc_aij(pump)<br>
                          <br>
                          <br>
                          jac = l.duplicate()<br>
                          for i in range(8192):<br>
                               jac.setValue(i, i, 0)<br>
                          jac.assemble()<br>
                          jac += l<br>
                          <br>
                          vec = l.createVecRight()<br>
                          vec.setValues(np.arange(state.shape[0],
                          dtype=np.int32), state)<br>
                          vec.assemble()<br>
                          <br>
                          <br>
                          dt = 0.1<br>
                          <br>
                          ts = PETSc.TS().create(comm=comm)<br>
                          ts.setFromOptions()<br>
                          ts.setProblemType(ts.ProblemType.LINEAR)<br>
ts.setEquationType(ts.EquationType.ODE_EXPLICIT)<br>
                          ts.setType(ts.Type.RK)<br>
                          ts.setRKType(ts.RKType.RK3BS)<br>
                          ts.setTime(0)<br>
                          print("KSP:", ts.getKSP().getType())<br>
                          print("KSP PC:",ts.getKSP().getPC().getType())<br>
                          print("SNES :", ts.getSNES().getType())<br>
                          <br>
                          def jacobian(ts, t, u, Amat, Pmat):<br>
                               Amat.zeroEntries()<br>
                               Amat.aypx(1, l,
                          structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)<br>
                               Amat.axpy(0.5 * (5 < t < 10), pump,
                          <br>
structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)<br>
                          <br>
ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)<br>
#ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant, l, l) # <br>
                          Uncomment for f(t) = 0<br>
                          ts.setRHSJacobian(jacobian, jac)<br>
                          <br>
                          NUM_STEPS = 200<br>
                          res = np.empty((NUM_STEPS, 8192),
                          dtype=np.float64)<br>
                          times = []<br>
                          rstart, rend = vec.getOwnershipRange()<br>
                          for i in tqdm(range(NUM_STEPS)):<br>
                               time = ts.getTime()<br>
                               ts.setMaxTime(time + dt)<br>
                               ts.solve(vec)<br>
                               res[i, rstart:rend] = vec.getArray()[:]<br>
                               times.append(time)<br>
                          <br>
                          I decomposed the complex ODE into a larger
                          real ODE, so that I can <br>
                          easily switch maybe to GPU computation later
                          on. Now, the solutions of <br>
                          both scripts are very much identical, but
                          PETSc runs about 3 times <br>
                          slower at 120it/s on my machine. I don't use
                          MPI for PETSc yet.<br>
                          <br>
                          I strongly suppose that the problem lies
                          within the jacobian definition, <br>
                          as PETSc is about 3 times *faster* than scipy
                          with f(t) = 0 and <br>
                          therefore a constant jacobian.<br>
                          <br>
                          Thank you in advance.<br>
                          <br>
                          All the best,<br>
                          Niclas<br>
                          <br>
                          <br>
                        </blockquote>
                      </div>
                    </div>
                  </div>
                </blockquote>
              </div>
              <span id="cid:BE1D2EE1-07F9-4CBC-9F5B-F2E7FA651DAA"><log.log></span></div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
  </body>
</html>