[petsc-users] Petsc4py Simulation: Mat.axpy() slow
Barry Smith
bsmith at petsc.dev
Sat Nov 11 18:58:46 CST 2023
How many MPI processes did you use? Please try with just one to get a base line
MatSetValues 1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 32 0 0 0 1 32 0 0 0 669
MatGetRow 1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 6.0e+00 1 32 5 14 0 1 32 5 14 0 478
This is concerning: there are 1298 MatSetValues and MatGetRow, are you calling them? If not that means the MatAXPY is calling them (if SAME_NONZERO_PATTERN is not used these are used in the MatAXPY).
> I'm still somewhat new to parallel computing, so I'm not sure what you mean by binding. Does this lock in certain processes to certain cores?
Yes, it is usually the right thing to do for numerical computing. I included a link to some discussion of it on petsc.org
> On Nov 11, 2023, at 6:38 PM, Donald Rex Planalp <Donald.Planalp at colorado.edu> wrote:
>
> Hello again,
>
> I've run the simulation with profiling again. In this setup I only ran the necessary methods to construct the matrices, and then at the end I added H_mix and H_ang using the Mataxpy method. Below are the results
>
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop --- Global --- --- Stage ---- Total
> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided 350 1.0 6.9605e-01 1.9 0.00e+00 0.0 3.9e+02 5.6e+00 3.5e+02 3 0 24 0 22 3 0 24 0 22 0
> BuildTwoSidedF 236 1.0 6.9569e-01 1.9 0.00e+00 0.0 2.3e+02 1.1e+01 2.4e+02 3 0 14 0 15 3 0 14 0 15 0
> SFSetGraph 114 1.0 2.0555e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> SFSetUp 116 1.0 1.2123e-03 1.1 0.00e+00 0.0 6.2e+02 8.9e+02 1.1e+02 0 0 38 71 7 0 0 38 71 7 0
> SFPack 312 1.0 6.2350e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> SFUnpack 312 1.0 2.2470e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecView 26 1.0 4.0207e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecCopy 13 1.0 7.6100e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAssemblyBegin 14 1.0 2.9690e-04 2.1 0.00e+00 0.0 2.3e+02 1.1e+01 1.4e+01 0 0 14 0 1 0 0 14 0 1 0
> VecAssemblyEnd 14 1.0 2.9400e-05 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecScatterBegin 312 1.0 6.0288e-04 1.6 0.00e+00 0.0 7.3e+02 3.1e+02 1.0e+02 0 0 45 29 6 0 0 45 29 7 0
> VecScatterEnd 312 1.0 9.1886e-04 3.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSetRandom 2 1.0 3.6500e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMult 104 1.0 2.5355e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 2.2e+02 0 1 49 29 14 0 1 49 29 14 70
> MatSolve 104 1.0 2.4431e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 1.1e+02 0 0 49 29 7 0 0 49 29 7 36
> MatLUFactorSym 2 1.0 1.2584e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatLUFactorNum 2 1.0 7.7804e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatScale 2 1.0 4.5068e-02 1.0 2.36e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 65 0 0 0 0 65 0 0 0 3657
> MatAssemblyBegin 169 1.0 4.2296e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 1.1e+02 2 0 0 0 7 2 0 0 0 7 0
> MatAssemblyEnd 169 1.0 3.6979e+00 1.0 0.00e+00 0.0 5.9e+02 9.4e+02 7.8e+02 22 0 36 71 48 22 0 36 71 49 0
> MatSetValues 1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 32 0 0 0 1 32 0 0 0 669
> MatGetRow 1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAXPY 1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 6.0e+00 1 32 5 14 0 1 32 5 14 0 478
> PCSetUp 2 1.0 2.0848e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 0 0 0 0 0 1 0
> PCApply 104 1.0 2.4459e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 1.1e+02 0 0 49 29 7 0 0 49 29 7 36
> KSPSetUp 2 1.0 1.8400e-06 5.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 104 1.0 2.5008e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 2.2e+02 0 0 49 29 14 0 0 49 29 14 35
> EPSSetUp 2 1.0 2.4027e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+01 0 0 0 0 1 0 0 0 0 1 0
> EPSSolve 2 1.0 3.0789e-02 1.0 1.09e+06 1.1 8.0e+02 2.8e+02 4.5e+02 0 3 49 29 28 0 3 49 29 28 242
> STSetUp 2 1.0 2.1294e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 0 0 0 0 0 1 0
> STComputeOperatr 2 1.0 2.1100e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> STApply 104 1.0 2.5290e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 2.2e+02 0 1 49 29 14 0 1 49 29 14 70
> STMatSolve 104 1.0 2.5081e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 2.2e+02 0 0 49 29 14 0 0 49 29 14 35
> BVCopy 20 1.0 2.8770e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> BVMultVec 206 1.0 1.8736e-04 1.5 3.13e+05 1.1 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 11442
> BVMultInPlace 11 1.0 9.2730e-05 1.2 1.49e+05 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 11026
> BVDotVec 206 1.0 1.0389e-03 1.2 3.35e+05 1.1 0.0e+00 0.0e+00 2.1e+02 0 1 0 0 13 0 1 0 0 13 2205
> BVOrthogonalizeV 106 1.0 1.2649e-03 1.1 6.84e+05 1.1 0.0e+00 0.0e+00 2.1e+02 0 2 0 0 13 0 2 0 0 13 3706
> BVScale 106 1.0 9.0329e-05 1.9 5.51e+03 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 418
> BVSetRandom 2 1.0 1.4080e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> BVMatMultVec 104 1.0 2.5468e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 2.2e+02 0 1 49 29 14 0 1 49 29 14 70
> DSSolve 9 1.0 1.2452e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> DSVectors 24 1.0 1.5345e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> DSOther 27 1.0 1.6735e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> ------------------------------------------------------------------------------------------------------------------------
>
> I believe that the only matrix addition I did is shown there to take about 0.17 seconds. Doing so twice per timestep in addition to scaling, this ends up taking about 0.4 seconds
> during the full calculation so the timestep is far too slow.
>
> However it seems that the number of flops is on the order of 10^7. Would this imply that
> I am perhaps memory bandwidth bound on my home pc?
>
> I'm still somewhat new to parallel computing, so I'm not sure what you mean by binding. Does this lock in certain processes to certain cores?
>
> Thank you for your time and assistance
>
> On Sat, Nov 11, 2023 at 2:54 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>
>> Here is the code for MPIBAIJ with SAME_NONZERO_STRUCTURE. (the other formats are similar)
>>
>> PetscErrorCode MatAXPY_SeqBAIJ(Mat Y, PetscScalar a, Mat X, MatStructure str)
>> {
>> Mat_SeqBAIJ *x = (Mat_SeqBAIJ *)X->data, *y = (Mat_SeqBAIJ *)Y->data;
>> PetscInt bs = Y->rmap->bs, bs2 = bs * bs;
>> PetscBLASInt one = 1;
>> ....
>> if (str == SAME_NONZERO_PATTERN) {
>> PetscScalar alpha = a;
>> PetscBLASInt bnz;
>> PetscCall(PetscBLASIntCast(x->nz * bs2, &bnz));
>> PetscCallBLAS("BLASaxpy", BLASaxpy_(&bnz, &alpha, x->a, &one, y->a, &one));
>> PetscCall(PetscObjectStateIncrease((PetscObject)Y));
>>
>> It directly adds the nonzero values from the two matrices together (the nonzero structure plays no role) and it uses the BLAS so it should perform as "fast as possible" given the hardware (are you configured --with-debugging=0 ?, are you using binding with mpiexec to ensure MPI is using the best combination of cores? https://petsc.org/release/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpetsc.org%2Frelease%2Ffaq%2F%23what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup&data=05%7C01%7CDonald.Planalp%40colorado.edu%7C5331c73c8037444dcb6208dbe300c6f9%7C3ded8b1b070d462982e4c0b019f46057%7C1%7C0%7C638353364712831930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=g2y%2BXYoKqVnJ4P79Bx2tEgz4T%2BBpu2g6rYWi0nIg%2BBc%3D&reserved=0>)
>>
>> Can you run with -log_view and send the output so we can see directly the performance?
>>
>>
>>
>>
>>> On Nov 11, 2023, at 4:25 PM, Donald Rex Planalp <Donald.Planalp at colorado.edu <mailto:Donald.Planalp at colorado.edu>> wrote:
>>>
>>> Hello, and thank you for the quick reply.
>>>
>>> For some context, all of these matrices are produced from a kronecker product in parallel between an "angular" and a "radial" matrix. In this case S_total and H_atom use the same angular matrix which is the identity, while their
>>> radial matrices obey different sparsity.
>>>
>>> Meanwhile H_mix and H_ang are constructed from two different angular matrices, however both are only nonzero along the two off diagonals, and the radial matrices have the same sparse structure.
>>>
>>> So S_total and H_atom have the same block sparse structure, while H_mix and H_ang have the same block sparse structure. However even just adding H_mix and H_ang for the simplest case takes around half a second which is still too slow even when using SAME_NONZERO_PATTERN .
>>>
>>> This confuses me the most because another researcher wrote a similar code a few years back using finite difference, so his matrices are on the order 250kx250k and larger, yet the matrix additions are far faster for him.
>>>
>>> If you would like I can show more code snippets but im not sure what you would want to see.
>>>
>>> Thank you for your time
>>>
>>>
>>>
>>> On Sat, Nov 11, 2023 at 11:22 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>>>
>>>>
>>>> DIFFERENT_NONZERO_PATTERN will always be slow because it needs to determine the nonzero pattern of the result for each computation.
>>>>
>>>> SAME_NONZERO_PATTERN can obviously run fast but may not be practical for your problem.
>>>>
>>>> SUBSET_NONZERO_PATTERN (which means in A += b*B we know that B has NO nonzero locations that A does not have) does not need to reallocate anything in A so it can be reasonably fast (we can do more optimizations on the current code to improve it for you).
>>>>
>>>> So, can you construct the S nonzero structure initially to have the union of the nonzero structures of all the matrices that accumulate into it? This is the same
>>>> nonzero structure that it "ends up with" in the current code. With this nonzero structure, SUBSET_NONZERO_PATTERN can be used. Depending on how sparse all the matrices that get accumulated into S are, you could perhaps go further and make sure they all have the same nonzero structure (sure, it uses more memory and will do extra operations, but the actual computation will be so fast it may be worth the extra computations.
>>>>
>>>> Barry
>>>>
>>>>
>>>>
>>>>> On Nov 10, 2023, at 9:53 PM, Donald Rex Planalp <Donald.Planalp at colorado.edu <mailto:Donald.Planalp at colorado.edu>> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to use petsc4py to conduct a quantum mechanics simulation. I've been able to construct all of the relevant matrices, however I am reaching a gigantic bottleneck.
>>>>>
>>>>> For the simplest problem I am running I have a few matrices each about 5000x5000. In order to begin time propagation I need to add these matrices together. However, on 6 cores of my local machine it is taking approximately 1-2 seconds per addition. Since I need to do this for each time step in my simulation it is prohibitively slow since there could be upwards of 10K time steps.
>>>>>
>>>>> Below is the relevant code:
>>>>>
>>>>> structure = structure=PETSc.Mat.Structure.DIFFERENT_NONZERO_PATTERN
>>>>> if test2:
>>>>> def makeLeft(S,MIX,ANG,ATOM,i):
>>>>>
>>>>>
>>>>> S.axpy(-Field.pulse[i],MIX,structure)
>>>>> S.axpy(-Field.pulse[i],ANG,structure)
>>>>> S.axpy(-1,ATOM,structure)
>>>>> return S
>>>>> def makeRight(S,MIX,ANG,ATOM,i):
>>>>>
>>>>>
>>>>>
>>>>> S.axpy(Field.pulse[i],MIX,structure)
>>>>> S.axpy(Field.pulse[i],ANG,structure)
>>>>> S.axpy(1,ATOM,structure)
>>>>>
>>>>> return S
>>>>>
>>>>> H_mix = Int.H_mix
>>>>> H_mix.scale(1j * dt /2)
>>>>>
>>>>> H_ang = Int.H_ang
>>>>> H_ang.scale(1j * dt /2)
>>>>>
>>>>> H_atom = Int.H_atom
>>>>> H_atom.scale(1j * dt /2)
>>>>>
>>>>> S = Int.S_total
>>>>>
>>>>> psi_initial = psi.psi_initial.copy()
>>>>> ksp = PETSc.KSP().create(PETSc.COMM_WORLD)
>>>>>
>>>>>
>>>>> for i,t in enumerate(box.t):
>>>>> print(i,L)
>>>>>
>>>>>
>>>>>
>>>>> O_L = makeLeft(S,H_mix,H_ang,H_atom,i)
>>>>> O_R = makeRight(S,H_mix,H_ang,H_atom,i)
>>>>>
>>>>>
>>>>>
>>>>> if i == 0:
>>>>> known = O_R.getVecRight()
>>>>> sol = O_L.getVecRight()
>>>>>
>>>>> O_R.mult(psi_initial,known)
>>>>>
>>>>> ksp.setOperators(O_L)
>>>>>
>>>>>
>>>>> ksp.solve(known,sol)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> psi_initial.copy(sol)
>>>>>
>>>>>
>>>>> I need to clean it up a bit, but the main point is that I need to add the matrices many times for a single time step. I can't preallocate memory very well since some of the matrices aren't the most sparse either. It seems if I cannot speed up the addition it will be difficult to continue so I was wondering if you had any insights.
>>>>>
>>>>> Thank you for your time
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231111/f9658931/attachment-0001.html>
More information about the petsc-users
mailing list