[petsc-users] Petsc4py Simulation: Mat.axpy() slow

Sat Nov 11 18:58:46 CST 2023

   How many MPI processes did you use?    Please try with just one to get a base line

MatSetValues        1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1 32  0  0  0   1 32  0  0  0   669
MatGetRow           1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAXPY                1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 6.0e+00  1 32  5 14  0   1 32  5 14  0   478

   This is concerning:  there are 1298 MatSetValues and MatGetRow, are you calling them? If not that means the MatAXPY is calling them (if SAME_NONZERO_PATTERN is not used these are used in the MatAXPY). 

> I'm still somewhat new to parallel computing, so I'm not sure what you mean by binding. Does this lock in certain processes to certain cores? 

   Yes, it is usually the right thing to do for numerical computing. I included a link to some discussion of it on petsc.org

> On Nov 11, 2023, at 6:38 PM, Donald Rex Planalp <Donald.Planalp at colorado.edu> wrote:
> 
> Hello again,
> 
> I've run the simulation with profiling again. In this setup I only ran the necessary methods to construct the matrices, and then at the end I added H_mix and H_ang using the Mataxpy method. Below are the results 
> 
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> BuildTwoSided        350 1.0 6.9605e-01 1.9 0.00e+00 0.0 3.9e+02 5.6e+00 3.5e+02  3  0 24  0 22   3  0 24  0 22     0
> BuildTwoSidedF       236 1.0 6.9569e-01 1.9 0.00e+00 0.0 2.3e+02 1.1e+01 2.4e+02  3  0 14  0 15   3  0 14  0 15     0
> SFSetGraph           114 1.0 2.0555e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFSetUp              116 1.0 1.2123e-03 1.1 0.00e+00 0.0 6.2e+02 8.9e+02 1.1e+02  0  0 38 71  7   0  0 38 71  7     0
> SFPack               312 1.0 6.2350e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFUnpack             312 1.0 2.2470e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecView               26 1.0 4.0207e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCopy               13 1.0 7.6100e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyBegin      14 1.0 2.9690e-04 2.1 0.00e+00 0.0 2.3e+02 1.1e+01 1.4e+01  0  0 14  0  1   0  0 14  0  1     0
> VecAssemblyEnd        14 1.0 2.9400e-05 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin      312 1.0 6.0288e-04 1.6 0.00e+00 0.0 7.3e+02 3.1e+02 1.0e+02  0  0 45 29  6   0  0 45 29  7     0
> VecScatterEnd        312 1.0 9.1886e-04 3.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSetRandom           2 1.0 3.6500e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMult              104 1.0 2.5355e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
> MatSolve             104 1.0 2.4431e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 1.1e+02  0  0 49 29  7   0  0 49 29  7    36
> MatLUFactorSym         2 1.0 1.2584e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum         2 1.0 7.7804e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatScale               2 1.0 4.5068e-02 1.0 2.36e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0 65  0  0  0   0 65  0  0  0  3657
> MatAssemblyBegin     169 1.0 4.2296e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 1.1e+02  2  0  0  0  7   2  0  0  0  7     0
> MatAssemblyEnd       169 1.0 3.6979e+00 1.0 0.00e+00 0.0 5.9e+02 9.4e+02 7.8e+02 22  0 36 71 48  22  0 36 71 49     0
> MatSetValues        1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1 32  0  0  0   1 32  0  0  0   669
> MatGetRow           1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAXPY                1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 6.0e+00  1 32  5 14  0   1 32  5 14  0   478
> PCSetUp                2 1.0 2.0848e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  0   0  0  0  0  1     0
> PCApply              104 1.0 2.4459e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 1.1e+02  0  0 49 29  7   0  0 49 29  7    36
> KSPSetUp               2 1.0 1.8400e-06 5.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve             104 1.0 2.5008e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 2.2e+02  0  0 49 29 14   0  0 49 29 14    35
> EPSSetUp               2 1.0 2.4027e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+01  0  0  0  0  1   0  0  0  0  1     0
> EPSSolve               2 1.0 3.0789e-02 1.0 1.09e+06 1.1 8.0e+02 2.8e+02 4.5e+02  0  3 49 29 28   0  3 49 29 28   242
> STSetUp                2 1.0 2.1294e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  0   0  0  0  0  1     0
> STComputeOperatr       2 1.0 2.1100e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> STApply              104 1.0 2.5290e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
> STMatSolve           104 1.0 2.5081e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 2.2e+02  0  0 49 29 14   0  0 49 29 14    35
> BVCopy                20 1.0 2.8770e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> BVMultVec            206 1.0 1.8736e-04 1.5 3.13e+05 1.1 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0 11442
> BVMultInPlace         11 1.0 9.2730e-05 1.2 1.49e+05 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 11026
> BVDotVec             206 1.0 1.0389e-03 1.2 3.35e+05 1.1 0.0e+00 0.0e+00 2.1e+02  0  1  0  0 13   0  1  0  0 13  2205
> BVOrthogonalizeV     106 1.0 1.2649e-03 1.1 6.84e+05 1.1 0.0e+00 0.0e+00 2.1e+02  0  2  0  0 13   0  2  0  0 13  3706
> BVScale              106 1.0 9.0329e-05 1.9 5.51e+03 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   418
> BVSetRandom            2 1.0 1.4080e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> BVMatMultVec         104 1.0 2.5468e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
> DSSolve                9 1.0 1.2452e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> DSVectors             24 1.0 1.5345e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> DSOther               27 1.0 1.6735e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
> 
> I believe that the only matrix addition I did is shown there to take about 0.17 seconds. Doing so twice per timestep in addition to scaling, this ends up taking about 0.4 seconds
> during the full calculation so the timestep is far too slow.
> 
> However it seems that the number of flops is on the order of 10^7. Would this imply that 
> I am  perhaps memory bandwidth bound on my home pc?
> 
> I'm still somewhat new to parallel computing, so I'm not sure what you mean by binding. Does this lock in certain processes to certain cores? 
> 
> Thank you for your time and assistance
> 
> On Sat, Nov 11, 2023 at 2:54 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>> 
>>   Here is the code for MPIBAIJ with SAME_NONZERO_STRUCTURE. (the other formats are similar)
>> 
>> PetscErrorCode MatAXPY_SeqBAIJ(Mat Y, PetscScalar a, Mat X, MatStructure str)
>> {
>>   Mat_SeqBAIJ *x = (Mat_SeqBAIJ *)X->data, *y = (Mat_SeqBAIJ *)Y->data;
>>   PetscInt     bs = Y->rmap->bs, bs2 = bs * bs;
>>   PetscBLASInt one = 1;
>> ....
>> if (str == SAME_NONZERO_PATTERN) {
>>     PetscScalar  alpha = a;
>>     PetscBLASInt bnz;
>>     PetscCall(PetscBLASIntCast(x->nz * bs2, &bnz));
>>     PetscCallBLAS("BLASaxpy", BLASaxpy_(&bnz, &alpha, x->a, &one, y->a, &one));
>>     PetscCall(PetscObjectStateIncrease((PetscObject)Y));
>> 
>> It directly adds the nonzero values from the two matrices together (the nonzero structure plays no role) and it uses the BLAS so it should perform as "fast as possible" given the hardware (are you configured --with-debugging=0 ?, are you using binding with mpiexec to ensure MPI is using the best combination of cores? https://petsc.org/release/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpetsc.org%2Frelease%2Ffaq%2F%23what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup&data=05%7C01%7CDonald.Planalp%40colorado.edu%7C5331c73c8037444dcb6208dbe300c6f9%7C3ded8b1b070d462982e4c0b019f46057%7C1%7C0%7C638353364712831930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=g2y%2BXYoKqVnJ4P79Bx2tEgz4T%2BBpu2g6rYWi0nIg%2BBc%3D&reserved=0>)
>> 
>>  Can you run with -log_view and send the output so we can see directly the performance?
>> 
>> 
>> 
>> 
>>> On Nov 11, 2023, at 4:25 PM, Donald Rex Planalp <Donald.Planalp at colorado.edu <mailto:Donald.Planalp at colorado.edu>> wrote:
>>> 
>>> Hello, and thank you for the quick reply.
>>> 
>>> For some context, all of these matrices are produced from a kronecker product in parallel between an "angular" and a "radial" matrix. In this case S_total and H_atom use the same angular matrix which is the identity, while their 
>>> radial matrices obey different sparsity.
>>> 
>>>  Meanwhile H_mix and H_ang are constructed from two different angular matrices, however both are only nonzero along the two off diagonals, and the radial matrices have the same sparse structure. 
>>> 
>>> So S_total and H_atom have the same block sparse structure, while H_mix and H_ang have the same block sparse structure. However even just adding H_mix and H_ang for the simplest case takes around half a second which is still too slow even when using SAME_NONZERO_PATTERN . 
>>> 
>>> This confuses me the most because another researcher wrote a similar code a few years back using finite difference, so his matrices are on the order 250kx250k and larger, yet the matrix additions are far faster for him. 
>>> 
>>> If you would like I can show more code snippets but im not sure what you would want to see.
>>> 
>>> Thank you for your time
>>> 
>>> 
>>> 
>>> On Sat, Nov 11, 2023 at 11:22 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>>> 
>>>> 
>>>>  DIFFERENT_NONZERO_PATTERN will always be slow because it needs to determine the nonzero pattern of the result for each computation.
>>>> 
>>>>  SAME_NONZERO_PATTERN can obviously run fast but may not be practical for your problem. 
>>>> 
>>>>  SUBSET_NONZERO_PATTERN (which means in A += b*B we know that B has NO nonzero locations that A does not have) does not need to reallocate anything in A so it can be reasonably fast (we can do more optimizations on the current code to improve it for you).
>>>> 
>>>>   So, can you construct the S nonzero structure initially to have the union of the nonzero structures of all the matrices that accumulate into it? This is the same
>>>> nonzero structure that it "ends up with" in the current code. With this nonzero structure, SUBSET_NONZERO_PATTERN can be used. Depending on how sparse all the matrices that get accumulated into S are, you could perhaps go further and make sure they all have the same nonzero structure (sure, it uses more memory and will do extra operations, but the actual computation will be so fast it may be worth the extra computations.
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> 
>>>>> On Nov 10, 2023, at 9:53 PM, Donald Rex Planalp <Donald.Planalp at colorado.edu <mailto:Donald.Planalp at colorado.edu>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I am trying to use petsc4py to conduct a quantum mechanics simulation. I've been able to construct all of the relevant matrices, however I am reaching a gigantic bottleneck. 
>>>>> 
>>>>> For the simplest problem I am running I have a few matrices each about 5000x5000. In order to begin time propagation I need to add these matrices together. However, on 6 cores of my local machine it is taking approximately 1-2 seconds per addition. Since I need to do this for each time step in my simulation it is prohibitively slow since there could be upwards of 10K time steps. 
>>>>> 
>>>>> Below is the relevant code:
>>>>> 
>>>>> structure = structure=PETSc.Mat.Structure.DIFFERENT_NONZERO_PATTERN
>>>>>     if test2:
>>>>>         def makeLeft(S,MIX,ANG,ATOM,i):
>>>>>             
>>>>> 
>>>>>             S.axpy(-Field.pulse[i],MIX,structure)
>>>>>             S.axpy(-Field.pulse[i],ANG,structure)
>>>>>             S.axpy(-1,ATOM,structure)
>>>>>             return S
>>>>>         def makeRight(S,MIX,ANG,ATOM,i):
>>>>>             
>>>>>             
>>>>>             
>>>>>             S.axpy(Field.pulse[i],MIX,structure)
>>>>>             S.axpy(Field.pulse[i],ANG,structure)
>>>>>             S.axpy(1,ATOM,structure)
>>>>>             
>>>>>             return S
>>>>> 
>>>>>         H_mix = Int.H_mix
>>>>>         H_mix.scale(1j * dt /2)
>>>>>   
>>>>>         H_ang = Int.H_ang
>>>>>         H_ang.scale(1j * dt /2)
>>>>> 
>>>>>         H_atom = Int.H_atom
>>>>>         H_atom.scale(1j * dt /2)
>>>>> 
>>>>>         S = Int.S_total
>>>>> 
>>>>>         psi_initial = psi.psi_initial.copy()
>>>>>         ksp = PETSc.KSP().create(PETSc.COMM_WORLD)
>>>>> 
>>>>>    
>>>>>         for i,t in enumerate(box.t):
>>>>>             print(i,L)
>>>>> 
>>>>>             
>>>>> 
>>>>>             O_L = makeLeft(S,H_mix,H_ang,H_atom,i)
>>>>>             O_R = makeRight(S,H_mix,H_ang,H_atom,i)
>>>>> 
>>>>>             
>>>>> 
>>>>>             if i == 0:
>>>>>                 known = O_R.getVecRight()
>>>>>                 sol = O_L.getVecRight()
>>>>>         
>>>>>             O_R.mult(psi_initial,known)
>>>>> 
>>>>>             ksp.setOperators(O_L)
>>>>>         
>>>>>             
>>>>>             ksp.solve(known,sol)
>>>>> 
>>>>>             
>>>>>             
>>>>>     
>>>>>             psi_initial.copy(sol)
>>>>> 
>>>>> 
>>>>> I need to clean it up a bit, but the main point is that I need to add the matrices many times for a single time step. I can't preallocate memory very well since some of the matrices aren't the most sparse either. It seems if I cannot speed up the addition it will be difficult to continue so I was wondering if you had any insights.
>>>>> 
>>>>> Thank you for your time
>>>> 
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231111/f9658931/attachment-0001.html>