[petsc-users] Petsc4py Simulation: Mat.axpy() slow

Sat Nov 11 15:54:15 CST 2023

  Here is the code for MPIBAIJ with SAME_NONZERO_STRUCTURE. (the other formats are similar)

PetscErrorCode MatAXPY_SeqBAIJ(Mat Y, PetscScalar a, Mat X, MatStructure str)
{
  Mat_SeqBAIJ *x = (Mat_SeqBAIJ *)X->data, *y = (Mat_SeqBAIJ *)Y->data;
  PetscInt     bs = Y->rmap->bs, bs2 = bs * bs;
  PetscBLASInt one = 1;
....
if (str == SAME_NONZERO_PATTERN) {
    PetscScalar  alpha = a;
    PetscBLASInt bnz;
    PetscCall(PetscBLASIntCast(x->nz * bs2, &bnz));
    PetscCallBLAS("BLASaxpy", BLASaxpy_(&bnz, &alpha, x->a, &one, y->a, &one));
    PetscCall(PetscObjectStateIncrease((PetscObject)Y));

It directly adds the nonzero values from the two matrices together (the nonzero structure plays no role) and it uses the BLAS so it should perform as "fast as possible" given the hardware (are you configured --with-debugging=0 ?, are you using binding with mpiexec to ensure MPI is using the best combination of cores? https://petsc.org/release/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup)

 Can you run with -log_view and send the output so we can see directly the performance?

> On Nov 11, 2023, at 4:25 PM, Donald Rex Planalp <Donald.Planalp at colorado.edu> wrote:
> 
> Hello, and thank you for the quick reply.
> 
> For some context, all of these matrices are produced from a kronecker product in parallel between an "angular" and a "radial" matrix. In this case S_total and H_atom use the same angular matrix which is the identity, while their 
> radial matrices obey different sparsity.
> 
>  Meanwhile H_mix and H_ang are constructed from two different angular matrices, however both are only nonzero along the two off diagonals, and the radial matrices have the same sparse structure. 
> 
> So S_total and H_atom have the same block sparse structure, while H_mix and H_ang have the same block sparse structure. However even just adding H_mix and H_ang for the simplest case takes around half a second which is still too slow even when using SAME_NONZERO_PATTERN . 
> 
> This confuses me the most because another researcher wrote a similar code a few years back using finite difference, so his matrices are on the order 250kx250k and larger, yet the matrix additions are far faster for him. 
> 
> If you would like I can show more code snippets but im not sure what you would want to see.
> 
> Thank you for your time
> 
> 
> 
> On Sat, Nov 11, 2023 at 11:22 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>> 
>> 
>>  DIFFERENT_NONZERO_PATTERN will always be slow because it needs to determine the nonzero pattern of the result for each computation.
>> 
>>  SAME_NONZERO_PATTERN can obviously run fast but may not be practical for your problem. 
>> 
>>  SUBSET_NONZERO_PATTERN (which means in A += b*B we know that B has NO nonzero locations that A does not have) does not need to reallocate anything in A so it can be reasonably fast (we can do more optimizations on the current code to improve it for you).
>> 
>>   So, can you construct the S nonzero structure initially to have the union of the nonzero structures of all the matrices that accumulate into it? This is the same
>> nonzero structure that it "ends up with" in the current code. With this nonzero structure, SUBSET_NONZERO_PATTERN can be used. Depending on how sparse all the matrices that get accumulated into S are, you could perhaps go further and make sure they all have the same nonzero structure (sure, it uses more memory and will do extra operations, but the actual computation will be so fast it may be worth the extra computations.
>> 
>>   Barry
>> 
>> 
>> 
>>> On Nov 10, 2023, at 9:53 PM, Donald Rex Planalp <Donald.Planalp at colorado.edu <mailto:Donald.Planalp at colorado.edu>> wrote:
>>> 
>>> Hello,
>>> 
>>> I am trying to use petsc4py to conduct a quantum mechanics simulation. I've been able to construct all of the relevant matrices, however I am reaching a gigantic bottleneck. 
>>> 
>>> For the simplest problem I am running I have a few matrices each about 5000x5000. In order to begin time propagation I need to add these matrices together. However, on 6 cores of my local machine it is taking approximately 1-2 seconds per addition. Since I need to do this for each time step in my simulation it is prohibitively slow since there could be upwards of 10K time steps. 
>>> 
>>> Below is the relevant code:
>>> 
>>> structure = structure=PETSc.Mat.Structure.DIFFERENT_NONZERO_PATTERN
>>>     if test2:
>>>         def makeLeft(S,MIX,ANG,ATOM,i):
>>>             
>>> 
>>>             S.axpy(-Field.pulse[i],MIX,structure)
>>>             S.axpy(-Field.pulse[i],ANG,structure)
>>>             S.axpy(-1,ATOM,structure)
>>>             return S
>>>         def makeRight(S,MIX,ANG,ATOM,i):
>>>             
>>>             
>>>             
>>>             S.axpy(Field.pulse[i],MIX,structure)
>>>             S.axpy(Field.pulse[i],ANG,structure)
>>>             S.axpy(1,ATOM,structure)
>>>             
>>>             return S
>>> 
>>>         H_mix = Int.H_mix
>>>         H_mix.scale(1j * dt /2)
>>>   
>>>         H_ang = Int.H_ang
>>>         H_ang.scale(1j * dt /2)
>>> 
>>>         H_atom = Int.H_atom
>>>         H_atom.scale(1j * dt /2)
>>> 
>>>         S = Int.S_total
>>> 
>>>         psi_initial = psi.psi_initial.copy()
>>>         ksp = PETSc.KSP().create(PETSc.COMM_WORLD)
>>> 
>>>    
>>>         for i,t in enumerate(box.t):
>>>             print(i,L)
>>> 
>>>             
>>> 
>>>             O_L = makeLeft(S,H_mix,H_ang,H_atom,i)
>>>             O_R = makeRight(S,H_mix,H_ang,H_atom,i)
>>> 
>>>             
>>> 
>>>             if i == 0:
>>>                 known = O_R.getVecRight()
>>>                 sol = O_L.getVecRight()
>>>         
>>>             O_R.mult(psi_initial,known)
>>> 
>>>             ksp.setOperators(O_L)
>>>         
>>>             
>>>             ksp.solve(known,sol)
>>> 
>>>             
>>>             
>>>     
>>>             psi_initial.copy(sol)
>>> 
>>> 
>>> I need to clean it up a bit, but the main point is that I need to add the matrices many times for a single time step. I can't preallocate memory very well since some of the matrices aren't the most sparse either. It seems if I cannot speed up the addition it will be difficult to continue so I was wondering if you had any insights.
>>> 
>>> Thank you for your time
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231111/0db377ee/attachment-0001.html>