[petsc-users] parallelize matrix assembly process

Mon Dec 12 21:20:57 CST 2022

Following your comments,
I checked by using '-info'.

As you suspected, most elements being computed on wrong MPI rank.
Also, there are a lot of stashed entries.

Should I divide the domain from the problem define stage?
Or is a proper preallocation sufficient?

[0] <sys> PetscCommDuplicate(): Duplicating a communicator 139687279637472
94370404729840 max tags = 2147483647

[1] <sys> PetscCommDuplicate(): Duplicating a communicator 139620736898016
94891084133376 max tags = 2147483647

[0] <mat> MatSetUp(): Warning not preallocating matrix storage

[1] <sys> PetscCommDuplicate(): Duplicating a communicator 139620736897504
94891083133744 max tags = 2147483647

[0] <sys> PetscCommDuplicate(): Duplicating a communicator 139687279636960
94370403730224 max tags = 2147483647

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840

 TIME0 : 0.000000

 TIME0 : 0.000000

[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 8 mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.

[0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 5
mallocs.

[1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 5
mallocs.

[0] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13892 X 13892; storage
space: 180684 unneeded,987406 used

[0] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 73242

[0] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81

[0] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
0)/(num_localrows 13892) < 0.6. Do not use CompressedRow routines.

[0] <mat> MatSeqAIJCheckInode(): Found 4631 nodes of 13892. Limit used: 5.
Using Inode routines

[1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 13891; storage
space: 180715 unneeded,987325 used

[1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 73239

[1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81

[1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
0)/(num_localrows 13891) < 0.6. Do not use CompressedRow routines.

[1] <mat> MatSeqAIJCheckInode(): Found 4631 nodes of 13891. Limit used: 5.
Using Inode routines

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13892 X 1390; storage
space: 72491 unneeded,34049 used

[0] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 2472

[0] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 40

[0] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
12501)/(num_localrows 13892) > 0.6. Use CompressedRow routines.

Assemble Time : 174.079366sec

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 1391; storage
space: 72441 unneeded,34049 used

[1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 2469

[1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 41

[1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
12501)/(num_localrows 13891) > 0.6. Use CompressedRow routines.

Assemble Time : 174.141234sec

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 8
mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.

[1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 13891; storage
space: 0 unneeded,987325 used

[1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 0

[1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81

[1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
0)/(num_localrows 13891) < 0.6. Do not use CompressedRow routines.

[0] <pc> PCSetUp(): Setting up PC for first time

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged

Solving Time : 5.085394sec

[0] <ksp> KSPConvergedDefault(): Linear solver has converged. Residual norm
1.258030470407e-17 is less than relative tolerance 1.000000000000e-05 times
initial right hand side norm 2.579617304779e-03 at iteration 1

Solving Time : 5.089733sec

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.

[0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 0
mallocs.

[1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 0
mallocs.

Assemble Time : 5.242508sec

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

Assemble Time : 5.240863sec

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 0
mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.

     TIME : 1.000000,     TIME_STEP : 1.000000,      ITER : 2,     RESIDUAL
: 2.761615e-03

     TIME : 1.000000,     TIME_STEP : 1.000000,      ITER : 2,     RESIDUAL
: 2.761615e-03

[0] <pc> PCSetUp(): Setting up PC with same nonzero pattern

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged

[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged

[0] <ksp> KSPConvergedDefault(): Linear solver has converged. Residual norm
1.539725065974e-19 is less than relative tolerance 1.000000000000e-05 times
initial right hand side norm 8.015104666105e-06 at iteration 1

Solving Time : 4.662785sec

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

Solving Time : 4.664515sec

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.

[1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 0
mallocs.

[0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 0
mallocs.

Assemble Time : 5.238257sec

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744

Assemble Time : 5.236535sec

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224

     TIME : 1.000000,     TIME_STEP : 1.000000,      ITER : 3,     RESIDUAL
: 3.705062e-08

 TIME0 : 1.000000

[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 0
mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.

     TIME : 1.000000,     TIME_STEP : 1.000000,      ITER : 3,     RESIDUAL
: 3.705062e-08

 TIME0 : 1.000000

[1] <sys> PetscFinalize(): PetscFinalize() called

[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.

[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.

[0] <sys> PetscFinalize(): PetscFinalize() called

2022년 12월 13일 (화) 오전 12:50, Barry Smith <bsmith at petsc.dev>님이 작성:

>
>    The problem is possibly due to most elements being computed on "wrong"
> MPI rank and thus requiring almost all the matrix entries to be "stashed"
> when computed and then sent off to the owning MPI rank.  Please send ALL
> the output of a parallel run with -info so we can see how much
> communication is done in the matrix assembly.
>
>   Barry
>
>
> > On Dec 12, 2022, at 6:16 AM, 김성익 <ksi2443 at gmail.com> wrote:
> >
> > Hello,
> >
> >
> > I need some keyword or some examples for parallelizing matrix assemble
> process.
> >
> > My current state is as below.
> > - Finite element analysis code for Structural mechanics.
> > - problem size : 3D solid hexa element (number of elements : 125,000),
> number of degree of freedom : 397,953
> > - Matrix type : seqaij, matrix set preallocation by using
> MatSeqAIJSetPreallocation
> > - Matrix assemble time by using 1 core : 120 sec
> >    for (int i=0; i<125000; i++) {
> >     ~~ element matrix calculation}
> >    matassemblybegin
> >    matassemblyend
> > - Matrix assemble time by using 8 core : 70,234sec
> >   int start, end;
> >   VecGetOwnershipRange( element_vec, &start, &end);
> >   for (int i=start; i<end; i++){
> >    ~~ element matrix calculation
> >    matassemblybegin
> >    matassemblyend
> >
> >
> > As you see the state, the parallel case spent a lot of time than
> sequential case..
> > How can I speed up in this case?
> > Can I get some keyword or examples for parallelizing assembly of matrix
> in finite element analysis ?
> >
> > Thanks,
> > Hyung Kim
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20221213/fc90f8a6/attachment-0001.html>