[petsc-users] parallelize matrix assembly process
김성익
ksi2443 at gmail.com
Mon Dec 12 21:20:57 CST 2022
Following your comments,
I checked by using '-info'.
As you suspected, most elements being computed on wrong MPI rank.
Also, there are a lot of stashed entries.
Should I divide the domain from the problem define stage?
Or is a proper preallocation sufficient?
[0] <sys> PetscCommDuplicate(): Duplicating a communicator 139687279637472
94370404729840 max tags = 2147483647
[1] <sys> PetscCommDuplicate(): Duplicating a communicator 139620736898016
94891084133376 max tags = 2147483647
[0] <mat> MatSetUp(): Warning not preallocating matrix storage
[1] <sys> PetscCommDuplicate(): Duplicating a communicator 139620736897504
94891083133744 max tags = 2147483647
[0] <sys> PetscCommDuplicate(): Duplicating a communicator 139687279636960
94370403730224 max tags = 2147483647
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736898016 94891084133376
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279637472 94370404729840
TIME0 : 0.000000
TIME0 : 0.000000
[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 8 mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.
[0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 5
mallocs.
[1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 5
mallocs.
[0] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13892 X 13892; storage
space: 180684 unneeded,987406 used
[0] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 73242
[0] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81
[0] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
0)/(num_localrows 13892) < 0.6. Do not use CompressedRow routines.
[0] <mat> MatSeqAIJCheckInode(): Found 4631 nodes of 13892. Limit used: 5.
Using Inode routines
[1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 13891; storage
space: 180715 unneeded,987325 used
[1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 73239
[1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81
[1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
0)/(num_localrows 13891) < 0.6. Do not use CompressedRow routines.
[1] <mat> MatSeqAIJCheckInode(): Found 4631 nodes of 13891. Limit used: 5.
Using Inode routines
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13892 X 1390; storage
space: 72491 unneeded,34049 used
[0] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 2472
[0] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 40
[0] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
12501)/(num_localrows 13892) > 0.6. Use CompressedRow routines.
Assemble Time : 174.079366sec
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 1391; storage
space: 72441 unneeded,34049 used
[1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 2469
[1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 41
[1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
12501)/(num_localrows 13891) > 0.6. Use CompressedRow routines.
Assemble Time : 174.141234sec
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 8
mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.
[1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 13891; storage
space: 0 unneeded,987325 used
[1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues()
is 0
[1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81
[1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
0)/(num_localrows 13891) < 0.6. Do not use CompressedRow routines.
[0] <pc> PCSetUp(): Setting up PC for first time
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged
Solving Time : 5.085394sec
[0] <ksp> KSPConvergedDefault(): Linear solver has converged. Residual norm
1.258030470407e-17 is less than relative tolerance 1.000000000000e-05 times
initial right hand side norm 2.579617304779e-03 at iteration 1
Solving Time : 5.089733sec
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.
[0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 0
mallocs.
[1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 0
mallocs.
Assemble Time : 5.242508sec
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
Assemble Time : 5.240863sec
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 0
mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.
TIME : 1.000000, TIME_STEP : 1.000000, ITER : 2, RESIDUAL
: 2.761615e-03
TIME : 1.000000, TIME_STEP : 1.000000, ITER : 2, RESIDUAL
: 2.761615e-03
[0] <pc> PCSetUp(): Setting up PC with same nonzero pattern
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged
[0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator
is unchanged
[0] <ksp> KSPConvergedDefault(): Linear solver has converged. Residual norm
1.539725065974e-19 is less than relative tolerance 1.000000000000e-05 times
initial right hand side norm 8.015104666105e-06 at iteration 1
Solving Time : 4.662785sec
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
Solving Time : 4.664515sec
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.
[1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 0
mallocs.
[0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 0
mallocs.
Assemble Time : 5.238257sec
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
[1] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139620736897504 94891083133744
Assemble Time : 5.236535sec
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
[0] <sys> PetscCommDuplicate(): Using internal PETSc communicator
139687279636960 94370403730224
TIME : 1.000000, TIME_STEP : 1.000000, ITER : 3, RESIDUAL
: 3.705062e-08
TIME0 : 1.000000
[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 0
mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.
TIME : 1.000000, TIME_STEP : 1.000000, ITER : 3, RESIDUAL
: 3.705062e-08
TIME0 : 1.000000
[1] <sys> PetscFinalize(): PetscFinalize() called
[0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.
[0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0
mallocs.
[0] <sys> PetscFinalize(): PetscFinalize() called
2022년 12월 13일 (화) 오전 12:50, Barry Smith <bsmith at petsc.dev>님이 작성:
>
> The problem is possibly due to most elements being computed on "wrong"
> MPI rank and thus requiring almost all the matrix entries to be "stashed"
> when computed and then sent off to the owning MPI rank. Please send ALL
> the output of a parallel run with -info so we can see how much
> communication is done in the matrix assembly.
>
> Barry
>
>
> > On Dec 12, 2022, at 6:16 AM, 김성익 <ksi2443 at gmail.com> wrote:
> >
> > Hello,
> >
> >
> > I need some keyword or some examples for parallelizing matrix assemble
> process.
> >
> > My current state is as below.
> > - Finite element analysis code for Structural mechanics.
> > - problem size : 3D solid hexa element (number of elements : 125,000),
> number of degree of freedom : 397,953
> > - Matrix type : seqaij, matrix set preallocation by using
> MatSeqAIJSetPreallocation
> > - Matrix assemble time by using 1 core : 120 sec
> > for (int i=0; i<125000; i++) {
> > ~~ element matrix calculation}
> > matassemblybegin
> > matassemblyend
> > - Matrix assemble time by using 8 core : 70,234sec
> > int start, end;
> > VecGetOwnershipRange( element_vec, &start, &end);
> > for (int i=start; i<end; i++){
> > ~~ element matrix calculation
> > matassemblybegin
> > matassemblyend
> >
> >
> > As you see the state, the parallel case spent a lot of time than
> sequential case..
> > How can I speed up in this case?
> > Can I get some keyword or examples for parallelizing assembly of matrix
> in finite element analysis ?
> >
> > Thanks,
> > Hyung Kim
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20221213/fc90f8a6/attachment-0001.html>
More information about the petsc-users
mailing list