[petsc-users] large PetscCommDuplicate overhead

Thu Oct 6 10:45:17 CDT 2016

Matthew and Barry,

1) I did a direct measurement of PetscCommDuplicate() time by tracing just
that call (using CrayPat), and confirmed the sampling results.  For 8
processes (n=8), tracing counted a total of 101 calls, taking ~0 time on the
root process but taking 11.78 seconds (6.3% of 188 total seconds) on each of
the other 7 processes.  For 16 processes (n=16, still only 1 node), tracing
counted 102 total calls for a total of 18.42 seconds (13.2% of 139.6 total
seconds) on every process except the root.

2) Copied below is a section of the log view for the first two solutions for
n=2, which shows the same calls as for n=8. (I can send the entire log files
if desired.)  In each case I count about 44 PCD calls per process during
initialization and meshing, 7 calls during setup, 9 calls for the first
solution, then 3 calls for each subsequent solution (fixed-point iteration),
and 3 calls to write out the solution, for 75 total.

3) I would expect that the administrators of this machine have configured
PETSc appropriately. I am using their current default install, which is
3.7.2.
https://www.nersc.gov/users/software/programming-libraries/math-libraries/pe
tsc/

4) Yes, I just gave the MUMPS time as a comparison.

5) As to where it is spending time, perhaps the timing results in the log
files will be helpful. The "Solution took ..." printouts give the total
solution time for that iteration, the others are incremental times.  (As an
aside, I have been wondering why the solution times do not scale well with
process count, even though that work is entirely done in parallel PETSc
routines.)

Thanks,
Matt Overholt

********** -log_view -info results for n=2 : the first solution and
subsequent fixed-point iteration ***********
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
-2080374779
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
-2080374779
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
-2080374781
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
Matrix setup took 0.108 s
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
-2080374779
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
-2080374781
KSP PC setup took 0.079 s
[0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
[0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
[0] MatStashScatterBegin_Ref(): No of messages: 0 
[0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
[1] MatStashScatterBegin_Ref(): No of messages: 1 
[1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes
[1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 5 mallocs.
[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage space:
1050106 unneeded,15128672 used
[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
[1] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows
599214) < 0.6. Do not use CompressedRow routines.
[1] MatSeqAIJCheckInode(): Found 599214 nodes out of 599214 rows. Not using
Inode routines
[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage space:
1237634 unneeded,15545404 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
[0] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows
621594) < 0.6. Do not use CompressedRow routines.
[0] MatSeqAIJCheckInode(): Found 621594 nodes out of 621594 rows. Not using
Inode routines
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
[0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
[0] VecScatterCreate(): General case: MPI to Seq
[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space:
5257543 unneeded,136718 used
[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 89
[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19
[1] MatCheckCompressedRow(): Found the ratio (num_zerorows
582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space:
5464978 unneeded,136718 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 490
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16
[0] MatCheckCompressedRow(): Found the ratio (num_zerorows
605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
K and q SetValues took 26.426 s
[0] PCSetUp(): Setting up PC for first time
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[0] VecScatterCreate(): Special case: processor zero gets entire parallel
vector, rest get none
 ** Max-trans not allowed because matrix is distributed
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PCSetUp(): Leaving PC with identical preconditioner since operator is
unchanged
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
[0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
[0] VecScatterCreate(): General case: Seq to MPI
[1] VecScatterCreate(): General case: Seq to MPI
Solution took 102.21 s

NL iteration 0: delta = 32.0488 67.6279.
Error delta calc took 0.045 s
Node and Element temps update took 0.017 s
[0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
[0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
[0] MatStashScatterBegin_Ref(): No of messages: 0 
[0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
[1] MatStashScatterBegin_Ref(): No of messages: 1 
[1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes
[1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 0 mallocs.
[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage space: 0
unneeded,15128672 used
[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
[1] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows
599214) < 0.6. Do not use CompressedRow routines.
[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage space: 0
unneeded,15545404 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
[0] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows
621594) < 0.6. Do not use CompressedRow routines.
[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space: 0
unneeded,136718 used
[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19
[1] MatCheckCompressedRow(): Found the ratio (num_zerorows
582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space: 0
unneeded,136718 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16
[0] MatCheckCompressedRow(): Found the ratio (num_zerorows
605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
K and q SetValues took 2.366 s
[0] PCSetUp(): Setting up PC with same nonzero pattern
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
-2080374782
[0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
[0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
[0] VecScatterCreate(): General case: Seq to MPI
[1] VecScatterCreate(): General case: Seq to MPI
Solution took 82.156 s

-----Original Message-----
From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
Sent: Wednesday, October 05, 2016 4:42 PM
To: overholt at capesim.com
Cc: petsc-users at mcs.anl.gov
Subject: Re: [petsc-users] large PetscCommDuplicate overhead

> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <overholt at capesim.com> wrote:
> 
> Hi Petsc-Users,
>  
> I am trying to understand an issue where PetscCommDuplicate() calls are
taking an increasing percentage of time as I run a fixed-sized problem on
more processes.
>  
> I am using the FEM to solve the steady-state heat transfer equation (K.x =
q) using a PC direct solver, like MUMPS.
>  
> I am running on the NERSC Cray X30, which has two Xeon's per node with 12
cores each, and profiling the code using CrayPat sampling.
>  
> On a typical problem (1E+6 finite elements), running on a single node:
> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate (on
process 1, but on the root it is less), and (for reference) 9% of total time
is for MUMPS.
> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate (on
every process except the root, where it is <1%), and 9-10% of total time is
for MUMPS.

   What does PetscCommDuplicate() have to do with MUMPS? Nothing at all, you
are just giving its time for comparison?

>  
> What is the large PetscCommDuplicate time connected to, an increasing
number of messages (tags)?  Would using fewer MatSetValues() and
VecSetValues() calls (with longer message lengths) alleviate this?

   No PetscCommDuplicate won't increate with more messages or calls to
XXXSetValues(). PetscCommDuplicate() is only called essentially on the
creation of new PETSc objects.  It should also be fast since it basically
needs to do just a MPI_Attr_get(). With more processes but the same problem
size and code there should be pretty much the same number of objects
created. 

   PetscSpinlockLock() does nothing if you are not using threads so it won't
take any time.

   Is there a way to see where it is spending its time inside the
PetscCommDuplicate()?  Perhaps the Cray MPI_Attr_get() has issues.

   Barry

>  
> For reference, the PETSc calling sequence in the code is as follows.
>     // Create the solution and RHS vectors
>     ierr = VecCreate(petscData->mpicomm,&mesh->hpx);
>    ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution");
>     ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # of
equations; distribution to match mesh
>     ierr = VecSetFromOptions(mesh->hpx);    // allow run time options
>     ierr = VecDuplicate(mesh->hpx,&q);      // create the RHS vector
>     // Create the stiffnexx matrix
>     ierr = MatCreate(petscData->mpicomm,&K);
>     ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns);
>     ierr = MatSetType(K,MATAIJ);  // default sparse type
>     // Do preallocation
>     ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL);
>     ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL);
>     ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE);
>     ierr = MatSetUp(K);
>     // Create and set up the KSP context as a PreConditioner Only (Direct)
Solution
>     ierr = KSPCreate(petscData->mpicomm,&ksp);
>     ierr = KSPSetOperators(ksp,K,K);
>     ierr = KSPSetType(ksp,KSPPREONLY);
>     // Set the temperature vector
>     ierr = VecSet(mesh->hpx,mesh->Tmin);
>     // Set the default PC method as MUMPS
>     ierr = KSPGetPC(ksp,&pc);     // extract the preconditioner
>     ierr = PCSetType(pc,PCLU);    // set pc options
>     ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS);
>     ierr = KSPSetFromOptions(ksp);
>     
>     // Set the values for the K matrix and q vector
>         // which involves a lot of these calls
>         ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES);    //
1 call per matrix row (equation)
>         ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES);      // 1 call per
element
>     ierr = VecAssemblyBegin(q);
>     ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY);
>     ierr = VecAssemblyEnd(q);
>     ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY);
>  
>     // Solve //////////////////////////////////////
>     ierr = KSPSolve(ksp,q,mesh->hpx);
> ...
> *Note that the code evenly divides the finite elements over the total
number of processors, and I am using ghosting of the FE vertices vector to
handle the vertices that are needed on more than 1 process.
>  
> Thanks in advance for your help,
> Matt Overholt
> CapeSym, Inc.
> 
> 	Virus-free. www.avast.com

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus