[petsc-users] large PetscCommDuplicate overhead

Barry Smith bsmith at mcs.anl.gov
Sat Oct 8 16:18:07 CDT 2016


  What exact machine are you running on? Please run modules list so we can see exactly what modules you are using.  Please tell us exactly what options you are passing to pat_build?

  Barry



> On Oct 6, 2016, at 10:45 AM, Matthew Overholt <overholt at capesim.com> wrote:
> 
> Matthew and Barry,
> 
> 1) I did a direct measurement of PetscCommDuplicate() time by tracing just
> that call (using CrayPat), and confirmed the sampling results.  For 8
> processes (n=8), tracing counted a total of 101 calls, taking ~0 time on the
> root process but taking 11.78 seconds (6.3% of 188 total seconds) on each of
> the other 7 processes.  For 16 processes (n=16, still only 1 node), tracing
> counted 102 total calls for a total of 18.42 seconds (13.2% of 139.6 total
> seconds) on every process except the root.
> 
> 2) Copied below is a section of the log view for the first two solutions for
> n=2, which shows the same calls as for n=8. (I can send the entire log files
> if desired.)  In each case I count about 44 PCD calls per process during
> initialization and meshing, 7 calls during setup, 9 calls for the first
> solution, then 3 calls for each subsequent solution (fixed-point iteration),
> and 3 calls to write out the solution, for 75 total.
> 
> 3) I would expect that the administrators of this machine have configured
> PETSc appropriately. I am using their current default install, which is
> 3.7.2.
> https://www.nersc.gov/users/software/programming-libraries/math-libraries/pe
> tsc/
> 
> 4) Yes, I just gave the MUMPS time as a comparison.
> 
> 5) As to where it is spending time, perhaps the timing results in the log
> files will be helpful. The "Solution took ..." printouts give the total
> solution time for that iteration, the others are incremental times.  (As an
> aside, I have been wondering why the solution times do not scale well with
> process count, even though that work is entirely done in parallel PETSc
> routines.)
> 
> Thanks,
> Matt Overholt
> 
> 
> ********** -log_view -info results for n=2 : the first solution and
> subsequent fixed-point iteration ***********
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374779
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374779
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374781
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> Matrix setup took 0.108 s
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374779
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374781
> KSP PC setup took 0.079 s
> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
> [0] MatStashScatterBegin_Ref(): No of messages: 0 
> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> [1] MatStashScatterBegin_Ref(): No of messages: 1 
> [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes
> [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 5 mallocs.
> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage space:
> 1050106 unneeded,15128672 used
> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows
> 599214) < 0.6. Do not use CompressedRow routines.
> [1] MatSeqAIJCheckInode(): Found 599214 nodes out of 599214 rows. Not using
> Inode routines
> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage space:
> 1237634 unneeded,15545404 used
> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows
> 621594) < 0.6. Do not use CompressedRow routines.
> [0] MatSeqAIJCheckInode(): Found 621594 nodes out of 621594 rows. Not using
> Inode routines
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
> [0] VecScatterCreate(): General case: MPI to Seq
> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space:
> 5257543 unneeded,136718 used
> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 89
> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19
> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space:
> 5464978 unneeded,136718 used
> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 490
> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16
> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
> K and q SetValues took 26.426 s
> [0] PCSetUp(): Setting up PC for first time
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [0] VecScatterCreate(): Special case: processor zero gets entire parallel
> vector, rest get none
> ** Max-trans not allowed because matrix is distributed
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PCSetUp(): Leaving PC with identical preconditioner since operator is
> unchanged
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
> [0] VecScatterCreate(): General case: Seq to MPI
> [1] VecScatterCreate(): General case: Seq to MPI
> Solution took 102.21 s
> 
> NL iteration 0: delta = 32.0488 67.6279.
> Error delta calc took 0.045 s
> Node and Element temps update took 0.017 s
> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
> [0] MatStashScatterBegin_Ref(): No of messages: 0 
> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> [1] MatStashScatterBegin_Ref(): No of messages: 1 
> [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes
> [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 0 mallocs.
> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage space: 0
> unneeded,15128672 used
> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows
> 599214) < 0.6. Do not use CompressedRow routines.
> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage space: 0
> unneeded,15545404 used
> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows
> 621594) < 0.6. Do not use CompressedRow routines.
> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space: 0
> unneeded,136718 used
> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19
> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space: 0
> unneeded,136718 used
> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16
> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
> K and q SetValues took 2.366 s
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
> [0] VecScatterCreate(): General case: Seq to MPI
> [1] VecScatterCreate(): General case: Seq to MPI
> Solution took 82.156 s
> 
> -----Original Message-----
> From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
> Sent: Wednesday, October 05, 2016 4:42 PM
> To: overholt at capesim.com
> Cc: petsc-users at mcs.anl.gov
> Subject: Re: [petsc-users] large PetscCommDuplicate overhead
> 
> 
>> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <overholt at capesim.com> wrote:
>> 
>> Hi Petsc-Users,
>> 
>> I am trying to understand an issue where PetscCommDuplicate() calls are
> taking an increasing percentage of time as I run a fixed-sized problem on
> more processes.
>> 
>> I am using the FEM to solve the steady-state heat transfer equation (K.x =
> q) using a PC direct solver, like MUMPS.
>> 
>> I am running on the NERSC Cray X30, which has two Xeon's per node with 12
> cores each, and profiling the code using CrayPat sampling.
>> 
>> On a typical problem (1E+6 finite elements), running on a single node:
>> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate (on
> process 1, but on the root it is less), and (for reference) 9% of total time
> is for MUMPS.
>> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate (on
> every process except the root, where it is <1%), and 9-10% of total time is
> for MUMPS.
> 
>   What does PetscCommDuplicate() have to do with MUMPS? Nothing at all, you
> are just giving its time for comparison?
> 
>> 
>> What is the large PetscCommDuplicate time connected to, an increasing
> number of messages (tags)?  Would using fewer MatSetValues() and
> VecSetValues() calls (with longer message lengths) alleviate this?
> 
>   No PetscCommDuplicate won't increate with more messages or calls to
> XXXSetValues(). PetscCommDuplicate() is only called essentially on the
> creation of new PETSc objects.  It should also be fast since it basically
> needs to do just a MPI_Attr_get(). With more processes but the same problem
> size and code there should be pretty much the same number of objects
> created. 
> 
>   PetscSpinlockLock() does nothing if you are not using threads so it won't
> take any time.
> 
>   Is there a way to see where it is spending its time inside the
> PetscCommDuplicate()?  Perhaps the Cray MPI_Attr_get() has issues.
> 
>   Barry
> 
> 
> 
> 
> 
> 
>> 
>> For reference, the PETSc calling sequence in the code is as follows.
>>    // Create the solution and RHS vectors
>>    ierr = VecCreate(petscData->mpicomm,&mesh->hpx);
>>   ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution");
>>    ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # of
> equations; distribution to match mesh
>>    ierr = VecSetFromOptions(mesh->hpx);    // allow run time options
>>    ierr = VecDuplicate(mesh->hpx,&q);      // create the RHS vector
>>    // Create the stiffnexx matrix
>>    ierr = MatCreate(petscData->mpicomm,&K);
>>    ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns);
>>    ierr = MatSetType(K,MATAIJ);  // default sparse type
>>    // Do preallocation
>>    ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL);
>>    ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL);
>>    ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE);
>>    ierr = MatSetUp(K);
>>    // Create and set up the KSP context as a PreConditioner Only (Direct)
> Solution
>>    ierr = KSPCreate(petscData->mpicomm,&ksp);
>>    ierr = KSPSetOperators(ksp,K,K);
>>    ierr = KSPSetType(ksp,KSPPREONLY);
>>    // Set the temperature vector
>>    ierr = VecSet(mesh->hpx,mesh->Tmin);
>>    // Set the default PC method as MUMPS
>>    ierr = KSPGetPC(ksp,&pc);     // extract the preconditioner
>>    ierr = PCSetType(pc,PCLU);    // set pc options
>>    ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS);
>>    ierr = KSPSetFromOptions(ksp);
>> 
>>    // Set the values for the K matrix and q vector
>>        // which involves a lot of these calls
>>        ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES);    //
> 1 call per matrix row (equation)
>>        ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES);      // 1 call per
> element
>>    ierr = VecAssemblyBegin(q);
>>    ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY);
>>    ierr = VecAssemblyEnd(q);
>>    ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY);
>> 
>>    // Solve //////////////////////////////////////
>>    ierr = KSPSolve(ksp,q,mesh->hpx);
>> ...
>> *Note that the code evenly divides the finite elements over the total
> number of processors, and I am using ghosting of the FE vertices vector to
> handle the vertices that are needed on more than 1 process.
>> 
>> Thanks in advance for your help,
>> Matt Overholt
>> CapeSym, Inc.
>> 
>> 	Virus-free. www.avast.com
> 
> 
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
> 



More information about the petsc-users mailing list