[petsc-users] large PetscCommDuplicate overhead

Matthew Overholt overholt at capesim.com
Tue Oct 11 16:08:55 CDT 2016


Barry,

Subsequent tests with the same code and a problem (input) having a much
smaller vertex (equation) count (i.e. a much smaller matrix to invert for
the solution) have NOT had PetscCommDuplicate() account for any significant
time, so I'm not surprised that your test didn't find any problem.

I am running on Edison with its default modules, except as follows.
> module unload darshan
> module load cray-petsc
> module load perftools-base
> module load perftools

For sampling, I used the default pat_build command on my executable (xyz):
> pat_build -f xyz
For tracing, I used the following:
> pat_build -w -T PetscCommDuplicate xyz

Thanks,
Matt...

-----Original Message-----
From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
Sent: Saturday, October 08, 2016 5:18 PM
To: overholt at capesim.com
Cc: PETSc
Subject: Re: [petsc-users] large PetscCommDuplicate overhead


  What exact machine are you running on? Please run modules list so we can
see exactly what modules you are using.  Please tell us exactly what options
you are passing to pat_build?

  Barry



> On Oct 6, 2016, at 10:45 AM, Matthew Overholt <overholt at capesim.com>
wrote:
> 
> Matthew and Barry,
> 
> 1) I did a direct measurement of PetscCommDuplicate() time by tracing 
> just that call (using CrayPat), and confirmed the sampling results.  
> For 8 processes (n=8), tracing counted a total of 101 calls, taking ~0 
> time on the root process but taking 11.78 seconds (6.3% of 188 total 
> seconds) on each of the other 7 processes.  For 16 processes (n=16, 
> still only 1 node), tracing counted 102 total calls for a total of 
> 18.42 seconds (13.2% of 139.6 total
> seconds) on every process except the root.
> 
> 2) Copied below is a section of the log view for the first two 
> solutions for n=2, which shows the same calls as for n=8. (I can send 
> the entire log files if desired.)  In each case I count about 44 PCD 
> calls per process during initialization and meshing, 7 calls during 
> setup, 9 calls for the first solution, then 3 calls for each 
> subsequent solution (fixed-point iteration), and 3 calls to write out the
solution, for 75 total.
> 
> 3) I would expect that the administrators of this machine have 
> configured PETSc appropriately. I am using their current default 
> install, which is 3.7.2.
> https://www.nersc.gov/users/software/programming-libraries/math-librar
> ies/pe
> tsc/
> 
> 4) Yes, I just gave the MUMPS time as a comparison.
> 
> 5) As to where it is spending time, perhaps the timing results in the 
> log files will be helpful. The "Solution took ..." printouts give the 
> total solution time for that iteration, the others are incremental 
> times.  (As an aside, I have been wondering why the solution times do 
> not scale well with process count, even though that work is entirely 
> done in parallel PETSc
> routines.)
> 
> Thanks,
> Matt Overholt
> 
> 
> ********** -log_view -info results for n=2 : the first solution and 
> subsequent fixed-point iteration *********** [0] PetscCommDuplicate(): 
> Using internal PETSc communicator 1140850688
> -2080374779
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374779
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374781
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> Matrix setup took 0.108 s
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374779
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> -2080374781
> KSP PC setup took 0.079 s
> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
> [0] MatStashScatterBegin_Ref(): No of messages: 0 [0] 
> MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> [1] MatStashScatterBegin_Ref(): No of messages: 1 [1] 
> MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes [1] 
> MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 5 mallocs.
> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage space:
> 1050106 unneeded,15128672 used
> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() 
> is 0 [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 
> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 0)/(num_localrows
> 599214) < 0.6. Do not use CompressedRow routines.
> [1] MatSeqAIJCheckInode(): Found 599214 nodes out of 599214 rows. Not 
> using Inode routines [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 
> X 621594; storage space:
> 1237634 unneeded,15545404 used
> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() 
> is 0 [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 
> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 0)/(num_localrows
> 621594) < 0.6. Do not use CompressedRow routines.
> [0] MatSeqAIJCheckInode(): Found 621594 nodes out of 621594 rows. Not 
> using Inode routines [0] PetscCommDuplicate(): Using internal PETSc 
> communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter 
> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [0] 
> VecScatterCreate(): General case: MPI to Seq [1] 
> MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space:
> 5257543 unneeded,136718 used
> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() 
> is 89 [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19 
> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space:
> 5464978 unneeded,136718 used
> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() 
> is 490 [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16 
> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
> K and q SetValues took 26.426 s
> [0] PCSetUp(): Setting up PC for first time [1] PetscCommDuplicate(): 
> Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [0] VecScatterCreate(): Special case: processor zero gets entire 
> parallel vector, rest get none
> ** Max-trans not allowed because matrix is distributed [0] 
> PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PCSetUp(): Leaving PC with identical preconditioner since operator 
> is unchanged [0] PetscCommDuplicate(): Using internal PETSc 
> communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter 
> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [0] 
> VecScatterCreate(): General case: Seq to MPI [1] VecScatterCreate(): 
> General case: Seq to MPI Solution took 102.21 s
> 
> NL iteration 0: delta = 32.0488 67.6279.
> Error delta calc took 0.045 s
> Node and Element temps update took 0.017 s [0] VecAssemblyBegin_MPI(): 
> Stash has 0 entries, uses 0 mallocs.
> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
> [0] MatStashScatterBegin_Ref(): No of messages: 0 [0] 
> MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> [1] MatStashScatterBegin_Ref(): No of messages: 1 [1] 
> MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes [1] 
> MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 0 mallocs.
> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage 
> space: 0
> unneeded,15128672 used
> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() 
> is 0 [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 
> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 0)/(num_localrows
> 599214) < 0.6. Do not use CompressedRow routines.
> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage 
> space: 0
> unneeded,15545404 used
> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() 
> is 0 [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 
> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 0)/(num_localrows
> 621594) < 0.6. Do not use CompressedRow routines.
> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage 
> space: 0
> unneeded,136718 used
> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() 
> is 0 [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19 
> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage 
> space: 0
> unneeded,136718 used
> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() 
> is 0 [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16 
> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
> K and q SetValues took 2.366 s
> [0] PCSetUp(): Setting up PC with same nonzero pattern [0] 
> PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374780
> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> -2080374782
> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter 
> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [0] 
> VecScatterCreate(): General case: Seq to MPI [1] VecScatterCreate(): 
> General case: Seq to MPI Solution took 82.156 s
> 
> -----Original Message-----
> From: Barry Smith [mailto:bsmith at mcs.anl.gov]
> Sent: Wednesday, October 05, 2016 4:42 PM
> To: overholt at capesim.com
> Cc: petsc-users at mcs.anl.gov
> Subject: Re: [petsc-users] large PetscCommDuplicate overhead
> 
> 
>> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <overholt at capesim.com>
wrote:
>> 
>> Hi Petsc-Users,
>> 
>> I am trying to understand an issue where PetscCommDuplicate() calls 
>> are
> taking an increasing percentage of time as I run a fixed-sized problem 
> on more processes.
>> 
>> I am using the FEM to solve the steady-state heat transfer equation 
>> (K.x =
> q) using a PC direct solver, like MUMPS.
>> 
>> I am running on the NERSC Cray X30, which has two Xeon's per node 
>> with 12
> cores each, and profiling the code using CrayPat sampling.
>> 
>> On a typical problem (1E+6 finite elements), running on a single node:
>> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate 
>> (on
> process 1, but on the root it is less), and (for reference) 9% of 
> total time is for MUMPS.
>> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate 
>> (on
> every process except the root, where it is <1%), and 9-10% of total 
> time is for MUMPS.
> 
>   What does PetscCommDuplicate() have to do with MUMPS? Nothing at 
> all, you are just giving its time for comparison?
> 
>> 
>> What is the large PetscCommDuplicate time connected to, an increasing
> number of messages (tags)?  Would using fewer MatSetValues() and
> VecSetValues() calls (with longer message lengths) alleviate this?
> 
>   No PetscCommDuplicate won't increate with more messages or calls to 
> XXXSetValues(). PetscCommDuplicate() is only called essentially on the 
> creation of new PETSc objects.  It should also be fast since it 
> basically needs to do just a MPI_Attr_get(). With more processes but 
> the same problem size and code there should be pretty much the same 
> number of objects created.
> 
>   PetscSpinlockLock() does nothing if you are not using threads so it 
> won't take any time.
> 
>   Is there a way to see where it is spending its time inside the 
> PetscCommDuplicate()?  Perhaps the Cray MPI_Attr_get() has issues.
> 
>   Barry
> 
> 
> 
> 
> 
> 
>> 
>> For reference, the PETSc calling sequence in the code is as follows.
>>    // Create the solution and RHS vectors
>>    ierr = VecCreate(petscData->mpicomm,&mesh->hpx);
>>   ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution");
>>    ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # 
>> of
> equations; distribution to match mesh
>>    ierr = VecSetFromOptions(mesh->hpx);    // allow run time options
>>    ierr = VecDuplicate(mesh->hpx,&q);      // create the RHS vector
>>    // Create the stiffnexx matrix
>>    ierr = MatCreate(petscData->mpicomm,&K);
>>    ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns);
>>    ierr = MatSetType(K,MATAIJ);  // default sparse type
>>    // Do preallocation
>>    ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL);
>>    ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL);
>>    ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE);
>>    ierr = MatSetUp(K);
>>    // Create and set up the KSP context as a PreConditioner Only 
>> (Direct)
> Solution
>>    ierr = KSPCreate(petscData->mpicomm,&ksp);
>>    ierr = KSPSetOperators(ksp,K,K);
>>    ierr = KSPSetType(ksp,KSPPREONLY);
>>    // Set the temperature vector
>>    ierr = VecSet(mesh->hpx,mesh->Tmin);
>>    // Set the default PC method as MUMPS
>>    ierr = KSPGetPC(ksp,&pc);     // extract the preconditioner
>>    ierr = PCSetType(pc,PCLU);    // set pc options
>>    ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS);
>>    ierr = KSPSetFromOptions(ksp);
>> 
>>    // Set the values for the K matrix and q vector
>>        // which involves a lot of these calls
>>        ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES);    //
> 1 call per matrix row (equation)
>>        ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES);      // 1 call per
> element
>>    ierr = VecAssemblyBegin(q);
>>    ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY);
>>    ierr = VecAssemblyEnd(q);
>>    ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY);
>> 
>>    // Solve //////////////////////////////////////
>>    ierr = KSPSolve(ksp,q,mesh->hpx);
>> ...
>> *Note that the code evenly divides the finite elements over the total
> number of processors, and I am using ghosting of the FE vertices 
> vector to handle the vertices that are needed on more than 1 process.
>> 
>> Thanks in advance for your help,
>> Matt Overholt
>> CapeSym, Inc.
>> 
>> 	Virus-free. www.avast.com
> 
> 
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
> 


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



More information about the petsc-users mailing list