[petsc-users] large PetscCommDuplicate overhead

Thu Oct 6 12:06:44 CDT 2016

Happy to (though Piz Daint goes down for an extended upgrade on Oct 17
so would need to be run before then)!

On Thu, Oct 6, 2016 at 6:03 PM, Matthew Knepley <knepley at gmail.com> wrote:
> On Thu, Oct 6, 2016 at 10:55 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>>
>>    Matt,
>>
>>     Thanks for this information. It sure looks like there is something
>> seriously wrong with the MPI_Attr_get() on the cray for non-root process.
>> Does any PETSc developer have access to such a machine? We need to write a
>> test program that just calls MPI_Attr_get a bunch of times (no PETSc) to see
>> if we can reproduce the problem and report it to Cray.
>
>
> Barry, if you write it, we can give it to Patrick Sanan to run.
>
>   Thanks,
>
>     Matt
>
>>
>>    Barry
>>
>>
>>
>> On Oct 6, 2016, at 10:45 AM, Matthew Overholt <overholt at capesim.com>
>> wrote:
>> >
>> >
>> > Matthew and Barry,
>> >
>> > 1) I did a direct measurement of PetscCommDuplicate() time by tracing
>> > just
>> > that call (using CrayPat), and confirmed the sampling results.  For 8
>> > processes (n=8), tracing counted a total of 101 calls, taking ~0 time on
>> > the
>> > root process but taking 11.78 seconds (6.3% of 188 total seconds) on
>> > each of
>> > the other 7 processes.  For 16 processes (n=16, still only 1 node),
>> > tracing
>> > counted 102 total calls for a total of 18.42 seconds (13.2% of 139.6
>> > total
>> > seconds) on every process except the root.
>> >
>> > 2) Copied below is a section of the log view for the first two solutions
>> > for
>> > n=2, which shows the same calls as for n=8. (I can send the entire log
>> > files
>> > if desired.)  In each case I count about 44 PCD calls per process during
>> > initialization and meshing, 7 calls during setup, 9 calls for the first
>> > solution, then 3 calls for each subsequent solution (fixed-point
>> > iteration),
>> > and 3 calls to write out the solution, for 75 total.
>> >
>> > 3) I would expect that the administrators of this machine have
>> > configured
>> > PETSc appropriately. I am using their current default install, which is
>> > 3.7.2.
>> >
>> > https://www.nersc.gov/users/software/programming-libraries/math-libraries/pe
>> > tsc/
>> >
>> > 4) Yes, I just gave the MUMPS time as a comparison.
>> >
>> > 5) As to where it is spending time, perhaps the timing results in the
>> > log
>> > files will be helpful. The "Solution took ..." printouts give the total
>> > solution time for that iteration, the others are incremental times.  (As
>> > an
>> > aside, I have been wondering why the solution times do not scale well
>> > with
>> > process count, even though that work is entirely done in parallel PETSc
>> > routines.)
>> >
>> > Thanks,
>> > Matt Overholt
>> >
>> >
>> > ********** -log_view -info results for n=2 : the first solution and
>> > subsequent fixed-point iteration ***********
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>> > -2080374779
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>> > -2080374779
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>> > -2080374781
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > Matrix setup took 0.108 s
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>> > -2080374779
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>> > -2080374781
>> > KSP PC setup took 0.079 s
>> > [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
>> > [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
>> > [0] MatStashScatterBegin_Ref(): No of messages: 0
>> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
>> > [1] MatStashScatterBegin_Ref(): No of messages: 1
>> > [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes
>> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 5 mallocs.
>> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage
>> > space:
>> > 1050106 unneeded,15128672 used
>> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>> > 0
>> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
>> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
>> > 0)/(num_localrows
>> > 599214) < 0.6. Do not use CompressedRow routines.
>> > [1] MatSeqAIJCheckInode(): Found 599214 nodes out of 599214 rows. Not
>> > using
>> > Inode routines
>> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage
>> > space:
>> > 1237634 unneeded,15545404 used
>> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>> > 0
>> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
>> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
>> > 0)/(num_localrows
>> > 621594) < 0.6. Do not use CompressedRow routines.
>> > [0] MatSeqAIJCheckInode(): Found 621594 nodes out of 621594 rows. Not
>> > using
>> > Inode routines
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
>> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
>> > [0] VecScatterCreate(): General case: MPI to Seq
>> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space:
>> > 5257543 unneeded,136718 used
>> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>> > 89
>> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19
>> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
>> > 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
>> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space:
>> > 5464978 unneeded,136718 used
>> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>> > 490
>> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16
>> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
>> > 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
>> > K and q SetValues took 26.426 s
>> > [0] PCSetUp(): Setting up PC for first time
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [0] VecScatterCreate(): Special case: processor zero gets entire
>> > parallel
>> > vector, rest get none
>> > ** Max-trans not allowed because matrix is distributed
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PCSetUp(): Leaving PC with identical preconditioner since operator
>> > is
>> > unchanged
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
>> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
>> > [0] VecScatterCreate(): General case: Seq to MPI
>> > [1] VecScatterCreate(): General case: Seq to MPI
>> > Solution took 102.21 s
>> >
>> > NL iteration 0: delta = 32.0488 67.6279.
>> > Error delta calc took 0.045 s
>> > Node and Element temps update took 0.017 s
>> > [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
>> > [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
>> > [0] MatStashScatterBegin_Ref(): No of messages: 0
>> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
>> > [1] MatStashScatterBegin_Ref(): No of messages: 1
>> > [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes
>> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 0 mallocs.
>> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage
>> > space: 0
>> > unneeded,15128672 used
>> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>> > 0
>> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
>> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
>> > 0)/(num_localrows
>> > 599214) < 0.6. Do not use CompressedRow routines.
>> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage
>> > space: 0
>> > unneeded,15545404 used
>> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>> > 0
>> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
>> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
>> > 0)/(num_localrows
>> > 621594) < 0.6. Do not use CompressedRow routines.
>> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space:
>> > 0
>> > unneeded,136718 used
>> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>> > 0
>> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19
>> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
>> > 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
>> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space:
>> > 0
>> > unneeded,136718 used
>> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>> > 0
>> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16
>> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
>> > 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
>> > K and q SetValues took 2.366 s
>> > [0] PCSetUp(): Setting up PC with same nonzero pattern
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374780
>> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>> > -2080374782
>> > [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
>> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
>> > [0] VecScatterCreate(): General case: Seq to MPI
>> > [1] VecScatterCreate(): General case: Seq to MPI
>> > Solution took 82.156 s
>> >
>> > -----Original Message-----
>> > From: Barry Smith [mailto:bsmith at mcs.anl.gov]
>> > Sent: Wednesday, October 05, 2016 4:42 PM
>> > To: overholt at capesim.com
>> > Cc: petsc-users at mcs.anl.gov
>> > Subject: Re: [petsc-users] large PetscCommDuplicate overhead
>> >
>> >
>> >> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <overholt at capesim.com>
>> >> wrote:
>> >>
>> >> Hi Petsc-Users,
>> >>
>> >> I am trying to understand an issue where PetscCommDuplicate() calls are
>> > taking an increasing percentage of time as I run a fixed-sized problem
>> > on
>> > more processes.
>> >>
>> >> I am using the FEM to solve the steady-state heat transfer equation
>> >> (K.x =
>> > q) using a PC direct solver, like MUMPS.
>> >>
>> >> I am running on the NERSC Cray X30, which has two Xeon's per node with
>> >> 12
>> > cores each, and profiling the code using CrayPat sampling.
>> >>
>> >> On a typical problem (1E+6 finite elements), running on a single node:
>> >> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate
>> >> (on
>> > process 1, but on the root it is less), and (for reference) 9% of total
>> > time
>> > is for MUMPS.
>> >> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate
>> >> (on
>> > every process except the root, where it is <1%), and 9-10% of total time
>> > is
>> > for MUMPS.
>> >
>> >   What does PetscCommDuplicate() have to do with MUMPS? Nothing at all,
>> > you
>> > are just giving its time for comparison?
>> >
>> >>
>> >> What is the large PetscCommDuplicate time connected to, an increasing
>> > number of messages (tags)?  Would using fewer MatSetValues() and
>> > VecSetValues() calls (with longer message lengths) alleviate this?
>> >
>> >   No PetscCommDuplicate won't increate with more messages or calls to
>> > XXXSetValues(). PetscCommDuplicate() is only called essentially on the
>> > creation of new PETSc objects.  It should also be fast since it
>> > basically
>> > needs to do just a MPI_Attr_get(). With more processes but the same
>> > problem
>> > size and code there should be pretty much the same number of objects
>> > created.
>> >
>> >   PetscSpinlockLock() does nothing if you are not using threads so it
>> > won't
>> > take any time.
>> >
>> >   Is there a way to see where it is spending its time inside the
>> > PetscCommDuplicate()?  Perhaps the Cray MPI_Attr_get() has issues.
>> >
>> >   Barry
>> >
>> >
>> >
>> >
>> >
>> >
>> >>
>> >> For reference, the PETSc calling sequence in the code is as follows.
>> >>    // Create the solution and RHS vectors
>> >>    ierr = VecCreate(petscData->mpicomm,&mesh->hpx);
>> >>   ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution");
>> >>    ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # of
>> > equations; distribution to match mesh
>> >>    ierr = VecSetFromOptions(mesh->hpx);    // allow run time options
>> >>    ierr = VecDuplicate(mesh->hpx,&q);      // create the RHS vector
>> >>    // Create the stiffnexx matrix
>> >>    ierr = MatCreate(petscData->mpicomm,&K);
>> >>    ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns);
>> >>    ierr = MatSetType(K,MATAIJ);  // default sparse type
>> >>    // Do preallocation
>> >>    ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL);
>> >>    ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL);
>> >>    ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE);
>> >>    ierr = MatSetUp(K);
>> >>    // Create and set up the KSP context as a PreConditioner Only
>> >> (Direct)
>> > Solution
>> >>    ierr = KSPCreate(petscData->mpicomm,&ksp);
>> >>    ierr = KSPSetOperators(ksp,K,K);
>> >>    ierr = KSPSetType(ksp,KSPPREONLY);
>> >>    // Set the temperature vector
>> >>    ierr = VecSet(mesh->hpx,mesh->Tmin);
>> >>    // Set the default PC method as MUMPS
>> >>    ierr = KSPGetPC(ksp,&pc);     // extract the preconditioner
>> >>    ierr = PCSetType(pc,PCLU);    // set pc options
>> >>    ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS);
>> >>    ierr = KSPSetFromOptions(ksp);
>> >>
>> >>    // Set the values for the K matrix and q vector
>> >>        // which involves a lot of these calls
>> >>        ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES);
>> >> //
>> > 1 call per matrix row (equation)
>> >>        ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES);      // 1 call
>> >> per
>> > element
>> >>    ierr = VecAssemblyBegin(q);
>> >>    ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY);
>> >>    ierr = VecAssemblyEnd(q);
>> >>    ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY);
>> >>
>> >>    // Solve //////////////////////////////////////
>> >>    ierr = KSPSolve(ksp,q,mesh->hpx);
>> >> ...
>> >> *Note that the code evenly divides the finite elements over the total
>> > number of processors, and I am using ghosting of the FE vertices vector
>> > to
>> > handle the vertices that are needed on more than 1 process.
>> >>
>> >> Thanks in advance for your help,
>> >> Matt Overholt
>> >> CapeSym, Inc.
>> >>
>> >>      Virus-free. www.avast.com
>> >
>> >
>> > ---
>> > This email has been checked for viruses by Avast antivirus software.
>> > https://www.avast.com/antivirus
>> >
>>
>
>
>
> --
> What most experimenters take for granted before they begin their experiments
> is infinitely more interesting than any results to which their experiments
> lead.
> -- Norbert Wiener