[petsc-users] large PetscCommDuplicate overhead
Barry Smith
bsmith at mcs.anl.gov
Sat Oct 8 16:21:42 CDT 2016
Attached is a simple test code. I had no luck getting it to behavior badly.
Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex3.c
Type: application/octet-stream
Size: 831 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161008/b531dfc4/attachment-0001.obj>
-------------- next part --------------
> On Oct 6, 2016, at 12:06 PM, Patrick Sanan <patrick.sanan at gmail.com> wrote:
>
> Happy to (though Piz Daint goes down for an extended upgrade on Oct 17
> so would need to be run before then)!
>
> On Thu, Oct 6, 2016 at 6:03 PM, Matthew Knepley <knepley at gmail.com> wrote:
>> On Thu, Oct 6, 2016 at 10:55 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>
>>>
>>> Matt,
>>>
>>> Thanks for this information. It sure looks like there is something
>>> seriously wrong with the MPI_Attr_get() on the cray for non-root process.
>>> Does any PETSc developer have access to such a machine? We need to write a
>>> test program that just calls MPI_Attr_get a bunch of times (no PETSc) to see
>>> if we can reproduce the problem and report it to Cray.
>>
>>
>> Barry, if you write it, we can give it to Patrick Sanan to run.
>>
>> Thanks,
>>
>> Matt
>>
>>>
>>> Barry
>>>
>>>
>>>
>>> On Oct 6, 2016, at 10:45 AM, Matthew Overholt <overholt at capesim.com>
>>> wrote:
>>>>
>>>>
>>>> Matthew and Barry,
>>>>
>>>> 1) I did a direct measurement of PetscCommDuplicate() time by tracing
>>>> just
>>>> that call (using CrayPat), and confirmed the sampling results. For 8
>>>> processes (n=8), tracing counted a total of 101 calls, taking ~0 time on
>>>> the
>>>> root process but taking 11.78 seconds (6.3% of 188 total seconds) on
>>>> each of
>>>> the other 7 processes. For 16 processes (n=16, still only 1 node),
>>>> tracing
>>>> counted 102 total calls for a total of 18.42 seconds (13.2% of 139.6
>>>> total
>>>> seconds) on every process except the root.
>>>>
>>>> 2) Copied below is a section of the log view for the first two solutions
>>>> for
>>>> n=2, which shows the same calls as for n=8. (I can send the entire log
>>>> files
>>>> if desired.) In each case I count about 44 PCD calls per process during
>>>> initialization and meshing, 7 calls during setup, 9 calls for the first
>>>> solution, then 3 calls for each subsequent solution (fixed-point
>>>> iteration),
>>>> and 3 calls to write out the solution, for 75 total.
>>>>
>>>> 3) I would expect that the administrators of this machine have
>>>> configured
>>>> PETSc appropriately. I am using their current default install, which is
>>>> 3.7.2.
>>>>
>>>> https://www.nersc.gov/users/software/programming-libraries/math-libraries/pe
>>>> tsc/
>>>>
>>>> 4) Yes, I just gave the MUMPS time as a comparison.
>>>>
>>>> 5) As to where it is spending time, perhaps the timing results in the
>>>> log
>>>> files will be helpful. The "Solution took ..." printouts give the total
>>>> solution time for that iteration, the others are incremental times. (As
>>>> an
>>>> aside, I have been wondering why the solution times do not scale well
>>>> with
>>>> process count, even though that work is entirely done in parallel PETSc
>>>> routines.)
>>>>
>>>> Thanks,
>>>> Matt Overholt
>>>>
>>>>
>>>> ********** -log_view -info results for n=2 : the first solution and
>>>> subsequent fixed-point iteration ***********
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>>>> -2080374779
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>>>> -2080374779
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>>>> -2080374781
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> Matrix setup took 0.108 s
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>>>> -2080374779
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
>>>> -2080374781
>>>> KSP PC setup took 0.079 s
>>>> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
>>>> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
>>>> [0] MatStashScatterBegin_Ref(): No of messages: 0
>>>> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
>>>> [1] MatStashScatterBegin_Ref(): No of messages: 1
>>>> [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes
>>>> [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 5 mallocs.
>>>> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage
>>>> space:
>>>> 1050106 unneeded,15128672 used
>>>> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>>>> 0
>>>> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
>>>> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
>>>> 0)/(num_localrows
>>>> 599214) < 0.6. Do not use CompressedRow routines.
>>>> [1] MatSeqAIJCheckInode(): Found 599214 nodes out of 599214 rows. Not
>>>> using
>>>> Inode routines
>>>> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage
>>>> space:
>>>> 1237634 unneeded,15545404 used
>>>> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>>>> 0
>>>> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
>>>> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
>>>> 0)/(num_localrows
>>>> 621594) < 0.6. Do not use CompressedRow routines.
>>>> [0] MatSeqAIJCheckInode(): Found 621594 nodes out of 621594 rows. Not
>>>> using
>>>> Inode routines
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
>>>> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
>>>> [0] VecScatterCreate(): General case: MPI to Seq
>>>> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space:
>>>> 5257543 unneeded,136718 used
>>>> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>>>> 89
>>>> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19
>>>> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
>>>> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
>>>> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space:
>>>> 5464978 unneeded,136718 used
>>>> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>>>> 490
>>>> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16
>>>> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
>>>> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
>>>> K and q SetValues took 26.426 s
>>>> [0] PCSetUp(): Setting up PC for first time
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [0] VecScatterCreate(): Special case: processor zero gets entire
>>>> parallel
>>>> vector, rest get none
>>>> ** Max-trans not allowed because matrix is distributed
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PCSetUp(): Leaving PC with identical preconditioner since operator
>>>> is
>>>> unchanged
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
>>>> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
>>>> [0] VecScatterCreate(): General case: Seq to MPI
>>>> [1] VecScatterCreate(): General case: Seq to MPI
>>>> Solution took 102.21 s
>>>>
>>>> NL iteration 0: delta = 32.0488 67.6279.
>>>> Error delta calc took 0.045 s
>>>> Node and Element temps update took 0.017 s
>>>> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.
>>>> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
>>>> [0] MatStashScatterBegin_Ref(): No of messages: 0
>>>> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
>>>> [1] MatStashScatterBegin_Ref(): No of messages: 1
>>>> [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes
>>>> [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 0 mallocs.
>>>> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage
>>>> space: 0
>>>> unneeded,15128672 used
>>>> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>>>> 0
>>>> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
>>>> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
>>>> 0)/(num_localrows
>>>> 599214) < 0.6. Do not use CompressedRow routines.
>>>> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage
>>>> space: 0
>>>> unneeded,15545404 used
>>>> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>>>> 0
>>>> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27
>>>> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
>>>> 0)/(num_localrows
>>>> 621594) < 0.6. Do not use CompressedRow routines.
>>>> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space:
>>>> 0
>>>> unneeded,136718 used
>>>> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>>>> 0
>>>> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19
>>>> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows
>>>> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.
>>>> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space:
>>>> 0
>>>> unneeded,136718 used
>>>> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is
>>>> 0
>>>> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16
>>>> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows
>>>> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.
>>>> K and q SetValues took 2.366 s
>>>> [0] PCSetUp(): Setting up PC with same nonzero pattern
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374780
>>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
>>>> -2080374782
>>>> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter
>>>> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
>>>> [0] VecScatterCreate(): General case: Seq to MPI
>>>> [1] VecScatterCreate(): General case: Seq to MPI
>>>> Solution took 82.156 s
>>>>
>>>> -----Original Message-----
>>>> From: Barry Smith [mailto:bsmith at mcs.anl.gov]
>>>> Sent: Wednesday, October 05, 2016 4:42 PM
>>>> To: overholt at capesim.com
>>>> Cc: petsc-users at mcs.anl.gov
>>>> Subject: Re: [petsc-users] large PetscCommDuplicate overhead
>>>>
>>>>
>>>>> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <overholt at capesim.com>
>>>>> wrote:
>>>>>
>>>>> Hi Petsc-Users,
>>>>>
>>>>> I am trying to understand an issue where PetscCommDuplicate() calls are
>>>> taking an increasing percentage of time as I run a fixed-sized problem
>>>> on
>>>> more processes.
>>>>>
>>>>> I am using the FEM to solve the steady-state heat transfer equation
>>>>> (K.x =
>>>> q) using a PC direct solver, like MUMPS.
>>>>>
>>>>> I am running on the NERSC Cray X30, which has two Xeon's per node with
>>>>> 12
>>>> cores each, and profiling the code using CrayPat sampling.
>>>>>
>>>>> On a typical problem (1E+6 finite elements), running on a single node:
>>>>> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate
>>>>> (on
>>>> process 1, but on the root it is less), and (for reference) 9% of total
>>>> time
>>>> is for MUMPS.
>>>>> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate
>>>>> (on
>>>> every process except the root, where it is <1%), and 9-10% of total time
>>>> is
>>>> for MUMPS.
>>>>
>>>> What does PetscCommDuplicate() have to do with MUMPS? Nothing at all,
>>>> you
>>>> are just giving its time for comparison?
>>>>
>>>>>
>>>>> What is the large PetscCommDuplicate time connected to, an increasing
>>>> number of messages (tags)? Would using fewer MatSetValues() and
>>>> VecSetValues() calls (with longer message lengths) alleviate this?
>>>>
>>>> No PetscCommDuplicate won't increate with more messages or calls to
>>>> XXXSetValues(). PetscCommDuplicate() is only called essentially on the
>>>> creation of new PETSc objects. It should also be fast since it
>>>> basically
>>>> needs to do just a MPI_Attr_get(). With more processes but the same
>>>> problem
>>>> size and code there should be pretty much the same number of objects
>>>> created.
>>>>
>>>> PetscSpinlockLock() does nothing if you are not using threads so it
>>>> won't
>>>> take any time.
>>>>
>>>> Is there a way to see where it is spending its time inside the
>>>> PetscCommDuplicate()? Perhaps the Cray MPI_Attr_get() has issues.
>>>>
>>>> Barry
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> For reference, the PETSc calling sequence in the code is as follows.
>>>>> // Create the solution and RHS vectors
>>>>> ierr = VecCreate(petscData->mpicomm,&mesh->hpx);
>>>>> ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution");
>>>>> ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # of
>>>> equations; distribution to match mesh
>>>>> ierr = VecSetFromOptions(mesh->hpx); // allow run time options
>>>>> ierr = VecDuplicate(mesh->hpx,&q); // create the RHS vector
>>>>> // Create the stiffnexx matrix
>>>>> ierr = MatCreate(petscData->mpicomm,&K);
>>>>> ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns);
>>>>> ierr = MatSetType(K,MATAIJ); // default sparse type
>>>>> // Do preallocation
>>>>> ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL);
>>>>> ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL);
>>>>> ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE);
>>>>> ierr = MatSetUp(K);
>>>>> // Create and set up the KSP context as a PreConditioner Only
>>>>> (Direct)
>>>> Solution
>>>>> ierr = KSPCreate(petscData->mpicomm,&ksp);
>>>>> ierr = KSPSetOperators(ksp,K,K);
>>>>> ierr = KSPSetType(ksp,KSPPREONLY);
>>>>> // Set the temperature vector
>>>>> ierr = VecSet(mesh->hpx,mesh->Tmin);
>>>>> // Set the default PC method as MUMPS
>>>>> ierr = KSPGetPC(ksp,&pc); // extract the preconditioner
>>>>> ierr = PCSetType(pc,PCLU); // set pc options
>>>>> ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS);
>>>>> ierr = KSPSetFromOptions(ksp);
>>>>>
>>>>> // Set the values for the K matrix and q vector
>>>>> // which involves a lot of these calls
>>>>> ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES);
>>>>> //
>>>> 1 call per matrix row (equation)
>>>>> ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES); // 1 call
>>>>> per
>>>> element
>>>>> ierr = VecAssemblyBegin(q);
>>>>> ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY);
>>>>> ierr = VecAssemblyEnd(q);
>>>>> ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY);
>>>>>
>>>>> // Solve //////////////////////////////////////
>>>>> ierr = KSPSolve(ksp,q,mesh->hpx);
>>>>> ...
>>>>> *Note that the code evenly divides the finite elements over the total
>>>> number of processors, and I am using ghosting of the FE vertices vector
>>>> to
>>>> handle the vertices that are needed on more than 1 process.
>>>>>
>>>>> Thanks in advance for your help,
>>>>> Matt Overholt
>>>>> CapeSym, Inc.
>>>>>
>>>>> Virus-free. www.avast.com
>>>>
>>>>
>>>> ---
>>>> This email has been checked for viruses by Avast antivirus software.
>>>> https://www.avast.com/antivirus
>>>>
>>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their experiments
>> is infinitely more interesting than any results to which their experiments
>> lead.
>> -- Norbert Wiener
More information about the petsc-users
mailing list