[petsc-users] large PetscCommDuplicate overhead

Wed Oct 5 15:41:53 CDT 2016

> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <overholt at capesim.com> wrote:
> 
> Hi Petsc-Users,
>  
> I am trying to understand an issue where PetscCommDuplicate() calls are taking an increasing percentage of time as I run a fixed-sized problem on more processes.
>  
> I am using the FEM to solve the steady-state heat transfer equation (K.x = q) using a PC direct solver, like MUMPS.
>  
> I am running on the NERSC Cray X30, which has two Xeon's per node with 12 cores each, and profiling the code using CrayPat sampling.
>  
> On a typical problem (1E+6 finite elements), running on a single node:
> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate (on process 1, but on the root it is less), and (for reference) 9% of total time is for MUMPS.
> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate (on every process except the root, where it is <1%), and 9-10% of total time is for MUMPS.

   What does PetscCommDuplicate() have to do with MUMPS? Nothing at all, you are just giving its time for comparison?

>  
> What is the large PetscCommDuplicate time connected to, an increasing number of messages (tags)?  Would using fewer MatSetValues() and VecSetValues() calls (with longer message lengths) alleviate this?

   No PetscCommDuplicate won't increate with more messages or calls to XXXSetValues(). PetscCommDuplicate() is only called essentially on the creation of new PETSc objects.  It should also be fast since it basically needs to do just a MPI_Attr_get(). With more processes but the same problem size and code there should be pretty much the same number of objects created. 

   PetscSpinlockLock() does nothing if you are not using threads so it won't take any time.

   Is there a way to see where it is spending its time inside the PetscCommDuplicate()?  Perhaps the Cray MPI_Attr_get() has issues.

   Barry

>  
> For reference, the PETSc calling sequence in the code is as follows.
>     // Create the solution and RHS vectors
>     ierr = VecCreate(petscData->mpicomm,&mesh->hpx);
>    ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution");
>     ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # of equations; distribution to match mesh
>     ierr = VecSetFromOptions(mesh->hpx);    // allow run time options
>     ierr = VecDuplicate(mesh->hpx,&q);      // create the RHS vector
>     // Create the stiffnexx matrix
>     ierr = MatCreate(petscData->mpicomm,&K);
>     ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns);
>     ierr = MatSetType(K,MATAIJ);  // default sparse type
>     // Do preallocation
>     ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL);
>     ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL);
>     ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE);
>     ierr = MatSetUp(K);
>     // Create and set up the KSP context as a PreConditioner Only (Direct) Solution
>     ierr = KSPCreate(petscData->mpicomm,&ksp);
>     ierr = KSPSetOperators(ksp,K,K);
>     ierr = KSPSetType(ksp,KSPPREONLY);
>     // Set the temperature vector
>     ierr = VecSet(mesh->hpx,mesh->Tmin);
>     // Set the default PC method as MUMPS
>     ierr = KSPGetPC(ksp,&pc);     // extract the preconditioner
>     ierr = PCSetType(pc,PCLU);    // set pc options
>     ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS);
>     ierr = KSPSetFromOptions(ksp);
>     
>     // Set the values for the K matrix and q vector
>         // which involves a lot of these calls
>         ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES);    // 1 call per matrix row (equation)
>         ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES);      // 1 call per element
>     ierr = VecAssemblyBegin(q);
>     ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY);
>     ierr = VecAssemblyEnd(q);
>     ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY);
>  
>     // Solve //////////////////////////////////////
>     ierr = KSPSolve(ksp,q,mesh->hpx);
> ...
> *Note that the code evenly divides the finite elements over the total number of processors, and I am using ghosting of the FE vertices vector to handle the vertices that are needed on more than 1 process.
>  
> Thanks in advance for your help,
> Matt Overholt
> CapeSym, Inc.
> 
> 	Virus-free. www.avast.com