<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Oct 6, 2016 at 10:55 AM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

   Matt,<br>

<br>

    Thanks for this information. It sure looks like there is something seriously wrong with the MPI_Attr_get() on the cray for non-root process. Does any PETSc developer have access to such a machine? We need to write a test program that just calls MPI_Attr_get a bunch of times (no PETSc) to see if we can reproduce the problem and report it to Cray.<br><span class="HOEnZb"><font color="#888888"></font></span></blockquote><div><br></div><div>Barry, if you write it, we can give it to Patrick Sanan to run.</div><div><br></div><div>  Thanks,</div><div><br></div><div>    Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="HOEnZb"><font color="#888888">   Barry<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

<br>

On Oct 6, 2016, at 10:45 AM, Matthew Overholt <<a href="mailto:overholt@capesim.com">overholt@capesim.com</a>> wrote:<br>

><br>

><br>

> Matthew and Barry,<br>

><br>

> 1) I did a direct measurement of PetscCommDuplicate() time by tracing just<br>

> that call (using CrayPat), and confirmed the sampling results.  For 8<br>

> processes (n=8), tracing counted a total of 101 calls, taking ~0 time on the<br>

> root process but taking 11.78 seconds (6.3% of 188 total seconds) on each of<br>

> the other 7 processes.  For 16 processes (n=16, still only 1 node), tracing<br>

> counted 102 total calls for a total of 18.42 seconds (13.2% of 139.6 total<br>

> seconds) on every process except the root.<br>

><br>

> 2) Copied below is a section of the log view for the first two solutions for<br>

> n=2, which shows the same calls as for n=8. (I can send the entire log files<br>

> if desired.)  In each case I count about 44 PCD calls per process during<br>

> initialization and meshing, 7 calls during setup, 9 calls for the first<br>

> solution, then 3 calls for each subsequent solution (fixed-point iteration),<br>

> and 3 calls to write out the solution, for 75 total.<br>

><br>

> 3) I would expect that the administrators of this machine have configured<br>

> PETSc appropriately. I am using their current default install, which is<br>

> 3.7.2.<br>

> <a href="https://www.nersc.gov/users/software/programming-libraries/math-libraries/pe" rel="noreferrer" target="_blank">https://www.nersc.gov/users/<wbr>software/programming-<wbr>libraries/math-libraries/pe</a><br>

> tsc/<br>

><br>

> 4) Yes, I just gave the MUMPS time as a comparison.<br>

><br>

> 5) As to where it is spending time, perhaps the timing results in the log<br>

> files will be helpful. The "Solution took ..." printouts give the total<br>

> solution time for that iteration, the others are incremental times.  (As an<br>

> aside, I have been wondering why the solution times do not scale well with<br>

> process count, even though that work is entirely done in parallel PETSc<br>

> routines.)<br>

><br>

> Thanks,<br>

> Matt Overholt<br>

><br>

><br>

> ********** -log_view -info results for n=2 : the first solution and<br>

> subsequent fixed-point iteration ***********<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688<br>

> -2080374779<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688<br>

> -2080374779<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688<br>

> -2080374781<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> Matrix setup took 0.108 s<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688<br>

> -2080374779<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688<br>

> -2080374781<br>

> KSP PC setup took 0.079 s<br>

> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.<br>

> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.<br>

> [0] MatStashScatterBegin_Ref(): No of messages: 0<br>

> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>

> [1] MatStashScatterBegin_Ref(): No of messages: 1<br>

> [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes<br>

> [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 5 mallocs.<br>

> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage space:<br>

> 1050106 unneeded,15128672 used<br>

> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27<br>

> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows<br>

> 599214) < 0.6. Do not use CompressedRow routines.<br>

> [1] MatSeqAIJCheckInode(): Found 599214 nodes out of 599214 rows. Not using<br>

> Inode routines<br>

> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage space:<br>

> 1237634 unneeded,15545404 used<br>

> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27<br>

> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows<br>

> 621594) < 0.6. Do not use CompressedRow routines.<br>

> [0] MatSeqAIJCheckInode(): Found 621594 nodes out of 621594 rows. Not using<br>

> Inode routines<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter<br>

> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter<br>

> [0] VecScatterCreate(): General case: MPI to Seq<br>

> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space:<br>

> 5257543 unneeded,136718 used<br>

> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 89<br>

> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19<br>

> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows<br>

> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.<br>

> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space:<br>

> 5464978 unneeded,136718 used<br>

> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 490<br>

> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16<br>

> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows<br>

> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.<br>

> K and q SetValues took 26.426 s<br>

> [0] PCSetUp(): Setting up PC for first time<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [0] VecScatterCreate(): Special case: processor zero gets entire parallel<br>

> vector, rest get none<br>

> ** Max-trans not allowed because matrix is distributed<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PCSetUp(): Leaving PC with identical preconditioner since operator is<br>

> unchanged<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter<br>

> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter<br>

> [0] VecScatterCreate(): General case: Seq to MPI<br>

> [1] VecScatterCreate(): General case: Seq to MPI<br>

> Solution took 102.21 s<br>

><br>

> NL iteration 0: delta = 32.0488 67.6279.<br>

> Error delta calc took 0.045 s<br>

> Node and Element temps update took 0.017 s<br>

> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs.<br>

> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.<br>

> [0] MatStashScatterBegin_Ref(): No of messages: 0<br>

> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>

> [1] MatStashScatterBegin_Ref(): No of messages: 1<br>

> [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes<br>

> [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 0 mallocs.<br>

> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage space: 0<br>

> unneeded,15128672 used<br>

> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27<br>

> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows<br>

> 599214) < 0.6. Do not use CompressedRow routines.<br>

> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage space: 0<br>

> unneeded,15545404 used<br>

> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27<br>

> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows<br>

> 621594) < 0.6. Do not use CompressedRow routines.<br>

> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space: 0<br>

> unneeded,136718 used<br>

> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19<br>

> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows<br>

> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines.<br>

> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space: 0<br>

> unneeded,136718 used<br>

> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16<br>

> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows<br>

> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines.<br>

> K and q SetValues took 2.366 s<br>

> [0] PCSetUp(): Setting up PC with same nonzero pattern<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter<br>

> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter<br>

> [0] VecScatterCreate(): General case: Seq to MPI<br>

> [1] VecScatterCreate(): General case: Seq to MPI<br>

> Solution took 82.156 s<br>

><br>

> -----Original Message-----<br>

> From: Barry Smith [mailto:<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>]<br>

> Sent: Wednesday, October 05, 2016 4:42 PM<br>

> To: <a href="mailto:overholt@capesim.com">overholt@capesim.com</a><br>

> Cc: <a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a><br>

> Subject: Re: [petsc-users] large PetscCommDuplicate overhead<br>

><br>

><br>

>> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <<a href="mailto:overholt@capesim.com">overholt@capesim.com</a>> wrote:<br>

>><br>

>> Hi Petsc-Users,<br>

>><br>

>> I am trying to understand an issue where PetscCommDuplicate() calls are<br>

> taking an increasing percentage of time as I run a fixed-sized problem on<br>

> more processes.<br>

>><br>

>> I am using the FEM to solve the steady-state heat transfer equation (K.x =<br>

> q) using a PC direct solver, like MUMPS.<br>

>><br>

>> I am running on the NERSC Cray X30, which has two Xeon's per node with 12<br>

> cores each, and profiling the code using CrayPat sampling.<br>

>><br>

>> On a typical problem (1E+6 finite elements), running on a single node:<br>

>> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate (on<br>

> process 1, but on the root it is less), and (for reference) 9% of total time<br>

> is for MUMPS.<br>

>> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate (on<br>

> every process except the root, where it is <1%), and 9-10% of total time is<br>

> for MUMPS.<br>

><br>

>   What does PetscCommDuplicate() have to do with MUMPS? Nothing at all, you<br>

> are just giving its time for comparison?<br>

><br>

>><br>

>> What is the large PetscCommDuplicate time connected to, an increasing<br>

> number of messages (tags)?  Would using fewer MatSetValues() and<br>

> VecSetValues() calls (with longer message lengths) alleviate this?<br>

><br>

>   No PetscCommDuplicate won't increate with more messages or calls to<br>

> XXXSetValues(). PetscCommDuplicate() is only called essentially on the<br>

> creation of new PETSc objects.  It should also be fast since it basically<br>

> needs to do just a MPI_Attr_get(). With more processes but the same problem<br>

> size and code there should be pretty much the same number of objects<br>

> created.<br>

><br>

>   PetscSpinlockLock() does nothing if you are not using threads so it won't<br>

> take any time.<br>

><br>

>   Is there a way to see where it is spending its time inside the<br>

> PetscCommDuplicate()?  Perhaps the Cray MPI_Attr_get() has issues.<br>

><br>

>   Barry<br>

><br>

><br>

><br>

><br>

><br>

><br>

>><br>

>> For reference, the PETSc calling sequence in the code is as follows.<br>

>>    // Create the solution and RHS vectors<br>

>>    ierr = VecCreate(petscData->mpicomm,&<wbr>mesh->hpx);<br>

>>   ierr = PetscObjectSetName((<wbr>PetscObject) mesh->hpx, "Solution");<br>

>>    ierr = VecSetSizes(mesh->hpx,mesh-><wbr>lxN,mesh->neqns); // size = # of<br>

> equations; distribution to match mesh<br>

>>    ierr = VecSetFromOptions(mesh->hpx);    // allow run time options<br>

>>    ierr = VecDuplicate(mesh->hpx,&q);      // create the RHS vector<br>

>>    // Create the stiffnexx matrix<br>

>>    ierr = MatCreate(petscData->mpicomm,&<wbr>K);<br>

>>    ierr = MatSetSizes(K,mesh->lxN,mesh-><wbr>lxN,mesh->neqns,mesh->neqns);<br>

>>    ierr = MatSetType(K,MATAIJ);  // default sparse type<br>

>>    // Do preallocation<br>

>>    ierr = MatMPIAIJSetPreallocation(K,d_<wbr>nz,NULL,o_nz,NULL);<br>

>>    ierr = MatSeqAIJSetPreallocation(K,d_<wbr>nz,NULL);<br>

>>    ierr = MatSetOption(K,MAT_NEW_<wbr>NONZERO_ALLOCATION_ERR,PETSC_<wbr>FALSE);<br>

>>    ierr = MatSetUp(K);<br>

>>    // Create and set up the KSP context as a PreConditioner Only (Direct)<br>

> Solution<br>

>>    ierr = KSPCreate(petscData->mpicomm,&<wbr>ksp);<br>

>>    ierr = KSPSetOperators(ksp,K,K);<br>

>>    ierr = KSPSetType(ksp,KSPPREONLY);<br>

>>    // Set the temperature vector<br>

>>    ierr = VecSet(mesh->hpx,mesh->Tmin);<br>

>>    // Set the default PC method as MUMPS<br>

>>    ierr = KSPGetPC(ksp,&pc);     // extract the preconditioner<br>

>>    ierr = PCSetType(pc,PCLU);    // set pc options<br>

>>    ierr = PCFactorSetMatSolverPackage(<wbr>pc,MATSOLVERMUMPS);<br>

>>    ierr = KSPSetFromOptions(ksp);<br>

>><br>

>>    // Set the values for the K matrix and q vector<br>

>>        // which involves a lot of these calls<br>

>>        ierr = MatSetValues(K,mrows,idxm,<wbr>ncols,idxn,pKe,ADD_VALUES);    //<br>

> 1 call per matrix row (equation)<br>

>>        ierr = VecSetValues(q,nqe,ixn,pqe,<wbr>ADD_VALUES);      // 1 call per<br>

> element<br>

>>    ierr = VecAssemblyBegin(q);<br>

>>    ierr = MatAssemblyBegin(K,MAT_FINAL_<wbr>ASSEMBLY);<br>

>>    ierr = VecAssemblyEnd(q);<br>

>>    ierr = MatAssemblyEnd(K,MAT_FINAL_<wbr>ASSEMBLY);<br>

>><br>

>>    // Solve //////////////////////////////<wbr>////////<br>

>>    ierr = KSPSolve(ksp,q,mesh->hpx);<br>

>> ...<br>

>> *Note that the code evenly divides the finite elements over the total<br>

> number of processors, and I am using ghosting of the FE vertices vector to<br>

> handle the vertices that are needed on more than 1 process.<br>

>><br>

>> Thanks in advance for your help,<br>

>> Matt Overholt<br>

>> CapeSym, Inc.<br>

>><br>

>>      Virus-free. <a href="http://www.avast.com" rel="noreferrer" target="_blank">www.avast.com</a><br>

><br>

><br>

> ---<br>

> This email has been checked for viruses by Avast antivirus software.<br>

> <a href="https://www.avast.com/antivirus" rel="noreferrer" target="_blank">https://www.avast.com/<wbr>antivirus</a><br>

><br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>