[petsc-dev] PETSc OpenMP benchmarking

Sat Mar 17 10:13:03 CDT 2012

The two tests are bit different: 1) one has "mystatge" and 1) one does about 50% more iterations for some reason.  It would make me feel better if the semantics of both programs were the same.  Why are you getting more iterations with pure MPI?

That said, looking at the flop rates, the output does not seem to match your plots:  the output files have almost the same flop rates for MatMult (but there is a load balance of 1.4 for pure MPI, so I would look at the partitioning of the problem and try to get perfect load balance, which should not be hard, I would think ...).  KSPSolve is degraded by the vector operations (AXPPY, norms, VecPointwiseMult, etc.) in pure OpenMP.  This is counterintuitive, there is no communication in these routines and they are dead simple.  I would verify this carefully with the exact same test and if it holds up dig in with perf tools.

Mark

On Mar 17, 2012, at 4:33 AM, Gerard Gorman wrote:

> Hi
> 
> We have profiled on Cray compute nodes with two 16-core AMD Opteron
> 2.3GHz Interlagos processors, using the same matrix but this time with
> -ksp_type cg and -pc_type jacobi. Attached are the logs with the 32 MPI
> processes and the 32 OpenMP threads tests.
> 
> Most of the time is in stage 2. As seen previously, MatMult is
> performing well, but the overall performance in KSPSolve drops for
> OpenMP. I have attached a plot of the (hybrid mpi+openmp time)/(pure
> openmp) where all 32 cores are always used. What the graph shows is that
> we are always getting better performance in MatMult for pure OpenMP but
> there is something additional in KSPSolve that degrades the OpenMP
> performance.
> 
> So far we have profiled with oprofile measuring the event
> CPU_CLK_UNHALTED, but this has not shown up the bottleneck. So more
> digging is required.
> 
> Any suggestions/comments gratefully received. 
> 
> Cheers
> Gerard
> 
> Gerard Gorman emailed the following on 14/03/12 16:59:
>> Hi
>> 
>> Since Vec and most of Mat is now threaded we have started to do more
>> detailed profiling. I'm posting these initial tasters from a two socket
>> Intel Core Bloomfield processor system (i.e. 8 cores) to stimulate
>> discussion.
>> 
>> The matrix comes from a 3D lock exchange problem discretised using a
>> continuous Galerkin finite element formulation and has about 450k
>> degrees of freedom.
>> 
>> I have configured the simulator (Fluidity -
>> http://amcg.ese.ic.ac.uk/Fluidity) to dump out PETSc matrices at each
>> solve. These individual matrices are then solved using
>> petsc-dev/src/ksp/ksp/examples/tests/ex6 compiled with GCC 4.6.3
>> --with-debugging=0.
>> 
>> The PETSc options are:
>> -get_total_flops -pc_type gamg -ksp_type cg -ksp_rtol 1.0e-6 -log_summary
>> 
>> The 3 log files attached are for OMP_NUM_THREADS=1, OMP_NUM_THREADS=8
>> and non-threaded MPI run with 8 processes for comparison.
>> 
>> So the reason this benchmark is interesting is because it is pressure
>> which is really stiff , and it uses GAMG as a blackbox.
>> 
>> Using xxdiff to compare the logs I think the interesting points are:
>> - Overall OpenMP compares favourably with MPI.
>> - OpenMP converged in 2 less iterations than with MPI. Earlier I was
>> expecting fewer iterations simply because of the absence of partitions
>> to diminish the effectiveness of coarsening. I have not been following
>> Mark's GAMG development but it looks repartitioning is being used to get
>> around that issue (?). However, the biggest plus is because Chebychev is
>> used as a smoother (rather than something difficult to parallelise like
>> SSOR), GAMG appears to scale pretty well when threaded with OpenMP.
>> - Important operations like MatMult etc perform well.
>> - From the summary, "mystage 1" is the main section where OMP appears to
>> need more work. We suffer from operations such as  MatPtAP and
>> MatTrnMatMult for example which we have not got around to looking at yet.
>> 
>> As this is a relatively small and boring UMA machine I have not bothered
>> with scaling curves. We are setting the same benchmark up on 32-core
>> Interlagos compute nodes at the moment - hopefully these will be ready
>> by tomorrow.
>> 
>> Comments welcome.
>> 
>> Cheers
>> Gerard
>> 
> 
> <pressure-matrix-cg-32mpi.dat><pressure-matrix-cg-jacobi-1mpi-32omp.dat><pressure-matrix-cg-hybrid_speedup.pdf>