[petsc-dev] PETSc OpenMP benchmarking

Thu Mar 22 07:59:33 CDT 2012

Hi

We have repeated the benchmarks using the Cray compiler. In terms of
performance, it is a close call between 2MPI+16OMP and 1MPI+32OMP on the
interlagos. As expected from the EPCC OpenMP micro-benchmarks, this is
fairing much better than GCC. Also included are the parallel
efficiencies when using pure MPI on the node, and pure OpenMP. RCM
reordering is used. You can see that the parallel efficiency of the
OpenMP implementation starts to increase as cache sharing starts to take
effect. You can argue that this effect will diminish for larger problem
sizes - but as the ratio of total memory to number of cores becomes more
important, these are the kinds of effects you would like to start taking
advantage of.

Any comments are welcome.

Cheers
Gerard

Gerard Gorman emailed the following on 17/03/12 08:33:
> Hi
>
> We have profiled on Cray compute nodes with two 16-core AMD Opteron
> 2.3GHz Interlagos processors, using the same matrix but this time with
> -ksp_type cg and -pc_type jacobi. Attached are the logs with the 32 MPI
> processes and the 32 OpenMP threads tests.
>
> Most of the time is in stage 2. As seen previously, MatMult is
> performing well, but the overall performance in KSPSolve drops for
> OpenMP. I have attached a plot of the (hybrid mpi+openmp time)/(pure
> openmp) where all 32 cores are always used. What the graph shows is that
> we are always getting better performance in MatMult for pure OpenMP but
> there is something additional in KSPSolve that degrades the OpenMP
> performance.
>
> So far we have profiled with oprofile measuring the event
> CPU_CLK_UNHALTED, but this has not shown up the bottleneck. So more
> digging is required.
>
> Any suggestions/comments gratefully received. 
>  
> Cheers
> Gerard
>
> Gerard Gorman emailed the following on 14/03/12 16:59:
>> Hi
>>
>> Since Vec and most of Mat is now threaded we have started to do more
>> detailed profiling. I'm posting these initial tasters from a two socket
>> Intel Core Bloomfield processor system (i.e. 8 cores) to stimulate
>> discussion.
>>
>> The matrix comes from a 3D lock exchange problem discretised using a
>> continuous Galerkin finite element formulation and has about 450k
>> degrees of freedom.
>>
>> I have configured the simulator (Fluidity -
>> http://amcg.ese.ic.ac.uk/Fluidity) to dump out PETSc matrices at each
>> solve. These individual matrices are then solved using
>> petsc-dev/src/ksp/ksp/examples/tests/ex6 compiled with GCC 4.6.3
>> --with-debugging=0.
>>
>> The PETSc options are:
>> -get_total_flops -pc_type gamg -ksp_type cg -ksp_rtol 1.0e-6 -log_summary
>>
>> The 3 log files attached are for OMP_NUM_THREADS=1, OMP_NUM_THREADS=8
>> and non-threaded MPI run with 8 processes for comparison.
>>
>> So the reason this benchmark is interesting is because it is pressure
>> which is really stiff , and it uses GAMG as a blackbox.
>>
>> Using xxdiff to compare the logs I think the interesting points are:
>> - Overall OpenMP compares favourably with MPI.
>> - OpenMP converged in 2 less iterations than with MPI. Earlier I was
>> expecting fewer iterations simply because of the absence of partitions
>> to diminish the effectiveness of coarsening. I have not been following
>> Mark's GAMG development but it looks repartitioning is being used to get
>> around that issue (?). However, the biggest plus is because Chebychev is
>> used as a smoother (rather than something difficult to parallelise like
>> SSOR), GAMG appears to scale pretty well when threaded with OpenMP.
>> - Important operations like MatMult etc perform well.
>> - From the summary, "mystage 1" is the main section where OMP appears to
>> need more work. We suffer from operations such as  MatPtAP and
>> MatTrnMatMult for example which we have not got around to looking at yet.
>>
>> As this is a relatively small and boring UMA machine I have not bothered
>> with scaling curves. We are setting the same benchmark up on 32-core
>> Interlagos compute nodes at the moment - hopefully these will be ready
>> by tomorrow.
>>
>> Comments welcome.
>>
>> Cheers
>> Gerard
>>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: compare.tar.gz
Type: application/x-gzip
Size: 18879 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120322/dc06e035/attachment.gz>