[petsc-dev] PETSc OpenMP benchmarking

Wed Mar 14 11:59:49 CDT 2012

Hi

Since Vec and most of Mat is now threaded we have started to do more
detailed profiling. I'm posting these initial tasters from a two socket
Intel Core Bloomfield processor system (i.e. 8 cores) to stimulate
discussion.

The matrix comes from a 3D lock exchange problem discretised using a
continuous Galerkin finite element formulation and has about 450k
degrees of freedom.

I have configured the simulator (Fluidity -
http://amcg.ese.ic.ac.uk/Fluidity) to dump out PETSc matrices at each
solve. These individual matrices are then solved using
petsc-dev/src/ksp/ksp/examples/tests/ex6 compiled with GCC 4.6.3
--with-debugging=0.

The PETSc options are:
-get_total_flops -pc_type gamg -ksp_type cg -ksp_rtol 1.0e-6 -log_summary

The 3 log files attached are for OMP_NUM_THREADS=1, OMP_NUM_THREADS=8
and non-threaded MPI run with 8 processes for comparison.

So the reason this benchmark is interesting is because it is pressure
which is really stiff , and it uses GAMG as a blackbox.

Using xxdiff to compare the logs I think the interesting points are:
- Overall OpenMP compares favourably with MPI.
- OpenMP converged in 2 less iterations than with MPI. Earlier I was
expecting fewer iterations simply because of the absence of partitions
to diminish the effectiveness of coarsening. I have not been following
Mark's GAMG development but it looks repartitioning is being used to get
around that issue (?). However, the biggest plus is because Chebychev is
used as a smoother (rather than something difficult to parallelise like
SSOR), GAMG appears to scale pretty well when threaded with OpenMP.
- Important operations like MatMult etc perform well.
- From the summary, "mystage 1" is the main section where OMP appears to
need more work. We suffer from operations such as  MatPtAP and
MatTrnMatMult for example which we have not got around to looking at yet.

As this is a relatively small and boring UMA machine I have not bothered
with scaling curves. We are setting the same benchmark up on 32-core
Interlagos compute nodes at the moment - hopefully these will be ready
by tomorrow.

Comments welcome.

Cheers
Gerard

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lock_exchange.tar.gz
Type: application/x-gzip
Size: 7559 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120314/c96b4c35/attachment.gz>