<div class="gmail_quote">Sorry about the slow response.</div><div class="gmail_quote"><br></div><div class="gmail_quote">On Thu, Feb 16, 2012 at 10:33, Gerard Gorman <span dir="ltr"><<a href="mailto:g.gorman@imperial.ac.uk">g.gorman@imperial.ac.uk</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi<br>

<br>

I have been running benchmarks on the OpenMP branch of petsc-dev on an<br>

Intel Westmere (Intel(R) Xeon(R) CPU X5670 @ 2.93GHz).<br>

<br>

You can see all graphs + test code + code to generate results in the tar<br>

ball linked below and I am just going to give a quick summary here.<br>

<a href="http://amcg.ese.ic.ac.uk/~ggorman/omp_vec_benchmarks.tar.gz" target="_blank">http://amcg.ese.ic.ac.uk/~ggorman/omp_vec_benchmarks.tar.gz</a><br>

<br>

There are 3 sets of results:<br>

<br>

gcc/ : GCC 4.6<br>

intel/ : Intel 12.0 with MKL<br>

intel-pinning/ : as above put applying hard affinity.<br>

<br>

Files matching  mpi_*.pdf show the MPI speedup and parallel efficiency<br>

for a range of vector sizes. Similarly for omp_*.pdf with respect to<br>

OpenMP. The remaining files directly compare scaling of MPI Vs OpenMP<br>

for the various tests for the largest vector size.<br>

<br>

I think the results are very encouraging and there are many interesting<br>

little details in there. I am just going to summarise a few here that I<br>

think are particularly important.<br>

<br>

1. In most cases the threaded code performs as well as, and in many<br>

cases better then the mpi code.<br></blockquote><div><br></div><div>Cool.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

2. For GCC I did not use a threaded blas. For Intel I used<br>

-lmkl_intel_thread. However, it appears dnrm2 is not threaded. It seems<br>

to be a common feature among other threaded blas libraries that Level 1<br>

is not completely threaded (e.g. cray). Unfortunately most of this is<br>

experience/anecdotal information. I do not know of any proper survey. We<br>

have the option here of either rolling our own or ignoring the issue<br>

until profiling shows it is a problem...and eventually someone else will<br>

release a fully threaded blas.<br></blockquote><div><br></div><div>Even much of BLAS2 is not threaded in MKL. Note that opening a parallel region is actually quite expensive, so (perhaps ironically), MPI is expected to perform better than threading when parallel regions involve relatively little work. In the case of BLAS-1, only quite large sizes could possibly pay for spawning a parallel region.</div>

<div><br></div><div>This is why I think the long-term solutions for threading involve long-lived threads with mostly-private memory that prefer redundant computation so that you only have to pay for synchronization instead of also having to pay dearly to use interfaces. Unfortunately, I think the current threaded programming models are challenging to use in this way and it imposes some extra complexity on users.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

3. Comparing intel/ and intel-pinning/ is particularly interesting.<br>

"First touch" has been applied to all memory in VecCreate so that memory<br>

should be paged correctly for NUMA. But first touch does not gain you<br>

much if threads migrate, so for the intel-pinning/ results I set the env<br>

KMP_AFFINITY=scatter to get hard affinity. You can clearly from the<br>

results that this improves parallel efficiency by a few percentage<br>

points in many cases. It also really smooths out efficiency dips as you<br>

run on different number of threads.<br></blockquote><div><br></div><div>Are you choosing sizes so that thread partitions always fall on a page boundary, or are some pages cut irregularly?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

Full blown benchmarks would not make a lot of sense until we get the Mat<br>

classes threaded in a similar fashion. However, at this point I would<br>

like feedback on the direction this is taking and if we can start<br>

getting code committed.<br></blockquote><div><br></div><div>Did you post a repository yet? I'd like to have a look at the code.</div></div>