[mpich-discuss] Faster MPI_Attr_get?

Sat May 12 01:00:58 CDT 2012

On May 11, 2012, at 4:12 PM CDT, Jed Brown wrote:

> On Fri, May 11, 2012 at 4:03 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> 
>> On May 11, 2012, at 1:49 PM CDT, Dave Goodell wrote:
>> 
>>> On May 11, 2012, at 1:17 PM CDT, Jed Brown wrote:
>>> 
>>>> Can you make this fast, like ~100 clocks for the common case?
>>> 
>>> Can you send us a small benchmark that you would like us to optimize?
>>> Presumably you already have one since you've measured the overhead.
>> 
> 
> I benchmarked an example using the PETSc ThreadComm. For a micro-benchmark
> that doesn't use pthreads, I would just call MPI_Attr_get() in a loop. I
> can prepare that if you like.

I wrote this simple benchmark (attached) and have found that performance is on the order of ~66 clocks when error checking is enabled and around ~44 clocks when error checking is disabled.  This is all on my 2.5 GHz Core i7 Macbook Pro with the trunk version of MPICH2, built with very similar options to what you gave (including "--enable-error-checking=runtime --enable-error-messages=all --enable-shared").

The benchmark even does a second test to ensure that the issue isn't from having many attributes attached to the communicator.  This clearly has an effect, although it's no more than a factor of ~2 performance loss for 20 attributes attached to the communicator.

----8<----
% MPICH_ERROR_CHECKING=1 ./a.out
SINGLE ATTRIBUTE
t= 0.026904 (us/iteration)
t= 67.260559 (rough cycles/iteration @ 2.5 GHz)

20 ATTRIBUTES
t= 0.045785 (us/iteration)
t= 114.462164 (rough cycles/iteration @ 2.5 GHz)

% MPICH_ERROR_CHECKING=0 ./a.out
SINGLE ATTRIBUTE
t= 0.017907 (us/iteration)
t= 44.767313 (rough cycles/iteration @ 2.5 GHz)

20 ATTRIBUTES
t= 0.039224 (us/iteration)
t= 98.059134 (rough cycles/iteration @ 2.5 GHz)
----8<----

Possible explanations for the difference from your ~1000 clock results:

1) A bug in my benchmark (missing a 0 somewhere, etc.).  Entirely possible, it's a bit late right now :)

2) Wildly different hardware being used for the test.  This seems unlikely.

3) The benchmark isn't as representative of whatever behavior that you are seeing as we thought it would be.  If your measurements were taken in a multithreaded environment then my guess is that you're seeing the cost of a contended pthread mutex hidden inside the MPI calls.

If it is a contended mutex issue, then this will not be very quick/easy to fix.

-Dave

-------------- next part --------------
A non-text attachment was scrubbed...
Name: attr_get_perf.c
Type: application/octet-stream
Size: 2415 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120512/eda1c65c/attachment.obj>