[petsc-users] Explicit linking to OpenMP results in performance drop and wrong results

Thu Feb 18 20:03:21 CST 2021

On Thu, Feb 18, 2021 at 7:15 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> On Feb 18, 2021, at 6:10 AM, Matthew Knepley <knepley at gmail.com> wrote:
>
> On Thu, Feb 18, 2021 at 3:09 AM Roland Richter <roland.richter at ntnu.no>
> wrote:
>
>> Hei,
>>
>> that was the reason for increased run times. When removing #pragma omp
>> parallel for, my loop took ~18 seconds. When changing it to #pragma omp
>> parallel for num_threads(2) or #pragma omp parallel for num_threads(4) (on
>> a i7-6700), the loop took ~16 s, but when increasing it to #pragma omp
>> parallel for num_threads(8), the loop took 28 s.
>>
>> Editorial: This is a reason I think OpenMP is inappropriate as a  tool
> for parallel computing (many people disagree). It makes resource management
> difficult for the user and impossible for a library.
>
>
>    It is possible to control these things properly with modern OpenMP APIs
> but, like MPI implementations, this can require some mucking around a
> beginner would not know about and the default settings can be terrible. MPI
> implementations are not better, their default bindings are generally
> horrible.
>

MPI allows the library to understand what resources are available and used.
Last time we looked at it, OpenMP does not have such
a context object that gets passed into the library (comm). The user could
construct one, but then the "usability" of OpenMP fades away.

  Matt

>   Barry
>
>
>   Thanks,
>
>      Matt
>
>> Regards,
>>
>> Roland
>> Am 17.02.21 um 18:51 schrieb Matthew Knepley:
>>
>> Jed, is it possible that this is an oversubscription penalty from bad
>> OpenMP settings? <said by a person who knows less about OpenMP than
>> cuneiform>
>>
>>   Thanks,
>>
>>      Matt
>>
>> On Wed, Feb 17, 2021 at 12:11 PM Roland Richter <roland.richter at ntnu.no>
>> wrote:
>>
>>> My PetscScalar is complex double (i.e. even higher penalty), but my
>>> matrix has a size of 8kk elements, so that should not an issue.
>>> Regards,
>>> Roland
>>> ------------------------------
>>> *Von:* Jed Brown <jed at jedbrown.org>
>>> *Gesendet:* Mittwoch, 17. Februar 2021 17:49:49
>>> *An:* Roland Richter; PETSc
>>> *Betreff:* Re: [petsc-users] Explicit linking to OpenMP results in
>>> performance drop and wrong results
>>>
>>> Roland Richter <roland.richter at ntnu.no> writes:
>>>
>>> > Hei,
>>> >
>>> > I replaced the linking line with
>>> >
>>> > //usr/lib64/mpi/gcc/openmpi3/bin/mpicxx  -march=native -fopenmp-simd
>>> > -DMKL_LP64 -m64
>>> > CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o
>>> > bin/armadillo_with_PETSc
>>> > -Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib
>>> > /usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran
>>> > -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64
>>> > -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
>>> > /opt/boost/lib/libboost_filesystem.so.1.72.0
>>> > /opt/boost/lib/libboost_mpi.so.1.72.0
>>> > /opt/boost/lib/libboost_program_options.so.1.72.0
>>> > /opt/boost/lib/libboost_serialization.so.1.72.0
>>> > /opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so
>>> > /opt/petsc_release/lib/libpetsc.so
>>> > /usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so
>>> > /
>>> >
>>> > and now the results are correct. Nevertheless, when comparing the loop
>>> > in line 26-28 in file test_scaling.cpp
>>> >
>>> > /#pragma omp parallel for//
>>> > //    for(int i = 0; i < r_0 * r_1; ++i)//
>>> > //        *(out_mat_ptr + i) = (*(in_mat_ptr + i) * scaling_factor);/
>>> >
>>> > the version without /#pragma omp parallel/ for is significantly faster
>>> > (i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there
>>> > still such a big difference?
>>>
>>> Sounds like you're using a profile to attribute time? Each `omp
>>> parallel` region incurs a cost ranging from about a microsecond to 10 or
>>> more microseconds depending on architecture, number of threads, and OpenMP
>>> implementation. Your loop (for double precision) operates at around 8
>>> entries per clock cycle (depending on architecture) if the operands are in
>>> cache so the loop size r_0 * r_1 should be at least 10000 just to pay off
>>> the cost of `omp parallel`.
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210218/acc80a27/attachment-0001.html>