[petsc-users] Explicit linking to OpenMP results in performance drop and wrong results

Thu Feb 18 02:09:35 CST 2021

Hei,

that was the reason for increased run times. When removing #pragma omp
parallel for, my loop took ~18 seconds. When changing it to #pragma omp
parallel for num_threads(2) or #pragma omp parallel for num_threads(4)
(on a i7-6700), the loop took ~16 s, but when increasing it to #pragma
omp parallel for num_threads(8), the loop took 28 s.

Regards,

Roland

Am 17.02.21 um 18:51 schrieb Matthew Knepley:
> Jed, is it possible that this is an oversubscription penalty from bad
> OpenMP settings? <said by a person who knows less about OpenMP than
> cuneiform>
>
>   Thanks,
>
>      Matt
>
> On Wed, Feb 17, 2021 at 12:11 PM Roland Richter
> <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> wrote:
>
>     My PetscScalar is complex double (i.e. even higher penalty), but
>     my matrix has a size of 8kk elements, so that should not an issue.
>     Regards,
>     Roland
>     ------------------------------------------------------------------------
>     *Von:* Jed Brown <jed at jedbrown.org <mailto:jed at jedbrown.org>>
>     *Gesendet:* Mittwoch, 17. Februar 2021 17:49:49
>     *An:* Roland Richter; PETSc
>     *Betreff:* Re: [petsc-users] Explicit linking to OpenMP results in
>     performance drop and wrong results
>      
>     Roland Richter <roland.richter at ntnu.no
>     <mailto:roland.richter at ntnu.no>> writes:
>
>     > Hei,
>     >
>     > I replaced the linking line with
>     >
>     > //usr/lib64/mpi/gcc/openmpi3/bin/mpicxx  -march=native -fopenmp-simd
>     > -DMKL_LP64 -m64
>     > CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o
>     > bin/armadillo_with_PETSc 
>     > -Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib
>     > /usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran 
>     > -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64
>     > -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
>     > /opt/boost/lib/libboost_filesystem.so.1.72.0
>     > /opt/boost/lib/libboost_mpi.so.1.72.0
>     > /opt/boost/lib/libboost_program_options.so.1.72.0
>     > /opt/boost/lib/libboost_serialization.so.1.72.0
>     > /opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so
>     > /opt/petsc_release/lib/libpetsc.so
>     > /usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so
>     > /
>     >
>     > and now the results are correct. Nevertheless, when comparing
>     the loop
>     > in line 26-28 in file test_scaling.cpp
>     >
>     > /#pragma omp parallel for//
>     > //    for(int i = 0; i < r_0 * r_1; ++i)//
>     > //        *(out_mat_ptr + i) = (*(in_mat_ptr + i) *
>     scaling_factor);/
>     >
>     > the version without /#pragma omp parallel/ for is significantly
>     faster
>     > (i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there
>     > still such a big difference?
>
>     Sounds like you're using a profile to attribute time? Each `omp
>     parallel` region incurs a cost ranging from about a microsecond to
>     10 or more microseconds depending on architecture, number of
>     threads, and OpenMP implementation. Your loop (for double
>     precision) operates at around 8 entries per clock cycle (depending
>     on architecture) if the operands are in cache so the loop size r_0
>     * r_1 should be at least 10000 just to pay off the cost of `omp
>     parallel`.
>
>
>
> -- 
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210218/e4b8bcc8/attachment.html>