[petsc-users] Explicit linking to OpenMP results in performance drop and wrong results

Wed Feb 17 11:56:04 CST 2021

It's entirely possible, especially if libgomp is being mixed with libiomp.

Roland hasn't show us the compilation line (just linker), because `omp parallel` shouldn't do anything with just -fopenmp-simd and no -fopenmp. 

Matthew Knepley <knepley at gmail.com> writes:

> Jed, is it possible that this is an oversubscription penalty from bad
> OpenMP settings? <said by a person who knows less about OpenMP than
> cuneiform>
>
>   Thanks,
>
>      Matt
>
> On Wed, Feb 17, 2021 at 12:11 PM Roland Richter <roland.richter at ntnu.no>
> wrote:
>
>> My PetscScalar is complex double (i.e. even higher penalty), but my matrix
>> has a size of 8kk elements, so that should not an issue.
>> Regards,
>> Roland
>> ------------------------------
>> *Von:* Jed Brown <jed at jedbrown.org>
>> *Gesendet:* Mittwoch, 17. Februar 2021 17:49:49
>> *An:* Roland Richter; PETSc
>> *Betreff:* Re: [petsc-users] Explicit linking to OpenMP results in
>> performance drop and wrong results
>>
>> Roland Richter <roland.richter at ntnu.no> writes:
>>
>> > Hei,
>> >
>> > I replaced the linking line with
>> >
>> > //usr/lib64/mpi/gcc/openmpi3/bin/mpicxx  -march=native -fopenmp-simd
>> > -DMKL_LP64 -m64
>> > CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o
>> > bin/armadillo_with_PETSc
>> > -Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib
>> > /usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran
>> > -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64
>> > -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
>> > /opt/boost/lib/libboost_filesystem.so.1.72.0
>> > /opt/boost/lib/libboost_mpi.so.1.72.0
>> > /opt/boost/lib/libboost_program_options.so.1.72.0
>> > /opt/boost/lib/libboost_serialization.so.1.72.0
>> > /opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so
>> > /opt/petsc_release/lib/libpetsc.so
>> > /usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so
>> > /
>> >
>> > and now the results are correct. Nevertheless, when comparing the loop
>> > in line 26-28 in file test_scaling.cpp
>> >
>> > /#pragma omp parallel for//
>> > //    for(int i = 0; i < r_0 * r_1; ++i)//
>> > //        *(out_mat_ptr + i) = (*(in_mat_ptr + i) * scaling_factor);/
>> >
>> > the version without /#pragma omp parallel/ for is significantly faster
>> > (i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there
>> > still such a big difference?
>>
>> Sounds like you're using a profile to attribute time? Each `omp parallel`
>> region incurs a cost ranging from about a microsecond to 10 or more
>> microseconds depending on architecture, number of threads, and OpenMP
>> implementation. Your loop (for double precision) operates at around 8
>> entries per clock cycle (depending on architecture) if the operands are in
>> cache so the loop size r_0 * r_1 should be at least 10000 just to pay off
>> the cost of `omp parallel`.
>>
>
>
> -- 
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>