[petsc-users] Explicit linking to OpenMP results in performance drop and wrong results

Jed Brown jed at jedbrown.org
Wed Feb 17 10:49:49 CST 2021


Roland Richter <roland.richter at ntnu.no> writes:

> Hei,
>
> I replaced the linking line with
>
> //usr/lib64/mpi/gcc/openmpi3/bin/mpicxx  -march=native -fopenmp-simd
> -DMKL_LP64 -m64
> CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o
> bin/armadillo_with_PETSc 
> -Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib
> /usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran 
> -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64
> -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
> /opt/boost/lib/libboost_filesystem.so.1.72.0
> /opt/boost/lib/libboost_mpi.so.1.72.0
> /opt/boost/lib/libboost_program_options.so.1.72.0
> /opt/boost/lib/libboost_serialization.so.1.72.0
> /opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so
> /opt/petsc_release/lib/libpetsc.so
> /usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so
> /
>
> and now the results are correct. Nevertheless, when comparing the loop
> in line 26-28 in file test_scaling.cpp
>
> /#pragma omp parallel for//
> //    for(int i = 0; i < r_0 * r_1; ++i)//
> //        *(out_mat_ptr + i) = (*(in_mat_ptr + i) * scaling_factor);/
>
> the version without /#pragma omp parallel/ for is significantly faster
> (i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there
> still such a big difference?

Sounds like you're using a profile to attribute time? Each `omp parallel` region incurs a cost ranging from about a microsecond to 10 or more microseconds depending on architecture, number of threads, and OpenMP implementation. Your loop (for double precision) operates at around 8 entries per clock cycle (depending on architecture) if the operands are in cache so the loop size r_0 * r_1 should be at least 10000 just to pay off the cost of `omp parallel`.


More information about the petsc-users mailing list