<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hei,</p>

    <p>the compilation line is (as shown below)</p>

    <p><i>/usr/lib64/mpi/gcc/openmpi3/bin/mpicxx -DBOOST_ALL_NO_LIB

        -DBOOST_FILESYSTEM_DYN_LINK -DBOOST_MPI_DYN_LINK

        -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_SERIALIZATION_DYN_LINK

        -DUSE_CUDA

        -I/home/roland/Dokumente/C++-Projekte/armadillo_with_PETSc/include

-I/opt/intel/compilers_and_libraries_2020.2.254/linux/mkl/include

        -I/opt/armadillo/include -isystem /opt/petsc_release/include

        -isystem /opt/fftw3/include -isystem /opt/boost/include

        -march=native -fopenmp-simd -DMKL_LP64 -m64 -Wall -Wextra

        -pedantic -fPIC -flto -O2 -funroll-loops -funroll-all-loops

        -fstrict-aliasing -mavx -march=native -fopenmp -std=gnu++17 -c

        <source_files> -o <target_files></i></p>

    <p>Regards,</p>

    <p>Roland<br>

      <i></i></p>

    <div class="moz-cite-prefix">Am 17.02.2021 um 18:56 schrieb Jed

      Brown:<br>

    </div>

    <blockquote type="cite" cite="mid:87v9aqbmtn.fsf@jedbrown.org">

      <pre class="moz-quote-pre" wrap="">It's entirely possible, especially if libgomp is being mixed with libiomp.

Roland hasn't show us the compilation line (just linker), because `omp parallel` shouldn't do anything with just -fopenmp-simd and no -fopenmp. 

Matthew Knepley <a class="moz-txt-link-rfc2396E" href="mailto:knepley@gmail.com"><knepley@gmail.com></a> writes:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">Jed, is it possible that this is an oversubscription penalty from bad

OpenMP settings? <said by a person who knows less about OpenMP than

cuneiform>

  Thanks,

     Matt

On Wed, Feb 17, 2021 at 12:11 PM Roland Richter <a class="moz-txt-link-rfc2396E" href="mailto:roland.richter@ntnu.no"><roland.richter@ntnu.no></a>

wrote:

</pre>

        <blockquote type="cite">

          <pre class="moz-quote-pre" wrap="">My PetscScalar is complex double (i.e. even higher penalty), but my matrix

has a size of 8kk elements, so that should not an issue.

Regards,

Roland

------------------------------

*Von:* Jed Brown <a class="moz-txt-link-rfc2396E" href="mailto:jed@jedbrown.org"><jed@jedbrown.org></a>

*Gesendet:* Mittwoch, 17. Februar 2021 17:49:49

*An:* Roland Richter; PETSc

*Betreff:* Re: [petsc-users] Explicit linking to OpenMP results in

performance drop and wrong results

Roland Richter <a class="moz-txt-link-rfc2396E" href="mailto:roland.richter@ntnu.no"><roland.richter@ntnu.no></a> writes:

</pre>

          <blockquote type="cite">

            <pre class="moz-quote-pre" wrap="">Hei,

I replaced the linking line with

//usr/lib64/mpi/gcc/openmpi3/bin/mpicxx  -march=native -fopenmp-simd

-DMKL_LP64 -m64

CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o

bin/armadillo_with_PETSc

-Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib

/usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran

-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64

-lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl

/opt/boost/lib/libboost_filesystem.so.1.72.0

/opt/boost/lib/libboost_mpi.so.1.72.0

/opt/boost/lib/libboost_program_options.so.1.72.0

/opt/boost/lib/libboost_serialization.so.1.72.0

/opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so

/opt/petsc_release/lib/libpetsc.so

/usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so

/

and now the results are correct. Nevertheless, when comparing the loop

in line 26-28 in file test_scaling.cpp

/#pragma omp parallel for//

//    for(int i = 0; i < r_0 * r_1; ++i)//

//        *(out_mat_ptr + i) = (*(in_mat_ptr + i) * scaling_factor);/

the version without /#pragma omp parallel/ for is significantly faster

(i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there

still such a big difference?

</pre>

          </blockquote>

          <pre class="moz-quote-pre" wrap="">

Sounds like you're using a profile to attribute time? Each `omp parallel`

region incurs a cost ranging from about a microsecond to 10 or more

microseconds depending on architecture, number of threads, and OpenMP

implementation. Your loop (for double precision) operates at around 8

entries per clock cycle (depending on architecture) if the operands are in

cache so the loop size r_0 * r_1 should be at least 10000 just to pay off

the cost of `omp parallel`.

</pre>

        </blockquote>

        <pre class="moz-quote-pre" wrap="">

-- 

What most experimenters take for granted before they begin their

experiments is infinitely more interesting than any results to which their

experiments lead.

-- Norbert Wiener

<a class="moz-txt-link-freetext" href="https://www.cse.buffalo.edu/~knepley/">https://www.cse.buffalo.edu/~knepley/</a> <a class="moz-txt-link-rfc2396E" href="http://www.cse.buffalo.edu/~knepley/"><http://www.cse.buffalo.edu/~knepley/></a>

</pre>

      </blockquote>

    </blockquote>

  </body>

</html>