<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<meta name="Generator" content="Microsoft Exchange Server">

<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>

</head>

<body>

<div>My PetscScalar is complex double (i.e. even higher penalty), but my matrix has a size of 8kk elements, so that should not an issue.<br>

Regards,<br>

Roland

<hr tabindex="-1" style="display:inline-block; width:98%">

<div id="x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>Von:</b> Jed Brown <jed@jedbrown.org><br>

<b>Gesendet:</b> Mittwoch, 17. Februar 2021 17:49:49<br>

<b>An:</b> Roland Richter; PETSc<br>

<b>Betreff:</b> Re: [petsc-users] Explicit linking to OpenMP results in performance drop and wrong results</font>

<div> </div>

</div>

</div>

<font size="2"><span style="font-size:10pt;">

<div class="PlainText">Roland Richter <roland.richter@ntnu.no> writes:<br>

<br>

> Hei,<br>

><br>

> I replaced the linking line with<br>

><br>

> //usr/lib64/mpi/gcc/openmpi3/bin/mpicxx  -march=native -fopenmp-simd<br>

> -DMKL_LP64 -m64<br>

> CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o<br>

> bin/armadillo_with_PETSc <br>

> -Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib<br>

> /usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran <br>

> -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64<br>

> -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl<br>

> /opt/boost/lib/libboost_filesystem.so.1.72.0<br>

> /opt/boost/lib/libboost_mpi.so.1.72.0<br>

> /opt/boost/lib/libboost_program_options.so.1.72.0<br>

> /opt/boost/lib/libboost_serialization.so.1.72.0<br>

> /opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so<br>

> /opt/petsc_release/lib/libpetsc.so<br>

> /usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so<br>

> /<br>

><br>

> and now the results are correct. Nevertheless, when comparing the loop<br>

> in line 26-28 in file test_scaling.cpp<br>

><br>

> /#pragma omp parallel for//<br>

> //    for(int i = 0; i < r_0 * r_1; ++i)//<br>

> //        *(out_mat_ptr + i) = (*(in_mat_ptr + i) * scaling_factor);/<br>

><br>

> the version without /#pragma omp parallel/ for is significantly faster<br>

> (i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there<br>

> still such a big difference?<br>

<br>

Sounds like you're using a profile to attribute time? Each `omp parallel` region incurs a cost ranging from about a microsecond to 10 or more microseconds depending on architecture, number of threads, and OpenMP implementation. Your loop (for double precision)

 operates at around 8 entries per clock cycle (depending on architecture) if the operands are in cache so the loop size r_0 * r_1 should be at least 10000 just to pay off the cost of `omp parallel`.<br>

</div>

</span></font>

</body>

</html>