[petsc-dev] Petsc "make test" have more failures for --with-openmp=1

Eric Chamberland Eric.Chamberland at giref.ulaval.ca
Fri Mar 12 13:54:53 CST 2021


Hi Pierre,

I now have a docker container reproducing the problems here.

Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm it fails 
like this:

not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
#       Initial guess
#       L_2 Error: 0.00803099
#       Initial Residual
#       L_2 Residual: 1.09057
#       Au - b = Au + F(0)
#       Linear L_2 Residual: 1.09057
#       [d470c54ce086:14127] Read -1, expected 4096, errno = 1
#       [d470c54ce086:14128] Read -1, expected 4096, errno = 1
#       [d470c54ce086:14129] Read -1, expected 4096, errno = 1
#       [3]PETSC ERROR: 
------------------------------------------------------------------------
#       [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation 
Violation, probably memory access out of range
#       [3]PETSC ERROR: Try option -start_in_debugger or 
-on_error_attach_debugger
#       [3]PETSC ERROR: or see 
https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
#       [3]PETSC ERROR: or try http://valgrind.org on GNU/linux and 
Apple Mac OS X to find memory corruption errors
#       [3]PETSC ERROR: likely location of problem given in stack below
#       [3]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
#       [3]PETSC ERROR: Note: The EXACT line numbers in the stack are 
not available,
#       [3]PETSC ERROR:       INSTEAD the line number of the start of 
the function
#       [3]PETSC ERROR:       is given.
#       [3]PETSC ERROR: [3] buildTwo line 987 
/opt/petsc-main/include/HPDDM_schwarz.hpp
#       [3]PETSC ERROR: [3] next line 1130 
/opt/petsc-main/include/HPDDM_schwarz.hpp
#       [3]PETSC ERROR: --------------------- Error Message 
--------------------------------------------------------------
#       [3]PETSC ERROR: Signal received
#       [3]PETSC ERROR: [0]PETSC ERROR: 
------------------------------------------------------------------------

also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected 
..." which I don't know where they come from...?

Hypre (like in diff-snes_tutorials-ex56_hypre)  is also having 
DIVERGED_INDEFINITE_PC failures...

Please see the 3 attached docker files:

1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with 
gnu compilers and MKL and everything to develop.

2) openmpi: the DockerFile to bluid OpenMPI

3) petsc: The las DockerFile that build/install and test PETSc

I build the 3 like this:

docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools .

docker build -t openmpi -f openmpi .

docker build -t petsc -f petsc .

Disclaimer: I am not a docker expert, so I may do things that are not 
docker-stat-of-the-art but I am opened to suggestions... ;)

I have just ran it on my portable (long) which have not enough cores, so 
many more tests failed (should force --oversubscribe but don't know how 
to).  I will relaunch on my workstation in a few minutes.

I will now test your branch! (sorry for the delay).

Thanks,

Eric

On 2021-03-11 9:03 a.m., Eric Chamberland wrote:
>
> Hi Pierre,
>
> ok, that's interesting!
>
> I will try to build a docker image until tomorrow and give you the 
> exact recipe to reproduce the bugs.
>
> Eric
>
>
> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:
>>
>>
>>> On 11 Mar 2021, at 6:16 AM, Barry Smith <bsmith at petsc.dev 
>>> <mailto:bsmith at petsc.dev>> wrote:
>>>
>>>
>>>   Eric,
>>>
>>>    Sorry about not being more immediate. We still have this in our 
>>> active email so you don't need to submit individual issues. We'll 
>>> try to get to them as soon as we can.
>>
>> Indeed, I’m still trying to figure this out.
>> I realized that some of my configure flags were different than yours, 
>> e.g., no --with-memalign.
>> I’ve also added SuperLU_DIST to my installation.
>> Still, I can’t reproduce any issue.
>> I will continue looking into this, it appears I’m seeing some 
>> valgrind errors, but I don’t know if this is some side effect of 
>> OpenMPI not being valgrind-clean (last time I checked, there was no 
>> error with MPICH).
>>
>> Thank you for your patience,
>> Pierre
>>
>> /usr/bin/gmake -f gmakefile test test-fail=1
>> Using MAKEFLAGS: test-fail=1
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
>>  ok snes_tutorials-ex12_quad_hpddm_reuse_baij
>>  ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
>>  ok ksp_ksp_tests-ex33_superlu_dist_2
>>  ok diff-ksp_ksp_tests-ex33_superlu_dist_2
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
>>  ok ksp_ksp_tutorials-ex50_tut_2
>>  ok diff-ksp_ksp_tutorials-ex50_tut_2
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
>>  ok ksp_ksp_tests-ex33_superlu_dist
>>  ok diff-ksp_ksp_tests-ex33_superlu_dist
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
>>  ok snes_tutorials-ex56_hypre
>>  ok diff-snes_tutorials-ex56_hypre
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
>>  ok ksp_ksp_tutorials-ex56_2
>>  ok diff-ksp_ksp_tutorials-ex56_2
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
>>  ok snes_tutorials-ex17_3d_q3_trig_elas
>>  ok diff-snes_tutorials-ex17_3d_q3_trig_elas
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
>>  ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>  ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
>> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
>> #srun: error: Unable to create step for job 1426755: More processors 
>> requested than permitted
>>  ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no diff
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
>>  ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for 
>> this test
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
>>  ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>  ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
>>  ok snes_tutorials-ex19_tut_3
>>  ok diff-snes_tutorials-ex19_tut_3
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
>>  ok snes_tutorials-ex17_3d_q3_trig_vlap
>>  ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
>>  ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for 
>> this test
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
>>  ok snes_tutorials-ex19_superlu_dist
>>  ok diff-snes_tutorials-ex19_superlu_dist
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
>>  ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>  ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
>>  ok ksp_ksp_tutorials-ex49_hypre_nullspace
>>  ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
>>  ok snes_tutorials-ex19_superlu_dist_2
>>  ok diff-snes_tutorials-ex19_superlu_dist_2
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
>> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
>> #srun: error: Unable to create step for job 1426755: More processors 
>> requested than permitted
>>  ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no diff
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
>>  ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>  ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
>>  ok ksp_ksp_tutorials-ex64_1
>>  ok diff-ksp_ksp_tutorials-ex64_1
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
>> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
>> #srun: error: Unable to create step for job 1426755: More processors 
>> requested than permitted
>>  ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff
>>         TEST 
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
>>  ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for 
>> this test
>>
>>>    Barry
>>>
>>>
>>>> On Mar 10, 2021, at 11:03 PM, Eric Chamberland 
>>>> <Eric.Chamberland at giref.ulaval.ca 
>>>> <mailto:Eric.Chamberland at giref.ulaval.ca>> wrote:
>>>>
>>>> Barry,
>>>>
>>>> to get a some follow up on --with-openmp=1 failures, shall I open 
>>>> gitlab issues for:
>>>>
>>>> a) all hypre failures giving DIVERGED_INDEFINITE_PC
>>>>
>>>> b) all superlu_dist failures giving different results with initia 
>>>> and "Exceeded timeout limit of 60 s"
>>>>
>>>> c) hpddm failures "free(): invalid next size (fast)" and 
>>>> "Segmentation Violation"
>>>>
>>>> d) all tao's "Exceeded timeout limit of 60 s"
>>>>
>>>> I don't see how I could do all these debugging by myself...
>>>>
>>>> Thanks,
>>>>
>>>> Eric
>>>>
>>>>
>>>
>>
> -- 
> Eric Chamberland, ing., M. Ing
> Professionnel de recherche
> GIREF/Université Laval
> (418) 656-2131 poste 41 22 42

-- 
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210312/78ddc78a/attachment-0001.html>
-------------- next part --------------
# Image de départ.
FROM fedora:33

SHELL ["/bin/bash", "-c"]

WORKDIR /

# InteOneAPI repo configuration and other packages for compiling OpenMPI and PETSc:
# (see https://software.intel.com/content/www/us/en/develop/articles/installing-intel-oneapi-toolkits-via-yum.html)

## on fixe le fuseau horaire dans le conteneur:
ENV TZ=America/New_York

RUN \
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime \
  && \
echo $TZ > /etc/timezone \
  && \
echo "LC_ALL=en_US.UTF-8" >> /etc/environment \
  &&  \
echo "en_US.UTF-8 UTF-8"  >> /etc/locale.gen \
  &&   \
echo "LANG=en_US.UTF-8"   >  /etc/locale.conf \
  &&  \
source /etc/locale.conf \
  && \
echo -e "\
[oneAPI]\n\
name=Intel(R) oneAPI repository\n\
baseurl=https://yum.repos.intel.com/oneapi\n\
enabled=1\n\
gpgcheck=1\n\
repo_gpgcheck=1\n\
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB\n\
" > /etc/yum.repos.d/oneAPI.repo \
  &&\
dnf install -y \
   authconfig \
   autoconf \
   binutils \
   bison \
   blas-devel \
   ccache \
   clang \
   cmake \
   flex \
   gcc-c++ \
   gcc-gfortran \
   gdb \
   git \
   glibc-langpack-en \
   gnuplot \
   intel-oneapi-mkl-devel \
   libtool \
   libtirpc-devel \
   libXext-devel \
   libX11-devel \
   make \
   nfs-utils \
   numactl-libs \
   numactl-devel \
   nscd \
   perl \
   "perl(Data::Dumper)" \
   pkg-config \
   procps-ng \
   python2 \
   python2-six \
   python \
   screen \
   tar \
   time \
   valgrind \
   vim \
   wget \
   xorg-x11-apps \
  && \
dnf clean all

#   intel-hpckit \

# Exécuter une commande au démarrage de l'image.
#CMD ["/bin/bash"]
-------------- next part --------------
# Image de départ.
FROM fedora_mkl_and_devtools:latest

SHELL ["/bin/bash", "-c"]

WORKDIR /

ARG ompi_ver=openmpi-4.1.0
ARG ompi_tar=${ompi_ver}.tar.gz
ARG ompi_rep_dest=/opt/${ompi_ver}

ENV MPIdir=${ompi_rep_dest}

RUN \
wget https://www.open-mpi.org/software/ompi/v4.1/downloads/${ompi_tar} \
&& \
tar -xvf ${ompi_tar} \
&& \
cd ${ompi_ver} \
&& \
./configure \
   --prefix=${ompi_rep_dest} \
   CXXFLAGS=-std=c++14\
   --with-wrapper-cxxflags='-std=c++14' \
   --with-cma \
   --enable-mpi1-compatibility \
   && \
make -j8 \
&& \
make install \
&& \
echo -e "\
export MPIdir=${ompi_rep_dest}\n\
export LD_LIBRARY_PATH=\${MPIdir}/${MPIlibdir}:\${LD_LIBRARY_PATH}\n\
export PATH=\${MPIdir}/bin:\${PATH}" > ${ompi_rep_dest}/mpilibs.sh


# Exécuter une commande au démarrage de l'image.
#CMD ["/bin/bash"]
-------------- next part --------------
# Image de départ.
FROM openmpi:latest

SHELL ["/bin/bash", "-c"]

WORKDIR /

ARG petsc_branch=main
ARG petsc_ver=petsc-${petsc_branch}
ARG petsc_rep_dest=/opt/${petsc_ver}

RUN \
git clone https://gitlab.com/petsc/petsc.git -b main && \
cd petsc

RUN \
source /opt/intel/oneapi/mkl/latest/env/vars.sh intel64 && \
source ${MPIdir}/mpilibs.sh && \
cd petsc && \
  ./configure \
   --prefix=${petsc_rep_dest} \
   --with-mpi-compilers=1 --with-mpi-dir=${MPIdir} \
   --download-ml=yes \
   --download-mumps=yes \
   --download-superlu=yes \
   --with-cxx-dialect=C++14 \
   --with-make-np=12 \
   --with-shared-libraries=1 \
   --with-debugging=1 \
   --with-memalign=64 \
   --with-visibility=0 \
   --with-openmp=1 \
   --with-64-bit-indices=0 \
   --download-hpddm=yes \
   --download-slepc=yes \
   --download-superlu_dist=yes \
   --download-parmetis=yes \
   --download-ptscotch=yes \
   --download-metis=yes \
   --download-strumpack=yes \
   --download-suitesparse=yes \
   --download-hypre=yes \
   --with-blaslapack-dir="$MKLROOT/lib/intel64" \
   --with-mkl_pardiso-dir="$MKLROOT" \
   --with-mkl_cpardiso-dir="$MKLROOT" \
   --with-scalapack=1 \
   --with-scalapack-include="$MKLROOT/include" \
   --with-scalapack-lib="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" \
   && \
   export PETSC_ARCH_VAR=$(tail -20 configure.log |grep "PETSC_ARCH:"|awk '{print $2}') && \
   export PETSC_DIR_VAR=$(tail -20 configure.log |grep "PETSC_DIR:"|awk '{print $2}') && \
   make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" all && \
   make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" install && \
   touch "${petsc_rep_dest}/hpclibs.sh" && \
   echo -e "source $MKLROOT/env/vars.sh  intel64\n\
   source ${MPIdir}/mpilibs.sh\n\
   export PETSC_DIR=${petsc_rep_dest}\n\
   export PETSC_ARCH=\"\"\n\
   export LD_LIBRARY_PATH=\${PETSC_DIR}/lib:\${LD_LIBRARY_PATH}\n\
   export PATH=\${PETSC_DIR}/bin:\${MPIdir}/\${MPIbindir}:\${PATH}\n" >> "${petsc_rep_dest}/hpclibs.sh"

ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
    OMPI_ALLOW_RUN_AS_ROOT=1

RUN source ${petsc_rep_dest}/hpclibs.sh \
&& \
cd /petsc \
&& \
export PETSC_ARCH_VAR=$(tail -20 configure.log |grep "PETSC_ARCH:"|awk '{print $2}') \
&& \
export PETSC_DIR_VAR=$(tail -20 configure.log |grep "PETSC_DIR:"|awk '{print $2}') \
&& \
make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" test |& tee make_test.log

# Exécuter une commande au démarrage de l'image.
#CMD ["cd /petsc; echo "You can source ${petsc_rep_dest}/hpclibs.sh to use PETSc"]


More information about the petsc-dev mailing list