[petsc-dev] Petsc "make test" have more failures for --with-openmp=1
Eric Chamberland
Eric.Chamberland at giref.ulaval.ca
Fri Mar 12 13:54:53 CST 2021
Hi Pierre,
I now have a docker container reproducing the problems here.
Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm it fails
like this:
not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
# Initial guess
# L_2 Error: 0.00803099
# Initial Residual
# L_2 Residual: 1.09057
# Au - b = Au + F(0)
# Linear L_2 Residual: 1.09057
# [d470c54ce086:14127] Read -1, expected 4096, errno = 1
# [d470c54ce086:14128] Read -1, expected 4096, errno = 1
# [d470c54ce086:14129] Read -1, expected 4096, errno = 1
# [3]PETSC ERROR:
------------------------------------------------------------------------
# [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
Violation, probably memory access out of range
# [3]PETSC ERROR: Try option -start_in_debugger or
-on_error_attach_debugger
# [3]PETSC ERROR: or see
https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
# [3]PETSC ERROR: or try http://valgrind.org on GNU/linux and
Apple Mac OS X to find memory corruption errors
# [3]PETSC ERROR: likely location of problem given in stack below
# [3]PETSC ERROR: --------------------- Stack Frames
------------------------------------
# [3]PETSC ERROR: Note: The EXACT line numbers in the stack are
not available,
# [3]PETSC ERROR: INSTEAD the line number of the start of
the function
# [3]PETSC ERROR: is given.
# [3]PETSC ERROR: [3] buildTwo line 987
/opt/petsc-main/include/HPDDM_schwarz.hpp
# [3]PETSC ERROR: [3] next line 1130
/opt/petsc-main/include/HPDDM_schwarz.hpp
# [3]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
# [3]PETSC ERROR: Signal received
# [3]PETSC ERROR: [0]PETSC ERROR:
------------------------------------------------------------------------
also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected
..." which I don't know where they come from...?
Hypre (like in diff-snes_tutorials-ex56_hypre) is also having
DIVERGED_INDEFINITE_PC failures...
Please see the 3 attached docker files:
1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with
gnu compilers and MKL and everything to develop.
2) openmpi: the DockerFile to bluid OpenMPI
3) petsc: The las DockerFile that build/install and test PETSc
I build the 3 like this:
docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools .
docker build -t openmpi -f openmpi .
docker build -t petsc -f petsc .
Disclaimer: I am not a docker expert, so I may do things that are not
docker-stat-of-the-art but I am opened to suggestions... ;)
I have just ran it on my portable (long) which have not enough cores, so
many more tests failed (should force --oversubscribe but don't know how
to). I will relaunch on my workstation in a few minutes.
I will now test your branch! (sorry for the delay).
Thanks,
Eric
On 2021-03-11 9:03 a.m., Eric Chamberland wrote:
>
> Hi Pierre,
>
> ok, that's interesting!
>
> I will try to build a docker image until tomorrow and give you the
> exact recipe to reproduce the bugs.
>
> Eric
>
>
> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:
>>
>>
>>> On 11 Mar 2021, at 6:16 AM, Barry Smith <bsmith at petsc.dev
>>> <mailto:bsmith at petsc.dev>> wrote:
>>>
>>>
>>> Eric,
>>>
>>> Sorry about not being more immediate. We still have this in our
>>> active email so you don't need to submit individual issues. We'll
>>> try to get to them as soon as we can.
>>
>> Indeed, I’m still trying to figure this out.
>> I realized that some of my configure flags were different than yours,
>> e.g., no --with-memalign.
>> I’ve also added SuperLU_DIST to my installation.
>> Still, I can’t reproduce any issue.
>> I will continue looking into this, it appears I’m seeing some
>> valgrind errors, but I don’t know if this is some side effect of
>> OpenMPI not being valgrind-clean (last time I checked, there was no
>> error with MPICH).
>>
>> Thank you for your patience,
>> Pierre
>>
>> /usr/bin/gmake -f gmakefile test test-fail=1
>> Using MAKEFLAGS: test-fail=1
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
>> ok snes_tutorials-ex12_quad_hpddm_reuse_baij
>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
>> ok ksp_ksp_tests-ex33_superlu_dist_2
>> ok diff-ksp_ksp_tests-ex33_superlu_dist_2
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
>> ok ksp_ksp_tutorials-ex50_tut_2
>> ok diff-ksp_ksp_tutorials-ex50_tut_2
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
>> ok ksp_ksp_tests-ex33_superlu_dist
>> ok diff-ksp_ksp_tests-ex33_superlu_dist
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
>> ok snes_tutorials-ex56_hypre
>> ok diff-snes_tutorials-ex56_hypre
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
>> ok ksp_ksp_tutorials-ex56_2
>> ok diff-ksp_ksp_tutorials-ex56_2
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
>> ok snes_tutorials-ex17_3d_q3_trig_elas
>> ok diff-snes_tutorials-ex17_3d_q3_trig_elas
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
>> ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
>> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
>> #srun: error: Unable to create step for job 1426755: More processors
>> requested than permitted
>> ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no diff
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
>> ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for
>> this test
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
>> ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
>> ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
>> ok snes_tutorials-ex19_tut_3
>> ok diff-snes_tutorials-ex19_tut_3
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
>> ok snes_tutorials-ex17_3d_q3_trig_vlap
>> ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
>> ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for
>> this test
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
>> ok snes_tutorials-ex19_superlu_dist
>> ok diff-snes_tutorials-ex19_superlu_dist
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
>> ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>> ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
>> ok ksp_ksp_tutorials-ex49_hypre_nullspace
>> ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
>> ok snes_tutorials-ex19_superlu_dist_2
>> ok diff-snes_tutorials-ex19_superlu_dist_2
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
>> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
>> #srun: error: Unable to create step for job 1426755: More processors
>> requested than permitted
>> ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no diff
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
>> ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>> ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
>> ok ksp_ksp_tutorials-ex64_1
>> ok diff-ksp_ksp_tutorials-ex64_1
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
>> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
>> #srun: error: Unable to create step for job 1426755: More processors
>> requested than permitted
>> ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff
>> TEST
>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
>> ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for
>> this test
>>
>>> Barry
>>>
>>>
>>>> On Mar 10, 2021, at 11:03 PM, Eric Chamberland
>>>> <Eric.Chamberland at giref.ulaval.ca
>>>> <mailto:Eric.Chamberland at giref.ulaval.ca>> wrote:
>>>>
>>>> Barry,
>>>>
>>>> to get a some follow up on --with-openmp=1 failures, shall I open
>>>> gitlab issues for:
>>>>
>>>> a) all hypre failures giving DIVERGED_INDEFINITE_PC
>>>>
>>>> b) all superlu_dist failures giving different results with initia
>>>> and "Exceeded timeout limit of 60 s"
>>>>
>>>> c) hpddm failures "free(): invalid next size (fast)" and
>>>> "Segmentation Violation"
>>>>
>>>> d) all tao's "Exceeded timeout limit of 60 s"
>>>>
>>>> I don't see how I could do all these debugging by myself...
>>>>
>>>> Thanks,
>>>>
>>>> Eric
>>>>
>>>>
>>>
>>
> --
> Eric Chamberland, ing., M. Ing
> Professionnel de recherche
> GIREF/Université Laval
> (418) 656-2131 poste 41 22 42
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210312/78ddc78a/attachment-0001.html>
-------------- next part --------------
# Image de départ.
FROM fedora:33
SHELL ["/bin/bash", "-c"]
WORKDIR /
# InteOneAPI repo configuration and other packages for compiling OpenMPI and PETSc:
# (see https://software.intel.com/content/www/us/en/develop/articles/installing-intel-oneapi-toolkits-via-yum.html)
## on fixe le fuseau horaire dans le conteneur:
ENV TZ=America/New_York
RUN \
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime \
&& \
echo $TZ > /etc/timezone \
&& \
echo "LC_ALL=en_US.UTF-8" >> /etc/environment \
&& \
echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen \
&& \
echo "LANG=en_US.UTF-8" > /etc/locale.conf \
&& \
source /etc/locale.conf \
&& \
echo -e "\
[oneAPI]\n\
name=Intel(R) oneAPI repository\n\
baseurl=https://yum.repos.intel.com/oneapi\n\
enabled=1\n\
gpgcheck=1\n\
repo_gpgcheck=1\n\
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB\n\
" > /etc/yum.repos.d/oneAPI.repo \
&&\
dnf install -y \
authconfig \
autoconf \
binutils \
bison \
blas-devel \
ccache \
clang \
cmake \
flex \
gcc-c++ \
gcc-gfortran \
gdb \
git \
glibc-langpack-en \
gnuplot \
intel-oneapi-mkl-devel \
libtool \
libtirpc-devel \
libXext-devel \
libX11-devel \
make \
nfs-utils \
numactl-libs \
numactl-devel \
nscd \
perl \
"perl(Data::Dumper)" \
pkg-config \
procps-ng \
python2 \
python2-six \
python \
screen \
tar \
time \
valgrind \
vim \
wget \
xorg-x11-apps \
&& \
dnf clean all
# intel-hpckit \
# Exécuter une commande au démarrage de l'image.
#CMD ["/bin/bash"]
-------------- next part --------------
# Image de départ.
FROM fedora_mkl_and_devtools:latest
SHELL ["/bin/bash", "-c"]
WORKDIR /
ARG ompi_ver=openmpi-4.1.0
ARG ompi_tar=${ompi_ver}.tar.gz
ARG ompi_rep_dest=/opt/${ompi_ver}
ENV MPIdir=${ompi_rep_dest}
RUN \
wget https://www.open-mpi.org/software/ompi/v4.1/downloads/${ompi_tar} \
&& \
tar -xvf ${ompi_tar} \
&& \
cd ${ompi_ver} \
&& \
./configure \
--prefix=${ompi_rep_dest} \
CXXFLAGS=-std=c++14\
--with-wrapper-cxxflags='-std=c++14' \
--with-cma \
--enable-mpi1-compatibility \
&& \
make -j8 \
&& \
make install \
&& \
echo -e "\
export MPIdir=${ompi_rep_dest}\n\
export LD_LIBRARY_PATH=\${MPIdir}/${MPIlibdir}:\${LD_LIBRARY_PATH}\n\
export PATH=\${MPIdir}/bin:\${PATH}" > ${ompi_rep_dest}/mpilibs.sh
# Exécuter une commande au démarrage de l'image.
#CMD ["/bin/bash"]
-------------- next part --------------
# Image de départ.
FROM openmpi:latest
SHELL ["/bin/bash", "-c"]
WORKDIR /
ARG petsc_branch=main
ARG petsc_ver=petsc-${petsc_branch}
ARG petsc_rep_dest=/opt/${petsc_ver}
RUN \
git clone https://gitlab.com/petsc/petsc.git -b main && \
cd petsc
RUN \
source /opt/intel/oneapi/mkl/latest/env/vars.sh intel64 && \
source ${MPIdir}/mpilibs.sh && \
cd petsc && \
./configure \
--prefix=${petsc_rep_dest} \
--with-mpi-compilers=1 --with-mpi-dir=${MPIdir} \
--download-ml=yes \
--download-mumps=yes \
--download-superlu=yes \
--with-cxx-dialect=C++14 \
--with-make-np=12 \
--with-shared-libraries=1 \
--with-debugging=1 \
--with-memalign=64 \
--with-visibility=0 \
--with-openmp=1 \
--with-64-bit-indices=0 \
--download-hpddm=yes \
--download-slepc=yes \
--download-superlu_dist=yes \
--download-parmetis=yes \
--download-ptscotch=yes \
--download-metis=yes \
--download-strumpack=yes \
--download-suitesparse=yes \
--download-hypre=yes \
--with-blaslapack-dir="$MKLROOT/lib/intel64" \
--with-mkl_pardiso-dir="$MKLROOT" \
--with-mkl_cpardiso-dir="$MKLROOT" \
--with-scalapack=1 \
--with-scalapack-include="$MKLROOT/include" \
--with-scalapack-lib="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" \
&& \
export PETSC_ARCH_VAR=$(tail -20 configure.log |grep "PETSC_ARCH:"|awk '{print $2}') && \
export PETSC_DIR_VAR=$(tail -20 configure.log |grep "PETSC_DIR:"|awk '{print $2}') && \
make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" all && \
make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" install && \
touch "${petsc_rep_dest}/hpclibs.sh" && \
echo -e "source $MKLROOT/env/vars.sh intel64\n\
source ${MPIdir}/mpilibs.sh\n\
export PETSC_DIR=${petsc_rep_dest}\n\
export PETSC_ARCH=\"\"\n\
export LD_LIBRARY_PATH=\${PETSC_DIR}/lib:\${LD_LIBRARY_PATH}\n\
export PATH=\${PETSC_DIR}/bin:\${MPIdir}/\${MPIbindir}:\${PATH}\n" >> "${petsc_rep_dest}/hpclibs.sh"
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
OMPI_ALLOW_RUN_AS_ROOT=1
RUN source ${petsc_rep_dest}/hpclibs.sh \
&& \
cd /petsc \
&& \
export PETSC_ARCH_VAR=$(tail -20 configure.log |grep "PETSC_ARCH:"|awk '{print $2}') \
&& \
export PETSC_DIR_VAR=$(tail -20 configure.log |grep "PETSC_DIR:"|awk '{print $2}') \
&& \
make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" test |& tee make_test.log
# Exécuter une commande au démarrage de l'image.
#CMD ["cd /petsc; echo "You can source ${petsc_rep_dest}/hpclibs.sh to use PETSc"]
More information about the petsc-dev
mailing list