[petsc-dev] Petsc "make test" have more failures for --with-openmp=1

Mark Adams mfadams at lbl.gov
Sat Mar 13 07:26:10 CST 2021


Hypre uses a multiplicative smoother by default. It has a chebyshev
smoother. That with a Jacobi PC should be thread invariant.
Mark

On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <pierre at joliv.et> wrote:

>
> On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <pierre at joliv.et> wrote:
>
> Hello Eric,
> I’ve made an “interesting” discovery, so I’ll put back the list in c/c.
> It appears the following snippet of code which uses Allreduce() + lambda
> function + MPI_IN_PLACE is:
> - Valgrind-clean with MPICH;
> - Valgrind-clean with OpenMPI 4.0.5;
> - not Valgrind-clean with OpenMPI 4.1.0.
> I’m not sure who is to blame here, I’ll need to look at the MPI
> specification for what is required by the implementors and users in that
> case.
>
> In the meantime, I’ll do the following:
> - update config/BuildSystem/config/packages/OpenMPI.py to use OpenMPI
> 4.1.0, see if any other error appears;
> - provide a hotfix to bypass the segfaults;
>
>
> I can confirm that splitting the single Allreduce with my own MPI_Op into
> two Allreduce with MAX and BAND fixes the segfaults with OpenMPI (*).
>
> - look at the hypre issue and whether they should be deferred to the hypre
> team.
>
>
> I don’t know if there is something wrong in hypre threading or if it’s
> just a side effect of threading, but it seems that the number of threads
> has a drastic effect on the quality of the PC.
> By default, it looks that there are two threads per process with your
> Docker image.
> If I force OMP_NUM_THREADS=1, then I get the same convergence as in the
> output file.
>
> Thanks,
> Pierre
>
> (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712
>
> Thank you for the Docker files, they were really useful.
> If you want to avoid oversubscription failures, you can edit the file
> /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append the line:
> localhost slots=12
> If you want to increase the timeout limit of PETSc test suite for each
> test, you can add the extra flag in your command line TIMEOUT=180 (default
> is 60, units are seconds).
>
> Thanks, I’ll ping you on GitLab when I’ve got something ready for you to
> try,
> Pierre
>
> <ompi.cxx>
>
> On 12 Mar 2021, at 8:54 PM, Eric Chamberland <
> Eric.Chamberland at giref.ulaval.ca> wrote:
>
> Hi Pierre,
>
> I now have a docker container reproducing the problems here.
>
> Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm  it fails
> like this:
>
> not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
> #       Initial guess
> #       L_2 Error: 0.00803099
> #       Initial Residual
> #       L_2 Residual: 1.09057
> #       Au - b = Au + F(0)
> #       Linear L_2 Residual: 1.09057
> #       [d470c54ce086:14127] Read -1, expected 4096, errno = 1
> #       [d470c54ce086:14128] Read -1, expected 4096, errno = 1
> #       [d470c54ce086:14129] Read -1, expected 4096, errno = 1
> #       [3]PETSC ERROR:
> ------------------------------------------------------------------------
> #       [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
> Violation, probably memory access out of range
> #       [3]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> #       [3]PETSC ERROR: or see
> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> #       [3]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple
> Mac OS X to find memory corruption errors
> #       [3]PETSC ERROR: likely location of problem given in stack below
> #       [3]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> #       [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> #       [3]PETSC ERROR:       INSTEAD the line number of the start of the
> function
> #       [3]PETSC ERROR:       is given.
> #       [3]PETSC ERROR: [3] buildTwo line 987
> /opt/petsc-main/include/HPDDM_schwarz.hpp
> #       [3]PETSC ERROR: [3] next line 1130
> /opt/petsc-main/include/HPDDM_schwarz.hpp
> #       [3]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> #       [3]PETSC ERROR: Signal received
> #       [3]PETSC ERROR: [0]PETSC ERROR:
> ------------------------------------------------------------------------
>
> also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected
> ..." which I don't know where they come from...?
>
> Hypre (like in diff-snes_tutorials-ex56_hypre)  is also having
> DIVERGED_INDEFINITE_PC failures...
>
> Please see the 3 attached docker files:
>
> 1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with
> gnu compilers and MKL and everything to develop.
>
> 2) openmpi: the DockerFile to bluid OpenMPI
>
> 3) petsc: The las DockerFile that build/install and test PETSc
>
> I build the 3 like this:
>
> docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools .
>
> docker build -t openmpi -f openmpi .
>
> docker build -t petsc -f petsc .
>
> Disclaimer: I am not a docker expert, so I may do things that are not
> docker-stat-of-the-art but I am opened to suggestions... ;)
>
> I have just ran it on my portable (long) which have not enough cores, so
> many more tests failed (should force --oversubscribe but don't know how
> to).  I will relaunch on my workstation in a few minutes.
>
> I will now test your branch! (sorry for the delay).
>
> Thanks,
>
> Eric
> On 2021-03-11 9:03 a.m., Eric Chamberland wrote:
>
> Hi Pierre,
>
> ok, that's interesting!
>
> I will try to build a docker image until tomorrow and give you the exact
> recipe to reproduce the bugs.
>
> Eric
>
>
> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:
>
>
>
> On 11 Mar 2021, at 6:16 AM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
>   Eric,
>
>    Sorry about not being more immediate. We still have this in our active
> email so you don't need to submit individual issues. We'll try to get to
> them as soon as we can.
>
>
> Indeed, I’m still trying to figure this out.
> I realized that some of my configure flags were different than yours,
> e.g., no --with-memalign.
> I’ve also added SuperLU_DIST to my installation.
> Still, I can’t reproduce any issue.
> I will continue looking into this, it appears I’m seeing some valgrind
> errors, but I don’t know if this is some side effect of OpenMPI not being
> valgrind-clean (last time I checked, there was no error with MPICH).
>
> Thank you for your patience,
> Pierre
>
> /usr/bin/gmake -f gmakefile test test-fail=1
> Using MAKEFLAGS: test-fail=1
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
>  ok snes_tutorials-ex12_quad_hpddm_reuse_baij
>  ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
>  ok ksp_ksp_tests-ex33_superlu_dist_2
>  ok diff-ksp_ksp_tests-ex33_superlu_dist_2
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
>  ok ksp_ksp_tutorials-ex50_tut_2
>  ok diff-ksp_ksp_tutorials-ex50_tut_2
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
>  ok ksp_ksp_tests-ex33_superlu_dist
>  ok diff-ksp_ksp_tests-ex33_superlu_dist
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
>  ok snes_tutorials-ex56_hypre
>  ok diff-snes_tutorials-ex56_hypre
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
>  ok ksp_ksp_tutorials-ex56_2
>  ok diff-ksp_ksp_tutorials-ex56_2
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
>  ok snes_tutorials-ex17_3d_q3_trig_elas
>  ok diff-snes_tutorials-ex17_3d_q3_trig_elas
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
>  ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>  ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
> # srun: error: Unable to create step for job 1426755: More processors
> requested than permitted
>  ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no diff
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
>  ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for this
> test
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
>  ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
>  ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
>  ok snes_tutorials-ex19_tut_3
>  ok diff-snes_tutorials-ex19_tut_3
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
>  ok snes_tutorials-ex17_3d_q3_trig_vlap
>  ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
>  ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for this
> test
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
>  ok snes_tutorials-ex19_superlu_dist
>  ok diff-snes_tutorials-ex19_superlu_dist
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
>  ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>  ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
>  ok ksp_ksp_tutorials-ex49_hypre_nullspace
>  ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
>  ok snes_tutorials-ex19_superlu_dist_2
>  ok diff-snes_tutorials-ex19_superlu_dist_2
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
> # srun: error: Unable to create step for job 1426755: More processors
> requested than permitted
>  ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no diff
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
>  ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>  ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
>  ok ksp_ksp_tutorials-ex64_1
>  ok diff-ksp_ksp_tutorials-ex64_1
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
> # srun: error: Unable to create step for job 1426755: More processors
> requested than permitted
>  ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff
>         TEST
> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
>  ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for this
> test
>
>    Barry
>
>
> On Mar 10, 2021, at 11:03 PM, Eric Chamberland <
> Eric.Chamberland at giref.ulaval.ca> wrote:
>
> Barry,
>
> to get a some follow up on --with-openmp=1 failures, shall I open gitlab
> issues for:
>
> a) all hypre failures giving DIVERGED_INDEFINITE_PC
>
> b) all superlu_dist failures giving different results with initia and
> "Exceeded timeout limit of 60 s"
>
> c) hpddm failures "free(): invalid next size (fast)" and "Segmentation
> Violation"
>
> d) all tao's "Exceeded timeout limit of 60 s"
>
> I don't see how I could do all these debugging by myself...
>
> Thanks,
>
> Eric
>
>
>
> --
> Eric Chamberland, ing., M. Ing
> Professionnel de recherche
> GIREF/Université Laval
> (418) 656-2131 poste 41 22 42
>
> --
> Eric Chamberland, ing., M. Ing
> Professionnel de recherche
> GIREF/Université Laval
> (418) 656-2131 poste 41 22 42
>
> <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210313/e5fdcaea/attachment-0001.html>


More information about the petsc-dev mailing list