[petsc-dev] Petsc "make test" have more failures for --with-openmp=1

Eric Chamberland Eric.Chamberland at giref.ulaval.ca
Sat Mar 13 18:44:56 CST 2021


For us it clearly creates problems in real computations...

I understand the need to have clean test for PETSc, but for me, it 
reveals that hypre isn't usable with more than one thread for now...

Another solution:  force single-threaded configuration for hypre until 
this is fixed?

Eric

On 2021-03-13 8:50 a.m., Pierre Jolivet wrote:
> -pc_hypre_boomeramg_relax_type_all Jacobi =>
>   Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 3
> -pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi =>
> OK, independently of the architecture it seems (Eric Docker image with 
> 1 or 2 threads or my macOS), but contraction factor is higher
>   Linear solve converged due to CONVERGED_RTOL iterations 8
>   Linear solve converged due to CONVERGED_RTOL iterations 24
>   Linear solve converged due to CONVERGED_RTOL iterations 26
> v. currently
>   Linear solve converged due to CONVERGED_RTOL iterations 7
>   Linear solve converged due to CONVERGED_RTOL iterations 9
>   Linear solve converged due to CONVERGED_RTOL iterations 10
>
> Do we change this? Or should we force OMP_NUM_THREADS=1 for make test?
>
> Thanks,
> Pierre
>
>> On 13 Mar 2021, at 2:26 PM, Mark Adams <mfadams at lbl.gov 
>> <mailto:mfadams at lbl.gov>> wrote:
>>
>> Hypre uses a multiplicative smoother by default. It has a chebyshev 
>> smoother. That with a Jacobi PC should be thread invariant.
>> Mark
>>
>> On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <pierre at joliv.et 
>> <mailto:pierre at joliv.et>> wrote:
>>
>>
>>>     On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <pierre at joliv.et
>>>     <mailto:pierre at joliv.et>> wrote:
>>>
>>>     Hello Eric,
>>>     I’ve made an “interesting” discovery, so I’ll put back the list
>>>     in c/c.
>>>     It appears the following snippet of code which uses Allreduce()
>>>     + lambda function + MPI_IN_PLACE is:
>>>     - Valgrind-clean with MPICH;
>>>     - Valgrind-clean with OpenMPI 4.0.5;
>>>     - not Valgrind-clean with OpenMPI 4.1.0.
>>>     I’m not sure who is to blame here, I’ll need to look at the MPI
>>>     specification for what is required by the implementors and users
>>>     in that case.
>>>
>>>     In the meantime, I’ll do the following:
>>>     - update config/BuildSystem/config/packages/OpenMPI.py to use
>>>     OpenMPI 4.1.0, see if any other error appears;
>>>     - provide a hotfix to bypass the segfaults;
>>
>>     I can confirm that splitting the single Allreduce with my own
>>     MPI_Op into two Allreduce with MAX and BAND fixes the segfaults
>>     with OpenMPI (*).
>>
>>>     - look at the hypre issue and whether they should be deferred to
>>>     the hypre team.
>>
>>     I don’t know if there is something wrong in hypre threading or if
>>     it’s just a side effect of threading, but it seems that the
>>     number of threads has a drastic effect on the quality of the PC.
>>     By default, it looks that there are two threads per process with
>>     your Docker image.
>>     If I force OMP_NUM_THREADS=1, then I get the same convergence as
>>     in the output file.
>>
>>     Thanks,
>>     Pierre
>>
>>     (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712
>>     <https://gitlab.com/petsc/petsc/-/merge_requests/3712>
>>
>>>     Thank you for the Docker files, they were really useful.
>>>     If you want to avoid oversubscription failures, you can edit the
>>>     file /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append
>>>     the line:
>>>     localhost slots=12
>>>     If you want to increase the timeout limit of PETSc test suite
>>>     for each test, you can add the extra flag in your command line
>>>     TIMEOUT=180 (default is 60, units are seconds).
>>>
>>>     Thanks, I’ll ping you on GitLab when I’ve got something ready
>>>     for you to try,
>>>     Pierre
>>>
>>>     <ompi.cxx>
>>>
>>>>     On 12 Mar 2021, at 8:54 PM, Eric Chamberland
>>>>     <Eric.Chamberland at giref.ulaval.ca
>>>>     <mailto:Eric.Chamberland at giref.ulaval.ca>> wrote:
>>>>
>>>>     Hi Pierre,
>>>>
>>>>     I now have a docker container reproducing the problems here.
>>>>
>>>>     Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm
>>>>     it fails like this:
>>>>
>>>>     not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
>>>>     #       Initial guess
>>>>     #       L_2 Error: 0.00803099
>>>>     #       Initial Residual
>>>>     #       L_2 Residual: 1.09057
>>>>     #       Au - b = Au + F(0)
>>>>     #       Linear L_2 Residual: 1.09057
>>>>     #       [d470c54ce086:14127] Read -1, expected 4096, errno = 1
>>>>     #       [d470c54ce086:14128] Read -1, expected 4096, errno = 1
>>>>     #       [d470c54ce086:14129] Read -1, expected 4096, errno = 1
>>>>     #       [3]PETSC ERROR:
>>>>     ------------------------------------------------------------------------
>>>>     #       [3]PETSC ERROR: Caught signal number 11 SEGV:
>>>>     Segmentation Violation, probably memory access out of range
>>>>     #       [3]PETSC ERROR: Try option -start_in_debugger or
>>>>     -on_error_attach_debugger
>>>>     #       [3]PETSC ERROR: or see
>>>>     https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>     <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
>>>>     #       [3]PETSC ERROR: or try http://valgrind.org
>>>>     <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find
>>>>     memory corruption errors
>>>>     #       [3]PETSC ERROR: likely location of problem given in
>>>>     stack below
>>>>     #       [3]PETSC ERROR: ---------------------  Stack Frames
>>>>     ------------------------------------
>>>>     #       [3]PETSC ERROR: Note: The EXACT line numbers in the
>>>>     stack are not available,
>>>>     #       [3]PETSC ERROR: INSTEAD the line number of the start of
>>>>     the function
>>>>     #       [3]PETSC ERROR:       is given.
>>>>     #       [3]PETSC ERROR: [3] buildTwo line 987
>>>>     /opt/petsc-main/include/HPDDM_schwarz.hpp
>>>>     #       [3]PETSC ERROR: [3] next line 1130
>>>>     /opt/petsc-main/include/HPDDM_schwarz.hpp
>>>>     #       [3]PETSC ERROR: --------------------- Error Message
>>>>     --------------------------------------------------------------
>>>>     #       [3]PETSC ERROR: Signal received
>>>>     #       [3]PETSC ERROR: [0]PETSC ERROR:
>>>>     ------------------------------------------------------------------------
>>>>
>>>>     also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1,
>>>>     expected ..." which I don't know where they come from...?
>>>>
>>>>     Hypre (like in diff-snes_tutorials-ex56_hypre) is also having
>>>>     DIVERGED_INDEFINITE_PC failures...
>>>>
>>>>     Please see the 3 attached docker files:
>>>>
>>>>     1) fedora_mkl_and_devtools : the DockerFile which install
>>>>     fedore 33 with gnu compilers and MKL and everything to develop.
>>>>
>>>>     2) openmpi: the DockerFile to bluid OpenMPI
>>>>
>>>>     3) petsc: The las DockerFile that build/install and test PETSc
>>>>
>>>>     I build the 3 like this:
>>>>
>>>>     docker build -t fedora_mkl_and_devtools -f
>>>>     fedora_mkl_and_devtools .
>>>>
>>>>     docker build -t openmpi -f openmpi .
>>>>
>>>>     docker build -t petsc -f petsc .
>>>>
>>>>     Disclaimer: I am not a docker expert, so I may do things that
>>>>     are not docker-stat-of-the-art but I am opened to suggestions... ;)
>>>>
>>>>     I have just ran it on my portable (long) which have not enough
>>>>     cores, so many more tests failed (should force --oversubscribe
>>>>     but don't know how to).  I will relaunch on my workstation in a
>>>>     few minutes.
>>>>
>>>>     I will now test your branch! (sorry for the delay).
>>>>
>>>>     Thanks,
>>>>
>>>>     Eric
>>>>
>>>>     On 2021-03-11 9:03 a.m., Eric Chamberland wrote:
>>>>>
>>>>>     Hi Pierre,
>>>>>
>>>>>     ok, that's interesting!
>>>>>
>>>>>     I will try to build a docker image until tomorrow and give you
>>>>>     the exact recipe to reproduce the bugs.
>>>>>
>>>>>     Eric
>>>>>
>>>>>
>>>>>     On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:
>>>>>>
>>>>>>
>>>>>>>     On 11 Mar 2021, at 6:16 AM, Barry Smith <bsmith at petsc.dev
>>>>>>>     <mailto:bsmith at petsc.dev>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>       Eric,
>>>>>>>
>>>>>>>        Sorry about not being more immediate. We still have this
>>>>>>>     in our active email so you don't need to submit individual
>>>>>>>     issues. We'll try to get to them as soon as we can.
>>>>>>
>>>>>>     Indeed, I’m still trying to figure this out.
>>>>>>     I realized that some of my configure flags were different
>>>>>>     than yours, e.g., no --with-memalign.
>>>>>>     I’ve also added SuperLU_DIST to my installation.
>>>>>>     Still, I can’t reproduce any issue.
>>>>>>     I will continue looking into this, it appears I’m seeing some
>>>>>>     valgrind errors, but I don’t know if this is some side effect
>>>>>>     of OpenMPI not being valgrind-clean (last time I checked,
>>>>>>     there was no error with MPICH).
>>>>>>
>>>>>>     Thank you for your patience,
>>>>>>     Pierre
>>>>>>
>>>>>>     /usr/bin/gmake -f gmakefile test test-fail=1
>>>>>>     Using MAKEFLAGS: test-fail=1
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
>>>>>>      ok snes_tutorials-ex12_quad_hpddm_reuse_baij
>>>>>>      ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
>>>>>>      ok ksp_ksp_tests-ex33_superlu_dist_2
>>>>>>      ok diff-ksp_ksp_tests-ex33_superlu_dist_2
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
>>>>>>      ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>>>>>      ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>>>>>      ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>>>>>      ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>>>>>      ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>>>>>      ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>>>>>      ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>>>>>      ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>>>>>      ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>>>>>      ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>>>>>      ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>>>>>      ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>>>>>      ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>>>>>      ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>>>>>      ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>>>>>      ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
>>>>>>      ok ksp_ksp_tutorials-ex50_tut_2
>>>>>>      ok diff-ksp_ksp_tutorials-ex50_tut_2
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
>>>>>>      ok ksp_ksp_tests-ex33_superlu_dist
>>>>>>      ok diff-ksp_ksp_tests-ex33_superlu_dist
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
>>>>>>      ok snes_tutorials-ex56_hypre
>>>>>>      ok diff-snes_tutorials-ex56_hypre
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
>>>>>>      ok ksp_ksp_tutorials-ex56_2
>>>>>>      ok diff-ksp_ksp_tutorials-ex56_2
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
>>>>>>      ok snes_tutorials-ex17_3d_q3_trig_elas
>>>>>>      ok diff-snes_tutorials-ex17_3d_q3_trig_elas
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
>>>>>>      ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>>>>>      ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
>>>>>>     not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
>>>>>>     #srun: error: Unable to create step for job 1426755: More
>>>>>>     processors requested than permitted
>>>>>>      ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command
>>>>>>     failed so no diff
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
>>>>>>      ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran
>>>>>>     required for this test
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
>>>>>>      ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>>>>>      ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
>>>>>>      ok snes_tutorials-ex19_tut_3
>>>>>>      ok diff-snes_tutorials-ex19_tut_3
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
>>>>>>      ok snes_tutorials-ex17_3d_q3_trig_vlap
>>>>>>      ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
>>>>>>      ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran
>>>>>>     required for this test
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
>>>>>>      ok snes_tutorials-ex19_superlu_dist
>>>>>>      ok diff-snes_tutorials-ex19_superlu_dist
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
>>>>>>      ok
>>>>>>     snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>>>>>      ok
>>>>>>     diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
>>>>>>      ok ksp_ksp_tutorials-ex49_hypre_nullspace
>>>>>>      ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
>>>>>>      ok snes_tutorials-ex19_superlu_dist_2
>>>>>>      ok diff-snes_tutorials-ex19_superlu_dist_2
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
>>>>>>     not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
>>>>>>     #srun: error: Unable to create step for job 1426755: More
>>>>>>     processors requested than permitted
>>>>>>      ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command
>>>>>>     failed so no diff
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
>>>>>>      ok
>>>>>>     snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>>>>>      ok
>>>>>>     diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
>>>>>>      ok ksp_ksp_tutorials-ex64_1
>>>>>>      ok diff-ksp_ksp_tutorials-ex64_1
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
>>>>>>     not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
>>>>>>     #srun: error: Unable to create step for job 1426755: More
>>>>>>     processors requested than permitted
>>>>>>      ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed
>>>>>>     so no diff
>>>>>>             TEST
>>>>>>     arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
>>>>>>      ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran
>>>>>>     required for this test
>>>>>>
>>>>>>>        Barry
>>>>>>>
>>>>>>>
>>>>>>>>     On Mar 10, 2021, at 11:03 PM, Eric Chamberland
>>>>>>>>     <Eric.Chamberland at giref.ulaval.ca
>>>>>>>>     <mailto:Eric.Chamberland at giref.ulaval.ca>> wrote:
>>>>>>>>
>>>>>>>>     Barry,
>>>>>>>>
>>>>>>>>     to get a some follow up on --with-openmp=1 failures, shall
>>>>>>>>     I open gitlab issues for:
>>>>>>>>
>>>>>>>>     a) all hypre failures giving DIVERGED_INDEFINITE_PC
>>>>>>>>
>>>>>>>>     b) all superlu_dist failures giving different results with
>>>>>>>>     initia and "Exceeded timeout limit of 60 s"
>>>>>>>>
>>>>>>>>     c) hpddm failures "free(): invalid next size (fast)" and
>>>>>>>>     "Segmentation Violation"
>>>>>>>>
>>>>>>>>     d) all tao's "Exceeded timeout limit of 60 s"
>>>>>>>>
>>>>>>>>     I don't see how I could do all these debugging by myself...
>>>>>>>>
>>>>>>>>     Thanks,
>>>>>>>>
>>>>>>>>     Eric
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>     -- 
>>>>>     Eric Chamberland, ing., M. Ing
>>>>>     Professionnel de recherche
>>>>>     GIREF/Université Laval
>>>>>     (418) 656-2131 poste 41 22 42
>>>>     -- 
>>>>     Eric Chamberland, ing., M. Ing
>>>>     Professionnel de recherche
>>>>     GIREF/Université Laval
>>>>     (418) 656-2131 poste 41 22 42
>>>>     <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt>
>>>
>>
>
-- 
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210313/11e0997a/attachment-0001.html>


More information about the petsc-dev mailing list