[petsc-dev] Petsc "make test" have more failures for --with-openmp=1
Eric Chamberland
Eric.Chamberland at giref.ulaval.ca
Sat Mar 13 18:44:56 CST 2021
For us it clearly creates problems in real computations...
I understand the need to have clean test for PETSc, but for me, it
reveals that hypre isn't usable with more than one thread for now...
Another solution: force single-threaded configuration for hypre until
this is fixed?
Eric
On 2021-03-13 8:50 a.m., Pierre Jolivet wrote:
> -pc_hypre_boomeramg_relax_type_all Jacobi =>
> Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 3
> -pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi =>
> OK, independently of the architecture it seems (Eric Docker image with
> 1 or 2 threads or my macOS), but contraction factor is higher
> Linear solve converged due to CONVERGED_RTOL iterations 8
> Linear solve converged due to CONVERGED_RTOL iterations 24
> Linear solve converged due to CONVERGED_RTOL iterations 26
> v. currently
> Linear solve converged due to CONVERGED_RTOL iterations 7
> Linear solve converged due to CONVERGED_RTOL iterations 9
> Linear solve converged due to CONVERGED_RTOL iterations 10
>
> Do we change this? Or should we force OMP_NUM_THREADS=1 for make test?
>
> Thanks,
> Pierre
>
>> On 13 Mar 2021, at 2:26 PM, Mark Adams <mfadams at lbl.gov
>> <mailto:mfadams at lbl.gov>> wrote:
>>
>> Hypre uses a multiplicative smoother by default. It has a chebyshev
>> smoother. That with a Jacobi PC should be thread invariant.
>> Mark
>>
>> On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <pierre at joliv.et
>> <mailto:pierre at joliv.et>> wrote:
>>
>>
>>> On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <pierre at joliv.et
>>> <mailto:pierre at joliv.et>> wrote:
>>>
>>> Hello Eric,
>>> I’ve made an “interesting” discovery, so I’ll put back the list
>>> in c/c.
>>> It appears the following snippet of code which uses Allreduce()
>>> + lambda function + MPI_IN_PLACE is:
>>> - Valgrind-clean with MPICH;
>>> - Valgrind-clean with OpenMPI 4.0.5;
>>> - not Valgrind-clean with OpenMPI 4.1.0.
>>> I’m not sure who is to blame here, I’ll need to look at the MPI
>>> specification for what is required by the implementors and users
>>> in that case.
>>>
>>> In the meantime, I’ll do the following:
>>> - update config/BuildSystem/config/packages/OpenMPI.py to use
>>> OpenMPI 4.1.0, see if any other error appears;
>>> - provide a hotfix to bypass the segfaults;
>>
>> I can confirm that splitting the single Allreduce with my own
>> MPI_Op into two Allreduce with MAX and BAND fixes the segfaults
>> with OpenMPI (*).
>>
>>> - look at the hypre issue and whether they should be deferred to
>>> the hypre team.
>>
>> I don’t know if there is something wrong in hypre threading or if
>> it’s just a side effect of threading, but it seems that the
>> number of threads has a drastic effect on the quality of the PC.
>> By default, it looks that there are two threads per process with
>> your Docker image.
>> If I force OMP_NUM_THREADS=1, then I get the same convergence as
>> in the output file.
>>
>> Thanks,
>> Pierre
>>
>> (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712
>> <https://gitlab.com/petsc/petsc/-/merge_requests/3712>
>>
>>> Thank you for the Docker files, they were really useful.
>>> If you want to avoid oversubscription failures, you can edit the
>>> file /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append
>>> the line:
>>> localhost slots=12
>>> If you want to increase the timeout limit of PETSc test suite
>>> for each test, you can add the extra flag in your command line
>>> TIMEOUT=180 (default is 60, units are seconds).
>>>
>>> Thanks, I’ll ping you on GitLab when I’ve got something ready
>>> for you to try,
>>> Pierre
>>>
>>> <ompi.cxx>
>>>
>>>> On 12 Mar 2021, at 8:54 PM, Eric Chamberland
>>>> <Eric.Chamberland at giref.ulaval.ca
>>>> <mailto:Eric.Chamberland at giref.ulaval.ca>> wrote:
>>>>
>>>> Hi Pierre,
>>>>
>>>> I now have a docker container reproducing the problems here.
>>>>
>>>> Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm
>>>> it fails like this:
>>>>
>>>> not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
>>>> # Initial guess
>>>> # L_2 Error: 0.00803099
>>>> # Initial Residual
>>>> # L_2 Residual: 1.09057
>>>> # Au - b = Au + F(0)
>>>> # Linear L_2 Residual: 1.09057
>>>> # [d470c54ce086:14127] Read -1, expected 4096, errno = 1
>>>> # [d470c54ce086:14128] Read -1, expected 4096, errno = 1
>>>> # [d470c54ce086:14129] Read -1, expected 4096, errno = 1
>>>> # [3]PETSC ERROR:
>>>> ------------------------------------------------------------------------
>>>> # [3]PETSC ERROR: Caught signal number 11 SEGV:
>>>> Segmentation Violation, probably memory access out of range
>>>> # [3]PETSC ERROR: Try option -start_in_debugger or
>>>> -on_error_attach_debugger
>>>> # [3]PETSC ERROR: or see
>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
>>>> # [3]PETSC ERROR: or try http://valgrind.org
>>>> <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find
>>>> memory corruption errors
>>>> # [3]PETSC ERROR: likely location of problem given in
>>>> stack below
>>>> # [3]PETSC ERROR: --------------------- Stack Frames
>>>> ------------------------------------
>>>> # [3]PETSC ERROR: Note: The EXACT line numbers in the
>>>> stack are not available,
>>>> # [3]PETSC ERROR: INSTEAD the line number of the start of
>>>> the function
>>>> # [3]PETSC ERROR: is given.
>>>> # [3]PETSC ERROR: [3] buildTwo line 987
>>>> /opt/petsc-main/include/HPDDM_schwarz.hpp
>>>> # [3]PETSC ERROR: [3] next line 1130
>>>> /opt/petsc-main/include/HPDDM_schwarz.hpp
>>>> # [3]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> # [3]PETSC ERROR: Signal received
>>>> # [3]PETSC ERROR: [0]PETSC ERROR:
>>>> ------------------------------------------------------------------------
>>>>
>>>> also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1,
>>>> expected ..." which I don't know where they come from...?
>>>>
>>>> Hypre (like in diff-snes_tutorials-ex56_hypre) is also having
>>>> DIVERGED_INDEFINITE_PC failures...
>>>>
>>>> Please see the 3 attached docker files:
>>>>
>>>> 1) fedora_mkl_and_devtools : the DockerFile which install
>>>> fedore 33 with gnu compilers and MKL and everything to develop.
>>>>
>>>> 2) openmpi: the DockerFile to bluid OpenMPI
>>>>
>>>> 3) petsc: The las DockerFile that build/install and test PETSc
>>>>
>>>> I build the 3 like this:
>>>>
>>>> docker build -t fedora_mkl_and_devtools -f
>>>> fedora_mkl_and_devtools .
>>>>
>>>> docker build -t openmpi -f openmpi .
>>>>
>>>> docker build -t petsc -f petsc .
>>>>
>>>> Disclaimer: I am not a docker expert, so I may do things that
>>>> are not docker-stat-of-the-art but I am opened to suggestions... ;)
>>>>
>>>> I have just ran it on my portable (long) which have not enough
>>>> cores, so many more tests failed (should force --oversubscribe
>>>> but don't know how to). I will relaunch on my workstation in a
>>>> few minutes.
>>>>
>>>> I will now test your branch! (sorry for the delay).
>>>>
>>>> Thanks,
>>>>
>>>> Eric
>>>>
>>>> On 2021-03-11 9:03 a.m., Eric Chamberland wrote:
>>>>>
>>>>> Hi Pierre,
>>>>>
>>>>> ok, that's interesting!
>>>>>
>>>>> I will try to build a docker image until tomorrow and give you
>>>>> the exact recipe to reproduce the bugs.
>>>>>
>>>>> Eric
>>>>>
>>>>>
>>>>> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:
>>>>>>
>>>>>>
>>>>>>> On 11 Mar 2021, at 6:16 AM, Barry Smith <bsmith at petsc.dev
>>>>>>> <mailto:bsmith at petsc.dev>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Eric,
>>>>>>>
>>>>>>> Sorry about not being more immediate. We still have this
>>>>>>> in our active email so you don't need to submit individual
>>>>>>> issues. We'll try to get to them as soon as we can.
>>>>>>
>>>>>> Indeed, I’m still trying to figure this out.
>>>>>> I realized that some of my configure flags were different
>>>>>> than yours, e.g., no --with-memalign.
>>>>>> I’ve also added SuperLU_DIST to my installation.
>>>>>> Still, I can’t reproduce any issue.
>>>>>> I will continue looking into this, it appears I’m seeing some
>>>>>> valgrind errors, but I don’t know if this is some side effect
>>>>>> of OpenMPI not being valgrind-clean (last time I checked,
>>>>>> there was no error with MPICH).
>>>>>>
>>>>>> Thank you for your patience,
>>>>>> Pierre
>>>>>>
>>>>>> /usr/bin/gmake -f gmakefile test test-fail=1
>>>>>> Using MAKEFLAGS: test-fail=1
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
>>>>>> ok snes_tutorials-ex12_quad_hpddm_reuse_baij
>>>>>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
>>>>>> ok ksp_ksp_tests-ex33_superlu_dist_2
>>>>>> ok diff-ksp_ksp_tests-ex33_superlu_dist_2
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
>>>>>> ok ksp_ksp_tutorials-ex50_tut_2
>>>>>> ok diff-ksp_ksp_tutorials-ex50_tut_2
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
>>>>>> ok ksp_ksp_tests-ex33_superlu_dist
>>>>>> ok diff-ksp_ksp_tests-ex33_superlu_dist
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
>>>>>> ok snes_tutorials-ex56_hypre
>>>>>> ok diff-snes_tutorials-ex56_hypre
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
>>>>>> ok ksp_ksp_tutorials-ex56_2
>>>>>> ok diff-ksp_ksp_tutorials-ex56_2
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
>>>>>> ok snes_tutorials-ex17_3d_q3_trig_elas
>>>>>> ok diff-snes_tutorials-ex17_3d_q3_trig_elas
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
>>>>>> ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>>>>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
>>>>>> #srun: error: Unable to create step for job 1426755: More
>>>>>> processors requested than permitted
>>>>>> ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command
>>>>>> failed so no diff
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
>>>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran
>>>>>> required for this test
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
>>>>>> ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>>>>> ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
>>>>>> ok snes_tutorials-ex19_tut_3
>>>>>> ok diff-snes_tutorials-ex19_tut_3
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
>>>>>> ok snes_tutorials-ex17_3d_q3_trig_vlap
>>>>>> ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
>>>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran
>>>>>> required for this test
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
>>>>>> ok snes_tutorials-ex19_superlu_dist
>>>>>> ok diff-snes_tutorials-ex19_superlu_dist
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
>>>>>> ok
>>>>>> snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>>>>> ok
>>>>>> diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
>>>>>> ok ksp_ksp_tutorials-ex49_hypre_nullspace
>>>>>> ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
>>>>>> ok snes_tutorials-ex19_superlu_dist_2
>>>>>> ok diff-snes_tutorials-ex19_superlu_dist_2
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
>>>>>> #srun: error: Unable to create step for job 1426755: More
>>>>>> processors requested than permitted
>>>>>> ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command
>>>>>> failed so no diff
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
>>>>>> ok
>>>>>> snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>>>>> ok
>>>>>> diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
>>>>>> ok ksp_ksp_tutorials-ex64_1
>>>>>> ok diff-ksp_ksp_tutorials-ex64_1
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
>>>>>> #srun: error: Unable to create step for job 1426755: More
>>>>>> processors requested than permitted
>>>>>> ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed
>>>>>> so no diff
>>>>>> TEST
>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
>>>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran
>>>>>> required for this test
>>>>>>
>>>>>>> Barry
>>>>>>>
>>>>>>>
>>>>>>>> On Mar 10, 2021, at 11:03 PM, Eric Chamberland
>>>>>>>> <Eric.Chamberland at giref.ulaval.ca
>>>>>>>> <mailto:Eric.Chamberland at giref.ulaval.ca>> wrote:
>>>>>>>>
>>>>>>>> Barry,
>>>>>>>>
>>>>>>>> to get a some follow up on --with-openmp=1 failures, shall
>>>>>>>> I open gitlab issues for:
>>>>>>>>
>>>>>>>> a) all hypre failures giving DIVERGED_INDEFINITE_PC
>>>>>>>>
>>>>>>>> b) all superlu_dist failures giving different results with
>>>>>>>> initia and "Exceeded timeout limit of 60 s"
>>>>>>>>
>>>>>>>> c) hpddm failures "free(): invalid next size (fast)" and
>>>>>>>> "Segmentation Violation"
>>>>>>>>
>>>>>>>> d) all tao's "Exceeded timeout limit of 60 s"
>>>>>>>>
>>>>>>>> I don't see how I could do all these debugging by myself...
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Eric
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> --
>>>>> Eric Chamberland, ing., M. Ing
>>>>> Professionnel de recherche
>>>>> GIREF/Université Laval
>>>>> (418) 656-2131 poste 41 22 42
>>>> --
>>>> Eric Chamberland, ing., M. Ing
>>>> Professionnel de recherche
>>>> GIREF/Université Laval
>>>> (418) 656-2131 poste 41 22 42
>>>> <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt>
>>>
>>
>
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210313/11e0997a/attachment-0001.html>
More information about the petsc-dev
mailing list