[petsc-dev] Petsc "make test" have more failures for --with-openmp=1

Eric Chamberland Eric.Chamberland at giref.ulaval.ca
Tue Mar 2 22:38:50 CST 2021


On 2021-03-02 10:59 p.m., Barry Smith wrote:
>
>   It could be related to MKL but it could also be due to problems with 
> Scalapack when used with OpenMP. Do you need Scalapack? Maybe you want 
> to use it since it used by MUMPS?
Yes, exactly for mumps!
>>
>>
>> #2: I can deal with that! :)
>>
>>
>> #3: I am not sure if this output is due to the way I configure 
>> OpenMPI/3.x:
>>
>>   $ ./configure --prefix=/opt/openmpi-3.x_debug --enable-debug 
>> --enable-picky CXXFLAGS=-std=c++14 --with-wrapper-cxxflags=-std=c++14 
>> --with-cma
>>
>> or this export:
>>
>> export OMPI_MCA_plm_base_verbose=5
>>
>> which I left there to track an intermittent bug at singleton startup 
>> (https://www.mail-archive.com/devel@lists.open-mpi.org/msg19568.html)..
>>
>> I will remove this now, 4 years later, it does not happen anymore... 
>> But I don't think it should be harmful for PETSc tests, is it?
>>
> I am guessing that this is just informative information that does not 
> indicate a problem. But I am confused why it appears only 
> occasionally, presumably it is related to the current state of the 
> system.
>
> But the PETSc tests have no way of knowing that this type of output to 
> stdout or stderr is "harmless" informative information versus am 
> indication of something being seriously broken. One way of checking 
> PETSc tests in make test is to process the output and look for 
> something that is not "normal" and this is definitely not normal.
> I think you need to turn off the verbosity when running PETSc tests 
> and then hopefully this particular problem will go away.
>
>   ok tao_constrained_tutorials-toyf_1
> not ok diff-tao_constrained_tutorials-toyf_1 # Error code: 1
> #	0a1,17
> #	> [zorg:09243] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
> #	> [zorg:09243] plm:base:set_hnp_name: initial bias 9243 nodename hash 810220270
> #	> [zorg:09243] plm:base:set_hnp_name: final jobfam 61119
> #	> [zorg:09243] [[61119,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> #	> [zorg:09243] [[61119,0],0] plm:base:receive start comm
> #	> [zorg:09243] [[61119,0],0] plm:base:setup_job
> #	> [zorg:09243] [[61119,0],0] plm:base:setup_vm
> #	> [zorg:09243] [[61119,0],0] plm:base:setup_vm creating map
> #	> [zorg:09243] [[61119,0],0] setup:vm: working unmanaged allocation
> #	> [zorg:09243] [[61119,0],0] using default hostfile /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
> #	> [zorg:09243] [[61119,0],0] plm:base:setup_vm only HNP in allocation
> #	> [zorg:09243] [[61119,0],0] plm:base:setting slots for node zorg by cores
> #	> [zorg:09243] [[61119,0],0] complete_setup on job [61119,1]
> #	> [zorg:09243] [[61119,0],0] plm:base:launch_apps for job [61119,1]
> #	> [zorg:09243] [[61119,0],0] plm:base:launch wiring up iof for job [61119,1]
> #	> [zorg:09243] [[61119,0],0] plm:base:launch job [61119,1] is not a dynamic spawn
> #	> [zorg:09243] [[61119,0],0] plm:base:launch [61119,1] registered
> #	58a76,77
> #	> [zorg:09243] [[61119,0],0] plm:base:orted_cmd sending orted_exit commands
> #	> [zorg:09243] [[61119,0],0] plm:base:receive stop comm
>
Understood: here we did some workaround to filter these out for 
stdout/stderr comparison we do in nightly tests.
>
>>
>> #4: I do have a default choice for L2 projections which uses HYPRE 
>> BoomerAMG preconditioning...that now ends with 
>> KSP_DIVERGED_INDEFINITE_PC, so this is definitely a problem...
>>
>> #5: we do sometime use superlu-dist
>>
>>
>   I'm afraid for the possibly MUMPS, Superlu_DIST and hypre problems 
> you need to debug them one at a time by running the particular 
> troublesome example in the debugger to determine the problem.  It 
> could also be due to the relationship between the MKL and the OpenMP 
> implementation. I don't know exactly how MKL's multi-threaded code 
> runs in relation to OpenML and certainly if the compiler is providing 
> a different OpenMP than MKL is using it will not work.

Does the guys who maintain all these libs are reading petsc-dev? ;)


>
>   Under most circumstances if you are using MKL with threading and 
> PETSc you likely only want to use one MKL thread since PETSc already 
> handles the maximum parallelism with MPI and there are no "extra" 
> processors available to parallelize the BLAS/LAPACK called from PETSc 
> for more performance inside the MKL.   This may not be true if you are 
> using MUMPS which makes things far more complicated.
>
>  OpenMP is complicated in the context of PETSc and several external 
> packages because different packages may use it in different ways that 
> require different tuning and I won't know the tuning for each.

okay: we do no use OpenMP neither... we all rely on MPI for parallelism 
too... So if I could just compile for CUDA without it, I would be happy....

But, do you think it could be turned on/off only for specific packages 
at configuration time?  In regards of the bugs encountered, it is not 
interesting to activate it for all external packages...

regards,

Eric

>
>   Barry
>
>
>
>
>> #6: we do start looking at DD solvers like hpddm...
>>
>> So my killer question is: in regard of to the amount of work to have 
>> all these external packages fixed, is it possible to activate OpenMP 
>> only for the CUDA part?
>>
>> Thanks,
>>
>> Eric
>>
>> On 2021-03-02 3:47 p.m., Barry Smith wrote:
>>>
>>>   Eric,
>>>
>>>     Thanks for the detailed information.
>>>
>>>     I have cc:ed Pierre so he can look at the HPDDM failures.
>>>
>>>
>>>> On Mar 2, 2021, at 2:14 PM, Eric Chamberland 
>>>> <Eric.Chamberland at giref.ulaval.ca 
>>>> <mailto:Eric.Chamberland at giref.ulaval.ca>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> It all started when I wanted to test PETSC/CUDA compatibility for 
>>>> our code.
>>>>
>>>> I had to activate --with-openmp to configure with --with-cuda=1 
>>>> successfully.
>>>>
>>>>
>>> Certain packages like SuperLU_DIST require --with-openmp  if using 
>>> --with-cuda=1 but PETSc's own use of CUDA as well as some other 
>>> packages do not require the --with-openmp.
>>>
>>>> I then saw that PETSC_HAVE_OPENMP is used at least in MUMPS (and 
>>>> some other places).
>>>>
>>>> So, I configured and tested petsc with openmp activated, without CUDA.
>>>>
>>>> The first thing I see is that our code CI pipelines now fails for 
>>>> many tests.
>>>>
>>>> After looking deeper, it seems that PETSc itself fails many tests 
>>>> when I activate openmp!
>>>>
>>>> Here are all the configurations I have results for, after/before 
>>>> activating OpenMP for PETSc:
>>>
>>> There seem to be several distinct issues
>>>
>>> 1) failures inside Scalapack.
>>>
>>> 2) possibly slightly different convergence rates for some examples 
>>> changing the number of iterations slightly in PETSc.
>>>
>>> 3) trouble initializing something outside of PETSc, almost for sure 
>>> not related to PETSc
>>>
>>> [zorg:08517] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
>>> #	[zorg:08517] plm:base:set_hnp_name: initial bias 8517 nodename hash 810220270
>>> #	[zorg:08517] plm:base:set_hnp_name: final jobfam 60385
>>> #	[zorg:08517] [[60385,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>>> #	[zorg:08517] [[60385,0],0] plm:base:receive start comm
>>> #	[zorg:08517] [[60385,0],0] plm:base:setup_job
>>> #	[zorg:08517] [[60385,0],0] plm:base:setup_vm
>>> 4) problem with a hypre run Linear solve did not converge due to 
>>> DIVERGED_INDEFINITE_PC iterations 3 , again not likely a PETSc issue 
>>> but a hypre and OpenMP issue
>>> 5) Different results for initia inside an external package
>>> #	1c1
>>> #	<  MatInertia: nneg: 17, nzero: 0, npos: 83
>>> #	---
>>> #	>  MatInertia: nneg: 21, nzero: 0, npos: 79
>>>          TEST arch-linux-c-debug/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
>>>   ok ksp_ksp_tests-ex33_superlu_dist_2
>>> not ok diff-ksp_ksp_tests-ex33_superlu_dist_2 # Error code: 1
>>> #	1c1
>>> #	<  MatInertia: nneg: 17, nzero: 0, npos: 83
>>> #	---
>>> #	>  MatInertia: nneg: 25, nzero: 0, npos: 75
>>>
>>>
>>> 6) problems with the external package hpddm
>>>
>>> not ok snes_tutorials-ex12_quad_hpddm_reuse_baij # Error code: 139
>>> #	  0 SNES Function norm 21.3344
>>> #	[0]PETSC ERROR: ------------------------------------------------------------------------
>>> #	[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
>>> #	[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>>> #	[0]PETSC ERROR: or seehttps://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind  <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
>>> #	[0]PETSC ERROR: or tryhttp://valgrind.org  <http://valgrind.org/>  on GNU/linux and Apple Mac OS X to find memory corruption errors
>>> #	[0]PETSC ERROR: likely location of problem given in stack below
>>> #	[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>>> #	[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>> #	[0]PETSC ERROR:       INSTEAD the line number of the start of the function
>>> #	[0]PETSC ERROR:       is given.
>>> #	[0]PETSC ERROR: [0] constructionMatrix line 313 /opt/petsc-main_debug/include/HPDDM_coarse_operator_impl.hpp
>>> #	[0]PETSC ERROR: [0] construction line 256 /opt/petsc-main_debug/include/HPDDM_coarse_operator_impl.hpp
>>> #	[0]PETSC ERROR: [0] buildTwo line 987 /opt/petsc-main_debug/include/HPDDM_schwarz.hpp
>>> #	[0]PETSC ERROR: [0] next line 1130 /opt/petsc-main_debug/include/HPDDM_schwarz.hpp
>>> #	[0]PETSC ERROR: [0] PCSetUp_HPDDM line 746 /pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/pc/impls/hpddm/hpddm.cxx
>>> #	[0]PETSC ERROR: [0] PCSetUp line 974 /pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/pc/interface/precon.c
>>> #	[0]PETSC ERROR: [0] KSPSetUp line 319 /pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/ksp/interface/itfunc.c
>>> #	[0]PETSC ERROR: [0] KSPSolve_Private line 808 /pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/ksp/interface/itfunc.c
>>> #	[0]PETSC ERROR: [0] KSPSolve line 1080 /pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/ksp/interface/itfunc.c
>>> #	[0]PETSC ERROR: [0] SNESSolve_NEWTONLS line 144 /pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/snes/impls/ls/ls.c
>>> #	[0]PETSC ERROR: [0] SNESSolve line 4533 /pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/snes/interface/snes.c
>>> #	[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>>> #	[0]PETSC ERROR: Signal received
>>> #	[0]PETSC ERROR: Seehttps://www.mcs.anl.gov/petsc/documentation/faq.html  <https://www.mcs.anl.gov/petsc/documentation/faq.html>  for trouble shooting.
>>>
>>> PETSc itself does not use OpenMP so turning on OpenMP for pure PETSc 
>>> should generate no errors except possibly small changes in iteration 
>>> rates due to the different way the floating point operations in MKL 
>>> are done.
>>>
>>> We don't see much use for OpenMP so rarely turn it on. What is your 
>>> end goal, to use PETSc on CUDA (for each you can keep OpenMP off) or 
>>> something else?
>>>
>>>
>>>   Barry
>>>
>>>
>>>> ==============================================================================
>>>>
>>>> ==============================================================================
>>>>
>>>> For petsc/master + OpenMPI 4.0.4 + MKL 2019.4.243:
>>>>
>>>> With OpenMP=1
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/petsc-master-debug/2021.03.02.02h00m02s_make_test.log
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/petsc-master-debug/2021.03.02.02h00m02s_configure.log
>>>>
>>>> # -------------
>>>> #   Summary
>>>> # -------------
>>>> # FAILED snes_tutorials-ex12_quad_hpddm_reuse_baij diff-ksp_ksp_tests-ex33_superlu_dist_2 diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0 diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1 diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0 diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1 diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0 diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1 diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0 diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1 ksp_ksp_tutorials-ex50_tut_2 diff-ksp_ksp_tests-ex33_superlu_dist diff-snes_tutorials-ex56_hypre snes_tutorials-ex17_3d_q3_trig_elas snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij ksp_ksp_tutorials-ex5_superlu_dist_3 ksp_ksp_tutorials-ex5f_superlu_dist snes_tutorials-ex12_tri_parmetis_hpddm_baij diff-snes_tutorials-ex19_tut_3 mat_tests-ex242_3 snes_tutorials-ex17_3d_q3_trig_vlap ksp_ksp_tutorials-ex5f_superlu_dist_3 snes_tutorials-ex19_superlu_dist diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre diff-ksp_ksp_tutorials-ex49_hypre_nullspace ts_tutorials-ex18_p1p1_xper_ref ts_tutorials-ex18_p1p1_xyper_ref snes_tutorials-ex19_superlu_dist_2 ksp_ksp_tutorials-ex5_superlu_dist_2 diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre ksp_ksp_tutorials-ex64_1 ksp_ksp_tutorials-ex5_superlu_dist ksp_ksp_tutorials-ex5f_superlu_dist_2
>>>> # success 8275/10003 tests (82.7%)
>>>> #*failed 33/10003*  tests (0.3%)
>>>>
>>>> With OpenMP=0
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/petsc-master-debug/2021.02.26.02h00m16s_make_test.log
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/petsc-master-debug/2021.02.26.02h00m16s_configure.log
>>>>
>>>> # -------------
>>>> #   Summary
>>>> # -------------
>>>> # FAILED tao_constrained_tutorials-tomographyADMM_6 snes_tutorials-ex17_3d_q3_trig_elas mat_tests-ex242_3 snes_tutorials-ex17_3d_q3_trig_vlap tao_leastsquares_tutorials-tomography_1 tao_constrained_tutorials-tomographyADMM_5
>>>> # success 8262/9983 tests (82.8%)
>>>> #*failed 6/9983*  tests (0.1%)
>>>>
>>>> ==============================================================================
>>>>
>>>> ==============================================================================
>>>>
>>>> For OpenMPI 3.1.x/master:
>>>>
>>>> With OpenMP=1:
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/ompi_3.x/2021.03.01.22h00m01s_make_test.log
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/ompi_3.x/2021.03.01.22h00m01s_configure.log
>>>>
>>>> # -------------
>>>> #   Summary
>>>> # -------------
>>>> # FAILED mat_tests-ex242_3 mat_tests-ex242_2 diff-mat_tests-ex219f_1 diff-dm_tutorials-ex11f90_1 ksp_ksp_tutorials-ex5_superlu_dist_3 diff-ksp_ksp_tutorials-ex49_hypre_nullspace ksp_ksp_tutorials-ex5f_superlu_dist_3 snes_tutorials-ex17_3d_q3_trig_vlap diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre diff-snes_tutorials-ex19_tut_3 diff-snes_tutorials-ex56_hypre diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre tao_leastsquares_tutorials-tomography_1 tao_constrained_tutorials-tomographyADMM_4 tao_constrained_tutorials-tomographyADMM_6 diff-tao_constrained_tutorials-toyf_1
>>>> # success 8142/9765 tests (83.4%)
>>>> #*failed 16/9765*  tests (0.2%)
>>>>
>>>> With OpenMP=0:
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/ompi_3.x/2021.02.28.22h00m02s_make_test.log
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/ompi_3.x/2021.02.28.22h00m02s_configure.log
>>>>
>>>> # -------------
>>>> #   Summary
>>>> # -------------
>>>> # FAILED mat_tests-ex242_3 mat_tests-ex242_2 diff-mat_tests-ex219f_1 diff-dm_tutorials-ex11f90_1 ksp_ksp_tutorials-ex56_2 snes_tutorials-ex17_3d_q3_trig_vlap tao_leastsquares_tutorials-tomography_1 tao_constrained_tutorials-tomographyADMM_4 diff-tao_constrained_tutorials-toyf_1
>>>> # success 8151/9767 tests (83.5%)
>>>> #*failed 9/9767*  tests (0.1%)
>>>>
>>>> ==============================================================================
>>>>
>>>> ==============================================================================
>>>>
>>>> For OpenMPI 4.0.x/master:
>>>>
>>>> With OpenMP=1:
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/ompi_4.x/2021.03.01.20h00m01s_make_test.log
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/ompi_4.x/2021.03.01.20h00m01s_configure.log
>>>>
>>>> # FAILED snes_tutorials-ex17_3d_q3_trig_elas snes_tutorials-ex19_hypre ksp_ksp_tutorials-ex56_2 tao_leastsquares_tutorials-tomography_1 tao_constrained_tutorials-tomographyADMM_5 mat_tests-ex242_3 ksp_ksp_tutorials-ex55_hypre ksp_ksp_tutorials-ex5_superlu_dist_2 tao_constrained_tutorials-tomographyADMM_6 snes_tutorials-ex56_hypre snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre ksp_ksp_tutorials-ex5f_superlu_dist_3 ksp_ksp_tutorials-ex34_hyprestruct diff-ksp_ksp_tutorials-ex49_hypre_nullspace snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre ksp_ksp_tutorials-ex5f_superlu_dist ksp_ksp_tutorials-ex5f_superlu_dist_2 ksp_ksp_tutorials-ex5_superlu_dist snes_tutorials-ex19_tut_3 snes_tutorials-ex19_superlu_dist ksp_ksp_tutorials-ex50_tut_2 snes_tutorials-ex17_3d_q3_trig_vlap ksp_ksp_tutorials-ex5_superlu_dist_3 snes_tutorials-ex19_superlu_dist_2 tao_constrained_tutorials-tomographyADMM_4 ts_tutorials-ex26_2
>>>> # success 8125/9753 tests (83.3%)
>>>> #*failed 26/9753*  tests (0.3%)
>>>>
>>>> With OpenMP=0
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/ompi_4.x/2021.02.28.20h00m04s_make_test.log
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/ompi_4.x/2021.02.28.20h00m04s_configure.log
>>>>
>>>> # FAILED mat_tests-ex242_3
>>>> # success 8174/9777 tests (83.6%)
>>>> #*failed 1/9777*  tests (0.0%)
>>>>
>>>> ==============================================================================
>>>>
>>>> ==============================================================================
>>>>
>>>> Is that known and normal?
>>>>
>>>> In all cases, I am using MKL and I suspect it  may come from 
>>>> there... :/
>>>>
>>>> I also saw a second problem, "make test" fails to compile petsc 
>>>> examples on older versions of MKL (but that's less important for 
>>>> me, I just upgraded to OneAPI to avoid this, but you may want to know):
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/dernier_ompi/2021.03.02.02h16m01s_make_test.log
>>>>
>>>> https://giref.ulaval.ca/~cmpgiref/dernier_ompi/2021.03.02.02h16m01s_configure.log
>>>>
>>>> Thanks,
>>>>
>>>> Eric
>>>>
>>>> -- 
>>>> Eric Chamberland, ing., M. Ing
>>>> Professionnel de recherche
>>>> GIREF/Université Laval
>>>> (418) 656-2131 poste 41 22 42
>>>
>> -- 
>> Eric Chamberland, ing., M. Ing
>> Professionnel de recherche
>> GIREF/Université Laval
>> (418) 656-2131 poste 41 22 42
>
-- 
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210302/5d642205/attachment-0001.html>


More information about the petsc-dev mailing list