[petsc-users] SuperLU_dist bug with parallel symbolic factorisation

Xiaoye S. Li xsli at lbl.gov
Tue May 22 10:45:20 CDT 2018


Indeed, I am pretty sure the bug is in ParMETIS.  A few years ago, I sent a
sample matrix and debug trace to George Karypis, he was going to look at
it, but never did.

This bug seems to show up when the graph is relatively dense.  Can you try
to use serial symbolic factorization and Metis?

Sherry


On Tue, May 22, 2018 at 8:41 AM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:

>
> 0x00007f96a2148e52 in libmetis__FM_2WayCutRefine (ctrl=0x2784d20,
> graph=0x2784940, ntpwgts=0x7ffdfa323060, niter=4)
> at /home/mefpp_ericc/petsc-3.9.2-debug/arch-linux2-c-debug/
> externalpackages/git.metis/libmetis/fm.c:60
>
> It appears the crash is in metis, not SuperLU_Dist.
>
>   So either a bug in Metis or a bug in our Metis is called by ParMetis or
> SuperLU_Dist.
>
>    Barry
>
>
>
>
> > On May 22, 2018, at 10:37 AM, Hong <hzhang at mcs.anl.gov> wrote:
> >
> > Eric:
> > Likely, you encounter a zero pivot. Run your code with
> '-ksp_error_if_not_converged' would show it.
> > Adding option '-mat_superlu_dist_replacetinypivot' might help.
> > Hong
> >
> > Hi,
> >
> > The given matrix+vector is bogus with SuperLU_Dist on some of our
> nighlty validation tests since I activated the parallel symbolic
> factorisation. (with -mat_superlu_dist_colperm PARMETIS
> -mat_superlu_dist_parsymbfact 1 )
> >
> > I extracted an example system and reproduced the bug with
> src/ksp/ksp/examples/tests/ex6.c that I can run it with 2 or 3 processes,
> but with 4 it gives a FPE on process #1:
> >
> > mpirun -n 4 ./ex6 -f AssembleurGD_resolution_no_0_0 -ksp_view -ksp_type
> preonly -pc_type lu -pc_factor_mat_solver_type superlu_dist
> -mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1
> >
> > ...
> > [1]PETSC ERROR: ------------------------------
> ------------------------------------------
> > [1]PETSC ERROR: Caught signal number 8 FPE: Floating Point
> Exception,probably divide by zero
> > [1]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> > [1]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/
> documentation/faq.html#valgrind
> > [1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
> OS X to find memory corruption errors
> > [1]PETSC ERROR: likely location of problem given in stack below
> > [1]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> > [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> > [1]PETSC ERROR:       INSTEAD the line number of the start of the
> function
> > [1]PETSC ERROR:       is given.
> > [1]PETSC ERROR: [1] SuperLU_DIST:pdgssvx line 467
> /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/impls/aij/mpi/
> superlu_dist/superlu_dist.c
> > [1]PETSC ERROR: [1] MatLUFactorNumeric_SuperLU_DIST line 314
> /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/impls/aij/mpi/
> superlu_dist/superlu_dist.c
> > [1]PETSC ERROR: [1] MatLUFactorNumeric line 3014
> /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/interface/matrix.c
> > [1]PETSC ERROR: [1] PCSetUp_LU line 59 /home/mefpp_ericc/petsc-3.9.2-
> debug/src/ksp/pc/impls/factor/lu/lu.c
> > [1]PETSC ERROR: [1] PCSetUp line 885 /home/mefpp_ericc/petsc-3.9.2-
> debug/src/ksp/pc/interface/precon.c
> > [1]PETSC ERROR: [1] KSPSetUp line 294 /home/mefpp_ericc/petsc-3.9.2-
> debug/src/ksp/ksp/interface/itfunc.c
> > [1]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> > [1]PETSC ERROR: Signal received
> > [1]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> > [1]PETSC ERROR: Petsc Release Version 3.9.2, May, 20, 2018
> > [1]PETSC ERROR: ./ex6 on a  named lorien by eric Tue May 22 10:39:15 2018
> > [1]PETSC ERROR: Configure options --prefix=/opt/petsc-3.9.2_debug_openmpi-1.10.2
> --with-mpi-compilers=1 --with-mpi-dir=/opt/openmpi-1.10.2
> --with-make-np=12 --with-shared-libraries=1 --with-debugging=yes
> --with-memalign=64 --with-visibility=0 --with-64-bit-indices=0
> --download-ml=yes --download-mumps=yes --download-superlu=yes
> --download-superlu_dist=yes --download-parmetis=yes --download-ptscotch=yes
> --download-metis=yes --download-suitesparse=yes --download-hypre=yes
> --with-blaslapack-dir=/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64
> --with-mkl_pardiso-dir=/opt/intel/composer_xe_2015.2.164/mkl
> --with-mkl_cpardiso-dir=/opt/intel/composer_xe_2015.2.164/mkl
> --with-scalapack=1 --with-scalapack-include=/opt/
> intel/composer_xe_2015.2.164/mkl/include --with-scalapack-lib="-L/opt/
> intel/composer_xe_2015.2.164/mkl/lib/intel64 -lmkl_scalapack_lp64
> -lmkl_blacs_openmpi_lp64"
> > [1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> > ...
> >
> > The given Matrix+Vector are available here:
> >
> > http://www.giref.ulaval.ca/~ericc/bug_superlu_dist_
> parallel_factorisation/AssembleurGD_resolution_no_0_0
> >
> > http://www.giref.ulaval.ca/~ericc/bug_superlu_dist_
> parallel_factorisation/AssembleurGD_resolution_no_0_0.info
> >
> > If I run with -on_error_attach_debugger, I can see a division by zero
> here:
> >
> > #8  <signal handler called>
> > (gdb)
> > #9  0x00007f96a2148e52 in libmetis__FM_2WayCutRefine (ctrl=0x2784d20,
> graph=0x2784940, ntpwgts=0x7ffdfa323060, niter=4)
> >     at /home/mefpp_ericc/petsc-3.9.2-debug/arch-linux2-c-debug/
> externalpackages/git.metis/libmetis/fm.c:60
> > 60        avgvwgt = gk_min((pwgts[0]+pwgts[1])/20,
> 2*(pwgts[0]+pwgts[1])/nvtxs);
> >
> > and nvtxs value is "0"...
> >
> > Thanks!
> >
> > Eric
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180522/e75561c5/attachment.html>


More information about the petsc-users mailing list