[petsc-users] [mumps-dev] MUMPS and PARMETIS: Crashes

Zhang, Hong hzhang at mcs.anl.gov
Mon Dec 12 09:50:37 CST 2016


I tested master branch, it works fine.
Hong
________________________________________
From: Satish Balay [balay at mcs.anl.gov]
Sent: Monday, December 12, 2016 9:14 AM
To: Zhang, Hong
Cc: Alfredo Buttari; PETSc; mumps-dev
Subject: Re: [petsc-users] [mumps-dev]  MUMPS and PARMETIS: Crashes

Hong,

petsc master is updated to download/install mumps-5.0.2

Satish

On Mon, 12 Dec 2016, Hong wrote:

> Alfredo:
> Sure, I got the tarball of mumps-5.0.2, and will test it and update
> petsc-mumps interface. I'll let you know if problem remains.
>
> Hong
>
> Dear all,
> > sorry for the late reply. The petsc installation went supersmooth and
> > I could easily reproduce the issue. I dumped the matrix generated by
> > petsc and read it back with a standalone mumps tester in order to
> > confirm the bug. This bug has been already reported by another user,
> > was fixed a few months ago and the fix was included in the 5.0.2
> > release. Could you please check if everything works well with mumps
> > 5.0.2?
> >
> > Kind regards,
> > te MUMPS team
> >
> >
> >
> >
> > On Thu, Oct 20, 2016 at 4:44 PM, Hong <hzhang at mcs.anl.gov> wrote:
> > > Alfredo:
> > > It would be much easier to install petsc with mumps, parmetis, and
> > > debugging this case. Here is what you can do on a linux machine
> > > (see http://www.mcs.anl.gov/petsc/documentation/installation.html):
> > >
> > > 1) get petsc-release:
> > > git clone -b maint https://bitbucket.org/petsc/petsc petsc
> > >
> > > cd petsc
> > > git pull
> > > export PETSC_DIR=$PWD
> > > export PETSC_ARCH=<>
> > >
> > > 2) configure petsc with additional options
> > > '--download-metis --download-parmetis --download-mumps
> > --download-scalapack
> > > --download-ptscotch'
> > > see http://www.mcs.anl.gov/petsc/documentation/installation.html
> > >
> > > 3) build petsc and test
> > > make
> > > make test
> > >
> > > 4) test ex53.c:
> > > cd $PETSC_DIR/src/ksp/ksp/examples/tutorials
> > > make ex53
> > > mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
> > > -mat_mumps_icntl_29 2
> > >
> > > 5) debugging ex53.c:
> > > mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
> > > -mat_mumps_icntl_29 2 -start_in_debugger
> > >
> > > Give it a try. Contact us if you cannot reproduce this case.
> > >
> > > Hong
> > >
> > >> Dear all,
> > >> this may well be due to a bug in the parallel analysis. Do you think you
> > >> can reproduce the problem in a standalone MUMPS program (i.e., without
> > going
> > >> through PETSc) ? that would save a lot of time to track the bug since
> > we do
> > >> not have a PETSc install at hand. Otherwise we'll give it a shot at
> > >> installing petsc and reproducing the problem on our side.
> > >>
> > >> Kind regards,
> > >> the MUMPS team
> > >>
> > >>
> > >>
> > >> On Wed, Oct 19, 2016 at 8:32 PM, Barry Smith <bsmith at mcs.anl.gov>
> > wrote:
> > >>>
> > >>>
> > >>>    Tim,
> > >>>
> > >>>     You can/should also run with valgrind to determine exactly the
> > first
> > >>> point with memory corruption issues.
> > >>>
> > >>>   Barry
> > >>>
> > >>> > On Oct 19, 2016, at 11:08 AM, Hong <hzhang at mcs.anl.gov> wrote:
> > >>> >
> > >>> > Tim:
> > >>> > With '-mat_mumps_icntl_28 1', i.e., sequential analysis, I can run
> > ex56
> > >>> > with np=3 or larger np successfully.
> > >>> >
> > >>> > With '-mat_mumps_icntl_28 2', i.e., parallel analysis, I can run up
> > to
> > >>> > np=3.
> > >>> >
> > >>> > For np=4:
> > >>> > mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
> > >>> > -mat_mumps_icntl_29 2 -start_in_debugger
> > >>> >
> > >>> > code crashes inside mumps:
> > >>> > Program received signal SIGSEGV, Segmentation fault.
> > >>> > 0x00007f33d75857cb in
> > >>> > dmumps_parallel_analysis::dmumps_build_scotch_graph (
> > >>> >     id=..., first=..., last=..., ipe=...,
> > >>> >     pe=<error reading variable: Cannot access memory at address 0x0>,
> > >>> > work=...)
> > >>> >     at dana_aux_par.F:1450
> > >>> > 1450                MAPTAB(J) = I
> > >>> > (gdb) bt
> > >>> > #0  0x00007f33d75857cb in
> > >>> > dmumps_parallel_analysis::dmumps_build_scotch_graph (
> > >>> >     id=..., first=..., last=..., ipe=...,
> > >>> >     pe=<error reading variable: Cannot access memory at address 0x0>,
> > >>> > work=...)
> > >>> >     at dana_aux_par.F:1450
> > >>> > #1  0x00007f33d759207c in dmumps_parallel_analysis::dmum
> > ps_parmetis_ord
> > >>> > (
> > >>> >     id=..., ord=..., work=...) at dana_aux_par.F:400
> > >>> > #2  0x00007f33d7592d14 in dmumps_parallel_analysis::dmum
> > ps_do_par_ord
> > >>> > (id=...,
> > >>> >     ord=..., work=...) at dana_aux_par.F:351
> > >>> > #3  0x00007f33d7593aa9 in dmumps_parallel_analysis::dmumps_ana_f_par
> > >>> > (id=...,
> > >>> >     work1=..., work2=..., nfsiz=...,
> > >>> >     fils=<error reading variable: Cannot access memory at address
> > 0x0>,
> > >>> >     frere=<error reading variable: Cannot access memory at address
> > >>> > 0x0>)
> > >>> >     at dana_aux_par.F:98
> > >>> > #4  0x00007f33d74c622a in dmumps_ana_driver (id=...) at
> > >>> > dana_driver.F:563
> > >>> > #5  0x00007f33d747706b in dmumps (id=...) at dmumps_driver.F:1108
> > >>> > #6  0x00007f33d74721b5 in dmumps_f77 (job=1, sym=0, par=1,
> > >>> >     comm_f77=-2080374779, n=10000, icntl=..., cntl=..., keep=...,
> > >>> > dkeep=...,
> > >>> >     keep8=..., nz=0, irn=..., irnhere=0, jcn=..., jcnhere=0, a=...,
> > >>> > ahere=0,
> > >>> >     nz_loc=7500, irn_loc=..., irn_lochere=1, jcn_loc=...,
> > >>> > jcn_lochere=1,
> > >>> >     a_loc=..., a_lochere=1, nelt=0, eltptr=..., eltptrhere=0,
> > >>> > eltvar=...,
> > >>> >     eltvarhere=0, a_elt=..., a_elthere=0, perm_in=..., perm_inhere=0,
> > >>> > rhs=...,
> > >>> >     rhshere=0, redrhs=..., redrhshere=0, info=..., rinfo=...,
> > >>> > infog=...,
> > >>> >     rinfog=..., deficiency=0, lwk_user=0, size_schur=0,
> > >>> > listvar_schur=...,
> > >>> > ---Type <return> to continue, or q <return> to quit---
> > >>> >     ar_schurhere=0, schur=..., schurhere=0, wk_user=...,
> > wk_userhere=0,
> > >>> > colsca=...,
> > >>> >     colscahere=0, rowsca=..., rowscahere=0, instance_number=1,
> > nrhs=1,
> > >>> > lrhs=0, lredrhs=0,
> > >>> >     rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0,
> > >>> > irhs_sparse=...,
> > >>> >     irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0, isol_loc=...,
> > >>> > isol_lochere=0,
> > >>> >     nz_rhs=0, lsol_loc=0, schur_mloc=0, schur_nloc=0, schur_lld=0,
> > >>> > mblock=0, nblock=0,
> > >>> >     nprow=0, npcol=0, ooc_tmpdir=..., ooc_prefix=...,
> > >>> > write_problem=..., tmpdirlen=20,
> > >>> >     prefixlen=20, write_problemlen=20) at dmumps_f77.F:260
> > >>> > #7  0x00007f33d74709b1 in dmumps_c (mumps_par=0x16126f0) at
> > >>> > mumps_c.c:415
> > >>> > #8  0x00007f33d68408ca in MatLUFactorSymbolic_AIJMUMPS (F=0x1610280,
> > >>> > A=0x14bafc0,
> > >>> >     r=0x160cc30, c=0x1609ed0, info=0x15c6708)
> > >>> >     at /scratch/hzhang/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:14
> > 87
> > >>> >
> > >>> > -mat_mumps_icntl_29 = 0 or 1 give same error.
> > >>> > I'm cc'ing this email to mumps developer, who may help to resolve
> > this
> > >>> > matter.
> > >>> >
> > >>> > Hong
> > >>> >
> > >>> >
> > >>> > Hi all,
> > >>> >
> > >>> > I have some problems with PETSc using MUMPS and PARMETIS.
> > >>> > In some cases it works fine, but in some others it doesn't, so I am
> > >>> > trying to understand what is happening.
> > >>> >
> > >>> > I just picked the following example:
> > >>> >
> > >>> > http://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/examp
> > les/tutorials/ex53.c.html
> > >>> >
> > >>> > Now, when I start it with less than 4 processes it works as expected:
> > >>> > mpirun -n 3 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1
> > >>> > -mat_mumps_icntl_29 2
> > >>> >
> > >>> > But with 4 or more processes, it crashes, but only when I am using
> > >>> > Parmetis:
> > >>> > mpirun -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1
> > >>> > -mat_mumps_icntl_29 2
> > >>> >
> > >>> > Metis worked in every case I tried without any problems.
> > >>> >
> > >>> > I wonder if I am doing something wrong or if this is a general
> > problem
> > >>> > or even a bug? Is Parmetis supposed to work with that example with 4
> > >>> > processes?
> > >>> >
> > >>> > Thanks a lot and kind regards.
> > >>> >
> > >>> > Volker
> > >>> >
> > >>> >
> > >>> > Here is the error log of process 0:
> > >>> >
> > >>> > Entering DMUMPS 5.0.1 driver with JOB, N =   1       10000
> > >>> >  =================================================
> > >>> >  MUMPS compiled with option -Dmetis
> > >>> >  MUMPS compiled with option -Dparmetis
> > >>> >  =================================================
> > >>> > L U Solver for unsymmetric matrices
> > >>> > Type of parallelism: Working host
> > >>> >
> > >>> >  ****** ANALYSIS STEP ********
> > >>> >
> > >>> >  ** Max-trans not allowed because matrix is distributed
> > >>> > Using ParMETIS for parallel ordering.
> > >>> > [0]PETSC ERROR:
> > >>> >
> > >>> > ------------------------------------------------------------
> > ------------
> > >>> > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> > >>> > probably memory access out of range
> > >>> > [0]PETSC ERROR: Try option -start_in_debugger or
> > >>> > -on_error_attach_debugger
> > >>> > [0]PETSC ERROR: or see
> > >>> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > >>> > [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple
> > Mac
> > >>> > OS X to find memory corruption errors
> > >>> > [0]PETSC ERROR: likely location of problem given in stack below
> > >>> > [0]PETSC ERROR: ---------------------  Stack Frames
> > >>> > ------------------------------------
> > >>> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> > >>> > available,
> > >>> > [0]PETSC ERROR:       INSTEAD the line number of the start of the
> > >>> > function
> > >>> > [0]PETSC ERROR:       is given.
> > >>> > [0]PETSC ERROR: [0] MatLUFactorSymbolic_AIJMUMPS line 1395
> > >>> >
> > >>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/impls/
> > aij/mpi/mumps/mumps.c
> > >>> > [0]PETSC ERROR: [0] MatLUFactorSymbolic line 2927
> > >>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/interfac
> > e/matrix.c
> > >>> > [0]PETSC ERROR: [0] PCSetUp_LU line 101
> > >>> >
> > >>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/
> > impls/factor/lu/lu.c
> > >>> > [0]PETSC ERROR: [0] PCSetUp line 930
> > >>> >
> > >>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/
> > interface/precon.c
> > >>> > [0]PETSC ERROR: [0] KSPSetUp line 305
> > >>> >
> > >>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/
> > interface/itfunc.c
> > >>> > [0]PETSC ERROR: [0] KSPSolve line 563
> > >>> >
> > >>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/
> > interface/itfunc.c
> > >>> > [0]PETSC ERROR: --------------------- Error Message
> > >>> > --------------------------------------------------------------
> > >>> > [0]PETSC ERROR: Signal received
> > >>> > [0]PETSC ERROR: See
> > >>> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
> > >>> > shooting.
> > >>> > [0]PETSC ERROR: Petsc Release Version 3.7.4, Oct, 02, 2016
> > >>> > [0]PETSC ERROR: ./ex53 on a linux-manni-mumps named manni by 133 Wed
> > >>> > Oct 19 16:39:49 2016
> > >>> > [0]PETSC ERROR: Configure options --with-cc=mpiicc --with-cxx=mpiicpc
> > >>> > --with-fc=mpiifort --with-shared-libraries=1
> > >>> > --with-valgrind-dir=~/usr/valgrind/
> > >>> >
> > >>> > --with-mpi-dir=/home/software/intel/Intel-2016.4/compilers_a
> > nd_libraries_2016.4.258/linux/mpi
> > >>> > --download-scalapack --download-mumps --download-metis
> > >>> > --download-metis-shared=0 --download-parmetis
> > >>> > --download-parmetis-shared=0
> > >>> > [0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> > >>> > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
> > >>> >
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> -----------------------------------------
> > >> Alfredo Buttari, PhD
> > >> CNRS-IRIT
> > >> 2 rue Camichel, 31071 Toulouse, France
> > >> http://buttari.perso.enseeiht.fr
> > >
> > >
> >
> >
> >
> > --
> > -----------------------------------------
> > Alfredo Buttari, PhD
> > CNRS-IRIT
> > 2 rue Camichel, 31071 Toulouse, France
> > http://buttari.perso.enseeiht.fr
> >
>



More information about the petsc-users mailing list