[petsc-users] MUMPS and PARMETIS: Crashes
Barry Smith
bsmith at mcs.anl.gov
Wed Oct 19 13:32:48 CDT 2016
Tim,
You can/should also run with valgrind to determine exactly the first point with memory corruption issues.
Barry
> On Oct 19, 2016, at 11:08 AM, Hong <hzhang at mcs.anl.gov> wrote:
>
> Tim:
> With '-mat_mumps_icntl_28 1', i.e., sequential analysis, I can run ex56 with np=3 or larger np successfully.
>
> With '-mat_mumps_icntl_28 2', i.e., parallel analysis, I can run up to np=3.
>
> For np=4:
> mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2 -mat_mumps_icntl_29 2 -start_in_debugger
>
> code crashes inside mumps:
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007f33d75857cb in dmumps_parallel_analysis::dmumps_build_scotch_graph (
> id=..., first=..., last=..., ipe=...,
> pe=<error reading variable: Cannot access memory at address 0x0>, work=...)
> at dana_aux_par.F:1450
> 1450 MAPTAB(J) = I
> (gdb) bt
> #0 0x00007f33d75857cb in dmumps_parallel_analysis::dmumps_build_scotch_graph (
> id=..., first=..., last=..., ipe=...,
> pe=<error reading variable: Cannot access memory at address 0x0>, work=...)
> at dana_aux_par.F:1450
> #1 0x00007f33d759207c in dmumps_parallel_analysis::dmumps_parmetis_ord (
> id=..., ord=..., work=...) at dana_aux_par.F:400
> #2 0x00007f33d7592d14 in dmumps_parallel_analysis::dmumps_do_par_ord (id=...,
> ord=..., work=...) at dana_aux_par.F:351
> #3 0x00007f33d7593aa9 in dmumps_parallel_analysis::dmumps_ana_f_par (id=...,
> work1=..., work2=..., nfsiz=...,
> fils=<error reading variable: Cannot access memory at address 0x0>,
> frere=<error reading variable: Cannot access memory at address 0x0>)
> at dana_aux_par.F:98
> #4 0x00007f33d74c622a in dmumps_ana_driver (id=...) at dana_driver.F:563
> #5 0x00007f33d747706b in dmumps (id=...) at dmumps_driver.F:1108
> #6 0x00007f33d74721b5 in dmumps_f77 (job=1, sym=0, par=1,
> comm_f77=-2080374779, n=10000, icntl=..., cntl=..., keep=..., dkeep=...,
> keep8=..., nz=0, irn=..., irnhere=0, jcn=..., jcnhere=0, a=..., ahere=0,
> nz_loc=7500, irn_loc=..., irn_lochere=1, jcn_loc=..., jcn_lochere=1,
> a_loc=..., a_lochere=1, nelt=0, eltptr=..., eltptrhere=0, eltvar=...,
> eltvarhere=0, a_elt=..., a_elthere=0, perm_in=..., perm_inhere=0, rhs=...,
> rhshere=0, redrhs=..., redrhshere=0, info=..., rinfo=..., infog=...,
> rinfog=..., deficiency=0, lwk_user=0, size_schur=0, listvar_schur=...,
> ---Type <return> to continue, or q <return> to quit---
> ar_schurhere=0, schur=..., schurhere=0, wk_user=..., wk_userhere=0, colsca=...,
> colscahere=0, rowsca=..., rowscahere=0, instance_number=1, nrhs=1, lrhs=0, lredrhs=0,
> rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0, irhs_sparse=...,
> irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0, isol_loc=..., isol_lochere=0,
> nz_rhs=0, lsol_loc=0, schur_mloc=0, schur_nloc=0, schur_lld=0, mblock=0, nblock=0,
> nprow=0, npcol=0, ooc_tmpdir=..., ooc_prefix=..., write_problem=..., tmpdirlen=20,
> prefixlen=20, write_problemlen=20) at dmumps_f77.F:260
> #7 0x00007f33d74709b1 in dmumps_c (mumps_par=0x16126f0) at mumps_c.c:415
> #8 0x00007f33d68408ca in MatLUFactorSymbolic_AIJMUMPS (F=0x1610280, A=0x14bafc0,
> r=0x160cc30, c=0x1609ed0, info=0x15c6708)
> at /scratch/hzhang/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1487
>
> -mat_mumps_icntl_29 = 0 or 1 give same error.
> I'm cc'ing this email to mumps developer, who may help to resolve this matter.
>
> Hong
>
>
> Hi all,
>
> I have some problems with PETSc using MUMPS and PARMETIS.
> In some cases it works fine, but in some others it doesn't, so I am
> trying to understand what is happening.
>
> I just picked the following example:
> http://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/examples/tutorials/ex53.c.html
>
> Now, when I start it with less than 4 processes it works as expected:
> mpirun -n 3 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1
> -mat_mumps_icntl_29 2
>
> But with 4 or more processes, it crashes, but only when I am using Parmetis:
> mpirun -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1
> -mat_mumps_icntl_29 2
>
> Metis worked in every case I tried without any problems.
>
> I wonder if I am doing something wrong or if this is a general problem
> or even a bug? Is Parmetis supposed to work with that example with 4
> processes?
>
> Thanks a lot and kind regards.
>
> Volker
>
>
> Here is the error log of process 0:
>
> Entering DMUMPS 5.0.1 driver with JOB, N = 1 10000
> =================================================
> MUMPS compiled with option -Dmetis
> MUMPS compiled with option -Dparmetis
> =================================================
> L U Solver for unsymmetric matrices
> Type of parallelism: Working host
>
> ****** ANALYSIS STEP ********
>
> ** Max-trans not allowed because matrix is distributed
> Using ParMETIS for parallel ordering.
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
> OS X to find memory corruption errors
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: --------------------- Stack Frames
> ------------------------------------
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [0]PETSC ERROR: INSTEAD the line number of the start of the function
> [0]PETSC ERROR: is given.
> [0]PETSC ERROR: [0] MatLUFactorSymbolic_AIJMUMPS line 1395
> /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/impls/aij/mpi/mumps/mumps.c
> [0]PETSC ERROR: [0] MatLUFactorSymbolic line 2927
> /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/interface/matrix.c
> [0]PETSC ERROR: [0] PCSetUp_LU line 101
> /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/impls/factor/lu/lu.c
> [0]PETSC ERROR: [0] PCSetUp line 930
> /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/interface/precon.c
> [0]PETSC ERROR: [0] KSPSetUp line 305
> /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/interface/itfunc.c
> [0]PETSC ERROR: [0] KSPSolve line 563
> /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/interface/itfunc.c
> [0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [0]PETSC ERROR: Signal received
> [0]PETSC ERROR: See
> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
> shooting.
> [0]PETSC ERROR: Petsc Release Version 3.7.4, Oct, 02, 2016
> [0]PETSC ERROR: ./ex53 on a linux-manni-mumps named manni by 133 Wed
> Oct 19 16:39:49 2016
> [0]PETSC ERROR: Configure options --with-cc=mpiicc --with-cxx=mpiicpc
> --with-fc=mpiifort --with-shared-libraries=1
> --with-valgrind-dir=~/usr/valgrind/
> --with-mpi-dir=/home/software/intel/Intel-2016.4/compilers_and_libraries_2016.4.258/linux/mpi
> --download-scalapack --download-mumps --download-metis
> --download-metis-shared=0 --download-parmetis
> --download-parmetis-shared=0
> [0]PETSC ERROR: #1 User provided function() line 0 in unknown file
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
>
More information about the petsc-users
mailing list