[petsc-users] [mumps-dev] MUMPS and PARMETIS: Crashes

Hong hzhang at mcs.anl.gov
Thu Oct 20 09:44:43 CDT 2016


Alfredo:
It would be much easier to install petsc with mumps, parmetis, and
 debugging this case. Here is what you can do on a linux machine
(see http://www.mcs.anl.gov/petsc/documentation/installation.html):

1) get petsc-release:
git clone -b maint https://bitbucket.org/petsc/petsc petsc

cd petsc
git pull
export PETSC_DIR=$PWD
export PETSC_ARCH=<>

2) configure petsc with additional options
'--download-metis --download-parmetis --download-mumps
--download-scalapack --download-ptscotch'
see http://www.mcs.anl.gov/petsc/documentation/installation.html

3) build petsc and test
make
make test

4) test ex53.c:
cd $PETSC_DIR/src/ksp/ksp/examples/tutorials
make ex53
mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
-mat_mumps_icntl_29 2

5) debugging ex53.c:
mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
-mat_mumps_icntl_29 2 -start_in_debugger

Give it a try. Contact us if you cannot reproduce this case.

Hong

Dear all,
> this may well be due to a bug in the parallel analysis. Do you think you
> can reproduce the problem in a standalone MUMPS program (i.e., without
> going through PETSc) ? that would save a lot of time to track the bug since
> we do not have a PETSc install at hand. Otherwise we'll give it a shot at
> installing petsc and reproducing the problem on our side.
>
> Kind regards,
> the MUMPS team
>
>
>
> On Wed, Oct 19, 2016 at 8:32 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>>    Tim,
>>
>>     You can/should also run with valgrind to determine exactly the first
>> point with memory corruption issues.
>>
>>   Barry
>>
>> > On Oct 19, 2016, at 11:08 AM, Hong <hzhang at mcs.anl.gov> wrote:
>> >
>> > Tim:
>> > With '-mat_mumps_icntl_28 1', i.e., sequential analysis, I can run ex56
>> with np=3 or larger np successfully.
>> >
>> > With '-mat_mumps_icntl_28 2', i.e., parallel analysis, I can run up to
>> np=3.
>> >
>> > For np=4:
>> > mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
>> -mat_mumps_icntl_29 2 -start_in_debugger
>> >
>> > code crashes inside mumps:
>> > Program received signal SIGSEGV, Segmentation fault.
>> > 0x00007f33d75857cb in dmumps_parallel_analysis::dmumps_build_scotch_graph
>> (
>> >     id=..., first=..., last=..., ipe=...,
>> >     pe=<error reading variable: Cannot access memory at address 0x0>,
>> work=...)
>> >     at dana_aux_par.F:1450
>> > 1450                MAPTAB(J) = I
>> > (gdb) bt
>> > #0  0x00007f33d75857cb in dmumps_parallel_analysis::dmumps_build_scotch_graph
>> (
>> >     id=..., first=..., last=..., ipe=...,
>> >     pe=<error reading variable: Cannot access memory at address 0x0>,
>> work=...)
>> >     at dana_aux_par.F:1450
>> > #1  0x00007f33d759207c in dmumps_parallel_analysis::dmumps_parmetis_ord
>> (
>> >     id=..., ord=..., work=...) at dana_aux_par.F:400
>> > #2  0x00007f33d7592d14 in dmumps_parallel_analysis::dmumps_do_par_ord
>> (id=...,
>> >     ord=..., work=...) at dana_aux_par.F:351
>> > #3  0x00007f33d7593aa9 in dmumps_parallel_analysis::dmumps_ana_f_par
>> (id=...,
>> >     work1=..., work2=..., nfsiz=...,
>> >     fils=<error reading variable: Cannot access memory at address 0x0>,
>> >     frere=<error reading variable: Cannot access memory at address 0x0>)
>> >     at dana_aux_par.F:98
>> > #4  0x00007f33d74c622a in dmumps_ana_driver (id=...) at
>> dana_driver.F:563
>> > #5  0x00007f33d747706b in dmumps (id=...) at dmumps_driver.F:1108
>> > #6  0x00007f33d74721b5 in dmumps_f77 (job=1, sym=0, par=1,
>> >     comm_f77=-2080374779, n=10000, icntl=..., cntl=..., keep=...,
>> dkeep=...,
>> >     keep8=..., nz=0, irn=..., irnhere=0, jcn=..., jcnhere=0, a=...,
>> ahere=0,
>> >     nz_loc=7500, irn_loc=..., irn_lochere=1, jcn_loc=..., jcn_lochere=1,
>> >     a_loc=..., a_lochere=1, nelt=0, eltptr=..., eltptrhere=0,
>> eltvar=...,
>> >     eltvarhere=0, a_elt=..., a_elthere=0, perm_in=..., perm_inhere=0,
>> rhs=...,
>> >     rhshere=0, redrhs=..., redrhshere=0, info=..., rinfo=..., infog=...,
>> >     rinfog=..., deficiency=0, lwk_user=0, size_schur=0,
>> listvar_schur=...,
>> > ---Type <return> to continue, or q <return> to quit---
>> >     ar_schurhere=0, schur=..., schurhere=0, wk_user=..., wk_userhere=0,
>> colsca=...,
>> >     colscahere=0, rowsca=..., rowscahere=0, instance_number=1, nrhs=1,
>> lrhs=0, lredrhs=0,
>> >     rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0,
>> irhs_sparse=...,
>> >     irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0, isol_loc=...,
>> isol_lochere=0,
>> >     nz_rhs=0, lsol_loc=0, schur_mloc=0, schur_nloc=0, schur_lld=0,
>> mblock=0, nblock=0,
>> >     nprow=0, npcol=0, ooc_tmpdir=..., ooc_prefix=...,
>> write_problem=..., tmpdirlen=20,
>> >     prefixlen=20, write_problemlen=20) at dmumps_f77.F:260
>> > #7  0x00007f33d74709b1 in dmumps_c (mumps_par=0x16126f0) at
>> mumps_c.c:415
>> > #8  0x00007f33d68408ca in MatLUFactorSymbolic_AIJMUMPS (F=0x1610280,
>> A=0x14bafc0,
>> >     r=0x160cc30, c=0x1609ed0, info=0x15c6708)
>> >     at /scratch/hzhang/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1487
>> >
>> > -mat_mumps_icntl_29 = 0 or 1 give same error.
>> > I'm cc'ing this email to mumps developer, who may help to resolve this
>> matter.
>> >
>> > Hong
>> >
>> >
>> > Hi all,
>> >
>> > I have some problems with PETSc using MUMPS and PARMETIS.
>> > In some cases it works fine, but in some others it doesn't, so I am
>> > trying to understand what is happening.
>> >
>> > I just picked the following example:
>> > http://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/examp
>> les/tutorials/ex53.c.html
>> >
>> > Now, when I start it with less than 4 processes it works as expected:
>> > mpirun -n 3 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1
>> > -mat_mumps_icntl_29 2
>> >
>> > But with 4 or more processes, it crashes, but only when I am using
>> Parmetis:
>> > mpirun -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1
>> > -mat_mumps_icntl_29 2
>> >
>> > Metis worked in every case I tried without any problems.
>> >
>> > I wonder if I am doing something wrong or if this is a general problem
>> > or even a bug? Is Parmetis supposed to work with that example with 4
>> > processes?
>> >
>> > Thanks a lot and kind regards.
>> >
>> > Volker
>> >
>> >
>> > Here is the error log of process 0:
>> >
>> > Entering DMUMPS 5.0.1 driver with JOB, N =   1       10000
>> >  =================================================
>> >  MUMPS compiled with option -Dmetis
>> >  MUMPS compiled with option -Dparmetis
>> >  =================================================
>> > L U Solver for unsymmetric matrices
>> > Type of parallelism: Working host
>> >
>> >  ****** ANALYSIS STEP ********
>> >
>> >  ** Max-trans not allowed because matrix is distributed
>> > Using ParMETIS for parallel ordering.
>> > [0]PETSC ERROR:
>> > ------------------------------------------------------------
>> ------------
>> > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>> > probably memory access out of range
>> > [0]PETSC ERROR: Try option -start_in_debugger or
>> -on_error_attach_debugger
>> > [0]PETSC ERROR: or see
>> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>> > [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>> > OS X to find memory corruption errors
>> > [0]PETSC ERROR: likely location of problem given in stack below
>> > [0]PETSC ERROR: ---------------------  Stack Frames
>> > ------------------------------------
>> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>> available,
>> > [0]PETSC ERROR:       INSTEAD the line number of the start of the
>> function
>> > [0]PETSC ERROR:       is given.
>> > [0]PETSC ERROR: [0] MatLUFactorSymbolic_AIJMUMPS line 1395
>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/impls/
>> aij/mpi/mumps/mumps.c
>> > [0]PETSC ERROR: [0] MatLUFactorSymbolic line 2927
>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/interface/matrix.c
>> > [0]PETSC ERROR: [0] PCSetUp_LU line 101
>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/
>> impls/factor/lu/lu.c
>> > [0]PETSC ERROR: [0] PCSetUp line 930
>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/
>> interface/precon.c
>> > [0]PETSC ERROR: [0] KSPSetUp line 305
>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/
>> interface/itfunc.c
>> > [0]PETSC ERROR: [0] KSPSolve line 563
>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/
>> interface/itfunc.c
>> > [0]PETSC ERROR: --------------------- Error Message
>> > --------------------------------------------------------------
>> > [0]PETSC ERROR: Signal received
>> > [0]PETSC ERROR: See
>> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>> > shooting.
>> > [0]PETSC ERROR: Petsc Release Version 3.7.4, Oct, 02, 2016
>> > [0]PETSC ERROR: ./ex53 on a linux-manni-mumps named manni by 133 Wed
>> > Oct 19 16:39:49 2016
>> > [0]PETSC ERROR: Configure options --with-cc=mpiicc --with-cxx=mpiicpc
>> > --with-fc=mpiifort --with-shared-libraries=1
>> > --with-valgrind-dir=~/usr/valgrind/
>> > --with-mpi-dir=/home/software/intel/Intel-2016.4/compilers_a
>> nd_libraries_2016.4.258/linux/mpi
>> > --download-scalapack --download-mumps --download-metis
>> > --download-metis-shared=0 --download-parmetis
>> > --download-parmetis-shared=0
>> > [0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
>> > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
>> >
>>
>>
>
>
> --
> -----------------------------------------
> Alfredo Buttari, PhD
> CNRS-IRIT
> 2 rue Camichel, 31071 Toulouse, France
> http://buttari.perso.enseeiht.fr
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161020/8a474569/attachment-0001.html>


More information about the petsc-users mailing list