[petsc-users] [mumps-dev] MUMPS and PARMETIS: Crashes

Alfredo Buttari alfredo.buttari at enseeiht.fr
Mon Dec 12 02:00:56 CST 2016


Dear all,
sorry for the late reply. The petsc installation went supersmooth and
I could easily reproduce the issue. I dumped the matrix generated by
petsc and read it back with a standalone mumps tester in order to
confirm the bug. This bug has been already reported by another user,
was fixed a few months ago and the fix was included in the 5.0.2
release. Could you please check if everything works well with mumps
5.0.2?

Kind regards,
te MUMPS team




On Thu, Oct 20, 2016 at 4:44 PM, Hong <hzhang at mcs.anl.gov> wrote:
> Alfredo:
> It would be much easier to install petsc with mumps, parmetis, and
> debugging this case. Here is what you can do on a linux machine
> (see http://www.mcs.anl.gov/petsc/documentation/installation.html):
>
> 1) get petsc-release:
> git clone -b maint https://bitbucket.org/petsc/petsc petsc
>
> cd petsc
> git pull
> export PETSC_DIR=$PWD
> export PETSC_ARCH=<>
>
> 2) configure petsc with additional options
> '--download-metis --download-parmetis --download-mumps --download-scalapack
> --download-ptscotch'
> see http://www.mcs.anl.gov/petsc/documentation/installation.html
>
> 3) build petsc and test
> make
> make test
>
> 4) test ex53.c:
> cd $PETSC_DIR/src/ksp/ksp/examples/tutorials
> make ex53
> mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
> -mat_mumps_icntl_29 2
>
> 5) debugging ex53.c:
> mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
> -mat_mumps_icntl_29 2 -start_in_debugger
>
> Give it a try. Contact us if you cannot reproduce this case.
>
> Hong
>
>> Dear all,
>> this may well be due to a bug in the parallel analysis. Do you think you
>> can reproduce the problem in a standalone MUMPS program (i.e., without going
>> through PETSc) ? that would save a lot of time to track the bug since we do
>> not have a PETSc install at hand. Otherwise we'll give it a shot at
>> installing petsc and reproducing the problem on our side.
>>
>> Kind regards,
>> the MUMPS team
>>
>>
>>
>> On Wed, Oct 19, 2016 at 8:32 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>
>>>
>>>    Tim,
>>>
>>>     You can/should also run with valgrind to determine exactly the first
>>> point with memory corruption issues.
>>>
>>>   Barry
>>>
>>> > On Oct 19, 2016, at 11:08 AM, Hong <hzhang at mcs.anl.gov> wrote:
>>> >
>>> > Tim:
>>> > With '-mat_mumps_icntl_28 1', i.e., sequential analysis, I can run ex56
>>> > with np=3 or larger np successfully.
>>> >
>>> > With '-mat_mumps_icntl_28 2', i.e., parallel analysis, I can run up to
>>> > np=3.
>>> >
>>> > For np=4:
>>> > mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2
>>> > -mat_mumps_icntl_29 2 -start_in_debugger
>>> >
>>> > code crashes inside mumps:
>>> > Program received signal SIGSEGV, Segmentation fault.
>>> > 0x00007f33d75857cb in
>>> > dmumps_parallel_analysis::dmumps_build_scotch_graph (
>>> >     id=..., first=..., last=..., ipe=...,
>>> >     pe=<error reading variable: Cannot access memory at address 0x0>,
>>> > work=...)
>>> >     at dana_aux_par.F:1450
>>> > 1450                MAPTAB(J) = I
>>> > (gdb) bt
>>> > #0  0x00007f33d75857cb in
>>> > dmumps_parallel_analysis::dmumps_build_scotch_graph (
>>> >     id=..., first=..., last=..., ipe=...,
>>> >     pe=<error reading variable: Cannot access memory at address 0x0>,
>>> > work=...)
>>> >     at dana_aux_par.F:1450
>>> > #1  0x00007f33d759207c in dmumps_parallel_analysis::dmumps_parmetis_ord
>>> > (
>>> >     id=..., ord=..., work=...) at dana_aux_par.F:400
>>> > #2  0x00007f33d7592d14 in dmumps_parallel_analysis::dmumps_do_par_ord
>>> > (id=...,
>>> >     ord=..., work=...) at dana_aux_par.F:351
>>> > #3  0x00007f33d7593aa9 in dmumps_parallel_analysis::dmumps_ana_f_par
>>> > (id=...,
>>> >     work1=..., work2=..., nfsiz=...,
>>> >     fils=<error reading variable: Cannot access memory at address 0x0>,
>>> >     frere=<error reading variable: Cannot access memory at address
>>> > 0x0>)
>>> >     at dana_aux_par.F:98
>>> > #4  0x00007f33d74c622a in dmumps_ana_driver (id=...) at
>>> > dana_driver.F:563
>>> > #5  0x00007f33d747706b in dmumps (id=...) at dmumps_driver.F:1108
>>> > #6  0x00007f33d74721b5 in dmumps_f77 (job=1, sym=0, par=1,
>>> >     comm_f77=-2080374779, n=10000, icntl=..., cntl=..., keep=...,
>>> > dkeep=...,
>>> >     keep8=..., nz=0, irn=..., irnhere=0, jcn=..., jcnhere=0, a=...,
>>> > ahere=0,
>>> >     nz_loc=7500, irn_loc=..., irn_lochere=1, jcn_loc=...,
>>> > jcn_lochere=1,
>>> >     a_loc=..., a_lochere=1, nelt=0, eltptr=..., eltptrhere=0,
>>> > eltvar=...,
>>> >     eltvarhere=0, a_elt=..., a_elthere=0, perm_in=..., perm_inhere=0,
>>> > rhs=...,
>>> >     rhshere=0, redrhs=..., redrhshere=0, info=..., rinfo=...,
>>> > infog=...,
>>> >     rinfog=..., deficiency=0, lwk_user=0, size_schur=0,
>>> > listvar_schur=...,
>>> > ---Type <return> to continue, or q <return> to quit---
>>> >     ar_schurhere=0, schur=..., schurhere=0, wk_user=..., wk_userhere=0,
>>> > colsca=...,
>>> >     colscahere=0, rowsca=..., rowscahere=0, instance_number=1, nrhs=1,
>>> > lrhs=0, lredrhs=0,
>>> >     rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0,
>>> > irhs_sparse=...,
>>> >     irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0, isol_loc=...,
>>> > isol_lochere=0,
>>> >     nz_rhs=0, lsol_loc=0, schur_mloc=0, schur_nloc=0, schur_lld=0,
>>> > mblock=0, nblock=0,
>>> >     nprow=0, npcol=0, ooc_tmpdir=..., ooc_prefix=...,
>>> > write_problem=..., tmpdirlen=20,
>>> >     prefixlen=20, write_problemlen=20) at dmumps_f77.F:260
>>> > #7  0x00007f33d74709b1 in dmumps_c (mumps_par=0x16126f0) at
>>> > mumps_c.c:415
>>> > #8  0x00007f33d68408ca in MatLUFactorSymbolic_AIJMUMPS (F=0x1610280,
>>> > A=0x14bafc0,
>>> >     r=0x160cc30, c=0x1609ed0, info=0x15c6708)
>>> >     at /scratch/hzhang/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1487
>>> >
>>> > -mat_mumps_icntl_29 = 0 or 1 give same error.
>>> > I'm cc'ing this email to mumps developer, who may help to resolve this
>>> > matter.
>>> >
>>> > Hong
>>> >
>>> >
>>> > Hi all,
>>> >
>>> > I have some problems with PETSc using MUMPS and PARMETIS.
>>> > In some cases it works fine, but in some others it doesn't, so I am
>>> > trying to understand what is happening.
>>> >
>>> > I just picked the following example:
>>> >
>>> > http://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/examples/tutorials/ex53.c.html
>>> >
>>> > Now, when I start it with less than 4 processes it works as expected:
>>> > mpirun -n 3 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1
>>> > -mat_mumps_icntl_29 2
>>> >
>>> > But with 4 or more processes, it crashes, but only when I am using
>>> > Parmetis:
>>> > mpirun -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1
>>> > -mat_mumps_icntl_29 2
>>> >
>>> > Metis worked in every case I tried without any problems.
>>> >
>>> > I wonder if I am doing something wrong or if this is a general problem
>>> > or even a bug? Is Parmetis supposed to work with that example with 4
>>> > processes?
>>> >
>>> > Thanks a lot and kind regards.
>>> >
>>> > Volker
>>> >
>>> >
>>> > Here is the error log of process 0:
>>> >
>>> > Entering DMUMPS 5.0.1 driver with JOB, N =   1       10000
>>> >  =================================================
>>> >  MUMPS compiled with option -Dmetis
>>> >  MUMPS compiled with option -Dparmetis
>>> >  =================================================
>>> > L U Solver for unsymmetric matrices
>>> > Type of parallelism: Working host
>>> >
>>> >  ****** ANALYSIS STEP ********
>>> >
>>> >  ** Max-trans not allowed because matrix is distributed
>>> > Using ParMETIS for parallel ordering.
>>> > [0]PETSC ERROR:
>>> >
>>> > ------------------------------------------------------------------------
>>> > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> > probably memory access out of range
>>> > [0]PETSC ERROR: Try option -start_in_debugger or
>>> > -on_error_attach_debugger
>>> > [0]PETSC ERROR: or see
>>> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>> > [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>>> > OS X to find memory corruption errors
>>> > [0]PETSC ERROR: likely location of problem given in stack below
>>> > [0]PETSC ERROR: ---------------------  Stack Frames
>>> > ------------------------------------
>>> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>> > available,
>>> > [0]PETSC ERROR:       INSTEAD the line number of the start of the
>>> > function
>>> > [0]PETSC ERROR:       is given.
>>> > [0]PETSC ERROR: [0] MatLUFactorSymbolic_AIJMUMPS line 1395
>>> >
>>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/impls/aij/mpi/mumps/mumps.c
>>> > [0]PETSC ERROR: [0] MatLUFactorSymbolic line 2927
>>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/interface/matrix.c
>>> > [0]PETSC ERROR: [0] PCSetUp_LU line 101
>>> >
>>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/impls/factor/lu/lu.c
>>> > [0]PETSC ERROR: [0] PCSetUp line 930
>>> >
>>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/interface/precon.c
>>> > [0]PETSC ERROR: [0] KSPSetUp line 305
>>> >
>>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/interface/itfunc.c
>>> > [0]PETSC ERROR: [0] KSPSolve line 563
>>> >
>>> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/interface/itfunc.c
>>> > [0]PETSC ERROR: --------------------- Error Message
>>> > --------------------------------------------------------------
>>> > [0]PETSC ERROR: Signal received
>>> > [0]PETSC ERROR: See
>>> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>> > shooting.
>>> > [0]PETSC ERROR: Petsc Release Version 3.7.4, Oct, 02, 2016
>>> > [0]PETSC ERROR: ./ex53 on a linux-manni-mumps named manni by 133 Wed
>>> > Oct 19 16:39:49 2016
>>> > [0]PETSC ERROR: Configure options --with-cc=mpiicc --with-cxx=mpiicpc
>>> > --with-fc=mpiifort --with-shared-libraries=1
>>> > --with-valgrind-dir=~/usr/valgrind/
>>> >
>>> > --with-mpi-dir=/home/software/intel/Intel-2016.4/compilers_and_libraries_2016.4.258/linux/mpi
>>> > --download-scalapack --download-mumps --download-metis
>>> > --download-metis-shared=0 --download-parmetis
>>> > --download-parmetis-shared=0
>>> > [0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
>>> > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
>>> >
>>>
>>
>>
>>
>> --
>> -----------------------------------------
>> Alfredo Buttari, PhD
>> CNRS-IRIT
>> 2 rue Camichel, 31071 Toulouse, France
>> http://buttari.perso.enseeiht.fr
>
>



-- 
-----------------------------------------
Alfredo Buttari, PhD
CNRS-IRIT
2 rue Camichel, 31071 Toulouse, France
http://buttari.perso.enseeiht.fr


More information about the petsc-users mailing list