[petsc-users] Strange behavior of MatLUFactorNumeric()
Barry Smith
bsmith at mcs.anl.gov
Fri Aug 17 16:14:48 CDT 2012
Those valgrind messages are not help, they are just issues with MPI implementation. If you see valgrind messages where the problems are inside PETSc or superlu_dist then those may be real problems (send us those valgrind messages if they exist to petsc-maint at mcs.anl.gov)
You can track down the problem easily with a little debugger work. Consider running with -on_error_attach_debugger and then when it crashes in the debugger type where and look at the lines and variables where it crashed and you may find the problem. Basic usage of the debugger can save you endless amounts of pain. It is worth spending some hours getting comfortable in the debugger it will pay off big time.
Barry
On Aug 16, 2012, at 11:14 AM, Jinquan Zhong <jzhong at scsolutions.com> wrote:
> Barry,
>
> *********************************************************************************************************************************
> Here is the typical info I got from Valgrind for a large matrix with order N=141081. Would they imply something wrong in the installation or the software setting?
> *********************************************************************************************************************************
>
> 1. Several circumstances where the following (Conditional jump or move) happens:
>
> ==28565== Conditional jump or move depends on uninitialised value(s)
> ==28565== at 0x85A9590: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x3FDE4066AA: ibv_open_device (in /usr/lib64/libibverbs.so.1.0.0)
> ==28565== by 0x79BD990: ring_rdma_open_hca (rdma_iba_priv.c:411)
> ==28565== by 0x79CC6CD: rdma_setup_startup_ring (ring_startup.c:405)
> ==28565== by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
> ==28565== by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
> ==28565== by 0x7996600: MPID_Init (mpid_init.c:288)
> ==28565== by 0x799011E: MPIR_Init_thread (initthread.c:402)
> ==28565== by 0x7990328: PMPI_Init_thread (initthread.c:569)
> ==28565== by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
> ==28565== by 0x408ED9: main (hw.cpp:122)
>
>
> ==28565== Conditional jump or move depends on uninitialised value(s)
> ==28565== at 0x79A59D1: calloc (mvapich_malloc.c:3719)
> ==28565== by 0x85A99F2: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x85AB75A: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
> ==28565== by 0x79CC6A6: create_qp (ring_startup.c:223)
> ==28565== by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
> ==28565== by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
> ==28565== by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
> ==28565== by 0x7996600: MPID_Init (mpid_init.c:288)
> ==28565== by 0x799011E: MPIR_Init_thread (initthread.c:402)
> ==28565== by 0x7990328: PMPI_Init_thread (initthread.c:569)
> ==28565== by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
> ==28565== by 0x408ED9: main (hw.cpp:122)
>
> ==28565== Conditional jump or move depends on uninitialised value(s)
> ==28565== at 0x4A08E54: memset (mc_replace_strmem.c:731)
> ==28565== by 0x79A5BA1: calloc (mvapich_malloc.c:3825)
> ==28565== by 0x85A99F2: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x85AB75A: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
> ==28565== by 0x79CC6A6: create_qp (ring_startup.c:223)
> ==28565== by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
> ==28565== by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
> ==28565== by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
> ==28565== by 0x7996600: MPID_Init (mpid_init.c:288)
> ==28565== by 0x799011E: MPIR_Init_thread (initthread.c:402)
> ==28565== by 0x7990328: PMPI_Init_thread (initthread.c:569)
> ==28565== by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
> ==28565== by 0x408ED9: main (hw.cpp:122)
>
> ==28565== Conditional jump or move depends on uninitialised value(s)
> ==28565== at 0x4A08E79: memset (mc_replace_strmem.c:731)
> ==28565== by 0x79A5BA1: calloc (mvapich_malloc.c:3825)
> ==28565== by 0x85A99F2: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x85AB75A: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
> ==28565== by 0x79CC6A6: create_qp (ring_startup.c:223)
> ==28565== by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
> ==28565== by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
> ==28565== by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
> ==28565== by 0x7996600: MPID_Init (mpid_init.c:288)
> ==28565== by 0x799011E: MPIR_Init_thread (initthread.c:402)
> ==28565== by 0x7990328: PMPI_Init_thread (initthread.c:569)
> ==28565== by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
> ==28565== by 0x408ED9: main (hw.cpp:122)
>
> 2.
>
> ==28565== Syscall param write(buf) points to uninitialised byte(s)
> ==28565== at 0x3FDF40E460: __write_nocancel (in /lib64/libpthread-2.12.so)
> ==28565== by 0x3FDE404DFC: ibv_cmd_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
> ==28565== by 0x85AB742: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
> ==28565== by 0x79CC6A6: create_qp (ring_startup.c:223)
> ==28565== by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
> ==28565== by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
> ==28565== by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
> ==28565== by 0x7996600: MPID_Init (mpid_init.c:288)
> ==28565== by 0x799011E: MPIR_Init_thread (initthread.c:402)
> ==28565== by 0x7990328: PMPI_Init_thread (initthread.c:569)
> ==28565== by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
> ==28565== by 0x408ED9: main (hw.cpp:122)
> ==28565== Address 0x7feffbae8 is on thread 1's stack
>
>
> 3.
>
> ==28565== Use of uninitialised value of size 8
> ==28565== at 0x85A998F: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x85AB75A: ??? (in /usr/lib64/libmlx4-rdmav2.so)
> ==28565== by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
> ==28565== by 0x79CC6A6: create_qp (ring_startup.c:223)
> ==28565== by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
> ==28565== by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
> ==28565== by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
> ==28565== by 0x7996600: MPID_Init (mpid_init.c:288)
> ==28565== by 0x799011E: MPIR_Init_thread (initthread.c:402)
> ==28565== by 0x7990328: PMPI_Init_thread (initthread.c:569)
> ==28565== by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
> ==28565== by 0x408ED9: main (hw.cpp:122)
>
>
>
> *******************************************
> The error message for rank 10 from PETSc is
> *******************************************
>
> [10]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [10]PETSC ERROR: INSTEAD the line number of the start of the function
> [10]PETSC ERROR: is given.
> [10]PETSC ERROR: --------------------- Error Message ------------------------------------
> [10]PETSC ERROR: Signal received!
> [10]PETSC ERROR: ------------------------------------------------------------------------
> [10]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13 15:42:00 CDT 2012
> [10]PETSC ERROR: See docs/changes/index.html for recent updates.
> [10]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> [10]PETSC ERROR: See docs/index.html for manual pages.
> [10]PETSC ERROR: ------------------------------------------------------------------------
> [10]PETSC ERROR: /nfs/06/com0488/programs/examples/testc/ZSOL on a arch-linu named n0685.ten.osc.edu by com0488 Wed Aug 15 21:00:31 2012
> [10]PETSC ERROR: Libraries linked from /nfs/07/com0489/petsc/petsc-3.3-p2/arch-linux2-cxx-debug/lib
> [10]PETSC ERROR: Configure run at Wed Aug 15 13:55:29 2012
> [10]PETSC ERROR: Configure options --with-blas-lapack-lib="-L/usr/local/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -liomp5 -lm" --download-blacs --download-scalapack --with-mpi-dir=/usr/local/mvapich2/1.7-gnu --with-mpiexec=/usr/local/bin/mpiexec --with-scalar-type=complex --with-precision=double --with-clanguage=cxx --with-fortran-kernels=generic --download-mumps --download-superlu_dist --download-parmetis --download-metis --with-fortran-interfaces
> [10]PETSC ERROR: ------------------------------------------------------------------------
> [10]PETSC ERROR: User provided function() line 0 in unknown directory unknown file
> [cli_10]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 10
>
>
>
> Jinquan
>
>
>
>
> -----Original Message-----
> From: petsc-users-bounces at mcs.anl.gov [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Barry Smith
> Sent: Tuesday, August 14, 2012 7:54 PM
> To: PETSc users list
> Subject: Re: [petsc-users] Strange behavior of MatLUFactorNumeric()
>
>
> On Aug 14, 2012, at 6:05 PM, Jinquan Zhong <jzhong at scsolutions.com> wrote:
>
>> Barry,
>>
>> The machine I ran this program does not have valgrind.
>>
>> Another interesting observation is that when I ran the same three
>> matrices using PETSc3.2. MatLUFactorNumeric() hanged up even on N=75,
>> 2028 till I specified -mat_superlu_dist_colperm. However,
>> MatLUFactorNumeric() didn't work for N=21180 either even I used
>>
>> -mat_superlu_dist_rowperm NATURAL -mat_superlu_dist_colperm NATURAL
>> -mat_superlu_dist_parsymbfact YES
>>
>> I suspect that there is something incompatible in the factored matrix from superLU-dist to be used MatLUFactorNumeric() in PETSc3.2. Although PETSc 3.3 fixed this issue for matrix with small N, however, this issue relapsed for large N in PETSc3.3.
>
> It is using Superlu_dist for this factorization (and that version changed with PETSc 3.3) the problem is with Superlu_Dist not PETSc. valgrind will likely find an error in SuperLU_dist
>
> Barry
>
>>
>> Jinquan
>>
>>
>> -----Original Message-----
>> From: petsc-users-bounces at mcs.anl.gov
>> [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Barry Smith
>> Sent: Tuesday, August 14, 2012 3:55 PM
>> To: PETSc users list
>> Subject: Re: [petsc-users] Strange behavior of MatLUFactorNumeric()
>>
>>
>> Can you run with valgrind
>>
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>
>>
>>
>> On Aug 14, 2012, at 5:39 PM, Jinquan Zhong <jzhong at scsolutions.com> wrote:
>>
>>> Thanks, Matt.
>>>
>>> 1. Yes, I have checked the returned values from x obtained from
>>> MatSolve(F,b,x)
>>>
>>> The norm error check for x is complete for N=75, 2028.
>>>
>>> 2. Good point, Matt. Here is the complete message for Rank 391. The others are similar to this one.
>>>
>>>
>>> [391]PETSC ERROR:
>>> ---------------------------------------------------------------------
>>> -
>>> -- [391]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>>> Violation, probably memory access out of range [391]PETSC ERROR: Try
>>> option -start_in_debugger or -on_error_attach_debugger [391]PETSC
>>> ERROR: or see
>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[391]PETS
>>> C
>>> ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to
>>> find memory corruption errors [391]PETSC ERROR: likely location of
>>> problem given in stack below [391]PETSC ERROR: ---------------------
>>> Stack Frames ------------------------------------
>>> [391]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>> [391]PETSC ERROR: INSTEAD the line number of the start of the function
>>> [391]PETSC ERROR: is given.
>>> [391]PETSC ERROR: [391] MatLUFactorNumeric_SuperLU_DIST line 284
>>> /nfs/06/com0488/programs/libraries/PETSc/petsc-3.3-p2/src/mat/impls/a
>>> i j/mpi/superlu_dist/superlu_dist.c [391]PETSC ERROR: [391]
>>> MatLUFactorNumeric line 2778
>>> /nfs/06/com0488/programs/libraries/PETSc/petsc-3.3-p2/src/mat/interfa
>>> c e/matrix.c [391]PETSC ERROR: --------------------- Error Message
>>> ------------------------------------
>>> [391]PETSC ERROR: Signal received!
>>> [391]PETSC ERROR:
>>> ---------------------------------------------------------------------
>>> -
>>> -- [391]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13
>>> 15:42:00 CDT 2012 [391]PETSC ERROR: See docs/changes/index.html for
>>> recent updates.
>>> [391]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>>> [391]PETSC ERROR: See docs/index.html for manual pages.
>>> [391]PETSC ERROR:
>>> ---------------------------------------------------------------------
>>> -
>>> -- [391]PETSC ERROR:
>>> /nfs/06/com0488/programs/examples/ZSOL0.2431/ZSOL
>>> on a arch-linu named n0272.ten.osc.edu by com0488 Sun Aug 12 23:18:07
>>> 2012 [391]PETSC ERROR: Libraries linked from
>>> /nfs/06/com0488/programs/libraries/PETSc/petsc-3.3-p2/arch-linux2-cxx
>>> - debug/lib [391]PETSC ERROR: Configure run at Fri Aug 3 17:44:00
>>> 2012 [391]PETSC ERROR: Configure options
>>> --with-blas-lib=/nfs/06/com0488/programs/libraries/ScaLAPACK/2.0.1/li
>>> b
>>> /librefblas.a
>>> --with-lapack-lib=/nfs/06/com0488/programs/libraries/ScaLAPACK/2.0.1/
>>> l ib/libreflapack.a --download-blacs --download-scalapack
>>> --with-mpi-dir=/usr/local/mvapich2/1.7-gnu
>>> --with-mpiexec=/usr/local/bin/mpiexec --with-scalar-type=complex
>>> --with-precision=double --with-clanguage=cxx
>>> --with-fortran-kernels=generic --download-mumps
>>> --download-superlu_dist --download-parmetis --download-metis
>>> --with-fortran-interfaces[391]PETSC ERROR:
>>> ---------------------------------------------------------------------
>>> -
>>> -- [391]PETSC ERROR: User provided function() line 0 in unknown
>>> directory unknown file
>>> [cli_391]: aborting job:
>>> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 391
>>>
>>>
>>> From: petsc-users-bounces at mcs.anl.gov
>>> [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Matthew Knepley
>>> Sent: Tuesday, August 14, 2012 3:34 PM
>>> To: PETSc users list
>>> Subject: Re: [petsc-users] Strange behavior of MatLUFactorNumeric()
>>>
>>> On Tue, Aug 14, 2012 at 5:26 PM, Jinquan Zhong <jzhong at scsolutions.com> wrote:
>>> Dear PETSc folks,
>>>
>>> I have a strange observation on using MatLUFactorNumeric() for dense matrices at different order N. Here is the situation I have:
>>>
>>> 1. I use ./src/mat/tests/ex137.c as an example to direct PETSc in selecting superLU-dist and mumps. The calling sequence is
>>>
>>> MatGetOrdering(A,...)
>>>
>>> MatGetFactor(A,...)
>>>
>>> MatLUFactorSymbolic(F, A,...)
>>>
>>> MatLUFactorNumeric(F, A,...)
>>>
>>> MatSolve(F,b,x)
>>>
>>> 2. I have three dense matrices A at three different dimensions: N=75, 2028 and 21180.
>>>
>>> 3. The calling sequence works for N=75 and 2028. But when N=21180, the program hanged up when calling MatLUFactorNumeric(...). Seemed to be a segmentation fault with the following error message:
>>>
>>>
>>>
>>> [1]PETSC ERROR: --------------------- Error Message
>>> ------------------------------------
>>> [1]PETSC ERROR: Signal received!
>>>
>>> ALWAYS send the entire error message. How can we tell anything from a small snippet?
>>>
>>> Since you have [1], this was run in parallel, so you need 3rd party
>>> packages. But you do not seem to be checking return values. Check
>>> them to make sure those packages are installed correctly.
>>>
>>> Matt
>>>
>>> Does anybody have similar experience on that?
>>>
>>> Thanks a lot!
>>>
>>> Jinquan
>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>> -- Norbert Wiener
>>
>
More information about the petsc-users
mailing list