[petsc-users] Strange behavior of MatLUFactorNumeric()

Jinquan Zhong jzhong at scsolutions.com
Thu Aug 16 11:14:55 CDT 2012


Barry,

*********************************************************************************************************************************
Here is the typical info I got from Valgrind for a large matrix with order N=141081.  Would they imply something wrong in the installation or the software setting?
*********************************************************************************************************************************

1.  Several circumstances where the following (Conditional jump or move) happens:

==28565== Conditional jump or move depends on uninitialised value(s)
==28565==    at 0x85A9590: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x3FDE4066AA: ibv_open_device (in /usr/lib64/libibverbs.so.1.0.0)
==28565==    by 0x79BD990: ring_rdma_open_hca (rdma_iba_priv.c:411)
==28565==    by 0x79CC6CD: rdma_setup_startup_ring (ring_startup.c:405)
==28565==    by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
==28565==    by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
==28565==    by 0x7996600: MPID_Init (mpid_init.c:288)
==28565==    by 0x799011E: MPIR_Init_thread (initthread.c:402)
==28565==    by 0x7990328: PMPI_Init_thread (initthread.c:569)
==28565==    by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
==28565==    by 0x408ED9: main (hw.cpp:122)


==28565== Conditional jump or move depends on uninitialised value(s)
==28565==    at 0x79A59D1: calloc (mvapich_malloc.c:3719)
==28565==    by 0x85A99F2: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x85AB75A: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
==28565==    by 0x79CC6A6: create_qp (ring_startup.c:223)
==28565==    by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
==28565==    by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
==28565==    by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
==28565==    by 0x7996600: MPID_Init (mpid_init.c:288)
==28565==    by 0x799011E: MPIR_Init_thread (initthread.c:402)
==28565==    by 0x7990328: PMPI_Init_thread (initthread.c:569)
==28565==    by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
==28565==    by 0x408ED9: main (hw.cpp:122)

==28565== Conditional jump or move depends on uninitialised value(s)
==28565==    at 0x4A08E54: memset (mc_replace_strmem.c:731)
==28565==    by 0x79A5BA1: calloc (mvapich_malloc.c:3825)
==28565==    by 0x85A99F2: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x85AB75A: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
==28565==    by 0x79CC6A6: create_qp (ring_startup.c:223)
==28565==    by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
==28565==    by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
==28565==    by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
==28565==    by 0x7996600: MPID_Init (mpid_init.c:288)
==28565==    by 0x799011E: MPIR_Init_thread (initthread.c:402)
==28565==    by 0x7990328: PMPI_Init_thread (initthread.c:569)
==28565==    by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
==28565==    by 0x408ED9: main (hw.cpp:122)

==28565== Conditional jump or move depends on uninitialised value(s)
==28565==    at 0x4A08E79: memset (mc_replace_strmem.c:731)
==28565==    by 0x79A5BA1: calloc (mvapich_malloc.c:3825)
==28565==    by 0x85A99F2: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x85AB75A: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
==28565==    by 0x79CC6A6: create_qp (ring_startup.c:223)
==28565==    by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
==28565==    by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
==28565==    by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
==28565==    by 0x7996600: MPID_Init (mpid_init.c:288)
==28565==    by 0x799011E: MPIR_Init_thread (initthread.c:402)
==28565==    by 0x7990328: PMPI_Init_thread (initthread.c:569)
==28565==    by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
==28565==    by 0x408ED9: main (hw.cpp:122)

2.

==28565== Syscall param write(buf) points to uninitialised byte(s)
==28565==    at 0x3FDF40E460: __write_nocancel (in /lib64/libpthread-2.12.so)
==28565==    by 0x3FDE404DFC: ibv_cmd_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
==28565==    by 0x85AB742: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
==28565==    by 0x79CC6A6: create_qp (ring_startup.c:223)
==28565==    by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
==28565==    by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
==28565==    by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
==28565==    by 0x7996600: MPID_Init (mpid_init.c:288)
==28565==    by 0x799011E: MPIR_Init_thread (initthread.c:402)
==28565==    by 0x7990328: PMPI_Init_thread (initthread.c:569)
==28565==    by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
==28565==    by 0x408ED9: main (hw.cpp:122)
==28565==  Address 0x7feffbae8 is on thread 1's stack


3.

==28565== Use of uninitialised value of size 8
==28565==    at 0x85A998F: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x85AB75A: ??? (in /usr/lib64/libmlx4-rdmav2.so)
==28565==    by 0x3FDE408F21: ibv_create_qp (in /usr/lib64/libibverbs.so.1.0.0)
==28565==    by 0x79CC6A6: create_qp (ring_startup.c:223)
==28565==    by 0x79CC7DE: rdma_setup_startup_ring (ring_startup.c:434)
==28565==    by 0x79B98D2: MPIDI_CH3I_CM_Init (rdma_iba_init.c:1031)
==28565==    by 0x793C79C: MPIDI_CH3_Init (ch3_init.c:159)
==28565==    by 0x7996600: MPID_Init (mpid_init.c:288)
==28565==    by 0x799011E: MPIR_Init_thread (initthread.c:402)
==28565==    by 0x7990328: PMPI_Init_thread (initthread.c:569)
==28565==    by 0x481AE0: PetscInitialize(int*, char***, char const*, char const*) (pinit.c:671)
==28565==    by 0x408ED9: main (hw.cpp:122)



*******************************************
The error message for rank 10 from PETSc is
*******************************************

[10]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[10]PETSC ERROR:       INSTEAD the line number of the start of the function
[10]PETSC ERROR:       is given.
[10]PETSC ERROR: --------------------- Error Message ------------------------------------
[10]PETSC ERROR: Signal received!
[10]PETSC ERROR: ------------------------------------------------------------------------
[10]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13 15:42:00 CDT 2012
[10]PETSC ERROR: See docs/changes/index.html for recent updates.
[10]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[10]PETSC ERROR: See docs/index.html for manual pages.
[10]PETSC ERROR: ------------------------------------------------------------------------
[10]PETSC ERROR: /nfs/06/com0488/programs/examples/testc/ZSOL on a arch-linu named n0685.ten.osc.edu by com0488 Wed Aug 15 21:00:31 2012
[10]PETSC ERROR: Libraries linked from /nfs/07/com0489/petsc/petsc-3.3-p2/arch-linux2-cxx-debug/lib
[10]PETSC ERROR: Configure run at Wed Aug 15 13:55:29 2012
[10]PETSC ERROR: Configure options --with-blas-lapack-lib="-L/usr/local/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -liomp5 -lm" --download-blacs --download-scalapack --with-mpi-dir=/usr/local/mvapich2/1.7-gnu --with-mpiexec=/usr/local/bin/mpiexec --with-scalar-type=complex --with-precision=double --with-clanguage=cxx --with-fortran-kernels=generic --download-mumps --download-superlu_dist --download-parmetis --download-metis --with-fortran-interfaces
[10]PETSC ERROR: ------------------------------------------------------------------------
[10]PETSC ERROR: User provided function() line 0 in unknown directory unknown file
[cli_10]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 10



Jinquan




-----Original Message-----
From: petsc-users-bounces at mcs.anl.gov [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Barry Smith
Sent: Tuesday, August 14, 2012 7:54 PM
To: PETSc users list
Subject: Re: [petsc-users] Strange behavior of MatLUFactorNumeric()


On Aug 14, 2012, at 6:05 PM, Jinquan Zhong <jzhong at scsolutions.com> wrote:

> Barry,
> 
> The machine I ran this program does not have valgrind.
> 
> Another interesting observation is that when I ran the same three 
> matrices using PETSc3.2.  MatLUFactorNumeric() hanged up even on N=75, 
> 2028 till I specified -mat_superlu_dist_colperm.  However, 
> MatLUFactorNumeric() didn't work for N=21180 either even I used
> 
> 	-mat_superlu_dist_rowperm NATURAL -mat_superlu_dist_colperm NATURAL 
> -mat_superlu_dist_parsymbfact YES
> 
> I suspect that there is something incompatible in the factored matrix from superLU-dist to be used MatLUFactorNumeric() in PETSc3.2.  Although PETSc 3.3 fixed this issue for matrix with small N, however, this issue relapsed for large N in PETSc3.3.

  It is using Superlu_dist for this factorization (and that version changed with PETSc 3.3) the problem is with Superlu_Dist not PETSc. valgrind will likely find an error in SuperLU_dist

   Barry

> 
> Jinquan
> 
> 
> -----Original Message-----
> From: petsc-users-bounces at mcs.anl.gov 
> [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Barry Smith
> Sent: Tuesday, August 14, 2012 3:55 PM
> To: PETSc users list
> Subject: Re: [petsc-users] Strange behavior of MatLUFactorNumeric()
> 
> 
>  Can you run with valgrind
> 
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> 
> 
> 
> On Aug 14, 2012, at 5:39 PM, Jinquan Zhong <jzhong at scsolutions.com> wrote:
> 
>> Thanks, Matt.
>> 
>> 1.       Yes, I have checked the returned values from x obtained from
>> MatSolve(F,b,x)
>> 
>>                The norm error check for x is complete for N=75, 2028.
>> 
>> 2.       Good point, Matt.  Here is the complete message for Rank 391.  The others are similar to this one.
>> 
>> 
>> [391]PETSC ERROR: 
>> ---------------------------------------------------------------------
>> -
>> -- [391]PETSC ERROR: Caught signal number 11 SEGV: Segmentation 
>> Violation, probably memory access out of range [391]PETSC ERROR: Try 
>> option -start_in_debugger or -on_error_attach_debugger [391]PETSC
>> ERROR: or see
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[391]PETS
>> C
>> ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to 
>> find memory corruption errors [391]PETSC ERROR: likely location of 
>> problem given in stack below [391]PETSC ERROR: --------------------- 
>> Stack Frames ------------------------------------
>> [391]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>> [391]PETSC ERROR:       INSTEAD the line number of the start of the function
>> [391]PETSC ERROR:       is given.
>> [391]PETSC ERROR: [391] MatLUFactorNumeric_SuperLU_DIST line 284 
>> /nfs/06/com0488/programs/libraries/PETSc/petsc-3.3-p2/src/mat/impls/a
>> i j/mpi/superlu_dist/superlu_dist.c [391]PETSC ERROR: [391] 
>> MatLUFactorNumeric line 2778 
>> /nfs/06/com0488/programs/libraries/PETSc/petsc-3.3-p2/src/mat/interfa
>> c e/matrix.c [391]PETSC ERROR: --------------------- Error Message
>> ------------------------------------
>> [391]PETSC ERROR: Signal received!
>> [391]PETSC ERROR: 
>> ---------------------------------------------------------------------
>> -
>> -- [391]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13
>> 15:42:00 CDT 2012 [391]PETSC ERROR: See docs/changes/index.html for 
>> recent updates.
>> [391]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>> [391]PETSC ERROR: See docs/index.html for manual pages.
>> [391]PETSC ERROR: 
>> ---------------------------------------------------------------------
>> -
>> -- [391]PETSC ERROR: 
>> /nfs/06/com0488/programs/examples/ZSOL0.2431/ZSOL
>> on a arch-linu named n0272.ten.osc.edu by com0488 Sun Aug 12 23:18:07
>> 2012 [391]PETSC ERROR: Libraries linked from
>> /nfs/06/com0488/programs/libraries/PETSc/petsc-3.3-p2/arch-linux2-cxx
>> - debug/lib [391]PETSC ERROR: Configure run at Fri Aug  3 17:44:00 
>> 2012 [391]PETSC ERROR: Configure options 
>> --with-blas-lib=/nfs/06/com0488/programs/libraries/ScaLAPACK/2.0.1/li
>> b
>> /librefblas.a
>> --with-lapack-lib=/nfs/06/com0488/programs/libraries/ScaLAPACK/2.0.1/
>> l ib/libreflapack.a --download-blacs --download-scalapack 
>> --with-mpi-dir=/usr/local/mvapich2/1.7-gnu
>> --with-mpiexec=/usr/local/bin/mpiexec --with-scalar-type=complex 
>> --with-precision=double --with-clanguage=cxx 
>> --with-fortran-kernels=generic --download-mumps 
>> --download-superlu_dist --download-parmetis --download-metis 
>> --with-fortran-interfaces[391]PETSC ERROR:
>> ---------------------------------------------------------------------
>> -
>> -- [391]PETSC ERROR: User provided function() line 0 in unknown 
>> directory unknown file
>> [cli_391]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 391
>> 
>> 
>> From: petsc-users-bounces at mcs.anl.gov 
>> [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Matthew Knepley
>> Sent: Tuesday, August 14, 2012 3:34 PM
>> To: PETSc users list
>> Subject: Re: [petsc-users] Strange behavior of MatLUFactorNumeric()
>> 
>> On Tue, Aug 14, 2012 at 5:26 PM, Jinquan Zhong <jzhong at scsolutions.com> wrote:
>> Dear PETSc folks,
>> 
>> I have a strange observation on using MatLUFactorNumeric() for dense matrices at different order N.  Here is the situation I have:
>> 
>> 1.       I use ./src/mat/tests/ex137.c as an example to direct PETSc in selecting superLU-dist and mumps.  The calling sequence is
>> 
>> MatGetOrdering(A,...)
>> 
>> MatGetFactor(A,...)
>> 
>> MatLUFactorSymbolic(F, A,...)
>> 
>> MatLUFactorNumeric(F, A,...)
>> 
>> MatSolve(F,b,x)
>> 
>> 2.       I have three dense matrices A at three different dimensions: N=75, 2028 and 21180. 
>> 
>> 3.       The calling sequence works for N=75 and 2028.  But when N=21180, the program hanged up when calling MatLUFactorNumeric(...).  Seemed to be a segmentation fault with the following error message:
>> 
>> 
>> 
>> [1]PETSC ERROR: --------------------- Error Message
>> ------------------------------------
>> [1]PETSC ERROR: Signal received!
>> 
>> ALWAYS send the entire error message. How can we tell anything from a small snippet?
>> 
>> Since you have [1], this was run in parallel, so you need 3rd party 
>> packages. But you do not seem to be checking return values. Check 
>> them to make sure those packages are installed correctly.
>> 
>>   Matt
>> 
>> Does anybody have similar experience on that?
>> 
>> Thanks a lot!
>> 
>> Jinquan
>> 
>> 
>> 
>> --
>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>> -- Norbert Wiener
> 



More information about the petsc-users mailing list