[petsc-users] SuperLU_dist issue in 3.7.4

Anton Popov popov at uni-mainz.de
Wed Oct 19 10:22:33 CDT 2016


Thank you Sherry for your efforts

but before I can setup an example that reproduces the problem, I have to 
ask PETSc related question.

When I pump matrix via MatView MatLoad it ignores its original partitioning.

Say originally I have 100 and 110 equations on two processors, after 
MatLoad I will have 105 and 105 also on two processors.

What do I do to pass partitioning info through MatView MatLoad?

I guess it's important for reproducing my setup exactly.

Thanks


On 10/19/2016 08:06 AM, Xiaoye S. Li wrote:
> I looked at each valgrind-complained item in your email dated Oct. 
> 11.  Those reports are really superficial; I don't see anything  wrong 
> with those lines (mostly uninitialized variables) singled out.  I did 
> a few tests with the latest version in github,  all went fine.
>
> Perhaps you can print your matrix that caused problem, I can run it 
> using  your matrix.
>
> Sherry
>
>
> On Tue, Oct 11, 2016 at 2:18 PM, Anton <popov at uni-mainz.de 
> <mailto:popov at uni-mainz.de>> wrote:
>
>
>
>     On 10/11/16 7:19 PM, Satish Balay wrote:
>
>         This log looks truncated. Are there any valgrind mesages
>         before this?
>         [like from your application code - or from MPI]
>
>     Yes it is indeed truncated. I only included relevant messages.
>
>
>         Perhaps you can send the complete log - with:
>         valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
>         --track-origins=yes
>
>         [and if there were more valgrind messages from MPI - rebuild petsc
>
>     There are no messages originating from our code, just a few MPI
>     related ones (probably false positives) and from SuperLU_DIST
>     (most of them).
>
>     Thanks,
>     Anton
>
>         with --download-mpich - for a valgrind clean mpi]
>
>         Sherry,
>         Perhaps this log points to some issue in superlu_dist?
>
>         thanks,
>         Satish
>
>         On Tue, 11 Oct 2016, Anton Popov wrote:
>
>             Valgrind immediately detects interesting stuff:
>
>             ==25673== Use of uninitialised value of size 8
>             ==25673==    at 0x178272C: static_schedule
>             (static_schedule.c:960)
>             ==25674== Use of uninitialised value of size 8
>             ==25674==    at 0x178272C: static_schedule
>             (static_schedule.c:960)
>             ==25674==    by 0x174E74E: pdgstrf (pdgstrf.c:572)
>             ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
>
>             ==25673== Conditional jump or move depends on
>             uninitialised value(s)
>             ==25673==    at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
>             ==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
>
>             ==25673== Conditional jump or move depends on
>             uninitialised value(s)
>             ==25673==    at 0x5C83F43: PMPI_Recv (in
>             /opt/mpich3/lib/libmpi.so.12.1.0)
>             ==25673==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>             ==25673==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>             ==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
>             ==25674== Use of uninitialised value of size 8
>             ==25674==    at 0x62BF72B: _itoa_word (_itoa.c:179)
>             ==25674==    by 0x62C1289: printf_positional (vfprintf.c:2022)
>             ==25674==    by 0x62C2465: vfprintf (vfprintf.c:1677)
>             ==25674==    by 0x638AFD5: __vsnprintf_chk
>             (vsnprintf_chk.c:63)
>             ==25674==    by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
>             ==25674==    by 0x5CC6C08: MPIR_Err_create_code_valist (in
>             /opt/mpich3/lib/libmpi.so.12.1.0)
>             ==25674==    by 0x5CC7A9A: MPIR_Err_create_code (in
>             /opt/mpich3/lib/libmpi.so.12.1.0)
>             ==25674==    by 0x5C83FB1: PMPI_Recv (in
>             /opt/mpich3/lib/libmpi.so.12.1.0)
>             ==25674==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>             ==25674==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>             ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
>             ==25674== Use of uninitialised value of size 8
>             ==25674==    at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
>             ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
>             And it crashes after this:
>
>             ==25674== Invalid write of size 4
>             ==25674==    at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
>             ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>             ==25674==    by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
>             (superlu_dist.c:421)
>             ==25674==  Address 0xa0 is not stack'd, malloc'd or
>             (recently) free'd
>             ==25674==
>             [1]PETSC ERROR:
>             ------------------------------------------------------------------------
>             [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>             Violation, probably
>             memory access out of range
>
>
>             On 10/11/2016 03:26 PM, Anton Popov wrote:
>
>                 On 10/10/2016 07:11 PM, Satish Balay wrote:
>
>                     Thats from petsc-3.5
>
>                     Anton - please post the stack trace you get with
>                     --download-superlu_dist-commit=origin/maint
>
>                 I guess this is it:
>
>                 [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
>                 /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>                 [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST
>                 line 282
>                 /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>                 [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
>                 /home/anton/LIB/petsc/src/mat/interface/matrix.c
>                 [0]PETSC ERROR: [0] PCSetUp_LU line 101
>                 /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
>                 [0]PETSC ERROR: [0] PCSetUp line 930
>                 /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
>
>                 According to the line numbers it crashes within
>                 MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
>
>                 Surprisingly this only happens on the second SNES
>                 iteration, but not on the
>                 first.
>
>                 I'm trying to reproduce this behavior with PETSc KSP
>                 and SNES examples.
>                 However, everything I've tried up to now with
>                 SuperLU_DIST does just fine.
>
>                 I'm also checking our code in Valgrind to make sure
>                 it's clean.
>
>                 Anton
>
>                     Satish
>
>
>                     On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
>
>                         Which version of superlu_dist does this
>                         capture?   I looked at the
>                         original
>                         error  log, it pointed to pdgssvx: line 161.
>                         But that line is in
>                         comment
>                         block, not the program.
>
>                         Sherry
>
>
>                         On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov
>                         <popov at uni-mainz.de
>                         <mailto:popov at uni-mainz.de>> wrote:
>
>                             On 10/07/2016 05:23 PM, Satish Balay wrote:
>
>                                 On Fri, 7 Oct 2016, Kong, Fande wrote:
>
>                                 On Fri, Oct 7, 2016 at 9:04 AM, Satish
>                                 Balay <balay at mcs.anl.gov
>                                 <mailto:balay at mcs.anl.gov>>
>                                 wrote:
>
>                                     On Fri, 7 Oct 2016, Anton Popov wrote:
>
>                                         Hi guys,
>
>                                             are there any news about
>                                             fixing buggy behavior of
>                                             SuperLU_DIST, exactly
>
>                                         what
>
>                                             is described here:
>
>                                             https://urldefense.proofpoint.com/v2/url?u=http-3A__lists
>                                             <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists>.
>
>                                         mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
>                                         l&d=CwIBAg&c=
>                                         54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_
>                                         JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
>                                         1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e=
>                                         ?
>
>                                             I'm using 3.7.4 and still
>                                             get SEGV in pdgssvx routine.
>                                             Everything works
>
>                                         fine
>
>                                             with 3.5.4.
>
>                                             Do I still have to stick
>                                             to maint branch, and what
>                                             are the
>                                             chances for
>
>                                         these
>
>                                             fixes to be included in 3.7.5?
>
>                                         3.7.4. is off maint branch [as
>                                         of a week ago]. So if you are
>                                         seeing
>                                         issues with it - its best to
>                                         debug and figure out the cause.
>
>                                         This bug is indeed inside of
>                                         superlu_dist, and we started
>                                         having
>                                         this
>
>                                     issue
>                                     from PETSc-3.6.x. I think
>                                     superlu_dist developers should have
>                                     fixed this
>                                     bug. We forgot to update
>                                     superlu_dist?? This is not a thing
>                                     users
>                                     could
>                                     debug and fix.
>
>                                     I have many people in INL
>                                     suffering from this issue, and
>                                     they have
>                                     to
>                                     stay
>                                     with PETSc-3.5.4 to use superlu_dist.
>
>                                 To verify if the bug is fixed in
>                                 latest superlu_dist - you can try
>                                 [assuming you have git - either from
>                                 petsc-3.7/maint/master]:
>
>                                 --download-superlu_dist
>                                 --download-superlu_dist-commit=origin/maint
>
>
>                                 Satish
>
>                                 Hi Satish,
>
>                             I did this:
>
>                             git clone -b maint
>                             https://bitbucket.org/petsc/petsc.git
>                             <https://bitbucket.org/petsc/petsc.git> petsc
>
>                             --download-superlu_dist
>                             --download-superlu_dist-commit=origin/maint
>                             (not sure this is needed,
>                             since I'm already in maint)
>
>                             The problem is still there.
>
>                             Cheers,
>                             Anton
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161019/5fd5d2c4/attachment-0001.html>


More information about the petsc-users mailing list