[petsc-users] SuperLU_dist issue in 3.7.4
Anton Popov
popov at uni-mainz.de
Wed Oct 19 10:22:33 CDT 2016
Thank you Sherry for your efforts
but before I can setup an example that reproduces the problem, I have to
ask PETSc related question.
When I pump matrix via MatView MatLoad it ignores its original partitioning.
Say originally I have 100 and 110 equations on two processors, after
MatLoad I will have 105 and 105 also on two processors.
What do I do to pass partitioning info through MatView MatLoad?
I guess it's important for reproducing my setup exactly.
Thanks
On 10/19/2016 08:06 AM, Xiaoye S. Li wrote:
> I looked at each valgrind-complained item in your email dated Oct.
> 11. Those reports are really superficial; I don't see anything wrong
> with those lines (mostly uninitialized variables) singled out. I did
> a few tests with the latest version in github, all went fine.
>
> Perhaps you can print your matrix that caused problem, I can run it
> using your matrix.
>
> Sherry
>
>
> On Tue, Oct 11, 2016 at 2:18 PM, Anton <popov at uni-mainz.de
> <mailto:popov at uni-mainz.de>> wrote:
>
>
>
> On 10/11/16 7:19 PM, Satish Balay wrote:
>
> This log looks truncated. Are there any valgrind mesages
> before this?
> [like from your application code - or from MPI]
>
> Yes it is indeed truncated. I only included relevant messages.
>
>
> Perhaps you can send the complete log - with:
> valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
> --track-origins=yes
>
> [and if there were more valgrind messages from MPI - rebuild petsc
>
> There are no messages originating from our code, just a few MPI
> related ones (probably false positives) and from SuperLU_DIST
> (most of them).
>
> Thanks,
> Anton
>
> with --download-mpich - for a valgrind clean mpi]
>
> Sherry,
> Perhaps this log points to some issue in superlu_dist?
>
> thanks,
> Satish
>
> On Tue, 11 Oct 2016, Anton Popov wrote:
>
> Valgrind immediately detects interesting stuff:
>
> ==25673== Use of uninitialised value of size 8
> ==25673== at 0x178272C: static_schedule
> (static_schedule.c:960)
> ==25674== Use of uninitialised value of size 8
> ==25674== at 0x178272C: static_schedule
> (static_schedule.c:960)
> ==25674== by 0x174E74E: pdgstrf (pdgstrf.c:572)
> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
>
> ==25673== Conditional jump or move depends on
> uninitialised value(s)
> ==25673== at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
> ==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
>
> ==25673== Conditional jump or move depends on
> uninitialised value(s)
> ==25673== at 0x5C83F43: PMPI_Recv (in
> /opt/mpich3/lib/libmpi.so.12.1.0)
> ==25673== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
> ==25673== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
> ==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
> ==25674== Use of uninitialised value of size 8
> ==25674== at 0x62BF72B: _itoa_word (_itoa.c:179)
> ==25674== by 0x62C1289: printf_positional (vfprintf.c:2022)
> ==25674== by 0x62C2465: vfprintf (vfprintf.c:1677)
> ==25674== by 0x638AFD5: __vsnprintf_chk
> (vsnprintf_chk.c:63)
> ==25674== by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
> ==25674== by 0x5CC6C08: MPIR_Err_create_code_valist (in
> /opt/mpich3/lib/libmpi.so.12.1.0)
> ==25674== by 0x5CC7A9A: MPIR_Err_create_code (in
> /opt/mpich3/lib/libmpi.so.12.1.0)
> ==25674== by 0x5C83FB1: PMPI_Recv (in
> /opt/mpich3/lib/libmpi.so.12.1.0)
> ==25674== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
> ==25674== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
> ==25674== Use of uninitialised value of size 8
> ==25674== at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>
> And it crashes after this:
>
> ==25674== Invalid write of size 4
> ==25674== at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
> ==25674== by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
> (superlu_dist.c:421)
> ==25674== Address 0xa0 is not stack'd, malloc'd or
> (recently) free'd
> ==25674==
> [1]PETSC ERROR:
> ------------------------------------------------------------------------
> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
> Violation, probably
> memory access out of range
>
>
> On 10/11/2016 03:26 PM, Anton Popov wrote:
>
> On 10/10/2016 07:11 PM, Satish Balay wrote:
>
> Thats from petsc-3.5
>
> Anton - please post the stack trace you get with
> --download-superlu_dist-commit=origin/maint
>
> I guess this is it:
>
> [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST
> line 282
> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
> /home/anton/LIB/petsc/src/mat/interface/matrix.c
> [0]PETSC ERROR: [0] PCSetUp_LU line 101
> /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
> [0]PETSC ERROR: [0] PCSetUp line 930
> /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
>
> According to the line numbers it crashes within
> MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
>
> Surprisingly this only happens on the second SNES
> iteration, but not on the
> first.
>
> I'm trying to reproduce this behavior with PETSc KSP
> and SNES examples.
> However, everything I've tried up to now with
> SuperLU_DIST does just fine.
>
> I'm also checking our code in Valgrind to make sure
> it's clean.
>
> Anton
>
> Satish
>
>
> On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
>
> Which version of superlu_dist does this
> capture? I looked at the
> original
> error log, it pointed to pdgssvx: line 161.
> But that line is in
> comment
> block, not the program.
>
> Sherry
>
>
> On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov
> <popov at uni-mainz.de
> <mailto:popov at uni-mainz.de>> wrote:
>
> On 10/07/2016 05:23 PM, Satish Balay wrote:
>
> On Fri, 7 Oct 2016, Kong, Fande wrote:
>
> On Fri, Oct 7, 2016 at 9:04 AM, Satish
> Balay <balay at mcs.anl.gov
> <mailto:balay at mcs.anl.gov>>
> wrote:
>
> On Fri, 7 Oct 2016, Anton Popov wrote:
>
> Hi guys,
>
> are there any news about
> fixing buggy behavior of
> SuperLU_DIST, exactly
>
> what
>
> is described here:
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists>.
>
> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
> l&d=CwIBAg&c=
> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_
> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
> 1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e=
> ?
>
> I'm using 3.7.4 and still
> get SEGV in pdgssvx routine.
> Everything works
>
> fine
>
> with 3.5.4.
>
> Do I still have to stick
> to maint branch, and what
> are the
> chances for
>
> these
>
> fixes to be included in 3.7.5?
>
> 3.7.4. is off maint branch [as
> of a week ago]. So if you are
> seeing
> issues with it - its best to
> debug and figure out the cause.
>
> This bug is indeed inside of
> superlu_dist, and we started
> having
> this
>
> issue
> from PETSc-3.6.x. I think
> superlu_dist developers should have
> fixed this
> bug. We forgot to update
> superlu_dist?? This is not a thing
> users
> could
> debug and fix.
>
> I have many people in INL
> suffering from this issue, and
> they have
> to
> stay
> with PETSc-3.5.4 to use superlu_dist.
>
> To verify if the bug is fixed in
> latest superlu_dist - you can try
> [assuming you have git - either from
> petsc-3.7/maint/master]:
>
> --download-superlu_dist
> --download-superlu_dist-commit=origin/maint
>
>
> Satish
>
> Hi Satish,
>
> I did this:
>
> git clone -b maint
> https://bitbucket.org/petsc/petsc.git
> <https://bitbucket.org/petsc/petsc.git> petsc
>
> --download-superlu_dist
> --download-superlu_dist-commit=origin/maint
> (not sure this is needed,
> since I'm already in maint)
>
> The problem is still there.
>
> Cheers,
> Anton
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161019/5fd5d2c4/attachment-0001.html>
More information about the petsc-users
mailing list