[petsc-users] SuperLU_dist issue in 3.7.4

Satish Balay balay at mcs.anl.gov
Tue Oct 11 16:16:07 CDT 2016


On Tue, 11 Oct 2016, Anton wrote:

> 
> 
> On 10/11/16 7:44 PM, Barry Smith wrote:
> >     You can run your code with -ksp_view_mat binary -ksp_view_rhs binary
> >     this will cause it to save the matrices and right hand sides to the
> >     linear systems in a file called binaryoutput, then email the file to
> >     petsc-maint at mcs.anl.gov (don't worry this email address accepts large
> >     attachments). And tell us how many processes you ran on that produced
> >     the problems.
> >
> >     Barry
> >
> 
> I'll do that, but I just wonder which version of SuperLU_DIST is used in
> 3.7.4?
> 
> The latest version available on http://crd-legacy.lbl.gov/~xiaoye/SuperLU/ is
> 5.1.1 which is a week old and includes bug fixes.

This is the version you essentially got - when you configured with --download-superlu_dist-commit=origin/maint

Satish

> 
> Maybe we're facing a problem that is already solved.
> 
> Thanks,
> Anton
> >
> > > On Oct 11, 2016, at 12:19 PM, Satish Balay <balay at mcs.anl.gov> wrote:
> > >
> > > This log looks truncated. Are there any valgrind mesages before this?
> > > [like from your application code - or from MPI]
> > >
> > > Perhaps you can send the complete log - with:
> > > valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
> > > --track-origins=yes
> > >
> > > [and if there were more valgrind messages from MPI - rebuild petsc
> > > with --download-mpich - for a valgrind clean mpi]
> > >
> > > Sherry,
> > > Perhaps this log points to some issue in superlu_dist?
> > >
> > > thanks,
> > > Satish
> > >
> > > On Tue, 11 Oct 2016, Anton Popov wrote:
> > >
> > > > Valgrind immediately detects interesting stuff:
> > > >
> > > > ==25673== Use of uninitialised value of size 8
> > > > ==25673==    at 0x178272C: static_schedule (static_schedule.c:960)
> > > > ==25674== Use of uninitialised value of size 8
> > > > ==25674==    at 0x178272C: static_schedule (static_schedule.c:960)
> > > > ==25674==    by 0x174E74E: pdgstrf (pdgstrf.c:572)
> > > > ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > >
> > > > ==25673== Conditional jump or move depends on uninitialised value(s)
> > > > ==25673==    at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
> > > > ==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > >
> > > > ==25673== Conditional jump or move depends on uninitialised value(s)
> > > > ==25673==    at 0x5C83F43: PMPI_Recv (in
> > > > /opt/mpich3/lib/libmpi.so.12.1.0)
> > > > ==25673==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
> > > > ==25673==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
> > > > ==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > > ==25674== Use of uninitialised value of size 8
> > > > ==25674==    at 0x62BF72B: _itoa_word (_itoa.c:179)
> > > > ==25674==    by 0x62C1289: printf_positional (vfprintf.c:2022)
> > > > ==25674==    by 0x62C2465: vfprintf (vfprintf.c:1677)
> > > > ==25674==    by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
> > > > ==25674==    by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
> > > > ==25674==    by 0x5CC6C08: MPIR_Err_create_code_valist (in
> > > > /opt/mpich3/lib/libmpi.so.12.1.0)
> > > > ==25674==    by 0x5CC7A9A: MPIR_Err_create_code (in
> > > > /opt/mpich3/lib/libmpi.so.12.1.0)
> > > > ==25674==    by 0x5C83FB1: PMPI_Recv (in
> > > > /opt/mpich3/lib/libmpi.so.12.1.0)
> > > > ==25674==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
> > > > ==25674==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
> > > > ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > > ==25674== Use of uninitialised value of size 8
> > > > ==25674==    at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
> > > > ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > > And it crashes after this:
> > > >
> > > > ==25674== Invalid write of size 4
> > > > ==25674==    at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
> > > > ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > > ==25674==    by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
> > > > (superlu_dist.c:421)
> > > > ==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
> > > > ==25674==
> > > > [1]PETSC ERROR:
> > > > ------------------------------------------------------------------------
> > > > [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> > > > probably
> > > > memory access out of range
> > > >
> > > >
> > > > On 10/11/2016 03:26 PM, Anton Popov wrote:
> > > > > On 10/10/2016 07:11 PM, Satish Balay wrote:
> > > > > > Thats from petsc-3.5
> > > > > >
> > > > > > Anton - please post the stack trace you get with
> > > > > > --download-superlu_dist-commit=origin/maint
> > > > > I guess this is it:
> > > > >
> > > > > [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
> > > > > /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> > > > > [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
> > > > > /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> > > > > [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
> > > > > /home/anton/LIB/petsc/src/mat/interface/matrix.c
> > > > > [0]PETSC ERROR: [0] PCSetUp_LU line 101
> > > > > /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
> > > > > [0]PETSC ERROR: [0] PCSetUp line 930
> > > > > /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
> > > > >
> > > > > According to the line numbers it crashes within
> > > > > MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
> > > > >
> > > > > Surprisingly this only happens on the second SNES iteration, but not
> > > > > on the
> > > > > first.
> > > > >
> > > > > I'm trying to reproduce this behavior with PETSc KSP and SNES
> > > > > examples.
> > > > > However, everything I've tried up to now with SuperLU_DIST does just
> > > > > fine.
> > > > >
> > > > > I'm also checking our code in Valgrind to make sure it's clean.
> > > > >
> > > > > Anton
> > > > > > Satish
> > > > > >
> > > > > >
> > > > > > On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
> > > > > >
> > > > > > > Which version of superlu_dist does this capture?   I looked at the
> > > > > > > original
> > > > > > > error  log, it pointed to pdgssvx: line 161.  But that line is in
> > > > > > > comment
> > > > > > > block, not the program.
> > > > > > >
> > > > > > > Sherry
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov <popov at uni-mainz.de>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On 10/07/2016 05:23 PM, Satish Balay wrote:
> > > > > > > >
> > > > > > > > > On Fri, 7 Oct 2016, Kong, Fande wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay
> > > > > > > > > <balay at mcs.anl.gov>
> > > > > > > > > wrote:
> > > > > > > > > > On Fri, 7 Oct 2016, Anton Popov wrote:
> > > > > > > > > > > Hi guys,
> > > > > > > > > > > > are there any news about fixing buggy behavior of
> > > > > > > > > > > > SuperLU_DIST, exactly
> > > > > > > > > > > >
> > > > > > > > > > > what
> > > > > > > > > > >
> > > > > > > > > > > > is described here:
> > > > > > > > > > > >
> > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> > > > > > > > > > > >
> > > > > > > > > > > mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
> > > > > > > > > > > l&d=CwIBAg&c=
> > > > > > > > > > > 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_
> > > > > > > > > > > JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
> > > > > > > > > > > 1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e=
> > > > > > > > > > > ?
> > > > > > > > > > >
> > > > > > > > > > > > I'm using 3.7.4 and still get SEGV in pdgssvx routine.
> > > > > > > > > > > > Everything works
> > > > > > > > > > > >
> > > > > > > > > > > fine
> > > > > > > > > > >
> > > > > > > > > > > > with 3.5.4.
> > > > > > > > > > > >
> > > > > > > > > > > > Do I still have to stick to maint branch, and what are
> > > > > > > > > > > > the
> > > > > > > > > > > > chances for
> > > > > > > > > > > >
> > > > > > > > > > > these
> > > > > > > > > > >
> > > > > > > > > > > > fixes to be included in 3.7.5?
> > > > > > > > > > > >
> > > > > > > > > > > 3.7.4. is off maint branch [as of a week ago]. So if you
> > > > > > > > > > > are
> > > > > > > > > > > seeing
> > > > > > > > > > > issues with it - its best to debug and figure out the
> > > > > > > > > > > cause.
> > > > > > > > > > >
> > > > > > > > > > > This bug is indeed inside of superlu_dist, and we started
> > > > > > > > > > > having
> > > > > > > > > > > this
> > > > > > > > > > issue
> > > > > > > > > > from PETSc-3.6.x. I think superlu_dist developers should
> > > > > > > > > > have
> > > > > > > > > > fixed this
> > > > > > > > > > bug. We forgot to update superlu_dist??  This is not a thing
> > > > > > > > > > users
> > > > > > > > > > could
> > > > > > > > > > debug and fix.
> > > > > > > > > >
> > > > > > > > > > I have many people in INL suffering from this issue, and
> > > > > > > > > > they have
> > > > > > > > > > to
> > > > > > > > > > stay
> > > > > > > > > > with PETSc-3.5.4 to use superlu_dist.
> > > > > > > > > >
> > > > > > > > > To verify if the bug is fixed in latest superlu_dist - you can
> > > > > > > > > try
> > > > > > > > > [assuming you have git - either from petsc-3.7/maint/master]:
> > > > > > > > >
> > > > > > > > > --download-superlu_dist
> > > > > > > > > --download-superlu_dist-commit=origin/maint
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Satish
> > > > > > > > >
> > > > > > > > > Hi Satish,
> > > > > > > > I did this:
> > > > > > > >
> > > > > > > > git clone -b maint https://bitbucket.org/petsc/petsc.git petsc
> > > > > > > >
> > > > > > > > --download-superlu_dist
> > > > > > > > --download-superlu_dist-commit=origin/maint (not sure this is
> > > > > > > > needed,
> > > > > > > > since I'm already in maint)
> > > > > > > >
> > > > > > > > The problem is still there.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Anton
> > > > > > > >
> > > >
> > > >
> 
> 
> 



More information about the petsc-users mailing list