[petsc-users] SuperLU_dist issue in 3.7.4

Xiaoye S. Li xsli at lbl.gov
Wed Oct 19 01:06:04 CDT 2016


I looked at each valgrind-complained item in your email dated Oct. 11.
Those reports are really superficial; I don't see anything  wrong with
those lines (mostly uninitialized variables) singled out.  I did a few
tests with the latest version in github,  all went fine.

Perhaps you can print your matrix that caused problem, I can run it using
 your matrix.

Sherry


On Tue, Oct 11, 2016 at 2:18 PM, Anton <popov at uni-mainz.de> wrote:

>
>
> On 10/11/16 7:19 PM, Satish Balay wrote:
>
>> This log looks truncated. Are there any valgrind mesages before this?
>> [like from your application code - or from MPI]
>>
> Yes it is indeed truncated. I only included relevant messages.
>
>>
>> Perhaps you can send the complete log - with:
>> valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
>> --track-origins=yes
>>
>> [and if there were more valgrind messages from MPI - rebuild petsc
>>
> There are no messages originating from our code, just a few MPI related
> ones (probably false positives) and from SuperLU_DIST (most of them).
>
> Thanks,
> Anton
>
> with --download-mpich - for a valgrind clean mpi]
>>
>> Sherry,
>> Perhaps this log points to some issue in superlu_dist?
>>
>> thanks,
>> Satish
>>
>> On Tue, 11 Oct 2016, Anton Popov wrote:
>>
>> Valgrind immediately detects interesting stuff:
>>>
>>> ==25673== Use of uninitialised value of size 8
>>> ==25673==    at 0x178272C: static_schedule (static_schedule.c:960)
>>> ==25674== Use of uninitialised value of size 8
>>> ==25674==    at 0x178272C: static_schedule (static_schedule.c:960)
>>> ==25674==    by 0x174E74E: pdgstrf (pdgstrf.c:572)
>>> ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>>
>>> ==25673== Conditional jump or move depends on uninitialised value(s)
>>> ==25673==    at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
>>> ==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>>
>>> ==25673== Conditional jump or move depends on uninitialised value(s)
>>> ==25673==    at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1
>>> .0)
>>> ==25673==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>>> ==25673==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>>> ==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>> ==25674== Use of uninitialised value of size 8
>>> ==25674==    at 0x62BF72B: _itoa_word (_itoa.c:179)
>>> ==25674==    by 0x62C1289: printf_positional (vfprintf.c:2022)
>>> ==25674==    by 0x62C2465: vfprintf (vfprintf.c:1677)
>>> ==25674==    by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
>>> ==25674==    by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
>>> ==25674==    by 0x5CC6C08: MPIR_Err_create_code_valist (in
>>> /opt/mpich3/lib/libmpi.so.12.1.0)
>>> ==25674==    by 0x5CC7A9A: MPIR_Err_create_code (in
>>> /opt/mpich3/lib/libmpi.so.12.1.0)
>>> ==25674==    by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1
>>> .0)
>>> ==25674==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>>> ==25674==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>>> ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>> ==25674== Use of uninitialised value of size 8
>>> ==25674==    at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
>>> ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>> And it crashes after this:
>>>
>>> ==25674== Invalid write of size 4
>>> ==25674==    at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
>>> ==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>> ==25674==    by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
>>> (superlu_dist.c:421)
>>> ==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
>>> ==25674==
>>> [1]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> probably
>>> memory access out of range
>>>
>>>
>>> On 10/11/2016 03:26 PM, Anton Popov wrote:
>>>
>>>> On 10/10/2016 07:11 PM, Satish Balay wrote:
>>>>
>>>>> Thats from petsc-3.5
>>>>>
>>>>> Anton - please post the stack trace you get with
>>>>> --download-superlu_dist-commit=origin/maint
>>>>>
>>>> I guess this is it:
>>>>
>>>> [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
>>>> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>>> [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
>>>> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>>> [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
>>>> /home/anton/LIB/petsc/src/mat/interface/matrix.c
>>>> [0]PETSC ERROR: [0] PCSetUp_LU line 101
>>>> /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
>>>> [0]PETSC ERROR: [0] PCSetUp line 930
>>>> /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
>>>>
>>>> According to the line numbers it crashes within
>>>> MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
>>>>
>>>> Surprisingly this only happens on the second SNES iteration, but not on
>>>> the
>>>> first.
>>>>
>>>> I'm trying to reproduce this behavior with PETSc KSP and SNES examples.
>>>> However, everything I've tried up to now with SuperLU_DIST does just
>>>> fine.
>>>>
>>>> I'm also checking our code in Valgrind to make sure it's clean.
>>>>
>>>> Anton
>>>>
>>>>> Satish
>>>>>
>>>>>
>>>>> On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
>>>>>
>>>>> Which version of superlu_dist does this capture?   I looked at the
>>>>>> original
>>>>>> error  log, it pointed to pdgssvx: line 161.  But that line is in
>>>>>> comment
>>>>>> block, not the program.
>>>>>>
>>>>>> Sherry
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov <popov at uni-mainz.de>
>>>>>> wrote:
>>>>>>
>>>>>> On 10/07/2016 05:23 PM, Satish Balay wrote:
>>>>>>>
>>>>>>> On Fri, 7 Oct 2016, Kong, Fande wrote:
>>>>>>>>
>>>>>>>> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay <balay at mcs.anl.gov>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Fri, 7 Oct 2016, Anton Popov wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>>> are there any news about fixing buggy behavior of
>>>>>>>>>>> SuperLU_DIST, exactly
>>>>>>>>>>>
>>>>>>>>>>> what
>>>>>>>>>>
>>>>>>>>>> is described here:
>>>>>>>>>>>
>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
>>>>>>>>>>>
>>>>>>>>>>> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
>>>>>>>>>> l&d=CwIBAg&c=
>>>>>>>>>> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_
>>>>>>>>>> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
>>>>>>>>>> 1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e= ?
>>>>>>>>>>
>>>>>>>>>> I'm using 3.7.4 and still get SEGV in pdgssvx routine.
>>>>>>>>>>> Everything works
>>>>>>>>>>>
>>>>>>>>>>> fine
>>>>>>>>>>
>>>>>>>>>> with 3.5.4.
>>>>>>>>>>>
>>>>>>>>>>> Do I still have to stick to maint branch, and what are the
>>>>>>>>>>> chances for
>>>>>>>>>>>
>>>>>>>>>>> these
>>>>>>>>>>
>>>>>>>>>> fixes to be included in 3.7.5?
>>>>>>>>>>>
>>>>>>>>>>> 3.7.4. is off maint branch [as of a week ago]. So if you are
>>>>>>>>>> seeing
>>>>>>>>>> issues with it - its best to debug and figure out the cause.
>>>>>>>>>>
>>>>>>>>>> This bug is indeed inside of superlu_dist, and we started having
>>>>>>>>>> this
>>>>>>>>>>
>>>>>>>>> issue
>>>>>>>>> from PETSc-3.6.x. I think superlu_dist developers should have
>>>>>>>>> fixed this
>>>>>>>>> bug. We forgot to update superlu_dist??  This is not a thing users
>>>>>>>>> could
>>>>>>>>> debug and fix.
>>>>>>>>>
>>>>>>>>> I have many people in INL suffering from this issue, and they have
>>>>>>>>> to
>>>>>>>>> stay
>>>>>>>>> with PETSc-3.5.4 to use superlu_dist.
>>>>>>>>>
>>>>>>>>> To verify if the bug is fixed in latest superlu_dist - you can try
>>>>>>>> [assuming you have git - either from petsc-3.7/maint/master]:
>>>>>>>>
>>>>>>>> --download-superlu_dist --download-superlu_dist-commit=origin/maint
>>>>>>>>
>>>>>>>>
>>>>>>>> Satish
>>>>>>>>
>>>>>>>> Hi Satish,
>>>>>>>>
>>>>>>> I did this:
>>>>>>>
>>>>>>> git clone -b maint https://bitbucket.org/petsc/petsc.git petsc
>>>>>>>
>>>>>>> --download-superlu_dist
>>>>>>> --download-superlu_dist-commit=origin/maint (not sure this is
>>>>>>> needed,
>>>>>>> since I'm already in maint)
>>>>>>>
>>>>>>> The problem is still there.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Anton
>>>>>>>
>>>>>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161018/7db03a48/attachment-0001.html>


More information about the petsc-users mailing list