[petsc-users] SuperLU_dist issue in 3.7.4

Anton Popov popov at uni-mainz.de
Tue Oct 11 08:48:28 CDT 2016


Valgrind immediately detects interesting stuff:

==25673== Use of uninitialised value of size 8
==25673==    at 0x178272C: static_schedule (static_schedule.c:960)
==25674== Use of uninitialised value of size 8
==25674==    at 0x178272C: static_schedule (static_schedule.c:960)
==25674==    by 0x174E74E: pdgstrf (pdgstrf.c:572)
==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on uninitialised value(s)
==25673==    at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on uninitialised value(s)
==25673==    at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
==25673==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25673==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25673==    by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==    at 0x62BF72B: _itoa_word (_itoa.c:179)
==25674==    by 0x62C1289: printf_positional (vfprintf.c:2022)
==25674==    by 0x62C2465: vfprintf (vfprintf.c:1677)
==25674==    by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
==25674==    by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
==25674==    by 0x5CC6C08: MPIR_Err_create_code_valist (in 
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==    by 0x5CC7A9A: MPIR_Err_create_code (in 
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==    by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
==25674==    by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25674==    by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==    at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)

And it crashes after this:

==25674== Invalid write of size 4
==25674==    at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
==25674==    by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25674==    by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST 
(superlu_dist.c:421)
==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
==25674==
[1]PETSC ERROR: 
------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
probably memory access out of range


On 10/11/2016 03:26 PM, Anton Popov wrote:
>
> On 10/10/2016 07:11 PM, Satish Balay wrote:
>> Thats from petsc-3.5
>>
>> Anton - please post the stack trace you get with 
>> --download-superlu_dist-commit=origin/maint
>
> I guess this is it:
>
> [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421 
> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282 
> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985 
> /home/anton/LIB/petsc/src/mat/interface/matrix.c
> [0]PETSC ERROR: [0] PCSetUp_LU line 101 
> /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
> [0]PETSC ERROR: [0] PCSetUp line 930 
> /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
>
> According to the line numbers it crashes within 
> MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
>
> Surprisingly this only happens on the second SNES iteration, but not 
> on the first.
>
> I'm trying to reproduce this behavior with PETSc KSP and SNES 
> examples. However, everything I've tried up to now with SuperLU_DIST 
> does just fine.
>
> I'm also checking our code in Valgrind to make sure it's clean.
>
> Anton
>>
>> Satish
>>
>>
>> On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
>>
>>> Which version of superlu_dist does this capture?   I looked at the 
>>> original
>>> error  log, it pointed to pdgssvx: line 161.  But that line is in 
>>> comment
>>> block, not the program.
>>>
>>> Sherry
>>>
>>>
>>> On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov <popov at uni-mainz.de> 
>>> wrote:
>>>
>>>>
>>>> On 10/07/2016 05:23 PM, Satish Balay wrote:
>>>>
>>>>> On Fri, 7 Oct 2016, Kong, Fande wrote:
>>>>>
>>>>> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay <balay at mcs.anl.gov> 
>>>>> wrote:
>>>>>> On Fri, 7 Oct 2016, Anton Popov wrote:
>>>>>>> Hi guys,
>>>>>>>> are there any news about fixing buggy behavior of SuperLU_DIST, 
>>>>>>>> exactly
>>>>>>>>
>>>>>>> what
>>>>>>>
>>>>>>>> is described here:
>>>>>>>>
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
>>>>>>>>
>>>>>>> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
>>>>>>> l&d=CwIBAg&c=
>>>>>>> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_
>>>>>>> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
>>>>>>> 1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e= ?
>>>>>>>
>>>>>>>> I'm using 3.7.4 and still get SEGV in pdgssvx routine. 
>>>>>>>> Everything works
>>>>>>>>
>>>>>>> fine
>>>>>>>
>>>>>>>> with 3.5.4.
>>>>>>>>
>>>>>>>> Do I still have to stick to maint branch, and what are the 
>>>>>>>> chances for
>>>>>>>>
>>>>>>> these
>>>>>>>
>>>>>>>> fixes to be included in 3.7.5?
>>>>>>>>
>>>>>>> 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
>>>>>>> issues with it - its best to debug and figure out the cause.
>>>>>>>
>>>>>>> This bug is indeed inside of superlu_dist, and we started having 
>>>>>>> this
>>>>>> issue
>>>>>> from PETSc-3.6.x. I think superlu_dist developers should have 
>>>>>> fixed this
>>>>>> bug. We forgot to update superlu_dist??  This is not a thing 
>>>>>> users could
>>>>>> debug and fix.
>>>>>>
>>>>>> I have many people in INL suffering from this issue, and they 
>>>>>> have to
>>>>>> stay
>>>>>> with PETSc-3.5.4 to use superlu_dist.
>>>>>>
>>>>> To verify if the bug is fixed in latest superlu_dist - you can try
>>>>> [assuming you have git - either from petsc-3.7/maint/master]:
>>>>>
>>>>> --download-superlu_dist --download-superlu_dist-commit=origin/maint
>>>>>
>>>>>
>>>>> Satish
>>>>>
>>>>> Hi Satish,
>>>> I did this:
>>>>
>>>> git clone -b maint https://bitbucket.org/petsc/petsc.git petsc
>>>>
>>>> --download-superlu_dist
>>>> --download-superlu_dist-commit=origin/maint (not sure this is needed,
>>>> since I'm already in maint)
>>>>
>>>> The problem is still there.
>>>>
>>>> Cheers,
>>>> Anton
>>>>
>



More information about the petsc-users mailing list