[petsc-users] SuperLU_dist issue in 3.7.4
Barry Smith
bsmith at mcs.anl.gov
Tue Oct 11 12:44:15 CDT 2016
You can run your code with -ksp_view_mat binary -ksp_view_rhs binary this will cause it to save the matrices and right hand sides to the linear systems in a file called binaryoutput, then email the file to petsc-maint at mcs.anl.gov (don't worry this email address accepts large attachments). And tell us how many processes you ran on that produced the problems.
Barry
> On Oct 11, 2016, at 12:19 PM, Satish Balay <balay at mcs.anl.gov> wrote:
>
> This log looks truncated. Are there any valgrind mesages before this?
> [like from your application code - or from MPI]
>
> Perhaps you can send the complete log - with:
> valgrind -q --tool=memcheck --leak-check=yes --num-callers=20 --track-origins=yes
>
> [and if there were more valgrind messages from MPI - rebuild petsc
> with --download-mpich - for a valgrind clean mpi]
>
> Sherry,
> Perhaps this log points to some issue in superlu_dist?
>
> thanks,
> Satish
>
> On Tue, 11 Oct 2016, Anton Popov wrote:
>
>> Valgrind immediately detects interesting stuff:
>>
>> ==25673== Use of uninitialised value of size 8
>> ==25673== at 0x178272C: static_schedule (static_schedule.c:960)
>> ==25674== Use of uninitialised value of size 8
>> ==25674== at 0x178272C: static_schedule (static_schedule.c:960)
>> ==25674== by 0x174E74E: pdgstrf (pdgstrf.c:572)
>> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>
>>
>> ==25673== Conditional jump or move depends on uninitialised value(s)
>> ==25673== at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
>> ==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>
>>
>> ==25673== Conditional jump or move depends on uninitialised value(s)
>> ==25673== at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
>> ==25673== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>> ==25673== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>> ==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>
>> ==25674== Use of uninitialised value of size 8
>> ==25674== at 0x62BF72B: _itoa_word (_itoa.c:179)
>> ==25674== by 0x62C1289: printf_positional (vfprintf.c:2022)
>> ==25674== by 0x62C2465: vfprintf (vfprintf.c:1677)
>> ==25674== by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
>> ==25674== by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
>> ==25674== by 0x5CC6C08: MPIR_Err_create_code_valist (in
>> /opt/mpich3/lib/libmpi.so.12.1.0)
>> ==25674== by 0x5CC7A9A: MPIR_Err_create_code (in
>> /opt/mpich3/lib/libmpi.so.12.1.0)
>> ==25674== by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
>> ==25674== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>> ==25674== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>
>> ==25674== Use of uninitialised value of size 8
>> ==25674== at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
>> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>
>> And it crashes after this:
>>
>> ==25674== Invalid write of size 4
>> ==25674== at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
>> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)
>> ==25674== by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:421)
>> ==25674== Address 0xa0 is not stack'd, malloc'd or (recently) free'd
>> ==25674==
>> [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
>> memory access out of range
>>
>>
>> On 10/11/2016 03:26 PM, Anton Popov wrote:
>>>
>>> On 10/10/2016 07:11 PM, Satish Balay wrote:
>>>> Thats from petsc-3.5
>>>>
>>>> Anton - please post the stack trace you get with
>>>> --download-superlu_dist-commit=origin/maint
>>>
>>> I guess this is it:
>>>
>>> [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
>>> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>> [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
>>> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>> [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
>>> /home/anton/LIB/petsc/src/mat/interface/matrix.c
>>> [0]PETSC ERROR: [0] PCSetUp_LU line 101
>>> /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
>>> [0]PETSC ERROR: [0] PCSetUp line 930
>>> /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
>>>
>>> According to the line numbers it crashes within
>>> MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
>>>
>>> Surprisingly this only happens on the second SNES iteration, but not on the
>>> first.
>>>
>>> I'm trying to reproduce this behavior with PETSc KSP and SNES examples.
>>> However, everything I've tried up to now with SuperLU_DIST does just fine.
>>>
>>> I'm also checking our code in Valgrind to make sure it's clean.
>>>
>>> Anton
>>>>
>>>> Satish
>>>>
>>>>
>>>> On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
>>>>
>>>>> Which version of superlu_dist does this capture? I looked at the
>>>>> original
>>>>> error log, it pointed to pdgssvx: line 161. But that line is in
>>>>> comment
>>>>> block, not the program.
>>>>>
>>>>> Sherry
>>>>>
>>>>>
>>>>> On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov <popov at uni-mainz.de> wrote:
>>>>>
>>>>>>
>>>>>> On 10/07/2016 05:23 PM, Satish Balay wrote:
>>>>>>
>>>>>>> On Fri, 7 Oct 2016, Kong, Fande wrote:
>>>>>>>
>>>>>>> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay <balay at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>> On Fri, 7 Oct 2016, Anton Popov wrote:
>>>>>>>>> Hi guys,
>>>>>>>>>> are there any news about fixing buggy behavior of
>>>>>>>>>> SuperLU_DIST, exactly
>>>>>>>>>>
>>>>>>>>> what
>>>>>>>>>
>>>>>>>>>> is described here:
>>>>>>>>>>
>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
>>>>>>>>>>
>>>>>>>>> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
>>>>>>>>> l&d=CwIBAg&c=
>>>>>>>>> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_
>>>>>>>>> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
>>>>>>>>> 1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e= ?
>>>>>>>>>
>>>>>>>>>> I'm using 3.7.4 and still get SEGV in pdgssvx routine.
>>>>>>>>>> Everything works
>>>>>>>>>>
>>>>>>>>> fine
>>>>>>>>>
>>>>>>>>>> with 3.5.4.
>>>>>>>>>>
>>>>>>>>>> Do I still have to stick to maint branch, and what are the
>>>>>>>>>> chances for
>>>>>>>>>>
>>>>>>>>> these
>>>>>>>>>
>>>>>>>>>> fixes to be included in 3.7.5?
>>>>>>>>>>
>>>>>>>>> 3.7.4. is off maint branch [as of a week ago]. So if you are
>>>>>>>>> seeing
>>>>>>>>> issues with it - its best to debug and figure out the cause.
>>>>>>>>>
>>>>>>>>> This bug is indeed inside of superlu_dist, and we started having
>>>>>>>>> this
>>>>>>>> issue
>>>>>>>> from PETSc-3.6.x. I think superlu_dist developers should have
>>>>>>>> fixed this
>>>>>>>> bug. We forgot to update superlu_dist?? This is not a thing users
>>>>>>>> could
>>>>>>>> debug and fix.
>>>>>>>>
>>>>>>>> I have many people in INL suffering from this issue, and they have
>>>>>>>> to
>>>>>>>> stay
>>>>>>>> with PETSc-3.5.4 to use superlu_dist.
>>>>>>>>
>>>>>>> To verify if the bug is fixed in latest superlu_dist - you can try
>>>>>>> [assuming you have git - either from petsc-3.7/maint/master]:
>>>>>>>
>>>>>>> --download-superlu_dist --download-superlu_dist-commit=origin/maint
>>>>>>>
>>>>>>>
>>>>>>> Satish
>>>>>>>
>>>>>>> Hi Satish,
>>>>>> I did this:
>>>>>>
>>>>>> git clone -b maint https://bitbucket.org/petsc/petsc.git petsc
>>>>>>
>>>>>> --download-superlu_dist
>>>>>> --download-superlu_dist-commit=origin/maint (not sure this is needed,
>>>>>> since I'm already in maint)
>>>>>>
>>>>>> The problem is still there.
>>>>>>
>>>>>> Cheers,
>>>>>> Anton
>>>>>>
>>>
>>
>>
>>
>
More information about the petsc-users
mailing list