[petsc-users] superlu_dist segfault

Xiaoye S. Li xsli at lbl.gov
Sun Nov 1 13:37:45 CST 2020


>From the memory output:

** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
    L\U :          952.18 |  Total :  1980.60
** Total highmark (MB):
    Sum-of-all : 12401.85 | Avg :   387.56  | Max :   387.56

Looks like you are using 32 MPI processes.  The maximum peak memory per
process is only 387.56 MB. So the sparse factors L and U do not take much
memory, compared to your large dense ones.

Can you send me the input matrix A ?  I can do some stand-alone debugging.

Sherry

On Sun, Nov 1, 2020 at 2:09 AM Stefano Zampini <stefano.zampini at gmail.com>
wrote:

> More importantly,
>
> ==43569== Conditional jump or move depends on uninitialised value(s)
> ==43569==    at 0x1473C515: pzgstrs (pzgstrs.c:1074)
> ==43569==    by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569==    by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569==    by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569==    by 0x40465D: main (superlu_test.c:59)
>
> You should run using valgrind's option --track-origins=yes to
> understand the reason for this.
>
> Il giorno dom 1 nov 2020 alle ore 11:53 Barry Smith <bsmith at petsc.dev> ha
> scritto:
>
>>
>>
>>   You can sometimes use -on_error_attach_debugger noxterm and it will try
>> to attach just in the console you started the job. If you are lucky this
>> works and you use bt and see the stack and look at variables. But if
>> multiple ranks crash the debugger will get confused and even if only one
>> crashes if it is not rank zero the stty can get messed up so you cannot
>> type to control the debugger.
>>
>>    The valgrind information is very valuable, likely Sherry can look at
>> those lines and have a really good idea what the problem is, for example,
>>
>> Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd
>>
>>
>>   means that for some reason the code is writing past the end of an
>> allocated array, either because the array allocated was not long enough or
>> the code has some issue where it wants to write further than it should.
>> This kind of thing is very common and usually easy to debug by someone who
>> knows the code once they know exactly what line of code is problematic.
>> Since it shows exactly where the memory was allocated and exactly where it
>> went out of bounds.
>>
>>   Barry
>>
>>
>> On Nov 1, 2020, at 1:21 AM, Marius Buerkle <mbuerkle at web.de> wrote:
>>
>> Hi,
>>
>> I cannot use on_error_attach_debugger as X forwarding does not work on
>> the system. Is it possible to dump the gdb output to file instead?
>>
>> I run it through valgrind. It seems there is some problem during calls in
>> superlu_dist but I don't know if this eventually causes the seg fault.  I
>> think this is the relevant output:
>>
>> ==43569== Conditional jump or move depends on uninitialised value(s)
>> ==43569==    at 0x1473C515: pzgstrs (pzgstrs.c:1074)
>> ==43569==    by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
>> ==43569==    by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
>> ==43569==    by 0x55FB716: MatMatSolve (matrix.c:3485)
>> ==43569==    by 0x40465D: main (superlu_test.c:59)
>> ==43569==
>> ==43569== Use of uninitialised value of size 8
>> ==43569==    at 0x1473C554: pzgstrs (pzgstrs.c:1077)
>> ==43569==    by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
>> ==43569==    by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
>> ==43569==    by 0x55FB716: MatMatSolve (matrix.c:3485)
>> ==43569==    by 0x40465D: main (superlu_test.c:59)
>> ==43569==
>> ==43569== Use of uninitialised value of size 8
>> ==43569==    at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
>> ==43569==    by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
>> ==43569==    by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
>> ==43569==    by 0x55FB716: MatMatSolve (matrix.c:3485)
>> ==43569==    by 0x40465D: main (superlu_test.c:59)
>> ==43569==
>> ==43569== Invalid write of size 8
>> ==43569==    at 0x1473C554: pzgstrs (pzgstrs.c:1077)
>> ==43569==    by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
>> ==43569==    by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
>> ==43569==    by 0x55FB716: MatMatSolve (matrix.c:3485)
>> ==43569==    by 0x40465D: main (superlu_test.c:59)
>> ==43569==  Address 0x266e5ac0 is 0 bytes after a block of size 35,520
>> alloc'd
>> ==43569==    at 0x4C2D814: memalign (vg_replace_malloc.c:906)
>> ==43569==    by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
>> ==43569==    by 0x1464D488: superlu_malloc_dist (memory.c:127)
>> ==43569==    by 0x1473C451: pzgstrs (pzgstrs.c:1044)
>> ==43569==    by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
>> ==43569==    by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
>> ==43569==    by 0x55FB716: MatMatSolve (matrix.c:3485)
>> ==43569==    by 0x40465D: main (superlu_test.c:59)
>> ==43569==
>> ==43569== Invalid write of size 8
>> ==43569==    at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
>> ==43569==    by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
>> ==43569==    by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
>> ==43569==    by 0x55FB716: MatMatSolve (matrix.c:3485)
>> ==43569==    by 0x40465D: main (superlu_test.c:59)
>> ==43569==  Address 0x266e5ad0 is 16 bytes after a block of size 35,520
>> alloc'd
>> ==43569==    at 0x4C2D814: memalign (vg_replace_malloc.c:906)
>> ==43569==    by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
>> ==43569==    by 0x1464D488: superlu_malloc_dist (memory.c:127)
>> ==43569==    by 0x1473C451: pzgstrs (pzgstrs.c:1044)
>> ==43569==    by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
>> ==43569==    by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
>> ==43569==    by 0x55FB716: MatMatSolve (matrix.c:3485)
>> ==43569==    by 0x40465D: main (superlu_test.c:59)
>> ==43569==
>>
>> I also attached the whole log. Does this make any sense? The problem
>> seems to be around where I get the original segfault.
>>
>> best,
>> marius
>>
>>
>> *Gesendet:* Samstag, 31. Oktober 2020 um 04:07 Uhr
>> *Von:* "Barry Smith" <bsmith at petsc.dev>
>> *An:* "Marius Buerkle" <mbuerkle at web.de>
>> *Cc:* "Xiaoye S. Li" <xsli at lbl.gov>, "petsc-users at mcs.anl.gov" <
>> petsc-users at mcs.anl.gov>, "Sherry Li" <xiaoye at nersc.gov>
>> *Betreff:* Re: [petsc-users] superlu_dist segfault
>>
>>  Have you run it yet with valgrind, good be memory corruption earlier
>> that causes a later crash, crashes that occur at different places for the
>> same run are almost always due to memory corruption.
>>
>>   If valgrind is clean you can run with -on_error_attach_debugger and if
>> the X forwarding is set up it will open a debugger on the crashing process
>> and you can type bt to see exactly where it is crashing, at what line
>> number and code line.
>>
>>   Barry
>>
>>
>>
>> On Oct 29, 2020, at 1:04 AM, Marius Buerkle <mbuerkle at web.de> wrote:
>>
>> Hi Sherry,
>>
>> I used only 1 OpenMP thread and I also recompiled PETSC in debug mode
>> with OpenMP turned off. But did not help.
>>
>> Here is the output I can get from SuperLu during the PETSC run
>>         Nonzeros in L       29519630
>>         Nonzeros in U       29519630
>>         nonzeros in L+U     58996711
>>         nonzeros in LSUB     4509612
>> ** Memory Usage **********************************
>> ** NUMfact space (MB): (sum-of-all-processes)
>>     L\U :          952.18 |  Total :  1980.60
>> ** Total highmark (MB):
>>     Sum-of-all : 12401.85 | Avg :   387.56  | Max :   387.56
>> **************************************************
>> **************************************************
>> **** Time (seconds) ****
>>         EQUIL time             0.06
>>         ROWPERM time           1.03
>>         COLPERM time           1.01
>>         SYMBFACT time          0.45
>>         DISTRIBUTE time        0.33
>>         FACTOR time            0.90
>>         Factor flops    2.225916e+11    Mflops  247438.62
>>         SOLVE time            0.000
>> **************************************************
>>
>> I tried all available ordering options for Colperm
>> (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which
>> always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the
>> same seg. fault.
>>
>>
>> *Gesendet:* Donnerstag, 29. Oktober 2020 um 14:14 Uhr
>> *Von:* "Xiaoye S. Li" <xsli at lbl.gov>
>> *An:* "Marius Buerkle" <mbuerkle at web.de>
>> *Cc:* "Zhang, Hong" <hzhang at mcs.anl.gov>, "petsc-users at mcs.anl.gov" <
>> petsc-users at mcs.anl.gov>, "Sherry Li" <xiaoye at nersc.gov>
>> *Betreff:* Re: Re: Re: [petsc-users] superlu_dist segfault
>> Hong: thanks for the diagnosis!
>>
>> Marius: how many OpenMP threads are you using per MPI task?
>> In an earlier email, you mentioned the allocation failure at the
>> following line:
>>   if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread *
>> sizeof(doublecomplex))))     ABORT("Malloc fails for lsum[].");
>>
>> this is in the solve phase. I think when we do some OpenMP optimization,
>> we allowed several data structures to grow with OpenMP threads.  You can
>> try to use 1 thread.
>>
>> The RHS and X  memories are easy to compute. However, in order to gauge
>> how much memory is used in the factorization, can you print out the number
>> of nonzeros in the L and U factors?   What ordering option are you using?
>> The sparse matrix A looks pretty small.
>>
>> The code can also print out the working storage used during
>> factorization.  I am not sure how this printing can be turned on through
>> PETSc.
>>
>> Sherry
>>
>> On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <mbuerkle at web.de> wrote:
>>
>>> Thanks for the swift reply.
>>>
>>> I also realized if I reduce the number of RHS then it works. But I am
>>> running the code on a cluster with 256GB ram / node.  One dense matrix
>>> would be around ~30 Gb so 60 Gb, which is large but does exceed the
>>> memory of even one node and I also get the seg fault if I run it on several
>>> nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The
>>> maxium memory used when using MUMPS is around 150 Gb during the solver
>>> phase but for SuperLU_dist it crashed even before reaching the solver
>>> phase. Could there be such a large difference in memory usage between
>>> SuperLu_dist and MUMPS ?
>>>
>>>
>>> best,
>>>
>>> marius
>>>
>>> *Gesendet:* Donnerstag, 29. Oktober 2020 um 10:10 Uhr
>>> *Von:* "Zhang, Hong" <hzhang at mcs.anl.gov>
>>> *An:* "Marius Buerkle" <mbuerkle at web.de>
>>> *Cc:* "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>, "Sherry Li" <
>>> xiaoye at nersc.gov>
>>> *Betreff:* Re: Re: [petsc-users] superlu_dist segfault
>>> Marius,
>>> I tested your code with petsc-release on my mac laptop using np=2 cores.
>>> I first tested a small matrix data file successfully. Then I switch to your
>>> data file and run out of memory, likely due to the dense matrices B and X.
>>> I got an error "Your system has run out of application memory" from my
>>> laptop.
>>>
>>> The sparse matrix A has size 42549 by 42549. Your code creates dense
>>> matrices B and X with the same size -- a huge memory requirement!
>>> By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the
>>> code run well with np=2. Note the error message you got
>>> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> probably memory access out of range
>>>
>>> The modified code I used is attached.
>>> Hong
>>>
>>> ------------------------------
>>> *From:* Marius Buerkle <mbuerkle at web.de>
>>> *Sent:* Tuesday, October 27, 2020 10:01 PM
>>> *To:* Zhang, Hong <hzhang at mcs.anl.gov>
>>> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Sherry Li <
>>> xiaoye at nersc.gov>
>>> *Subject:* Aw: Re: [petsc-users] superlu_dist segfault
>>>
>>> Hi,
>>>
>>> I recompiled PETSC with debug option, now I get a seg fault at a
>>> different position
>>>
>>> [23]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> probably memory access out of range
>>> [23]PETSC ERROR: Try option -start_in_debugger or
>>> -on_error_attach_debugger
>>> [23]PETSC ERROR: or see
>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>> [23]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>>> OS X to find memory corruption errors
>>> [23]PETSC ERROR: likely location of problem given in stack below
>>> [23]PETSC ERROR: ---------------------  Stack Frames
>>> ------------------------------------
>>> [23]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>> available,
>>> [23]PETSC ERROR:       INSTEAD the line number of the start of the
>>> function
>>> [23]PETSC ERROR:       is given.
>>> [23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242
>>> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>> [23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211
>>> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>> [23]PETSC ERROR: [23] MatMatSolve line 3466
>>> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c
>>> [23]PETSC ERROR: --------------------- Error Message
>>> --------------------------------------------------------------
>>> [23]PETSC ERROR: Signal received
>>>
>>> I  made a small reproducer. The matrix is a bit too big so I cannot
>>> attach it directly to the email, but I put it in the cloud
>>> https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw
>>>
>>> Best,
>>> Marius
>>>
>>>
>>> *Gesendet:* Dienstag, 27. Oktober 2020 um 23:11 Uhr
>>> *Von:* "Zhang, Hong" <hzhang at mcs.anl.gov>
>>> *An:* "Marius Buerkle" <mbuerkle at web.de>, "petsc-users at mcs.anl.gov" <
>>> petsc-users at mcs.anl.gov>, "Sherry Li" <xiaoye at nersc.gov>
>>> *Betreff:* Re: [petsc-users] superlu_dist segfault
>>> Marius,
>>> It fails at the line 1075 in file
>>> /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>>>     if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread *
>>> sizeof(doublecomplex))))     ABORT("Malloc fails for lsum[].");
>>>
>>> We do not know what it means. You may use a debugger to check the values
>>> of the variables involved.
>>> I'm cc'ing Sherry (superlu_dist developer), or you may send us a
>>> stand-alone short code that reproduce the error. We can help on its
>>> investigation.
>>> Hong
>>>
>>>
>>> ------------------------------
>>> *From:* petsc-users <petsc-users-bounces at mcs.anl.gov> on behalf of
>>> Marius Buerkle <mbuerkle at web.de>
>>> *Sent:* Tuesday, October 27, 2020 8:46 AM
>>> *To:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
>>> *Subject:* [petsc-users] superlu_dist segfault
>>>
>>> Hi,
>>>
>>> When using MatMatSolve with superlu_dist I get a segmentation fault:
>>>
>>> Malloc fails for lsum[]. at line 1075 in file
>>> /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>>>
>>> The matrix size is not particular big and I am using the petsc release
>>> branch and superlu_dist is v6.3.0 I think.
>>>
>>> Best,
>>> Marius
>>>
>> <valgrind.tar.gz>
>>
>>
>>
>
> --
> Stefano
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201101/d1bc7da5/attachment-0001.html>


More information about the petsc-users mailing list