[petsc-users] superlu_dist segfault

Barry Smith bsmith at petsc.dev
Fri Oct 30 14:07:54 CDT 2020


 Have you run it yet with valgrind, good be memory corruption earlier that causes a later crash, crashes that occur at different places for the same run are almost always due to memory corruption. 

  If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line.

  Barry


> On Oct 29, 2020, at 1:04 AM, Marius Buerkle <mbuerkle at web.de> wrote:
> 
> Hi Sherry,
>  
> I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with OpenMP turned off. But did not help. 
>  
> Here is the output I can get from SuperLu during the PETSC run
>         Nonzeros in L       29519630
>         Nonzeros in U       29519630
>         nonzeros in L+U     58996711
>         nonzeros in LSUB     4509612
> ** Memory Usage **********************************
> ** NUMfact space (MB): (sum-of-all-processes)
>     L\U :          952.18 |  Total :  1980.60
> ** Total highmark (MB):
>     Sum-of-all : 12401.85 | Avg :   387.56  | Max :   387.56
> **************************************************
> **************************************************
> **** Time (seconds) ****
>         EQUIL time             0.06
>         ROWPERM time           1.03
>         COLPERM time           1.01
>         SYMBFACT time          0.45
>         DISTRIBUTE time        0.33
>         FACTOR time            0.90
>         Factor flops    2.225916e+11    Mflops  247438.62
>         SOLVE time            0.000
> **************************************************
>  
> I tried all available ordering options for Colperm (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the same seg. fault.
>  
>  
> Gesendet: Donnerstag, 29. Oktober 2020 um 14:14 Uhr
> Von: "Xiaoye S. Li" <xsli at lbl.gov>
> An: "Marius Buerkle" <mbuerkle at web.de>
> Cc: "Zhang, Hong" <hzhang at mcs.anl.gov>, "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>, "Sherry Li" <xiaoye at nersc.gov>
> Betreff: Re: Re: Re: [petsc-users] superlu_dist segfault
> Hong: thanks for the diagnosis!
>  
> Marius: how many OpenMP threads are you using per MPI task?
> In an earlier email, you mentioned the allocation failure at the following line:
>   if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex))))     ABORT("Malloc fails for lsum[].");
>  
> this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads.  You can try to use 1 thread.
> 
> The RHS and X  memories are easy to compute. However, in order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors?   What ordering option are you using?  The sparse matrix A looks pretty small.
>  
> The code can also print out the working storage used during factorization.  I am not sure how this printing can be turned on through PETSc.
>  
> Sherry
>  
> On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>> wrote:
> Thanks for the swift reply.
> 
> I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node.  One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?
> 
>  
> best,
> 
> marius
> 
>  
> Gesendet: Donnerstag, 29. Oktober 2020 um 10:10 Uhr
> Von: "Zhang, Hong" <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> An: "Marius Buerkle" <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Cc: "petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>" <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, "Sherry Li" <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Betreff: Re: Re: [petsc-users] superlu_dist segfault
> Marius,
> I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.
>  
> The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!
> By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got 
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
>  
> The modified code I used is attached.
> Hong
>  
> From: Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Sent: Tuesday, October 27, 2020 10:01 PM
> To: Zhang, Hong <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>; Sherry Li <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Subject: Aw: Re: [petsc-users] superlu_dist segfault
>  
> Hi,
>  
> I recompiled PETSC with debug option, now I get a seg fault at a different position
>  
> [23]PETSC ERROR: ------------------------------------------------------------------------
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
> [23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [23]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [23]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> [23]PETSC ERROR: likely location of problem given in stack below
> [23]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
> [23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [23]PETSC ERROR:       INSTEAD the line number of the start of the function
> [23]PETSC ERROR:       is given.
> [23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c
> [23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [23]PETSC ERROR: Signal received
>  
> I  made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloud
> https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw <https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw>
>  
> Best,
> Marius
>  
>  
> Gesendet: Dienstag, 27. Oktober 2020 um 23:11 Uhr
> Von: "Zhang, Hong" <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> An: "Marius Buerkle" <mbuerkle at web.de <mailto:mbuerkle at web.de>>, "petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>" <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, "Sherry Li" <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Betreff: Re: [petsc-users] superlu_dist segfault
> Marius,
> It fails at the line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>     if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex))))     ABORT("Malloc fails for lsum[].");
>  
> We do not know what it means. You may use a debugger to check the values of the variables involved.
> I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.
> Hong
>  
>  
> From: petsc-users <petsc-users-bounces at mcs.anl.gov <mailto:petsc-users-bounces at mcs.anl.gov>> on behalf of Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Sent: Tuesday, October 27, 2020 8:46 AM
> To: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
> Subject: [petsc-users] superlu_dist segfault
>  
> Hi,
>  
> When using MatMatSolve with superlu_dist I get a segmentation fault:
>  
> Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>  
> The matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.
>  
> Best,
> Marius

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201030/63d06ba0/attachment.html>


More information about the petsc-users mailing list