[petsc-users] superlu_dist segfault
Barry Smith
bsmith at petsc.dev
Fri Oct 30 14:07:54 CDT 2020
Have you run it yet with valgrind, good be memory corruption earlier that causes a later crash, crashes that occur at different places for the same run are almost always due to memory corruption.
If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line.
Barry
> On Oct 29, 2020, at 1:04 AM, Marius Buerkle <mbuerkle at web.de> wrote:
>
> Hi Sherry,
>
> I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with OpenMP turned off. But did not help.
>
> Here is the output I can get from SuperLu during the PETSC run
> Nonzeros in L 29519630
> Nonzeros in U 29519630
> nonzeros in L+U 58996711
> nonzeros in LSUB 4509612
> ** Memory Usage **********************************
> ** NUMfact space (MB): (sum-of-all-processes)
> L\U : 952.18 | Total : 1980.60
> ** Total highmark (MB):
> Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56
> **************************************************
> **************************************************
> **** Time (seconds) ****
> EQUIL time 0.06
> ROWPERM time 1.03
> COLPERM time 1.01
> SYMBFACT time 0.45
> DISTRIBUTE time 0.33
> FACTOR time 0.90
> Factor flops 2.225916e+11 Mflops 247438.62
> SOLVE time 0.000
> **************************************************
>
> I tried all available ordering options for Colperm (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the same seg. fault.
>
>
> Gesendet: Donnerstag, 29. Oktober 2020 um 14:14 Uhr
> Von: "Xiaoye S. Li" <xsli at lbl.gov>
> An: "Marius Buerkle" <mbuerkle at web.de>
> Cc: "Zhang, Hong" <hzhang at mcs.anl.gov>, "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>, "Sherry Li" <xiaoye at nersc.gov>
> Betreff: Re: Re: Re: [petsc-users] superlu_dist segfault
> Hong: thanks for the diagnosis!
>
> Marius: how many OpenMP threads are you using per MPI task?
> In an earlier email, you mentioned the allocation failure at the following line:
> if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");
>
> this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads. You can try to use 1 thread.
>
> The RHS and X memories are easy to compute. However, in order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors? What ordering option are you using? The sparse matrix A looks pretty small.
>
> The code can also print out the working storage used during factorization. I am not sure how this printing can be turned on through PETSc.
>
> Sherry
>
> On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>> wrote:
> Thanks for the swift reply.
>
> I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node. One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?
>
>
> best,
>
> marius
>
>
> Gesendet: Donnerstag, 29. Oktober 2020 um 10:10 Uhr
> Von: "Zhang, Hong" <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> An: "Marius Buerkle" <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Cc: "petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>" <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, "Sherry Li" <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Betreff: Re: Re: [petsc-users] superlu_dist segfault
> Marius,
> I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.
>
> The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!
> By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
>
> The modified code I used is attached.
> Hong
>
> From: Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Sent: Tuesday, October 27, 2020 10:01 PM
> To: Zhang, Hong <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>; Sherry Li <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Subject: Aw: Re: [petsc-users] superlu_dist segfault
>
> Hi,
>
> I recompiled PETSC with debug option, now I get a seg fault at a different position
>
> [23]PETSC ERROR: ------------------------------------------------------------------------
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
> [23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [23]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [23]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> [23]PETSC ERROR: likely location of problem given in stack below
> [23]PETSC ERROR: --------------------- Stack Frames ------------------------------------
> [23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [23]PETSC ERROR: INSTEAD the line number of the start of the function
> [23]PETSC ERROR: is given.
> [23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c
> [23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [23]PETSC ERROR: Signal received
>
> I made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloud
> https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw <https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw>
>
> Best,
> Marius
>
>
> Gesendet: Dienstag, 27. Oktober 2020 um 23:11 Uhr
> Von: "Zhang, Hong" <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> An: "Marius Buerkle" <mbuerkle at web.de <mailto:mbuerkle at web.de>>, "petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>" <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, "Sherry Li" <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Betreff: Re: [petsc-users] superlu_dist segfault
> Marius,
> It fails at the line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
> if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");
>
> We do not know what it means. You may use a debugger to check the values of the variables involved.
> I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.
> Hong
>
>
> From: petsc-users <petsc-users-bounces at mcs.anl.gov <mailto:petsc-users-bounces at mcs.anl.gov>> on behalf of Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Sent: Tuesday, October 27, 2020 8:46 AM
> To: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
> Subject: [petsc-users] superlu_dist segfault
>
> Hi,
>
> When using MatMatSolve with superlu_dist I get a segmentation fault:
>
> Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>
> The matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.
>
> Best,
> Marius
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201030/63d06ba0/attachment.html>
More information about the petsc-users
mailing list