[petsc-users] superlu_dist segfault
Barry Smith
bsmith at petsc.dev
Sun Nov 1 02:52:47 CST 2020
You can sometimes use -on_error_attach_debugger noxterm and it will try to attach just in the console you started the job. If you are lucky this works and you use bt and see the stack and look at variables. But if multiple ranks crash the debugger will get confused and even if only one crashes if it is not rank zero the stty can get messed up so you cannot type to control the debugger.
The valgrind information is very valuable, likely Sherry can look at those lines and have a really good idea what the problem is, for example,
> Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd
means that for some reason the code is writing past the end of an allocated array, either because the array allocated was not long enough or the code has some issue where it wants to write further than it should. This kind of thing is very common and usually easy to debug by someone who knows the code once they know exactly what line of code is problematic. Since it shows exactly where the memory was allocated and exactly where it went out of bounds.
Barry
> On Nov 1, 2020, at 1:21 AM, Marius Buerkle <mbuerkle at web.de> wrote:
>
> Hi,
>
> I cannot use on_error_attach_debugger as X forwarding does not work on the system. Is it possible to dump the gdb output to file instead?
>
> I run it through valgrind. It seems there is some problem during calls in superlu_dist but I don't know if this eventually causes the seg fault. I think this is the relevant output:
>
> ==43569== Conditional jump or move depends on uninitialised value(s)
> ==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
> ==43569== Use of uninitialised value of size 8
> ==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
> ==43569== Use of uninitialised value of size 8
> ==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
> ==43569== Invalid write of size 8
> ==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569== Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd
> ==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
> ==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
> ==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
> ==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
> ==43569== Invalid write of size 8
> ==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569== Address 0x266e5ad0 is 16 bytes after a block of size 35,520 alloc'd
> ==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)
> ==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)
> ==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)
> ==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)
> ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)
> ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)
> ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)
> ==43569== by 0x40465D: main (superlu_test.c:59)
> ==43569==
>
> I also attached the whole log. Does this make any sense? The problem seems to be around where I get the original segfault.
>
> best,
> marius
>
>
> Gesendet: Samstag, 31. Oktober 2020 um 04:07 Uhr
> Von: "Barry Smith" <bsmith at petsc.dev>
> An: "Marius Buerkle" <mbuerkle at web.de>
> Cc: "Xiaoye S. Li" <xsli at lbl.gov>, "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>, "Sherry Li" <xiaoye at nersc.gov>
> Betreff: Re: [petsc-users] superlu_dist segfault
>
> Have you run it yet with valgrind, good be memory corruption earlier that causes a later crash, crashes that occur at different places for the same run are almost always due to memory corruption.
>
> If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line.
>
> Barry
>
>
> On Oct 29, 2020, at 1:04 AM, Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>> wrote:
>
> Hi Sherry,
>
> I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with OpenMP turned off. But did not help.
>
> Here is the output I can get from SuperLu during the PETSC run
> Nonzeros in L 29519630
> Nonzeros in U 29519630
> nonzeros in L+U 58996711
> nonzeros in LSUB 4509612
> ** Memory Usage **********************************
> ** NUMfact space (MB): (sum-of-all-processes)
> L\U : 952.18 | Total : 1980.60
> ** Total highmark (MB):
> Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56
> **************************************************
> **************************************************
> **** Time (seconds) ****
> EQUIL time 0.06
> ROWPERM time 1.03
> COLPERM time 1.01
> SYMBFACT time 0.45
> DISTRIBUTE time 0.33
> FACTOR time 0.90
> Factor flops 2.225916e+11 Mflops 247438.62
> SOLVE time 0.000
> **************************************************
>
> I tried all available ordering options for Colperm (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the same seg. fault.
>
>
> Gesendet: Donnerstag, 29. Oktober 2020 um 14:14 Uhr
> Von: "Xiaoye S. Li" <xsli at lbl.gov <mailto:xsli at lbl.gov>>
> An: "Marius Buerkle" <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Cc: "Zhang, Hong" <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>, "petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>" <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, "Sherry Li" <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Betreff: Re: Re: Re: [petsc-users] superlu_dist segfault
> Hong: thanks for the diagnosis!
>
> Marius: how many OpenMP threads are you using per MPI task?
> In an earlier email, you mentioned the allocation failure at the following line:
> if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");
>
> this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads. You can try to use 1 thread.
>
> The RHS and X memories are easy to compute. However, in order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors? What ordering option are you using? The sparse matrix A looks pretty small.
>
> The code can also print out the working storage used during factorization. I am not sure how this printing can be turned on through PETSc.
>
> Sherry
>
> On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>> wrote:
> Thanks for the swift reply.
>
> I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node. One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?
>
>
> best,
>
> marius
>
>
> Gesendet: Donnerstag, 29. Oktober 2020 um 10:10 Uhr
> Von: "Zhang, Hong" <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> An: "Marius Buerkle" <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Cc: "petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>" <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, "Sherry Li" <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Betreff: Re: Re: [petsc-users] superlu_dist segfault
> Marius,
> I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.
>
> The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!
> By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
>
> The modified code I used is attached.
> Hong
>
> From: Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Sent: Tuesday, October 27, 2020 10:01 PM
> To: Zhang, Hong <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>; Sherry Li <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Subject: Aw: Re: [petsc-users] superlu_dist segfault
>
> Hi,
>
> I recompiled PETSC with debug option, now I get a seg fault at a different position
>
> [23]PETSC ERROR: ------------------------------------------------------------------------
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
> [23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [23]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [23]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on GNU/linux and Apple Mac OS X to find memory corruption errors
> [23]PETSC ERROR: likely location of problem given in stack below
> [23]PETSC ERROR: --------------------- Stack Frames ------------------------------------
> [23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [23]PETSC ERROR: INSTEAD the line number of the start of the function
> [23]PETSC ERROR: is given.
> [23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c
> [23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [23]PETSC ERROR: Signal received
>
> I made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloud
> https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw <https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw>
>
> Best,
> Marius
>
>
> Gesendet: Dienstag, 27. Oktober 2020 um 23:11 Uhr
> Von: "Zhang, Hong" <hzhang at mcs.anl.gov <mailto:hzhang at mcs.anl.gov>>
> An: "Marius Buerkle" <mbuerkle at web.de <mailto:mbuerkle at web.de>>, "petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>" <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>, "Sherry Li" <xiaoye at nersc.gov <mailto:xiaoye at nersc.gov>>
> Betreff: Re: [petsc-users] superlu_dist segfault
> Marius,
> It fails at the line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
> if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");
>
> We do not know what it means. You may use a debugger to check the values of the variables involved.
> I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.
> Hong
>
>
> From: petsc-users <petsc-users-bounces at mcs.anl.gov <mailto:petsc-users-bounces at mcs.anl.gov>> on behalf of Marius Buerkle <mbuerkle at web.de <mailto:mbuerkle at web.de>>
> Sent: Tuesday, October 27, 2020 8:46 AM
> To: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
> Subject: [petsc-users] superlu_dist segfault
>
> Hi,
>
> When using MatMatSolve with superlu_dist I get a segmentation fault:
>
> Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>
> The matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.
>
> Best,
> Marius
> <valgrind.tar.gz>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201101/58edce22/attachment-0001.html>
More information about the petsc-users
mailing list