[petsc-users] superlu_dist segfault

Xiaoye S. Li xsli at lbl.gov
Thu Oct 29 00:14:24 CDT 2020


Hong: thanks for the diagnosis!

Marius: how many OpenMP threads are you using per MPI task?
In an earlier email, you mentioned the allocation failure at the following
line:
  if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread *
sizeof(doublecomplex))))     ABORT("Malloc fails for lsum[].");

this is in the solve phase. I think when we do some OpenMP optimization, we
allowed several data structures to grow with OpenMP threads.  You can try
to use 1 thread.

The RHS and X  memories are easy to compute. However, in order to gauge how
much memory is used in the factorization, can you print out the number of
nonzeros in the L and U factors?   What ordering option are you using?  The
sparse matrix A looks pretty small.

The code can also print out the working storage used during factorization.
I am not sure how this printing can be turned on through PETSc.

Sherry

On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <mbuerkle at web.de> wrote:

> Thanks for the swift reply.
>
> I also realized if I reduce the number of RHS then it works. But I am
> running the code on a cluster with 256GB ram / node.  One dense matrix
> would be around ~30 Gb so 60 Gb, which is large but does exceed the
> memory of even one node and I also get the seg fault if I run it on several
> nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The
> maxium memory used when using MUMPS is around 150 Gb during the solver
> phase but for SuperLU_dist it crashed even before reaching the solver
> phase. Could there be such a large difference in memory usage between
> SuperLu_dist and MUMPS ?
>
>
>
> best,
>
> marius
>
> *Gesendet:* Donnerstag, 29. Oktober 2020 um 10:10 Uhr
> *Von:* "Zhang, Hong" <hzhang at mcs.anl.gov>
> *An:* "Marius Buerkle" <mbuerkle at web.de>
> *Cc:* "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>, "Sherry Li" <
> xiaoye at nersc.gov>
> *Betreff:* Re: Re: [petsc-users] superlu_dist segfault
> Marius,
> I tested your code with petsc-release on my mac laptop using np=2 cores. I
> first tested a small matrix data file successfully. Then I switch to your
> data file and run out of memory, likely due to the dense matrices B and X.
> I got an error "Your system has run out of application memory" from my
> laptop.
>
> The sparse matrix A has size 42549 by 42549. Your code creates dense
> matrices B and X with the same size -- a huge memory requirement!
> By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the
> code run well with np=2. Note the error message you got
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
>
> The modified code I used is attached.
> Hong
>
> ------------------------------
> *From:* Marius Buerkle <mbuerkle at web.de>
> *Sent:* Tuesday, October 27, 2020 10:01 PM
> *To:* Zhang, Hong <hzhang at mcs.anl.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Sherry Li <
> xiaoye at nersc.gov>
> *Subject:* Aw: Re: [petsc-users] superlu_dist segfault
>
> Hi,
>
> I recompiled PETSC with debug option, now I get a seg fault at a different
> position
>
> [23]PETSC ERROR:
> ------------------------------------------------------------------------
> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [23]PETSC ERROR: or see
> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [23]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
> X to find memory corruption errors
> [23]PETSC ERROR: likely location of problem given in stack below
> [23]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [23]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [23]PETSC ERROR:       INSTEAD the line number of the start of the function
> [23]PETSC ERROR:       is given.
> [23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242
> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211
> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [23]PETSC ERROR: [23] MatMatSolve line 3466
> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c
> [23]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [23]PETSC ERROR: Signal received
>
> I  made a small reproducer. The matrix is a bit too big so I cannot attach
> it directly to the email, but I put it in the cloud
> https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw
>
> Best,
> Marius
>
>
> *Gesendet:* Dienstag, 27. Oktober 2020 um 23:11 Uhr
> *Von:* "Zhang, Hong" <hzhang at mcs.anl.gov>
> *An:* "Marius Buerkle" <mbuerkle at web.de>, "petsc-users at mcs.anl.gov" <
> petsc-users at mcs.anl.gov>, "Sherry Li" <xiaoye at nersc.gov>
> *Betreff:* Re: [petsc-users] superlu_dist segfault
> Marius,
> It fails at the line 1075 in file
> /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>     if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread *
> sizeof(doublecomplex))))     ABORT("Malloc fails for lsum[].");
>
> We do not know what it means. You may use a debugger to check the values
> of the variables involved.
> I'm cc'ing Sherry (superlu_dist developer), or you may send us a
> stand-alone short code that reproduce the error. We can help on its
> investigation.
> Hong
>
>
> ------------------------------
> *From:* petsc-users <petsc-users-bounces at mcs.anl.gov> on behalf of Marius
> Buerkle <mbuerkle at web.de>
> *Sent:* Tuesday, October 27, 2020 8:46 AM
> *To:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* [petsc-users] superlu_dist segfault
>
> Hi,
>
> When using MatMatSolve with superlu_dist I get a segmentation fault:
>
> Malloc fails for lsum[]. at line 1075 in file
> /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c
>
> The matrix size is not particular big and I am using the petsc release
> branch and superlu_dist is v6.3.0 I think.
>
> Best,
> Marius
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201028/24af4ab8/attachment-0001.html>


More information about the petsc-users mailing list