<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div> Have you run it yet with valgrind, good be memory corruption earlier that causes a later crash, crashes that occur at different places for the same run are almost always due to memory corruption. <div class=""><br class=""></div><div class=""> If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line.</div><div class=""><br class=""></div><div class=""> Barry</div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Oct 29, 2020, at 1:04 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" class="">mbuerkle@web.de</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class=""><div style="font-family: Verdana;font-size: 12.0px;" class=""><div class="">Hi Sherry,</div>
<div class=""> </div>
<div class="">I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with OpenMP turned off. But did not help. </div>
<div class=""> </div>
<div class="">Here is the output I can get from SuperLu during the PETSC run</div>
<div class="">
<div class=""> Nonzeros in L 29519630<br class="">
Nonzeros in U 29519630<br class="">
nonzeros in L+U 58996711<br class="">
nonzeros in LSUB 4509612</div>
<div class="">** Memory Usage **********************************<br class="">
** NUMfact space (MB): (sum-of-all-processes)<br class="">
L\U : 952.18 | Total : 1980.60<br class="">
** Total highmark (MB):<br class="">
Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56<br class="">
**************************************************<br class="">
**************************************************<br class="">
**** Time (seconds) ****<br class="">
EQUIL time 0.06<br class="">
ROWPERM time 1.03<br class="">
COLPERM time 1.01<br class="">
SYMBFACT time 0.45<br class="">
DISTRIBUTE time 0.33<br class="">
FACTOR time 0.90<br class="">
Factor flops 2.225916e+11 Mflops 247438.62<br class="">
SOLVE time 0.000<br class="">
**************************************************</div>
<div class=""> </div>
<div class="">I tried all available ordering options for Colperm (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the same seg. fault.</div>
</div>
<div class="">
<div class="">
<div name="quote" style="margin:10px 5px 5px 10px; padding: 10px 0 10px 10px; border-left:2px solid #C3D9E5; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
<div style="margin:0 0 10px 0;" class=""><b class="">Gesendet:</b> Donnerstag, 29. Oktober 2020 um 14:14 Uhr<br class="">
<b class="">Von:</b> "Xiaoye S. Li" <<a href="mailto:xsli@lbl.gov" class="">xsli@lbl.gov</a>><br class="">
<b class="">An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" class="">mbuerkle@web.de</a>><br class="">
<b class="">Cc:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" class="">hzhang@mcs.anl.gov</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: Re: Re: [petsc-users] superlu_dist segfault</div>
<div name="quoted-content" class="">
<div class="">
<div class="gmail_default" style="font-family: verdana, sans-serif; font-size: small;">Hong: thanks for the diagnosis!</div>
<div class="gmail_default" style="font-family: verdana, sans-serif; font-size: small;"> </div>
<div class="gmail_default" style="font-family: verdana, sans-serif; font-size: small;">Marius: how many OpenMP threads are you using per MPI task?</div>
<div class="gmail_default" style="font-family: verdana, sans-serif; font-size: small;">In an earlier email, you mentioned the allocation failure at the following line:</div>
<div class="gmail_default" style="font-family: verdana, sans-serif; font-size: small;">
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;" class=""> if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class="">this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads. You can try to use 1 thread.</div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><br class="">
The RHS and X memories are easy to compute. However, i<span style="font-size: 12.0pt;" class="">n order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors? What ordering option are you using? The sparse matrix A looks pretty small.</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-size: 12.0pt;" class="">The code can also print out the working storage used during factorization. I am not sure how this printing can be turned on through PETSc.</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-size: 12.0pt;" class="">Sherry</span></div>
</div>
</div>
<div class="gmail_quote">
<div class="gmail_attr">On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>> wrote:</div>
<blockquote class="gmail_quote" style="margin: 0.0px 0.0px 0.0px 0.8ex;border-left: 1.0px solid rgb(204,204,204);padding-left: 1.0ex;">
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class=""><p class=""><span style="font-size: 11.0pt;" class=""><span style="line-height: normal;" class=""><span style="font-family: Calibri , sans-serif;" class=""><span style="font-size: 12.0pt;" class=""><span style="font-family: "Times New Roman" , serif;" class="">Thanks for the swift reply. </span></span></span></span></span></p><p class=""><span style="font-size: 11.0pt;" class=""><span style="line-height: normal;" class=""><span style="font-family: Calibri , sans-serif;" class=""><span style="font-size: 12.0pt;" class=""><span style="font-family: "Times New Roman" , serif;" class="">I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node. One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?</span></span></span></span></span></p><div class=""> <br class="webkit-block-placeholder"></div><p class=""><span style="font-size: 11.0pt;" class=""><span style="line-height: normal;" class=""><span style="font-family: Calibri , sans-serif;" class=""><span style="font-size: 12.0pt;" class=""><span style="font-family: "Times New Roman" , serif;" class="">best,</span></span></span></span></span></p><p class=""><span style="font-size: 11.0pt;" class=""><span style="line-height: normal;" class=""><span style="font-family: Calibri , sans-serif;" class=""><span style="font-size: 12.0pt;" class=""><span style="font-family: "Times New Roman" , serif;" class="">marius</span></span></span></span></span></p>
<div class="">
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);" class="">
<div style="margin: 0.0px 0.0px 10.0px;" class=""><b class="">Gesendet:</b> Donnerstag, 29. Oktober 2020 um 10:10 Uhr<br class="">
<b class="">Von:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" target="_blank" class="">hzhang@mcs.anl.gov</a>><br class="">
<b class="">An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>><br class="">
<b class="">Cc:</b> "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: Re: [petsc-users] superlu_dist segfault</div>
<div class="">
<div class="">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">Marius,</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""> </div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!</span></div>
<div class=""><font color="#201f1e" face="Verdana" class=""><span style="font-size: 12.0px;" class="">By replacing B and X with size <span style="background-color: rgb(255,255,255);display: inline;" class="">42549 by<span class=""> nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got </span></span></span></font></div>
<div class=""><font color="#201f1e" face="Verdana" class=""><span style="font-size: 12.0px;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class=""><span class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range</span></span></span></span></font></div>
<div class=""> </div>
<div class="">The modified code I used is attached.</div>
<div class=""><font color="#201f1e" face="Verdana" class=""><span style="font-size: 12.0px;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class=""><span class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">Hong</span></span></span></span></font></div>
<div id="gmail-m_2827192423146367754appendonsend" class=""> </div>
<hr style="display: inline-block;width: 98.0%;" class="">
<div id="gmail-m_2827192423146367754divRplyFwdMsg" class=""><font face="Calibri, sans-serif" style="font-size: 11.0pt;" class=""><b class="">From:</b> Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>><br class="">
<b class="">Sent:</b> Tuesday, October 27, 2020 10:01 PM<br class="">
<b class="">To:</b> Zhang, Hong <<a href="mailto:hzhang@mcs.anl.gov" target="_blank" class="">hzhang@mcs.anl.gov</a>><br class="">
<b class="">Cc:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>; Sherry Li <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Subject:</b> Aw: Re: [petsc-users] superlu_dist segfault</font>
<div class=""> </div>
</div>
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class="">Hi,</div>
<div class=""> </div>
<div class="">I recompiled PETSC with debug option, now I get a seg fault at a different position</div>
<div class=""> </div>
<div class="">[23]PETSC ERROR: ------------------------------------------------------------------------<br class="">
[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range<br class="">
[23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br class="">
[23]PETSC ERROR: or see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank" class="">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a><br class="">
[23]PETSC ERROR: or try <a href="http://valgrind.org/" target="_blank" class="">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors<br class="">
[23]PETSC ERROR: likely location of problem given in stack below<br class="">
[23]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br class="">
[23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,<br class="">
[23]PETSC ERROR: INSTEAD the line number of the start of the function<br class="">
[23]PETSC ERROR: is given.<br class="">
[23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c<br class="">
[23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c<br class="">
[23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c<br class="">
[23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<br class="">
[23]PETSC ERROR: Signal received</div>
<div class=""> </div>
<div class="">I made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloud</div>
<div class=""><a href="https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw" target="_blank" class="">https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw</a></div>
<div class=""> </div>
<div class="">Best,</div>
<div class="">Marius</div>
<div class="">
<div class="">
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);" class="">
<div style="margin: 0.0px 0.0px 10.0px;" class=""><b class="">Gesendet:</b> Dienstag, 27. Oktober 2020 um 23:11 Uhr<br class="">
<b class="">Von:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" target="_blank" class="">hzhang@mcs.anl.gov</a>><br class="">
<b class="">An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div class="">
<div class="">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">Marius,</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">It fails at the line <span style="background-color: rgb(255,255,255);display: inline;" class="">1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c</span></span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class=""> if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");</span></span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""> </div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">We do not know what it means. You may use a debugger to check the values of the variables involved.</span></span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.</span></span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">Hong</span></span></div>
<div id="gmail-m_2827192423146367754x_appendonsend" class=""> </div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;" class=""> </div>
<hr style="display: inline-block;width: 98.0%;" class="">
<div id="gmail-m_2827192423146367754x_divRplyFwdMsg" class=""><font face="Calibri, sans-serif" style="font-size: 11.0pt;" class=""><b class="">From:</b> petsc-users <<a href="mailto:petsc-users-bounces@mcs.anl.gov" target="_blank" class="">petsc-users-bounces@mcs.anl.gov</a>> on behalf of Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>><br class="">
<b class="">Sent:</b> Tuesday, October 27, 2020 8:46 AM<br class="">
<b class="">To:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>><br class="">
<b class="">Subject:</b> [petsc-users] superlu_dist segfault</font>
<div class=""> </div>
</div>
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class="">Hi,</div>
<div class=""> </div>
<div class="">When using MatMatSolve with superlu_dist I get a segmentation fault:</div>
<div class=""> </div>
<div class="">Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c</div>
<div class=""> </div>
<div class="">The matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.</div>
<div class=""> </div>
<div class="">Best,</div>
<div class="">Marius</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div></div></div>
</div></blockquote></div><br class=""></div></body></html>