<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;color:#000000">Hong: thanks for the diagnosis!</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;color:#000000"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;color:#000000">Marius: how many OpenMP threads are you using per MPI task?</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;color:#000000">In an earlier email, you mentioned the allocation failure at the following line:</div><div class="gmail_default" style="font-family:verdana,sans-serif;font-size:small;color:#000000"><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt"><span style="font-family:Verdana;font-size:12px"> if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");<br></span></div><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt"><br></div><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt">this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads. You can try to use 1 thread.</div><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt"><br>The RHS and X memories are easy to compute. However, i<span style="font-size:12pt">n order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors? What ordering option are you using? The sparse matrix A looks pretty small.</span></div><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt"><span style="font-size:12pt"><br></span></div><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt"><span style="font-size:12pt">The code can also print out the working storage used during factorization. I am not sure how this printing can be turned on through PETSc.</span></div><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt"><span style="font-size:12pt"><br></span></div><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt"><span style="font-size:12pt">Sherry</span></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <<a href="mailto:mbuerkle@web.de">mbuerkle@web.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div style="font-family:Verdana;font-size:12px"><div>
<p><span style="font-size:11pt"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><span style="font-size:12pt"><span style="font-family:"Times New Roman",serif">Thanks for the swift reply. </span></span></span></span></span></p>
<p><span style="font-size:11pt"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><span style="font-size:12pt"><span style="font-family:"Times New Roman",serif">I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node. One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?</span></span></span></span></span></p>
<p> </p>
<p><span style="font-size:11pt"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><span style="font-size:12pt"><span style="font-family:"Times New Roman",serif">best,</span></span></span></span></span></p>
<p><span style="font-size:11pt"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><span style="font-size:12pt"><span style="font-family:"Times New Roman",serif">marius</span></span></span></span></span></p>
<div>
<div name="quote" style="margin:10px 5px 5px 10px;padding:10px 0px 10px 10px;border-left:2px solid rgb(195,217,229)">
<div style="margin:0px 0px 10px"><b>Gesendet:</b> Donnerstag, 29. Oktober 2020 um 10:10 Uhr<br>
<b>Von:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" target="_blank">hzhang@mcs.anl.gov</a>><br>
<b>An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank">mbuerkle@web.de</a>><br>
<b>Cc:</b> "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank">xiaoye@nersc.gov</a>><br>
<b>Betreff:</b> Re: Re: [petsc-users] superlu_dist segfault</div>
<div name="quoted-content">
<div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="color:rgb(32,31,30);font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline">Marius,</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="color:rgb(32,31,30);font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline">I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"> </div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="color:rgb(32,31,30);font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline">The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!</span></div>
<div><font color="#201f1e" face="Verdana"><span style="font-size:12px">By replacing B and X with size <span style="background-color:rgb(255,255,255);display:inline">42549 by<span> nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got </span></span></span></font></div>
<div><font color="#201f1e" face="Verdana"><span style="font-size:12px"><span style="background-color:rgb(255,255,255);display:inline"><span><span style="background-color:rgb(255,255,255);display:inline">[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range</span></span></span></span></font></div>
<div> </div>
<div>The modified code I used is attached.</div>
<div><font color="#201f1e" face="Verdana"><span style="font-size:12px"><span style="background-color:rgb(255,255,255);display:inline"><span><span style="background-color:rgb(255,255,255);display:inline">Hong</span></span></span></span></font></div>
<div id="gmail-m_2827192423146367754appendonsend"> </div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_2827192423146367754divRplyFwdMsg"><font color="#000000" face="Calibri, sans-serif" style="font-size:11pt"><b>From:</b> Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank">mbuerkle@web.de</a>><br>
<b>Sent:</b> Tuesday, October 27, 2020 10:01 PM<br>
<b>To:</b> Zhang, Hong <<a href="mailto:hzhang@mcs.anl.gov" target="_blank">hzhang@mcs.anl.gov</a>><br>
<b>Cc:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>; Sherry Li <<a href="mailto:xiaoye@nersc.gov" target="_blank">xiaoye@nersc.gov</a>><br>
<b>Subject:</b> Aw: Re: [petsc-users] superlu_dist segfault</font>
<div> </div>
</div>
<div>
<div style="font-family:Verdana;font-size:12px">
<div>Hi,</div>
<div> </div>
<div>I recompiled PETSC with debug option, now I get a seg fault at a different position</div>
<div> </div>
<div>[23]PETSC ERROR: ------------------------------------------------------------------------<br>
[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range<br>
[23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>
[23]PETSC ERROR: or see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a><br>
[23]PETSC ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors<br>
[23]PETSC ERROR: likely location of problem given in stack below<br>
[23]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br>
[23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,<br>
[23]PETSC ERROR: INSTEAD the line number of the start of the function<br>
[23]PETSC ERROR: is given.<br>
[23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c<br>
[23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c<br>
[23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c<br>
[23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<br>
[23]PETSC ERROR: Signal received</div>
<div> </div>
<div>I made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloud</div>
<div><a href="https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw" target="_blank">https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw</a></div>
<div> </div>
<div>Best,</div>
<div>Marius</div>
<div>
<div>
<div style="margin:10px 5px 5px 10px;padding:10px 0px 10px 10px;border-left:2px solid rgb(195,217,229)">
<div style="margin:0px 0px 10px"><b>Gesendet:</b> Dienstag, 27. Oktober 2020 um 23:11 Uhr<br>
<b>Von:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" target="_blank">hzhang@mcs.anl.gov</a>><br>
<b>An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank">mbuerkle@web.de</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank">xiaoye@nersc.gov</a>><br>
<b>Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div>
<div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline">Marius,</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline">It fails at the line <span style="background-color:rgb(255,255,255);display:inline">1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c</span></span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline"><span style="background-color:rgb(255,255,255);display:inline"> if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");</span></span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"> </div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline"><span style="background-color:rgb(255,255,255);display:inline">We do not know what it means. You may use a debugger to check the values of the variables involved.</span></span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline"><span style="background-color:rgb(255,255,255);display:inline">I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.</span></span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="font-family:Verdana;font-size:12px;background-color:rgb(255,255,255);display:inline"><span style="background-color:rgb(255,255,255);display:inline">Hong</span></span></div>
<div id="gmail-m_2827192423146367754x_appendonsend"> </div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"> </div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_2827192423146367754x_divRplyFwdMsg"><font color="#000000" face="Calibri, sans-serif" style="font-size:11pt"><b>From:</b> petsc-users <<a href="mailto:petsc-users-bounces@mcs.anl.gov" target="_blank">petsc-users-bounces@mcs.anl.gov</a>> on behalf of Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank">mbuerkle@web.de</a>><br>
<b>Sent:</b> Tuesday, October 27, 2020 8:46 AM<br>
<b>To:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
<b>Subject:</b> [petsc-users] superlu_dist segfault</font>
<div> </div>
</div>
<div>
<div style="font-family:Verdana;font-size:12px">
<div>Hi,</div>
<div> </div>
<div>When using MatMatSolve with superlu_dist I get a segmentation fault:</div>
<div> </div>
<div>Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c</div>
<div> </div>
<div>The matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.</div>
<div> </div>
<div>Best,</div>
<div>Marius</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div></div></div>
</blockquote></div>