<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div> Code?<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Nov 2, 2020, at 9:27 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" class="">mbuerkle@web.de</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class=""><div style="font-family: Verdana;font-size: 12.0px;" class=""><div class=""> </div>
<div class="">
<div class="">The matrix is a bit too big for email attachment, I put it on onedrive</div>
<div class=""><a href="https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw" target="_blank" class="">https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw</a></div>
<div class=""> </div>
<div class="">
<div name="quote" style="margin:10px 5px 5px 10px; padding: 10px 0 10px 10px; border-left:2px solid #C3D9E5; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
<div style="margin:0 0 10px 0;" class=""><b class="">Gesendet:</b> Montag, 02. November 2020 um 23:58 Uhr<br class="">
<b class="">Von:</b> "Barry Smith" <<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>><br class="">
<b class="">An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" class="">mbuerkle@web.de</a>><br class="">
<b class="">Cc:</b> "Stefano Zampini" <<a href="mailto:stefano.zampini@gmail.com" class="">stefano.zampini@gmail.com</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div name="quoted-content" class="">
<div class="">
<div class=""> </div>
Please send this program and your data file. This should definitely not be happening.
<div class=""> </div>
<div class=""> Barry</div>
<div class=""> </div>
<div class=""> Valgrind is generally trustworthy.
<div class="">
<blockquote class="">
<div class="">On Nov 2, 2020, at 12:21 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>> wrote:</div>
<div class="">
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class="">Hi,</div>
<div class=""> </div>
<div class="">I tried valgrind with track-origins, valgrind crashes at somepoint due to running out of energy though. But before I get a lot of </div>
<div class="">"Conditional jump or move depends on uninitialised value(s)" and "Use of uninitialised value of size 8" not all of them related to Petsc but some of them are during MatLoad, PCSetup_LU, and also in Superlu. For example</div>
<div class=""> </div>
<div class="">==41867== Conditional jump or move depends on uninitialised value(s)<br class="">
==41867== at 0x5DEA7C4: MatSetValues_MPIAIJ (mpiaij.c:601)<br class="">
==41867== by 0x5E310D8: MatMPIAIJSetPreallocationCSR_MPIAIJ (mpiaij.c:4031)<br class="">
==41867== by 0x5E31773: MatMPIAIJSetPreallocationCSR (mpiaij.c:4091)<br class="">
==41867== by 0x5E2184C: MatLoad_MPIAIJ_Binary (mpiaij.c:3197)<br class="">
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)<br class="">
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)<br class="">
==41867== by 0x4063ED: main (superlu_test.c:28)<br class="">
==41867== Uninitialised value was created by a heap allocation<br class="">
==41867== at 0x4C2D814: memalign (vg_replace_malloc.c:906)<br class="">
==41867== by 0x50220D6: PetscMallocAlign (mal.c:52)<br class="">
==41867== by 0x50242D4: PetscMallocA (mal.c:425)<br class="">
==41867== by 0x5E20FC2: MatLoad_MPIAIJ_Binary (mpiaij.c:3187)<br class="">
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)<br class="">
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)<br class="">
==41867== by 0x4063ED: main (superlu_test.c:28)<br class="">
==41867==<br class="">
==41867== Use of uninitialised value of size 8<br class="">
==41867== at 0x5DEA8AE: MatSetValues_MPIAIJ (mpiaij.c:603)<br class="">
==41867== by 0x5E310D8: MatMPIAIJSetPreallocationCSR_MPIAIJ (mpiaij.c:4031)<br class="">
==41867== by 0x5E31773: MatMPIAIJSetPreallocationCSR (mpiaij.c:4091)<br class="">
==41867== by 0x5E2184C: MatLoad_MPIAIJ_Binary (mpiaij.c:3197)<br class="">
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)<br class="">
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)<br class="">
==41867== by 0x4063ED: main (superlu_test.c:28)<br class="">
==41867== Uninitialised value was created by a heap allocation<br class="">
==41867== at 0x4C2D814: memalign (vg_replace_malloc.c:906)<br class="">
==41867== by 0x50220D6: PetscMallocAlign (mal.c:52)<br class="">
==41867== by 0x50242D4: PetscMallocA (mal.c:425)<br class="">
==41867== by 0x5E20FC2: MatLoad_MPIAIJ_Binary (mpiaij.c:3187)<br class="">
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)<br class="">
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)<br class="">
==41867== by 0x4063ED: main (superlu_test.c:28)</div>
<div class=""> </div>
<div class="">I don't know if this are real errors or only some problem of valgrind. I attached th whole valgrind logs, they are rather noisy though.</div>
<div class=""> </div>
<div class="">Best,</div>
<div class="">Marius</div>
<div class="">
<div class="">
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);" class="">
<div style="margin: 0 0 10.0px 0;" class=""><b class="">Gesendet:</b> Sonntag, 01. November 2020 um 19:09 Uhr<br class="">
<b class="">Von:</b> "Stefano Zampini" <<a href="mailto:stefano.zampini@gmail.com" target="_blank" class="">stefano.zampini@gmail.com</a>><br class="">
<b class="">An:</b> "Barry Smith" <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>><br class="">
<b class="">Cc:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div class="">
<div class="">More importantly,
<div class="">
<div class="">==43569== Conditional jump or move depends on uninitialised value(s)<br class="">
==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)<br class="">
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br class="">
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br class="">
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br class="">
==43569== by 0x40465D: main (superlu_test.c:59)</div>
</div>
<div class=""> </div>
<div class="">You should run using valgrind's option --track-origins=yes to understand the reason for this. </div>
</div>
<div class="gmail_quote">
<div class="gmail_attr">Il giorno dom 1 nov 2020 alle ore 11:53 Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>> ha scritto:</div>
<blockquote class="gmail_quote" style="margin: 0.0px 0.0px 0.0px 0.8ex;border-left: 1.0px solid rgb(204,204,204);padding-left: 1.0ex;">
<div class="">
<div class=""> </div>
<div class=""> You can sometimes use -on_error_attach_debugger noxterm and it will try to attach just in the console you started the job. If you are lucky this works and you use bt and see the stack and look at variables. But if multiple ranks crash the debugger will get confused and even if only one crashes if it is not rank zero the stty can get messed up so you cannot type to control the debugger.</div>
<div class=""> </div>
<div class=""> The valgrind information is very valuable, likely Sherry can look at those lines and have a really good idea what the problem is, for example,</div>
<div class=""> </div>
<div class="">
<blockquote class="">
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class="">Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd</div>
</div>
</div>
</blockquote>
</div>
<div class=""> means that for some reason the code is writing past the end of an allocated array, either because the array allocated was not long enough or the code has some issue where it wants to write further than it should. This kind of thing is very common and usually easy to debug by someone who knows the code once they know exactly what line of code is problematic. Since it shows exactly where the memory was allocated and exactly where it went out of bounds.</div>
<div class=""> </div>
<div class=""> Barry</div>
<div class=""> </div>
<div class="">
<blockquote class="">
<div class="">On Nov 1, 2020, at 1:21 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>> wrote:</div>
<div class="">
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class="">Hi,</div>
<div class=""> </div>
<div class="">I cannot use on_error_attach_debugger as X forwarding does not work on the system. Is it possible to dump the gdb output to file instead? </div>
<div class=""> </div>
<div class="">I run it through valgrind. It seems there is some problem during calls in superlu_dist but I don't know if this eventually causes the seg fault. I think this is the relevant output:</div>
<div class=""> </div>
<div class="">==43569== Conditional jump or move depends on uninitialised value(s)<br class="">
==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)<br class="">
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br class="">
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br class="">
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br class="">
==43569== by 0x40465D: main (superlu_test.c:59)<br class="">
==43569==<br class="">
==43569== Use of uninitialised value of size 8<br class="">
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)<br class="">
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br class="">
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br class="">
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br class="">
==43569== by 0x40465D: main (superlu_test.c:59)<br class="">
==43569==<br class="">
==43569== Use of uninitialised value of size 8<br class="">
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)<br class="">
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br class="">
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br class="">
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br class="">
==43569== by 0x40465D: main (superlu_test.c:59)<br class="">
==43569==<br class="">
==43569== Invalid write of size 8<br class="">
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)<br class="">
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br class="">
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br class="">
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br class="">
==43569== by 0x40465D: main (superlu_test.c:59)<br class="">
==43569== Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd<br class="">
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)<br class="">
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)<br class="">
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)<br class="">
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)<br class="">
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br class="">
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br class="">
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br class="">
==43569== by 0x40465D: main (superlu_test.c:59)<br class="">
==43569==<br class="">
==43569== Invalid write of size 8<br class="">
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)<br class="">
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br class="">
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br class="">
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br class="">
==43569== by 0x40465D: main (superlu_test.c:59)<br class="">
==43569== Address 0x266e5ad0 is 16 bytes after a block of size 35,520 alloc'd<br class="">
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)<br class="">
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)<br class="">
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)<br class="">
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)<br class="">
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br class="">
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br class="">
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br class="">
==43569== by 0x40465D: main (superlu_test.c:59)<br class="">
==43569==</div>
<div class=""> </div>
<div class="">I also attached the whole log. Does this make any sense? The problem seems to be around where I get the original segfault.</div>
<div class=""> </div>
<div class="">best,</div>
<div class="">marius</div>
<div class="">
<div class="">
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);" class="">
<div style="margin: 0.0px 0.0px 10.0px;" class=""><b class="">Gesendet:</b> Samstag, 31. Oktober 2020 um 04:07 Uhr<br class="">
<b class="">Von:</b> "Barry Smith" <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>><br class="">
<b class="">An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>><br class="">
<b class="">Cc:</b> "Xiaoye S. Li" <<a href="mailto:xsli@lbl.gov" target="_blank" class="">xsli@lbl.gov</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div class="">
<div class="">
<div class=""> </div>
Have you run it yet with valgrind, good be memory corruption earlier that causes a later crash, crashes that occur at different places for the same run are almost always due to memory corruption.
<div class=""> </div>
<div class=""> If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line.</div>
<div class=""> </div>
<div class=""> Barry</div>
<div class="">
<div class="">
<blockquote class="">
<div class="">On Oct 29, 2020, at 1:04 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>> wrote:</div>
<div class="">
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class="">Hi Sherry,</div>
<div class=""> </div>
<div class="">I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with OpenMP turned off. But did not help. </div>
<div class=""> </div>
<div class="">Here is the output I can get from SuperLu during the PETSC run</div>
<div class="">
<div class=""> Nonzeros in L 29519630<br class="">
Nonzeros in U 29519630<br class="">
nonzeros in L+U 58996711<br class="">
nonzeros in LSUB 4509612</div>
<div class="">** Memory Usage **********************************<br class="">
** NUMfact space (MB): (sum-of-all-processes)<br class="">
L\U : 952.18 | Total : 1980.60<br class="">
** Total highmark (MB):<br class="">
Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56<br class="">
**************************************************<br class="">
**************************************************<br class="">
**** Time (seconds) ****<br class="">
EQUIL time 0.06<br class="">
ROWPERM time 1.03<br class="">
COLPERM time 1.01<br class="">
SYMBFACT time 0.45<br class="">
DISTRIBUTE time 0.33<br class="">
FACTOR time 0.90<br class="">
Factor flops 2.225916e+11 Mflops 247438.62<br class="">
SOLVE time 0.000<br class="">
**************************************************</div>
<div class=""> </div>
<div class="">I tried all available ordering options for Colperm (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the same seg. fault.</div>
</div>
<div class="">
<div class="">
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);" class="">
<div style="margin: 0.0px 0.0px 10.0px;" class=""><b class="">Gesendet:</b> Donnerstag, 29. Oktober 2020 um 14:14 Uhr<br class="">
<b class="">Von:</b> "Xiaoye S. Li" <<a href="mailto:xsli@lbl.gov" target="_blank" class="">xsli@lbl.gov</a>><br class="">
<b class="">An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>><br class="">
<b class="">Cc:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" target="_blank" class="">hzhang@mcs.anl.gov</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: Re: Re: [petsc-users] superlu_dist segfault</div>
<div class="">
<div class="">
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;">Hong: thanks for the diagnosis!</div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;"> </div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;">Marius: how many OpenMP threads are you using per MPI task?</div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;">In an earlier email, you mentioned the allocation failure at the following line:</div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;">
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;" class=""> if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class="">this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads. You can try to use 1 thread.</div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><br class="">
The RHS and X memories are easy to compute. However, i<span style="font-size: 12.0pt;" class="">n order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors? What ordering option are you using? The sparse matrix A looks pretty small.</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-size: 12.0pt;" class="">The code can also print out the working storage used during factorization. I am not sure how this printing can be turned on through PETSc.</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-size: 12.0pt;" class="">Sherry</span></div>
</div>
</div>
<div class="gmail_quote">
<div class="gmail_attr">On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>> wrote:</div>
<blockquote class="gmail_quote" style="margin: 0.0px 0.0px 0.0px 0.8ex;border-left: 1.0px solid rgb(204,204,204);padding-left: 1.0ex;">
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class=""><p class=""><span style="font-size: 11.0pt;" class=""><span style="line-height: normal;" class=""><span style="font-family: Calibri , sans-serif;" class=""><span style="font-size: 12.0pt;" class=""><span style="font-family: "Times New Roman" , serif;" class="">Thanks for the swift reply. </span></span></span></span></span></p><p class=""><span style="font-size: 11.0pt;" class=""><span style="line-height: normal;" class=""><span style="font-family: Calibri , sans-serif;" class=""><span style="font-size: 12.0pt;" class=""><span style="font-family: "Times New Roman" , serif;" class="">I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node. One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?</span></span></span></span></span></p>
<div class=""> </div><p class=""><span style="font-size: 11.0pt;" class=""><span style="line-height: normal;" class=""><span style="font-family: Calibri , sans-serif;" class=""><span style="font-size: 12.0pt;" class=""><span style="font-family: "Times New Roman" , serif;" class="">best,</span></span></span></span></span></p><p class=""><span style="font-size: 11.0pt;" class=""><span style="line-height: normal;" class=""><span style="font-family: Calibri , sans-serif;" class=""><span style="font-size: 12.0pt;" class=""><span style="font-family: "Times New Roman" , serif;" class="">marius</span></span></span></span></span></p>
<div class="">
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);" class="">
<div style="margin: 0.0px 0.0px 10.0px;" class=""><b class="">Gesendet:</b> Donnerstag, 29. Oktober 2020 um 10:10 Uhr<br class="">
<b class="">Von:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" target="_blank" class="">hzhang@mcs.anl.gov</a>><br class="">
<b class="">An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>><br class="">
<b class="">Cc:</b> "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: Re: [petsc-users] superlu_dist segfault</div>
<div class="">
<div class="">
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">Marius,</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!</span></div>
<div class=""><font color="#201f1e" face="Verdana" class=""><span style="font-size: 12.0px;" class="">By replacing B and X with size <span style="background-color: rgb(255,255,255);display: inline;" class="">42549 by<span class=""> nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got </span></span></span></font></div>
<div class=""><font color="#201f1e" face="Verdana" class=""><span style="font-size: 12.0px;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class=""><span class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range</span></span></span></span></font></div>
<div class=""> </div>
<div class="">The modified code I used is attached.</div>
<div class=""><font color="#201f1e" face="Verdana" class=""><span style="font-size: 12.0px;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class=""><span class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">Hong</span></span></span></span></font></div>
<div id="gmail-m_7151358651855493837gmail-m_2827192423146367754appendonsend" class=""> </div>
<hr style="display: inline-block;width: 98.0%;" class="">
<div id="gmail-m_7151358651855493837gmail-m_2827192423146367754divRplyFwdMsg" class=""><font face="Calibri, sans-serif" style="font-size: 11.0pt;" class=""><b class="">From:</b> Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>><br class="">
<b class="">Sent:</b> Tuesday, October 27, 2020 10:01 PM<br class="">
<b class="">To:</b> Zhang, Hong <<a href="mailto:hzhang@mcs.anl.gov" target="_blank" class="">hzhang@mcs.anl.gov</a>><br class="">
<b class="">Cc:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>; Sherry Li <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Subject:</b> Aw: Re: [petsc-users] superlu_dist segfault</font>
<div class=""> </div>
</div>
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class="">Hi,</div>
<div class=""> </div>
<div class="">I recompiled PETSC with debug option, now I get a seg fault at a different position</div>
<div class=""> </div>
<div class="">[23]PETSC ERROR: ------------------------------------------------------------------------<br class="">
[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range<br class="">
[23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br class="">
[23]PETSC ERROR: or see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank" class="">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a><br class="">
[23]PETSC ERROR: or try <a href="http://valgrind.org/" target="_blank" class="">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors<br class="">
[23]PETSC ERROR: likely location of problem given in stack below<br class="">
[23]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br class="">
[23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,<br class="">
[23]PETSC ERROR: INSTEAD the line number of the start of the function<br class="">
[23]PETSC ERROR: is given.<br class="">
[23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c<br class="">
[23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c<br class="">
[23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c<br class="">
[23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<br class="">
[23]PETSC ERROR: Signal received</div>
<div class=""> </div>
<div class="">I made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloud</div>
<div class=""><a href="https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw" target="_blank" class="">https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw</a></div>
<div class=""> </div>
<div class="">Best,</div>
<div class="">Marius</div>
<div class="">
<div class="">
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);" class="">
<div style="margin: 0.0px 0.0px 10.0px;" class=""><b class="">Gesendet:</b> Dienstag, 27. Oktober 2020 um 23:11 Uhr<br class="">
<b class="">Von:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" target="_blank" class="">hzhang@mcs.anl.gov</a>><br class="">
<b class="">An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" target="_blank" class="">xiaoye@nersc.gov</a>><br class="">
<b class="">Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div class="">
<div class="">
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">Marius,</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class="">It fails at the line <span style="background-color: rgb(255,255,255);display: inline;" class="">1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c</span></span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class=""> if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");</span></span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">We do not know what it means. You may use a debugger to check the values of the variables involved.</span></span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.</span></span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;" class=""><span style="background-color: rgb(255,255,255);display: inline;" class="">Hong</span></span></div>
<div id="gmail-m_7151358651855493837gmail-m_2827192423146367754x_appendonsend" class=""> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;" class=""> </div>
<hr style="display: inline-block;width: 98.0%;" class="">
<div id="gmail-m_7151358651855493837gmail-m_2827192423146367754x_divRplyFwdMsg" class=""><font face="Calibri, sans-serif" style="font-size: 11.0pt;" class=""><b class="">From:</b> petsc-users <<a href="mailto:petsc-users-bounces@mcs.anl.gov" target="_blank" class="">petsc-users-bounces@mcs.anl.gov</a>> on behalf of Marius Buerkle <<a href="mailto:mbuerkle@web.de" target="_blank" class="">mbuerkle@web.de</a>><br class="">
<b class="">Sent:</b> Tuesday, October 27, 2020 8:46 AM<br class="">
<b class="">To:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank" class="">petsc-users@mcs.anl.gov</a>><br class="">
<b class="">Subject:</b> [petsc-users] superlu_dist segfault</font>
<div class=""> </div>
</div>
<div class="">
<div style="font-family: Verdana;font-size: 12.0px;" class="">
<div class="">Hi,</div>
<div class=""> </div>
<div class="">When using MatMatSolve with superlu_dist I get a segmentation fault:</div>
<div class=""> </div>
<div class="">Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c</div>
<div class=""> </div>
<div class="">The matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.</div>
<div class=""> </div>
<div class="">Best,</div>
<div class="">Marius</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<span id="gmail-m_7151358651855493837cid:8484CCBB-7318-4599-8119-03B504563992@hsd1.il.comcast.net" class=""><valgrind.tar.gz></span></div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<div class=""> </div>
--
<div class="gmail_signature">Stefano</div>
</div>
</div>
</div>
</div>
</div>
</div>
<span id="cid:0C64BCA4-C017-46BE-B4E4-5B00C48E7177@hsd1.il.comcast.net" class=""><valgrind_track-origins.tar.gz></span></div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div></div></div>
</div></blockquote></div><br class=""></body></html>