<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div>Sorry it is also include in the archive on onedrive, should have mentioned it. It is the same code and data as I send in the beginning, I didn't change anything I think.</div>
<div>
<div>
<div name="quote" style="margin:10px 5px 5px 10px; padding: 10px 0 10px 10px; border-left:2px solid #C3D9E5; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<div style="margin:0 0 10px 0;"><b>Gesendet:</b> Dienstag, 03. November 2020 um 00:45 Uhr<br/>
<b>Von:</b> "Barry Smith" <bsmith@petsc.dev><br/>
<b>An:</b> "Marius Buerkle" <mbuerkle@web.de><br/>
<b>Cc:</b> "Stefano Zampini" <stefano.zampini@gmail.com>, "petsc-users@mcs.anl.gov" <petsc-users@mcs.anl.gov>, "Sherry Li" <xiaoye@nersc.gov><br/>
<b>Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div name="quoted-content">
<div>
<div> </div>
Code?
<div>
<blockquote>
<div>On Nov 2, 2020, at 9:27 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>> wrote:</div>
<div>
<div>
<div style="font-family: Verdana;font-size: 12.0px;">
<div> </div>
<div>
<div>The matrix is a bit too big for email attachment, I put it on onedrive</div>
<div><a href="https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw" target="_blank">https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw</a></div>
<div> </div>
<div>
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);">
<div style="margin: 0 0 10.0px 0;"><b>Gesendet:</b> Montag, 02. November 2020 um 23:58 Uhr<br/>
<b>Von:</b> "Barry Smith" <<a href="mailto:bsmith@petsc.dev" onclick="parent.window.location.href='mailto:bsmith@petsc.dev'; return false;" target="_blank">bsmith@petsc.dev</a>><br/>
<b>An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>><br/>
<b>Cc:</b> "Stefano Zampini" <<a href="mailto:stefano.zampini@gmail.com" onclick="parent.window.location.href='mailto:stefano.zampini@gmail.com'; return false;" target="_blank">stefano.zampini@gmail.com</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" onclick="parent.window.location.href='mailto:xiaoye@nersc.gov'; return false;" target="_blank">xiaoye@nersc.gov</a>><br/>
<b>Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div>
<div>
<div> </div>
Please send this program and your data file. This should definitely not be happening.
<div> </div>
<div> Barry</div>
<div> </div>
<div> Valgrind is generally trustworthy.
<div>
<blockquote>
<div>On Nov 2, 2020, at 12:21 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>> wrote:</div>
<div>
<div>
<div style="font-family: Verdana;font-size: 12.0px;">
<div>Hi,</div>
<div> </div>
<div>I tried valgrind with track-origins, valgrind crashes at somepoint due to running out of energy though. But before I get a lot of </div>
<div>"Conditional jump or move depends on uninitialised value(s)" and "Use of uninitialised value of size 8" not all of them related to Petsc but some of them are during MatLoad, PCSetup_LU, and also in Superlu. For example</div>
<div> </div>
<div>==41867== Conditional jump or move depends on uninitialised value(s)<br/>
==41867== at 0x5DEA7C4: MatSetValues_MPIAIJ (mpiaij.c:601)<br/>
==41867== by 0x5E310D8: MatMPIAIJSetPreallocationCSR_MPIAIJ (mpiaij.c:4031)<br/>
==41867== by 0x5E31773: MatMPIAIJSetPreallocationCSR (mpiaij.c:4091)<br/>
==41867== by 0x5E2184C: MatLoad_MPIAIJ_Binary (mpiaij.c:3197)<br/>
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)<br/>
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)<br/>
==41867== by 0x4063ED: main (superlu_test.c:28)<br/>
==41867== Uninitialised value was created by a heap allocation<br/>
==41867== at 0x4C2D814: memalign (vg_replace_malloc.c:906)<br/>
==41867== by 0x50220D6: PetscMallocAlign (mal.c:52)<br/>
==41867== by 0x50242D4: PetscMallocA (mal.c:425)<br/>
==41867== by 0x5E20FC2: MatLoad_MPIAIJ_Binary (mpiaij.c:3187)<br/>
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)<br/>
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)<br/>
==41867== by 0x4063ED: main (superlu_test.c:28)<br/>
==41867==<br/>
==41867== Use of uninitialised value of size 8<br/>
==41867== at 0x5DEA8AE: MatSetValues_MPIAIJ (mpiaij.c:603)<br/>
==41867== by 0x5E310D8: MatMPIAIJSetPreallocationCSR_MPIAIJ (mpiaij.c:4031)<br/>
==41867== by 0x5E31773: MatMPIAIJSetPreallocationCSR (mpiaij.c:4091)<br/>
==41867== by 0x5E2184C: MatLoad_MPIAIJ_Binary (mpiaij.c:3197)<br/>
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)<br/>
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)<br/>
==41867== by 0x4063ED: main (superlu_test.c:28)<br/>
==41867== Uninitialised value was created by a heap allocation<br/>
==41867== at 0x4C2D814: memalign (vg_replace_malloc.c:906)<br/>
==41867== by 0x50220D6: PetscMallocAlign (mal.c:52)<br/>
==41867== by 0x50242D4: PetscMallocA (mal.c:425)<br/>
==41867== by 0x5E20FC2: MatLoad_MPIAIJ_Binary (mpiaij.c:3187)<br/>
==41867== by 0x5E200FB: MatLoad_MPIAIJ (mpiaij.c:3142)<br/>
==41867== by 0x58DBDAC: MatLoad (matrix.c:1231)<br/>
==41867== by 0x4063ED: main (superlu_test.c:28)</div>
<div> </div>
<div>I don't know if this are real errors or only some problem of valgrind. I attached th whole valgrind logs, they are rather noisy though.</div>
<div> </div>
<div>Best,</div>
<div>Marius</div>
<div>
<div>
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);">
<div style="margin: 0 0 10.0px 0;"><b>Gesendet:</b> Sonntag, 01. November 2020 um 19:09 Uhr<br/>
<b>Von:</b> "Stefano Zampini" <<a href="mailto:stefano.zampini@gmail.com" onclick="parent.window.location.href='mailto:stefano.zampini@gmail.com'; return false;" target="_blank">stefano.zampini@gmail.com</a>><br/>
<b>An:</b> "Barry Smith" <<a href="mailto:bsmith@petsc.dev" onclick="parent.window.location.href='mailto:bsmith@petsc.dev'; return false;" target="_blank">bsmith@petsc.dev</a>><br/>
<b>Cc:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" onclick="parent.window.location.href='mailto:xiaoye@nersc.gov'; return false;" target="_blank">xiaoye@nersc.gov</a>><br/>
<b>Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div>
<div>More importantly,
<div>
<div>==43569== Conditional jump or move depends on uninitialised value(s)<br/>
==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)<br/>
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br/>
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br/>
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br/>
==43569== by 0x40465D: main (superlu_test.c:59)</div>
</div>
<div> </div>
<div>You should run using valgrind's option --track-origins=yes to understand the reason for this. </div>
</div>
<div class="gmail_quote">
<div class="gmail_attr">Il giorno dom 1 nov 2020 alle ore 11:53 Barry Smith <<a href="mailto:bsmith@petsc.dev" onclick="parent.window.location.href='mailto:bsmith@petsc.dev'; return false;" target="_blank">bsmith@petsc.dev</a>> ha scritto:</div>
<blockquote class="gmail_quote" style="margin: 0.0px 0.0px 0.0px 0.8ex;border-left: 1.0px solid rgb(204,204,204);padding-left: 1.0ex;">
<div>
<div> </div>
<div> You can sometimes use -on_error_attach_debugger noxterm and it will try to attach just in the console you started the job. If you are lucky this works and you use bt and see the stack and look at variables. But if multiple ranks crash the debugger will get confused and even if only one crashes if it is not rank zero the stty can get messed up so you cannot type to control the debugger.</div>
<div> </div>
<div> The valgrind information is very valuable, likely Sherry can look at those lines and have a really good idea what the problem is, for example,</div>
<div> </div>
<div>
<blockquote>
<div>
<div style="font-family: Verdana;font-size: 12.0px;">
<div>Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd</div>
</div>
</div>
</blockquote>
</div>
<div> means that for some reason the code is writing past the end of an allocated array, either because the array allocated was not long enough or the code has some issue where it wants to write further than it should. This kind of thing is very common and usually easy to debug by someone who knows the code once they know exactly what line of code is problematic. Since it shows exactly where the memory was allocated and exactly where it went out of bounds.</div>
<div> </div>
<div> Barry</div>
<div> </div>
<div>
<blockquote>
<div>On Nov 1, 2020, at 1:21 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>> wrote:</div>
<div>
<div>
<div style="font-family: Verdana;font-size: 12.0px;">
<div>Hi,</div>
<div> </div>
<div>I cannot use on_error_attach_debugger as X forwarding does not work on the system. Is it possible to dump the gdb output to file instead? </div>
<div> </div>
<div>I run it through valgrind. It seems there is some problem during calls in superlu_dist but I don't know if this eventually causes the seg fault. I think this is the relevant output:</div>
<div> </div>
<div>==43569== Conditional jump or move depends on uninitialised value(s)<br/>
==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074)<br/>
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br/>
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br/>
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br/>
==43569== by 0x40465D: main (superlu_test.c:59)<br/>
==43569==<br/>
==43569== Use of uninitialised value of size 8<br/>
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)<br/>
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br/>
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br/>
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br/>
==43569== by 0x40465D: main (superlu_test.c:59)<br/>
==43569==<br/>
==43569== Use of uninitialised value of size 8<br/>
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)<br/>
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br/>
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br/>
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br/>
==43569== by 0x40465D: main (superlu_test.c:59)<br/>
==43569==<br/>
==43569== Invalid write of size 8<br/>
==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077)<br/>
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br/>
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br/>
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br/>
==43569== by 0x40465D: main (superlu_test.c:59)<br/>
==43569== Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd<br/>
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)<br/>
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)<br/>
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)<br/>
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)<br/>
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br/>
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br/>
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br/>
==43569== by 0x40465D: main (superlu_test.c:59)<br/>
==43569==<br/>
==43569== Invalid write of size 8<br/>
==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077)<br/>
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br/>
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br/>
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br/>
==43569== by 0x40465D: main (superlu_test.c:59)<br/>
==43569== Address 0x266e5ad0 is 16 bytes after a block of size 35,520 alloc'd<br/>
==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906)<br/>
==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070)<br/>
==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127)<br/>
==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044)<br/>
==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422)<br/>
==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242)<br/>
==43569== by 0x55FB716: MatMatSolve (matrix.c:3485)<br/>
==43569== by 0x40465D: main (superlu_test.c:59)<br/>
==43569==</div>
<div> </div>
<div>I also attached the whole log. Does this make any sense? The problem seems to be around where I get the original segfault.</div>
<div> </div>
<div>best,</div>
<div>marius</div>
<div>
<div>
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);">
<div style="margin: 0.0px 0.0px 10.0px;"><b>Gesendet:</b> Samstag, 31. Oktober 2020 um 04:07 Uhr<br/>
<b>Von:</b> "Barry Smith" <<a href="mailto:bsmith@petsc.dev" onclick="parent.window.location.href='mailto:bsmith@petsc.dev'; return false;" target="_blank">bsmith@petsc.dev</a>><br/>
<b>An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>><br/>
<b>Cc:</b> "Xiaoye S. Li" <<a href="mailto:xsli@lbl.gov" onclick="parent.window.location.href='mailto:xsli@lbl.gov'; return false;" target="_blank">xsli@lbl.gov</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" onclick="parent.window.location.href='mailto:xiaoye@nersc.gov'; return false;" target="_blank">xiaoye@nersc.gov</a>><br/>
<b>Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div>
<div>
<div> </div>
Have you run it yet with valgrind, good be memory corruption earlier that causes a later crash, crashes that occur at different places for the same run are almost always due to memory corruption.
<div> </div>
<div> If valgrind is clean you can run with -on_error_attach_debugger and if the X forwarding is set up it will open a debugger on the crashing process and you can type bt to see exactly where it is crashing, at what line number and code line.</div>
<div> </div>
<div> Barry</div>
<div>
<div>
<blockquote>
<div>On Oct 29, 2020, at 1:04 AM, Marius Buerkle <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>> wrote:</div>
<div>
<div>
<div style="font-family: Verdana;font-size: 12.0px;">
<div>Hi Sherry,</div>
<div> </div>
<div>I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with OpenMP turned off. But did not help. </div>
<div> </div>
<div>Here is the output I can get from SuperLu during the PETSC run</div>
<div>
<div> Nonzeros in L 29519630<br/>
Nonzeros in U 29519630<br/>
nonzeros in L+U 58996711<br/>
nonzeros in LSUB 4509612</div>
<div>** Memory Usage **********************************<br/>
** NUMfact space (MB): (sum-of-all-processes)<br/>
L\U : 952.18 | Total : 1980.60<br/>
** Total highmark (MB):<br/>
Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56<br/>
**************************************************<br/>
**************************************************<br/>
**** Time (seconds) ****<br/>
EQUIL time 0.06<br/>
ROWPERM time 1.03<br/>
COLPERM time 1.01<br/>
SYMBFACT time 0.45<br/>
DISTRIBUTE time 0.33<br/>
FACTOR time 0.90<br/>
Factor flops 2.225916e+11 Mflops 247438.62<br/>
SOLVE time 0.000<br/>
**************************************************</div>
<div> </div>
<div>I tried all available ordering options for Colperm (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the same seg. fault.</div>
</div>
<div>
<div>
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);">
<div style="margin: 0.0px 0.0px 10.0px;"><b>Gesendet:</b> Donnerstag, 29. Oktober 2020 um 14:14 Uhr<br/>
<b>Von:</b> "Xiaoye S. Li" <<a href="mailto:xsli@lbl.gov" onclick="parent.window.location.href='mailto:xsli@lbl.gov'; return false;" target="_blank">xsli@lbl.gov</a>><br/>
<b>An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>><br/>
<b>Cc:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" onclick="parent.window.location.href='mailto:hzhang@mcs.anl.gov'; return false;" target="_blank">hzhang@mcs.anl.gov</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" onclick="parent.window.location.href='mailto:xiaoye@nersc.gov'; return false;" target="_blank">xiaoye@nersc.gov</a>><br/>
<b>Betreff:</b> Re: Re: Re: [petsc-users] superlu_dist segfault</div>
<div>
<div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;">Hong: thanks for the diagnosis!</div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;"> </div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;">Marius: how many OpenMP threads are you using per MPI task?</div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;">In an earlier email, you mentioned the allocation failure at the following line:</div>
<div class="gmail_default" style="font-family: verdana , sans-serif;font-size: small;">
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-family: Verdana;font-size: 12.0px;"> if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;">this is in the solve phase. I think when we do some OpenMP optimization, we allowed several data structures to grow with OpenMP threads. You can try to use 1 thread.</div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><br/>
The RHS and X memories are easy to compute. However, i<span style="font-size: 12.0pt;">n order to gauge how much memory is used in the factorization, can you print out the number of nonzeros in the L and U factors? What ordering option are you using? The sparse matrix A looks pretty small.</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-size: 12.0pt;">The code can also print out the working storage used during factorization. I am not sure how this printing can be turned on through PETSc.</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-size: 12.0pt;">Sherry</span></div>
</div>
</div>
<div class="gmail_quote">
<div class="gmail_attr">On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>> wrote:</div>
<blockquote class="gmail_quote" style="margin: 0.0px 0.0px 0.0px 0.8ex;border-left: 1.0px solid rgb(204,204,204);padding-left: 1.0ex;">
<div>
<div style="font-family: Verdana;font-size: 12.0px;">
<div>
<p><span style="font-size: 11.0pt;"><span style="line-height: normal;"><span style="font-family: Calibri , sans-serif;"><span style="font-size: 12.0pt;"><span style="font-family: "Times New Roman" , serif;">Thanks for the swift reply. </span></span></span></span></span></p>
<p><span style="font-size: 11.0pt;"><span style="line-height: normal;"><span style="font-family: Calibri , sans-serif;"><span style="font-size: 12.0pt;"><span style="font-family: "Times New Roman" , serif;">I also realized if I reduce the number of RHS then it works. But I am running the code on a cluster with 256GB ram / node. One dense matrix would be around ~30 Gb so 60 Gb, which is large but does exceed the memory of even one node and I also get the seg fault if I run it on several nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The maxium memory used when using MUMPS is around 150 Gb during the solver phase but for SuperLU_dist it crashed even before reaching the solver phase. Could there be such a large difference in memory usage between SuperLu_dist and MUMPS ?</span></span></span></span></span></p>
<div> </div>
<p><span style="font-size: 11.0pt;"><span style="line-height: normal;"><span style="font-family: Calibri , sans-serif;"><span style="font-size: 12.0pt;"><span style="font-family: "Times New Roman" , serif;">best,</span></span></span></span></span></p>
<p><span style="font-size: 11.0pt;"><span style="line-height: normal;"><span style="font-family: Calibri , sans-serif;"><span style="font-size: 12.0pt;"><span style="font-family: "Times New Roman" , serif;">marius</span></span></span></span></span></p>
<div>
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);">
<div style="margin: 0.0px 0.0px 10.0px;"><b>Gesendet:</b> Donnerstag, 29. Oktober 2020 um 10:10 Uhr<br/>
<b>Von:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" onclick="parent.window.location.href='mailto:hzhang@mcs.anl.gov'; return false;" target="_blank">hzhang@mcs.anl.gov</a>><br/>
<b>An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>><br/>
<b>Cc:</b> "<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" onclick="parent.window.location.href='mailto:xiaoye@nersc.gov'; return false;" target="_blank">xiaoye@nersc.gov</a>><br/>
<b>Betreff:</b> Re: Re: [petsc-users] superlu_dist segfault</div>
<div>
<div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;">Marius,</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;">I tested your code with petsc-release on my mac laptop using np=2 cores. I first tested a small matrix data file successfully. Then I switch to your data file and run out of memory, likely due to the dense matrices B and X. I got an error "Your system has run out of application memory" from my laptop.</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="color: rgb(32,31,30);font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;">The sparse matrix A has size 42549 by 42549. Your code creates dense matrices B and X with the same size -- a huge memory requirement!</span></div>
<div><font color="#201f1e" face="Verdana"><span style="font-size: 12.0px;">By replacing B and X with size <span style="background-color: rgb(255,255,255);display: inline;">42549 by<span> nrhs (nrhs =< 4000), I had the code run well with np=2. Note the error message you got </span></span></span></font></div>
<div><font color="#201f1e" face="Verdana"><span style="font-size: 12.0px;"><span style="background-color: rgb(255,255,255);display: inline;"><span><span style="background-color: rgb(255,255,255);display: inline;">[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range</span></span></span></span></font></div>
<div> </div>
<div>The modified code I used is attached.</div>
<div><font color="#201f1e" face="Verdana"><span style="font-size: 12.0px;"><span style="background-color: rgb(255,255,255);display: inline;"><span><span style="background-color: rgb(255,255,255);display: inline;">Hong</span></span></span></span></font></div>
<div id="gmail-m_7151358651855493837gmail-m_2827192423146367754appendonsend"> </div>
<hr style="display: inline-block;width: 98.0%;"/>
<div id="gmail-m_7151358651855493837gmail-m_2827192423146367754divRplyFwdMsg"><font face="Calibri, sans-serif" style="font-size: 11.0pt;"><b>From:</b> Marius Buerkle <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>><br/>
<b>Sent:</b> Tuesday, October 27, 2020 10:01 PM<br/>
<b>To:</b> Zhang, Hong <<a href="mailto:hzhang@mcs.anl.gov" onclick="parent.window.location.href='mailto:hzhang@mcs.anl.gov'; return false;" target="_blank">hzhang@mcs.anl.gov</a>><br/>
<b>Cc:</b> <a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>>; Sherry Li <<a href="mailto:xiaoye@nersc.gov" onclick="parent.window.location.href='mailto:xiaoye@nersc.gov'; return false;" target="_blank">xiaoye@nersc.gov</a>><br/>
<b>Subject:</b> Aw: Re: [petsc-users] superlu_dist segfault</font>
<div> </div>
</div>
<div>
<div style="font-family: Verdana;font-size: 12.0px;">
<div>Hi,</div>
<div> </div>
<div>I recompiled PETSC with debug option, now I get a seg fault at a different position</div>
<div> </div>
<div>[23]PETSC ERROR: ------------------------------------------------------------------------<br/>
[23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range<br/>
[23]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br/>
[23]PETSC ERROR: or see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a><br/>
[23]PETSC ERROR: or try <a href="http://valgrind.org/" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors<br/>
[23]PETSC ERROR: likely location of problem given in stack below<br/>
[23]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br/>
[23]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,<br/>
[23]PETSC ERROR: INSTEAD the line number of the start of the function<br/>
[23]PETSC ERROR: is given.<br/>
[23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c<br/>
[23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c<br/>
[23]PETSC ERROR: [23] MatMatSolve line 3466 /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c<br/>
[23]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<br/>
[23]PETSC ERROR: Signal received</div>
<div> </div>
<div>I made a small reproducer. The matrix is a bit too big so I cannot attach it directly to the email, but I put it in the cloud</div>
<div><a href="https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw" target="_blank">https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw</a></div>
<div> </div>
<div>Best,</div>
<div>Marius</div>
<div>
<div>
<div style="margin: 10.0px 5.0px 5.0px 10.0px;padding: 10.0px 0.0px 10.0px 10.0px;border-left: 2.0px solid rgb(195,217,229);">
<div style="margin: 0.0px 0.0px 10.0px;"><b>Gesendet:</b> Dienstag, 27. Oktober 2020 um 23:11 Uhr<br/>
<b>Von:</b> "Zhang, Hong" <<a href="mailto:hzhang@mcs.anl.gov" onclick="parent.window.location.href='mailto:hzhang@mcs.anl.gov'; return false;" target="_blank">hzhang@mcs.anl.gov</a>><br/>
<b>An:</b> "Marius Buerkle" <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>>, "<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>>, "Sherry Li" <<a href="mailto:xiaoye@nersc.gov" onclick="parent.window.location.href='mailto:xiaoye@nersc.gov'; return false;" target="_blank">xiaoye@nersc.gov</a>><br/>
<b>Betreff:</b> Re: [petsc-users] superlu_dist segfault</div>
<div>
<div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;">Marius,</span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;">It fails at the line <span style="background-color: rgb(255,255,255);display: inline;">1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c</span></span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;"><span style="background-color: rgb(255,255,255);display: inline;"> if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[].");</span></span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;"><span style="background-color: rgb(255,255,255);display: inline;">We do not know what it means. You may use a debugger to check the values of the variables involved.</span></span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;"><span style="background-color: rgb(255,255,255);display: inline;">I'm cc'ing Sherry (superlu_dist developer), or you may send us a stand-alone short code that reproduce the error. We can help on its investigation.</span></span></div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"><span style="font-family: Verdana;font-size: 12.0px;background-color: rgb(255,255,255);display: inline;"><span style="background-color: rgb(255,255,255);display: inline;">Hong</span></span></div>
<div id="gmail-m_7151358651855493837gmail-m_2827192423146367754x_appendonsend"> </div>
<div style="font-family: Calibri , Arial , Helvetica , sans-serif;font-size: 12.0pt;"> </div>
<hr style="display: inline-block;width: 98.0%;"/>
<div id="gmail-m_7151358651855493837gmail-m_2827192423146367754x_divRplyFwdMsg"><font face="Calibri, sans-serif" style="font-size: 11.0pt;"><b>From:</b> petsc-users <<a href="mailto:petsc-users-bounces@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users-bounces@mcs.anl.gov'; return false;" target="_blank">petsc-users-bounces@mcs.anl.gov</a>> on behalf of Marius Buerkle <<a href="mailto:mbuerkle@web.de" onclick="parent.window.location.href='mailto:mbuerkle@web.de'; return false;" target="_blank">mbuerkle@web.de</a>><br/>
<b>Sent:</b> Tuesday, October 27, 2020 8:46 AM<br/>
<b>To:</b> <a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" onclick="parent.window.location.href='mailto:petsc-users@mcs.anl.gov'; return false;" target="_blank">petsc-users@mcs.anl.gov</a>><br/>
<b>Subject:</b> [petsc-users] superlu_dist segfault</font>
<div> </div>
</div>
<div>
<div style="font-family: Verdana;font-size: 12.0px;">
<div>Hi,</div>
<div> </div>
<div>When using MatMatSolve with superlu_dist I get a segmentation fault:</div>
<div> </div>
<div>Malloc fails for lsum[]. at line 1075 in file /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c</div>
<div> </div>
<div>The matrix size is not particular big and I am using the petsc release branch and superlu_dist is v6.3.0 I think.</div>
<div> </div>
<div>Best,</div>
<div>Marius</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<span id="gmail-m_7151358651855493837cid:8484CCBB-7318-4599-8119-03B504563992@hsd1.il.comcast.net"><valgrind.tar.gz></span></div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<div> </div>
--
<div class="gmail_signature">Stefano</div>
</div>
</div>
</div>
</div>
</div>
</div>
<span id="cid:0C64BCA4-C017-46BE-B4E4-5B00C48E7177@hsd1.il.comcast.net"><valgrind_track-origins.tar.gz></span></div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div></div></body></html>