<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>Thank you Sherry for your efforts</p>
<p>but before I can setup an example that reproduces the problem, I
have to ask PETSc related question.</p>
<p>When I pump matrix via MatView MatLoad it ignores its original
partitioning.<br>
</p>
<p>Say originally I have 100 and 110 equations on two processors,
after MatLoad I will have 105 and 105 also on two processors.</p>
<p>What do I do to pass partitioning info through MatView MatLoad?</p>
<p>I guess it's important for reproducing my setup exactly.</p>
<p>Thanks<br>
</p>
<br>
<div class="moz-cite-prefix">On 10/19/2016 08:06 AM, Xiaoye S. Li
wrote:<br>
</div>
<blockquote
cite="mid:CAFvbobWHxhgp1Lan4zf8t-O_D5_LO89Jc1VgbQ4JkMOrxoEz2Q@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="ltr">
<div class="gmail_default" style="font-family:comic sans
ms,sans-serif;font-size:small">I looked at each
valgrind-complained item in your email dated Oct. 11. Those
reports are really superficial; I don't see anything wrong
with those lines (mostly uninitialized variables) singled
out. I did a few tests with the latest version in github,
all went fine. </div>
<div class="gmail_default" style="font-family:comic sans
ms,sans-serif;font-size:small"><br>
</div>
<div class="gmail_default" style="font-family:comic sans
ms,sans-serif;font-size:small">Perhaps you can print your
matrix that caused problem, I can run it using your matrix.</div>
<div class="gmail_default" style="font-family:comic sans
ms,sans-serif;font-size:small"><br>
</div>
<div class="gmail_default" style="font-family:comic sans
ms,sans-serif;font-size:small">Sherry</div>
<div class="gmail_default" style="font-family:comic sans
ms,sans-serif;font-size:small"><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue, Oct 11, 2016 at 2:18 PM, Anton
<span dir="ltr"><<a moz-do-not-send="true"
href="mailto:popov@uni-mainz.de" target="_blank">popov@uni-mainz.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex"><span
class=""><br>
<br>
On 10/11/16 7:19 PM, Satish Balay wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
This log looks truncated. Are there any valgrind mesages
before this?<br>
[like from your application code - or from MPI]<br>
</blockquote>
</span>
Yes it is indeed truncated. I only included relevant
messages.<span class=""><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Perhaps you can send the complete log - with:<br>
valgrind -q --tool=memcheck --leak-check=yes
--num-callers=20 --track-origins=yes<br>
<br>
[and if there were more valgrind messages from MPI -
rebuild petsc<br>
</blockquote>
</span>
There are no messages originating from our code, just a few
MPI related ones (probably false positives) and from
SuperLU_DIST (most of them).<br>
<br>
Thanks,<br>
Anton
<div class="HOEnZb">
<div class="h5"><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
with --download-mpich - for a valgrind clean mpi]<br>
<br>
Sherry,<br>
Perhaps this log points to some issue in superlu_dist?<br>
<br>
thanks,<br>
Satish<br>
<br>
On Tue, 11 Oct 2016, Anton Popov wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
Valgrind immediately detects interesting stuff:<br>
<br>
==25673== Use of uninitialised value of size 8<br>
==25673== at 0x178272C: static_schedule
(static_schedule.c:960)<br>
==25674== Use of uninitialised value of size 8<br>
==25674== at 0x178272C: static_schedule
(static_schedule.c:960)<br>
==25674== by 0x174E74E: pdgstrf (pdgstrf.c:572)<br>
==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)<br>
<br>
<br>
==25673== Conditional jump or move depends on
uninitialised value(s)<br>
==25673== at 0x1752143: pdgstrf
(dlook_ahead_update.c:24)<br>
==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124)<br>
<br>
<br>
==25673== Conditional jump or move depends on
uninitialised value(s)<br>
==25673== at 0x5C83F43: PMPI_Recv (in
/opt/mpich3/lib/libmpi.so.12.1<wbr>.0)<br>
==25673== by 0x1755385: pdgstrf2_trsm
(pdgstrf2.c:253)<br>
==25673== by 0x1751E4F: pdgstrf
(dlook_ahead_update.c:195)<br>
==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124)<br>
<br>
==25674== Use of uninitialised value of size 8<br>
==25674== at 0x62BF72B: _itoa_word (_itoa.c:179)<br>
==25674== by 0x62C1289: printf_positional
(vfprintf.c:2022)<br>
==25674== by 0x62C2465: vfprintf
(vfprintf.c:1677)<br>
==25674== by 0x638AFD5: __vsnprintf_chk
(vsnprintf_chk.c:63)<br>
==25674== by 0x638AF37: __snprintf_chk
(snprintf_chk.c:34)<br>
==25674== by 0x5CC6C08:
MPIR_Err_create_code_valist (in<br>
/opt/mpich3/lib/libmpi.so.12.1<wbr>.0)<br>
==25674== by 0x5CC7A9A: MPIR_Err_create_code (in<br>
/opt/mpich3/lib/libmpi.so.12.1<wbr>.0)<br>
==25674== by 0x5C83FB1: PMPI_Recv (in
/opt/mpich3/lib/libmpi.so.12.1<wbr>.0)<br>
==25674== by 0x1755385: pdgstrf2_trsm
(pdgstrf2.c:253)<br>
==25674== by 0x1751E4F: pdgstrf
(dlook_ahead_update.c:195)<br>
==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)<br>
<br>
==25674== Use of uninitialised value of size 8<br>
==25674== at 0x1751E92: pdgstrf
(dlook_ahead_update.c:205)<br>
==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)<br>
<br>
And it crashes after this:<br>
<br>
==25674== Invalid write of size 4<br>
==25674== at 0x1751F2F: pdgstrf
(dlook_ahead_update.c:211)<br>
==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124)<br>
==25674== by 0xAAEFAE:
MatLUFactorNumeric_SuperLU_DIS<wbr>T
(superlu_dist.c:421)<br>
==25674== Address 0xa0 is not stack'd, malloc'd or
(recently) free'd<br>
==25674==<br>
[1]PETSC ERROR:<br>
------------------------------<wbr>------------------------------<wbr>------------<br>
[1]PETSC ERROR: Caught signal number 11 SEGV:
Segmentation Violation, probably<br>
memory access out of range<br>
<br>
<br>
On 10/11/2016 03:26 PM, Anton Popov wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
On 10/10/2016 07:11 PM, Satish Balay wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0
0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
Thats from petsc-3.5<br>
<br>
Anton - please post the stack trace you get with<br>
--download-superlu_dist-commit<wbr>=origin/maint<br>
</blockquote>
I guess this is it:<br>
<br>
[0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421<br>
/home/anton/LIB/petsc/src/mat/<wbr>impls/aij/mpi/superlu_dist/sup<wbr>erlu_dist.c<br>
[0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIS<wbr>T
line 282<br>
/home/anton/LIB/petsc/src/mat/<wbr>impls/aij/mpi/superlu_dist/sup<wbr>erlu_dist.c<br>
[0]PETSC ERROR: [0] MatLUFactorNumeric line 2985<br>
/home/anton/LIB/petsc/src/mat/<wbr>interface/matrix.c<br>
[0]PETSC ERROR: [0] PCSetUp_LU line 101<br>
/home/anton/LIB/petsc/src/ksp/<wbr>pc/impls/factor/lu/lu.c<br>
[0]PETSC ERROR: [0] PCSetUp line 930<br>
/home/anton/LIB/petsc/src/ksp/<wbr>pc/interface/precon.c<br>
<br>
According to the line numbers it crashes within<br>
MatLUFactorNumeric_SuperLU_DIS<wbr>T while calling
pdgssvx.<br>
<br>
Surprisingly this only happens on the second SNES
iteration, but not on the<br>
first.<br>
<br>
I'm trying to reproduce this behavior with PETSc
KSP and SNES examples.<br>
However, everything I've tried up to now with
SuperLU_DIST does just fine.<br>
<br>
I'm also checking our code in Valgrind to make
sure it's clean.<br>
<br>
Anton<br>
<blockquote class="gmail_quote" style="margin:0 0
0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
Satish<br>
<br>
<br>
On Mon, 10 Oct 2016, Xiaoye S. Li wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0
0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
Which version of superlu_dist does this
capture? I looked at the<br>
original<br>
error log, it pointed to pdgssvx: line 161.
But that line is in<br>
comment<br>
block, not the program.<br>
<br>
Sherry<br>
<br>
<br>
On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov
<<a moz-do-not-send="true"
href="mailto:popov@uni-mainz.de"
target="_blank">popov@uni-mainz.de</a>>
wrote:<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
On 10/07/2016 05:23 PM, Satish Balay wrote:<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
On Fri, 7 Oct 2016, Kong, Fande wrote:<br>
<br>
On Fri, Oct 7, 2016 at 9:04 AM, Satish
Balay <<a moz-do-not-send="true"
href="mailto:balay@mcs.anl.gov"
target="_blank">balay@mcs.anl.gov</a>><br>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
On Fri, 7 Oct 2016, Anton Popov wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
Hi guys,<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
are there any news about fixing
buggy behavior of<br>
SuperLU_DIST, exactly<br>
<br>
</blockquote>
what<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
is described here:<br>
<br>
<a moz-do-not-send="true"
href="https://urldefense.proofpoint.com/v2/url?u=http-3A__lists"
rel="noreferrer" target="_blank">https://urldefense.proofpoint.<wbr>com/v2/url?u=http-3A__lists</a>.<br>
<br>
</blockquote>
mcs.anl.gov_pipermail_petsc-2D<wbr>users_2015-2DAugust_026802.htm<br>
l&d=CwIBAg&c=<br>
54IZrppPQZKX9mLzcGdPfFD1hxrcB_<wbr>_aEkJFOKJFd00&r=DUUt3SRGI0_<br>
JgtNaS3udV68GRkgV4ts7XKfj2opmi<wbr>CY&m=RwruX6ckX0t9H89Z6LXKBfJBO<wbr>AM2vG<br>
1sQHw2tIsSQtA&s=bbB62oGLm582Je<wbr>bVs8xsUej_OX0eUwibAKsRRWKafos&<wbr>e=
?<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
I'm using 3.7.4 and still get SEGV
in pdgssvx routine.<br>
Everything works<br>
<br>
</blockquote>
fine<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
with 3.5.4.<br>
<br>
Do I still have to stick to maint
branch, and what are the<br>
chances for<br>
<br>
</blockquote>
these<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
fixes to be included in 3.7.5?<br>
<br>
</blockquote>
3.7.4. is off maint branch [as of a
week ago]. So if you are<br>
seeing<br>
issues with it - its best to debug and
figure out the cause.<br>
<br>
This bug is indeed inside of
superlu_dist, and we started having<br>
this<br>
</blockquote>
issue<br>
from PETSc-3.6.x. I think superlu_dist
developers should have<br>
fixed this<br>
bug. We forgot to update superlu_dist??
This is not a thing users<br>
could<br>
debug and fix.<br>
<br>
I have many people in INL suffering from
this issue, and they have<br>
to<br>
stay<br>
with PETSc-3.5.4 to use superlu_dist.<br>
<br>
</blockquote>
To verify if the bug is fixed in latest
superlu_dist - you can try<br>
[assuming you have git - either from
petsc-3.7/maint/master]:<br>
<br>
--download-superlu_dist
--download-superlu_dist-commit<wbr>=origin/maint<br>
<br>
<br>
Satish<br>
<br>
Hi Satish,<br>
</blockquote>
I did this:<br>
<br>
git clone -b maint <a
moz-do-not-send="true"
href="https://bitbucket.org/petsc/petsc.git"
rel="noreferrer" target="_blank">https://bitbucket.org/petsc/pe<wbr>tsc.git</a>
petsc<br>
<br>
--download-superlu_dist<br>
--download-superlu_dist-commit<wbr>=origin/maint
(not sure this is needed,<br>
since I'm already in maint)<br>
<br>
The problem is still there.<br>
<br>
Cheers,<br>
Anton<br>
<br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br>
<br>
</blockquote>
</blockquote>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>