<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div dir="ltr">Some updates for this OpenMPI bug:
<div> 1) It appears to OpenMPI 2.1.x when configured with --enable-heterogeneous, which is not a default option and is not commonly used. But Ubuntu somehow used that.
<div> 2) OpenMPI fixed it in 3.x</div>
<div> 3) It was reported to Ubuntu two years ago but is still unassigned. <a href="https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/1731938">https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/1731938</a>. A user's comment from last year, "We have
just spent today hunting down a user bug report for Xyce (which uses Trilinos, and its Zoltan library) that turn out to be exactly this issue "</div>
<div><span style="color:rgb(51,51,51);font-family:monospace;font-size:12px"><br>
</span></div>
<div>
<div>
<div dir="ltr" class="m_-1459072320325078085gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Jul 31, 2019 at 2:17 PM Junchao Zhang <<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>Hi, Fabian,</div>
I found it is an OpenMPI bug w.r.t self-to-self MPI_Send/Recv using MPI_ANY_SOURCE for message matching. OpenMPI does not put correct value in recv buffer.
<div>I have a workaround <a href="https://bitbucket.org/petsc/petsc/branch/jczhang/fix-ubuntu-openmpi-anysource" style="color:rgb(0,82,204);text-decoration-line:none;font-family:-apple-system,system-ui,"Segoe UI",Roboto,Oxygen,Ubuntu,"Fira Sans","Droid Sans","Helvetica Neue",sans-serif;font-size:14px" target="_blank">jczhang/fix-ubuntu-openmpi-anysource</a>.
I tested with your petsc_ex.F90 and $PETSC_DIR/src/dm/examples/tests/ex14. The majority of valgrind errors disappeared. A few left are in ompi_mpi_init and we can ignore them.</div>
<div>I filed a bug report to OpenMPI <a href="https://www.mail-archive.com/users@lists.open-mpi.org//msg33383.html" target="_blank">https://www.mail-archive.com/users@lists.open-mpi.org//msg33383.html</a> and hope they can fix it in Ubuntu.</div>
<div>Thanks.</div>
<div><br>
<div>
<div dir="ltr" class="gmail-m_-1459072320325078085gmail-m_2874015940392249691gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Jul 30, 2019 at 9:47 AM Fabian.Jakub via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Dear Petsc Team,<br>
Our cluster recently switched to Ubuntu 18.04 which has gcc 7.4 and<br>
(Open MPI) 2.1.1 - with this I ended up with segfault and valgrind<br>
errors in DMDAGlobalToNatural.<br>
<br>
This is evident in a minimal fortran example such as the attached<br>
example petsc_ex.F90<br>
<br>
with the following error:<br>
<br>
==22616== Conditional jump or move depends on uninitialised value(s)<br>
==22616== at 0x4FA5CDB: PetscTrMallocDefault (mtr.c:185)<br>
==22616== by 0x4FA4DAC: PetscMallocA (mal.c:413)<br>
==22616== by 0x5090E94: VecScatterSetUp_SF (vscatsf.c:652)<br>
==22616== by 0x50A1104: VecScatterSetUp (vscatfce.c:209)<br>
==22616== by 0x509EE3B: VecScatterCreate (vscreate.c:280)<br>
==22616== by 0x577B48B: DMDAGlobalToNatural_Create (dagtol.c:108)<br>
==22616== by 0x577BB6D: DMDAGlobalToNaturalBegin (dagtol.c:155)<br>
==22616== by 0x5798446: VecView_MPI_DA (gr2.c:720)<br>
==22616== by 0x51BC7D8: VecView (vector.c:574)<br>
==22616== by 0x4F4ECA1: PetscObjectView (destroy.c:90)<br>
==22616== by 0x4F4F05E: PetscObjectViewFromOptions (destroy.c:126)<br>
<br>
and consequently wrong results in the natural vec<br>
<br>
<br>
I was looking at the fortran example if I did forget something but I can<br>
also see the same error, i.e. not being valgrind clean, in pure C - PETSc:<br>
<br>
cd $PETSC_DIR/src/dm/examples/tests && make ex14 && mpirun<br>
--allow-run-as-root -np 2 valgrind ./ex14<br>
<br>
I then tried various docker/podman linux distributions to make sure that<br>
my setup is clean and to me it seems that this error is confined to the<br>
particular gcc version 7.4 and (Open MPI) 2.1.1 from the ubuntu:latest repo.<br>
<br>
I tried other images from dockerhub including<br>
<br>
gcc:7.4.0 :: where I could neither install openmpi nor mpich through<br>
apt, however works with --download-openmpi and --download-mpich<br>
<br>
ubuntu:rolling(19.04) <-- work<br>
<br>
debian:latest & :stable <-- works<br>
<br>
ubuntu:latest(18.04) <-- fails in case of openmpi, but works with mpich<br>
or with petsc-configure --download-openmpi or --download-mpich<br>
<br>
<br>
Is this error with (Open MPI) 2.1.1 a known issue? In the meantime, I<br>
guess I'll go with a custom mpi install but given that ubuntu:latest is<br>
widely spread, do you think there is an easy solution to the error?<br>
<br>
I guess you are not eager to delve into this issue with old mpi versions<br>
but in case you find some spare time, maybe you find the root cause<br>
and/or a workaround.<br>
<br>
Many thanks,<br>
Fabian<br>
</blockquote>
</div>
</blockquote>
</div>
</body>
</html>