[petsc-dev] DMDAGlobalToNatural errors with Ubuntu:latest; gcc 7 & Open MPI 2.1.1

Fabian.Jakub Fabian.Jakub at physik.uni-muenchen.de
Wed Jul 31 14:19:27 CDT 2019


Awesome, many thanks for your efforts!

On 7/31/19 9:17 PM, Zhang, Junchao wrote:
> Hi, Fabian,
> I found it is an OpenMPI bug w.r.t self-to-self MPI_Send/Recv using MPI_ANY_SOURCE for message matching. OpenMPI does not put correct value in recv buffer.
> I have a workaround jczhang/fix-ubuntu-openmpi-anysource<https://bitbucket.org/petsc/petsc/branch/jczhang/fix-ubuntu-openmpi-anysource>. I tested with your petsc_ex.F90 and $PETSC_DIR/src/dm/examples/tests/ex14.  The majority of valgrind errors disappeared. A few left are in ompi_mpi_init and we can ignore them.
> I filed a bug report to OpenMPI https://www.mail-archive.com/users@lists.open-mpi.org//msg33383.html and hope they can fix it in Ubuntu.
> Thanks.
> 
> --Junchao Zhang
> 
> 
> On Tue, Jul 30, 2019 at 9:47 AM Fabian.Jakub via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
> Dear Petsc Team,
> Our cluster recently switched to Ubuntu 18.04 which has gcc 7.4 and
> (Open MPI) 2.1.1 - with this I ended up with segfault and valgrind
> errors in DMDAGlobalToNatural.
> 
> This is evident in a minimal fortran example such as the attached
> example petsc_ex.F90
> 
> with the following error:
> 
> ==22616== Conditional jump or move depends on uninitialised value(s)
> ==22616==    at 0x4FA5CDB: PetscTrMallocDefault (mtr.c:185)
> ==22616==    by 0x4FA4DAC: PetscMallocA (mal.c:413)
> ==22616==    by 0x5090E94: VecScatterSetUp_SF (vscatsf.c:652)
> ==22616==    by 0x50A1104: VecScatterSetUp (vscatfce.c:209)
> ==22616==    by 0x509EE3B: VecScatterCreate (vscreate.c:280)
> ==22616==    by 0x577B48B: DMDAGlobalToNatural_Create (dagtol.c:108)
> ==22616==    by 0x577BB6D: DMDAGlobalToNaturalBegin (dagtol.c:155)
> ==22616==    by 0x5798446: VecView_MPI_DA (gr2.c:720)
> ==22616==    by 0x51BC7D8: VecView (vector.c:574)
> ==22616==    by 0x4F4ECA1: PetscObjectView (destroy.c:90)
> ==22616==    by 0x4F4F05E: PetscObjectViewFromOptions (destroy.c:126)
> 
> and consequently wrong results in the natural vec
> 
> 
> I was looking at the fortran example if I did forget something but I can
> also see the same error, i.e. not being valgrind clean, in pure C - PETSc:
> 
> cd $PETSC_DIR/src/dm/examples/tests && make ex14 && mpirun
> --allow-run-as-root -np 2 valgrind ./ex14
> 
> I then tried various docker/podman linux distributions to make sure that
> my setup is clean and to me it seems that this error is confined to the
> particular gcc version 7.4 and (Open MPI) 2.1.1 from the ubuntu:latest repo.
> 
> I tried other images from dockerhub including
> 
> gcc:7.4.0 :: where I could neither install openmpi nor mpich through
> apt, however works with --download-openmpi and --download-mpich
> 
> ubuntu:rolling(19.04) <-- work
> 
> debian:latest & :stable <-- works
> 
> ubuntu:latest(18.04) <-- fails in case of openmpi, but works with mpich
> or with petsc-configure --download-openmpi or --download-mpich
> 
> 
> Is this error with (Open MPI) 2.1.1 a known issue? In the meantime, I
> guess I'll go with a custom mpi install but given that ubuntu:latest is
> widely spread, do you think there is an easy solution to the error?
> 
> I guess you are not eager to delve into this issue with old mpi versions
> but in case you find some spare time, maybe you find the root cause
> and/or a workaround.
> 
> Many thanks,
> Fabian
> 



More information about the petsc-dev mailing list