[petsc-users] MatDestroy problem with multiple matrices and SUPERLU_DIST

Deij-van Rijswijk, Menno M.Deij at marin.nl
Wed Apr 28 07:22:31 CDT 2021


The modules have automatic freeing in as much as that when a variable that is local to a subroutine is ALLOCATE'd, it is automatically freed when the subroutine returns. I don't think that is problematic, as MatDestroy is used a lot in the code and normally executes just fine.

As far as I can see, no specific new communicators are created; MatCreateAIJ or MatCreateSeqAIJ are called with PETSC_COMM_WORLD, resp. PETSC_COMM_SELF as first argument.

We also run this with the Intel MPI library, which is based on MPICH. There this problem does not occur.

The Valgrind run did not produce any new insights (at least not for me), I have pasted the relevant bits at the end of this message. I did a run on debug versions of PETSc (v3.14.5) and OpenMPI (v 3.1.2) and I find the following stack trace with line numbers for each frame. Maybe that helps in further pinpointing the problem.

0x0000155540d11719 in ompi_comm_free (comm=0x483f4e0) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/communicator/comm.c:1470
1470            if ( ! OMPI_COMM_IS_INTRINSIC((*comm)->c_local_comm)) {
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-5.el8.0.2.x86_64 libgfortran-8.3.1-5.el8.0.2.x86_64 libibumad-47mlnx1-1.47329.x86_64 libibverbs-47mlnx1-1.47329.x86_64 libnl3-3.5.0-1.el8.x86_64 libquadmath-8.3.1-5.el8.0.2.x86_64 librdmacm-47mlnx1-1.47329.x86_64 libstdc++-8.3.1-5.el8.0.2.x86_64 libxml2-2.9.7-7.el8.x86_64 numactl-libs-2.0.12-9.el8.x86_64 opensm-libs-5.5.1.MLNX20191120.0c8dde0-0.1.47329.x86_64 openssl-libs-1.1.1c-15.el8.x86_64 python3-libs-3.6.8-23.el8.x86_64 sssd-client-2.2.3-20.el8.x86_64 ucx-cma-1.7.0-1.47329.x86_64 ucx-ib-1.7.0-1.47329.x86_64 xz-libs-5.2.4-3.el8.x86_64 zlib-1.2.11-16.el8_2.x86_64
(gdb) bt
#0  0x0000155540d11719 in ompi_comm_free (comm=0x483f4e0) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/communicator/comm.c:1470
#1  0x0000155540d4f1af in PMPI_Comm_free (comm=0x483f4e0) at pcomm_free.c:62
#2  0x000015555346329a in superlu_gridexit (grid=0x483f4e0) at /home/mdeij/install-gnu/extLibs/Linux-x86_64-Intel/superlu_dist-6.3.0/SRC/superlu_grid.c:174
#3  0x0000155553ca2ff1 in Petsc_Superlu_dist_keyval_Delete_Fn (comm=0x3921b10, keyval=16, attr_val=0x483f4d0, extra_state=0x0) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c:97
#4  0x0000155540d0baa1 in ompi_attr_delete_impl (type=COMM_ATTR, object=0x3921b10, attr_hash=0x377efe0, key=16, predefined=true) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/attribute/attribute.c:1062
#5  0x0000155540d0c039 in ompi_attr_delete_all (type=COMM_ATTR, object=0x3921b10, attr_hash=0x377efe0) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/attribute/attribute.c:1166
#6  0x0000155540d11676 in ompi_comm_free (comm=0x7fffffffc5c0) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/communicator/comm.c:1462
#7  0x0000155540d4f1af in PMPI_Comm_free (comm=0x7fffffffc5c0) at pcomm_free.c:62
#8  0x000015555393fb68 in PetscCommDestroy (comm=0x3943a60) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/sys/objects/tagm.c:217
#9  0x0000155553941e07 in PetscHeaderDestroy_Private (h=0x3943a20) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/sys/objects/inherit.c:121
#10 0x000015555408edfe in MatDestroy (A=0x3558c18) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/mat/interface/matrix.c:1306
#11 0x00001555540cb5fa in matdestroy_ (A=0x3558c18, __ierr=0x7fffffffc73c) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/mat/interface/ftn-auto/matrixf.c:770

Valgrind output:

==1026905== Invalid read of size 1
==1026905==    at 0x19184538: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x6943B61: superlu_gridexit (in /home/mdeij/install-gnu/extLibs/lib/libsuperlu_dist.so.6.3.0)
==1026905==    by 0x56F398E: Petsc_Superlu_dist_keyval_Delete_Fn (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x1912447B: ompi_attr_delete_impl (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19126FFE: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x4FEE49D: PetscCommDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x4FF0EE1: PetscHeaderDestroy_Private (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x5336E58: matdestroy_ (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x1528710: __fsi_MOD_fem_constructmatricespetscexit (fsi.F90:2297)
==1026905==  Address 0x2ce67398 is 11,112 bytes inside an unallocated block of size 11,232 in arena "client"
==1026905==
==1026905== Invalid read of size 8
==1026905==    at 0x1912AC9A: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x6943B61: superlu_gridexit (in /home/mdeij/install-gnu/extLibs/lib/libsuperlu_dist.so.6.3.0)
==1026905==    by 0x56F398E: Petsc_Superlu_dist_keyval_Delete_Fn (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x1912447B: ompi_attr_delete_impl (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19126FFE: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x4FEE49D: PetscCommDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x4FF0EE1: PetscHeaderDestroy_Private (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x5336E58: matdestroy_ (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==  Address 0x2ce673c0 is 11,152 bytes inside an unallocated block of size 11,232 in arena "client"
==1026905==
==1026905== Invalid read of size 8
==1026905==    at 0x19126E5B: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x6943B61: superlu_gridexit (in /home/mdeij/install-gnu/extLibs/lib/libsuperlu_dist.so.6.3.0)
==1026905==    by 0x56F398E: Petsc_Superlu_dist_keyval_Delete_Fn (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x1912447B: ompi_attr_delete_impl (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19126FFE: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x4FEE49D: PetscCommDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x4FF0EE1: PetscHeaderDestroy_Private (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==  Address 0x91 is not stack'd, malloc'd or (recently) free'd
==1026905==
==1026905==
==1026905== Process terminating with default action of signal 11 (SIGSEGV)
==1026905==  Access not within mapped region at address 0x91
==1026905==    at 0x19126E5B: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x6943B61: superlu_gridexit (in /home/mdeij/install-gnu/extLibs/lib/libsuperlu_dist.so.6.3.0)
==1026905==    by 0x56F398E: Petsc_Superlu_dist_keyval_Delete_Fn (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x1912447B: ompi_attr_delete_impl (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19126FFE: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905==    by 0x4FEE49D: PetscCommDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x4FF0EE1: PetscHeaderDestroy_Private (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==    by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905==  If you believe this happened as a result of a stack
==1026905==  overflow in your program's main thread (unlikely but
==1026905==  possible), you can try to increase the size of the
==1026905==  main thread stack using the --main-stacksize= flag.
==1026905==  The main thread stack size used in this run was 16777216.


dr. ir. Menno A. Deij-van Rijswijk | Researcher | Research & Development
MARIN | T +31 317 49 35 06 | M.Deij at marin.nl<mailto:M.Deij at marin.nl> | www.marin.nl<http://www.marin.nl>

[LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] <http://www.youtube.com/marinmultimedia>  [Twitter] <https://twitter.com/MARIN_nieuws>  [Facebook] <https://www.facebook.com/marin.wageningen>
MARIN news: WASP webinar & WiSP workshop<https://www.marin.nl/news/wasp-webinar-wisp-workshop-april-22>
From: Barry Smith <bsmith at petsc.dev>
Sent: Friday, April 23, 2021 7:09 PM
To: Deij-van Rijswijk, Menno <M.Deij at marin.nl>
Cc: petsc-users at mcs.anl.gov
Subject: Re: [petsc-users] MatDestroy problem with multiple matrices and SUPERLU_DIST


   Thanks for looking. Do these modules have any "automatic freeing" when variables go out of scope (like C++ classes do)?

    Do you make specific new MPI communicators to use create the matrices?

    Have you tried MPICH or a different version of OpenMPI.

    Maybe run the program with valgrind.  The stack frames you sent look "funny", that is I would not normally expect them to be in such an order.

   Barry






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210428/b7bec947/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imagebf865c.PNG
Type: image/png
Size: 293 bytes
Desc: imagebf865c.PNG
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210428/b7bec947/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image1edec1.PNG
Type: image/png
Size: 331 bytes
Desc: image1edec1.PNG
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210428/b7bec947/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imagedbdbd7.PNG
Type: image/png
Size: 333 bytes
Desc: imagedbdbd7.PNG
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210428/b7bec947/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image4abcc0.PNG
Type: image/png
Size: 253 bytes
Desc: image4abcc0.PNG
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210428/b7bec947/attachment-0007.png>


More information about the petsc-users mailing list