[petsc-users] MatDestroy problem with multiple matrices and SUPERLU_DIST
Deij-van Rijswijk, Menno
M.Deij at marin.nl
Tue May 4 02:55:36 CDT 2021
Hi Barry,
Thank you for this message about finalisation. I have checked that PetscFinalize is called after the problematic call to MatDestroy, and that is indeed the case. Furthermore, the module does not use "final".
Menno
dr. ir. Menno A. Deij-van Rijswijk | Researcher | Research & Development
MARIN | T +31 317 49 35 06 | M.Deij at marin.nl<mailto:M.Deij at marin.nl> | www.marin.nl<http://www.marin.nl>
[LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] <http://www.youtube.com/marinmultimedia> [Twitter] <https://twitter.com/MARIN_nieuws> [Facebook] <https://www.facebook.com/marin.wageningen>
MARIN news: Working paper on the Design of the Wageningen F-series<https://www.marin.nl/news/working-paper-on-the-design-of-the-wageningen-f-series>
From: Barry Smith <bsmith at petsc.dev>
Sent: Sunday, May 2, 2021 6:30 PM
To: Deij-van Rijswijk, Menno <M.Deij at marin.nl>
Cc: petsc-users at mcs.anl.gov
Subject: Re: [petsc-users] MatDestroy problem with multiple matrices and SUPERLU_DIST
==1026905== by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x5336E58: matdestroy_ (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x1528710: __fsi_MOD_fem_constructmatricespetscexit (fsi.F90:2297)
==1026905== Address 0x2ce67398 is 11,112 bytes inside an unallocated block of size 11,232 in arena "client"
Is it possible that this __fsi_MOD_fem_constructmatricespetscexit is being called AFTER PetscFinalize()? Perhaps it is defined with a "final" and the compiler/linker schedule it to be called after the program has "completed".
This would explain the crash, the valgrind stack frames and why it even does not crash with MPICH. This can happen with C++ destructors in code such as
MyC++Class my; <-- has a destructor that destroys PETSc objects
PetscInitialize()
....
PetscFinalize()
<-- the destructor gets called here and messes with MPI data that no longer exists.
return 0;
}
The fix is to force the destructor to be called before PETSc finalize and this can be done with
PetscInitialize()
{
MyC++Class my; <-- has a destructor that destroys PETSc objects
....
<-- the destructor gets called here and everything is fine
}
PetscFinalize()
return 0;
}
I don't know the details of how Fortran's final is implemented but this is my current guess as to what is happening in your code and you need to somehow arrange for the module final to be called before PetscFinalize().
Barry
On Apr 28, 2021, at 7:22 AM, Deij-van Rijswijk, Menno <M.Deij at marin.nl<mailto:M.Deij at marin.nl>> wrote:
The modules have automatic freeing in as much as that when a variable that is local to a subroutine is ALLOCATE'd, it is automatically freed when the subroutine returns. I don't think that is problematic, as MatDestroy is used a lot in the code and normally executes just fine.
As far as I can see, no specific new communicators are created; MatCreateAIJ or MatCreateSeqAIJ are called with PETSC_COMM_WORLD, resp. PETSC_COMM_SELF as first argument.
We also run this with the Intel MPI library, which is based on MPICH. There this problem does not occur.
The Valgrind run did not produce any new insights (at least not for me), I have pasted the relevant bits at the end of this message. I did a run on debug versions of PETSc (v3.14.5) and OpenMPI (v 3.1.2) and I find the following stack trace with line numbers for each frame. Maybe that helps in further pinpointing the problem.
0x0000155540d11719 in ompi_comm_free (comm=0x483f4e0) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/communicator/comm.c:1470
1470 if ( ! OMPI_COMM_IS_INTRINSIC((*comm)->c_local_comm)) {
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-5.el8.0.2.x86_64 libgfortran-8.3.1-5.el8.0.2.x86_64 libibumad-47mlnx1-1.47329.x86_64 libibverbs-47mlnx1-1.47329.x86_64 libnl3-3.5.0-1.el8.x86_64 libquadmath-8.3.1-5.el8.0.2.x86_64 librdmacm-47mlnx1-1.47329.x86_64 libstdc++-8.3.1-5.el8.0.2.x86_64 libxml2-2.9.7-7.el8.x86_64 numactl-libs-2.0.12-9.el8.x86_64 opensm-libs-5.5.1.MLNX20191120.0c8dde0-0.1.47329.x86_64 openssl-libs-1.1.1c-15.el8.x86_64 python3-libs-3.6.8-23.el8.x86_64 sssd-client-2.2.3-20.el8.x86_64 ucx-cma-1.7.0-1.47329.x86_64 ucx-ib-1.7.0-1.47329.x86_64 xz-libs-5.2.4-3.el8.x86_64 zlib-1.2.11-16.el8_2.x86_64
(gdb) bt
#0 0x0000155540d11719 in ompi_comm_free (comm=0x483f4e0) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/communicator/comm.c:1470
#1 0x0000155540d4f1af in PMPI_Comm_free (comm=0x483f4e0) at pcomm_free.c:62
#2 0x000015555346329a in superlu_gridexit (grid=0x483f4e0) at /home/mdeij/install-gnu/extLibs/Linux-x86_64-Intel/superlu_dist-6.3.0/SRC/superlu_grid.c:174
#3 0x0000155553ca2ff1 in Petsc_Superlu_dist_keyval_Delete_Fn (comm=0x3921b10, keyval=16, attr_val=0x483f4d0, extra_state=0x0) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c:97
#4 0x0000155540d0baa1 in ompi_attr_delete_impl (type=COMM_ATTR, object=0x3921b10, attr_hash=0x377efe0, key=16, predefined=true) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/attribute/attribute.c:1062
#5 0x0000155540d0c039 in ompi_attr_delete_all (type=COMM_ATTR, object=0x3921b10, attr_hash=0x377efe0) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/attribute/attribute.c:1166
#6 0x0000155540d11676 in ompi_comm_free (comm=0x7fffffffc5c0) at /home/mdeij/build-libs-gnu/superbuild/openmpi/src/ompi/communicator/comm.c:1462
#7 0x0000155540d4f1af in PMPI_Comm_free (comm=0x7fffffffc5c0) at pcomm_free.c:62
#8 0x000015555393fb68 in PetscCommDestroy (comm=0x3943a60) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/sys/objects/tagm.c:217
#9 0x0000155553941e07 in PetscHeaderDestroy_Private (h=0x3943a20) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/sys/objects/inherit.c:121
#10 0x000015555408edfe in MatDestroy (A=0x3558c18) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/mat/interface/matrix.c:1306
#11 0x00001555540cb5fa in matdestroy_ (A=0x3558c18, __ierr=0x7fffffffc73c) at /home/mdeij/build-libs-gnu/superbuild/petsc/src/src/mat/interface/ftn-auto/matrixf.c:770
Valgrind output:
==1026905== Invalid read of size 1
==1026905== at 0x19184538: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x6943B61: superlu_gridexit (in /home/mdeij/install-gnu/extLibs/lib/libsuperlu_dist.so.6.3.0)
==1026905== by 0x56F398E: Petsc_Superlu_dist_keyval_Delete_Fn (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x1912447B: ompi_attr_delete_impl (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19126FFE: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x4FEE49D: PetscCommDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x4FF0EE1: PetscHeaderDestroy_Private (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x5336E58: matdestroy_ (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x1528710: __fsi_MOD_fem_constructmatricespetscexit (fsi.F90:2297)
==1026905== Address 0x2ce67398 is 11,112 bytes inside an unallocated block of size 11,232 in arena "client"
==1026905==
==1026905== Invalid read of size 8
==1026905== at 0x1912AC9A: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x6943B61: superlu_gridexit (in /home/mdeij/install-gnu/extLibs/lib/libsuperlu_dist.so.6.3.0)
==1026905== by 0x56F398E: Petsc_Superlu_dist_keyval_Delete_Fn (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x1912447B: ompi_attr_delete_impl (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19126FFE: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x4FEE49D: PetscCommDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x4FF0EE1: PetscHeaderDestroy_Private (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x5336E58: matdestroy_ (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== Address 0x2ce673c0 is 11,152 bytes inside an unallocated block of size 11,232 in arena "client"
==1026905==
==1026905== Invalid read of size 8
==1026905== at 0x19126E5B: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x6943B61: superlu_gridexit (in /home/mdeij/install-gnu/extLibs/lib/libsuperlu_dist.so.6.3.0)
==1026905== by 0x56F398E: Petsc_Superlu_dist_keyval_Delete_Fn (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x1912447B: ompi_attr_delete_impl (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19126FFE: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x4FEE49D: PetscCommDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x4FF0EE1: PetscHeaderDestroy_Private (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== Address 0x91 is not stack'd, malloc'd or (recently) free'd
==1026905==
==1026905==
==1026905== Process terminating with default action of signal 11 (SIGSEGV)
==1026905== Access not within mapped region at address 0x91
==1026905== at 0x19126E5B: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x6943B61: superlu_gridexit (in /home/mdeij/install-gnu/extLibs/lib/libsuperlu_dist.so.6.3.0)
==1026905== by 0x56F398E: Petsc_Superlu_dist_keyval_Delete_Fn (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x1912447B: ompi_attr_delete_impl (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19126FFE: ompi_attr_delete_all (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x1912ACC6: ompi_comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x19184555: PMPI_Comm_free (in /home/mdeij/install-gnu/extLibs/lib/libmpi.so.40.10.2)
==1026905== by 0x4FEE49D: PetscCommDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x4FF0EE1: PetscHeaderDestroy_Private (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== by 0x5317899: MatDestroy (in /home/mdeij/install-gnu/extLibs/lib/libpetsc.so.3.14.5)
==1026905== If you believe this happened as a result of a stack
==1026905== overflow in your program's main thread (unlikely but
==1026905== possible), you can try to increase the size of the
==1026905== main thread stack using the --main-stacksize= flag.
==1026905== The main thread stack size used in this run was 16777216.
dr. ir. Menno A. Deij-van Rijswijk | Researcher | Research & Development
MARIN | T +31 317 49 35 06 | M.Deij at marin.nl<mailto:M.Deij at marin.nl> | www.marin.nl<http://www.marin.nl/>
<imagebf865c.PNG><https://www.linkedin.com/company/marin> <image1edec1.PNG><http://www.youtube.com/marinmultimedia> <imagedbdbd7.PNG><https://twitter.com/MARIN_nieuws> <image4abcc0.PNG><https://www.facebook.com/marin.wageningen>
MARIN news: WASP webinar & WiSP workshop<https://www.marin.nl/news/wasp-webinar-wisp-workshop-april-22>
From: Barry Smith <bsmith at petsc.dev<mailto:bsmith at petsc.dev>>
Sent: Friday, April 23, 2021 7:09 PM
To: Deij-van Rijswijk, Menno <M.Deij at marin.nl<mailto:M.Deij at marin.nl>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] MatDestroy problem with multiple matrices and SUPERLU_DIST
Thanks for looking. Do these modules have any "automatic freeing" when variables go out of scope (like C++ classes do)?
Do you make specific new MPI communicators to use create the matrices?
Have you tried MPICH or a different version of OpenMPI.
Maybe run the program with valgrind. The stack frames you sent look "funny", that is I would not normally expect them to be in such an order.
Barry
Help us improve the spam filter. If this message contains SPAM, click here<https://www.mailcontrol.com/sr/lqKC67CZnPPGX2PQPOmvUhkLFoJbzkFEyBNkQNATPXFrmmQ3cY8Q4d5cDBrY7_s6LHWuLmbsjXSzbAWAmKJQAw==> to report. Thank you, MARIN Digital Services
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210504/36c8f120/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image021333.PNG
Type: image/png
Size: 293 bytes
Desc: image021333.PNG
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210504/36c8f120/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image6d0c90.PNG
Type: image/png
Size: 331 bytes
Desc: image6d0c90.PNG
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210504/36c8f120/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image8d0af3.PNG
Type: image/png
Size: 333 bytes
Desc: image8d0af3.PNG
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210504/36c8f120/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image92bfe7.PNG
Type: image/png
Size: 253 bytes
Desc: image92bfe7.PNG
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210504/36c8f120/attachment-0007.png>
More information about the petsc-users
mailing list