[petsc-users] HDF5 VecView segfaults with more than 1 processor

zakaryah zakaryah at gmail.com
Thu Apr 4 10:35:35 CDT 2019


I'm trying to output PETSc vectors from a DMComposite using VecView with an
HDF5 viewer.  This works fine with -n 1, but segfaults in parallel.  The
code is simple:

ierr=PetscViewerHDF5Open(PETSC_COMM_WORLD,fname,FILE_MODE_WRITE,&H5viewer);CHKERRQ(ierr);
ierr=VecView(X,H5viewer);CHKERRQ(ierr);

ierr=PetscViewerDestroy(&H5viewer);CHKERRQ(ierr);


I'm using PETSc 3.11.0, configured with --with-hdf5
--with-hdf5-include=/MYFS/hdf5-1.10.5/include/
--with-hdf5-lib=/MYFS/hdf5-1.10.5/lib/libhdf5.so, where hdf5-1.10.5 was
configured with mpicc, and --enable-parallel, per the instructions in the
INSTALL_parallel doc.  I'm using openmpi 1.10.3a1.

According to valgrind, the segfault originates in H5G_traverse, trying to
write 8 bytes to a null pointer (n -3):

==35379== Thread 1:



==35379== Invalid write of size 8



==35379==    at 0x1AEBD92B: ADIOI_Flatten (flatten.c:225)



==35379==    by 0x1AEBF82E: ADIOI_Flatten_datatype (flatten.c:80)



==35379==    by 0x1AEB5982: ADIO_Set_view (ad_set_view.c:52)



==35379==    by 0x1AE9B860: mca_io_romio_dist_MPI_File_set_view
(set_view.c:155)


==35379==    by 0x92C0D3D: PMPI_File_set_view (pfile_set_view.c:75)



==35379==    by 0x830D963: H5FD_mpio_write (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x8101576: H5FD_write (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80DD30B: H5F__accum_write (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x81F3830: H5PB_write (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80E906A: H5F_block_write (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x808FEBB: H5D__chunk_allocate (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80A1774: H5D__init_storage (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80A7282: H5D__alloc_storage (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80AE4A4: H5D__layout_oh_create (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80A3EF2: H5D__create (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80AF768: H5O__dset_create (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x81A3898: H5O_obj_create (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x81675E3: H5L__link_cb (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x8137680: H5G__traverse_real (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x8137E93: H5G_traverse (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

There are also many "possibly lost" warnings, starting in
PetscViewerHDF5Open but arising from libhdf5, for example:

==35379== 10 bytes in 1 blocks are possibly lost in loss record 3,259 of
14,406


==35379==    at 0x4C29BFD: malloc (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)



==35379==    by 0x810D495: H5FL_blk_malloc (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x81F8F10: H5RS_create (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x8129F3B: H5G__name_init (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x81318AF: H5G_mkroot (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80E7016: H5F_open (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x80D7C62: H5Fcreate (in
/rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)



==35379==    by 0x565F43C: PetscViewerFileSetName_HDF5 (hdf5v.c:314)



==35379==    by 0x568A828: PetscViewerFileSetName (filev.c:667)



==35379==    by 0x56624DD: PetscViewerHDF5Open (hdf5v.c:547)



==35379==    by 0x434092: myFunction (myFunction.c:276)



==35379==    by 0x42DC03: main (main.c:280)

With n -1, I don't get any of these errors and there is no segfault.

I suppose this is an HDF5 problem but I'm wondering if anyone has advice
for me on how to fix this.  If there are issues with parallel HDF5 then I'm
happy to settle for sequential VecView, but I couldn't figure out how to
get that to work either.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190404/e481f112/attachment-0001.html>


More information about the petsc-users mailing list