[petsc-users] HDF5 VecView segfaults with more than 1 processor

Smith, Barry F. bsmith at mcs.anl.gov
Thu Apr 4 23:58:13 CDT 2019


   Would you be able to send us a sample code that reproduces the problem? We don't have a lot of tests for DMComposite so it is possible there are bugs in our VecView() for that case.

   Barry

   Also the "possibly lost" messages from valgrind are irrelevant after a crash or valgrind errors so can be ignored in that situation.


> On Apr 4, 2019, at 10:35 AM, zakaryah via petsc-users <petsc-users at mcs.anl.gov> wrote:
> 
> I'm trying to output PETSc vectors from a DMComposite using VecView with an HDF5 viewer.  This works fine with -n 1, but segfaults in parallel.  The code is simple:
> 
> ierr=PetscViewerHDF5Open(PETSC_COMM_WORLD,fname,FILE_MODE_WRITE,&H5viewer);CHKERRQ(ierr);
> ierr=VecView(X,H5viewer);CHKERRQ(ierr);
> ierr=PetscViewerDestroy(&H5viewer);CHKERRQ(ierr);
> 
> I'm using PETSc 3.11.0, configured with --with-hdf5 --with-hdf5-include=/MYFS/hdf5-1.10.5/include/ --with-hdf5-lib=/MYFS/hdf5-1.10.5/lib/libhdf5.so, where hdf5-1.10.5 was configured with mpicc, and --enable-parallel, per the instructions in the INSTALL_parallel doc.  I'm using openmpi 1.10.3a1.
> 
> According to valgrind, the segfault originates in H5G_traverse, trying to write 8 bytes to a null pointer (n -3):
> 
> ==35379== Thread 1:                                                                                                                                                                                             
> ==35379== Invalid write of size 8                                                                                                                                                                               
> ==35379==    at 0x1AEBD92B: ADIOI_Flatten (flatten.c:225)                                                                                                                                                       
> ==35379==    by 0x1AEBF82E: ADIOI_Flatten_datatype (flatten.c:80)                                                                                                                                               
> ==35379==    by 0x1AEB5982: ADIO_Set_view (ad_set_view.c:52)                                                                                                                                                    
> ==35379==    by 0x1AE9B860: mca_io_romio_dist_MPI_File_set_view (set_view.c:155)                                                                                                                                
> ==35379==    by 0x92C0D3D: PMPI_File_set_view (pfile_set_view.c:75)                                                                                                                                             
> ==35379==    by 0x830D963: H5FD_mpio_write (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                     
> ==35379==    by 0x8101576: H5FD_write (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                          
> ==35379==    by 0x80DD30B: H5F__accum_write (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                    
> ==35379==    by 0x81F3830: H5PB_write (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                          
> ==35379==    by 0x80E906A: H5F_block_write (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                     
> ==35379==    by 0x808FEBB: H5D__chunk_allocate (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                 
> ==35379==    by 0x80A1774: H5D__init_storage (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                   
> ==35379==    by 0x80A7282: H5D__alloc_storage (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                  
> ==35379==    by 0x80AE4A4: H5D__layout_oh_create (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                               
> ==35379==    by 0x80A3EF2: H5D__create (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                         
> ==35379==    by 0x80AF768: H5O__dset_create (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                    
> ==35379==    by 0x81A3898: H5O_obj_create (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                      
> ==35379==    by 0x81675E3: H5L__link_cb (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                        
> ==35379==    by 0x8137680: H5G__traverse_real (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                  
> ==35379==    by 0x8137E93: H5G_traverse (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                        
> ==35379==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> 
> There are also many "possibly lost" warnings, starting in PetscViewerHDF5Open but arising from libhdf5, for example:
> 
> ==35379== 10 bytes in 1 blocks are possibly lost in loss record 3,259 of 14,406                                                                                                                                 
> ==35379==    at 0x4C29BFD: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)                                                                                                                    
> ==35379==    by 0x810D495: H5FL_blk_malloc (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                     
> ==35379==    by 0x81F8F10: H5RS_create (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                         
> ==35379==    by 0x8129F3B: H5G__name_init (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                      
> ==35379==    by 0x81318AF: H5G_mkroot (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                          
> ==35379==    by 0x80E7016: H5F_open (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                            
> ==35379==    by 0x80D7C62: H5Fcreate (in /rugpfs/fs0/home/zfrentz/hdf5-1.10.5/lib/libhdf5.so.103.1.0)                                                                                                           
> ==35379==    by 0x565F43C: PetscViewerFileSetName_HDF5 (hdf5v.c:314)                                                                                                                                            
> ==35379==    by 0x568A828: PetscViewerFileSetName (filev.c:667)                                                                                                                                                 
> ==35379==    by 0x56624DD: PetscViewerHDF5Open (hdf5v.c:547)                                                                                                                                                    
> ==35379==    by 0x434092: myFunction (myFunction.c:276)                                                                                                                                                 
> ==35379==    by 0x42DC03: main (main.c:280)
> 
> With n -1, I don't get any of these errors and there is no segfault.
> 
> I suppose this is an HDF5 problem but I'm wondering if anyone has advice for me on how to fix this.  If there are issues with parallel HDF5 then I'm happy to settle for sequential VecView, but I couldn't figure out how to get that to work either.
> 



More information about the petsc-users mailing list