[petsc-users] VecView to hdf5 broken for large (complex) vectors

Smith, Barry F. bsmith at mcs.anl.gov
Tue Apr 16 22:17:05 CDT 2019


https://bitbucket.org/petsc/petsc/pull-requests/1551/chunksize-could-overflow-and-become/diff

With this fix I can run with your vector size on 1 process. With 2 processes I get

$ petscmpiexec -n 2 ./ex1 
Assertion failed in file adio/common/ad_write_coll.c at line 904: (curr_to_proc[p] + len - done_to_proc[p]) == (unsigned) (curr_to_proc[p] + len - done_to_proc[p])
0   libpmpi.0.dylib                     0x0000000111241f3e backtrace_libc + 62
1   libpmpi.0.dylib                     0x0000000111241ef5 MPL_backtrace_show + 21
2   libpmpi.0.dylib                     0x000000011119f85a MPIR_Assert_fail + 90
3   libpmpi.0.dylib                     0x00000001111a15f3 MPIR_Ext_assert_fail + 35
4   libmpi.0.dylib                      0x0000000110eee16e ADIOI_Fill_send_buffer + 1134
5   libmpi.0.dylib                      0x0000000110eefe74 ADIOI_W_Exchange_data + 2980
6   libmpi.0.dylib                      0x0000000110eed7ad ADIOI_Exch_and_write + 3197
7   libmpi.0.dylib                      0x0000000110eec854 ADIOI_GEN_WriteStridedColl + 2004
8   libpmpi.0.dylib                     0x000000011128ad4b MPIOI_File_write_all + 1179
9   libmpi.0.dylib                      0x0000000110ec382b MPI_File_write_at_all + 91
10  libhdf5.10.dylib                    0x00000001108b982a H5FD_mpio_write + 1466
11  libhdf5.10.dylib                    0x00000001108b127a H5FD_write + 634
12  li

Looks like an int overflow in the MPIIO. (It is scary to see the ints in the ADIO code as opposed to 64 bit integers but I guess somehow it works, maybe this is a strange corner case and I don't know if the problem is with HDF5 or MPIIO) 

 on 4 and 8 processes it runs. 

Note that you are playing with a very dangerous size. 32768 * 32768 * 2 is a negative number in int. So this is essentially the largest problem you can run before switching to 64 bit indices for PETSc. 

  Barry



> On Apr 16, 2019, at 9:32 AM, Sajid Ali via petsc-users <petsc-users at mcs.anl.gov> wrote:
> 
> Hi PETSc developers,
> 
> I’m trying to write a large vector created with VecCreateMPI (size 32768x32768) concurrently from 4 nodes (+32 tasks per node, total 128 mpi-ranks) and I see the following (indicative) error : [Full error log is here : https://file.io/CdjUfe] 
> 
> HDF5-DIAG: Error detected in HDF5 (1.10.5) MPI-process 52:
>   #000: H5D.c line 145 in H5Dcreate2(): unable to create dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #001: H5Dint.c line 329 in H5D__create_named(): unable to create and link to dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #002: H5L.c line 1557 in H5L_link_object(): unable to create new link to object
>     major: Links
>     minor: Unable to initialize object
>   #003: H5L.c line 1798 in H5L__create_real(): can't insert link
>     major: Links
>     minor: Unable to insert object
>   #004: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
>     major: Symbol table
> HDF5-DIAG: Error detected in HDF5 (1.10.5) MPI-process 59:                              
>   #000: H5D.c line 145 in H5Dcreate2(): unable to create dataset                        
>     major: Dataset                                                                      
>     minor: Unable to initialize object                                                  
>   #001: H5Dint.c line 329 in H5D__create_named(): unable to create and link to dataset  
>     major: Dataset                                                                      
>     minor: Unable to initialize object                                                  
>   #002: H5L.c line 1557 in H5L_link_object(): unable to create new link to object       
>     major: Links                                                                        
>     minor: Unable to initialize object                                                  
>   #003: H5L.c line 1798 in H5L__create_real(): can't insert link                        
>     major: Links                                                                        
>     minor: Unable to insert object                                                      
>   #004: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed        
>     major: Symbol table                                                                 
>     minor: Object not found                                                             
>   #005: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator failed       
>     major: Symbol table                                                                 
>     minor: Callback failed                                                              
>   #006: H5L.c line 1604 in H5L__link_cb(): unable to create object                      
>     major: Links                                                                        
>     minor: Unable to initialize object                                                  
>   #007: H5Oint.c line 2453 in H5O_obj_create(): unable to open object                   
>     major: Object header                                                                
>     minor: Can't open object                                                            
>   #008: H5Doh.c line 300 in H5O__dset_create(): unable to create dataset                
>     minor: Object not found                                                             
>   #005: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator failed       
>     major: Symbol table                                                                 
>     minor: Callback failed                                                              
>   #006: H5L.c line 1604 in H5L__link_cb(): unable to create object                      
>     major: Links                                                                        
>     minor: Unable to initialize object                                                  
>   #007: H5Oint.c line 2453 in H5O_obj_create(): unable to open object                   
>     major: Object header                                                                
>     minor: Can't open object                                                            
>   #008: H5Doh.c line 300 in H5O__dset_create(): unable to create dataset                
>     major: Dataset                                                                      
>     minor: Unable to initialize object                                                  
>   #009: H5Dint.c line 1274 in H5D__create(): unable to construct layout information
>     major: Dataset
>     minor: Unable to initialize object
>   #010: H5Dchunk.c line 872 in H5D__chunk_construct(): unable to set chunk sizes
>     major: Dataset
>     minor: Bad value
>   #011: H5Dchunk.c line 831 in H5D__chunk_set_sizes(): chunk size must be < 4GB
>     major: Dataset
>     minor: Unable to initialize object
>     major: Dataset
>     minor: Unable to initialize object
>   #009: H5Dint.c line 1274 in H5D__create(): unable to construct layout information
>     major: Dataset
>     minor: Unable to initialize object
>   #010: H5Dchunk.c line 872 in H5D__chunk_construct(): unable to set chunk sizes
>     major: Dataset
>     minor: Bad value
>   #011: H5Dchunk.c line 831 in H5D__chunk_set_sizes(): chunk size must be < 4GB
>     major: Dataset
>     minor: Unable to initialize object
> .......
> 
> I spoke to Barry last evening who said that this is a known error that was fixed for DMDA vecs but is broken for non-dmda vecs.
> 
> Could this be fixed ?
> 
> 
> Thank You, 
> Sajid Ali
> Applied Physics
> Northwestern University



More information about the petsc-users mailing list