[petsc-users] VecView to hdf5 broken for large (complex) vectors
Smith, Barry F.
bsmith at mcs.anl.gov
Tue Apr 16 22:17:05 CDT 2019
https://bitbucket.org/petsc/petsc/pull-requests/1551/chunksize-could-overflow-and-become/diff
With this fix I can run with your vector size on 1 process. With 2 processes I get
$ petscmpiexec -n 2 ./ex1
Assertion failed in file adio/common/ad_write_coll.c at line 904: (curr_to_proc[p] + len - done_to_proc[p]) == (unsigned) (curr_to_proc[p] + len - done_to_proc[p])
0 libpmpi.0.dylib 0x0000000111241f3e backtrace_libc + 62
1 libpmpi.0.dylib 0x0000000111241ef5 MPL_backtrace_show + 21
2 libpmpi.0.dylib 0x000000011119f85a MPIR_Assert_fail + 90
3 libpmpi.0.dylib 0x00000001111a15f3 MPIR_Ext_assert_fail + 35
4 libmpi.0.dylib 0x0000000110eee16e ADIOI_Fill_send_buffer + 1134
5 libmpi.0.dylib 0x0000000110eefe74 ADIOI_W_Exchange_data + 2980
6 libmpi.0.dylib 0x0000000110eed7ad ADIOI_Exch_and_write + 3197
7 libmpi.0.dylib 0x0000000110eec854 ADIOI_GEN_WriteStridedColl + 2004
8 libpmpi.0.dylib 0x000000011128ad4b MPIOI_File_write_all + 1179
9 libmpi.0.dylib 0x0000000110ec382b MPI_File_write_at_all + 91
10 libhdf5.10.dylib 0x00000001108b982a H5FD_mpio_write + 1466
11 libhdf5.10.dylib 0x00000001108b127a H5FD_write + 634
12 li
Looks like an int overflow in the MPIIO. (It is scary to see the ints in the ADIO code as opposed to 64 bit integers but I guess somehow it works, maybe this is a strange corner case and I don't know if the problem is with HDF5 or MPIIO)
on 4 and 8 processes it runs.
Note that you are playing with a very dangerous size. 32768 * 32768 * 2 is a negative number in int. So this is essentially the largest problem you can run before switching to 64 bit indices for PETSc.
Barry
> On Apr 16, 2019, at 9:32 AM, Sajid Ali via petsc-users <petsc-users at mcs.anl.gov> wrote:
>
> Hi PETSc developers,
>
> I’m trying to write a large vector created with VecCreateMPI (size 32768x32768) concurrently from 4 nodes (+32 tasks per node, total 128 mpi-ranks) and I see the following (indicative) error : [Full error log is here : https://file.io/CdjUfe]
>
> HDF5-DIAG: Error detected in HDF5 (1.10.5) MPI-process 52:
> #000: H5D.c line 145 in H5Dcreate2(): unable to create dataset
> major: Dataset
> minor: Unable to initialize object
> #001: H5Dint.c line 329 in H5D__create_named(): unable to create and link to dataset
> major: Dataset
> minor: Unable to initialize object
> #002: H5L.c line 1557 in H5L_link_object(): unable to create new link to object
> major: Links
> minor: Unable to initialize object
> #003: H5L.c line 1798 in H5L__create_real(): can't insert link
> major: Links
> minor: Unable to insert object
> #004: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
> major: Symbol table
> HDF5-DIAG: Error detected in HDF5 (1.10.5) MPI-process 59:
> #000: H5D.c line 145 in H5Dcreate2(): unable to create dataset
> major: Dataset
> minor: Unable to initialize object
> #001: H5Dint.c line 329 in H5D__create_named(): unable to create and link to dataset
> major: Dataset
> minor: Unable to initialize object
> #002: H5L.c line 1557 in H5L_link_object(): unable to create new link to object
> major: Links
> minor: Unable to initialize object
> #003: H5L.c line 1798 in H5L__create_real(): can't insert link
> major: Links
> minor: Unable to insert object
> #004: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
> major: Symbol table
> minor: Object not found
> #005: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator failed
> major: Symbol table
> minor: Callback failed
> #006: H5L.c line 1604 in H5L__link_cb(): unable to create object
> major: Links
> minor: Unable to initialize object
> #007: H5Oint.c line 2453 in H5O_obj_create(): unable to open object
> major: Object header
> minor: Can't open object
> #008: H5Doh.c line 300 in H5O__dset_create(): unable to create dataset
> minor: Object not found
> #005: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator failed
> major: Symbol table
> minor: Callback failed
> #006: H5L.c line 1604 in H5L__link_cb(): unable to create object
> major: Links
> minor: Unable to initialize object
> #007: H5Oint.c line 2453 in H5O_obj_create(): unable to open object
> major: Object header
> minor: Can't open object
> #008: H5Doh.c line 300 in H5O__dset_create(): unable to create dataset
> major: Dataset
> minor: Unable to initialize object
> #009: H5Dint.c line 1274 in H5D__create(): unable to construct layout information
> major: Dataset
> minor: Unable to initialize object
> #010: H5Dchunk.c line 872 in H5D__chunk_construct(): unable to set chunk sizes
> major: Dataset
> minor: Bad value
> #011: H5Dchunk.c line 831 in H5D__chunk_set_sizes(): chunk size must be < 4GB
> major: Dataset
> minor: Unable to initialize object
> major: Dataset
> minor: Unable to initialize object
> #009: H5Dint.c line 1274 in H5D__create(): unable to construct layout information
> major: Dataset
> minor: Unable to initialize object
> #010: H5Dchunk.c line 872 in H5D__chunk_construct(): unable to set chunk sizes
> major: Dataset
> minor: Bad value
> #011: H5Dchunk.c line 831 in H5D__chunk_set_sizes(): chunk size must be < 4GB
> major: Dataset
> minor: Unable to initialize object
> .......
>
> I spoke to Barry last evening who said that this is a known error that was fixed for DMDA vecs but is broken for non-dmda vecs.
>
> Could this be fixed ?
>
>
> Thank You,
> Sajid Ali
> Applied Physics
> Northwestern University
More information about the petsc-users
mailing list