[Darshan-users] HDF5 chunking and Darshan with enable-hdf5-mod results in error
Latham, Robert J.
robl at mcs.anl.gov
Tue Apr 20 12:11:05 CDT 2021
On Mon, 2021-04-19 at 14:17 +0200, Tobias Meisel wrote:
> Hi all,
> I’ve built 3.2.1 Darshan with --enable-hdf5-mod=
>
>
> When I began writing chunks collectively with hdf5 I got errors only
> when darshan instrumentation is enabled.
> The darshan instrumentation without hdf5 chunking worked fine (also
> with --enable-hdf5-mod).
The errors are two kinds:
- "H5Shyper.c line 12116 in H5Sget_regular_hyperslab(): not a hyperslab
selection"
-- Gerd Heber says some processes are calling H5Sselect_none which is
not a "regular hyperslab selection". We've seen this warning before in
other workloads this week.
But there's a more alarming error -- not just a warning but a divide by
zero
[archlinux:31445] *** Process received signal ***
[archlinux:31445] Signal: Floating point exception (8)
[archlinux:31445] Signal code: Integer divide-by-zero (1)
[archlinux:31445] Failing at address: 0x7f3f980614bf
[archlinux:31445] [ 0]
/usr/lib/libpthread.so.0(+0x13960)[0x7f3f99bbd960]
[archlinux:31445] [ 1]
/usr/lib/openmpi/openmpi/mca_io_ompio.so(mca_io_ompio_file_get_byte_off
set+0x3f)[0x7f3f980614bf]
[archlinux:31445] [ 2]
/usr/lib/openmpi/libmpi.so.40(PMPI_File_get_byte_offset+0x70)[0x7f3f99f
745b0]
[archlinux:31445] [ 3]
/usr/local/lib/libdarshan.so(MPI_File_write_at_all+0x197)[0x7f3f9a4b76b
7]
Darshan doesn't divide anything by anything in its wrappers, but
OpenMPI does:
https://github.com/open-mpi/ompi/blob/master/ompi/mca/io/ompio/io_ompio_file_open.c#L511
Lots of questions here:
- OpenMPI-IO shouldn't ever crash based on user input, but it does.
- How is Darshan feeding OpenMPI-IO such a bogus payload?
- How is the hyperslab selection, or lack thereof, triggering all this?
==rob
> I have opened a topic at the HDF group forum:
>
https://forum.hdfgroup.org/t/parallel-hdf5-write-with-irregular-size-in-one-dimension/8284/5
> The resulting errors are posted to this topic as well.
>
> The minimal example to reproduce the error is also here:
>
https://dbkt.hdfgroup.org/original/2X/c/c58be4df192333b6d15d6e91d58b114b85cea2f4.cc
>
> My Setup is:
> OpenMPI (OpenRTE) 4.0.5
> HDF5 1.12.0
> Darshan 3.2.1
>
> Could you take a look and check if there is a problem with the HDF5
> instrumentation in Darshan?
>
> Thank you
>
> Tobias
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
More information about the Darshan-users
mailing list