[Darshan-users] HDF5 chunking and Darshan with enable-hdf5-mod results in error

Tue Apr 20 12:11:05 CDT 2021

On Mon, 2021-04-19 at 14:17 +0200, Tobias Meisel wrote:
> Hi all,
> I’ve built 3.2.1 Darshan with --enable-hdf5-mod=
>  
> 
> When I began writing chunks collectively with hdf5 I got errors only
> when darshan instrumentation is enabled.
> The darshan instrumentation without hdf5 chunking worked fine (also
> with --enable-hdf5-mod).

The errors are two kinds:
- "H5Shyper.c line 12116 in H5Sget_regular_hyperslab(): not a hyperslab
selection"
-- Gerd Heber says some processes are calling H5Sselect_none which is
not a "regular hyperslab selection".  We've seen this warning before in
other workloads this week.

But there's a more alarming error -- not just a warning but a divide by
zero

[archlinux:31445] *** Process received signal ***
[archlinux:31445] Signal: Floating point exception (8)
[archlinux:31445] Signal code: Integer divide-by-zero (1)
[archlinux:31445] Failing at address: 0x7f3f980614bf
[archlinux:31445] [ 0]
/usr/lib/libpthread.so.0(+0x13960)[0x7f3f99bbd960]
[archlinux:31445] [ 1]
/usr/lib/openmpi/openmpi/mca_io_ompio.so(mca_io_ompio_file_get_byte_off
set+0x3f)[0x7f3f980614bf]
[archlinux:31445] [ 2]
/usr/lib/openmpi/libmpi.so.40(PMPI_File_get_byte_offset+0x70)[0x7f3f99f
745b0]
[archlinux:31445] [ 3]
/usr/local/lib/libdarshan.so(MPI_File_write_at_all+0x197)[0x7f3f9a4b76b
7]

Darshan doesn't divide anything by anything in its wrappers, but
OpenMPI does:

https://github.com/open-mpi/ompi/blob/master/ompi/mca/io/ompio/io_ompio_file_open.c#L511

Lots of questions here: 
- OpenMPI-IO shouldn't ever crash based on user input, but it does.
- How is Darshan feeding OpenMPI-IO such a bogus payload?
- How is the hyperslab selection, or lack thereof, triggering all this?

==rob

> I have opened a topic at the HDF group forum: 
> 
https://forum.hdfgroup.org/t/parallel-hdf5-write-with-irregular-size-in-one-dimension/8284/5
> The resulting errors are posted to this topic as well.
> 
> The minimal example to reproduce the error is also here: 
> 
https://dbkt.hdfgroup.org/original/2X/c/c58be4df192333b6d15d6e91d58b114b85cea2f4.cc
> 
> My Setup is:
> OpenMPI (OpenRTE) 4.0.5
> HDF5 1.12.0
> Darshan 3.2.1
> 
> Could you take a look and check if there is a problem with the HDF5
> instrumentation in Darshan?
> 
> Thank you
> 
> Tobias
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users