[Darshan-users] HDF5 chunking and Darshan with enable-hdf5-mod results in error

Snyder, Shane ssnyder at mcs.anl.gov
Fri Apr 30 14:33:38 CDT 2021


Thanks again for reporting these issues, Tobias. Just wanted to close the loop on the mailing list:

  *   The HDF5 warnings related to hyperslab selections were due to a bug in Darshan that is now fixed.
  *   The crashes appear due to an OpenMPI bug that Darshan happens to trigger for some workloads. We have similarly modified Darshan to avoid triggering this bug when using OpenMPI, so should be safe while OpenMPI folks continue to investigate the underlying issue (see: https://github.com/open-mpi/ompi/issues/8841)


Both of these bug fixes are available starting in the darshan-3.3.0-pre2 pre-release that just came out today, and they will obviously be included in the stable 3.3.0 release that we plan to have available next week.

Please let us know if you have any further issues related to this and we'd be happy to investigate more.

--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Latham, Robert J. <robl at mcs.anl.gov>
Sent: Tuesday, April 20, 2021 12:11 PM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>; tobias.meisel at ufz.de <tobias.meisel at ufz.de>
Subject: Re: [Darshan-users] HDF5 chunking and Darshan with enable-hdf5-mod results in error

On Mon, 2021-04-19 at 14:17 +0200, Tobias Meisel wrote:
> Hi all,
> I’ve built 3.2.1 Darshan with --enable-hdf5-mod=
>
>
> When I began writing chunks collectively with hdf5 I got errors only
> when darshan instrumentation is enabled.
> The darshan instrumentation without hdf5 chunking worked fine (also
> with --enable-hdf5-mod).

The errors are two kinds:
- "H5Shyper.c line 12116 in H5Sget_regular_hyperslab(): not a hyperslab
selection"
-- Gerd Heber says some processes are calling H5Sselect_none which is
not a "regular hyperslab selection".  We've seen this warning before in
other workloads this week.

But there's a more alarming error -- not just a warning but a divide by
zero

[archlinux:31445] *** Process received signal ***
[archlinux:31445] Signal: Floating point exception (8)
[archlinux:31445] Signal code: Integer divide-by-zero (1)
[archlinux:31445] Failing at address: 0x7f3f980614bf
[archlinux:31445] [ 0]
/usr/lib/libpthread.so.0(+0x13960)[0x7f3f99bbd960]
[archlinux:31445] [ 1]
/usr/lib/openmpi/openmpi/mca_io_ompio.so(mca_io_ompio_file_get_byte_off
set+0x3f)[0x7f3f980614bf]
[archlinux:31445] [ 2]
/usr/lib/openmpi/libmpi.so.40(PMPI_File_get_byte_offset+0x70)[0x7f3f99f
745b0]
[archlinux:31445] [ 3]
/usr/local/lib/libdarshan.so(MPI_File_write_at_all+0x197)[0x7f3f9a4b76b
7]

Darshan doesn't divide anything by anything in its wrappers, but
OpenMPI does:

https://github.com/open-mpi/ompi/blob/master/ompi/mca/io/ompio/io_ompio_file_open.c#L511

Lots of questions here:
- OpenMPI-IO shouldn't ever crash based on user input, but it does.
- How is Darshan feeding OpenMPI-IO such a bogus payload?
- How is the hyperslab selection, or lack thereof, triggering all this?

==rob

> I have opened a topic at the HDF group forum:
>
https://forum.hdfgroup.org/t/parallel-hdf5-write-with-irregular-size-in-one-dimension/8284/5
> The resulting errors are posted to this topic as well.
>
> The minimal example to reproduce the error is also here:
>
https://dbkt.hdfgroup.org/original/2X/c/c58be4df192333b6d15d6e91d58b114b85cea2f4.cc
>
> My Setup is:
> OpenMPI (OpenRTE) 4.0.5
> HDF5 1.12.0
> Darshan 3.2.1
>
> Could you take a look and check if there is a problem with the HDF5
> instrumentation in Darshan?
>
> Thank you
>
> Tobias
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
_______________________________________________
Darshan-users mailing list
Darshan-users at lists.mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210430/f4e8df80/attachment.html>


More information about the Darshan-users mailing list