[Darshan-users] HDF5 chunking and Darshan with enable-hdf5-mod results in error
ssnyder at mcs.anl.gov
Fri Apr 30 14:33:38 CDT 2021
Thanks again for reporting these issues, Tobias. Just wanted to close the loop on the mailing list:
* The HDF5 warnings related to hyperslab selections were due to a bug in Darshan that is now fixed.
* The crashes appear due to an OpenMPI bug that Darshan happens to trigger for some workloads. We have similarly modified Darshan to avoid triggering this bug when using OpenMPI, so should be safe while OpenMPI folks continue to investigate the underlying issue (see: https://github.com/open-mpi/ompi/issues/8841)
Both of these bug fixes are available starting in the darshan-3.3.0-pre2 pre-release that just came out today, and they will obviously be included in the stable 3.3.0 release that we plan to have available next week.
Please let us know if you have any further issues related to this and we'd be happy to investigate more.
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Latham, Robert J. <robl at mcs.anl.gov>
Sent: Tuesday, April 20, 2021 12:11 PM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>; tobias.meisel at ufz.de <tobias.meisel at ufz.de>
Subject: Re: [Darshan-users] HDF5 chunking and Darshan with enable-hdf5-mod results in error
On Mon, 2021-04-19 at 14:17 +0200, Tobias Meisel wrote:
> Hi all,
> I’ve built 3.2.1 Darshan with --enable-hdf5-mod=
> When I began writing chunks collectively with hdf5 I got errors only
> when darshan instrumentation is enabled.
> The darshan instrumentation without hdf5 chunking worked fine (also
> with --enable-hdf5-mod).
The errors are two kinds:
- "H5Shyper.c line 12116 in H5Sget_regular_hyperslab(): not a hyperslab
-- Gerd Heber says some processes are calling H5Sselect_none which is
not a "regular hyperslab selection". We've seen this warning before in
other workloads this week.
But there's a more alarming error -- not just a warning but a divide by
[archlinux:31445] *** Process received signal ***
[archlinux:31445] Signal: Floating point exception (8)
[archlinux:31445] Signal code: Integer divide-by-zero (1)
[archlinux:31445] Failing at address: 0x7f3f980614bf
[archlinux:31445] [ 0]
[archlinux:31445] [ 1]
[archlinux:31445] [ 2]
[archlinux:31445] [ 3]
Darshan doesn't divide anything by anything in its wrappers, but
Lots of questions here:
- OpenMPI-IO shouldn't ever crash based on user input, but it does.
- How is Darshan feeding OpenMPI-IO such a bogus payload?
- How is the hyperslab selection, or lack thereof, triggering all this?
> I have opened a topic at the HDF group forum:
> The resulting errors are posted to this topic as well.
> The minimal example to reproduce the error is also here:
> My Setup is:
> OpenMPI (OpenRTE) 4.0.5
> HDF5 1.12.0
> Darshan 3.2.1
> Could you take a look and check if there is a problem with the HDF5
> instrumentation in Darshan?
> Thank you
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
Darshan-users mailing list
Darshan-users at lists.mcs.anl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Darshan-users