[Darshan-users] Error in job_summary

Snyder, Shane ssnyder at mcs.anl.gov
Mon Jul 26 16:43:00 CDT 2021


Hi Jeff,

Existing Darshan releases do have some hard coded limits that have been increasingly problematic for our users, it seems. The limit you are likely hitting is just that Darshan instrumentation modules do not track more than 1,024 file records currently. This isn't really tunable in any way, unfortunately.

You can get a list of files that Darshan did instrument by running darshan-parser with the '--file-list' option. That might give you some more ideas on directories you could potentially exclude to force Darshan to reserve instrumentation resources for other files, but that may not even be sufficient depending on your workload.

We do have some functionality we are hoping to have merged in for our next release to help address this issue. In fact, it's available to try out in a branch in our repo if you're really motivated to get this working soon. There are more details here in a PR on our GitHub: https://github.com/darshan-hpc/darshan/pull/405

Essentially, you can use a  config file to control a number of different Darshan settings, including the ability to change the hard coded file maximum from above and to provide regular expressions (rather than just directory names) for files Darshan should exclude from instrumentation. If you have more specific questions or feedback about this functionality, please let us know.

Thanks!
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Jeffrey Layton <laytonjb at gmail.com>
Sent: Monday, July 26, 2021 9:15 AM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: [Darshan-users] Error in job_summary

Good morning,

I'm post-processing a darshan file for a Tensorflow training of a simple model (CIFAR-10). The post-processing completes just fine, but I see an error on the first page:


WARNING: This Darshan log contains incomplete data. This happens when a module runs out of memory to store
new record data. Please run darshan-parser on the log file for more information.

So I ran darshan-parser on the file and I see the following at the end.


# *******************************************************
# POSIX module data
# *******************************************************

# *ERROR*: The POSIX module contains incomplete data!
#            This happens when a module runs out of
#            memory to store new record data.

# To avoid this error, consult the darshan-runtime
# documentation and consider setting the
# DARSHAN_EXCLUDE_DIRS environment variable to prevent
# Darshan from instrumenting unecessary files.

# You can display the (incomplete) data that is
# present in this log using the --show-incomplete
# option to darshan-parser.


I have a bunch of file systems excluded: /proc,/etc,/dev,/sys,/snap,/run .

How can I get a list of files that Darshan tracked? Is there a way to increase the amount of memory?

Thanks!

Jeff



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210726/689ddc0f/attachment.html>


More information about the Darshan-users mailing list