[Darshan-users] Error in job_summary

Snyder, Shane ssnyder at mcs.anl.gov
Wed Jul 28 11:28:53 CDT 2021


Hi Jeff,

Thanks for the details on these DL workloads you're working with. We've had some other users report similar issues with Darshan on workloads like this, e.g., https://lists.mcs.anl.gov/pipermail/darshan-users/2021-July/000715.html

That user has tried bumping up the max records using the config file mechanism, as well as bumped up Darshan's default memory allocation size (by setting DARSHAN_MODMEM env var), with some varying levels of success for different workload sizes. That might give you some ideas for tuning Darshan for your test cases.

I think the branch I'm working on that allows for more control over different instrumentation settings will help, but as you can see in the above post I referenced, that still isn't necessarily sufficient to get Darshan working properly with millions of files to instrument. I still need to sit down and see if I can get a better feel for what's breaking at that sort of scale and if it's something we can easily resolve. If not, we can try to better document our limitations so it's more clear what Darshan can and can't do.

We are happy that users are keeping us in mind for their I/O profiling needs now that Darshan can be made to work in non-MPI contexts, but obviously there are still some growing pains we need to sort out for some workloads. Thanks again for reporting issues and giving us really helpful feedback!

--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Phil Carns <carns at mcs.anl.gov>
Sent: Wednesday, July 28, 2021 11:07 AM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: Re: [Darshan-users] Error in job_summary


Hi Jeff,

Yes, it includes subdirs.  You can kind of think of it as a prefix match on the file paths.

thanks,

-Phil

On 7/27/21 8:07 AM, Jeffrey Layton wrote:
Oops. DId a reply instead of a reply-all.

One more question. When I specify a directory to exclude, does it include all subdirectories as well?

Thanks!

Jeff


On Tue, Jul 27, 2021 at 12:01 PM Jeffrey Layton <laytonjb at gmail.com<mailto:laytonjb at gmail.com>> wrote:
Shane,

Thanks for the reply. I'm glad you're doing the changes to Darshan. This will have a big impact on profiling DL workloads (I realize that wasn't a focus of Darshan originally, but Darshan has become a victim of it's own success. When people mention 'IO profiling' then immediately say 'Darshan').

DL is a tough workload. Let me give you an example. I just ran a pretty simple model with 1.2M parameters. I used the CIFAR-10 data set (collection of images) and ran PyTorch or 100 epochs (not even close to being fully trained). During those 100 epochs, PyTorch opened 167,076 files. It closed 1,274,000 files (I measured this using the strace output). It also used 1,206 threads. The training code was all Python.

Tensorflow is better in regard to IO than PyTorch for CIFAR-10. The model only used 555K parameters. For 100 epochs it opened 4,099 files. It closed 3.617 files. It used 350 threads during this time.

You can see that DL frameworks do a lot of stuff in the name of IO! Being able to track over 100K files is probably not a bad idea (I might go as far as 1M files).

In the meantime, is there a limit to the number of items you can include using excludes?

Thanks!

Jeff



On Mon, Jul 26, 2021 at 9:43 PM Snyder, Shane <ssnyder at mcs.anl.gov<mailto:ssnyder at mcs.anl.gov>> wrote:
Hi Jeff,

Existing Darshan releases do have some hard coded limits that have been increasingly problematic for our users, it seems. The limit you are likely hitting is just that Darshan instrumentation modules do not track more than 1,024 file records currently. This isn't really tunable in any way, unfortunately.

You can get a list of files that Darshan did instrument by running darshan-parser with the '--file-list' option. That might give you some more ideas on directories you could potentially exclude to force Darshan to reserve instrumentation resources for other files, but that may not even be sufficient depending on your workload.

We do have some functionality we are hoping to have merged in for our next release to help address this issue. In fact, it's available to try out in a branch in our repo if you're really motivated to get this working soon. There are more details here in a PR on our GitHub: https://github.com/darshan-hpc/darshan/pull/405

Essentially, you can use a  config file to control a number of different Darshan settings, including the ability to change the hard coded file maximum from above and to provide regular expressions (rather than just directory names) for files Darshan should exclude from instrumentation. If you have more specific questions or feedback about this functionality, please let us know.

Thanks!
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov<mailto:darshan-users-bounces at lists.mcs.anl.gov>> on behalf of Jeffrey Layton <laytonjb at gmail.com<mailto:laytonjb at gmail.com>>
Sent: Monday, July 26, 2021 9:15 AM
To: darshan-users at lists.mcs.anl.gov<mailto:darshan-users at lists.mcs.anl.gov> <darshan-users at lists.mcs.anl.gov<mailto:darshan-users at lists.mcs.anl.gov>>
Subject: [Darshan-users] Error in job_summary

Good morning,

I'm post-processing a darshan file for a Tensorflow training of a simple model (CIFAR-10). The post-processing completes just fine, but I see an error on the first page:


WARNING: This Darshan log contains incomplete data. This happens when a module runs out of memory to store
new record data. Please run darshan-parser on the log file for more information.

So I ran darshan-parser on the file and I see the following at the end.


# *******************************************************
# POSIX module data
# *******************************************************

# *ERROR*: The POSIX module contains incomplete data!
#            This happens when a module runs out of
#            memory to store new record data.

# To avoid this error, consult the darshan-runtime
# documentation and consider setting the
# DARSHAN_EXCLUDE_DIRS environment variable to prevent
# Darshan from instrumenting unecessary files.

# You can display the (incomplete) data that is
# present in this log using the --show-incomplete
# option to darshan-parser.


I have a bunch of file systems excluded: /proc,/etc,/dev,/sys,/snap,/run .

How can I get a list of files that Darshan tracked? Is there a way to increase the amount of memory?

Thanks!

Jeff






_______________________________________________
Darshan-users mailing list
Darshan-users at lists.mcs.anl.gov<mailto:Darshan-users at lists.mcs.anl.gov>
https://lists.mcs.anl.gov/mailman/listinfo/darshan-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210728/64c41690/attachment-0001.html>


More information about the Darshan-users mailing list