[Darshan-users] Problems about Darshan Logs

Mon Jun 7 11:44:41 CDT 2021

Hi Jie,

Unfortunately, Darshan doesn't currently expose many tunables that are likely to help for this particular problem. Darshan modules have been mostly hard-coded to store a maximum of 1,024 file records per-process, just as an attempt to bound their memory usage at reasonable levels. That design tradeoff obviously creates problems for workloads like the one you've shared with us. We've ran into this problem more and more recently, especially for a lot of Python frameworks that tend to open a lot of files, many of which are not really pertinent from an I/O analysis perspective (i.e., things like .so, .h, .py). You can see this in the example PDF you shared with us: many of the files Darshan tells you about are .py and .pyc files, which are probably not of any interest.

As for existing options to help workaround this that work with the Darshan installation on ThetaGPU, here are a couple of ideas:

  *   Unlikely, but in the off chance that many of the .py and .pyc (and any other types of files you're not interested in) files are isolated in a directory away from your image files, you could try using the DARSHAN_EXCLUDE_DIRS environment variable to exclude them
     *   DARSHAN_EXCLUDE_DIRS: specifies a list of comma-separated paths that Darshan will not instrument at runtime (in addition to Darshan's default exclusion list)
  *   Darshan's tracing modules (DXT) only limit themselves in terms of memory usage, not total number of instrumented files. They default to 4 MiB per-process, but you can ask for more memory at configure time when building Darshan (i..e, this is not a runtime tunable currently).
     *   Furthermore, DXT does have some trace filtering logic you can use to restrict which files Darshan instruments (using path prefixes or file extensions). See documentation here: https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html#_using_the_darshan_extended_tracing_dxt_module
     *   Note that DXT does not provide you with the per-file summary counters you traditionally get with Darshan, so you would have to post-process the traces yourself to get stats on read/write activity

I'm working on some new mechanisms for Darshan that will give you more runtime control over what files are instrumented (using regular expressions to exclude specific directories or extensions), how much memory each module uses, etc. It's kind of a generalization of the DXT trace filtering stuff I mentioned above. I'm hoping to have something ready to try in the next week or so, and would be great if you guys could help try it out. I'll keep you posted.

In the meantime it sounds like you guys have had some success manually modifying the limit in the source code. I think that should work fine, just keep in mind that you will probably also need to set DARSHAN_MODMEM environment variable to a sufficiently large value to hold all of the records at runtime. It might take some experimentation to figure out the right settings (for the hard-coded limit and for DARSHAN_MODMEM) to capture everything.

Thanks,
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Chunduri, Sudheer <sudheer at anl.gov>
Sent: Monday, June 7, 2021 9:54 AM
To: Jie Liu <jliu279 at ucmerced.edu>; darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Cc: Nicolae, Bogdan <bnicolae at anl.gov>; Si, Min <msi at anl.gov>
Subject: Re: [Darshan-users] Problems about Darshan Logs

Hi Jie,

I see you copying darshan-users mailing list, so, Shane should hopefully see this.

Meanwhile, have you tried using “darshan-parser --show-incomplete”?

From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Jie Liu <jliu279 at ucmerced.edu>
Date: Monday, June 7, 2021 at 9:42 AM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Cc: Nicolae, Bogdan <bnicolae at anl.gov>, Si, Min <msi at anl.gov>
Subject: [Darshan-users] Problems about Darshan Logs

Hi,

I used Darshan to do some profiling work when training Deep Learning models on ThetaGPU (Resnet50 on ImageNet, mini-batch size is 32).

When I used the following command to get the summary of darshan logs:

darshan-job-summary.pl  /path/to/.darshan --output /path/to/summary.pdf

The received summary.pdf file contains the following Error message at the firs page:

WARNING: This Darshan log contains incomplete data. This happens when a module runs out of memory to store new record data. Please run darshan-parser on the log file for more information.

I also tried to use darshan-parser by the following command:

darshan-parser  /path/to/.darshan --output /path/to/summary.txt

It also shows incomplete data error:

*ERROR*: The POSIX module contains incomplete data! This happens when a module runs out of memory to store new record data.

The ImageNet dataset contains about 1.3 million image files, but the darshan log only shows the number of opened files is: 14792 when I trained Resnet50 on ThetaGPU with 2 nodes, 16 GPUs. (Please check the attached file for more information about the logs obtained by darshan).

Is there an efficient way to make the darshan logs contain all the I/O information of 1.3 million image files during the model training?

Previously, I contacted with the Support Team, their response is “the POSIX module only tracks 1024 files, once we open 1025 files Darshan no longer tracks those files”. How to make the POSIX module track all the images files.

For the model training on ThetaGPU using 2 nodes and 16 GPUs. My experimental results show that every process handles 14792/16 = 924 images files on average, actually, this number is less than 1024. How to explain it?

Thanks for your help.

Best regards,

--

Jie Liu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210607/c2c70d4f/attachment-0001.html>