[Darshan-users] Error in job_summary

Jeffrey Layton laytonjb at gmail.com
Tue Jul 27 07:07:55 CDT 2021


Oops. DId a reply instead of a reply-all.

One more question. When I specify a directory to exclude, does it include
all subdirectories as well?

Thanks!

Jeff


On Tue, Jul 27, 2021 at 12:01 PM Jeffrey Layton <laytonjb at gmail.com> wrote:

> Shane,
>
> Thanks for the reply. I'm glad you're doing the changes to Darshan. This
> will have a big impact on profiling DL workloads (I realize that wasn't a
> focus of Darshan originally, but Darshan has become a victim of it's own
> success. When people mention 'IO profiling' then immediately say 'Darshan').
>
> DL is a tough workload. Let me give you an example. I just ran a pretty
> simple model with 1.2M parameters. I used the CIFAR-10 data set (collection
> of images) and ran PyTorch or 100 epochs (not even close to being fully
> trained). During those 100 epochs, PyTorch opened 167,076 files. It closed
> 1,274,000 files (I measured this using the strace output). It also used
> 1,206 threads. The training code was all Python.
>
> Tensorflow is better in regard to IO than PyTorch for CIFAR-10. The model
> only used 555K parameters. For 100 epochs it opened 4,099 files. It closed
> 3.617 files. It used 350 threads during this time.
>
> You can see that DL frameworks do a lot of stuff in the name of IO! Being
> able to track over 100K files is probably not a bad idea (I might go as far
> as 1M files).
>
> In the meantime, is there a limit to the number of items you can include
> using excludes?
>
> Thanks!
>
> Jeff
>
>
>
> On Mon, Jul 26, 2021 at 9:43 PM Snyder, Shane <ssnyder at mcs.anl.gov> wrote:
>
>> Hi Jeff,
>>
>> Existing Darshan releases do have some hard coded limits that have been
>> increasingly problematic for our users, it seems. The limit you are likely
>> hitting is just that Darshan instrumentation modules do not track more than
>> 1,024 file records currently. This isn't really tunable in any way,
>> unfortunately.
>>
>> You can get a list of files that Darshan did instrument by running
>> darshan-parser with the '--file-list' option. That might give you some more
>> ideas on directories you could potentially exclude to force Darshan to
>> reserve instrumentation resources for other files, but that may not even be
>> sufficient depending on your workload.
>>
>> We do have some functionality we are hoping to have merged in for our
>> next release to help address this issue. In fact, it's available to try out
>> in a branch in our repo if you're really motivated to get this working
>> soon. There are more details here in a PR on our GitHub:
>> https://github.com/darshan-hpc/darshan/pull/405
>>
>> Essentially, you can use a  config file to control a number of different
>> Darshan settings, including the ability to change the hard coded file
>> maximum from above and to provide regular expressions (rather than just
>> directory names) for files Darshan should exclude from instrumentation. If
>> you have more specific questions or feedback about this functionality,
>> please let us know.
>>
>> Thanks!
>> --Shane
>> ------------------------------
>> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on
>> behalf of Jeffrey Layton <laytonjb at gmail.com>
>> *Sent:* Monday, July 26, 2021 9:15 AM
>> *To:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
>> *Subject:* [Darshan-users] Error in job_summary
>>
>> Good morning,
>>
>> I'm post-processing a darshan file for a Tensorflow training of a simple
>> model (CIFAR-10). The post-processing completes just fine, but I see an
>> error on the first page:
>>
>>
>> WARNING: This Darshan log contains incomplete data. This happens when a
>> module runs out of memory to store
>> new record data. Please run darshan-parser on the log file for more
>> information.
>>
>> So I ran darshan-parser on the file and I see the following at the end.
>>
>>
>> # *******************************************************
>> # POSIX module data
>> # *******************************************************
>>
>> # *ERROR*: The POSIX module contains incomplete data!
>> #            This happens when a module runs out of
>> #            memory to store new record data.
>>
>> # To avoid this error, consult the darshan-runtime
>> # documentation and consider setting the
>> # DARSHAN_EXCLUDE_DIRS environment variable to prevent
>> # Darshan from instrumenting unecessary files.
>>
>> # You can display the (incomplete) data that is
>> # present in this log using the --show-incomplete
>> # option to darshan-parser.
>>
>>
>> I have a bunch of file systems excluded: /proc,/etc,/dev,/sys,/snap,/run
>> .
>>
>> How can I get a list of files that Darshan tracked? Is there a way to
>> increase the amount of memory?
>>
>> Thanks!
>>
>> Jeff
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210727/c77d4ece/attachment-0001.html>


More information about the Darshan-users mailing list