[Darshan-users] Error in job_summary

Wed Jul 28 11:07:16 CDT 2021

Hi Jeff,

Yes, it includes subdirs.  You can kind of think of it as a prefix match 
on the file paths.

thanks,

-Phil

On 7/27/21 8:07 AM, Jeffrey Layton wrote:
> Oops. DId a reply instead of a reply-all.
>
> One more question. When I specify a directory to exclude, does it 
> include all subdirectories as well?
>
> Thanks!
>
> Jeff
>
>
> On Tue, Jul 27, 2021 at 12:01 PM Jeffrey Layton <laytonjb at gmail.com 
> <mailto:laytonjb at gmail.com>> wrote:
>
>     Shane,
>
>     Thanks for the reply. I'm glad you're doing the changes to
>     Darshan. This will have a big impact on profiling DL workloads (I
>     realize that wasn't a focus of Darshan originally, but Darshan has
>     become a victim of it's own success. When people mention 'IO
>     profiling' then immediately say 'Darshan').
>
>     DL is a tough workload. Let me give you an example. I just ran a
>     pretty simple model with 1.2M parameters. I used the CIFAR-10 data
>     set (collection of images) and ran PyTorch or 100 epochs (not even
>     close to being fully trained). During those 100 epochs, PyTorch
>     opened 167,076 files. It closed 1,274,000 files (I measured this
>     using the strace output). It also used 1,206 threads. The training
>     code was all Python.
>
>     Tensorflow is better in regard to IO than PyTorch for CIFAR-10.
>     The model only used 555K parameters. For 100 epochs it opened
>     4,099 files. It closed 3.617 files. It used 350 threads during
>     this time.
>
>     You can see that DL frameworks do a lot of stuff in the name of
>     IO! Being able to track over 100K files is probably not a bad idea
>     (I might go as far as 1M files).
>
>     In the meantime, is there a limit to the number of items you can
>     include using excludes?
>
>     Thanks!
>
>     Jeff
>
>
>
>     On Mon, Jul 26, 2021 at 9:43 PM Snyder, Shane <ssnyder at mcs.anl.gov
>     <mailto:ssnyder at mcs.anl.gov>> wrote:
>
>         Hi Jeff,
>
>         Existing Darshan releases do have some hard coded limits that
>         have been increasingly problematic for our users, it seems.
>         The limit you are likely hitting is just that Darshan
>         instrumentation modules do not track more than 1,024 file
>         records currently. This isn't really tunable in any way,
>         unfortunately.
>
>         You can get a list of files that Darshan did instrument by
>         running darshan-parser with the '--file-list' option. That
>         might give you some more ideas on directories you could
>         potentially exclude to force Darshan to reserve
>         instrumentation resources for other files, but that may not
>         even be sufficient depending on your workload.
>
>         We do have some functionality we are hoping to have merged in
>         for our next release to help address this issue. In fact, it's
>         available to try out in a branch in our repo if you're really
>         motivated to get this working soon. There are more details
>         here in a PR on our GitHub:
>         https://github.com/darshan-hpc/darshan/pull/405
>         <https://github.com/darshan-hpc/darshan/pull/405>
>
>         Essentially, you can use a  config file to control a number of
>         different Darshan settings, including the ability to change
>         the hard coded file maximum from above and to provide regular
>         expressions (rather than just directory names) for files
>         Darshan should exclude from instrumentation. If you have more
>         specific questions or feedback about this functionality,
>         please let us know.
>
>         Thanks!
>         --Shane
>         ------------------------------------------------------------------------
>         *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov
>         <mailto:darshan-users-bounces at lists.mcs.anl.gov>> on behalf of
>         Jeffrey Layton <laytonjb at gmail.com <mailto:laytonjb at gmail.com>>
>         *Sent:* Monday, July 26, 2021 9:15 AM
>         *To:* darshan-users at lists.mcs.anl.gov
>         <mailto:darshan-users at lists.mcs.anl.gov>
>         <darshan-users at lists.mcs.anl.gov
>         <mailto:darshan-users at lists.mcs.anl.gov>>
>         *Subject:* [Darshan-users] Error in job_summary
>         Good morning,
>
>         I'm post-processing a darshan file for a Tensorflow training
>         of a simple model (CIFAR-10). The post-processing completes
>         just fine, but I see an error on the first page:
>
>
>         WARNING:ThisDarshanlogcontainsincompletedata.Thishappenswhenamodulerunsoutofmemorytostore
>         newrecorddata.Pleaserundarshan-parseronthelogfileformoreinformation.
>
>         So I ran darshan-parser on the file and I see the following at
>         the end.
>
>
>         # *******************************************************
>         # POSIX module data
>         # *******************************************************
>
>         # *ERROR*: The POSIX module contains incomplete data!
>         #            This happens when a module runs out of
>         #            memory to store new record data.
>
>         # To avoid this error, consult the darshan-runtime
>         # documentation and consider setting the
>         # DARSHAN_EXCLUDE_DIRS environment variable to prevent
>         # Darshan from instrumenting unecessary files.
>
>         # You can display the (incomplete) data that is
>         # present in this log using the --show-incomplete
>         # option to darshan-parser.
>
>
>         I have a bunch of file systems excluded:
>         /proc,/etc,/dev,/sys,/snap,/run .
>
>         How can I get a list of files that Darshan tracked? Is there a
>         way to increase the amount of memory?
>
>         Thanks!
>
>         Jeff
>
>
>
>
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210728/560ea5f2/attachment-0001.html>