[Darshan-users] Error in job_summary
Phil Carns
carns at mcs.anl.gov
Wed Jul 28 11:07:16 CDT 2021
Hi Jeff,
Yes, it includes subdirs. You can kind of think of it as a prefix match
on the file paths.
thanks,
-Phil
On 7/27/21 8:07 AM, Jeffrey Layton wrote:
> Oops. DId a reply instead of a reply-all.
>
> One more question. When I specify a directory to exclude, does it
> include all subdirectories as well?
>
> Thanks!
>
> Jeff
>
>
> On Tue, Jul 27, 2021 at 12:01 PM Jeffrey Layton <laytonjb at gmail.com
> <mailto:laytonjb at gmail.com>> wrote:
>
> Shane,
>
> Thanks for the reply. I'm glad you're doing the changes to
> Darshan. This will have a big impact on profiling DL workloads (I
> realize that wasn't a focus of Darshan originally, but Darshan has
> become a victim of it's own success. When people mention 'IO
> profiling' then immediately say 'Darshan').
>
> DL is a tough workload. Let me give you an example. I just ran a
> pretty simple model with 1.2M parameters. I used the CIFAR-10 data
> set (collection of images) and ran PyTorch or 100 epochs (not even
> close to being fully trained). During those 100 epochs, PyTorch
> opened 167,076 files. It closed 1,274,000 files (I measured this
> using the strace output). It also used 1,206 threads. The training
> code was all Python.
>
> Tensorflow is better in regard to IO than PyTorch for CIFAR-10.
> The model only used 555K parameters. For 100 epochs it opened
> 4,099 files. It closed 3.617 files. It used 350 threads during
> this time.
>
> You can see that DL frameworks do a lot of stuff in the name of
> IO! Being able to track over 100K files is probably not a bad idea
> (I might go as far as 1M files).
>
> In the meantime, is there a limit to the number of items you can
> include using excludes?
>
> Thanks!
>
> Jeff
>
>
>
> On Mon, Jul 26, 2021 at 9:43 PM Snyder, Shane <ssnyder at mcs.anl.gov
> <mailto:ssnyder at mcs.anl.gov>> wrote:
>
> Hi Jeff,
>
> Existing Darshan releases do have some hard coded limits that
> have been increasingly problematic for our users, it seems.
> The limit you are likely hitting is just that Darshan
> instrumentation modules do not track more than 1,024 file
> records currently. This isn't really tunable in any way,
> unfortunately.
>
> You can get a list of files that Darshan did instrument by
> running darshan-parser with the '--file-list' option. That
> might give you some more ideas on directories you could
> potentially exclude to force Darshan to reserve
> instrumentation resources for other files, but that may not
> even be sufficient depending on your workload.
>
> We do have some functionality we are hoping to have merged in
> for our next release to help address this issue. In fact, it's
> available to try out in a branch in our repo if you're really
> motivated to get this working soon. There are more details
> here in a PR on our GitHub:
> https://github.com/darshan-hpc/darshan/pull/405
> <https://github.com/darshan-hpc/darshan/pull/405>
>
> Essentially, you can use a config file to control a number of
> different Darshan settings, including the ability to change
> the hard coded file maximum from above and to provide regular
> expressions (rather than just directory names) for files
> Darshan should exclude from instrumentation. If you have more
> specific questions or feedback about this functionality,
> please let us know.
>
> Thanks!
> --Shane
> ------------------------------------------------------------------------
> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov
> <mailto:darshan-users-bounces at lists.mcs.anl.gov>> on behalf of
> Jeffrey Layton <laytonjb at gmail.com <mailto:laytonjb at gmail.com>>
> *Sent:* Monday, July 26, 2021 9:15 AM
> *To:* darshan-users at lists.mcs.anl.gov
> <mailto:darshan-users at lists.mcs.anl.gov>
> <darshan-users at lists.mcs.anl.gov
> <mailto:darshan-users at lists.mcs.anl.gov>>
> *Subject:* [Darshan-users] Error in job_summary
> Good morning,
>
> I'm post-processing a darshan file for a Tensorflow training
> of a simple model (CIFAR-10). The post-processing completes
> just fine, but I see an error on the first page:
>
>
> WARNING:ThisDarshanlogcontainsincompletedata.Thishappenswhenamodulerunsoutofmemorytostore
> newrecorddata.Pleaserundarshan-parseronthelogfileformoreinformation.
>
> So I ran darshan-parser on the file and I see the following at
> the end.
>
>
> # *******************************************************
> # POSIX module data
> # *******************************************************
>
> # *ERROR*: The POSIX module contains incomplete data!
> # This happens when a module runs out of
> # memory to store new record data.
>
> # To avoid this error, consult the darshan-runtime
> # documentation and consider setting the
> # DARSHAN_EXCLUDE_DIRS environment variable to prevent
> # Darshan from instrumenting unecessary files.
>
> # You can display the (incomplete) data that is
> # present in this log using the --show-incomplete
> # option to darshan-parser.
>
>
> I have a bunch of file systems excluded:
> /proc,/etc,/dev,/sys,/snap,/run .
>
> How can I get a list of files that Darshan tracked? Is there a
> way to increase the amount of memory?
>
> Thanks!
>
> Jeff
>
>
>
>
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210728/560ea5f2/attachment-0001.html>
More information about the Darshan-users
mailing list