<div dir="ltr"><div>Oops. DId a reply instead of a reply-all.</div><div><br></div><div>One more question. When I specify a directory to exclude, does it include all subdirectories as well?</div><div><br></div><div>Thanks!</div><div><br></div><div>Jeff</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 27, 2021 at 12:01 PM Jeffrey Layton <<a href="mailto:laytonjb@gmail.com">laytonjb@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Shane,</div><div><br></div><div>Thanks for the reply. I'm glad you're doing the changes to Darshan. This will have a big impact on profiling DL workloads (I realize that wasn't a focus of Darshan originally, but Darshan has become a victim of it's own success. When people mention 'IO profiling' then immediately say 'Darshan').</div><div><br></div><div>DL is a tough workload. Let me give you an example. I just ran a pretty simple model with 1.2M parameters. I used the CIFAR-10 data set (collection of images) and ran PyTorch or 100 epochs (not even close to being fully trained). During those 100 epochs, PyTorch opened 167,076 files. It closed 1,274,000 files (I measured this using the strace output). It also used 1,206 threads. The training code was all Python.</div><div><br></div><div>Tensorflow is better in regard to IO than PyTorch for CIFAR-10. The model only used 555K parameters. For 100 epochs it opened 4,099 files. It closed 3.617 files. It used 350 threads during this time.</div><div><br></div><div>You can see that DL frameworks do a lot of stuff in the name of IO! Being able to track over 100K files is probably not a bad idea (I might go as far as 1M files).</div><div><br></div><div>In the meantime, is there a limit to the number of items you can include using excludes?</div><div><br></div><div>Thanks!</div><div><br></div><div>Jeff</div><div><br></div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 26, 2021 at 9:43 PM Snyder, Shane <<a href="mailto:ssnyder@mcs.anl.gov" target="_blank">ssnyder@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Jeff,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Existing Darshan releases do have some hard coded limits that have been increasingly problematic for our users, it seems. The limit you are likely hitting is just that Darshan instrumentation modules do not track more than 1,024 file records currently. This
isn't really tunable in any way, unfortunately.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
You can get a list of files that Darshan did instrument by running darshan-parser with the '--file-list' option. That might give you some more ideas on directories you could potentially exclude to force Darshan to reserve instrumentation resources for other
files, but that may not even be sufficient depending on your workload.<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
We do have some functionality we are hoping to have merged in for our next release to help address this issue. In fact, it's available to try out in a branch in our repo if you're really motivated to get this working soon. There are more details here in a PR
on our GitHub: <a href="https://github.com/darshan-hpc/darshan/pull/405" id="gmail-m_1040552738000251815gmail-m_5528091281693620561LPlnk476838" target="_blank">
https://github.com/darshan-hpc/darshan/pull/405</a><br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Essentially, you can use a config file to control a number of different Darshan settings, including the ability to change the hard coded file maximum from above and to provide regular expressions (rather than just directory names) for files Darshan should
exclude from instrumentation. If you have more specific questions or feedback about this functionality, please let us know.<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thanks!</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
--Shane<br>
</div>
<div id="gmail-m_1040552738000251815gmail-m_5528091281693620561appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_1040552738000251815gmail-m_5528091281693620561divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> Darshan-users <<a href="mailto:darshan-users-bounces@lists.mcs.anl.gov" target="_blank">darshan-users-bounces@lists.mcs.anl.gov</a>> on behalf of Jeffrey Layton <<a href="mailto:laytonjb@gmail.com" target="_blank">laytonjb@gmail.com</a>><br>
<b>Sent:</b> Monday, July 26, 2021 9:15 AM<br>
<b>To:</b> <a href="mailto:darshan-users@lists.mcs.anl.gov" target="_blank">darshan-users@lists.mcs.anl.gov</a> <<a href="mailto:darshan-users@lists.mcs.anl.gov" target="_blank">darshan-users@lists.mcs.anl.gov</a>><br>
<b>Subject:</b> [Darshan-users] Error in job_summary</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div>Good morning,</div>
<div><br>
</div>
<div>I'm post-processing a darshan file for a Tensorflow training of a simple model (CIFAR-10). The post-processing completes just fine, but I see an error on the first page:</div>
<div><br>
</div>
<div><br>
</div>
<div><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">WARNING</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">:</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">This</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">Darshan</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">log</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">contains</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">incomplete</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">data.</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">This</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">happens</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">when</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">a</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">module</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">runs</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">out</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">of</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">memory</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">to</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">store</span><br role="presentation">
<span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">new</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">record</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">data.</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">Please</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">run</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">darshan-parser</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">on</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">the</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">log</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">file</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">for</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">more</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">
</span><span role="presentation" dir="ltr" style="font-size:18.1818px;font-family:sans-serif">information.</span></div>
<div><br>
</div>
<div>So I ran darshan-parser on the file and I see the following at the end.</div>
<div><br>
</div>
<div><br>
</div>
<div># *******************************************************<br>
# POSIX module data<br>
# *******************************************************<br>
<br>
# *ERROR*: The POSIX module contains incomplete data!<br>
# This happens when a module runs out of<br>
# memory to store new record data.<br>
<br>
# To avoid this error, consult the darshan-runtime<br>
# documentation and consider setting the<br>
# DARSHAN_EXCLUDE_DIRS environment variable to prevent<br>
# Darshan from instrumenting unecessary files.<br>
<br>
# You can display the (incomplete) data that is<br>
# present in this log using the --show-incomplete<br>
# option to darshan-parser.</div>
<div><br>
</div>
<div><br>
</div>
<div>I have a bunch of file systems excluded: /proc,/etc,/dev,/sys,/snap,/run . <br>
</div>
<div><br>
</div>
<div>How can I get a list of files that Darshan tracked? Is there a way to increase the amount of memory?</div>
<div><br>
</div>
<div>Thanks!</div>
<div><br>
</div>
<div>Jeff</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote></div></div>
</blockquote></div>