<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hi Jie,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Unfortunately, Darshan doesn't currently expose many tunables that are likely to help for this particular problem. Darshan modules have been mostly hard-coded to store a maximum of 1,024 file records per-process, just as an attempt to bound their memory usage
at reasonable levels. That design tradeoff obviously creates problems for workloads like the one you've shared with us. We've ran into this problem more and more recently, especially for a lot of Python frameworks that tend to open a lot of files, many of
which are not really pertinent from an I/O analysis perspective (i.e., things like .so, .h, .py). You can see this in the example PDF you shared with us: many of the files Darshan tells you about are .py and .pyc files, which are probably not of any interest.<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
As for existing options to help workaround this that work with the Darshan installation on ThetaGPU, here are a couple of ideas:</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<ul>
<li>Unlikely, but in the off chance that many of the .py and .pyc (and any other types of files you're not interested in) files are isolated in a directory away from your image files, you could try using the DARSHAN_EXCLUDE_DIRS environment variable to exclude
them</li><ul>
<li>DARSHAN_EXCLUDE_DIRS: specifies a list of comma-separated paths that Darshan will not instrument at runtime (in addition to Darshan's default exclusion list)</li></ul>
<li>Darshan's tracing modules (DXT) only limit themselves in terms of memory usage, not total number of instrumented files. They default to 4 MiB per-process, but you can ask for more memory at configure time when building Darshan (i..e, this is not a runtime
tunable currently).</li><ul>
<li>Furthermore, DXT does have some trace filtering logic you can use to restrict which files Darshan instruments (using path prefixes or file extensions). See documentation here:
<a href="https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html#_using_the_darshan_extended_tracing_dxt_module" id="LPlnk">
https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html#_using_the_darshan_extended_tracing_dxt_module</a></li><li>Note that DXT does not provide you with the per-file summary counters you traditionally get with Darshan, so you would have to post-process the traces yourself to get stats on read/write activity<br>
</li><div class="_Entity _EType_OWALinkPreview _EId_OWALinkPreview _EReadonly_1"></div>
</ul>
</ul>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I'm working on some new mechanisms for Darshan that will give you more runtime control over what files are instrumented (using regular expressions to exclude specific directories or extensions), how much memory each module uses, etc. It's kind of a generalization
of the DXT trace filtering stuff I mentioned above. I'm hoping to have something ready to try in the next week or so, and would be great if you guys could help try it out. I'll keep you posted.<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
In the meantime it sounds like you guys have had some success manually modifying the limit in the source code. I think that should work fine, just keep in mind that you will probably also need to set DARSHAN_MODMEM environment variable to a sufficiently large
value to hold all of the records at runtime. It might take some experimentation to figure out the right settings (for the hard-coded limit and for DARSHAN_MODMEM) to capture everything.<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thanks,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
--Shane<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Darshan-users <darshan-users-bounces@lists.mcs.anl.gov> on behalf of Chunduri, Sudheer <sudheer@anl.gov><br>
<b>Sent:</b> Monday, June 7, 2021 9:54 AM<br>
<b>To:</b> Jie Liu <jliu279@ucmerced.edu>; darshan-users@lists.mcs.anl.gov <darshan-users@lists.mcs.anl.gov><br>
<b>Cc:</b> Nicolae, Bogdan <bnicolae@anl.gov>; Si, Min <msi@anl.gov><br>
<b>Subject:</b> Re: [Darshan-users] Problems about Darshan Logs</font>
<div> </div>
</div>
<style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
@font-face
{font-family:"DengXian Light"}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
{margin:0in;
font-size:12.0pt;
font-family:"Calibri",sans-serif}
span.x_EmailStyle19
{font-family:"Calibri",sans-serif;
color:windowtext}
.x_MsoChpDefault
{font-size:10.0pt}
@page WordSection1
{margin:1.0in 1.0in 1.0in 1.0in}
div.x_WordSection1
{}
-->
</style>
<div lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="x_WordSection1">
<p class="x_MsoNormal"><span style="font-size:11.0pt">Hi Jie,</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt">I see you copying darshan-users mailing list, so, Shane should hopefully see this.</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt">Meanwhile, have you tried using “darshan-parser --show-incomplete”?</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt"> </span></p>
<div style="border:none; border-top:solid #B5C4DF 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="x_MsoNormal" style="margin-bottom:12.0pt"><b><span style="color:black">From:
</span></b><span style="color:black">Darshan-users <darshan-users-bounces@lists.mcs.anl.gov> on behalf of Jie Liu <jliu279@ucmerced.edu><br>
<b>Date: </b>Monday, June 7, 2021 at 9:42 AM<br>
<b>To: </b>darshan-users@lists.mcs.anl.gov <darshan-users@lists.mcs.anl.gov><br>
<b>Cc: </b>Nicolae, Bogdan <bnicolae@anl.gov>, Si, Min <msi@anl.gov><br>
<b>Subject: </b>[Darshan-users] Problems about Darshan Logs</span></p>
</div>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">Hi,</span></p>
<p class="x_MsoNormal"><span style="color:black"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">I used Darshan to do some profiling work when training Deep Learning models on ThetaGPU (Resnet50 on ImageNet, mini-batch size is 32).</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">When I used the following command to get the summary of darshan logs:</span></p>
<p class="x_MsoNormal"><b><span style="font-size:11.0pt; font-family:"DengXian Light"; color:red">darshan-job-summary.pl</span></b><span style="font-size:11.0pt; font-family:"DengXian Light"; color:red"> </span><span style="font-size:11.0pt; font-family:"DengXian Light"; color:black">/path/to/.darshan
--output /path/to/summary.pdf</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">The received summary.pdf file contains the following Error message at the firs page:</span></p>
<p class="x_MsoNormal"><b><span style="font-size:11.0pt; font-family:"DengXian Light"; color:red">WARNING</span></b><span style="font-size:11.0pt; font-family:"DengXian Light"; color:black">: This Darshan log contains incomplete data. This happens when a module
runs out of memory to store new record data. Please run darshan-parser on the log file for more information.</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">I also tried to use darshan-parser by the following command:</span></p>
<p class="x_MsoNormal"><b><span style="font-size:11.0pt; font-family:"DengXian Light"; color:red">darshan-parser</span></b><span style="font-size:11.0pt; font-family:"DengXian Light"; color:red"> </span><span style="font-size:11.0pt; font-family:"DengXian Light"; color:black">/path/to/.darshan
--output /path/to/summary.txt</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">It also shows incomplete data error:</span></p>
<p class="x_MsoNormal"><b><span style="font-size:11.0pt; font-family:"DengXian Light"; color:red">*ERROR*:</span></b><span style="font-size:11.0pt; font-family:"DengXian Light"; color:red"> </span><span style="font-size:11.0pt; font-family:"DengXian Light"; color:black">The
POSIX module contains incomplete data! This happens when a module runs out of memory to store new record data.</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">The ImageNet dataset contains about 1.3 million image files, but the darshan log only shows the number of opened files is: 14792 when I trained Resnet50 on ThetaGPU with 2 nodes, 16 GPUs. (<b>Please
check the attached file for more information about the logs obtained by darshan</b>).</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">Is there an efficient way to make the darshan logs contain all the I/O information of 1.3 million image files during the model training?</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">Previously, I contacted with the Support Team, their response is “the POSIX module only tracks 1024 files, once we open 1025 files Darshan no longer tracks those files”. How to make the POSIX
module track all the images files. </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">For the model training on ThetaGPU using 2 nodes and 16 GPUs. My experimental results show that every process handles 14792/16 = 924 images files on average, actually, this number is less than
1024. How to explain it? </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">Thanks for your help.</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">Best regards,</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">--</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; color:black">Jie Liu</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt"> </span></p>
</div>
</div>
</body>
</html>