<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
What issue specifically do you see when using pytorch? Does Darshan run out of memory in those cases? Or does it just not capture information on the files you expect? We recently modified Darshan to gracefully handle apps that call fork(), but if pytorch is
using the Python 'multiprocess' module it is likely that we can't accurately capture the I/O behavior -- 'multiprocess' is using clone() system calls that we have not found a way to properly handle in Darshan. We should probably think a bit more to see if
it's at all possible to account for apps that use clone(), but seemed pretty tricky when I last looked.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
As for your second issue related to Darshan running out of memory, Darshan does have some internal limits that prevent each module from instrumenting more than 1,024 files for a job. Increasing DARSHAN_MODMEM does not increase those limits, and in fact, those
limits are not tunable in any way right now. That said, we are working on changes to Darshan right now that allow you to control those on a per-module basis, so you could set DARSHAN_MODMEM really high and configure Darshan to allow the POSIX module to record
1.2 million files, theoretically. Those changes are in an experimental branch right now while I fine tune the implementation, but if you're interested in trying it out I could give you some details.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
It does sound like there could be a bug in Darshan's core library that causes problems when compressing using zlib and using really large DARSHAN_MODMEM values. I'll investigate that more to see if I can trigger it and see if I can put it in a workaround. My
hunch is that we don't properly handle buffers over 4 GB, so you might consider dialing DARSHAN_MODMEM back to around 2 GB or so at max -- that should still be enough space to capture info on 1.2 million files. But again, setting it that high right now isn't
helpful without using it with the new changes I'm working on.<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thanks,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
--Shane<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Lu Weizheng <luweizheng36@hotmail.com><br>
<b>Sent:</b> Friday, June 18, 2021 9:04 AM<br>
<b>To:</b> Snyder, Shane <ssnyder@mcs.anl.gov><br>
<b>Cc:</b> darshan-users@lists.mcs.anl.gov <darshan-users@lists.mcs.anl.gov><br>
<b>Subject:</b> Re: Using darshan to instrument PyTorch</font>
<div> </div>
</div>
<div class="" style="word-wrap:break-word; line-break:after-white-space"><span class="" style="font-size:15px">Today I do some other experiments on instrumenting pytorch using darshan. I guess it is very likely that pytorch$B!G(Bs default DataLoader uses multiprocessing
so I cannot get the correct darshan log. </span>
<div class=""><span class="" style="font-size:15px">I switch to NVIDIA DALI(<a href="https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html" class="">https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html</a>) which is another data
loading backend instead of using multiprocessing to load data. Now it seems that darshan can collect IO behavior as using darshan-parser, I can see POSIX read/write logs on the files I am using. </span>
<div class=""><span class="" style="font-size:15px">However, there is still a problem. As I am training ImageNet which in total is 160GB and has 1.2 million images and 1k folders. Darshan seems ran out of memory. I have tuned the DARSHAN_MODMEM environment
variable up to 40960 MB. I get the warning log: darshan_library_warning: error compressing job record darshan_library_warning: unable to write job record to file. </span></div>
<div class=""><span class="" style="font-size:15px">I add some debug lines in darshan-runtime source code. Darshan hit this line: tmp_stream.avail_out == 0 (<a href="https://github.com/darshan-hpc/darshan/blob/e85b8bc929da91e54ff68fb1210dfe7bee3261a3/darshan-runtime/lib/darshan-core.c#L2039" class="">https://github.com/darshan-hpc/darshan/blob/e85b8bc929da91e54ff68fb1210dfe7bee3261a3/darshan-runtime/lib/darshan-core.c#L2039</a>).
It seems that the zlib is trying to compress the buffered data but run out of buffer. My current working node has 60GB main memory.</span></div>
<div class=""><span class="" style="font-size:15px">So what should I do now? To use another node with bigger memory size, or tune the </span><span class="" style="font-size:15px">DARSHAN_MODMEM to a very big size?</span></div>
<div class=""><span class="" style="font-size:15px"><br class="">
</span></div>
<div class=""><span class="" style="font-size:15px">Thank you very much if you can reply.</span></div>
<div class=""><span class="" style="font-size:15px"><br class="">
</span></div>
<div class=""><span class="" style="font-size:15px">Lu</span></div>
<div class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">2021$BG/(B6$B7n(B18$BF|(B $B>e8a(B5:26$B!$(BSnyder, Shane <<a href="mailto:ssnyder@mcs.anl.gov" class="">ssnyder@mcs.anl.gov</a>> $B<LF;!'(B</div>
<br class="x_Apple-interchange-newline">
<div class="">
<div class="" style="font-style:normal; font-variant-caps:normal; font-weight:normal; letter-spacing:normal; text-align:start; text-indent:0px; text-transform:none; white-space:normal; word-spacing:0px; text-decoration:none; font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
<span class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">Hi Lu,</span>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
<br class="">
</div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
(sending to the entire mailing list now)<br class="">
</div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
<br class="">
</div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
Unfortunately, we don't currently have a tool for either combining multiple logs from a workflow into a single log file or analysis tools that work on sets of logs.<span class="x_Apple-converted-space"> </span><br class="">
</div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
<br class="">
</div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
We do have a utility called 'darshan-merge' that was written to help merge together Darshan logs for another use case, but I don't think it will work right for this case from some quick testing. I've opened an issue on our GitHub page (<a href="https://github.com/darshan-hpc/darshan/issues/401" target="_blank" rel="noopener noreferrer" class="">https://github.com/darshan-hpc/darshan/issues/401</a>)
to remind myself to see if I can rework this tool to be more helpful in cases like yours.</div>
<div class=""></div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
<br class="">
</div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
At some point, we'd like to offer some of our own analysis tools that are workflow aware and can summarize data from multiple Darshan logs. That's something that's going to take some time though, as we are just now starting to look at revamping some of our
analysis tools using the new PyDarshan interface to Darshan logs. BTW, PyDarshan might be something you could consider using if you wanted to come up with your own analysis tools for Darshan data, but that might be more work than you're looking for. In case
it's helpful, here's some documentation on PyDarshan:<span class="x_Apple-converted-space"> </span><a href="https://www.mcs.anl.gov/research/projects/darshan/docs/pydarshan/index.html" target="_blank" rel="noopener noreferrer" class="">https://www.mcs.anl.gov/research/projects/darshan/docs/pydarshan/index.html</a><br class="">
</div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
<br class="">
</div>
<div class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">
Thanks,</div>
<span class="" style="font-size:12pt; font-family:Calibri,Arial,Helvetica,sans-serif">--Shane</span><br class="">
</div>
<div id="x_appendonsend" class="" style="font-family:Helvetica; font-size:12px; font-style:normal; font-variant-caps:normal; font-weight:normal; letter-spacing:normal; text-align:start; text-indent:0px; text-transform:none; white-space:normal; word-spacing:0px; text-decoration:none">
</div>
<hr tabindex="-1" class="" style="font-family:Helvetica; font-size:12px; font-style:normal; font-variant-caps:normal; font-weight:normal; letter-spacing:normal; text-align:start; text-indent:0px; text-transform:none; white-space:normal; word-spacing:0px; text-decoration:none; display:inline-block; width:748.71875px">
<span class="" style="font-family:Helvetica; font-size:12px; font-style:normal; font-variant-caps:normal; font-weight:normal; letter-spacing:normal; text-align:start; text-indent:0px; text-transform:none; white-space:normal; word-spacing:0px; text-decoration:none; float:none; display:inline!important"></span>
<div id="x_divRplyFwdMsg" dir="ltr" class="" style="font-family:Helvetica; font-size:12px; font-style:normal; font-variant-caps:normal; font-weight:normal; letter-spacing:normal; text-align:start; text-indent:0px; text-transform:none; white-space:normal; word-spacing:0px; text-decoration:none">
<font face="Calibri, sans-serif" class="" style="font-size:11pt"><b class="">From:</b><span class="x_Apple-converted-space"> </span>Darshan-users <<a href="mailto:darshan-users-bounces@lists.mcs.anl.gov" class="">darshan-users-bounces@lists.mcs.anl.gov</a>>
on behalf of Lu Weizheng <<a href="mailto:luweizheng36@hotmail.com" class="">luweizheng36@hotmail.com</a>><br class="">
<b class="">Sent:</b><span class="x_Apple-converted-space"> </span>Tuesday, June 15, 2021 3:43 AM<br class="">
<b class="">To:</b><span class="x_Apple-converted-space"> </span><a href="mailto:darshan-users@lists.mcs.anl.gov" class="">darshan-users@lists.mcs.anl.gov</a> <<a href="mailto:darshan-users@lists.mcs.anl.gov" class="">darshan-users@lists.mcs.anl.gov</a>><br class="">
<b class="">Subject:</b><span class="x_Apple-converted-space"> </span>[Darshan-users] Using darshan to instrument PyTorch</font>
<div class=""> </div>
</div>
<div dir="ltr" class="" style="font-family:Helvetica; font-size:12px; font-style:normal; font-variant-caps:normal; font-weight:normal; letter-spacing:normal; text-align:start; text-indent:0px; text-transform:none; white-space:normal; word-spacing:0px; text-decoration:none">
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
Hi,</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
<br class="">
</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
I am using darshan to instrument PyTorch on a local machine. My workload is an image classification problem on ImageNet dataset. When the training process ended, there are a lot of logs generated. Like:</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
<br class="">
</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
u2020000_python_id4719_6-15-41351-17690910011763757569_1.darshan </div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
u2020000_python_id5012_6-15-42860-17690910011763757569_1.darshan
<div class="">u2020000_python_id4721_6-15-41352-17690910011763757569_1.darshan </div>
<div class="">u2020000_uname_id4720_6-15-41351-17690910011763757569_1.darshan</div>
<div class="">u2020000_python_id4722_6-15-41352-17690910011763757569_1.darshan </div>
<div class="">u2020000_uname_id4723_6-15-41354-17690910011763757569_1.darshan</div>
u2020000_python_id4758_6-15-41830-17690910011763757569_1.darshan </div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
u2020000_uname_id4724_6-15-41354-17690910011763757569_1.darshan</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
...</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
<br class="">
</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
After using the darshan-util analysis tool for one of the above log file, it shows: I/O performance estimate (at the POSIX layer): transferred 7.5 MiB at 36.02 MiB/s</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
<br class="">
</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
The transferred data showed in the PDF report is far less than the whole dataset size.<span class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">As PyTorch DataLoader is a multi-process program, I guess darshan generate every log
for every process. </span></div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
<br class="">
</div>
<div class="" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
My question is: how can I get the IO analysis for the whole PyTorch workload task instead of these process logs?</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</body>
</html>