[Darshan-users] Using darshan to instrument PyTorch

Snyder, Shane ssnyder at mcs.anl.gov
Fri Jul 9 16:19:59 CDT 2021


Thank you for the update, Lu.

Let me see if I can  write a test that generates a few thousand file records, like your test case 3 results -- seems like you start hitting problems there, so will be good for me to understand what is still limiting you. I can also see if there's anything obvious that breaks our usage of zlib compression algorithms when users really start dialing up Darshan's memory usage. Maybe with those 2 things resolved we can get closer to complete coverage on workloads like this.

I'll see if I can get an updated branch for you to try soon, if you're interested.

--Shane
________________________________
From: Lu Weizheng <luweizheng36 at hotmail.com>
Sent: Tuesday, July 6, 2021 4:24 AM
To: Snyder, Shane <ssnyder at mcs.anl.gov>
Cc: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: Re: Using darshan to instrument PyTorch

Hi Shane,

Thank you so much for your reply!
I have tested your branch. Maybe there are still some problems.

The file structure of my dataset is like the following, which is a typical ImageNet file structure.

train/
|-- n01440764
|   |-- n01440764_10026.JPEG
|   |-- n01440764_10027.JPEG
|   |-- n01440764_10029.JPEG
|   |-- n01440764_10040.JPEG
|   |-- n01440764_10042.JPEG
…
val/
…

n01440764 means this is one of the 1000 classes of this dataset. The whole train folder has 1000 folders which means the dataset has 1000 classes of different classes representing different kinds of items in images.

I have two filesystems: one is local SSD with xfs on compute node, another is a Lustre filesystem. I do the tests on both of the two filesystems and the results show same results.
Here are what I do to test Darshan and Python.

I compile Darshan using 'snyder/dev-log-filters’ branch.
The Test Case 1-4 are based on a simple image reading Python program. The program just walks through some folders and uses Python PIL library (which is most common used image reading library in PyTorch computer vision community) to read the JPEG images into memory and converts JPEG images to RGB.

Test Case 1:
WITHOUT DARSHAN_CONF_PATH and DARSHAN_MODMEM=2048.
I only read one folder like n01440764. There are 1300 JPEG images in this folder. The size of images ranges from 10K - 100K Bytes.
The collected log shows that The POSIX module contains incomplete data.

Test Case 2:
WITH DARSHAN_CONF_PATH which set MAX_RECORDS to a very big value like 1200000 and DARSHAN_MODMEM=2048.
The Python program and the folder I read are same with Test Case 1. No more incomplete data. Use grep to check the log from darshan-parser shows the number of recorded files are exactly what the Python program would read. Total Bytes Read is correct.
I guess the DARSHAN_CONF_PATH can take effect. Darshan log before parser is 141K. Darshan-parser generated log is 23M.

Test Case 3:
WITH DARSHAN_CONF_PATH which set MAX_RECORDS to a very big value like 1200000 and DARSHAN_MODMEM=2048.
I read more folders in Python program. In total the program would read 2 folders and 2600 images. 2 and more folders shows incomplete data again. Use grep to check the log from darshan-parser shows the number of recorded files are 100 less than what the Python program would read.

Test Case 4:
With DARSHAN_CONF_PATH which set MAX_RECORDS to a very big value like 1200000 and DARSHAN_MODMEM=4096.
Python program is same with Test Case 1. Now I get: darshan_library_warning: error compressing job record.
darshan_library_warning: unable to write job record to file. The warning log probably relates with a previous problem I mention in this thread and maybe a zlib related problem(https://github.com/darshan-hpc/darshan/blob/e85b8bc929da91e54ff68fb1210dfe7bee3261a3/darshan-runtime/lib/darshan-core.c#L2039).

Test Case 5:
WITH DARSHAN_CONF_PATH which set MAX_RECORDS to a very big value like 3000000 and DARSHAN_MODMEM=2048.
I use a typical PyTorch ImageNet training program which includes image reading, data processing and neural network training. The darshan-parser shows that Darshan could not get all the counters recorded correctly. Logs are not complete and total bytes read is 0.


So I guess the DARSHAN_CONF_PATH can take effect. But for larger number of files, Darshan’s POSIX module may encounter out of memory issue.



2021年7月3日 上午4:28,Snyder, Shane <ssnyder at mcs.anl.gov<mailto:ssnyder at mcs.anl.gov>> 写道:

Hi Lu,

Sure, I can give you some details on how to use it. Most of the details are actually contained in this PR:https://github.com/darshan-hpc/darshan/pull/405

So, to use, you would need to check out the branch that PR is based on (<https://github.com/darshan-hpc/darshan/tree/snyder/dev-log-filters>'snyder/dev-log-filters') and build it just like you would normally build Darshan.

The only additional trick is that you can specify a config file for Darshan to use at runtime via the DARSHAN_CONF_PATH environment variable (i.e., export DARSHAN_CONF_PATH=/path/to/my/darshan/config/file). You can add whatever lines you need to your config file to control various aspects of Darshan's runtime behavior as outlined in the PR. Most relevant for you is probably just the ability to request that the POSIX module use more than the default 1,024 records, like this:

# request POSIX store 1.2 million file records rather than 1024 default
MAX_RECORDS       1200000                    POSIX

You may also consider using NAME_EXCLUDE options to provide regular expressions of files to ignore that are not related to your ImageNet test case.

# e.g., ignore files in /home and files that end in .txt
NAME_EXCLUDE    ^/home    *
NAME_EXCLUDE    .txt$          *

I have not tried to get Darshan to instrument such a massive single process workload, but will be interested to see if you have success. As I mentioned in my previous email, you'll probably want to bump DARSHAN_MODMEM up to around 2 GB to handle this, at the very least.

Another disclaimer I'll mention is that, since this stuff is experimental, some of these steps or naming conventions could change by the time we merge this into our main branch for eventual release. Not a big deal for now, but just a heads up.

Thanks,
--Shane

________________________________
From: Lu Weizheng <luweizheng36 at hotmail.com<mailto:luweizheng36 at hotmail.com>>
Sent: Friday, July 2, 2021 4:02 AM
To: Snyder, Shane <ssnyder at mcs.anl.gov<mailto:ssnyder at mcs.anl.gov>>
Subject: Re: Using darshan to instrument PyTorch

Hi Shane,

Could you tell me more info about the experimental branch. Is it on github? I want to try it.

Thanks!

2021年6月18日 下午11:11,Snyder, Shane <ssnyder at mcs.anl.gov<mailto:ssnyder at mcs.anl.gov>> 写道:

Those changes are in an experimental branch right now while I fine tune the implementation, but if you're interested in trying it out I could give you some details.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210709/eb2d4d5f/attachment-0001.html>


More information about the Darshan-users mailing list