[Darshan-users] Using darshan to instrument PyTorch

Fri Jun 18 09:04:07 CDT 2021

Today I do some other experiments on instrumenting pytorch using darshan. I guess it is very likely that pytorch’s default DataLoader uses multiprocessing so I cannot get the correct darshan log.
I switch to NVIDIA DALI(https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html) which is another data loading backend instead of using multiprocessing to load data. Now it seems that darshan can collect IO behavior as using darshan-parser, I can see POSIX read/write logs on the files I am using.
However, there is still a problem. As I am training ImageNet which in total is 160GB and has 1.2 million images and 1k folders. Darshan seems ran out of memory. I have tuned the DARSHAN_MODMEM environment variable up to 40960 MB. I get the warning log: darshan_library_warning: error compressing job record darshan_library_warning: unable to write job record to file.
I add some debug lines in darshan-runtime source code. Darshan hit this line: tmp_stream.avail_out == 0 (https://github.com/darshan-hpc/darshan/blob/e85b8bc929da91e54ff68fb1210dfe7bee3261a3/darshan-runtime/lib/darshan-core.c#L2039). It seems that the zlib is trying to compress the buffered data but run out of buffer. My current working node has 60GB main memory.
So what should I do now? To use another node with bigger memory size, or tune the DARSHAN_MODMEM to a very big size?

Thank you very much if you can reply.

Lu

2021年6月18日 上午5:26，Snyder, Shane <ssnyder at mcs.anl.gov<mailto:ssnyder at mcs.anl.gov>> 写道：

Hi Lu,

(sending to the entire mailing list now)

Unfortunately, we don't currently have a tool for either combining multiple logs from a workflow into a single log file or analysis tools that work on sets of logs.

We do have a utility called 'darshan-merge' that was written to help merge together Darshan logs for another use case, but I don't think it will work right for this case from some quick testing. I've opened an issue on our GitHub page (https://github.com/darshan-hpc/darshan/issues/401) to remind myself to see if I can rework this tool to be more helpful in cases like yours.

At some point, we'd like to offer some of our own analysis tools that are workflow aware and can summarize data from multiple Darshan logs. That's something that's going to take some time though, as we are just now starting to look at revamping some of our analysis tools using the new PyDarshan interface to Darshan logs. BTW, PyDarshan might be something you could consider using if you wanted to come up with your own analysis tools for Darshan data, but that might be more work than you're looking for. In case it's helpful, here's some documentation on PyDarshan: https://www.mcs.anl.gov/research/projects/darshan/docs/pydarshan/index.html

Thanks,
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov<mailto:darshan-users-bounces at lists.mcs.anl.gov>> on behalf of Lu Weizheng <luweizheng36 at hotmail.com<mailto:luweizheng36 at hotmail.com>>
Sent: Tuesday, June 15, 2021 3:43 AM
To: darshan-users at lists.mcs.anl.gov<mailto:darshan-users at lists.mcs.anl.gov> <darshan-users at lists.mcs.anl.gov<mailto:darshan-users at lists.mcs.anl.gov>>
Subject: [Darshan-users] Using darshan to instrument PyTorch

Hi,

I am using darshan to instrument PyTorch on a local machine. My workload is an image classification problem on ImageNet dataset. When the training process ended, there are a lot of logs generated. Like:

u2020000_python_id4719_6-15-41351-17690910011763757569_1.darshan
u2020000_python_id5012_6-15-42860-17690910011763757569_1.darshan
u2020000_python_id4721_6-15-41352-17690910011763757569_1.darshan
u2020000_uname_id4720_6-15-41351-17690910011763757569_1.darshan
u2020000_python_id4722_6-15-41352-17690910011763757569_1.darshan
u2020000_uname_id4723_6-15-41354-17690910011763757569_1.darshan
u2020000_python_id4758_6-15-41830-17690910011763757569_1.darshan
u2020000_uname_id4724_6-15-41354-17690910011763757569_1.darshan
...

After using the darshan-util analysis tool for one of the above log file, it shows: I/O performance estimate (at the POSIX layer): transferred 7.5 MiB at 36.02 MiB/s

The transferred data showed in the PDF report is far less than the whole dataset size.As PyTorch DataLoader is a multi-process program, I guess darshan generate every log for every process.

My question is: how can I get the IO analysis for the whole PyTorch workload task instead of these process logs?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210618/8a46c13b/attachment-0001.html>