[Darshan-users] Problems about Darshan Logs
Chunduri, Sudheer
sudheer at anl.gov
Mon Jun 7 11:26:32 CDT 2021
Bogdan,
Ok. On the part that work on in Autoperf, we have received a request to have a runtime variable to ON and OFF the modules to be loaded and we will soon work on that, but that is more in darshan-runtime. But I think you are referring to the darshan-utils code, if possible, would be good to have some sample list of these constants in the code. Probably, Shane can comment better if it is on darshan-utils.
From: Nicolae, Bogdan <bnicolae at anl.gov>
Date: Monday, June 7, 2021 at 11:12 AM
To: Chunduri, Sudheer <sudheer at anl.gov>, Jie Liu <jliu279 at ucmerced.edu>, darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Cc: Si, Min <msi at anl.gov>
Subject: Re: Problems about Darshan Logs
Chunduri,
We finally fixed the problem, Darshan simply has a lot of hardcoded constants in the code. We compiled our own version. Based on our experience, it would be better to expose these constants as environment variables.
Cheers,
Bogdan
________________________________
From: Chunduri, Sudheer <sudheer at anl.gov>
Sent: Monday, June 7, 2021 9:54 AM
To: Jie Liu <jliu279 at ucmerced.edu>; darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Cc: Nicolae, Bogdan <bnicolae at anl.gov>; Si, Min <msi at anl.gov>
Subject: Re: Problems about Darshan Logs
Hi Jie,
I see you copying darshan-users mailing list, so, Shane should hopefully see this.
Meanwhile, have you tried using “darshan-parser --show-incomplete”?
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Jie Liu <jliu279 at ucmerced.edu>
Date: Monday, June 7, 2021 at 9:42 AM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Cc: Nicolae, Bogdan <bnicolae at anl.gov>, Si, Min <msi at anl.gov>
Subject: [Darshan-users] Problems about Darshan Logs
Hi,
I used Darshan to do some profiling work when training Deep Learning models on ThetaGPU (Resnet50 on ImageNet, mini-batch size is 32).
When I used the following command to get the summary of darshan logs:
darshan-job-summary.pl /path/to/.darshan --output /path/to/summary.pdf
The received summary.pdf file contains the following Error message at the firs page:
WARNING: This Darshan log contains incomplete data. This happens when a module runs out of memory to store new record data. Please run darshan-parser on the log file for more information.
I also tried to use darshan-parser by the following command:
darshan-parser /path/to/.darshan --output /path/to/summary.txt
It also shows incomplete data error:
*ERROR*: The POSIX module contains incomplete data! This happens when a module runs out of memory to store new record data.
The ImageNet dataset contains about 1.3 million image files, but the darshan log only shows the number of opened files is: 14792 when I trained Resnet50 on ThetaGPU with 2 nodes, 16 GPUs. (Please check the attached file for more information about the logs obtained by darshan).
Is there an efficient way to make the darshan logs contain all the I/O information of 1.3 million image files during the model training?
Previously, I contacted with the Support Team, their response is “the POSIX module only tracks 1024 files, once we open 1025 files Darshan no longer tracks those files”. How to make the POSIX module track all the images files.
For the model training on ThetaGPU using 2 nodes and 16 GPUs. My experimental results show that every process handles 14792/16 = 924 images files on average, actually, this number is less than 1024. How to explain it?
Thanks for your help.
Best regards,
--
Jie Liu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210607/eb2efca9/attachment-0001.html>
More information about the Darshan-users
mailing list