[Darshan-users] Problems about Darshan Logs

Mon Jun 7 11:12:44 CDT 2021

Chunduri,

We finally fixed the problem, Darshan simply has a lot of hardcoded constants in the code. We compiled our own version. Based on our experience, it would be better to expose these constants as environment variables.

Cheers,
Bogdan
________________________________
From: Chunduri, Sudheer <sudheer at anl.gov>
Sent: Monday, June 7, 2021 9:54 AM
To: Jie Liu <jliu279 at ucmerced.edu>; darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Cc: Nicolae, Bogdan <bnicolae at anl.gov>; Si, Min <msi at anl.gov>
Subject: Re: Problems about Darshan Logs

Hi Jie,

I see you copying darshan-users mailing list, so, Shane should hopefully see this.

Meanwhile, have you tried using “darshan-parser --show-incomplete”?

From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Jie Liu <jliu279 at ucmerced.edu>
Date: Monday, June 7, 2021 at 9:42 AM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Cc: Nicolae, Bogdan <bnicolae at anl.gov>, Si, Min <msi at anl.gov>
Subject: [Darshan-users] Problems about Darshan Logs

Hi,

I used Darshan to do some profiling work when training Deep Learning models on ThetaGPU (Resnet50 on ImageNet, mini-batch size is 32).

When I used the following command to get the summary of darshan logs:

darshan-job-summary.pl  /path/to/.darshan --output /path/to/summary.pdf

The received summary.pdf file contains the following Error message at the firs page:

WARNING: This Darshan log contains incomplete data. This happens when a module runs out of memory to store new record data. Please run darshan-parser on the log file for more information.

I also tried to use darshan-parser by the following command:

darshan-parser  /path/to/.darshan --output /path/to/summary.txt

It also shows incomplete data error:

*ERROR*: The POSIX module contains incomplete data! This happens when a module runs out of memory to store new record data.

The ImageNet dataset contains about 1.3 million image files, but the darshan log only shows the number of opened files is: 14792 when I trained Resnet50 on ThetaGPU with 2 nodes, 16 GPUs. (Please check the attached file for more information about the logs obtained by darshan).

Is there an efficient way to make the darshan logs contain all the I/O information of 1.3 million image files during the model training?

Previously, I contacted with the Support Team, their response is “the POSIX module only tracks 1024 files, once we open 1025 files Darshan no longer tracks those files”. How to make the POSIX module track all the images files.

For the model training on ThetaGPU using 2 nodes and 16 GPUs. My experimental results show that every process handles 14792/16 = 924 images files on average, actually, this number is less than 1024. How to explain it?

Thanks for your help.

Best regards,

--

Jie Liu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210607/462e8699/attachment.html>