[Darshan-users] Darshan & IPM results not the same
Snyder, Shane
ssnyder at mcs.anl.gov
Tue Nov 9 11:36:08 CST 2021
Hi Vineet,
We've been getting reports from users more and more about this problem recently, but as Kevin mentions we hope to have something more configurable for our next release.
Copying a response I had recently for another user:
We do have a work-in-progress PR on our repo that is supposed to help workaround this limit: https://github.com/darshan-hpc/darshan/pull/405
That PR is generally about modifying Darshan to accept different configuration options from users at runtime (currently only using a config file, but ultimately there will be corresponding env variables to control things), including an ability to control how many records are pre-allocated for each module (via MAX_RECORDS configuration setting). I think things work well enough on that branch that you could try requesting more records for the POSIX module, or whatever modules you're interested in, and see how things work. The PR has some details on how to do this.
So there is a WIP branch you can use, but it still needs to be cleaned up. I think you should have better luck if you use the config file to set a more appropriate MAX_RECORDS value for POSIX module (and whatever other modules you have that run out of memory). You should probably set DARSHAN_MODMEM to a higher value than the default, too, but it seems really unlikely to me that you'd need GiBs of memory as you mention in your last response (and this in fact could cause Darshan to crash) -- a reasonable way to estimate a value is to assume Darshan needs 1 or 2 KiB of memory for each record it tracks (i.e. MAX_RECORDS value from above).
Hopefully that's enough to get you started, but please let us know if you have more problems, questions, or feedback.
Thanks!
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Vineet Soni <vsoni at mercator-ocean.fr>
Sent: Tuesday, November 9, 2021 9:48 AM
To: Harms, Kevin <harms at alcf.anl.gov>; darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: Re: [Darshan-users] Darshan & IPM results not the same
Thanks Kevin for the explanation.
Unfortunately, all of the files are read from a single directory. So, it's not possible to exclude any.
Do you have any rough estimation of when this new version will be available? Or, is there a gitlab-branch that I can test?
Thanks,
Vineet
________________________________
From: Harms, Kevin <harms at alcf.anl.gov>
Sent: Tuesday, November 9, 2021 3:58 PM
To: Vineet Soni; darshan-users at lists.mcs.anl.gov
Subject: Re: Darshan & IPM results not the same
Vineet,
ok, so the problem seems to be you are exceeding the maximum limit of files per process (1024). After darshan hits this limit, it will not record any other files. Raising the memory limit will not change the file limit. If the files you don't care about are in a different directory than the files you do care about, you can use this variable:
DARSHAN_EXCLUDE_DIRS
A list of comma-separated paths that Darshan will not instrument at runtime (in addition to Darshan’s default blacklist)
We are working on an updated version that will allow users to specify a higher file limit as well as more complex patterns for excluding files.
kevin
________________________________________
From: Vineet Soni <vsoni at mercator-ocean.fr>
Sent: Tuesday, November 9, 2021 3:47 AM
To: Harms, Kevin; darshan-users at lists.mcs.anl.gov
Subject: RE: Darshan & IPM results not the same
Hi Kevin,
The code does not use threading. And yes, there are many files I don't see in the darshan log, and they are relatively large compared to the ones intercepted.
And, the application does have fread() calls. But, the STDIO module does not have a significant value in total_STDIO_F_READ_TIME.
I realized that there are warnings in POSIX and STDIO modules about the incomplete data in the log. However, I see no change in the log even after setting DARSHAN_MODMEM to 1024 MiB.
Also, even though the application occupies only 110 GB memory out of 256 GB per node, setting DARSHAN_MODMEM to higher values such as 4096 MiB crashes the job (which makes me think that this value is per process - 128 per node?).
Is there any runtime environment variable to set for excluding a group of files instead of directories?
Thanks,
Vineet
-----Original Message-----
From: Harms, Kevin <harms at alcf.anl.gov>
Sent: Monday, November 8, 2021 8:36 PM
To: Vineet Soni <vsoni at mercator-ocean.fr>; darshan-users at lists.mcs.anl.gov
Subject: Re: Darshan & IPM results not the same
Vineet,
a few ideas:
- is the I/O done using fread() or similar? These are accounted under the STDIO module rather than the POSIX module. Can you check to see what STDIO module shows?
- is the application threaded? It's possible an issue with threading, but given the disparity that seems less likely.
- Perhaps an issue with darshan not intercepting a subset of the calls your application is making. If you look at the file name list, does it seem obvious that darshan is missing I/O from some set of files? (This could also be due to files being caught under the exclude list)
kevin
________________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Vineet Soni <vsoni at mercator-ocean.fr>
Sent: Monday, November 8, 2021 4:05 AM
To: darshan-users at lists.mcs.anl.gov
Subject: [Darshan-users] Darshan & IPM results not the same
Hello,
I am trying to analyze the IO behavior of our codes with Darshan.
Darshan: 3.3.0
Compilers: Intel 2018
MPI: Intel MPI 2018
FS: Lustre (lustre-module disabled in Darshan configuration)
Darshan profiling: LD_PRELOAD
I observe a big difference in IO results from Darshan and IPM (v2.0.5) for one of our codes. I guess that both profilers are not profiling the same POSIX calls?
The POSIXIO calls profiled in IPM are:
fopen, fdopen, freopen, open, open64
fclose, close
fflush
fread, read
fwrite, write
fseek, lseek, lseek64
ftell
rewind
fgetpos, fsetpos, fgetc, getc, ungetc
creat
truncate, ftruncate, truncate64, ftruncate64
While the ones profiled by Darshan are: https://github.com/darshan-hpc/darshan/blob/main/darshan-runtime/lib/darshan-posix.c ?
However, the huge difference is observed in the “read” call, which exists in both the profilers.
+-------------------+------------+-----------+
| | IPM | Darshan |
+-------------------+------------+-----------+
| Read (s) | 324.57 | 6.02 |
+-------------------+------------+-----------+
| Agg. Read (count) | 34 766 456 | 2 946 271 |
+-------------------+------------+-----------+
I tested Darshan and IPM with other codes (reading NOT the same files) to check if this issue is faced in them as well. But, I got the same results.
So, I don't understand what could be the reason that this application is not giving the same results.
Do you have any idea of why this could happen?
Thanks in advance.
PS: The application does a lot of IO, and is expected to spend a significant time in read operations.
Best regards,
Vineet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20211109/11343ef9/attachment-0001.html>
More information about the Darshan-users
mailing list