[Darshan-users] Darshan & IPM results not the same
Vineet Soni
vsoni at mercator-ocean.fr
Tue Nov 9 09:48:13 CST 2021
Thanks Kevin for the explanation.
Unfortunately, all of the files are read from a single directory. So, it's not possible to exclude any.
Do you have any rough estimation of when this new version will be available? Or, is there a gitlab-branch that I can test?
Thanks,
Vineet
________________________________
From: Harms, Kevin <harms at alcf.anl.gov>
Sent: Tuesday, November 9, 2021 3:58 PM
To: Vineet Soni; darshan-users at lists.mcs.anl.gov
Subject: Re: Darshan & IPM results not the same
Vineet,
ok, so the problem seems to be you are exceeding the maximum limit of files per process (1024). After darshan hits this limit, it will not record any other files. Raising the memory limit will not change the file limit. If the files you don't care about are in a different directory than the files you do care about, you can use this variable:
DARSHAN_EXCLUDE_DIRS
A list of comma-separated paths that Darshan will not instrument at runtime (in addition to Darshan’s default blacklist)
We are working on an updated version that will allow users to specify a higher file limit as well as more complex patterns for excluding files.
kevin
________________________________________
From: Vineet Soni <vsoni at mercator-ocean.fr>
Sent: Tuesday, November 9, 2021 3:47 AM
To: Harms, Kevin; darshan-users at lists.mcs.anl.gov
Subject: RE: Darshan & IPM results not the same
Hi Kevin,
The code does not use threading. And yes, there are many files I don't see in the darshan log, and they are relatively large compared to the ones intercepted.
And, the application does have fread() calls. But, the STDIO module does not have a significant value in total_STDIO_F_READ_TIME.
I realized that there are warnings in POSIX and STDIO modules about the incomplete data in the log. However, I see no change in the log even after setting DARSHAN_MODMEM to 1024 MiB.
Also, even though the application occupies only 110 GB memory out of 256 GB per node, setting DARSHAN_MODMEM to higher values such as 4096 MiB crashes the job (which makes me think that this value is per process - 128 per node?).
Is there any runtime environment variable to set for excluding a group of files instead of directories?
Thanks,
Vineet
-----Original Message-----
From: Harms, Kevin <harms at alcf.anl.gov>
Sent: Monday, November 8, 2021 8:36 PM
To: Vineet Soni <vsoni at mercator-ocean.fr>; darshan-users at lists.mcs.anl.gov
Subject: Re: Darshan & IPM results not the same
Vineet,
a few ideas:
- is the I/O done using fread() or similar? These are accounted under the STDIO module rather than the POSIX module. Can you check to see what STDIO module shows?
- is the application threaded? It's possible an issue with threading, but given the disparity that seems less likely.
- Perhaps an issue with darshan not intercepting a subset of the calls your application is making. If you look at the file name list, does it seem obvious that darshan is missing I/O from some set of files? (This could also be due to files being caught under the exclude list)
kevin
________________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Vineet Soni <vsoni at mercator-ocean.fr>
Sent: Monday, November 8, 2021 4:05 AM
To: darshan-users at lists.mcs.anl.gov
Subject: [Darshan-users] Darshan & IPM results not the same
Hello,
I am trying to analyze the IO behavior of our codes with Darshan.
Darshan: 3.3.0
Compilers: Intel 2018
MPI: Intel MPI 2018
FS: Lustre (lustre-module disabled in Darshan configuration)
Darshan profiling: LD_PRELOAD
I observe a big difference in IO results from Darshan and IPM (v2.0.5) for one of our codes. I guess that both profilers are not profiling the same POSIX calls?
The POSIXIO calls profiled in IPM are:
fopen, fdopen, freopen, open, open64
fclose, close
fflush
fread, read
fwrite, write
fseek, lseek, lseek64
ftell
rewind
fgetpos, fsetpos, fgetc, getc, ungetc
creat
truncate, ftruncate, truncate64, ftruncate64
While the ones profiled by Darshan are: https://github.com/darshan-hpc/darshan/blob/main/darshan-runtime/lib/darshan-posix.c ?
However, the huge difference is observed in the “read” call, which exists in both the profilers.
+-------------------+------------+-----------+
| | IPM | Darshan |
+-------------------+------------+-----------+
| Read (s) | 324.57 | 6.02 |
+-------------------+------------+-----------+
| Agg. Read (count) | 34 766 456 | 2 946 271 |
+-------------------+------------+-----------+
I tested Darshan and IPM with other codes (reading NOT the same files) to check if this issue is faced in them as well. But, I got the same results.
So, I don't understand what could be the reason that this application is not giving the same results.
Do you have any idea of why this could happen?
Thanks in advance.
PS: The application does a lot of IO, and is expected to spend a significant time in read operations.
Best regards,
Vineet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20211109/160ea07e/attachment-0001.html>
More information about the Darshan-users
mailing list