[Darshan-users] Module contains incomplete data

Wed Apr 27 12:34:07 CDT 2022

Hi,

to reproduce the installation issue:
mkdir darshan_root
cd darshan_root
git clone https://github.com/darshan-hpc/darshan.git .

Then cd darshan-utils/ and then run:
autoconf
configure
make install

Then:

   1. If running darshan-parser within the same folder it runs fine.
   2. If running .lib/darshan-parser (which is installed by make install)
   it crashes with the library not available, see the previous email.

Cheers,

Jiri

st 27. 4. 2022 v 18:23 odesílatel Snyder, Shane <ssnyder at mcs.anl.gov>
napsal:

> Great, I'm glad that you were able to get the instrumentation mostly
> working!
>
> I think it's sensible to ignore Python source/compiled code for most cases
> -- I doubt there's any insight to gain and you'll just end up trying to
> filter them out when analyzing logs anyways.
>
> I'm not sure what's going on with the installation issues you mention. If
> you think something might be wrong with Darshan's build, then would you
> mind sharing how you ran configure, etc.? I could see if I'm able to
> reproduce anything.
>
> If you wouldn't mind starting a new thread related to the HDF5 issue, I
> think that would be helpful -- it might help if other users ever want to
> search the list archive for h5py/HDF5 related issues if you include those
> in the title.
>
> --Shane
> ------------------------------
> *From:* Jiří Nádvorník <nadvornik.ji at gmail.com>
> *Sent:* Wednesday, April 27, 2022 11:06 AM
> *To:* Snyder, Shane <ssnyder at mcs.anl.gov>
> *Cc:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] Module contains incomplete data
>
> Hi,
>
> yes, that NAMEMEM got it done. I also excluded .py and .pyc files - those
> reads are only loading of them, right? No data access itself (and no, I'm
> not reading and manually interpreting my own python files :), so I'm not
> interested into that ). Actually, I'm reading thousands of small files
> which I'm ingesting into HDF5 and I'm interested into how many reads, etc.
> are happening.
>
> I'm trying to make some sense of what I see but for now I'm just going to
> say it's very valuable data for me. Pity I can't get the Hdf5 module, if it
> would give me more granularity it would be very helpful.
>
> Regarding the darshan-utils you were right, I didn't reinstall them. I
> actually ran into an install problem - for some reason, the git
> installation takes the .lib/darshan-parser when installing it to
> /usr/local/... and that one throws:
> darshan-parser: error while loading shared libraries:
> libdarshan-util.so.0: cannot open shared object file: No such file or
> directory
>
> But if I run the darshan-parser within darshan_root_folder/darshan-util/
> then the error is gone and --show-incomplete |grep incomplete prints
> nothing.
>
> Could we now focus on the HDF5 issue or should I create a new thread for
> clarity?
>
> Cheers,
>
> Jiri
>
>
>
>
>
> st 27. 4. 2022 v 17:21 odesílatel Snyder, Shane <ssnyder at mcs.anl.gov>
> napsal:
>
> Thanks for working through the build issues and giving this a shot.
>
> A couple of things stand out to me (ignoring your HDF5 issue for now):
>
>    - It looks like at least the MPI-IO module is no longer reporting
>    partial data? Small progress...
>    - There is a new warning about there being no log utility handlers for
>    a "null" module. Are you perhaps parsing a log generated by your prior
>    Darshan install? Maybe you have not completely re-installed a new
>    darshan-util? We should figure out what's going on there, too, to be safe.
>
> I'd also suggest two things for your config file:
>
>    - Dial back MODMEM and MAX_RECORDS values. Your MODMEM value is asking
>    Darshan to allocate a GiB of memory (it is expressed in MiB units and you
>    set to 1024), which Darshan will happily try to do, I'm not sure it's a
>    good idea though. I'd probably start with a MODMEM value of 8 and
>    MAX_RECORDS of 2000, and just double those again if needed -- anything
>    beyond that would be surprising unless you know your workload is really
>    opening hundreds of thousands of files. You might also have a look at the
>    files Darshan is currently instrumenting and see if you really want it to
>    -- I've noticed when instrumenting Python frameworks that you can get tons
>    of records for things like shared libraries, source files, etc. that can
>    just be ignored using NAME_EXCLUDE mechanisms.
>    - Add "NAMEMEM  2" to your config file to force Darshan to allocate
>    more memory (2 MiB) for storing the filenames associated with each record.
>    This might actually be the main reason your log is reporting partial data
>    rather than actually running out of module data, which is another reason
>    not to get too aggressive with the MODMEM/MAX_RECORDS parameters. I should
>    have mentioned this setting originally as there have been other users who
>    have reported exceeding it recently.
>
> Hopefully that gets you further along and we can move onto the HDF5 issue
> you mention.
>
> Thanks,
> --Shane
> ------------------------------
> *From:* Jiří Nádvorník <nadvornik.ji at gmail.com>
> *Sent:* Wednesday, April 27, 2022 6:37 AM
> *To:* Snyder, Shane <ssnyder at mcs.anl.gov>
> *Cc:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] Module contains incomplete data
>
> Aha! I just realized there is an obvious "prepare.sh" script that I didn't
> run, I found out by trial and error though, could be more documented :).
>
> Now I'm further.. With a config file:
> MAX_RECORDS     102400     POSIX,MPI-IO,STDIO
> MODMEM  1024
> APP_EXCLUDE     git,ls
>
> I'm getting for:
> darshan-parser --show-incomplete
>  caucau_python_id127447-127447_4-27-48556-1842455298968263838_1.darshan
> |grep incomplete
>
> output:
> # *WARNING*: The POSIX module contains incomplete data!
> # *WARNING*: The STDIO module contains incomplete data!
> Warning: no log utility handlers defined for module (null), SKIPPING.
>
> I don't think I have more than 100000 files to be touched by my poor tiny
> python script, right?
>
> By the way I've encountered another problem, not sure whether to put it to
> another thread. If I compile with HDF5 (the results above are without it):
> ./configure --with-log-path=/gpfs/raid/darshan-logs
> --with-jobid-env=PBS_JOBID CC=mpicc --enable-hdf5-mod
> --with-hdf5=/gpfs/raid/SDSSCube/ext_lib//hdf5-1.12.0/hdf5/
>
> It messes up my runtime and causes python to crash:
> mpirun -x DARSHAN_CONFIG_PATH=/gpfs/raid/SDSSCube/darshan.conf -x
> LD_PRELOAD=/gpfs/raid/shared_libs/darshan/darshan-runtime/lib/.libs/libdarshan.so:/gpfs/raid/SDSSCube/ext_lib/hdf5-1.12.0/hdf5/lib/libhdf5.so
> -np 65 --hostfile hosts --map-by node
> /gpfs/raid/SDSSCube/venv_par/bin/python hisscube.py --truncate
> ../sdss_data/ results/SDSS_cube_c_par.h5
>
> Resulting in:
> INFO:rank[0]:Rank 0 pid: 137058
> Darshan HDF5 module error: runtime library version (1.12) incompatible
> with Darshan module (1.10-).
> Traceback (most recent call last):
>   File "hisscube.py", line 74, in <module>
>     writer.ingest(fits_image_path, fits_spectra_path,
> truncate_file=args.truncate)
>   File "/gpfs/raid/SDSSCube/hisscube/ParallelWriterMWMR.py", line 45, in
> ingest
>     self.process_metadata(image_path, image_pattern, spectra_path,
> spectra_pattern, truncate_file)
>   File "/gpfs/raid/SDSSCube/hisscube/CWriter.py", line 150, in
> process_metadata
>     h5_file = self.open_h5_file_serial(truncate_file)
>   File "/gpfs/raid/SDSSCube/hisscube/CWriter.py", line 170, in
> open_h5_file_serial
>     return h5py.File(self.h5_path, 'w', fs_strategy="page",
> fs_page_size=4096, libver="latest")
>   File
> "/gpfs/raid/SDSSCube/venv_par/lib/python3.8/site-packages/h5py-3.6.0-py3.8-linux-x86_64.egg/h5py/_hl/files.py",
> line 533, in __init__
>     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
>   File
> "/gpfs/raid/SDSSCube/venv_par/lib/python3.8/site-packages/h5py-3.6.0-py3.8-linux-x86_64.egg/h5py/_hl/files.py",
> line 232, in make_fid
>     fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
>   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
>   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
>   File "h5py/h5f.pyx", line 126, in h5py.h5f.create
>   File "h5py/defs.pyx", line 693, in h5py.defs.H5Fcreate
> RuntimeError: Unspecified error in H5Fcreate (return value <0)
>
> You are saying that darshan should be compatible with HDF5 > 1.8, which
> 1.12 should be, right?
>
> Thanks for help!
>
> Cheers,
>
> Jiri
>
>
>
>
>
>
> st 27. 4. 2022 v 8:43 odesílatel Jiří Nádvorník <nadvornik.ji at gmail.com>
> napsal:
>
> Hi,
>
> I think I will chew through the documentation just fine but two things are
> not clear:
>
>    1. Does the darshan library provide its own config file that I need to
>    change or do I need to always create my own?
>    2. How can I build the git version? I didn't find any instructions and
>    the usual autoconf just throws:
>       1. root at kub-b1:/gpfs/raid/shared_libs/darshan/darshan-runtime#
>       autoconf
>       configure.ac:19: error: possibly undefined macro:
>       AC_CONFIG_MACRO_DIRS
>             If this token and others are legitimate, please use
>       m4_pattern_allow.
>             See the Autoconf documentation.
>       configure.ac:21: error: possibly undefined macro: AM_INIT_AUTOMAKE
>       configure.ac:22: error: possibly undefined macro: AM_SILENT_RULES
>       configure.ac:23: error: possibly undefined macro: AM_MAINTAINER_MODE
>       configure.ac:713: error: possibly undefined macro: AM_CONDITIONAL
>       root at kub-b1:/gpfs/raid/shared_libs/darshan/darshan-runtime#
>       ./configure
>       configure: error: cannot find install-sh, install.sh, or shtool in
>       ../maint/scripts "."/../maint/scripts
>
> Thanks for help.
>
> Cheers,
>
> Jiri
>
> út 26. 4. 2022 v 17:43 odesílatel Snyder, Shane <ssnyder at mcs.anl.gov>
> napsal:
>
> Hi Jiri,
>
> For some background, Darshan enforces some internal memory limits to avoid
> ballooning memory usage at runtime. Specifically, all of our
> instrumentation modules should pre-allocate file records for up to 1,024
> files opened by the app -- if your app opens more than 1,024 files
> per-process, Darshan stops instrumenting and issues those warning messages
> when parsing the log file.
>
> We have users hit this issue pretty frequently now, and we actually just
> wrapped up development of some new mechanisms to help out with this. They
> were just merged into our main branch, and we will be formally releasing a
> pre-release version of this code in the next week or so. For the time
> being, you should be able to use the 'main' branch of our repo (
> https://github.com/darshan-hpc/darshan) to leverage this new
> functionality.
>
> There are 2 new mechanisms that can help out, both of which require you to
> provide a configuration file to Darshan at runtime:
>
>    - MAX_RECORDS setting can be used to bump up the number of
>    pre-allocated records for different modules. In your case, you might try to
>    bump up the default number of records for the POSIX, MPI-IO, and STDIO
>    modules  by setting something like this in your config file (this would
>    allow you to instrument up to 4000 files per-process for each of these
>    modules):
>       - MAX_RECORDS    4000    POSIX,MPI-IO,STDIO
>    - An alternative (or complementary) approach to bumping up the record
>    limit is to limit instrumentation to particular files. You can use the
>    NAME_EXCLUDE setting to avoid instrumenting specific directory paths, file
>    extensions, etc by specifying regular expressions. E.g, the following
>    settings would avoid instrumenting files with .so prefixes or files located
>    in a directory we don't care about for all modules (* denotes all modules):
>       - NAME_EXCLUDE    .so$    *
>       - NAME_EXCLUDE    ^/path/to/avoid    *
>
> I'm attaching the updated runtime documentation for Darshan for your
> reference. Section 8 provides a ton of details on how to provide a config
> file to Darshan that should help clear up any missing gaps in my
> description above.
>
> Please let us know if you have any further questions or issues, though!
>
> Thanks,
> --Shane
> ------------------------------
> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf
> of Jiří Nádvorník <nadvornik.ji at gmail.com>
> *Sent:* Sunday, April 24, 2022 3:00 PM
> *To:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* [Darshan-users] Module contains incomplete data
>
> Hi All,
>
> I just tried out Darshan and the potential output seems perfect for my
> HDF5 MPI application! Although I'm not able to get there :(.
>
> I have a log that has a big stamp "This darshan log contains incomplete
> data".
>
> When I run:
> darshan-parser --show-incomplete  mylog.darshan |grep incomplete
> Output is:
> # *WARNING*: The POSIX module contains incomplete data!
> # *WARNING*: The MPI-IO module contains incomplete data!
> # *WARNING*: The STDIO module contains incomplete data!
>
> Would you be able to point me to some setting that would improve the
> measurements? Can I actually rely on the profiling results if it says the
> data is incomplete in some of the categories?
>
> Thank you very much for your help!
>
> Cheers,
>
> Jiri
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20220427/8a7187d9/attachment-0001.html>