[Darshan-users] Hang on post-process

Snyder, Shane ssnyder at mcs.anl.gov
Wed Jul 14 09:17:34 CDT 2021


Hi Jeff,

I similarly tried running job-summary on your log using our current main branch (which is essentially just Darshan 3.3.1), and it worked fine, so not exactly sure what the problem is, but doesn't appear to be a general bug. You might be able to find some hints about what's going wrong by running job-summary again with the '--verbose' flag -- this persists the temporary directory Darshan is using for creating the PDF files, including pdflatex logs, etc. You might be able to find some error messages in the 'summary.log' file that give some sort of indication in what's failing/hanging? Not the most straightforward debugging strategy but I don't really have much else to suggest...

As a side note, we are in the middle of developing new Darshan analysis tools based on PyDarshan that will hopefully be available before too long. There's a lot more development momentum on our end towards these new PyDarshan-based analysis tools, with the older tools likely being deprecated once these are available. I just mention this for you and other users so you're aware help is on the way and that we aren't completely ignoring these issues.

Thanks,
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Jeffrey Layton <laytonjb at gmail.com>
Sent: Tuesday, July 13, 2021 2:00 PM
To: Latham, Robert J. <robl at mcs.anl.gov>
Cc: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: Re: [Darshan-users] Hang on post-process

Thanks Rob!! I appreciate the pdf (at least I won't look like a slacker and actually produced something).

What steps do you want to take to debug the issue? I'm guessing it's a configuration issue or dependency issue on my side. BTW - I'm running Ubuntu 20.04 on an AMD system.  I built Darshan 3.3.1 using gcc 9.3.0 (Ubuntu 20.04 version).

Thanks!

Jeff


On Tue, Jul 13, 2021 at 2:46 PM Latham, Robert J. <robl at mcs.anl.gov<mailto:robl at mcs.anl.gov>> wrote:
Howdy Jeff: thanks for sending the log file

It looks like a legitimate log file to me.  `darshan-job-parser`, which
simply dumps the counters and such to stdout, gives me a reasonable
looking log file.  here's the header:

# darshan log version: 3.21
# compression method: ZLIB
# exe: python3 cifar10-4-checkpoint.py
# uid: 1000
# jobid: 6041
# start_time: 1626196275
# start_time_asci: Tue Jul 13 12:11:15 2021
# end_time: 1626196561
# end_time_asci: Tue Jul 13 12:16:01 2021
# nprocs: 1
# run time: 287
# metadata: lib_ver = 3.3.1
# metadata: h = romio_no_indep_rw=true;cb_nodes=4

# log file regions
# -------------------------------------------------------
# header: 360 bytes (uncompressed)
# job data: 543 bytes (compressed)
# record table: 18164 bytes (compressed)
# POSIX module: 41682 bytes (compressed), ver=4
# STDIO module: 230 bytes (compressed), ver=2

And a darshan-job-summary.pl<http://darshan-job-summary.pl> that I built back in August 2020 generates
a pdf for me in a few seconds.  I've attached it for you but really we
should figure out what's going on in your environment


==rob

On Tue, 2021-07-13 at 13:41 -0400, Jeffrey Layton wrote:
> Good afternoon,
>
> Apologies for posting yet another problem :)  I'm trying to use
> Darshan on a Tensorflow/Keras script. It's a simple model operating
> on the CIFAR-10 data set (fairly small). Darshan produces the output
> files but when I try to post-process one using darshan-job-
> summary.pl<http://summary.pl>, it hangs and I end up having to kill the process (I
> waited about an hour - just to be sure).
>
> I run the script using the following:
>
> export DARSHAN_EXCLUDE_DIRS=/proc,/etc,/dev,/sys
> env LD_PRELOAD=/home/laytonjb/bin/darshan-3.3.1/lib/libdarshan.so
> python3 cifar10-4-checkpoint.py
>
> (I can provide the script if needed). It produces four files:
>
> $ ls -s
> total 72
>  4 laytonjb_ptxas_id6210-6210_7-13-47480-
> 2131301613401632697_1.darshan  60 laytonjb_python3_id6041-6041_7-13-
> 47475-2131301613401632697_1.darshan
>  4 laytonjb_ptxas_id6211-6211_7-13-47480-
> 2131301613401632697_1.darshan   4 laytonjb_uname_id6056-6056_7-13-
> 47475-2131301613401632697_1.darshan
>
>
> I chose to post-process the "python3" output but this is where it
> hangs. I'm attaching the darshan output file if that is of any help.
>
> Thanks for any help.
>
> Jeff
>
>
>
>
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov<mailto:Darshan-users at lists.mcs.anl.gov>
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210714/37671924/attachment-0001.html>


More information about the Darshan-users mailing list