[Darshan-users] Hang on post-process

Jeffrey Layton laytonjb at gmail.com
Wed Jul 14 13:32:01 CDT 2021


I added the --verbose flag (sorry - I should have thought of that earlier).
The job-summary still hangs. However, when I look at the directory, I see
several pdf's of the individual plots, and I see a bunch of .dat, .eps,
.tex, and a few other files:

$ ls -s
total 340
 4 access-hist-eps.gplt   8 file-access-read.pdf      4
file-access-write.dat      4 fs-data-table.tex    8 op-counts.pdf      4
posix-access-hist.dat   0 summary.log             8 time-summary.pdf
 4 access-table.tex       4 file-access-read-sh.dat  24
file-access-write.eps      4 job-table.tex        4 pattern.dat       28
posix-access-hist.eps   4 summary.tex             4 title.tex
 4 file-access-eps.gplt  24 file-access-shared.eps    8
file-access-write.pdf      4 latex.output        24 pattern.eps        8
posix-access-hist.pdf   4 time-summary.dat        4 variance-table.tex
20 file-access-read.dat   8 file-access-shared.pdf    4
file-access-write-sh.dat  28 op-counts.eps        4 pattern-eps.gplt   4
posix-op-counts.dat    24 time-summary.eps
24 file-access-read.eps   4 file-access-table.tex     4
file-count-table.tex       4 op-counts-eps.gplt   8 pattern.pdf        4
stdio-op-counts.dat     4 time-summary-eps.gplt


The summary.log file is empty. But the summary.tex file looks correct
(there is a \end{document}  at the end of the document). I'm wondering if
it gets stuck in converting summary.text to a pdf? Here are the pertinent
processes:


laytonjb   27458    5158  0 13:11 pts/2    00:00:00 perl
/home/laytonjb/bin/darshan-3.3.1/bin/darshan-job-summary.pl --verbose
/home/laytonjb/darshan-logs/2021/7/13/laytonjb_python3_id6041-6041_7-13-47475-2131301613401632697_1.darshan
--output python3.pdf
laytonjb   27492   27458  0 13:11 pts/2    00:00:00 sh -c pdflatex
"\def\inclstdio{1} \\def\inclperf{1} \\def\incompletelog{1}
\\def\titlecmd{python3} \     \def\titlemon{7} \     \def\titlemday{13} \
  \def\titleyear{2021} \     \def\titlecmdline{ python3
cifar10-4-checkpoint.py } \     \def\jobid{ 6041} \     \def\jobuid{ 1000}
\     \def\jobnprocs{ 1} \     \def\jobruntime{ 287} \
\def\filecri{0.046549} \     \def\filecrbi{9.35267639160156} \
\def\filecwi{0.046681} \     \def\filecwbi{0.772393226623535} \
\def\filecrs{0} \     \def\filecrbs{0} \     \def\filecws{0} \
\def\filecwbs{0} \     \def\filecmi{0.020773} \     \def\filecms{0} \
\def\filecmi{0.020773} \     \def\perflayer{POSIX} \
\def\perfest{88.94} \     \def\perfbytes{10.1} \
\def\stdioperfest{50.66} \     \def\stdioperfbytes{0.0} \
\input{summary.tex}" \     -halt-on-error > latex.output
laytonjb   27493   27492  0 13:11 pts/2    00:00:00 pdflatex
\def\inclstdio{1} \def\inclperf{1} \def\incompletelog{1}
\def\titlecmd{python3}     \def\titlemon{7}     \def\titlemday{13}
\def\titleyear{2021}     \def\titlecmdline{ python3 cifar10-4-checkpoint.py
}     \def\jobid{ 6041}     \def\jobuid{ 1000}     \def\jobnprocs{ 1}
\def\jobruntime{ 287}     \def\filecri{0.046549}
\def\filecrbi{9.35267639160156}     \def\filecwi{0.046681}
\def\filecwbi{0.772393226623535}     \def\filecrs{0}     \def\filecrbs{0}
  \def\filecws{0}     \def\filecwbs{0}     \def\filecmi{0.020773}
\def\filecms{0}     \def\filecmi{0.020773}     \def\perflayer{POSIX}
\def\perfest{88.94}     \def\perfbytes{10.1}     \def\stdioperfest{50.66}
  \def\stdioperfbytes{0.0}     \input{summary.tex} -halt-on-error


(Apologies for the length).

While it creates some .pdf files, I'm wondering if there is a problem
before pdflatex is called to process summary.tex? This is the version
output from pdflatex:

$ pdflatex --version
pdfTeX 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian)
kpathsea version 6.3.1
Copyright 2019 Han The Thanh (pdfTeX) et al.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX source.
Primary author of pdfTeX: Han The Thanh (pdfTeX) et al.
Compiled with libpng 1.6.37; using libpng 1.6.37
Compiled with zlib 1.2.11; using zlib 1.2.11
Compiled with xpdf version 4.01


I'm guessing you and Rob are using CentOS? Ubuntu sometimes makes things
difficult.

Thanks!


Jeff


On Wed, Jul 14, 2021 at 10:17 AM Snyder, Shane <ssnyder at mcs.anl.gov> wrote:

> Hi Jeff,
>
> I similarly tried running job-summary on your log using our current main
> branch (which is essentially just Darshan 3.3.1), and it worked fine, so
> not exactly sure what the problem is, but doesn't appear to be a general
> bug. You might be able to find some hints about what's going wrong by
> running job-summary again with the '--verbose' flag -- this persists the
> temporary directory Darshan is using for creating the PDF files, including
> pdflatex logs, etc. You might be able to find some error messages in the
> 'summary.log' file that give some sort of indication in what's
> failing/hanging? Not the most straightforward debugging strategy but I
> don't really have much else to suggest...
>
> As a side note, we are in the middle of developing new Darshan analysis
> tools based on PyDarshan that will hopefully be available before too long.
> There's a lot more development momentum on our end towards these new
> PyDarshan-based analysis tools, with the older tools likely being
> deprecated once these are available. I just mention this for you and other
> users so you're aware help is on the way and that we aren't completely
> ignoring these issues.
>
> Thanks,
> --Shane
> ------------------------------
> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf
> of Jeffrey Layton <laytonjb at gmail.com>
> *Sent:* Tuesday, July 13, 2021 2:00 PM
> *To:* Latham, Robert J. <robl at mcs.anl.gov>
> *Cc:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] Hang on post-process
>
> Thanks Rob!! I appreciate the pdf (at least I won't look like a slacker
> and actually produced something).
>
> What steps do you want to take to debug the issue? I'm guessing it's a
> configuration issue or dependency issue on my side. BTW - I'm running
> Ubuntu 20.04 on an AMD system.  I built Darshan 3.3.1 using gcc 9.3.0
> (Ubuntu 20.04 version).
>
> Thanks!
>
> Jeff
>
>
> On Tue, Jul 13, 2021 at 2:46 PM Latham, Robert J. <robl at mcs.anl.gov>
> wrote:
>
> Howdy Jeff: thanks for sending the log file
>
> It looks like a legitimate log file to me.  `darshan-job-parser`, which
> simply dumps the counters and such to stdout, gives me a reasonable
> looking log file.  here's the header:
>
> # darshan log version: 3.21
> # compression method: ZLIB
> # exe: python3 cifar10-4-checkpoint.py
> # uid: 1000
> # jobid: 6041
> # start_time: 1626196275
> # start_time_asci: Tue Jul 13 12:11:15 2021
> # end_time: 1626196561
> # end_time_asci: Tue Jul 13 12:16:01 2021
> # nprocs: 1
> # run time: 287
> # metadata: lib_ver = 3.3.1
> # metadata: h = romio_no_indep_rw=true;cb_nodes=4
>
> # log file regions
> # -------------------------------------------------------
> # header: 360 bytes (uncompressed)
> # job data: 543 bytes (compressed)
> # record table: 18164 bytes (compressed)
> # POSIX module: 41682 bytes (compressed), ver=4
> # STDIO module: 230 bytes (compressed), ver=2
>
> And a darshan-job-summary.pl that I built back in August 2020 generates
> a pdf for me in a few seconds.  I've attached it for you but really we
> should figure out what's going on in your environment
>
>
> ==rob
>
> On Tue, 2021-07-13 at 13:41 -0400, Jeffrey Layton wrote:
> > Good afternoon,
> >
> > Apologies for posting yet another problem :)  I'm trying to use
> > Darshan on a Tensorflow/Keras script. It's a simple model operating
> > on the CIFAR-10 data set (fairly small). Darshan produces the output
> > files but when I try to post-process one using darshan-job-
> > summary.pl, it hangs and I end up having to kill the process (I
> > waited about an hour - just to be sure).
> >
> > I run the script using the following:
> >
> > export DARSHAN_EXCLUDE_DIRS=/proc,/etc,/dev,/sys
> > env LD_PRELOAD=/home/laytonjb/bin/darshan-3.3.1/lib/libdarshan.so
> > python3 cifar10-4-checkpoint.py
> >
> > (I can provide the script if needed). It produces four files:
> >
> > $ ls -s
> > total 72
> >  4 laytonjb_ptxas_id6210-6210_7-13-47480-
> > 2131301613401632697_1.darshan  60 laytonjb_python3_id6041-6041_7-13-
> > 47475-2131301613401632697_1.darshan
> >  4 laytonjb_ptxas_id6211-6211_7-13-47480-
> > 2131301613401632697_1.darshan   4 laytonjb_uname_id6056-6056_7-13-
> > 47475-2131301613401632697_1.darshan
> >
> >
> > I chose to post-process the "python3" output but this is where it
> > hangs. I'm attaching the darshan output file if that is of any help.
> >
> > Thanks for any help.
> >
> > Jeff
> >
> >
> >
> >
> > _______________________________________________
> > Darshan-users mailing list
> > Darshan-users at lists.mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210714/af5faa34/attachment.html>


More information about the Darshan-users mailing list