[Darshan-users] A bug report and a feature request

Tue Sep 20 15:55:42 CDT 2011

On 09/15/2011 02:00 PM, Bill Barth wrote:
> I'm interested in a tool that would scan a bunch of logfiles and throw up
> some red flags suggesting users/codes that we should be looking at.
>
> Any thoughts on metrics for that?

You might want to start by looking at some of the summary information 
from running "darshan-parser --file --perf <logfile>".  In particular I 
would look at these fields:

total_bytes:
   - This is the total amount of data read and written.  It's useful to 
filter out jobs that didn't move enough data to be relevant from an I/O 
red flag point of view

agg_perf_by_slowest:
   - This is our most accurate guess as to the aggregate I/O performance 
obtained by the job.  You can divide by the number of nodes or number of 
processes to get a value that is normalized to the size of the job.  
(There are other estimates being printed out by the tool as well but 
those can be ignored for the versions of Darshan that you are using)

total: (in the "files" section of the output)
   - This is the total number of files opened by the job.  You might 
want to be on the lookout for apps that open an extraordinary number of 
files.

So you could set some thresholds about what reasonable performance or 
numbers of files are and look for jobs that way.

You can also roughly estimate the amount of wall time spent performing 
I/O by working backwards with :

(total_bytes/(1024*1024)/agg_perf_by_slowest

and then divide that by the run time (end_time-start_time) to see which 
jobs are spending an unusual fraction of their time doing I/O rather 
than computing.

-Phil

>
> Bill.
> --
> Bill Barth, Ph.D., Director, HPC
> bbarth at tacc.utexas.edu        |   Phone: (512) 232-7069
> Office: ROC 1.435             |   Fax:   (512) 475-9445
>
>
>
>
>
>
>
> On 9/15/11 12:56 PM, "Rob Latham"<robl at mcs.anl.gov>  wrote:
>
>> On Thu, Sep 15, 2011 at 01:34:33PM -0400, Phil Carns wrote:
>>> On 09/15/2011 01:17 PM, Bill Barth wrote:
>>>> The Bug Report:
>>>>
>>>> I think that the summary.pdf has swapped the summaries for read and
>>> write
>>>> in the "Data Transfer Per Filesystem" table. I'm happy to share a
>>> logfile
>>>> if you can't reproduce on your end.
>>> That would be great if you could send us a log file example.  I just
>>> tried a quick test on an old log file and it looked Ok in that
>>> particular case.  It might depend on the contents of the log.
>>>
>>>> The Feature Request:
>>>>
>>>> darshan-job-summary.pl should produce its output pdf file based on the
>>>> name of the input logfile rather than calling it "summary.pdf".
>>> We could probably add a command line option to produce that
>>> behavior.  In the mean time you can at least use the --output option
>>> to specify the file name.
>> I've started looking at how to make the tool output one summary file
>> per program data file (motivating case study: visualization tools read
>> one set of input data and write a separate output file -- two distinct
>> workloads).
>>
>> ==rob
>>
>> -- 
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>> _______________________________________________
>> Darshan-users mailing list
>> Darshan-users at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users