[Darshan-users] Extracting wallclock time from darshan-parser?

Carns, Philip H. carns at mcs.anl.gov
Fri Feb 22 12:08:02 CST 2019


Hi Sam,

All of the times reported by Darshan are actually wall times already.  There is a subtlety with how the READ/WRITE/META_TIME fields are reported on shared files that is probably causing some confusion, though.

In this line, for example:

MPI-IO  -1      4647831539574040701     MPIIO_F_WRITE_TIME      471.718718      /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/mpiio.dat /lustre/dirac   lustre

The 2nd column is the rank number.  If the rank is -1, that means that the file was opened by all ranks, and Darshan has collapsed all of the counters down to a single record representing all ranks.

The MPIIO_F_WRITE_TIME field represents the (wall) time spent by a rank making MPIIO write calls.  So for example, if each rank spent 10 seconds doing this, but you had 8 ranks, then the the total reported in the above line would be 80, not 10, and therefore might be quite a bit higher than the elapsed run time of the job.

There are a few options to get around this.  You can of course divide the number of ranks back out to get an average per rank for those counters.  You can also look at the counters with "SLOWEST" in the name, that retain the timing from the slowest rank even for shared files.  You can also set an environment variable (DARSHAN_DISABLE_SHARED_REDUCTION) to tell darshan *not* to collapse shared files.  This might result in a bigger log file (so it's not the default setting), but it will give you distinct accurate tally for each rank.

thanks,
-Phil



On 2019-02-19 11:20:19-05:00 Darshan-users wrote:
Hi,

I’m running Darshan 3.1.6. Is there any way to extract the wallclock time of operations from darshan-parser?

Let’s say I have sample program running on 16 MPI threads: part of it uses serial I/O, and part of it MPI-IO.


> time mpirun -np 16 -host localhost ./benchio.x

I then analyse the results with darshan-parser --total and filter out certain lines:


> /cm/shared/apps/darshan/intel/3.1.6/bin/darshan-parser --total  ~/dc-cox1_benchio.x_id15025_2-18-43921-17740490755319864674_1.darshan | grep -i "nprocs:\|run time\|POSIX_F_READ_TIME\|POSIX_F_WRITE_TIME\|POSIX_F_META_TIME\|MPIIO_F_READ_TIME\|MPIIO_F_WRITE_TIME\|MPIIO_F_META_TIME"

# nprocs: 16

# run time: 83

total_POSIX_F_READ_TIME: 0.000000

total_POSIX_F_WRITE_TIME: 57.435211

total_POSIX_F_META_TIME: 8.595520

total_MPIIO_F_READ_TIME: 0.000000

total_MPIIO_F_WRITE_TIME: 471.718718

total_MPIIO_F_META_TIME: 5.261146

I also analyse with darshan-parser --base --perf --file-list and filter similarly:


> /cm/shared/apps/darshan/intel/3.1.6/bin/darshan-parser --base --perf --file-list  ~/dc-cox1_benchio.x_id15025_2-18-43921-17740490755319864674_1.darshan | grep -i "nprocs:\|run time\|POSIX_F_READ_TIME\|POSIX_F_WRITE_TIME\|POSIX_F_META_TIME\|MPIIO_F_READ_TIME\|MPIIO_F_WRITE_TIME\|MPIIO_F_META_TIME"

# nprocs: 16
# run time: 83
POSIX   0       14271214572185876165    POSIX_F_READ_TIME       0.000000        /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/serial.dat        /lustre/dirac   lustre
POSIX   0       14271214572185876165    POSIX_F_WRITE_TIME      39.276790       /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/serial.dat        /lustre/dirac   lustre
POSIX   0       14271214572185876165    POSIX_F_META_TIME       4.394738        /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/serial.dat        /lustre/dirac   lustre
POSIX   -1      4647831539574040701     POSIX_F_READ_TIME       0.000000        /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/mpiio.dat /lustre/dirac   lustre
POSIX   -1      4647831539574040701     POSIX_F_WRITE_TIME      18.158421       /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/mpiio.dat /lustre/dirac   lustre
POSIX   -1      4647831539574040701     POSIX_F_META_TIME       4.200782        /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/mpiio.dat /lustre/dirac   lustre
MPI-IO  -1      4647831539574040701     MPIIO_F_READ_TIME       0.000000        /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/mpiio.dat /lustre/dirac   lustre
MPI-IO  -1      4647831539574040701     MPIIO_F_WRITE_TIME      471.718718      /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/mpiio.dat /lustre/dirac   lustre
MPI-IO  -1      4647831539574040701     MPIIO_F_META_TIME       5.261146        /lustre/dirac/home/dc-cox1/benchio/shared-file/source/benchio_files/mpiio.dat /lustre/dirac   lustre

Comparing the two outputs, it’s clear that it’s CPU time being measured. But as a high-level profiling aid I’d like to extract the associated wallclock time instead.

I think this would be 39.276+(18.15/16) and 4.39+(4.2/16) for the POSIX_F_WRITE/META_TIME. I’d rather not have to write a separate script to filter against the rank column from the --base --perf --file-list output.

Thanks in advance,
Sam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20190222/baea8c15/attachment.html>


More information about the Darshan-users mailing list