[Darshan-users] Darshan & EPCC benchio different behaviour

Thu Feb 13 04:24:19 CST 2020

Hi Phil,

first of all thanks for your quite detailed analysis.

I did some further test and the behaviour, for example, slightly change 
passing from MPIIFORT to MPIF90 (which is a sort of wrapper to GNU)

[planucar at r033c01s05 source]$ mpif90 -v
mpif90 for the Intel(R) MPI Library 2018 Update 4 for Linux*
Copyright(C) 2003-2018, Intel Corporation.  All rights reserved.
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man 
--infodir=/usr/share/info 
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap 
--enable-shared --enable-threads=posix --enable-checking=release 
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions 
--enable-gnu-unique-object --enable-linker-build-id 
--with-linker-hash-style=gnu 
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto 
--enable-plugin --enable-initfini-array --disable-libgcj 
--with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install 
--with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install 
--enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 
--build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)

In this case the estimate is more correct (altough there is still some 
"time" missing also in that estimate). Of course, this is a simple 
benchmark....so, in the end, results are not so useful.

But, my question, is how to recover the "missing" I/O time?

Infact, we should have an uncorrect information that the I/O behaviour 
is more efficient than we can observe and in a real situation this can 
be a limit of the tool.

Are you agree?

thanks again

Piero

in this case the estimate is more correct (altough there is still some 
"time" missing also in that estimate).

of course, this is a simple benchmark....so, in the end, results are not 
so useful.

But, my question, how to recover the "missing" time I/O time?

infact, we should have the incorrect information that the I/O time is 
more efficient than we can observe and in a real situation this can be a 
limit of the tool.

Agree?

thanks again

Piero

Il 12/02/2020 21:37, Carns, Philip H. ha scritto:
> Thank you Piero, that does help quite a bit.  I don't know the cause, 
> but I do see why Darshan reports a rate much higher than the benchmark.
>
> The clue is in the write() timing if you zero in on a specific file, 
> in this case serial.dat.1:
>
> [carns at carns-x1-7g Downloads]$ darshan-parser benchio_1202.dn.darshan 
> |grep serial.dat.1 |grep F_WRITE
> POSIX 0 2553187295521652424 POSIX_F_WRITE_START_TIMESTAMP 0.350832 
> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat.1 
> /gpfs/scratch gpfs
> POSIX 0 2553187295521652424 POSIX_F_WRITE_END_TIMESTAMP 4.519859 
> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat.1 
> /gpfs/scratch gpfs
> POSIX 0 2553187295521652424 POSIX_F_WRITE_TIME 1.336200 
> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat.1 
> /gpfs/scratch gpfs
>
> The POSIX_F_WRITE_TIME shows the cumulative time in seconds was spent 
> waiting on glibc write() calls: 1.3 seconds.
>
> The POSIX_F_WRITE_{START/END}_TIMESTAMP show the elapsed time between 
> when the first write started and the last write ended: 4.2 seconds.
>
> Somehow almost 70% of the elapsed time measured by the benchmark is 
> spent doing something between individual write() calls that make it to 
> glibc.  The writes seem to be relatively sparse in time.
>
> This doesn't make any sense if you look at the benchmark code, 
> though.  The write loop is very simple, in fact it is just a tight 
> loop calling write():
>
> https://github.com/EPCCed/benchio/blob/master/shared-file/source/mpiio.F90#L154
>
> Maybe someone on the list with more Fortran expertise than me can 
> comment on why this might happen?  Is it possible that the Fortran 
> runtime is doing something (buffering?) that consumes time in the 
> Fortran-level write call before the data is actually relayed to the 
> system?
>
> As a side note we could use other heuristics for performance, but they 
> all have their tradeoffs.  The one that would obviously work here 
> (measuring time from first to last I/O) would break for other common 
> patterns.
>
> thanks,
> -Phil
> ------------------------------------------------------------------------
> *From:* Piero LANUCARA <p.lanucara at cineca.it>
> *Sent:* Wednesday, February 12, 2020 1:47 PM
> *To:* Carns, Philip H. <carns at mcs.anl.gov>; Snyder, Shane 
> <ssnyder at mcs.anl.gov>; Harms, Kevin <harms at alcf.anl.gov>
> *Cc:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] Darshan & EPCC benchio different behaviour
>
> p.s
>
> sorry, I forgotted to say that we built the environment (both Darshan 
> and benchmark) using IntelMPI release mpiifort script:
>
>  mpiifort -v
> mpiifort for the Intel(R) MPI Library 2018 Update 4 for Linux*
> Copyright(C) 2003-2018, Intel Corporation.  All rights reserved.
>
>
> regards
>
>
> Piero
>
> Il 12/02/2020 16:54, Piero LANUCARA ha scritto:
>>
>> Hi Phil
>>
>>
>> in attach
>>
>> dn stands for different names....
>>
>>
>> cheers
>>
>> Piero
>>
>> Il 12/02/2020 16:13, Carns, Philip H. ha scritto:
>>> Ah, great, thank you for the confirmation.
>>>
>>> In that case it looks like Darshan is instrumenting properly at run 
>>> time, but I think Kevin is on the right track that Darshan's 
>>> heuristics for calculating performance in post processing are 
>>> getting confused for some reason.
>>>
>>> Probably GPFS is client-side caching aggressively in the single 
>>> client case, but that wouldn't explain why the benchmark output 
>>> reports a much different number than Darshan, though.  They should 
>>> both perceive roughly the same performance; neither the benchmark 
>>> itself nor Darshan know if caching is happening or not.
>>>
>>> It's hard to see where the performance heuristic went wrong from 
>>> looking at the log, in large part because the app repeatedly opens a 
>>> file with the same name (there is a clue to this in the OPEN 
>>> counters; the same file name is opened 20 times):
>>>
>>> # WARNING: POSIX_OPENS counter includes both POSIX_FILENOS and 
>>> POSIX_DUPS counts
>>> POSIX 0 6563482044800691889 POSIX_OPENS 20 
>>> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat 
>>> /gpfs/scratch gpfs
>>>
>>> Every time the file is opened (regardless of whether it was unlinked 
>>> in between or not), Darshan keeps adding counters to the same 
>>> record, which are associated with that serial.dat file name.  So 
>>> things like close() timestamps become nonsensical because Darshan 
>>> records when the first close() starts and when the last one finishes:
>>>
>>> [carns at carns-x1-7g Downloads]$ darshan-parser benchio_1202.darshan 
>>> |grep CLOSE
>>> POSIX 0 6563482044800691889 POSIX_F_CLOSE_START_TIMESTAMP 
>>> 4.536853/gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat 
>>> /gpfs/scratch gpfs
>>> POSIX 0 6563482044800691889 POSIX_F_CLOSE_END_TIMESTAMP 43.041987 
>>> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat 
>>> /gpfs/scratch gpfs
>>>
>>> (This doesn't mean there was one close() that took ~40 seconds; in 
>>> this case there were many close() calls and ~40 seconds elapsed 
>>> between the start of the first one and completion of the last one).
>>>
>>> If it is possible for you to modify the benchmark (as an experiment) 
>>> so that it chooses a new file name on each iteration, then I think 
>>> it would probably disentangle the counters enough for us to tell 
>>> what went wrong.
>>>
>>> thanks,
>>> -Phil
>>> ------------------------------------------------------------------------
>>> *From:* Piero LANUCARA <p.lanucara at cineca.it> 
>>> <mailto:p.lanucara at cineca.it>
>>> *Sent:* Wednesday, February 12, 2020 9:47 AM
>>> *To:* Carns, Philip H. <carns at mcs.anl.gov> 
>>> <mailto:carns at mcs.anl.gov>; Snyder, Shane <ssnyder at mcs.anl.gov> 
>>> <mailto:ssnyder at mcs.anl.gov>; Harms, Kevin <harms at alcf.anl.gov> 
>>> <mailto:harms at alcf.anl.gov>
>>> *Cc:* darshan-users at lists.mcs.anl.gov 
>>> <mailto:darshan-users at lists.mcs.anl.gov> 
>>> <darshan-users at lists.mcs.anl.gov> 
>>> <mailto:darshan-users at lists.mcs.anl.gov>
>>> *Subject:* Re: [Darshan-users] Darshan & EPCC benchio different 
>>> behaviour
>>>
>>> Hi Phil.....POSIX
>>>
>>> this is a well known benchmark ....you can easily verify it!
>>>
>>>
>>> by the way it's something like that
>>>
>>>
>>> ! Serial write is unconditionally compiled
>>> subroutine serialwrite(filename, iodata, n1, n2, n3, cartcomm)
>>>
>>>   character*(*) :: filename
>>>
>>>   integer :: n1, n2, n3
>>>   double precision, dimension(0:n1+1,0:n2+1,0:n3+1) :: iodata
>>>
>>>   integer :: cartcomm, ierr, rank, size
>>>   integer, parameter :: iounit = 10
>>>
>>>   integer :: i
>>>
>>>   call MPI_Comm_size(cartcomm, size, ierr)
>>>   call MPI_Comm_rank(cartcomm, rank, ierr)
>>>
>>> !  Write same amount of data as the parallel write but do it all 
>>> from rank 0
>>> !  This is just to get a baseline figure for serial IO performance - 
>>> note
>>> !  that the contents of the file will be differnent from the 
>>> parallel calls
>>>
>>>   if (rank == 0) then
>>>
>>>      open(file=filename, unit=iounit, access='stream')
>>>
>>>      do i = 1, size
>>>         write(unit=iounit) iodata(1:n1, 1:n2, 1:n3)
>>>      end do
>>>
>>>      close(iounit)
>>>
>>>   end if
>>>
>>> end subroutine serialwrite
>>>
>>>
>>> Piero
>>>
>>> Il 12/02/2020 14:13, Carns, Philip H. ha scritto:
>>>> Hi Piero,
>>>>
>>>> In the serial case, is the rank that's doing I/O still using 
>>>> MPI-IO, or is it making calls directly to POSIX in that case?
>>>>
>>>> The Darshan log for the serial case doesn't show any MPI-IO 
>>>> activity, but I'm not sure if that's accurate, or if it's an 
>>>> indication that we missed some instrumentation.
>>>>
>>>> thanks,
>>>> -Phil
>>>> ------------------------------------------------------------------------
>>>> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> 
>>>> <mailto:darshan-users-bounces at lists.mcs.anl.gov> on behalf of Piero 
>>>> LANUCARA <p.lanucara at cineca.it> <mailto:p.lanucara at cineca.it>
>>>> *Sent:* Wednesday, February 12, 2020 5:29 AM
>>>> *To:* Snyder, Shane <ssnyder at mcs.anl.gov> 
>>>> <mailto:ssnyder at mcs.anl.gov>; Harms, Kevin <harms at alcf.anl.gov> 
>>>> <mailto:harms at alcf.anl.gov>
>>>> *Cc:* darshan-users at lists.mcs.anl.gov 
>>>> <mailto:darshan-users at lists.mcs.anl.gov> 
>>>> <darshan-users at lists.mcs.anl.gov> 
>>>> <mailto:darshan-users at lists.mcs.anl.gov>
>>>> *Subject:* Re: [Darshan-users] Darshan & EPCC benchio different 
>>>> behaviour
>>>> Hi Shane, Kevin
>>>>
>>>> thanks for the update.
>>>>
>>>> I attached a new upated files (log and pdf) to this email.
>>>>
>>>> Also, the log from BENCHIO is attached.
>>>>
>>>> thanks again
>>>>
>>>> regards
>>>>
>>>> Piero
>>>>
>>>>
>>>> Il 11/02/2020 20:15, Shane Snyder ha scritto:
>>>> > Definitely looks like something strange is happening when Darshan is
>>>> > estimating the time spent in I/O operations (as seen in the very 
>>>> first
>>>> > figure, observed write time barely even registers) in the serial 
>>>> case,
>>>> > which it is ultimately used to provide the performance estimate.
>>>> >
>>>> > If you could provide them, the raw Darshan logs would be really
>>>> > helpful. That should make it clear whether it's an instrumentation
>>>> > issue (i.e., under accounting for time spent in I/O operations at
>>>> > runtime) or if its an issue with the heuristics in the PDF summary
>>>> > tool you are using, as Kevin points out. If it's the latter, 
>>>> having an
>>>> > example log to test modifications to our heuristics would be very
>>>> > helpful to us.
>>>> >
>>>> > Thanks,
>>>> > --Shane
>>>> >
>>>> > On 2/11/20 8:36 AM, Harms, Kevin wrote:
>>>> >> Piero,
>>>> >>
>>>> >>    the performance estimate is based on heuristics, it's 
>>>> possible the
>>>> >> 'serial' model is breaking some assumptions about how the I/O is
>>>> >> done. Is every rank opening the file, but only rank 0 is doing 
>>>> actual
>>>> >> I/O?
>>>> >>
>>>> >>    If possible, you could provide the log and we could check to see
>>>> >> what the counters look like.
>>>> >>
>>>> >> kevin
>>>> >>
>>>> >> ________________________________________
>>>> >> From: Piero LANUCARA <p.lanucara at cineca.it> 
>>>> <mailto:p.lanucara at cineca.it>
>>>> >> Sent: Tuesday, February 11, 2020 2:28 AM
>>>> >> To: Harms, Kevin
>>>> >> Cc: darshan-users at lists.mcs.anl.gov 
>>>> <mailto:darshan-users at lists.mcs.anl.gov>
>>>> >> Subject: Re: [Darshan-users] Darshan & EPCC benchio different 
>>>> behaviour
>>>> >>
>>>> >> Hi Kevin
>>>> >>
>>>> >> first of all thanks for the investigation..I did some futher 
>>>> test and it
>>>> >> seems like the issue may appear using Fortran (MPI, mainly 
>>>> IntelMPI)
>>>> >> codes.
>>>> >>
>>>> >> Is this information useful?
>>>> >>
>>>> >> regards
>>>> >> Piero
>>>> >> Il 07/02/2020 16:07, Harms, Kevin ha scritto:
>>>> >>> Piero,
>>>> >>>
>>>> >>>     just to confirm, the serial case is still running in parallel,
>>>> >>> 36 processes, but the I/O is only from rank 0?
>>>> >>>
>>>> >>> kevin
>>>> >>>
>>>> >>> ________________________________________
>>>> >>> From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> 
>>>> <mailto:darshan-users-bounces at lists.mcs.anl.gov> on
>>>> >>> behalf of Piero LANUCARA <p.lanucara at cineca.it> 
>>>> <mailto:p.lanucara at cineca.it>
>>>> >>> Sent: Wednesday, February 5, 2020 4:56 AM
>>>> >>> To: darshan-users at lists.mcs.anl.gov 
>>>> <mailto:darshan-users at lists.mcs.anl.gov>
>>>> >>> Subject: Re: [Darshan-users] Darshan & EPCC benchio different 
>>>> behaviour
>>>> >>>
>>>> >>> p.s
>>>> >>>
>>>> >>> to be more "verbose" I add to the discussion:
>>>> >>>
>>>> >>> Darshan output for the "serial" run (serial.pdf)
>>>> >>>
>>>> >>> Darshan output for the MPI-IO run (mpiio.pdf)
>>>> >>>
>>>> >>> benchio output for "serial" run (serial.out)
>>>> >>>
>>>> >>> benchio output for "MPI-IO" run (mpi-io.out)
>>>> >>>
>>>> >>> thanks
>>>> >>>
>>>> >>> Piero
>>>> >>>
>>>> >>> Il 04/02/2020 19:44, Piero LANUCARA ha scritto:
>>>> >>>> Dear all
>>>> >>>>
>>>> >>>> I'm using Darshan to measure EPCC benchio benchmark
>>>> >>>> (https://github.com/EPCCed/benchio) behaviour on a given x86 Tier1
>>>> >>>> machine.
>>>> >>>>
>>>> >>>> running two benchio tests (MPI-IO and serial) a different 
>>>> behaviour
>>>> >>>> appear
>>>> >>>>
>>>> >>>> while Darhsan pdf log file is able to recover the estimated 
>>>> time and
>>>> >>>> bandwidth in the MPI-IO case, the "serial" run is completely
>>>> >>>> underestimated by Darshan (the time and bandwidth are less/greater
>>>> >>>> than benchio output).
>>>> >>>>
>>>> >>>> Suggestions are welcomed
>>>> >>>>
>>>> >>>> thanks
>>>> >>>>
>>>> >>>> Piero
>>>> >>>>
>>>> >>>> _______________________________________________
>>>> >>>> Darshan-users mailing list
>>>> >>>> Darshan-users at lists.mcs.anl.gov 
>>>> <mailto:Darshan-users at lists.mcs.anl.gov>
>>>> >>>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>>>> >> _______________________________________________
>>>> >> Darshan-users mailing list
>>>> >> Darshan-users at lists.mcs.anl.gov 
>>>> <mailto:Darshan-users at lists.mcs.anl.gov>
>>>> >> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>>>> >
>>>> > _______________________________________________
>>>> > Darshan-users mailing list
>>>> > Darshan-users at lists.mcs.anl.gov 
>>>> <mailto:Darshan-users at lists.mcs.anl.gov>
>>>> > https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>>
>> _______________________________________________
>> Darshan-users mailing list
>> Darshan-users at lists.mcs.anl.gov  <mailto:Darshan-users at lists.mcs.anl.gov>
>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200213/6d5d3c6e/attachment-0001.html>