[Darshan-users] Darshan & EPCC benchio different behaviour
Piero LANUCARA
p.lanucara at cineca.it
Wed Feb 12 09:54:21 CST 2020
Hi Phil
in attach
dn stands for different names....
cheers
Piero
Il 12/02/2020 16:13, Carns, Philip H. ha scritto:
> Ah, great, thank you for the confirmation.
>
> In that case it looks like Darshan is instrumenting properly at run
> time, but I think Kevin is on the right track that Darshan's
> heuristics for calculating performance in post processing are getting
> confused for some reason.
>
> Probably GPFS is client-side caching aggressively in the single client
> case, but that wouldn't explain why the benchmark output reports a
> much different number than Darshan, though. They should both perceive
> roughly the same performance; neither the benchmark itself nor Darshan
> know if caching is happening or not.
>
> It's hard to see where the performance heuristic went wrong from
> looking at the log, in large part because the app repeatedly opens a
> file with the same name (there is a clue to this in the OPEN counters;
> the same file name is opened 20 times):
>
> # WARNING: POSIX_OPENS counter includes both POSIX_FILENOS and
> POSIX_DUPS counts
> POSIX 0 6563482044800691889 POSIX_OPENS 20
> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat
> /gpfs/scratch gpfs
>
> Every time the file is opened (regardless of whether it was unlinked
> in between or not), Darshan keeps adding counters to the same record,
> which are associated with that serial.dat file name. So things like
> close() timestamps become nonsensical because Darshan records when the
> first close() starts and when the last one finishes:
>
> [carns at carns-x1-7g Downloads]$ darshan-parser benchio_1202.darshan
> |grep CLOSE
> POSIX 0 6563482044800691889 POSIX_F_CLOSE_START_TIMESTAMP
> 4.536853/gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat
> /gpfs/scratch gpfs
> POSIX 0 6563482044800691889 POSIX_F_CLOSE_END_TIMESTAMP 43.041987
> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat
> /gpfs/scratch gpfs
>
> (This doesn't mean there was one close() that took ~40 seconds; in
> this case there were many close() calls and ~40 seconds elapsed
> between the start of the first one and completion of the last one).
>
> If it is possible for you to modify the benchmark (as an experiment)
> so that it chooses a new file name on each iteration, then I think it
> would probably disentangle the counters enough for us to tell what
> went wrong.
>
> thanks,
> -Phil
> ------------------------------------------------------------------------
> *From:* Piero LANUCARA <p.lanucara at cineca.it>
> *Sent:* Wednesday, February 12, 2020 9:47 AM
> *To:* Carns, Philip H. <carns at mcs.anl.gov>; Snyder, Shane
> <ssnyder at mcs.anl.gov>; Harms, Kevin <harms at alcf.anl.gov>
> *Cc:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] Darshan & EPCC benchio different behaviour
>
> Hi Phil.....POSIX
>
> this is a well known benchmark ....you can easily verify it!
>
>
> by the way it's something like that
>
>
> ! Serial write is unconditionally compiled
> subroutine serialwrite(filename, iodata, n1, n2, n3, cartcomm)
>
> character*(*) :: filename
>
> integer :: n1, n2, n3
> double precision, dimension(0:n1+1,0:n2+1,0:n3+1) :: iodata
>
> integer :: cartcomm, ierr, rank, size
> integer, parameter :: iounit = 10
>
> integer :: i
>
> call MPI_Comm_size(cartcomm, size, ierr)
> call MPI_Comm_rank(cartcomm, rank, ierr)
>
> ! Write same amount of data as the parallel write but do it all from
> rank 0
> ! This is just to get a baseline figure for serial IO performance - note
> ! that the contents of the file will be differnent from the parallel
> calls
>
> if (rank == 0) then
>
> open(file=filename, unit=iounit, access='stream')
>
> do i = 1, size
> write(unit=iounit) iodata(1:n1, 1:n2, 1:n3)
> end do
>
> close(iounit)
>
> end if
>
> end subroutine serialwrite
>
>
> Piero
>
> Il 12/02/2020 14:13, Carns, Philip H. ha scritto:
>> Hi Piero,
>>
>> In the serial case, is the rank that's doing I/O still using MPI-IO,
>> or is it making calls directly to POSIX in that case?
>>
>> The Darshan log for the serial case doesn't show any MPI-IO activity,
>> but I'm not sure if that's accurate, or if it's an indication that we
>> missed some instrumentation.
>>
>> thanks,
>> -Phil
>> ------------------------------------------------------------------------
>> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov>
>> <mailto:darshan-users-bounces at lists.mcs.anl.gov> on behalf of Piero
>> LANUCARA <p.lanucara at cineca.it> <mailto:p.lanucara at cineca.it>
>> *Sent:* Wednesday, February 12, 2020 5:29 AM
>> *To:* Snyder, Shane <ssnyder at mcs.anl.gov>
>> <mailto:ssnyder at mcs.anl.gov>; Harms, Kevin <harms at alcf.anl.gov>
>> <mailto:harms at alcf.anl.gov>
>> *Cc:* darshan-users at lists.mcs.anl.gov
>> <mailto:darshan-users at lists.mcs.anl.gov>
>> <darshan-users at lists.mcs.anl.gov>
>> <mailto:darshan-users at lists.mcs.anl.gov>
>> *Subject:* Re: [Darshan-users] Darshan & EPCC benchio different
>> behaviour
>> Hi Shane, Kevin
>>
>> thanks for the update.
>>
>> I attached a new upated files (log and pdf) to this email.
>>
>> Also, the log from BENCHIO is attached.
>>
>> thanks again
>>
>> regards
>>
>> Piero
>>
>>
>> Il 11/02/2020 20:15, Shane Snyder ha scritto:
>> > Definitely looks like something strange is happening when Darshan is
>> > estimating the time spent in I/O operations (as seen in the very first
>> > figure, observed write time barely even registers) in the serial case,
>> > which it is ultimately used to provide the performance estimate.
>> >
>> > If you could provide them, the raw Darshan logs would be really
>> > helpful. That should make it clear whether it's an instrumentation
>> > issue (i.e., under accounting for time spent in I/O operations at
>> > runtime) or if its an issue with the heuristics in the PDF summary
>> > tool you are using, as Kevin points out. If it's the latter, having an
>> > example log to test modifications to our heuristics would be very
>> > helpful to us.
>> >
>> > Thanks,
>> > --Shane
>> >
>> > On 2/11/20 8:36 AM, Harms, Kevin wrote:
>> >> Piero,
>> >>
>> >> the performance estimate is based on heuristics, it's possible the
>> >> 'serial' model is breaking some assumptions about how the I/O is
>> >> done. Is every rank opening the file, but only rank 0 is doing actual
>> >> I/O?
>> >>
>> >> If possible, you could provide the log and we could check to see
>> >> what the counters look like.
>> >>
>> >> kevin
>> >>
>> >> ________________________________________
>> >> From: Piero LANUCARA <p.lanucara at cineca.it>
>> <mailto:p.lanucara at cineca.it>
>> >> Sent: Tuesday, February 11, 2020 2:28 AM
>> >> To: Harms, Kevin
>> >> Cc: darshan-users at lists.mcs.anl.gov
>> <mailto:darshan-users at lists.mcs.anl.gov>
>> >> Subject: Re: [Darshan-users] Darshan & EPCC benchio different
>> behaviour
>> >>
>> >> Hi Kevin
>> >>
>> >> first of all thanks for the investigation..I did some futher test
>> and it
>> >> seems like the issue may appear using Fortran (MPI, mainly IntelMPI)
>> >> codes.
>> >>
>> >> Is this information useful?
>> >>
>> >> regards
>> >> Piero
>> >> Il 07/02/2020 16:07, Harms, Kevin ha scritto:
>> >>> Piero,
>> >>>
>> >>> just to confirm, the serial case is still running in parallel,
>> >>> 36 processes, but the I/O is only from rank 0?
>> >>>
>> >>> kevin
>> >>>
>> >>> ________________________________________
>> >>> From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov>
>> <mailto:darshan-users-bounces at lists.mcs.anl.gov> on
>> >>> behalf of Piero LANUCARA <p.lanucara at cineca.it>
>> <mailto:p.lanucara at cineca.it>
>> >>> Sent: Wednesday, February 5, 2020 4:56 AM
>> >>> To: darshan-users at lists.mcs.anl.gov
>> <mailto:darshan-users at lists.mcs.anl.gov>
>> >>> Subject: Re: [Darshan-users] Darshan & EPCC benchio different
>> behaviour
>> >>>
>> >>> p.s
>> >>>
>> >>> to be more "verbose" I add to the discussion:
>> >>>
>> >>> Darshan output for the "serial" run (serial.pdf)
>> >>>
>> >>> Darshan output for the MPI-IO run (mpiio.pdf)
>> >>>
>> >>> benchio output for "serial" run (serial.out)
>> >>>
>> >>> benchio output for "MPI-IO" run (mpi-io.out)
>> >>>
>> >>> thanks
>> >>>
>> >>> Piero
>> >>>
>> >>> Il 04/02/2020 19:44, Piero LANUCARA ha scritto:
>> >>>> Dear all
>> >>>>
>> >>>> I'm using Darshan to measure EPCC benchio benchmark
>> >>>> (https://github.com/EPCCed/benchio) behaviour on a given x86 Tier1
>> >>>> machine.
>> >>>>
>> >>>> running two benchio tests (MPI-IO and serial) a different behaviour
>> >>>> appear
>> >>>>
>> >>>> while Darhsan pdf log file is able to recover the estimated time and
>> >>>> bandwidth in the MPI-IO case, the "serial" run is completely
>> >>>> underestimated by Darshan (the time and bandwidth are less/greater
>> >>>> than benchio output).
>> >>>>
>> >>>> Suggestions are welcomed
>> >>>>
>> >>>> thanks
>> >>>>
>> >>>> Piero
>> >>>>
>> >>>> _______________________________________________
>> >>>> Darshan-users mailing list
>> >>>> Darshan-users at lists.mcs.anl.gov
>> <mailto:Darshan-users at lists.mcs.anl.gov>
>> >>>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>> >> _______________________________________________
>> >> Darshan-users mailing list
>> >> Darshan-users at lists.mcs.anl.gov
>> <mailto:Darshan-users at lists.mcs.anl.gov>
>> >> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>> >
>> > _______________________________________________
>> > Darshan-users mailing list
>> > Darshan-users at lists.mcs.anl.gov
>> <mailto:Darshan-users at lists.mcs.anl.gov>
>> > https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200212/bbec6087/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchio_1202.dn.darshan.pdf
Type: application/pdf
Size: 67450 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200212/bbec6087/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchio_1202.dn.darshan
Type: application/octet-stream
Size: 1563 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200212/bbec6087/attachment-0001.obj>
More information about the Darshan-users
mailing list