[Darshan-users] Darshan & EPCC benchio different behaviour

Piero LANUCARA p.lanucara at cineca.it
Wed Feb 12 09:54:21 CST 2020


Hi Phil


in attach

dn stands for different names....


cheers

Piero

Il 12/02/2020 16:13, Carns, Philip H. ha scritto:
> Ah, great, thank you for the confirmation.
>
> In that case it looks like Darshan is instrumenting properly at run 
> time, but I think Kevin is on the right track that Darshan's 
> heuristics for calculating performance in post processing are getting 
> confused for some reason.
>
> Probably GPFS is client-side caching aggressively in the single client 
> case, but that wouldn't explain why the benchmark output reports a 
> much different number than Darshan, though.  They should both perceive 
> roughly the same performance; neither the benchmark itself nor Darshan 
> know if caching is happening or not.
>
> It's hard to see where the performance heuristic went wrong from 
> looking at the log, in large part because the app repeatedly opens a 
> file with the same name (there is a clue to this in the OPEN counters; 
> the same file name is opened 20 times):
>
> # WARNING: POSIX_OPENS counter includes both POSIX_FILENOS and 
> POSIX_DUPS counts
> POSIX 0 6563482044800691889 POSIX_OPENS 20 
> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat 
> /gpfs/scratch gpfs
>
> Every time the file is opened (regardless of whether it was unlinked 
> in between or not), Darshan keeps adding counters to the same record, 
> which are associated with that serial.dat file name.  So things like 
> close() timestamps become nonsensical because Darshan records when the 
> first close() starts and when the last one finishes:
>
> [carns at carns-x1-7g Downloads]$ darshan-parser benchio_1202.darshan 
> |grep CLOSE
> POSIX 0 6563482044800691889 POSIX_F_CLOSE_START_TIMESTAMP 
> 4.536853/gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat 
> /gpfs/scratch gpfs
> POSIX 0 6563482044800691889 POSIX_F_CLOSE_END_TIMESTAMP 43.041987 
> /gpfs/scratch/userinternal/planucar/benchio-master/shared-file/source/benchio_files/serial.dat 
> /gpfs/scratch gpfs
>
> (This doesn't mean there was one close() that took ~40 seconds; in 
> this case there were many close() calls and ~40 seconds elapsed 
> between the start of the first one and completion of the last one).
>
> If it is possible for you to modify the benchmark (as an experiment) 
> so that it chooses a new file name on each iteration, then I think it 
> would probably disentangle the counters enough for us to tell what 
> went wrong.
>
> thanks,
> -Phil
> ------------------------------------------------------------------------
> *From:* Piero LANUCARA <p.lanucara at cineca.it>
> *Sent:* Wednesday, February 12, 2020 9:47 AM
> *To:* Carns, Philip H. <carns at mcs.anl.gov>; Snyder, Shane 
> <ssnyder at mcs.anl.gov>; Harms, Kevin <harms at alcf.anl.gov>
> *Cc:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] Darshan & EPCC benchio different behaviour
>
> Hi Phil.....POSIX
>
> this is a well known benchmark ....you can easily verify it!
>
>
> by the way it's something like that
>
>
> ! Serial write is unconditionally compiled
> subroutine serialwrite(filename, iodata, n1, n2, n3, cartcomm)
>
>   character*(*) :: filename
>
>   integer :: n1, n2, n3
>   double precision, dimension(0:n1+1,0:n2+1,0:n3+1) :: iodata
>
>   integer :: cartcomm, ierr, rank, size
>   integer, parameter :: iounit = 10
>
>   integer :: i
>
>   call MPI_Comm_size(cartcomm, size, ierr)
>   call MPI_Comm_rank(cartcomm, rank, ierr)
>
> !  Write same amount of data as the parallel write but do it all from 
> rank 0
> !  This is just to get a baseline figure for serial IO performance - note
> !  that the contents of the file will be differnent from the parallel 
> calls
>
>   if (rank == 0) then
>
>      open(file=filename, unit=iounit, access='stream')
>
>      do i = 1, size
>         write(unit=iounit) iodata(1:n1, 1:n2, 1:n3)
>      end do
>
>      close(iounit)
>
>   end if
>
> end subroutine serialwrite
>
>
> Piero
>
> Il 12/02/2020 14:13, Carns, Philip H. ha scritto:
>> Hi Piero,
>>
>> In the serial case, is the rank that's doing I/O still using MPI-IO, 
>> or is it making calls directly to POSIX in that case?
>>
>> The Darshan log for the serial case doesn't show any MPI-IO activity, 
>> but I'm not sure if that's accurate, or if it's an indication that we 
>> missed some instrumentation.
>>
>> thanks,
>> -Phil
>> ------------------------------------------------------------------------
>> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> 
>> <mailto:darshan-users-bounces at lists.mcs.anl.gov> on behalf of Piero 
>> LANUCARA <p.lanucara at cineca.it> <mailto:p.lanucara at cineca.it>
>> *Sent:* Wednesday, February 12, 2020 5:29 AM
>> *To:* Snyder, Shane <ssnyder at mcs.anl.gov> 
>> <mailto:ssnyder at mcs.anl.gov>; Harms, Kevin <harms at alcf.anl.gov> 
>> <mailto:harms at alcf.anl.gov>
>> *Cc:* darshan-users at lists.mcs.anl.gov 
>> <mailto:darshan-users at lists.mcs.anl.gov> 
>> <darshan-users at lists.mcs.anl.gov> 
>> <mailto:darshan-users at lists.mcs.anl.gov>
>> *Subject:* Re: [Darshan-users] Darshan & EPCC benchio different 
>> behaviour
>> Hi Shane, Kevin
>>
>> thanks for the update.
>>
>> I attached a new upated files (log and pdf) to this email.
>>
>> Also, the log from BENCHIO is attached.
>>
>> thanks again
>>
>> regards
>>
>> Piero
>>
>>
>> Il 11/02/2020 20:15, Shane Snyder ha scritto:
>> > Definitely looks like something strange is happening when Darshan is
>> > estimating the time spent in I/O operations (as seen in the very first
>> > figure, observed write time barely even registers) in the serial case,
>> > which it is ultimately used to provide the performance estimate.
>> >
>> > If you could provide them, the raw Darshan logs would be really
>> > helpful. That should make it clear whether it's an instrumentation
>> > issue (i.e., under accounting for time spent in I/O operations at
>> > runtime) or if its an issue with the heuristics in the PDF summary
>> > tool you are using, as Kevin points out. If it's the latter, having an
>> > example log to test modifications to our heuristics would be very
>> > helpful to us.
>> >
>> > Thanks,
>> > --Shane
>> >
>> > On 2/11/20 8:36 AM, Harms, Kevin wrote:
>> >> Piero,
>> >>
>> >>    the performance estimate is based on heuristics, it's possible the
>> >> 'serial' model is breaking some assumptions about how the I/O is
>> >> done. Is every rank opening the file, but only rank 0 is doing actual
>> >> I/O?
>> >>
>> >>    If possible, you could provide the log and we could check to see
>> >> what the counters look like.
>> >>
>> >> kevin
>> >>
>> >> ________________________________________
>> >> From: Piero LANUCARA <p.lanucara at cineca.it> 
>> <mailto:p.lanucara at cineca.it>
>> >> Sent: Tuesday, February 11, 2020 2:28 AM
>> >> To: Harms, Kevin
>> >> Cc: darshan-users at lists.mcs.anl.gov 
>> <mailto:darshan-users at lists.mcs.anl.gov>
>> >> Subject: Re: [Darshan-users] Darshan & EPCC benchio different 
>> behaviour
>> >>
>> >> Hi Kevin
>> >>
>> >> first of all thanks for the investigation..I did some futher test 
>> and it
>> >> seems like the issue may appear using Fortran (MPI, mainly IntelMPI)
>> >> codes.
>> >>
>> >> Is this information useful?
>> >>
>> >> regards
>> >> Piero
>> >> Il 07/02/2020 16:07, Harms, Kevin ha scritto:
>> >>> Piero,
>> >>>
>> >>>     just to confirm, the serial case is still running in parallel,
>> >>> 36 processes, but the I/O is only from rank 0?
>> >>>
>> >>> kevin
>> >>>
>> >>> ________________________________________
>> >>> From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> 
>> <mailto:darshan-users-bounces at lists.mcs.anl.gov> on
>> >>> behalf of Piero LANUCARA <p.lanucara at cineca.it> 
>> <mailto:p.lanucara at cineca.it>
>> >>> Sent: Wednesday, February 5, 2020 4:56 AM
>> >>> To: darshan-users at lists.mcs.anl.gov 
>> <mailto:darshan-users at lists.mcs.anl.gov>
>> >>> Subject: Re: [Darshan-users] Darshan & EPCC benchio different 
>> behaviour
>> >>>
>> >>> p.s
>> >>>
>> >>> to be more "verbose" I add to the discussion:
>> >>>
>> >>> Darshan output for the "serial" run (serial.pdf)
>> >>>
>> >>> Darshan output for the MPI-IO run (mpiio.pdf)
>> >>>
>> >>> benchio output for "serial" run (serial.out)
>> >>>
>> >>> benchio output for "MPI-IO" run (mpi-io.out)
>> >>>
>> >>> thanks
>> >>>
>> >>> Piero
>> >>>
>> >>> Il 04/02/2020 19:44, Piero LANUCARA ha scritto:
>> >>>> Dear all
>> >>>>
>> >>>> I'm using Darshan to measure EPCC benchio benchmark
>> >>>> (https://github.com/EPCCed/benchio) behaviour on a given x86 Tier1
>> >>>> machine.
>> >>>>
>> >>>> running two benchio tests (MPI-IO and serial) a different behaviour
>> >>>> appear
>> >>>>
>> >>>> while Darhsan pdf log file is able to recover the estimated time and
>> >>>> bandwidth in the MPI-IO case, the "serial" run is completely
>> >>>> underestimated by Darshan (the time and bandwidth are less/greater
>> >>>> than benchio output).
>> >>>>
>> >>>> Suggestions are welcomed
>> >>>>
>> >>>> thanks
>> >>>>
>> >>>> Piero
>> >>>>
>> >>>> _______________________________________________
>> >>>> Darshan-users mailing list
>> >>>> Darshan-users at lists.mcs.anl.gov 
>> <mailto:Darshan-users at lists.mcs.anl.gov>
>> >>>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>> >> _______________________________________________
>> >> Darshan-users mailing list
>> >> Darshan-users at lists.mcs.anl.gov 
>> <mailto:Darshan-users at lists.mcs.anl.gov>
>> >> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>> >
>> > _______________________________________________
>> > Darshan-users mailing list
>> > Darshan-users at lists.mcs.anl.gov 
>> <mailto:Darshan-users at lists.mcs.anl.gov>
>> > https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200212/bbec6087/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchio_1202.dn.darshan.pdf
Type: application/pdf
Size: 67450 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200212/bbec6087/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchio_1202.dn.darshan
Type: application/octet-stream
Size: 1563 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200212/bbec6087/attachment-0001.obj>


More information about the Darshan-users mailing list