[Darshan-users] overstating mpi-io time

Thu Sep 2 15:16:30 CDT 2010

I don't know if I count as "wider darshan community" really, but I do 
have some ideas :)

I happen to have been re-thinking about a similar problem today 
already.  Kevin and I worked out a scheme a while back for estimating 
the aggregate I/O performance of an application based strictly on data 
from Darshan, and I'm trying to add that into the job summary script 
right now.  It has a similar problem, which is basically how to compute 
aggregate performance if not all processes are contributing the same 
amount of time or data.  What you really want is the absolute time 
(across all procs) regardless of how many procs participated.

To make a long story short, what we found was that there wasn't any way 
to work backwards from aggregate time on shared files back to an 
absolute time that worked for every case.  What you have to do instead 
is figure out how to time the slowest rank on any given file.  In your 
case you need:

- how much time was spent in MPI_File_write_all() by the slowest rank 
that participated
- how much time was spent in posix calls on that file by the slowest 
rank that participated

I'm not sure if we have enough data currently to pull that off, though.  
Kevin recently added new fields that will tell you the fastest rank and 
slowest rank on a shared file along with the number of bytes and seconds 
that each one took and the variance.  I'm not sure if it always uses 
posix or if it switches to MPI counters on MPI files, but at any rate it 
definitely doesn't report both.   It sounds like maybe that's the thing 
to do though- if you just new the max time for posix and mpi separately 
it would answer the question.

We figured out a way to guestimate the absolute time before that feature 
was added, by just measuring the time between the first open and the 
last IO.  You can still do that to get a reasonable estimate, but we 
don't split that up by MPI and POSIX either, unfortunately.  You can 
only get the posix open timestamp.

-Phil

On 09/02/2010 03:18 PM, Rob Latham wrote:
> Let's consider the case of collective file I/O.  ROMIO uses
> "aggregation" to have a subset of processors carry out I/O.  On
> BlueGene, that aggregation means for every 256 processors (in virtual
> node mode), 8 of them actually do I/O.
>
> Imagine the following asci art is actually Jumpshot
>
> 0: |---MPI_File_write_all --|-----write(2)|--|
> 1: |---MPI_File_write_all --|
> 2: |---MPI_Fiel_write_all --|
>
> rank 0 spends, say, 100 seconds in MPI_File_write_all and 50 seconds
> in posix write.
>
> rank 1 and 2 don't do I/O and spend 50 seconds in MPI_File_write_all
> and 0 seconds in posix write.
>
> Now I look at CP_F_MPI_WRITE_TIME compared to CP_F_POSIX_WRITE_TIME,
> and see that I spent 200 seconds in MPI write, and only 50 seconds in
> posix write.  Wow, mpi has 150/200 ==  75% overhead!
>
> Ok, mpi kind of does have that overhead if you look at it in one way:
> if MPI_File_write_all on 1 and 2 returned sooner, some other work
> could happen on those processors.  But the time in MPI-IO overlaps and
> so we are kind of counting the time in MPI-IO too many times.
>
> The problem is actually exaggerated even further on real BlueGene
> runs: the aggregator ratio is 1:32.
>
> I've been trying to come up with a metric for "collective
> MPI-IO overhead" derived from CP_F_MPI_WRITE_TIME,
> CP_F_POSIX_WRITE_TIME, and my separate knowledge of how many I/O
> aggregators there are.  The problem is my derived metric sometimes
> gives negative numbers :>
>
> Take the example above: I really want to see a metric that tells me
> "half of your time was in MPI (the two-phase optimization) and half
> was in POSIX I/O".  Since i"m looking for percentages and not absolute
> numbers, I thought I could play with some scaling factors:
>
> - divide overhead by nprocs: now we are not exaggerating MPI-IO
>    overhead, but we have drastically under-stated POSIX time.
>
> - Ok, so take that "posix time per processor" and multiply by 32 (or
>    whatever the aggregator ratio is for a run): that overstates the
>    posix time (though I'm not sure why... really thought that one was
>    the trick), giving you more time spent in posix than in MPI-IO (and
>    a negative overhead)
>
> Does the wider darshan community have any suggestions?
>
> ==rob
>
>