[Darshan-users] overstating mpi-io time
Phil Carns
carns at mcs.anl.gov
Thu Sep 2 15:16:30 CDT 2010
I don't know if I count as "wider darshan community" really, but I do
have some ideas :)
I happen to have been re-thinking about a similar problem today
already. Kevin and I worked out a scheme a while back for estimating
the aggregate I/O performance of an application based strictly on data
from Darshan, and I'm trying to add that into the job summary script
right now. It has a similar problem, which is basically how to compute
aggregate performance if not all processes are contributing the same
amount of time or data. What you really want is the absolute time
(across all procs) regardless of how many procs participated.
To make a long story short, what we found was that there wasn't any way
to work backwards from aggregate time on shared files back to an
absolute time that worked for every case. What you have to do instead
is figure out how to time the slowest rank on any given file. In your
case you need:
- how much time was spent in MPI_File_write_all() by the slowest rank
that participated
- how much time was spent in posix calls on that file by the slowest
rank that participated
I'm not sure if we have enough data currently to pull that off, though.
Kevin recently added new fields that will tell you the fastest rank and
slowest rank on a shared file along with the number of bytes and seconds
that each one took and the variance. I'm not sure if it always uses
posix or if it switches to MPI counters on MPI files, but at any rate it
definitely doesn't report both. It sounds like maybe that's the thing
to do though- if you just new the max time for posix and mpi separately
it would answer the question.
We figured out a way to guestimate the absolute time before that feature
was added, by just measuring the time between the first open and the
last IO. You can still do that to get a reasonable estimate, but we
don't split that up by MPI and POSIX either, unfortunately. You can
only get the posix open timestamp.
-Phil
On 09/02/2010 03:18 PM, Rob Latham wrote:
> Let's consider the case of collective file I/O. ROMIO uses
> "aggregation" to have a subset of processors carry out I/O. On
> BlueGene, that aggregation means for every 256 processors (in virtual
> node mode), 8 of them actually do I/O.
>
> Imagine the following asci art is actually Jumpshot
>
> 0: |---MPI_File_write_all --|-----write(2)|--|
> 1: |---MPI_File_write_all --|
> 2: |---MPI_Fiel_write_all --|
>
> rank 0 spends, say, 100 seconds in MPI_File_write_all and 50 seconds
> in posix write.
>
> rank 1 and 2 don't do I/O and spend 50 seconds in MPI_File_write_all
> and 0 seconds in posix write.
>
> Now I look at CP_F_MPI_WRITE_TIME compared to CP_F_POSIX_WRITE_TIME,
> and see that I spent 200 seconds in MPI write, and only 50 seconds in
> posix write. Wow, mpi has 150/200 == 75% overhead!
>
> Ok, mpi kind of does have that overhead if you look at it in one way:
> if MPI_File_write_all on 1 and 2 returned sooner, some other work
> could happen on those processors. But the time in MPI-IO overlaps and
> so we are kind of counting the time in MPI-IO too many times.
>
> The problem is actually exaggerated even further on real BlueGene
> runs: the aggregator ratio is 1:32.
>
> I've been trying to come up with a metric for "collective
> MPI-IO overhead" derived from CP_F_MPI_WRITE_TIME,
> CP_F_POSIX_WRITE_TIME, and my separate knowledge of how many I/O
> aggregators there are. The problem is my derived metric sometimes
> gives negative numbers :>
>
> Take the example above: I really want to see a metric that tells me
> "half of your time was in MPI (the two-phase optimization) and half
> was in POSIX I/O". Since i"m looking for percentages and not absolute
> numbers, I thought I could play with some scaling factors:
>
> - divide overhead by nprocs: now we are not exaggerating MPI-IO
> overhead, but we have drastically under-stated POSIX time.
>
> - Ok, so take that "posix time per processor" and multiply by 32 (or
> whatever the aggregator ratio is for a run): that overstates the
> posix time (though I'm not sure why... really thought that one was
> the trick), giving you more time spent in posix than in MPI-IO (and
> a negative overhead)
>
> Does the wider darshan community have any suggestions?
>
> ==rob
>
>
More information about the Darshan-users
mailing list