[Darshan-users] overstating mpi-io time

Thu Sep 2 14:18:07 CDT 2010

Let's consider the case of collective file I/O.  ROMIO uses
"aggregation" to have a subset of processors carry out I/O.  On
BlueGene, that aggregation means for every 256 processors (in virtual
node mode), 8 of them actually do I/O.

Imagine the following asci art is actually Jumpshot

0: |---MPI_File_write_all --|-----write(2)|--|
1: |---MPI_File_write_all --|
2: |---MPI_Fiel_write_all --|

rank 0 spends, say, 100 seconds in MPI_File_write_all and 50 seconds
in posix write.

rank 1 and 2 don't do I/O and spend 50 seconds in MPI_File_write_all
and 0 seconds in posix write.

Now I look at CP_F_MPI_WRITE_TIME compared to CP_F_POSIX_WRITE_TIME,
and see that I spent 200 seconds in MPI write, and only 50 seconds in
posix write.  Wow, mpi has 150/200 ==  75% overhead!

Ok, mpi kind of does have that overhead if you look at it in one way:
if MPI_File_write_all on 1 and 2 returned sooner, some other work
could happen on those processors.  But the time in MPI-IO overlaps and
so we are kind of counting the time in MPI-IO too many times.

The problem is actually exaggerated even further on real BlueGene
runs: the aggregator ratio is 1:32. 

I've been trying to come up with a metric for "collective
MPI-IO overhead" derived from CP_F_MPI_WRITE_TIME,
CP_F_POSIX_WRITE_TIME, and my separate knowledge of how many I/O
aggregators there are.  The problem is my derived metric sometimes
gives negative numbers :>

Take the example above: I really want to see a metric that tells me
"half of your time was in MPI (the two-phase optimization) and half
was in POSIX I/O".  Since i"m looking for percentages and not absolute
numbers, I thought I could play with some scaling factors:

- divide overhead by nprocs: now we are not exaggerating MPI-IO
  overhead, but we have drastically under-stated POSIX time.

- Ok, so take that "posix time per processor" and multiply by 32 (or
  whatever the aggregator ratio is for a run): that overstates the
  posix time (though I'm not sure why... really thought that one was
  the trick), giving you more time spent in posix than in MPI-IO (and
  a negative overhead)

Does the wider darshan community have any suggestions?

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA