[Darshan-users] overstating mpi-io time
Rob Latham
robl at mcs.anl.gov
Thu Sep 2 14:18:07 CDT 2010
Let's consider the case of collective file I/O. ROMIO uses
"aggregation" to have a subset of processors carry out I/O. On
BlueGene, that aggregation means for every 256 processors (in virtual
node mode), 8 of them actually do I/O.
Imagine the following asci art is actually Jumpshot
0: |---MPI_File_write_all --|-----write(2)|--|
1: |---MPI_File_write_all --|
2: |---MPI_Fiel_write_all --|
rank 0 spends, say, 100 seconds in MPI_File_write_all and 50 seconds
in posix write.
rank 1 and 2 don't do I/O and spend 50 seconds in MPI_File_write_all
and 0 seconds in posix write.
Now I look at CP_F_MPI_WRITE_TIME compared to CP_F_POSIX_WRITE_TIME,
and see that I spent 200 seconds in MPI write, and only 50 seconds in
posix write. Wow, mpi has 150/200 == 75% overhead!
Ok, mpi kind of does have that overhead if you look at it in one way:
if MPI_File_write_all on 1 and 2 returned sooner, some other work
could happen on those processors. But the time in MPI-IO overlaps and
so we are kind of counting the time in MPI-IO too many times.
The problem is actually exaggerated even further on real BlueGene
runs: the aggregator ratio is 1:32.
I've been trying to come up with a metric for "collective
MPI-IO overhead" derived from CP_F_MPI_WRITE_TIME,
CP_F_POSIX_WRITE_TIME, and my separate knowledge of how many I/O
aggregators there are. The problem is my derived metric sometimes
gives negative numbers :>
Take the example above: I really want to see a metric that tells me
"half of your time was in MPI (the two-phase optimization) and half
was in POSIX I/O". Since i"m looking for percentages and not absolute
numbers, I thought I could play with some scaling factors:
- divide overhead by nprocs: now we are not exaggerating MPI-IO
overhead, but we have drastically under-stated POSIX time.
- Ok, so take that "posix time per processor" and multiply by 32 (or
whatever the aggregator ratio is for a run): that overstates the
posix time (though I'm not sure why... really thought that one was
the trick), giving you more time spent in posix than in MPI-IO (and
a negative overhead)
Does the wider darshan community have any suggestions?
==rob
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the Darshan-users
mailing list