[Darshan-users] overstating mpi-io time
Kevin Harms
harms at alcf.anl.gov
Fri Sep 3 09:16:33 CDT 2010
The fastest/slowest data is POSIX based but you could probably try adding the same thing for MPI-IO timings. In your case I suppose your overhead could just be comparing the slowest posix rank and the slowest mpi rank?
http://trac.mcs.anl.gov/projects/darshan/browser/trunk/lib/darshan-mpi-io.c#L1175
http://trac.mcs.anl.gov/projects/darshan/browser/trunk/lib/darshan-mpi-io.c#L1448
http://trac.mcs.anl.gov/projects/darshan/browser/trunk/lib/darshan-mpi-io.c#L1902
kevin
On Sep 2, 2010, at 3:16 PM, Phil Carns wrote:
> I don't know if I count as "wider darshan community" really, but I do have some ideas :)
>
> I happen to have been re-thinking about a similar problem today already. Kevin and I worked out a scheme a while back for estimating the aggregate I/O performance of an application based strictly on data from Darshan, and I'm trying to add that into the job summary script right now. It has a similar problem, which is basically how to compute aggregate performance if not all processes are contributing the same amount of time or data. What you really want is the absolute time (across all procs) regardless of how many procs participated.
>
> To make a long story short, what we found was that there wasn't any way to work backwards from aggregate time on shared files back to an absolute time that worked for every case. What you have to do instead is figure out how to time the slowest rank on any given file. In your case you need:
>
> - how much time was spent in MPI_File_write_all() by the slowest rank that participated
> - how much time was spent in posix calls on that file by the slowest rank that participated
>
> I'm not sure if we have enough data currently to pull that off, though. Kevin recently added new fields that will tell you the fastest rank and slowest rank on a shared file along with the number of bytes and seconds that each one took and the variance. I'm not sure if it always uses posix or if it switches to MPI counters on MPI files, but at any rate it definitely doesn't report both. It sounds like maybe that's the thing to do though- if you just new the max time for posix and mpi separately it would answer the question.
>
> We figured out a way to guestimate the absolute time before that feature was added, by just measuring the time between the first open and the last IO. You can still do that to get a reasonable estimate, but we don't split that up by MPI and POSIX either, unfortunately. You can only get the posix open timestamp.
>
> -Phil
>
> On 09/02/2010 03:18 PM, Rob Latham wrote:
>> Let's consider the case of collective file I/O. ROMIO uses
>> "aggregation" to have a subset of processors carry out I/O. On
>> BlueGene, that aggregation means for every 256 processors (in virtual
>> node mode), 8 of them actually do I/O.
>>
>> Imagine the following asci art is actually Jumpshot
>>
>> 0: |---MPI_File_write_all --|-----write(2)|--|
>> 1: |---MPI_File_write_all --|
>> 2: |---MPI_Fiel_write_all --|
>>
>> rank 0 spends, say, 100 seconds in MPI_File_write_all and 50 seconds
>> in posix write.
>>
>> rank 1 and 2 don't do I/O and spend 50 seconds in MPI_File_write_all
>> and 0 seconds in posix write.
>>
>> Now I look at CP_F_MPI_WRITE_TIME compared to CP_F_POSIX_WRITE_TIME,
>> and see that I spent 200 seconds in MPI write, and only 50 seconds in
>> posix write. Wow, mpi has 150/200 == 75% overhead!
>>
>> Ok, mpi kind of does have that overhead if you look at it in one way:
>> if MPI_File_write_all on 1 and 2 returned sooner, some other work
>> could happen on those processors. But the time in MPI-IO overlaps and
>> so we are kind of counting the time in MPI-IO too many times.
>>
>> The problem is actually exaggerated even further on real BlueGene
>> runs: the aggregator ratio is 1:32.
>>
>> I've been trying to come up with a metric for "collective
>> MPI-IO overhead" derived from CP_F_MPI_WRITE_TIME,
>> CP_F_POSIX_WRITE_TIME, and my separate knowledge of how many I/O
>> aggregators there are. The problem is my derived metric sometimes
>> gives negative numbers :>
>>
>> Take the example above: I really want to see a metric that tells me
>> "half of your time was in MPI (the two-phase optimization) and half
>> was in POSIX I/O". Since i"m looking for percentages and not absolute
>> numbers, I thought I could play with some scaling factors:
>>
>> - divide overhead by nprocs: now we are not exaggerating MPI-IO
>> overhead, but we have drastically under-stated POSIX time.
>>
>> - Ok, so take that "posix time per processor" and multiply by 32 (or
>> whatever the aggregator ratio is for a run): that overstates the
>> posix time (though I'm not sure why... really thought that one was
>> the trick), giving you more time spent in posix than in MPI-IO (and
>> a negative overhead)
>>
>> Does the wider darshan community have any suggestions?
>>
>> ==rob
>>
>>
>
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2909 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20100903/80ff70db/attachment.bin>
More information about the Darshan-users
mailing list