Performance tuning problem with iput_vara_double/wait_all

Sun Dec 30 09:27:00 CST 2012

Hi Dr. Liao,

1. Why the code would produce file type that against the requirement?
Does it mean that the I/O of each process will span all the datasets, and may result in decreasing offsets?

2. As Phil said, can we deconstruct the write call and sort the offsets before performing the actual collective write? I didn't understand why we need more space for the sorting, can we just perform the desconstructing and sorting together during the traditional aggregator assignment (when the requests are sorted and the aggregators are assigned with patitioned requests)?

Jialin

________________________________________
From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao [wkliao at ece.northwestern.edu]
Sent: Friday, December 28, 2012 11:29 AM
To: parallel-netcdf at lists.mcs.anl.gov
Subject: Re: Performance tuning problem with iput_vara_double/wait_all

Hi, Phil,

I have seen cases that the read access patterns are orthogonal to
write patterns. Data layout reorganization is commonly performed
off line.

You are very welcomed to to give a try to sort the file offsets.
Please note that while sorting the file offsets, you also have
to move the buffer data along with the offsets. There may be an
additional memory space required for this. If you found a nice
solution and successfully implemented it, it will be very useful
for PnetCDF.

Wei-keng

On Dec 27, 2012, at 5:18 PM, Phil Miller wrote:

> So, it turns out that if I transpose the file to be (pft, week, lat,
> lon), with only the minimal necessary changes to the code, I get over
> 1GB/second, which is fast enough for our needs. I'll have to see if
> the transposition will require any changes more substantial than the
> corresponding read code in this application, but I think the immediate
> problem is basically solved.
>
> As for the broader issue of higher-level aggregation and
> non-decreasing file offsets, might it be possible to deconstruct and
> sort the write requests as they're provided? If this would be useful
> to pNetCDF, I may see about implementing it.
>
> On Thu, Dec 27, 2012 at 3:56 PM, Wei-keng Liao
> <wkliao at ece.northwestern.edu> wrote:
>> Hi, Phil,
>>
>> Sorry, I misread your codes.
>> My understanding now is that each process is writing 24*52 array elements
>> and each of them is written to a file location with a (numlon*numlat)
>> distance apart from its next/previous element.
>>
>> PnetCDF nonblocking APIs define an MPI derived data type as
>> a file type to represent the file access layout for that call. Then all
>> file types are concatenated into a single one which is later used to
>> make a call to MPI collective write.
>>
>> The problem is MPI-IO requires the file type contains monotonic non-decreasing
>> file offsets. So your codes will produce a cocatenated file type that is
>> against this requirement and result in PnetCDF unable to aggregate the
>> requests the way you think it should be. At the end, PnetCDF will make
>> several MPI collective I/O calls and each of them accessing non-contiguous
>> file locations.
>>
>> Maybe a high-level aggregation will be unavoidable in your case.
>>
>> Wei-keng
>>
>> On Dec 27, 2012, at 3:03 PM, Phil Miller wrote:
>>
>>> On Thu, Dec 27, 2012 at 2:47 PM, Wei-keng Liao
>>> <wkliao at ece.northwestern.edu> wrote:
>>>> It looks like you are writing one array element of 8 bytes at a time.
>>>> Performance will definitely be poor for this I/O strategy.
>>>>
>>>> Please check if the indices translated by mask are actually continuous.
>>>> If so, you can replace the loop with one write call.
>>>
>>> Hi Wei-keng,
>>>
>>> I don't think it should be 8 bytes (1 element) at a time - each call
>>> should deposit 24*52 elements of data, spanning the entirety of the
>>> pft and weeks dimensions. Do you mean that because the file data isn't
>>> contiguous in those dimensions, the writes won't end up being
>>> combined?
>>>
>>> I could see what happens if I transpose the file dimensions to put pft
>>> and weeks first, I guess.
>>>
>>> Further, the documentation seems to indicate that by using collective
>>> data mode and non-blocking puts, the library will use MPI-IO's
>>> collective write calls to perform a two-phase redistribute-and-write,
>>> avoiding actually sending any tiny writes to the filesystem.
>>>
>>> Perhaps I've misunderstood something on one of those points, though?
>>>
>>> As for the mask continuity, they're explicitly not continuous. It's
>>> round-robin over the Earth's surface, so as to obtain good load
>>> balance, with gaps where a grid point is over water versus land. I
>>> have some other code that I'm adding now that fills in the gaps with a
>>> sentinel value, but making even those continuous will be challenging.
>>>
>>> Phil
>>