Performance tuning problem with iput_vara_double/wait_all

Sun Dec 30 12:19:53 CST 2012

On Dec 30, 2012, at 9:27 AM, Liu, Jaln wrote:

> Hi Dr. Liao,
> 
> 1. Why the code would produce file type that against the requirement?
> Does it mean that the I/O of each process will span all the datasets, and may result in decreasing offsets?

Each iput call produces a filetype abiding the requirement of monotonic non-decreasing
file offsets. But concatenating two filetypes will end up with a filetype violating
this requirement.

> 
> 2. As Phil said, can we deconstruct the write call and sort the offsets before performing the actual collective write? I didn't understand why we need more space for the sorting,

When a filetype is flattened into a list of (offset, len) pairs, each pair corresponds
to a portion of the user's buffer(s). When sorting offsets, the sub-buffers (pointers),
must also move along with the offsets. Note that the size of those sub-buffers may be of
different lengths. After the offsets are sorted, you will have to create a filetype and
a buffer type. For buffer type, you can either create a derived data type covering all
the sub-buffers or copy the contents of all sub-buffers to a contiguous buffer.
If you choose the latter, an additional space is needed.

Also, sorting offsets needs an additional array of C struct to store the offsets, lens,
and buffer pointers. In Phil's case where each offset-len pair only contains one
array element (8 bytes), the additional space for this C struct array will be more
than the write data itself.

If you have an idea of doing this differently, please do share it with us.

> can we just perform the desconstructing and sorting together during the traditional aggregator assignment (when the requests are sorted and the aggregators are assigned with patitioned requests)?

Are you referring to the aggregators at MPI-IO level? Please note the
two-phase I/O is not visible to PnetCDF.

Wei-keng

> Jialin
> 
> ________________________________________
> From: parallel-netcdf-bounces at lists.mcs.anl.gov [parallel-netcdf-bounces at lists.mcs.anl.gov] on behalf of Wei-keng Liao [wkliao at ece.northwestern.edu]
> Sent: Friday, December 28, 2012 11:29 AM
> To: parallel-netcdf at lists.mcs.anl.gov
> Subject: Re: Performance tuning problem with iput_vara_double/wait_all
> 
> Hi, Phil,
> 
> I have seen cases that the read access patterns are orthogonal to
> write patterns. Data layout reorganization is commonly performed
> off line.
> 
> You are very welcomed to to give a try to sort the file offsets.
> Please note that while sorting the file offsets, you also have
> to move the buffer data along with the offsets. There may be an
> additional memory space required for this. If you found a nice
> solution and successfully implemented it, it will be very useful
> for PnetCDF.
> 
> Wei-keng
> 
> On Dec 27, 2012, at 5:18 PM, Phil Miller wrote:
> 
>> So, it turns out that if I transpose the file to be (pft, week, lat,
>> lon), with only the minimal necessary changes to the code, I get over
>> 1GB/second, which is fast enough for our needs. I'll have to see if
>> the transposition will require any changes more substantial than the
>> corresponding read code in this application, but I think the immediate
>> problem is basically solved.
>> 
>> As for the broader issue of higher-level aggregation and
>> non-decreasing file offsets, might it be possible to deconstruct and
>> sort the write requests as they're provided? If this would be useful
>> to pNetCDF, I may see about implementing it.
>> 
>> On Thu, Dec 27, 2012 at 3:56 PM, Wei-keng Liao
>> <wkliao at ece.northwestern.edu> wrote:
>>> Hi, Phil,
>>> 
>>> Sorry, I misread your codes.
>>> My understanding now is that each process is writing 24*52 array elements
>>> and each of them is written to a file location with a (numlon*numlat)
>>> distance apart from its next/previous element.
>>> 
>>> PnetCDF nonblocking APIs define an MPI derived data type as
>>> a file type to represent the file access layout for that call. Then all
>>> file types are concatenated into a single one which is later used to
>>> make a call to MPI collective write.
>>> 
>>> The problem is MPI-IO requires the file type contains monotonic non-decreasing
>>> file offsets. So your codes will produce a cocatenated file type that is
>>> against this requirement and result in PnetCDF unable to aggregate the
>>> requests the way you think it should be. At the end, PnetCDF will make
>>> several MPI collective I/O calls and each of them accessing non-contiguous
>>> file locations.
>>> 
>>> Maybe a high-level aggregation will be unavoidable in your case.
>>> 
>>> Wei-keng
>>> 
>>> On Dec 27, 2012, at 3:03 PM, Phil Miller wrote:
>>> 
>>>> On Thu, Dec 27, 2012 at 2:47 PM, Wei-keng Liao
>>>> <wkliao at ece.northwestern.edu> wrote:
>>>>> It looks like you are writing one array element of 8 bytes at a time.
>>>>> Performance will definitely be poor for this I/O strategy.
>>>>> 
>>>>> Please check if the indices translated by mask are actually continuous.
>>>>> If so, you can replace the loop with one write call.
>>>> 
>>>> Hi Wei-keng,
>>>> 
>>>> I don't think it should be 8 bytes (1 element) at a time - each call
>>>> should deposit 24*52 elements of data, spanning the entirety of the
>>>> pft and weeks dimensions. Do you mean that because the file data isn't
>>>> contiguous in those dimensions, the writes won't end up being
>>>> combined?
>>>> 
>>>> I could see what happens if I transpose the file dimensions to put pft
>>>> and weeks first, I guess.
>>>> 
>>>> Further, the documentation seems to indicate that by using collective
>>>> data mode and non-blocking puts, the library will use MPI-IO's
>>>> collective write calls to perform a two-phase redistribute-and-write,
>>>> avoiding actually sending any tiny writes to the filesystem.
>>>> 
>>>> Perhaps I've misunderstood something on one of those points, though?
>>>> 
>>>> As for the mask continuity, they're explicitly not continuous. It's
>>>> round-robin over the Earth's surface, so as to obtain good load
>>>> balance, with gaps where a grid point is over water versus land. I
>>>> have some other code that I'm adding now that fills in the gaps with a
>>>> sentinel value, but making even those continuous will be challenging.
>>>> 
>>>> Phil
>>> 
>