Performance tuning problem with iput_vara_double/wait_all

Thu Dec 27 17:18:49 CST 2012

So, it turns out that if I transpose the file to be (pft, week, lat,
lon), with only the minimal necessary changes to the code, I get over
1GB/second, which is fast enough for our needs. I'll have to see if
the transposition will require any changes more substantial than the
corresponding read code in this application, but I think the immediate
problem is basically solved.

As for the broader issue of higher-level aggregation and
non-decreasing file offsets, might it be possible to deconstruct and
sort the write requests as they're provided? If this would be useful
to pNetCDF, I may see about implementing it.

On Thu, Dec 27, 2012 at 3:56 PM, Wei-keng Liao
<wkliao at ece.northwestern.edu> wrote:
> Hi, Phil,
>
> Sorry, I misread your codes.
> My understanding now is that each process is writing 24*52 array elements
> and each of them is written to a file location with a (numlon*numlat)
> distance apart from its next/previous element.
>
> PnetCDF nonblocking APIs define an MPI derived data type as
> a file type to represent the file access layout for that call. Then all
> file types are concatenated into a single one which is later used to
> make a call to MPI collective write.
>
> The problem is MPI-IO requires the file type contains monotonic non-decreasing
> file offsets. So your codes will produce a cocatenated file type that is
> against this requirement and result in PnetCDF unable to aggregate the
> requests the way you think it should be. At the end, PnetCDF will make
> several MPI collective I/O calls and each of them accessing non-contiguous
> file locations.
>
> Maybe a high-level aggregation will be unavoidable in your case.
>
> Wei-keng
>
> On Dec 27, 2012, at 3:03 PM, Phil Miller wrote:
>
>> On Thu, Dec 27, 2012 at 2:47 PM, Wei-keng Liao
>> <wkliao at ece.northwestern.edu> wrote:
>>> It looks like you are writing one array element of 8 bytes at a time.
>>> Performance will definitely be poor for this I/O strategy.
>>>
>>> Please check if the indices translated by mask are actually continuous.
>>> If so, you can replace the loop with one write call.
>>
>> Hi Wei-keng,
>>
>> I don't think it should be 8 bytes (1 element) at a time - each call
>> should deposit 24*52 elements of data, spanning the entirety of the
>> pft and weeks dimensions. Do you mean that because the file data isn't
>> contiguous in those dimensions, the writes won't end up being
>> combined?
>>
>> I could see what happens if I transpose the file dimensions to put pft
>> and weeks first, I guess.
>>
>> Further, the documentation seems to indicate that by using collective
>> data mode and non-blocking puts, the library will use MPI-IO's
>> collective write calls to perform a two-phase redistribute-and-write,
>> avoiding actually sending any tiny writes to the filesystem.
>>
>> Perhaps I've misunderstood something on one of those points, though?
>>
>> As for the mask continuity, they're explicitly not continuous. It's
>> round-robin over the Earth's surface, so as to obtain good load
>> balance, with gaps where a grid point is over water versus land. I
>> have some other code that I'm adding now that fills in the gaps with a
>> sentinel value, but making even those continuous will be challenging.
>>
>> Phil
>