pnetCDF performance issues
Wei-keng Liao
wkliao at ece.northwestern.edu
Wed Mar 9 18:19:39 CST 2011
>> 1) What should the defaults be so that users get good performance "out of the box"?
In term of performance, setting the header alignment size to the file striping size
gives the best performance. But, we also need to consider the file size. Say, if a user
create a file with a few small arrays each of size a few KBs and the file system
striping size is 4 MB, do we want to enforce this default alignment? (as it produces
a lot of used space.)
Proposed solution below.
>> 2) Can/should pnetCDF diagnose poor choices and inform the user?
The only diagnosis I can think of is to see if a user's choice matches
the file system striping size. (Match means the hints "nc_header_align_size"
chosen by the user being a multiple of striping size.)
As for informing users a poor choice, Rob's suggestion is fine. I personally
think a feedback from the I/O library (or other stack) is very useful.
>> 3) Can/should MPI-IO "fix" this by exploiting the MPI-IO semantics
>> to permit converting writes to be aligned (e.g., by caching)?
In MPI collective I/Os, writes from the aggregators are aligned with the
striping size, if the striping size can be obtained from the system. Currently, the
ROMIO drivers for PVFS and Lustre are collecting the striping info into
the hints. PnetCDF can use those info to choose a right header alignment size.
As for independent I/Os, no alignment is done.
If the striping info cannot be obtained, pnetcdf currently is using 512 bytes
for the file header alignment size.
>>
>> Of these, (1) is the most important for pnetCDF, particularly as
>> users compare approaches.
>
I propose the following way to pick a default value.
if ROMIO can obtain the file striping size
then
if the total aggregate array size is at least N times of striping size, (say N=4)
then pnetcdf uses the file striping size as the header alignment size
else 512 bytes is used
else
use 512 bytes
(Note that the header size is calculated at the call to ncmpi_enddef. In the meantime,
the number of arrays and their sizes are also known.)
Wei-keng
>
> One:
>
> pnetcdf could stat the file system, but take a peek at ROMIO's file
> system detection code for the state of portable statfs. today,
> perhaps it is less of a problem than when that code was written a
> decade ago. What I mean to say is: "does there exist a portable way
> to determine alignment"? st_blksize is probably our best bet, but on
> Lustre it's actually more important not to align blocks but to hit the
> same OST.
>
> Just because it's hard doesn't mean we shouldn't do it, of course...
>
> HDF5 has this problem too: both libraries would benefit from an MPI-IO
> interface to "file system features": alignment and "optimum tranfer
> size" come to mind. others no doubt.
>
> two:
>
> pnetcdf has two ways to get information back to the caller: the return
> code and the info object. A read-only "pnetcdf_how_we_doin" hint
> might do the trick.
>
> three:
>
> some MPI-IO implementations do fix this, as long as collective I/O is
> used. The MPI-IO on BlueGene, for example, always forces collective
> I/O (even if operations are not overlapping), then aligns file domains
> to block size boundaries. I know, I just complained about how
> un-portable st_blksize can be but 'ad_bgl' gets to make some
> simplifying assumptions.
>
> ROMIO, at least recent versions, can also do some file domain magic
> - "romio_min_fdomain_size" will enforce a lower bound on the amount of
> I/O an aggregator will do.
>
> - set the "striping_unit" hint and ROMIO will ensure file domain
> boundaries are aligned to a multiple of that value.
>
> ==rob
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
More information about the parallel-netcdf
mailing list