pnetCDF performance issues

Wei-keng Liao wkliao at ece.northwestern.edu
Wed Mar 16 18:31:13 CDT 2011


I have added my proposed approach of setting the default alignment
size to the SVN r915. You are welcomed to give it a try. Comments
are welcomed, too.

The wiki document has been revised to describe how the defaults
are calculated. Also, the confusing sentences are removed.

Please note that the default alignment size now depends on the
MPI-IO hint "striping_unit". Currently, only the ROMIO's
Lustre driver can set this properly. Rob Latham is working
on this issue for GPFS and PVFS2 drivers.

Wei-keng

On Mar 10, 2011, at 3:48 AM, Nils Smeds wrote:

> In light of this discussion could someone perhaps update the wiki page on striping_unit http://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/StripingUnitHint 
> 
> It is the section titled "Example scenario" that I find confusing. They set the striping_unit to 1/32 of the block size (128kB of 4MiB) to reserve space for an "enormous header while still making it possible to avoid a few unaligned file system accesses" 
> 
> Cheers, 
> 
> /Nils 
> ______________________________________________
> Nils Smeds,  IBM Deep Computing / World Wide Coordinated Tuning Team
> IT Specialist, Mobile phone: +46-70-793 2639
> Fax. +46-8-793 9523
> Mail address: IBM Sweden; Loc. 5-03; 164 92 Stockholm; SWEDEN 
> 
> 
> 
> From:        Wei-keng Liao <wkliao at ece.northwestern.edu> 
> To:        parallel-netcdf at lists.mcs.anl.gov 
> Date:        03/10/2011 01:26 AM 
> Subject:        Re: pnetCDF performance issues 
> Sent by:        parallel-netcdf-bounces at lists.mcs.anl.gov 
> 
> 
> 
> >> 1) What should the defaults be so that users get good performance "out of the box"?
> 
> In term of performance, setting the header alignment size to the file striping size
> gives the best performance. But, we also need to consider the file size. Say, if a user
> create a file with a few small arrays each of size a few KBs and the file system
> striping size is 4 MB, do we want to enforce this default alignment? (as it produces
> a lot of used space.)
> 
> Proposed solution below.
> 
> 
> >> 2) Can/should pnetCDF diagnose poor choices and inform the user?
> 
> The only diagnosis I can think of is to see if a user's choice matches
> the file system striping size. (Match means the hints "nc_header_align_size"
> chosen by the user being a multiple of striping size.)
> 
> As for informing users a poor choice, Rob's suggestion is fine. I personally
> think a feedback from the I/O library (or other stack) is very useful.
> 
> 
> >> 3) Can/should MPI-IO "fix" this by exploiting the MPI-IO semantics
> >> to permit converting writes to be aligned (e.g., by caching)?
> 
> In MPI collective I/Os, writes from the aggregators are aligned with the
> striping size, if the striping size can be obtained from the system. Currently, the
> ROMIO drivers for PVFS and Lustre are collecting the striping info into
> the hints. PnetCDF can use those info to choose a right header alignment size.
> 
> As for independent I/Os, no alignment is done.
> 
> If the striping info cannot be obtained, pnetcdf currently is using 512 bytes
> for the file header alignment size.
> 
> >> 
> >> Of these, (1) is the most important for pnetCDF, particularly as
> >> users compare approaches.
> > 
> 
> 
> I propose the following way to pick a default value.
> 
> if ROMIO can obtain the file striping size
> then
>      if the total aggregate array size is at least N times of striping size, (say N=4)
>      then pnetcdf uses the file striping size as the header alignment size
>      else 512 bytes is used
> else
>      use 512 bytes
> 
> (Note that the header size is calculated at the call to ncmpi_enddef. In the meantime,
> the number of arrays and their sizes are also known.)
> 
> Wei-keng
> 
> > 
> 
> 
> 
> > One:
> > 
> > pnetcdf could stat the file system, but take a peek at ROMIO's file
> > system detection code for the state of portable statfs.  today,
> > perhaps it is less of a problem than when that code was written a
> > decade ago.  What I mean to say is: "does there exist a portable way
> > to determine alignment"?   st_blksize is probably our best bet, but on
> > Lustre it's actually more important not to align blocks but to hit the
> > same OST.
> > 
> > Just because it's hard doesn't mean we shouldn't do it, of course...
> > 
> > HDF5 has this problem too: both libraries would benefit from an MPI-IO
> > interface to "file system features": alignment and "optimum tranfer
> > size" come to mind.  others no doubt.
> > 
> > two:
> > 
> > pnetcdf has two ways to get information back to the caller: the return
> > code and the info object.  A read-only "pnetcdf_how_we_doin" hint
> > might do the trick.
> > 
> > three:
> > 
> > some MPI-IO implementations do fix this, as long as collective I/O is
> > used.  The MPI-IO on BlueGene, for example, always forces collective
> > I/O (even if operations are not overlapping), then aligns file domains
> > to block size boundaries.  I know, I just complained about how
> > un-portable st_blksize can be but 'ad_bgl' gets to make some
> > simplifying assumptions.
> > 
> > ROMIO, at least recent versions, can also do some file domain magic
> > - "romio_min_fdomain_size" will enforce a lower bound on the amount of
> >  I/O an aggregator will do.
> > 
> > - set the "striping_unit" hint and ROMIO will ensure file domain
> >  boundaries are aligned to a multiple of that value.
> > 
> > ==rob
> > 
> > -- 
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> > 
> 
> 
> 
> 
> Såvida annat inte anges ovan: / Unless stated otherwise above:
> IBM Svenska AB
> Organisationsnummer: 556026-6883
> Adress: 164 92 Stockholm



More information about the parallel-netcdf mailing list