pnetCDF performance issues

Nils Smeds nils.smeds at se.ibm.com
Thu Mar 10 03:48:25 CST 2011


In light of this discussion could someone perhaps update the wiki page on 
striping_unit 
http://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/StripingUnitHint

It is the section titled "Example scenario" that I find confusing. They 
set the striping_unit to 1/32 of the block size (128kB of 4MiB) to reserve 
space for an "enormous header while still making it possible to avoid a 
few unaligned file system accesses" 

Cheers,

/Nils
______________________________________________
Nils Smeds,  IBM Deep Computing / World Wide Coordinated Tuning Team
IT Specialist, Mobile phone: +46-70-793 2639
Fax. +46-8-793 9523
Mail address: IBM Sweden; Loc. 5-03; 164 92 Stockholm; SWEDEN



From:   Wei-keng Liao <wkliao at ece.northwestern.edu>
To:     parallel-netcdf at lists.mcs.anl.gov
Date:   03/10/2011 01:26 AM
Subject:        Re: pnetCDF performance issues
Sent by:        parallel-netcdf-bounces at lists.mcs.anl.gov



>> 1) What should the defaults be so that users get good performance "out 
of the box"?

In term of performance, setting the header alignment size to the file 
striping size
gives the best performance. But, we also need to consider the file size. 
Say, if a user
create a file with a few small arrays each of size a few KBs and the file 
system
striping size is 4 MB, do we want to enforce this default alignment? (as 
it produces
a lot of used space.)

Proposed solution below.


>> 2) Can/should pnetCDF diagnose poor choices and inform the user?

The only diagnosis I can think of is to see if a user's choice matches
the file system striping size. (Match means the hints 
"nc_header_align_size"
chosen by the user being a multiple of striping size.)

As for informing users a poor choice, Rob's suggestion is fine. I 
personally
think a feedback from the I/O library (or other stack) is very useful.
 

>> 3) Can/should MPI-IO "fix" this by exploiting the MPI-IO semantics
>> to permit converting writes to be aligned (e.g., by caching)?

In MPI collective I/Os, writes from the aggregators are aligned with the
striping size, if the striping size can be obtained from the system. 
Currently, the
ROMIO drivers for PVFS and Lustre are collecting the striping info into
the hints. PnetCDF can use those info to choose a right header alignment 
size.

As for independent I/Os, no alignment is done.

If the striping info cannot be obtained, pnetcdf currently is using 512 
bytes
for the file header alignment size.

>> 
>> Of these, (1) is the most important for pnetCDF, particularly as
>> users compare approaches.
> 


I propose the following way to pick a default value.

if ROMIO can obtain the file striping size
then
      if the total aggregate array size is at least N times of striping 
size, (say N=4)
      then pnetcdf uses the file striping size as the header alignment 
size
      else 512 bytes is used
else
      use 512 bytes

(Note that the header size is calculated at the call to ncmpi_enddef. In 
the meantime,
 the number of arrays and their sizes are also known.)

Wei-keng

> 



> One:
> 
> pnetcdf could stat the file system, but take a peek at ROMIO's file
> system detection code for the state of portable statfs.  today,
> perhaps it is less of a problem than when that code was written a
> decade ago.  What I mean to say is: "does there exist a portable way
> to determine alignment"?   st_blksize is probably our best bet, but on
> Lustre it's actually more important not to align blocks but to hit the
> same OST.
> 
> Just because it's hard doesn't mean we shouldn't do it, of course...
> 
> HDF5 has this problem too: both libraries would benefit from an MPI-IO
> interface to "file system features": alignment and "optimum tranfer
> size" come to mind.  others no doubt.
> 
> two:
> 
> pnetcdf has two ways to get information back to the caller: the return
> code and the info object.  A read-only "pnetcdf_how_we_doin" hint
> might do the trick.
> 
> three:
> 
> some MPI-IO implementations do fix this, as long as collective I/O is
> used.  The MPI-IO on BlueGene, for example, always forces collective
> I/O (even if operations are not overlapping), then aligns file domains
> to block size boundaries.  I know, I just complained about how
> un-portable st_blksize can be but 'ad_bgl' gets to make some
> simplifying assumptions.
> 
> ROMIO, at least recent versions, can also do some file domain magic
> - "romio_min_fdomain_size" will enforce a lower bound on the amount of
>  I/O an aggregator will do.
> 
> - set the "striping_unit" hint and ROMIO will ensure file domain
>  boundaries are aligned to a multiple of that value.
> 
> ==rob
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> 




Såvida annat inte anges ovan: / Unless stated otherwise above:
IBM Svenska AB
Organisationsnummer: 556026-6883
Adress: 164 92 Stockholm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20110310/397ef0f7/attachment-0001.htm>


More information about the parallel-netcdf mailing list