bad performance

Fri Nov 20 10:27:13 CST 2009

On Fri, Nov 20, 2009 at 04:50:35PM +0100, Joeckel, Patrick wrote:
> Dear Rob,
> 
> thanks for the quick reply. Here is the information you asked for
> (below the ncdump -h ...):

Ouch.  Yeah, this is going to require some tuning.  But first, a lengthy
explanation!

> The data in grid-point representaiton is decomposed in "lon" and
> "lat" direction, e.g.,
> lon x lat = 128 x 64 = 8192,
> i.e., each CPU has 2 segments of 32 columns (lev=90)
> of each variable.
> For the spectral data, the decomposition is more complicated.
> In any case, the output is one complete variable at a time.
> 
> The file system is GPFS.
> 
> The compiler is
> IBM XL Fortran for Linux, V12.1
> Version: 12.01.0000.0000
> 
> The parallel environment is IBMs PE.

OK, that does complicate things a little bit -- I know how to tune
ROMIO, but I'm not as familiar with IBM's PE options.  Let's see what
we can do.  

First, for the sake of this discussion I'm going to cut out the
attributes on your variables.  Those are stored in the header and will
not have any impact on I/O

netcdf test03_________20050131_0000_tr_NOx_NOy {
 dimensions:
        time = UNLIMITED ; // (5 currently)
        lon = 128 ;
        lat = 64 ;
        complex = 2 ;
        nsp = 946 ;
        lev = 90 ;
 variables:
        double time(time) ;
        double YYYYMMDD(time) ;
        double dt(time) ;
        double nstep(time) ;

        float lon(lon) ;
        float lat(lat) ;
        float lev(lev) ;
        float hyam(lev) ;
        float hybm(lev) ;

        float aps(time, lat, lon) ;
        float geosp(time, lat, lon) ;
        float lsp(time, nsp, complex) ;
        float gboxarea(time, lat, lon) ;

        float N(time, lev, lat, lon) ;
        float N2(time, lev, lat, lon) ;
        float NH3(time, lev, lat, lon) ;
        float N2O(time, lev, lat, lon) ;
        float NO(time, lev, lat, lon) ;
        float NO2(time, lev, lat, lon) ;
        float NO3(time, lev, lat, lon) ;
        float N2O5(time, lev, lat, lon) ;
        float HONO(time, lev, lat, lon) ;
        float HNO3(time, lev, lat, lon) ;
        float HNO4(time, lev, lat, lon) ;
        float PAN(time, lev, lat, lon) ;
        float HNO3_nat(time, lev, lat, lon) ;
        float NH2OH(time, lev, lat, lon) ;
        float NHOH(time, lev, lat, lon) ;
        float NH2O(time, lev, lat, lon) ;
        float HNO(time, lev, lat, lon) ;
        float NH2(time, lev, lat, lon) ;
	}

You've got 34 variables: 5 non-record variables and 29 record
variables.  Your 16 chemical compounds (species?) are made up of 
2949120 byte records, aps, geosp, and gboxarea have 32768 byte
records, and lsp has a 7568 byte record.

So, what does all this mean for I/O performance?  How familiar are you
with the NetCDF file format?   Your 29 record variables are laid out
on disk interleaved:  for example, the records of NH20 are separated
by about 46 MB of data from other variables. 

OK, almost there, thanks for bearing with me.  Operations on these
variables ends up being very non-contiguous.  With some digging I
think we can find MPI-IO tuning parameters that will deal with this
more favorably.   

Are you familiar with MPI-IO hints?  Perhaps this
URL will be helpful. 

http://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/HintsForPnetcdf

Here's one that looks promising:  can you set the hint
IBM_io_buffer_size to the string "2949120"?
This can also be set with the 'MP_IO_BUFFER_SIZE' environment
variable.

There's another hint "IBM_sparse_access" which might help here.  Set
that to 'true'.

I know you are probably constrained by conventions and collaborators,
but is there any way you can change your time variable to not be
UNLIMITED?  It's only 5 in this data set.  What if you made 'time' 100
and stored an attribute saying how many timesteps the dataset
contains?  that would re-arrange the data in the file in a much more
efficient manner.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA