bad performance

Mon Nov 23 14:39:06 CST 2009

On Mon, Nov 23, 2009 at 03:03:20PM +0100, Joeckel, Patrick wrote:
> Dear Rob,
> 
> setting
> IBM_io_buffer_size to 2949120
> and
> IBM_sparse_access to true
> reduced the factor to approx. 2 (compared to serial netCDF),
> which is already more than a factor of 2 improvement.

OK, great! did you have to set both of these to see a benefit?  

> Going without UNLIMITED implies major changes to the overall structure
> (and almost our entire post-processing needs to be modified), thus
> if possible I would try to avoid to change this.

I was afraid of that.  It's the answer I usually get when I suggest it :>

> Do you see other chances for fine-tuning?

There's one more approach you can take, but I don't think you're going
to like it.  These variables are laid out on disk with their "records"
interleaved.  If you decompose your access among your processes so
that each MPI process operates on a record (the non-unlimited
dimensions) then you'll probably see much more I/O speedup.  

The problem is that such a planar decomposition typically matches
very poorly with the computational model.

> The file of which I sent you the ncdump is of course not the only
> output file, there are several output files open at a time.  There
> are several files with the grid-point layout (time, lev, lat, lon)
> or (time,lat,lon), some few with the spectral layout (time, nsp,
> complex) or (timer, lev, nsp, complex), some with mixed variables
> (as in the example), and some more with even different data
> representations (e.g. scalar, i.e. time only, or columns (time,
> nlev)).  They are all with UNLIMITED time variable, however.

> Would you recommend to calculate the IBM_io_buffer_size for each
> netCDF file individually (e.g., based on the number of variables
> that go into this file)?

The IBM_io_buffer_size hint should be set on a per-variable basis.
It's the product of the fixed dimensions times the size of the
datatype.  

Looks like it's time we started setting this hint inside the pnetcdf
library.  As you say, you can have assorted record sizes for the
different variables in the dataset but you only get one chance to set
those hints: at open or create time.   I don't think this will be hard
to do, but doing so in a way that works for both IBM PE and ROMIO will
take a little work.

> Or are there more fine tuning options I could use?

I think we've found the low-hanging fruit.  If you can send us an
"i/o kernel", though, (some code that demonstrates a typical I/O
pattern, but has none of the science), then we can run it on our
systems and try to tune the heck out of pnetcdf and MPI-IO.  It's
harder to do such tuning over email.

> How is your experience with the scaling when I go to more CPUs (e.g.
> up to 256 or 512)?

pnetcdf should scale very well: pnetcdf codes have run up to 32,000
cores on our BlueGene system.  

> Any further suggestions are very much appreciated.

Let's see if the list has anything to add...
==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA