[petsc-users] Unable to create >4GB sized HDF5 files on Cray XC30

Fri Oct 4 10:57:00 CDT 2013

On Sunday 18 August 2013 08:10:19 Jed Brown wrote:
> Output uses a collective write, so the granularity of the IO node is
> probably more relevant for writing (e.g., BG/Q would have one IO node
> per 128 compute nodes), but almost any chunk size should perform
> similarly.  It would make a lot more difference for something like

I ran into this on a Cray XK30 and it's certainly not the case there that any 
chunk size performs even close to similarly: I can get IO throughput from 
roughly 50 MB/s to 16 GB/s depending on the chunk sizes and number of ranks 
participating in the MPI IO operation (underneath H5Dwrite()). Unfortunately, 
there is still the lustre stripes to consider, which might drastically change 
things, so I doubt there is a good overall choice here.

> visualization where subsets of the data are read, typically with
> independent IO.

Yes, this certainly needs to be considered, too. I guess huge chunks are bad 
here?

> Are you sure?  Did you try writing a second time step?  The
> documentation says that H5S_UNLIMITED requires chunking.

Sorry, I didn't consider several timesteps and extending the files at all. 
You're correct, of course.

> Chunk size needs to be collective.  We could compute an average size
> From each subdomain, but can't just use the subdomain size.

Why not use the size of the local part of the DA/Vec? That would guarantee

1. That the chunk is not ridiculously large (unless someone runs on a huge SMP 
with a single thread, which is probably something no one should be doing 
anyway).

2. Whatever gets written to disc is contiguous both on disc and in memory, 
eliminating unnecessary seek()s and cache thrashing (although the latter is 
probably irrelevant as the disc is going to be the bottleneck).

3. The chunk is most likely < 4 GB as very few machines have more than 4 GB / 
core available. (Well, this is not really a guarantee.)

> I think the chunk size (or maximum chunk size) should be settable by the
> user.

I agree, that would be the best solution.

Is the granularity (number of ranks actually doing disc IO) settable on HDF5 
side or does that need to be set in MPI-IO?

Any idea which version of PETSc this fix might get into? I currently keep my 
own patched version of gr2.c around, which uses local-Vec-size chunks and it 
works ok, but I'd like to be able to use vanilla PETSc again.

Cheers,
Juha