[petsc-users] Unable to create >4GB sized HDF5 files on Cray XC30

Juha Jäykkä juhaj at iki.fi
Wed Oct 23 02:57:03 CDT 2013


Hi Jed, 

My first attempt at sanitising chunking.

The rationale is to keep the old behaviour if possible, so I just add a if-
else after the chunk sizes are first calculated and change those sizes that 
need to be changed. [There is also a zeroing of the 6-element arrays in the 
beginning of the function because I multiply them together to find the various 
sizes and at least Cray seems to have random numbers in the stack when 
entering the function, causing the sizes to be crazy.]

I first check if the average vector size exceeds 4 GiB and if it does, simply 
use as few chunks as possible, splitting along slowest varying axis first and 
if that is not enough, splitting the next axis, too etc. The patch *can* split 
the fastest varying dimension, too, but that seems overkill: it would mean 
that da_grid_x*dof > 4 GiB, which seems unlikely. It cannot split dof.

If the average Vec size is <= 4 GiB, but total Vec size is > 4 GiB, there is a 
simple logic: first see if <local_z_size>*<global_y_size>*<global_x_size>*dof 
is smaller than "target size". Target size comes from your formula earlier in 
this thread. If that is so, use that, if not, then just use the local Vec size 
as the chunk size. This could be improved by having an intermediate size 
solution, too, but this solution gives *my* tests decent performance across 
several global sizes, so it does not seem to be too bad. Now, the performance 
I get is just ~3 GiB/s, using 36 OSTs on a lustre. Using one OST I get ~200 
MiB/s, so the scaling is not that bad: I get half of 36*200 MiB/s, so I think 
my performance is limited by the OSTs, not MPI-IO or HDF5 or chunking. Tests 
with bigger systems would of course be needed to determine whether this 
solution works for truly huge files, too: the biggest I tried are just over 50 
GiB. Because I'm forced to spend my research quota on this, I don't want to 
waste too much of it by going to ridiculous size files (these are about the 
biggest ones I plan to use in practice for now).

And the patch itself is attached, of course. Please improve if you think it 
should be improved and adapt to PETSc coding conventions if necessary (at 
least the #defines are in a funny place to keep the patch simple).

Cheers,
Juha

On Sunday 06 Oct 2013 12:24:58 Jed Brown wrote:
> Juha Jäykkä <juhaj at iki.fi> writes:
> > Argh, messy indeed. Are you sure you mean 65 k and not 64 Ki?
> 
> I was using 65 as shorthand for 2^{32}, yes.
> 
> > I made a small table of the situation just to make sure I am not
> > missing anything. In the table, "small" means < 4 GB, "large" means >=
> > 4 GB, "few" means < 65 k, "many" means >= 65 k. Note that local size >
> > global size is impossible, but I include the row on the table for
> > completeness's sake.
> > 
> > Variables: 	local size	global size	# ranks		chunks
> > 
> > 		small		small		few		global size
> > 		small		small		many		global size[1]
> > 		small		large		few		avg local size
> > 		small		large		many		4 GiB
> > 		large		small		few		impossible
> > 		large		small		many		impossible
> > 		large		large		few		4 GiB[2]
> > 		large		large		many		65 k chunks
> 
> The last line cannot be stored since it consists of more than 65k chunks
> of size larger than 4 GiB.
> 
> > [1] It sounds improbable anyone would run a problem with < 4 GiB data with
> > >= 65k ranks, but fortunately it's not a problem.
> 
> Au contraire, it's typical for strong scaling to roll over at about 1000
> dofs/core (about 10 kB of data), so 65k ranks is still less than 1 GB.
> 
> > [2] Unless I'm mistaken, this situation will always give < 65 k chunks for
> > 4 GiB chunk size.
> 
> Well, 33k ranks each trying to write 8 GiB would need 66k chunks, but
> there is no way to write this file so we don't need to worry about it.
> 
> > I also believe your formula gives "the right" answer in each case. Just
> > one
> > more question: is "average local size" a good solution or is it better to
> > use "max local size"? The latter will cause more unnecessary data in the
> > file, but unless I'm mistaken, the former will require extra MPI
> > communication to fill in the portions of ranks whose local size is less
> > than average.
> 
> It depends how the compute nodes are connected to the file system, but
> even if you use "average" and if size is statistically uncorrelated with
> rank but has positive variance, the expected value of the skew will be
> more than a rank's contribution for sufficiently large numbers of cores.
> In other words, it's not possible to "balance" without needing to move
> all data for a sufficiently large number of ranks.
> 
> > HDF5 really needs to fix this internally. As it stands, a single HDF5
> > dataset cannot hold more than 260 TiB – not that many people would want
> > such files anyway, but then again, "640 kiB should be enough for
> > everybody", right? I'm running simulations which take more than terabyte
> > of memory, and I'm by far not the biggest memory consumer in the world,
> > so the limit is not really as far as it might seem.
> 
> We're already there.  Today's large machines can fit several vectors of
> size 260 TB in memory.
> 
> >> I think we're planning to tag 3.4.3 in the next couple weeks.  There
> >> might be a 3.4.4 as well, but I could see going straight to 3.5.
> > 
> > Ok. I don't see myself having time to fix and test this in two weeks, but
> > 3.4.4 should be doable. Anyone else want to fix the bug by then?
> 
> I'll write back if I get to it.
-- 
		 -----------------------------------------------
		| Juha Jäykkä, juhaj at iki.fi			|
		| http://koti.kapsi.fi/~juhaj/			|
		 -----------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hdf5_chunks.patch
Type: text/x-patch
Size: 5397 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20131023/b0e769d8/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20131023/b0e769d8/attachment.pgp>


More information about the petsc-users mailing list