[petsc-users] Unable to create >4GB sized HDF5 files on Cray XC30

Sun Oct 6 12:24:58 CDT 2013

Juha Jäykkä <juhaj at iki.fi> writes:

> Argh, messy indeed. Are you sure you mean 65 k and not 64 Ki? 

I was using 65 as shorthand for 2^{32}, yes.

> I made a small table of the situation just to make sure I am not
> missing anything. In the table, "small" means < 4 GB, "large" means >=
> 4 GB, "few" means < 65 k, "many" means >= 65 k. Note that local size >
> global size is impossible, but I include the row on the table for
> completeness's sake.
>
> Variables: 	local size	global size	# ranks		chunks
> 		small		small		few		global size
> 		small		small		many		global size[1]	
> 		small		large		few		avg local size
> 		small		large		many		4 GiB
> 		large		small		few		impossible
> 		large		small		many		impossible
> 		large		large		few		4 GiB[2]
> 		large		large		many		65 k chunks

The last line cannot be stored since it consists of more than 65k chunks
of size larger than 4 GiB.

> [1] It sounds improbable anyone would run a problem with < 4 GiB data with >= 
> 65k ranks, but fortunately it's not a problem.

Au contraire, it's typical for strong scaling to roll over at about 1000
dofs/core (about 10 kB of data), so 65k ranks is still less than 1 GB.

> [2] Unless I'm mistaken, this situation will always give < 65 k chunks for 4 
> GiB chunk size.

Well, 33k ranks each trying to write 8 GiB would need 66k chunks, but
there is no way to write this file so we don't need to worry about it.

> I also believe your formula gives "the right" answer in each case. Just one 
> more question: is "average local size" a good solution or is it better to use 
> "max local size"? The latter will cause more unnecessary data in the file, but 
> unless I'm mistaken, the former will require extra MPI communication to fill 
> in the portions of ranks whose local size is less than average.

It depends how the compute nodes are connected to the file system, but
even if you use "average" and if size is statistically uncorrelated with
rank but has positive variance, the expected value of the skew will be
more than a rank's contribution for sufficiently large numbers of cores.
In other words, it's not possible to "balance" without needing to move
all data for a sufficiently large number of ranks.

> HDF5 really needs to fix this internally. As it stands, a single HDF5 dataset 
> cannot hold more than 260 TiB – not that many people would want such files 
> anyway, but then again, "640 kiB should be enough for everybody", right? I'm 
> running simulations which take more than terabyte of memory, and I'm by far 
> not the biggest memory consumer in the world, so the limit is not really as 
> far as it might seem.

We're already there.  Today's large machines can fit several vectors of
size 260 TB in memory.

>> I think we're planning to tag 3.4.3 in the next couple weeks.  There
>> might be a 3.4.4 as well, but I could see going straight to 3.5.
>
> Ok. I don't see myself having time to fix and test this in two weeks, but 
> 3.4.4 should be doable. Anyone else want to fix the bug by then?

I'll write back if I get to it.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20131006/c2dd19ce/attachment.pgp>