[MOAB-dev] adding 'read and broadcast' to HDF5 reader

Fri Oct 19 18:39:53 CDT 2012

If HDF5's 'file image' feature works on a portion of a file, I would be
very surprised.

The problem is that when another processor is handed the buffer of
bytes, it has to be a (somewhat) consistent sequence of bytes as an
actual HDF5 file would appear on disk.

That said, I am betting that even if the 'file image' feature in HDF5 is
not strictly designed to handle portions of files, if the important
stuff you need to read was written early to the file by the data
producer, and if all processors don't attempt to descend into portions
of the file that exist outside some arbitrary maximum offset (you would
have to determine by guessing or something), then you could probably get
away with the following...

      * open or fopen the file path on proc zero.
      * read N bytes starting at offset 0 (you would have to guess at
        'N')
      * broadcast the buffer to all processors
      * Have all processors 'open' the buffer using HDF5's file image
        feature

The open should succeed and attempts to open groups or datasets that
exist in the first 'N' bytes should work fine. Not sure what will happen
if a processor inadvertently attempts an HDF5 operation causing it to
'seek' past the N'th byte. But, if all the sets and tables you need
exist within the 'N' bytes, I am betting you could get away with this.

Mark

On Fri, 2012-10-19 at 16:40 -0500, Iulian Grindeanu wrote:
> 
> 
> 
> ______________________________________________________________________
>         On Fri, Oct 19, 2012 at 03:16:46PM -0500, Iulian Grindeanu
>         wrote:
>         > Hello Rob, 
>         > I think that change has to happen in
>         src/parallel/ReadParallel.cpp 
>         > I am not sure yet though, Tim would confirm that 
>         
>         Interesting.  What is this POPT_BCAST option?  
>         
>         I don't want to change all of moab into 'rank 0 does i/o'  --
>         obviously that's not going to scale.  
>         
> this is read/broadcast option, which as you said, does not scale.
>         But for some of these inputs we are looking at datasets that
>         are not
>         all that big:
>         
>         /intrepid-fs0/users/tautges/persistent/meshes/2bricks/nogeom/64bricks_8mtet_ng_rib_2048.h5m
>         has a 7995686 x 4 "connectivity" dataset, but I know from
>         talking with
>         jason that you are only pulling one column out of this array,
>         so 61
>         MiBytes.  
> These are the connectivity arrays for ~ 8 million tetra elements. We
> do not read them on all procs, every processor needs to find out first
> what subset of elements has to read; the set information should be
> then enough to decide what portion of the connectivity array needs to
> be 
> read on each processor.  
> 
> Iulian
>         
>         ==rob
>         
>         > 
>         > ----- Original Message -----
>         > 
>         > | Tim knows all this but for the rest of the list, here's
>         the short
>         > | story:
>         > 
>         > 
>         > | MOAB's HDF5 reader and writer have a problem on BlueGene
>         where it will
>         > 
>         > | collectively read in initial conditions or write output,
>         and run out
>         > 
>         > | of memory. This out-of-memory condition comes from MOAB
>         doing all the
>         > 
>         > | right things -- using HDF5, using collective I/O -- but
>         the MPI-IO
>         > 
>         > | library on Intrepid goes and consumes too much memory.
>         > 
>         > 
>         > | I've got one approach to deal with the MPI-IO memory issue
>         for writes.
>         > 
>         > | This approach would sort of work for the reads, but what
>         is really
>         > 
>         > | needed is for rank 0 to do the read and broadcast the
>         result to
>         > 
>         > | everyone.
>         > 
>         > 
>         > | So, I'm looking for a little help understanding MOAB's
>         read side of
>         > 
>         > | the code. Conceptually, all processes read the table of
>         entities.
>         > 
>         > 
>         > | A fairly small 'mbconvert' job will run out of memory:
>         > 
>         > 
>         > | 512 nodes, 2048 processors:
>         > 
>         > 
>         > | ======
>         > 
>         > | NODES=512
>         > 
>         > | CORES=$(($NODES * 4))
>         > 
>         > | cd /intrepid-fs0/users/robl/scratch/moab-test
>         > 
>         > 
>         > | cqsub -t 15 -m vn -p SSSPP -e
>         MPIRUN_LABEL=1:BG_COREDUMPONEXIT=1 \
>         > 
>         > | -n $NODES -c
>         $CORES /home/robl/src/moab-svn/build/tools/mbconvert\
>         > 
>         > | -O CPUTIME -O PARALLEL_GHOSTS=3.0.1 -O PARALLEL=READ_PART
>         \
>         > 
>         > | -O PARALLEL_RESOLVE_SHARED_ENTS -O PARTITION -t \
>         > 
>         > | -o CPUTIME -o PARALLEL=WRITE_PART
>         >
>         | /intrepid-fs0/users/tautges/persistent/meshes/2bricks/nogeom/64bricks_8mtet_ng_rib_${CORES}.h5m
>         > | \
>         > 
>         >
>         | /intrepid-fs0/users/robl/scratch/moab/8mtet_ng-${CORES}-out.h5m
>         > 
>         > | ======
>         > 
>         > 
>         > | I'm kind of stumbling around ReadHDF5::load_file and
>         > 
>         > | ReadHDF5::load_file_partial trying to find a spot where a
>         collection
>         > 
>         > | of tags are read into memory. I'd like to, instead of
>         having all
>         > 
>         > | processors do the read, have just one processor read and
>         then send the
>         > 
>         > | tag data to the other processors.
>         > 
>         > 
>         > | First, do I remember the basic MOAB concept correctly:
>         that early on
>         > 
>         > | every process reads the exact same tables out of the (in
>         this case
>         > 
>         > | HDF5) file?
>         > 
>         > 
>         > | If I want rank 0 to do all the work and send data to other
>         ranks,
>         > 
>         > | where's the best place to slip that in? It's been a while
>         since I did
>         > 
>         > | anything non-trivial in C++, so some of these data
>         structures are kind
>         > 
>         > | of greek to me.
>         > 
>         > 
>         > | thanks
>         > 
>         > | ==rob
>         > 
>         > 
>         > | --
>         > 
>         > | Rob Latham
>         > 
>         > | Mathematics and Computer Science Division
>         > 
>         > | Argonne National Lab, IL USA
>         > 
>         > 
>         > 
>         
>         -- 
>         Rob Latham
>         Mathematics and Computer Science Division
>         Argonne National Lab, IL USA
> 
> 
-- 
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86 at llnl.gov      urgent: miller86 at pager.llnl.gov
T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511