[MOAB-dev] adding 'read and broadcast' to HDF5 reader

Tim Tautges tautges at mcs.anl.gov
Fri Oct 19 20:24:10 CDT 2012


In practice, the read process is more complicated than that.  The two things that complicate it the most are:
- figuring out what to read on a given proc requires reading the tags, then reading some of the sets
- the partition sets, as they are now, contain only the highest-dimensional entities in a given part, with the vertices 
and other entities to be read on that proc (and the sets to be represented locally) dependent on those

For the 2nd item, I've added the capability in mbpart to put *all* the entities to be represented on a given part into 
the part set, but haven't updated the read to take advantage of that information.  Going forward I don't think it's 
necessary to maintain backward compatibility with that feature though.

Anyway, attached is a pretty detailed description of the ReadHDF5 load_file function and the codes it calls.  To answer 
another question or two, the current code does open the file just with root, and read/broadcasts the file header 
information.

As Iulian implies, there is a mode in which you can request read_delete and bcast_delete modes, where the whole file is 
read on each proc or the root proc then broadcast, then each proc instantiates the whole model then deletes entities not 
in their partition.  But, that does require instantiating the whole model in memory, which limits the total file size 
readable with that option.  The read_part method is what's used to read only the local mesh on each proc.

These options are described in the user's guide in doc/, I think in section 5.

- tim

On 10/19/2012 03:01 PM, Rob Latham wrote:
> Tim knows all this but for the rest of the list, here's the short story:
>
> MOAB's HDF5 reader and writer have a problem on BlueGene where it will
> collectively read in initial conditions or write output, and run out
> of memory.  This out-of-memory condition comes from MOAB doing all the
> right things -- using HDF5, using collective I/O -- but the MPI-IO
> library on Intrepid goes and consumes too much memory.
>
> I've got one approach to deal with the MPI-IO memory issue for writes.
> This approach would sort of work for the reads, but what is really
> needed is for rank 0 to do the read and broadcast the result to
> everyone.
>
> So, I'm looking for a little help understanding MOAB's read side of
> the code.  Conceptually, all processes read the table of entities.
>
> A fairly small 'mbconvert' job will run out of memory:
>
> 512 nodes, 2048 processors:
>
> ======
> NODES=512
> CORES=$(($NODES * 4))
> cd /intrepid-fs0/users/robl/scratch/moab-test
>
> cqsub -t 15 -m vn -p SSSPP -e MPIRUN_LABEL=1:BG_COREDUMPONEXIT=1 \
>          -n $NODES -c  $CORES /home/robl/src/moab-svn/build/tools/mbconvert\
>          -O CPUTIME -O PARALLEL_GHOSTS=3.0.1 -O PARALLEL=READ_PART \
>          -O PARALLEL_RESOLVE_SHARED_ENTS -O PARTITION -t \
>          -o CPUTIME -o PARALLEL=WRITE_PART /intrepid-fs0/users/tautges/persistent/meshes/2bricks/nogeom/64bricks_8mtet_ng_rib_${CORES}.h5m \
>          /intrepid-fs0/users/robl/scratch/moab/8mtet_ng-${CORES}-out.h5m
> ======
>
> I'm kind of stumbling around  ReadHDF5::load_file and
> ReadHDF5::load_file_partial trying to find a spot where a collection
> of tags are read into memory.  I'd like to, instead of having all
> processors do the read, have just one processor read and then send the
> tag data to the other processors.
>
> First, do I remember the basic MOAB concept correctly: that early on
> every process reads the exact same tables out of the (in this case
> HDF5) file?
>
> If I want rank 0 to do all the work and send data to other ranks,
> where's the best place to slip that in?  It's been a while since I did
> anything non-trivial in C++, so some of these data structures are kind
> of greek to me.
>
> thanks
> ==rob
>

-- 
================================================================
"You will keep in perfect peace him whose mind is
   steadfast, because he trusts in you."               Isaiah 26:3

              Tim Tautges            Argonne National Laboratory
          (tautges at mcs.anl.gov)      (telecommuting from UW-Madison)
  phone (gvoice): (608) 354-1459      1500 Engineering Dr.
             fax: (608) 263-4499      Madison, WI 53706

-------------- next part --------------
load_file:
- set_up_read
  . parse options, allocate buffer
  . if bcastsummary && root:
    - get file info (vert/elem/set/tag table structures, etc.)
    - close file
  . bcast size of file info
  . bcast file info
  . open file with set_fapl_mpio, set_dxpl_mpio(collective)
  . if options given, set hyperslab selection limit / append behavior
- read_all_set_meta
  . if root, mhdf_readSetMetaWithOpt (read nx4 table with indices for contents/par/child/opts, n sets)
  . bcast set meta info
- load_file_partial:
  . get_subset_ids:
    - if (values not specified): get_tagged_entities:
      . for dense indices (entity type/#verts sequences with that tag), add all entities to file ids
      . if sparse data:
        - open sparse data table
	- while (remaining) read chunks, store in file ids
  . get_partition: filter file ids to just ones read on this proc
  . read_set_ids_recursive (sets):
    - open set data table
    - ReadHDF5Dataset.init(content_handle) (initialize window into dataset?)
    - ReadHDF5Dataset.init(child_handle) (initialize window into dataset?)
    - do: read_set_data while more children are read
  . get_set_contents (sets):
    - read_set_data (content), passing file ids out:
      .	construct range of offsets into data table for the set
      . read contents in chunks determined by reader
      . convert file ids to entity handles, add to set
  . compute maximum dimension, MPI_Allreduce
  . for all element tables, get polyhedra, read_elems
  . for each element type/num_verts combination that's not polyhedra type, read_elems:
    - mhdf_openConnectivitySimple: open this connectivity table
    - allocate entities, get ptr to connectivity
    - set file ids, with chunk size determined by buffer size
    - read in chunks
    - if node_ids array passed to read_elems, also copy read node ids to passed array
    - insert elem ids into map
  . read_nodes (range of file ids for nodes):
    - open node coordinates table
    - allocate nodes, get ptrs to coords arrays
    - if (blocked)
      . for each dimension 0..dim-1:
        - set column #, file ids, with chunk size determined by reader
	- read in chunks
    - else (interleaved, default)
      . set file ids, with chunk size determined by buffer size
      . read in chunks determined by buffer size, assign into (blocked) storage in MOAB
    - insert node ids in id map

  . for each elem sequence with dim between 1 and max_dim-1: read_node_adjacent_elems:
    - while (remaining):
      . read chunk of connectivity
      . for each element, check for all read nodes; if false, mark start vertex for not creation, else increment number
        of entities
      . create number of entities
      . go back through connectivity, copying connectivity into new connect array
      . insert created entities into output handle list
  . update_connectivity:
    - convert all stored connectivity lists from file id to vertex handles
  . read_adjacencies
  . find_sets_containing (sets):
    - scan set contents, using either read-bcast or collective read
    - for each set, if contents ids intersects with file ids, add set id to file ids
  . read_sets:
    - create sets
    - read_set_data (contents):
      .	construct range of offsets into data table for the set
      . read contents in chunks determined by reader
      . convert file ids to entity handles, add to set
    - read_set_data (children)
    - read_set_data (parents)
  . for each tag, read_tag:
    - create the tag
    - if sparse, read_sparse_tag:
      . read_sparse_tag_indices (read ids in chunks, convert to handles)
    - else if var length, read_var_len_tag
    - else, for each dense index (index=entity type + #verts):
      . get ptr to node/set/element description in file info
      . read_dense_tag:
        - get entities in this table actually read
	- create reader for this table, set file ids of items to read
	- while !done:
	  . read data into buffer
	  . if handle type tag, convert to handles
	  . compute handles of entities read in this chunk
	  . set tag data for those entities
	
	


More information about the moab-dev mailing list