[petsc-dev] Writing rich state

Tue Feb 23 09:31:32 CST 2010

Matt and I talked about this a couple months ago, but I'd like to also
mention it here.  It seems to me that data formats like HDF5 are really
a pain to use for generic purposes, because you end up trying to map a
directed graph of object relations (composition) into a hierarchical
data format, and then implement relational queries on top of this
hierarchy.  (I've done this, to some extent, and I ended up writing
cumbersome code to walk this hierarchy to answer queries that would be
one-line SQL queries.)

To elaborate slightly on the problem, the goal would be to write vectors
living on a DMComposite, with extra semantics like time step and units,
in a way that could be used for visualization as well as checkpoints for
forward and adjoint models.  PETSc's unadorned binary IO is fine if the
same code is going to read it back in, because everything will be wired
up correctly and we're just loading into a Vec (although it's already
somewhat tricky when the layout changes in the unstructured case).  But
there just isn't enough metadata to operate on in any sort of generic
way, and I hate writing custom code to describe meshes and relations
between them.

Current scientific data formats (at least those I have seen) are a
hassle to use since they have poor support for expressing relations.
HDF5 has the equivalent of file-system symlinks, but after
normalization, all the relations end up being encoded as a bunch of
symlinks, which is a relatively low-level view and isn't a particularly
convenient thing to traverse when answering a query.

So I'm curious if anyone has put such metadata into a relational
database instead of trying to contort it into one of these "scientific"
data formats.  My thought would be to drop only the metadata into
something like Sqlite, and write the arrays themselves using MPI-IO (or
HDF5/NetCDF/whatever, but these don't provide much when we aren't using
them for metadata).  This would allow efficient support of queries like
"all vector fields at step M" and "fields B and C from step M to N on
subdomains intersecting bounding box XYZ".  This isn't completely
different from what XDMF tries to do, but experimentation with that left
a sour taste.  Is SQL a stupid idea for this purpose and I'd be better
off writing code to support the queries I want on HDF5/XDMF/something
else?

Jed