[petsc-dev] Writing rich state

Tue Feb 23 13:31:09 CST 2010

This takes the discussion in a somewhat tangential direction, but consider this:

We use hierarchical file systems, which are also a pain.
Say, I'm working on project PETSc and I'm writing a DOE proposal for it.
Should I put it in ~/PETSc/Proposals/DOE/proposal or
~/Proposals/DOE/PETSc/proposal or
~/Proposals/PETSc/DOE?
Later (3 months from now) I might want to come back and retrieve a
file from that proposal tree.
Where do I look for it?
Maybe I should have all of these paths, all but one being soft links
to the master path?
I've tried that.  It's a pain.

Basically, any hierarchical storage format, such as a file system,
will impose a tree structure on
what is fundamentally a (hyper)graph.
GMail solves a similar problem by allowing multiple labels on a piece of email.
Then I can search on any or several of the labels: Proposals, DOE,
PETSc, irrespective of the order.
A file system imposes an artificial order.
You can think of labels as being the hyperedges in the hypergraph.

It would be nice to have a file system that functioned a bit like
GMail, I think.
In fact, I've thought about writing a Python replacement for 'ls',
that would list files with a given label or labels.   I'm too lazy and
incompetent, however.
In the simplest case the metadata could go right into the filename,
but maybe that's not
a good thing to do in general.

Dmitry.

On Tue, Feb 23, 2010 at 10:24 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>  I've thought about this be never done anything, I think it is worth
> investigating.
>
>  BTW: My long term goal is also that all PETSc source code lives in an
> appropriate database with appropriate relationships and meta-data stored
> there.
>
>  The fact that we (meaning HPC and OpenSource in general) use flat files so
> much shows a failure of something.
>
>   Barry
>
> On Feb 23, 2010, at 9:31 AM, Jed Brown wrote:
>
>> Matt and I talked about this a couple months ago, but I'd like to also
>> mention it here.  It seems to me that data formats like HDF5 are really
>> a pain to use for generic purposes, because you end up trying to map a
>> directed graph of object relations (composition) into a hierarchical
>> data format, and then implement relational queries on top of this
>> hierarchy.  (I've done this, to some extent, and I ended up writing
>> cumbersome code to walk this hierarchy to answer queries that would be
>> one-line SQL queries.)
>>
>> To elaborate slightly on the problem, the goal would be to write vectors
>> living on a DMComposite, with extra semantics like time step and units,
>> in a way that could be used for visualization as well as checkpoints for
>> forward and adjoint models.  PETSc's unadorned binary IO is fine if the
>> same code is going to read it back in, because everything will be wired
>> up correctly and we're just loading into a Vec (although it's already
>> somewhat tricky when the layout changes in the unstructured case).  But
>> there just isn't enough metadata to operate on in any sort of generic
>> way, and I hate writing custom code to describe meshes and relations
>> between them.
>>
>> Current scientific data formats (at least those I have seen) are a
>> hassle to use since they have poor support for expressing relations.
>> HDF5 has the equivalent of file-system symlinks, but after
>> normalization, all the relations end up being encoded as a bunch of
>> symlinks, which is a relatively low-level view and isn't a particularly
>> convenient thing to traverse when answering a query.
>>
>> So I'm curious if anyone has put such metadata into a relational
>> database instead of trying to contort it into one of these "scientific"
>> data formats.  My thought would be to drop only the metadata into
>> something like Sqlite, and write the arrays themselves using MPI-IO (or
>> HDF5/NetCDF/whatever, but these don't provide much when we aren't using
>> them for metadata).  This would allow efficient support of queries like
>> "all vector fields at step M" and "fields B and C from step M to N on
>> subdomains intersecting bounding box XYZ".  This isn't completely
>> different from what XDMF tries to do, but experimentation with that left
>> a sour taste.  Is SQL a stupid idea for this purpose and I'd be better
>> off writing code to support the queries I want on HDF5/XDMF/something
>> else?
>>
>> Jed
>
>