[petsc-dev] Writing rich state

Tue Feb 23 13:40:41 CST 2010

    With google (and Spotlight on the Mac) is there any need to  
organize anything anymore? Just burp down the data any way you please  
anywhere you want it and then have smart search tools find it for you  
and format it the way you need it at the time you need it? This does  
mean you need decent tools to parse random stuff for the search to  
understand it.

    Ironically in the past few years with Spotlight on my Mac I  
actually do a better job of organizing my home directory structure  
then I ever have before.

    Barry

On Feb 23, 2010, at 1:31 PM, Dmitry Karpeev wrote:

> This takes the discussion in a somewhat tangential direction, but  
> consider this:
>
> We use hierarchical file systems, which are also a pain.
> Say, I'm working on project PETSc and I'm writing a DOE proposal for  
> it.
> Should I put it in ~/PETSc/Proposals/DOE/proposal or
> ~/Proposals/DOE/PETSc/proposal or
> ~/Proposals/PETSc/DOE?
> Later (3 months from now) I might want to come back and retrieve a
> file from that proposal tree.
> Where do I look for it?
> Maybe I should have all of these paths, all but one being soft links
> to the master path?
> I've tried that.  It's a pain.
>
> Basically, any hierarchical storage format, such as a file system,
> will impose a tree structure on
> what is fundamentally a (hyper)graph.
> GMail solves a similar problem by allowing multiple labels on a  
> piece of email.
> Then I can search on any or several of the labels: Proposals, DOE,
> PETSc, irrespective of the order.
> A file system imposes an artificial order.
> You can think of labels as being the hyperedges in the hypergraph.
>
> It would be nice to have a file system that functioned a bit like
> GMail, I think.
> In fact, I've thought about writing a Python replacement for 'ls',
> that would list files with a given label or labels.   I'm too lazy and
> incompetent, however.
> In the simplest case the metadata could go right into the filename,
> but maybe that's not
> a good thing to do in general.
>
>
> Dmitry.
>
> On Tue, Feb 23, 2010 at 10:24 AM, Barry Smith <bsmith at mcs.anl.gov>  
> wrote:
>>
>>  I've thought about this be never done anything, I think it is worth
>> investigating.
>>
>>  BTW: My long term goal is also that all PETSc source code lives in  
>> an
>> appropriate database with appropriate relationships and meta-data  
>> stored
>> there.
>>
>>  The fact that we (meaning HPC and OpenSource in general) use flat  
>> files so
>> much shows a failure of something.
>>
>>   Barry
>>
>> On Feb 23, 2010, at 9:31 AM, Jed Brown wrote:
>>
>>> Matt and I talked about this a couple months ago, but I'd like to  
>>> also
>>> mention it here.  It seems to me that data formats like HDF5 are  
>>> really
>>> a pain to use for generic purposes, because you end up trying to  
>>> map a
>>> directed graph of object relations (composition) into a hierarchical
>>> data format, and then implement relational queries on top of this
>>> hierarchy.  (I've done this, to some extent, and I ended up writing
>>> cumbersome code to walk this hierarchy to answer queries that  
>>> would be
>>> one-line SQL queries.)
>>>
>>> To elaborate slightly on the problem, the goal would be to write  
>>> vectors
>>> living on a DMComposite, with extra semantics like time step and  
>>> units,
>>> in a way that could be used for visualization as well as  
>>> checkpoints for
>>> forward and adjoint models.  PETSc's unadorned binary IO is fine  
>>> if the
>>> same code is going to read it back in, because everything will be  
>>> wired
>>> up correctly and we're just loading into a Vec (although it's  
>>> already
>>> somewhat tricky when the layout changes in the unstructured  
>>> case).  But
>>> there just isn't enough metadata to operate on in any sort of  
>>> generic
>>> way, and I hate writing custom code to describe meshes and relations
>>> between them.
>>>
>>> Current scientific data formats (at least those I have seen) are a
>>> hassle to use since they have poor support for expressing relations.
>>> HDF5 has the equivalent of file-system symlinks, but after
>>> normalization, all the relations end up being encoded as a bunch of
>>> symlinks, which is a relatively low-level view and isn't a  
>>> particularly
>>> convenient thing to traverse when answering a query.
>>>
>>> So I'm curious if anyone has put such metadata into a relational
>>> database instead of trying to contort it into one of these  
>>> "scientific"
>>> data formats.  My thought would be to drop only the metadata into
>>> something like Sqlite, and write the arrays themselves using MPI- 
>>> IO (or
>>> HDF5/NetCDF/whatever, but these don't provide much when we aren't  
>>> using
>>> them for metadata).  This would allow efficient support of queries  
>>> like
>>> "all vector fields at step M" and "fields B and C from step M to N  
>>> on
>>> subdomains intersecting bounding box XYZ".  This isn't completely
>>> different from what XDMF tries to do, but experimentation with  
>>> that left
>>> a sour taste.  Is SQL a stupid idea for this purpose and I'd be  
>>> better
>>> off writing code to support the queries I want on HDF5/XDMF/ 
>>> something
>>> else?
>>>
>>> Jed
>>
>>