Self Describing Data

4 minute read

Life would be so much nicer if my notebook (I’ve been using Zim-Wiki for some years) could link to emails archived in my email application (Thunderbird)… Linking in the other direction would, of course, be too much to ask!

Now Zim-Wiki is somewhat scriptable in the sense that I could write a plugin that gets some configuration telling it where to find my email archives and that understands the mbox data-storage format used by Thunderbird, so it’s at least possible in some theoretical sense. Of course I’d have to delve into Python (which I know only a tiny little bit) and learn Zim’s Plugin API first… shouldn’t take more than a week or two.

So: it’s possible. But it’s quite a high bar to clear. Quite an expensive bit of wiring for just one small integration. And just think for a moment about how many integrations I’d really like to see…

It’s a pretty clearcut example where self-descriptive data formats would be Oh So Useful. Then software could shoulder (probably) 90% of the burden, and the whole project would become approachable, even if only by “expert” users. But as things stand… Not So Much!

Two strands of thought:

  1. There are two (more, actually, but for now…) kinds of programming:

    • wiring bits together — most of Enterprise programming — and
    • writing basic domains where we’re encoding/debugging/capturing domain knowledge and enabling manipulation/editing of the resulting model.
  2. The world would be a better place if data-formats were self-descriptive, and if that self-description was machine-readable as well as human readable. Second prize (though a fairly distant second) would be data that is merely “described”… somewhere… anywhere… *sobs quietly into pillow*.

Bringing these together, I begin to wish for data formats to imitate one of Smalltalk’s Big Wins: The meta-level (descriptive level) is entirely on the same basis as the functional level. Classes are merely Objects. The meta-level uses exactly the same mechanisms, syntax and conventions as the functional level. Could we do this for data? Because if we can, then self-description almost just falls out as a natural, inevitable and, above all, easy consequence.

Convenience matters! If self-description is not easy to do, preferably entirely automatic or consequential, then self-description simply won’t happen. (And there is the problem with “described elsewhere” data. Plus it has a nasty tendency to get out of sync with the actual data format. See here for an example where the description metadata gets arbitrarily clobbered by something and it turns out to be non-trivial to figure out what’s going wrong and what to do about it.)

Oh, and I’d like to link from my notes to single transactions on my bank statements… (Lather, rinse and repeat the performance.)

And from emails to some of those transaction records… (ugh: Thunderbird plugins? Fuhgeddaboudit.)

Baby Steps, I keep reminding myself, Baby Steps

Q: What’s the simplest possible thing that might work?

A: I could save the emails/bank statements to some folder in my Home space and then link to those from where ever I want, but then we’re back to location-based addressing of the data, and, being a notoriously ADHD sort of person, I like to move old and stale stuff off into the Attic once in a while as a way to reduce clutter. (Clutter being defined as anything that plants a hook in my brain.) I could write a little cron-job to do this automatically on some sort of sunsetting schedule, but then I’d break all those lovely links, wouldn’t I?

So the simplest thing that might work… probably won’t.

Advanced techies will, at this juncture, start talking about saving the data to some location, but only referring to it via symlinks or some such, and then swizzling the symlinks when you want to move the data around…

There is no problem in computer science that cannot be solved by an extra level of indirection. —David Wheeler

But that misses the point. It’s not where the data lives that really matters, but where the access reference that gives entry to the data lives that matters — where the hook in my brain lives. It’s the addressing scheme that’s at fault.

Pervasive search is one good answer to this, but so far I’ve not seen much use of programmatically initiated search in practise, and what little I have seen has been horribly poor — inaccurate, slow, filled with false-positives and incompetent to distinguish similarly-named by otherwise quite different blobs of data. Still: as a starting point for a form of content-addressable data, it might work. It’s a good entry.

Post Script: It’s worth acknowledging that there are substantial works out there seeking to address this ‘self-description’ want. OpenAPI, are certainly worth mentioning as steps in a good direction. Now we need to expand those efforts and not make the metadata be “elsewhere”…