Dissolution of metadata

03 July 2015

The idea of metadata used to be easy. It was a type of shadow object that trailed the “real” object of which it was the metadata. Getting right which information to put into that shadow object wasn’t easy, but the concept itself was clean, clear and usually rectangular.

Now the rectangle is dissolving, making metadata be more useful and harder to use, although over time it will get easier.

The canonical example of rectangular metadata is the library catalog card. It contains information deemed useful for people browsing for books or trying to locate the ones they already know they want. Developing that metadata took generations of skilled professionals; it is all too easily undervalued.

That work was necessary because metadata used to be conveyed by physical objects that usually were physically smaller than the objects they were attached to, and smaller in terms of the amount of information they contain. Not very much fits on a library catalog card.

When Google Books started digitizing tens of millions of volumes, they seemed to take a different view. They scraped what metadata they could from the title pages of the books they scanned, but didn’t seem overly impressed by the sort of metadata that librarians have worked so hard on. Presumably that is because when you can easily search all the words in a book, all the words become metadata. Why worry about a paltry index card’s worth of data?

Well, perhaps because there are key terms that may not appear in the book itself. The word “computer” does not appear in the memoirs of Charles Babbage who is often considered to be the great-great-great-great grandfather of the modern computer. Moby-Dick does not contain the word “fiction” much less “American classic.” But search engines these days are smarter than that; if you search Google for “‘American classic’ fiction whale,” Moby-Dick is the first result.

And now Google Books has added what we used to call a “word cloud,” showing the hundred or so most used, statistically interesting words in each book. “Sperm whale” and “Captain Ahab” are big in Moby-Dick. Against this we might compare the subject headings under which the Library of Congress classifies each work they review—up to 10 each.

The Library of Congress’ headings are carefully derived and carefully applied by human experts. They have lasting value, but I think algorithmic metadata is likely to dominate, for three reasons: First, algorithms scale. Second, they get smarter the more they are developed and the more data they have to play with. Third, they can use the Library of Congress’ classifications as a way of making themselves smarter. And they’d be smart to do so.

In any event, this is one example of metadata dissolving. In this case, metadata has become the informational content of the object itself.

Linked Data gives us another example of the dissolution of those shadow objects. For example, all of the information in library catalog records can be turned into three-part Linked Data statements on the order of “Melville wrote Moby-Dick” and “Moby-Dick was written in 1851.” Into this stew can be stirred as many other triplets as can be found, including perhaps and “Melville was friends with Hawthorne,” and “Hawthorne wrote The Scarlet Letter.” We can then query this soup, hopping from Melville to Hawthorne to Puritanism to Elizabeth I to wherever the links take us.

The old concept of a record dissolves in Linked Data. The information is peeled off the index card and thrown into a semantic primal soup. It’s all there, it’s just not shoved into the Moby-Dick or Melville rectangle. You could reconstitute those rectangles but why would you want to?

There is at least one more way in which metadata is being dissolved. The move toward microformats—supercharged by the Schema.org initiative backed by the major search engines—embeds metadata into HTML pages themselves. A page about Moby-Dick might contain metadata that notes who the author of the page is, what sort of document it is (an academic article? a review? a whaling manual?), when it was written and what its topics are. This metadata is hidden in the HTML markup that typically tells browsers how to display the elements of the page. Schema.org also lets an author indicate the roles of the various elements on the page: This is a list of ingredients, that is a number indicating a rating, etc.

Embedded metadata is nothing new. The phrase “Table of Contents” at the top of the page in an old-fashioned book is metadata telling us the nature of the text that follows. The difference is that electronic documents let us insert metadata hidden from human eyes that is useful for browsers, search engines and other computer programs.

As the line between metadata and what it’s about is dissolved, our ability to find, use and make sense of information grows. It turns out those old rectangles were holding us back.