Understanding big data vs. theory

30 October 2012

We don't need to understand what we know in order for that knowledge to be useful. For example, knowing that aspirin is good for headaches helps when you have a headache, even though for a long time we didn't understand why. Once you understand that "Aspirin inhibits prostaglandin production by inhibiting the COX-2 enzyme," (as eHow.com tells me at http://goo.gl/dJipf)), that particular itch is scratched ... assuming you understand the explanation, which I do not. But understanding doesn't just make us feel better. Armed with this understanding of aspirin, you now know to look for other compounds that inhibit that ol' COX-2 more effectively or with fewer side effects. Understanding is a good thing all around.

The problem is that knowledge often outpaces understanding. In the Age of the Net, if we want our knowledge to get very very big, knowledge is going to blow far past our understanding, and we aren't going to be able to afford to wait around for understanding to catch up.

Brave words. But there are real risks. For example, Alex "Sandy" Pentland is a computational social scientist who has been writing and thinking about the effect of big data, especially on understanding human behavior. In an interview at The Edge (http://goo.gl/lqXIK) recently, he said:

"The other problem with big data is human understanding. When you find a connection that works, you'd like to be able to use it to build new systems, and that requires having human understanding of the connection. The managers and the owners have to understand what this new connection means. There needs to be a dialogue between our human intuition and the big data statistics, and that's not something that's built into most of our management systems today. Our managers have little concept of how to use big data analytics, what they mean, and what to believe."

Having a "dialogue" between humans and big data so that we can understand the significance of the patterns we've observed sounds like a useful process. As Charles Gross, chair of the Lovell Group, suggested at an Aspen Institute event in 2010, "The use of data for correlations allows one to test theories and refine them." Certainly sometimes that will happen. And when it does, it's good.

But there need not always be human-comprehensible theories behind correlations of big data. We should get used to the idea that we're going to know some things without understanding them. We will be able to reliably predict outcomes, but not understand exactly what drove those outcomes. We can predict because we can build intricate computer models using metric tons of data. But, as an issue of Wired a few years ago pointed out, we can do this without having a theory that explains why the model works the way it does.

There is a metaphysical question here: Are there theories for everything? Perhaps some events are the product of chaotic combinations for which there is no real explanation other than describing how each dust mote banged into another—a task for big data.

If you were to ask me, I'd go with the complexity-without-theory idea at least for much of what we care about. And to support this, I will give you a single example, but it's a particularly important one: the life of anyone you know. Try to make sense of it as if you were writing a biography. You'll find some patterns ("He tried to live up to his parents' impossible expectations," "She saw adversity as a challenge"), but no theory for that individual with much predictive power, and you're unlikely to find a theory for human lives overall. If you want to tell the story of someone, you'll be talking about a whole lot of data—circumstances, events—and not a lot of theory. We can know a lot about a person, but not have a theory that enables us to understand her fully.

And on the scale of what we humans want to know, a single life is nothing. It is a dust mote's worth of order in a universe of ever-expanding disorder. Our understanding will never keep up with our knowledge.