The fervor over big data has largely focused on the number of data points now at our disposal by which ever-more specific and powerful analytic insights can be made. But managing the amount of computations is not the biggest challenge. The biggest challenge is what I call 2N Analytics – creating knowledge within the proliferating data that can be analyzed. As I’ll show, even very small data is still impossible to compute. The challenge now, as it has always been, is in developing analysis and knowledge without the ability to compute it all.
In computer science, there are a class of problems called. These are problems where the computations would take such a long time to perform that doing them at a useful scale would be computationally impossible. One such problem is finding cliques in a network. (Cliques are groups of people or nodes in which everyone is connected to everyone else.) To solve this problem, you have to literally check every possible combination of connections between nodes to determine whether the nodes are mutually aware of one another and the group. Mathematically, this requires 2^N computations where, for every nth node added to the network, the number of computations increases by a power (it’s actually (2^N)-1, but I’m rounding here). In a network of three people, a computer must do 2^3, or 8, computations. In a network of just 300 nodes, the computer must do 2^300, or 2×10^90, calculations! Just for reference, it would take IBM’s Sequoia supercomputer, the fastest we have, 6×10^73 seconds to compute, which is well over 4 times as long as the universe as been around!
Big Data presents the same problem, but not because we have 50 million data points. Instead, we have 50 million data points across 300 dimensions. In the same way that clique detection is NP-Complete, so is high dimensional data analysis. For every new dimension added to data, the analytic possibilities increase at 2^N. If we just have two dimensions, say cost and sales and we’re trying to predict future earnings, we can calculate the isolated effect of each on profit, the interactive effect of both on profit, and the effects of each controlling for the other. That’s four calculations for just two variables. As our number of analytic parameters increases, so do the possibilities for analytic insight grow exponentially. This is not so new really. The most widely used social survey, the General Social Survey, collects data on over 1,000 dimension from race and gender to attitudes about the environment and politics.
However, there’s another 2^N problem that does make this problem more salient than ever. As the number of dimensions grows, our ability to gain meaningful insight from them diminishes because there aren’t enough individual observations. A basic heuristic in statistics is that, for every variable you put into a linear regression, you need 10-15 observations. For a regression on 300 dimensions, this is only 3000-4500 observations. As above, we can multiply the 2×10^90 calculations needed to analyze all 300 dimensions by another 10 or 15. But, this gets even more mind-numbingly complicated and computationally intractable when we want to do an analysis within dimensions.
Let’s return to the cost and sales example. Say, you want to compare sales for low-cost versus high cost items. Knowing your product portfolio, you know that items over $1,000 are your high-end items. But, though you have 1000 observations, you’ve only sold five items over $1,000 in the past year. You have a lot of data, but not a lot of data about this fairly rare event. So, all of a sudden, the two dimensions you can analyze in four ways becomes impossible. Even with 1000 data points because the dimension of interest is too rare. The thing is, rarity actually becomes extremely common in 2N Analytics and this is a big problem. Every dimension added actually has at least 2 subdimensions and as many as N subdimensions. In the case of low- and high-cost items, a 1000-dimension variable is reduced to a 2-dimensional variable (assuming every item costs something slightly different). This is typically a strength, but when you want to make inferences about specific sub-dimensions (the power of big-N data), the data can run out fairly quickly.
Let’s use an example with the entire U.S. Population. Using the U.S. census (some of the oldest big data, now containing roughly 300 million people), let’s say you want to investigate the probability of unemployment (7%) for a black (12%) man (50%) in his thirties (13%) in a poor neighborhood (12%) of Detroit (.25%) to a similar man in a similar place in Chicago (1%). [Note the probabilities here are independent, the unemployment rate for black men in these places is actually much higher. I use these because I happen to know most of the stats off the top of my head.] Combining these probabilities (.12*.5*.13*.12 = .000936; *.0025; *.01; *.7; x 300 million) you will find that there are 701 such men in Detroit (49 of which are unemployed) and 2889 men in Chicago of whom 202 are unemployed. In adding five variables, we’ve cut a data set of 300,000,000 people into a data set of 3,500 people of which only 251 have the effect we’re testing. The power of big-N data is that we still have several thousand people. But, we still have a couple hundred variables in the American Community Survey (an in-depth survey of samples within the U.S.) we could add to understand the employment likelihood for these two groups of people (political ideology, family, education, transportation access, home ownership, etc.). Who wants to image how small the data becomes when you compare these 3500 people by their political ideology and family structure?
Hopefully by now, I’ve convinced you that these computational problems are not solved with bigger data and faster computers. Big data has made us better at getting estimates at such a fine-grained level. But, the scale needed to solve these problems should be considered unreachable. Instead, the promise of big data relies on analysts and their ability to choose the right features, set up the right kind of data collection, perform the right kind of analysis, and develop the right kind of conclusions. What is new is neither the data nor the computers, but our capacity to analytically and computationally engage in reducing these 2N problems to a meaningful and manageable scale from which we can build new insight.