Fellow Internet researchers, we need to have a little talk. It’s about “big data,” and what it isn’t.
Consider the following case:
Over the summer, David Corn at Mother Jones published an investigative piece about a conservative insider group named Groundswell. Groundswell included in-person meetings and a Google-Group that tea party activists, think tankers, conservative media journalist/activists, and government staffers used to discuss strategy and coordinate messaging. In essence, it was yet another “journolist” for the right (and, as such, it received basically zero public outrage …as David Weigel puts it “conservative news outlets talking to conservatives on background? Who didn’t figure this was happening anyway?”).
Weigel calls out the following passage from Corn’s reporting:
At the March 27 meeting, Groundswell participants discussed one multipurpose theme they had been deploying for weeks to bash the president on a variety of fronts, including immigration reform and the sequester: Obama places “politics over public safety.” In a display of Groundswell’s message-syncing, members of the group repeatedly flogged this phrase in public. Frank Gaffney penned a Washington Times op-ed titled “Putting Politics Over Public Safety.” Tom Fitton headlined a Judicial Watch weekly update “Politics over Public Safety: More Illegal Alien Criminals Released by Obama Administration.” Peter List, editor of LaborUnionReport.com, authored a RedState.com post called “Obama’s Machiavellian Sequestration Pain Game: Putting Politics Over Public Safety.” Matthew Boyle used the phrase in an immigration-related article for Breitbart. And Dan Bongino promoted Boyle’s story on Twitter by tweeting, “Politics over public safety?” In a message to Groundswellers, Ginni Thomas awarded “brownie points” to Fitton, Gaffney, and other members for promoting the “politics over public safety” riff.
The reason this passage is noteworthy is that it reveals an underlying flaw in virtually every academic study of online information diffusion.
Imagine if you were conducting a study of how the “politics over public safety” meme diffused through the blogosphere. You’d likely combine data from google trends, lexis-nexis, and the twitter firehose to identify instances of the phrase. You’d rely on the digital traces from social network ties and hyperlinks to identify where the phrase started and how it spread. You’d probably produce some fancy network graphs. If it’s part of a larger study, you might combine this case with several others to assess Granger causality. In the end, the data would tell a sophisticated story about what sorts of news outlets, pieces of content, or individuals in a network drive meme diffusion.
But you’d be wrong. You’d be wrong because, according to public data, it looks like the phrase diffused online from Frank Gaffney to Tom Fitton, then to Peter List, Matt Boyle, and Dan Bongino. But it actually diffused through an in-person meeting and a backchannel GoogleGroup. The public data can’t account for the hidden structure provided by offline and online-but-private communication systems.
This is a simple point, but it’s also a point that I inevitably make at every academic panel on “big data.” We, as a research community are repeatedly, comprehensively deriving incorrect conclusions. We’re able to draw upon more and more data, and we’re confusing that with comprehensive data.
Big data isn’t comprehensive data. It is systematically incomplete.