The Downward Spiral of Online Data Quality

Today in the New York Times “bits” blog, Nicole Perlroth brings us the latest cautionary tale for those who want trust online metrics a little too much.  Titled “Fake Twitter Followers Become a Million Dollar Business,” the article documents the growing market for fake follower numbers.

You can buy 1,000 followers on Fiverr for $5.  It took me a couple years to reach the 1,000 follower threshold.  …I’m such a sucker.

Perlroth’s post highlights a phenomenon that I’ve discussed elsewhere.  In “Social Science Research Methods in Internet Time,” I phrased it as a general rule: “Any metric of digital influence that becomes financially valuable, or is used to determine newsworthiness, will become increasingly unreliable over time.”*

The drivers of this process are abundantly clear.  Attach value to a digital metrics (hyperlinks, followers, retweets, site visitors) and you create an incentive for talented coders.  There’s money to be made in spam blogs and fake twitter accounts.  It isn’t particularly honest money, but it isn’t particularly dishonest money either.

Those coders will introduce noise into the system.  Another set of coders will work on proprietary counter-methods that help cut through the noise.  But that isn’t much use to researchers who are reliant on the publicly-available data itself.  The result is an ever-deepening GIGO (garbage in, garbage out) problem.  Academics often decide to treat follower count, retweet count, site traffic, etc as direct indicators of influence/success/prominence.  But those indicators were more accurate in 2009, than they were in 2011, than they are in 2013, than they will be in 2015, etc.  The data itself becomes less reliable over time.

This is a systemic property, which means we should be able to plan around it.  Theoretically, that is.  Practically, it’s devilishly hard to do so.  Our best options include (1) relying on metrics that fly under the radar, and thus (potentially) attract less spammer-attention, (2) thinking carefully about what biases to expect (which Twitter-users are most likely to buy spam accounts?  Presidential candidates > Physicists), and (3) developing partnerships with proprietary coders who can offer you higher-quality, constantly refined data.  Each of those options carries its own set of risks and problems, though.

Consider this your semi-regular reminder that the future of Big Data is going to involve just as much messiness and muddling-through as the past and present have.

 

*Self-quoting is weird.