[I’m not a sports analyst, and this is not a sports blog. We’re scholars, especially of political communication, politics, and media policy. But I do crunch numbers, and I thought I could help add something to this debate.]
[Also, corrections and updates at the bottom, appended Jan 28, 2pm-ish.]
We’ve all spent the last week hearing a lot about Tom Brady’s balls. Patriots fans and Pats haters are fighting online with a viciousness that’s hard to overstate. A good number of you have also seen the use of statistics to try to sort out whether the Pats have a measurable advantage in something that would be directly related to the inflated pressure of footballs — namely, fumble rates. Statistical analysis is only good, however, if the data are correct, if we are testing what we think we are testing, and if we are using the right statistical tools for the job. In this case as in so many, we need more good analysis that asks the right questions and uses the correct data.
This post has a lot to say, so here’s a summary: A football handicapper named Warren Sharp did a statistical analysis comparing fumble rates for players who had played for the Patriots and for other teams, showing that the same players somehow held on to the ball much better when playing for the 2007-14 Pats than when playing for other teams. Once I recreated the data set and ran my own analysis, I found that Sharp’s analysis has several fundamental flaws. I also found that a proper analysis of the correct data shows a small Pats advantage but with no statistical significance.
The Wall Street Journal then published a follow-up study by Michael Salfino; he looked at six players who have all played significant minutes for the Patriots and for other teams since the 2010 season, and he found a striking difference. Unlike the Sharp piece, Salfino uses the correct data, and the correct statistical test (which I believe I am the first person to run) shows that there is a statistically significant difference for these six players, leading to lower fumble rates when playing for the Patriots.
The Patriots might have cheated. If they did, and if this had an effect on ball security, the effect is smaller and less certain than Sharp (and those sharing his results) claim. But, especially in light of what Salfino’s data, it looks reasonably likely that there is an effect. This will, and should, contribute to the claim that the Pats likely cheated — even though the results are not so improbable otherwise as to show the case completely.
The Promise of Sharp’s Second Study
I was really impressed by Warren Sharp’s first viral post on Sharp Football Analysis, in which he concluded that Pats fumble rates since 2007 are so low as to be “nearly impossible.”
Once his server stopped crashing from all the traffic, he wrote a second post exploring fumble rates for players who have both played for the Patriots since 2007 and played for other teams. Again, he found something highly suspicious: the same player was significantly less likely to fumble while playing for the Pats than while playing for another team.
I really locked in on the second study. The most persuasive case for an unfair advantage would, ideally, control for differences in player personnel. After all, if Belichick consistently identifies players who are better than average at hanging on to the ball, then the Patriots could gain a substantial — even freakishly large — advantage through fair means. This just happens in sports some times.
In the 2013 season, the Broncos offense provided a really good example of this phenomenon; their points-per-game were so far ahead of the league norm that, by chance alone, such a result would happen about once every 2,230 seasons.1)In a year when the league averaged 23.42 points per game, with a standard deviation (SD) of 4.36 points, the Broncos averaged 37.9, for a z score of 3.32. There’s a formula to calculate how likely something is based on the z score, but you don’t have to learn it. You just have to go to this page with a handy z score calculator. As you know, however, NFL seasons are not purely the result of chance; the players involved matter a great deal. (Just ask Eric Decker about the drop-off from 2013 Peyton Manning to 2014 Geno Smith.)
The potential for a player effect is why Sharp’s second study was so promising — and, on first read, convincing. Wes Welker would not stop caring about ball security once he got to the Broncos, and if he had learned anything particularly helpful on this count, he would surely retain it for, and even share it with, his new team. If there is still a team effect, however — if the same player fumbles less for the Pats than for other teams — then we have something potentially suspicious.
Unfortunately, Sharp’s second post does not have the goods.
Sharp Mistake Number One: Counting the Wrong Fumbles
One persistent question by commenters under Sharp’s post that he refused to clarify is whether he included special teams (ST) fumbles. Once I had the data, I was able to conclude that he had done just that. It is hard to overstate how serious this mistake was. In this case, it’s the difference between “This adds a lot to the case that the Pats are cheaters” and “These data are not measuring what you think they’re measuring.”
Sharp finds that these players, while playing for the Patriots from 2007 onward, fumbled 39 times on 3,815 touches (98 touches/fumble), while they fumbled 124 times on 8,349 touches (67 touches/fumble) when playing for other teams. I wanted to double-check his numbers, so I copied the results myself directly from the ESPN.com career statistics pages for all the players. Once I had punched in the numbers myself, I was baffled; not only were the fumble totals way lower, even the touch count was off. It turns out that Sharp’s 2007-14 Pats counts mistakenly included Laurence Maloney’s 2006 season (197 touches, 1 fumble) but not Deion Branch’s 2011 and ’12 seasons (67 total touches, 0 fumbles). So his New England 2007-14 totals should have been: 3,685 touches, 38 fumbles, and 97.0 touches/fumble.
Again, the real problem is including special teams fumbles. The kickers’ balls are not in question here, after all, so including special teams fumbles will at best add a bunch of noise to the data, and at worst create the appearance of an effect when there really is not one. Sharp’s analysis is built in large part on the latter type of error.
Of these players’ (actual) 3,685 offensive touches for the Pats from 2007-14, there were only 29 offensive fumbles — not the 38 we would get if we include ST fumbles. Of these same players’ 8,349 offensive touches playing for other teams, there were only 86 offensive fumbles, meaning that Sharp tacked on a whopping 38 additional fumbles racked up during their ST duties for other teams. Also, each event is only one entry, so let’s put the non-fumbles on their own line as well. In other words, the actual table should be:
Table 1: Fumbles per Offensive Touch, Patriots 2007-14 vs. Same Players on Other Teams
(% of touches)
(% of touches)
|Pats, 2007-14||3,656 (99.21%)||29 (0.79%)||3685|
|Other Teams||8,263 (98.97%)||86 (1.03%)||8349|
|Total||11,919 (99.04%)||115 (0.96%)||12034|
This paints a very different picture than the one Sharp offers — one in which the players are only slightly (and certainly not impossibly) ahead of their non-Pats performance. “About 23.6% better”2)(.0103 – .0079) / 1.03 = .236. is definitely remarkable, in my view — again, these are the same players, and the only change is the team they’re playing for — but it is a more modest effect than the 31.2% benefit that we would find with Sharp’s numbers.
Sharp Mistake Number Two: Getting the Fractions Upside-Down
Even if we were to take Sharp’s numbers at face value, he has the numerator and denominator backward. If he had the fractions right side-up, we would see a smaller effect than he claims. He uses plays-per-fumble instead of fumbles-per-play, which is methodologically backward. We are interested in how often each player fumbles, so fumbles/touches is the correct method; doing it the other way is a great way to radically increase the appearance of a difference.
The analogy is not exact, but the mistake is something akin to when a department store says, “20% off of our clearance price, which was already 50% off!” We’re supposed to (and generally do) think “70% off,” even though this is only a net 60% discount. While not the same mistake per se, getting the numerator and denominator mixed up can be —and in this case, is — exponentially more misleading.
Sharp Mistake Number Three: Not Running Proper Statistical Tests
There was a lot of criticism of Sharp, and fast, but as of Sunday night, I found no comprehensive takedowns. So I decided to investigate for myself. I went to the ESPN career statistics pages for every player named in the study and copied their data into Excel, cleaning it up and standardizing the formatting. Then, I prepped it to import the data into a statistical analysis program.3)This meant turning each touch into its own “case” on a separate line. Thus, instead of one line expressing that Danny Amendola caught 85 passes and had one receiving fumble for the 2010 Rams, I had this across 85 lines. I also could have — and, in retrospect, probably should have — run the most obvious appropriate statistical tests using handy online tools I link to in this piece. This added up to 12,633 lines.
Using proper statistical analysis is important because a difference that catches the eyeballs might actually be reasonably likely to happen by chance alone. Incidentally, this is how we wind up with all sorts of misunderstandings and silly folk “wisdom.” Your mother might say, “I ate extra kale when I was pregnant with your brother, and he came out a whole 5 inches taller than you did! I should have eaten kale when you were gestating so you could be tall, too.”
With all due respect to your hypothetical mother, simple luck of the draw is a much, much better explanation,4)And don’t get me started on the anti-vaxxers. Thousands of children are vaccinated every workday. Some portion of them will, sadly, develop the first signs of autism in the following days. That’s just how chance works. “My child got vaccinated right before, therefore it’s the vaccines” is the equivalent of “I should have eaten kale so you’d be tall like your brother.” Except, to my knowledge, kale isn’t causing an outbreak of totally preventable diseases that are leading to hospitalizations and deaths. and if we had a reasonable number of kale study participants, we would probably (but not certainly; that’s how randomness works) find no effect.
So, without further ado, I introduce you to the “proper” statistical test: Chi-square.5)If you know the arguments about why Chi-square is inferior to other tests, you also know that (a) the differences in measured estimates of significance are generally marginal and are virtually the same here, (b) Chi-square critics can’t agree on what we should be doing instead, and (c) Chi-square is broadly understood. Oh, and while you’re here: Chi-square = 1.596. These are the right tests for a 2-by-2 table of binary data like this — at least, they are according to people much, much smarter than me. If you have a table of data and you want to test the significance of any association, there are helpful online calculators like this one.
If we run them on the actual offensive fumble data, here is what we find. In what I’ll call “Sharp’s Player Test” — that is, 2007-14 Patriots seasons versus all non-Patriots seasons — the answer is “not much.” Using offensive fumbles only, the probability that the Pats/non-Pats advantage would happen due to sheer chance alone is .206. Which is to say, about 21% of the time, we would expect a difference at least this large between for-the-Pats and for-someone-else results. Really, since we are only interested in whether the effect holds in one direction (fewer fumbles as Pats), we can cut this estimate in half and say that the Pats could gain this benefit, due to sheer chance alone, about 10.3% of the time. Unless we had something else to point to, we would be left concluding that the stats don’t really provide any additional fuel to the claims that the Pats cheated. It does not disprove that there is an effect — in fact, it suggests one. It’s just no more than a suggestion, and one that does not provide much hard proof.
Sharp’s Player Test is misleading in yet another potentially important way. Trends in the NFL change the nature of the game all the time, so we should always keep timeframe constant if possible.6)The big exception is if we’re interested in change over time. Then, we should try to hold other things constant or correct for changes, e.g. to season length. Thus, I re-ran the analysis with only seasons from 2007 and later, Pats or otherwise. Here’s what that data looks like:
Table 2: Fumbles per Offensive Touch, 2007-14, Patriots vs. Same Players on Other Teams
(% of touches)
(% of touches)
|Patriots, 2007-14||3,656 (99.21%)||29 (0.79%)||3685|
|Other Teams||3,273 (98.97%)||34 (1.03%)||3307|
|Total||6,929 (99.10%)||63 (0.90%)||12034|
The result is even less significant: .287, meaning that about 29% of the time, an outcome at least that different would be due to chance alone — and just over 14% of the time would such a Pats-favoring result occur.7)Chi-square = 1.135
I have to be honest; these results surprised me. At first sniff, I actually bought Sharp’s analyses, and I even said some things online based on that belief. Once I slept on it, thought about the flaws, and realized that it needed to be run using the appropriate tests, I hypothesized that Sharp’s Player Test would be statistically significant. After all, the same players, playing for the 2007-14 Pats, had shown “an improvement of OVER 100%!” (bold, caps, underline, AND! exclamation in original) versus their play on other teams.
I’m not just busting Sharp’s chops here, either; I really did believe it and expect the results to be statistically significant — that is, improbable enough to be less than 5% likely to be the result of chance.
By the time I was ready to run the tests, I had not expected the results to be worthy of four separate typological devices for emphasizing a point. (Now I am busting his chops, but in good fun.) Once I thought a bit more seriously about the nature of the problem, I realized the error Sharp had made in conflating summaries of binary data with a real variable that itself varies along a continuum or scale.
In other words, I suspected that, even if I had just taken Sharp’s data and run the proper significance test, the results might not have looked so dramatic. Fumbles are binary, and especially when a binary event is especially rare (like, say, about 1% of the time), small differences really exaggerate the perceived difference to the naked eye, especially if we look at simple ratios.
Thus, for your benefit, dear reader, I also ran a Chi-square on Sharp’s (inaccurate and wrongly premised) numbers. The result8)Chi-square = 4.244 was indeed statistically significant, with a probability of .0394, or only likely to result from chance alone about 4% of the time. Favoring the Pats specifically by at least this much would happen about 2% of the time.
In a league with 32 teams, even 50-to-1 improbabilities happen. Which does not mean such a result would let the Patriots specifically off the hook. It just means that even Sharp’s numbers, analyzed properly, would not be enough by themselves to shut up the most rabid Pats homer — or to lead to a punishment for the team, Belichick, or Brady. Rather, it would merely have added to the evidence that suggests cheating and is hard to explain away otherwise. All of that is hypothetical, of course, since the right test of the right numbers shows findings that are not even statistically significant. Or so I thought.
The Wall Street Journal Six
Once I had concluded that Sharp was wrong in so many ways and the proper player-comparison results showed no findings of note, I thought my analysis was all but done here. Then I found the Wall Street Journal article by Michael Salfino that repeats Sharp’s claims but also runs a separate analysis.
[A]ccording to Stats, LLC, the six players who have played extensively for the Patriots and other teams in this span all fumbled far less frequently wearing the New England uniform. Including recovered fumbles, Danny Amendola, BenJarvus Green-Ellis, Danny Woodhead, Wes Welker, Brandon LaFell and LeGarrette Blount have lost the ball eight times in 1,482 touches for the Patriots since 2010, or once every 185.3 times. For their other teams, they fumbled 22 times in 1,701 touches (once every 77.3).
Since I already had all this data, I just had to (a) run the numbers myself to make sure that it only includes offensive fumbles, and (b) run a significance test.
I did all that, and in a word: Bingo.
The fumbles counted are only offensive fumbles, not special teams fumbles, so these are the “right” data. The counts are all correct. Unlike with the broader data set, this is also a statistically significant result,9)Chi-square = 4.817. with a difference this extreme likely to happen just 2.8% of the time. Since we’re really only interested in results this extreme in one direction (helps the Pats) instead of both directions (the other being that the Pats are that much worse at fumbling), we can actually halve our estimate of how likely such a result would be, down to just 1.4% of the time.
That’s not quite a smoking gun, but it’s a level of certainty that actually does add some additional fuel to the controversy.
Also, let us correct for Salfino’s error of using touches-per-fumble and show the data in a nice table while we’re here:
Table 3: Fumbles per Offensive Touch, 2010-14, Select Players
(% of touches)
(% of touches)
|Patriots||1,474 (99.46%)||8 (0.54%)||1482|
|Other Teams||1,679 (98.71%)||22 (1.29%)||1701|
|Total||3,153 (99.06%)||30 (0.94%)||3183|
This highlights that the same six players — six players who were major contributors for the Pats and major contributors for other teams — did 58% better at hanging on to the ball as Pats than when playing for other teams. This is a much larger difference than the differences in the broader data sets: 31% better for the corrected version of Sharp’s Player Test (correct counts, no ST fumbles), 23% better for the 2007-and-later test.
The Pats/non-Pats difference among these six players looks impressive, but the naked eye is not as reliable as the significance test, and as I mentioned: The results are statistically significant and are thus probably better explained as not purely the result of chance.
I’m teaching tomorrow morning (it’s a class in statistics; I know, you’re shocked), so I need to wrap this up. Which is too bad, because I still haven’t tackled Sharp’s first post, looking at team-wide fumble data more broadly. I will do just that at some point this week, I’m sure, and I’ll share my findings. In the meantime, I have to stop here and assess.
What We Know Now
The post by Warren Sharp, comparing the rates for players as they have come on and off the Patriots, is based on a good idea but is methodologically unsound and should not be relied upon. A proper statistical analysis of the same question (Pats 2007-14 vs. same players for all other teams) and a slightly modified but more methodologically sound version of the same question (only using 2007-14) both found no statistical significance. In contrast, the Wall Street Journal test of six key players, using data from 2010 to 2014, is based on accurate data and shows results that would be very (if not extremely) unlikely to be the result of chance alone.
The Journal data supports the claim of cheating, even if it does not show such an unlikely outcome that cheating is the only sensible explanation. In contrast, the corrected Sharp tests simply fail to show much. They show a minor bias in favor of the Pats, but these results could credibly be the result of chance alone. Importantly, they do not really provide evidence for the lack of an effect; fumbles are rare, statistical significance is largely a function of sample size, and if these players had produced more data it very well might have shown an effect with statistical significance.
In light of these specific mixed findings, I am inclined to believe that the harder-to-explain-as-chance results probably speak more loudly. If each player has some kind of “natural” fumble rate, and this can vary from player to player, it would be ideal if we had a number of players with a large number of touches both in and outside a Pats uniform. We do have that, that’s the Journal data set, and that data set shows a statistically significant difference. Players with fewer touches while either on or off the Pats — especially if they have many touches in one category and few in the other — will be less reliable contributors to the data set than the six players Salfino focuses on.
These statistics are not the only relevant evidence, however, and in combination with what else we know, the claim of real interest — that the Patriots appear to have cheated — seems more likely than not but not especially certain. As in: I, Bill Herman, am more than 50% confident that they have systematically under-inflated their footballs for years, but I am less than 99% confident based on what we now know. I am already of the opinion, however, that the statistical evidence is good enough that we should be highly dubious if it is later claimed that a rogue equipment manager was responsible for a single incident. If there is specific evidence of some tampering, and if there is solid statistical evidence that this has led to a specific and causally related advantage, that combination of facts starts to look pretty damning — and not just for the equipment manager.
As with all sane uses of sports statistics, we don’t get to a good conclusion using exclusively quantitative data. Here, most of what we should be looking at is actually not statistical. Also, any attempt to quantify our estimates of the likelihood of cheating using something like Bayesian inference will have too many numbers-pulled-out-from-under-my-footballs variables. (E.g., “Before DeflateGate erupted, what was your probability estimate that Belichick was still cheating in some way?” I mean, not zero — come on, guys, it’s above zero for every team — but uh… Or, “If you saw any NFL equipment manager tampering with a team’s game footballs, what would be the probability that this manager was doing so on his/her own?”)
In other words, we are not quite ready to be certain in a conclusion that there has been cheating. However, the statistics do provide at least some additional (non-definitive, caps-and-bold-not-appropriate) support for the claim that something untoward has happened. With more analysis (of team-wide fumble rates in particular) and more directly incriminating evidence, we may be approaching a more certain conclusion there.
But then again, we might not.
Update, Jan 28, 2:13 pm:
Thanks for all the social media love, folks — to say nothing of the link from FiveThirtyEight. (Hot damn!) This has gone about as well as I could have hoped, save for getting it posted somewhere more visible. (Will see if I can make time for working at that this afternoon, but it’s not looking promising.)
Two corrections to what I wrote above. First, I didn’t include the fact that play selection is also a variable that will change substantially from team to team, and that play calling also surely has at least some effect on fumble rates as well. It’s still the case that this effect at least suggests, but does not definitively demonstrate, something fishy; the play calling variable just makes it even a bit less definitive than I might have implied in the first version. (Shared credit for this suggestion goes to the best football handicapper I know in person — who might or might not want to be identified in person. Will ask him and change this accordingly.)
Second, I oversimplified the “don’t use summaries of binary data like continuous variables” warning. It’s not a commandment — more like a yellow light than a red. If the summaries are in fact normally distributed, then you can actually run normal distribution tests. But when we tall up rather rare binary events and run tests on the tallies, we really have to double-check our assumptions.
“Sharp makes a major error because he didn’t run tests on the normalcy of the summary distributions” is a lot harder to explain than “Dude, you can’t just tally up binary events and treat those tallies like IQ scores.” So I took the latter route, even though even this Ashton Kutcher-esque version in the update would have been more correct than what I wrote above.
While I was writing this post, several others also weighed in. I’ve had nice exchanges with many of them over social media. See this roundup by Neil Paine at Five Thirty Eight (I made the cut!), but I’d specifically recommend Drew Fustin‘s “Comments on Warren Sharp’s Patriots Fumble Analysis.” Fustin is especially merciless in identifying Sharp’s mistakes, but Fustin’s tone is appropriate and civil — versus some of the other analyses that get a little more caps-lock-and-disses than I’d like. (In fairness, Sharp’s tone was super over the top to start with.)
Fustin finds that, not only are New England’s team fumble rates not off-the-charts, they’re not even at the top of the charts. Frankly, this leads us to be a bit more suspect about the way I end this piece — not least since the sample sizes are still so damnably small that it might be an artifact of specific players’ work for other teams. Fustin also highlights the assured partial effect of play selection.
In contrast, I don’t especially recommend the response that Sharp posted today. In it, he claims, “But unfortunately I never saw anything that actually looked at my conclusions, using the same time periods, and determined the fundamentals of my two key points were incorrect.”
I know that Sharp’s gone from “Who?” to internet-foootball-debates famous pretty much overnight, so he can’t have read everything critical that’s been written so far, but he surely isn’t looking that hard if he hasn’t found such a piece. (He’s also not working very hard to convince us that he has. None of the criticisms are linked, and [I think?] 2 are vaguely mentioned? Ugh.) Again, I specifically recommend Fustin’s piece, which does a great job analyzing the rates at the team level — and finds a much less dramatic effect.
Footnotes [ + ]
|1.||↑||In a year when the league averaged 23.42 points per game, with a standard deviation (SD) of 4.36 points, the Broncos averaged 37.9, for a z score of 3.32. There’s a formula to calculate how likely something is based on the z score, but you don’t have to learn it. You just have to go to this page with a handy z score calculator.|
|2.||↑||(.0103 – .0079) / 1.03 = .236.|
|3.||↑||This meant turning each touch into its own “case” on a separate line. Thus, instead of one line expressing that Danny Amendola caught 85 passes and had one receiving fumble for the 2010 Rams, I had this across 85 lines. I also could have — and, in retrospect, probably should have — run the most obvious appropriate statistical tests using handy online tools I link to in this piece.|
|4.||↑||And don’t get me started on the anti-vaxxers. Thousands of children are vaccinated every workday. Some portion of them will, sadly, develop the first signs of autism in the following days. That’s just how chance works. “My child got vaccinated right before, therefore it’s the vaccines” is the equivalent of “I should have eaten kale so you’d be tall like your brother.” Except, to my knowledge, kale isn’t causing an outbreak of totally preventable diseases that are leading to hospitalizations and deaths.|
|5.||↑||If you know the arguments about why Chi-square is inferior to other tests, you also know that (a) the differences in measured estimates of significance are generally marginal and are virtually the same here, (b) Chi-square critics can’t agree on what we should be doing instead, and (c) Chi-square is broadly understood. Oh, and while you’re here: Chi-square = 1.596.|
|6.||↑||The big exception is if we’re interested in change over time. Then, we should try to hold other things constant or correct for changes, e.g. to season length.|
|7.||↑||Chi-square = 1.135|
|8.||↑||Chi-square = 4.244|
|9.||↑||Chi-square = 4.817.|