Using R to answer a football question

By Joseph Adler
December 15, 2009 | Comments: 7

Last Monday night, I was watching the Ravens playing the Packers at Green Bay. Mostly, I was watching penalties. This game featured an astounding number of penalty calls: 23 calls, 310 yards. At first glance, you'd think that penalties were a bad thing; teams for which officials call more penalties should perform worse than teams for which officials call fewer penalties.

However, I wondered if this was really true. Penalties could be a sign that a team is playing passionately. Or they could be completely random and meaningless. I decided to take a look at this question. Here's what I figured out in 15 minutes, with a little help from R.

The NFL web site has a good statistics section. I decided to take a look at 2008 team offense statistics and 2008 team defense statistics. With these stats, I could look at the correlation between penalties and points. If penalties really affected games, you'd expect a negative correlation.

I copied these data tables from my web browser (Safari 4 for Mac OS X) to Microsoft Excel, then saved the data as a set of text files.

Now, I was ready to import the data into R, using read.delim. I started with defensive stats:


> def.08 <- read.delim(file="~/Desktop/nfldef2008.txt")

Before analyzing the data, I took a quick look at the penalty data to make sure that everything looked OK.


> names(def.08)
[1] "Rk" "Team" "G" "Pts.G" "TotPts"
[6] "Scrm.Plys" "Yds.G" "Yds.P" "X1st.G" "X3rd.Md"
[11] "X3rd.Att" "X3rd.Pct" "X4th.Md" "X4th.Att" "X4th.Pct"
[16] "Pen" "Pen.Yds" "ToP.G" "FUM" "Lost"
> def.08$Pen.Yds
[1] 801 792 593 639 866 1,002 750 601 660 636 543
[12] 772 869 540 615 663 691 736 816 721 827 659
[23] 637 854 708 770 633 654 738 671 588 753
32 Levels: 1,002 540 543 588 593 601 615 633 636 637 639 654 ... 869

This reveals one small problem: R loaded the numeric fields as text fields, converting them to factors. The problem was that "1,002" value; R didn't correctly parse that field. I corrected the underlying data and tried again:


> def.08 <- read.delim(file="~/Desktop/nfldef2008.txt")
> fivenum(def.08$TotPts)
[1] 223 313 350 390 517
> fivenum(def.08$Pen.Yds)
[1] 540.0 636.5 699.5 782.0 1002.0

Let's plot total points vs total penalty yards:


> plot(TotPts~Pen.Yds,data=def.08.n)

Here's what the chart looks like:

defense.png

I didn't see any obvious correlation between penalty yards and points allowed, but I decided to check for correlation using the cor.test function in R:


> cor.test(def.08.n$Pen.Yds,def.08.n$TotPts)

Pearson's product-moment correlation

data: def.08.n$Pen.Yds and def.08.n$TotPts
t = -0.4091, df = 29, p-value = 0.6855
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4188408 0.2862818
sample estimates:
cor
-0.07574162

As you can see, the p value shows that there was no statistically significant correlation for defensive statistics. The story is much the same for offensive stats:


> off.08 <- read.delim(file="~/Desktop/nfloff2008.txt")
> plot(TotPts~Pen.Yds,data=off.08.n)

Here is a plot of points allowed vs penalty yards for offenses in 2008:

offense.png

And here is the result of the correlation test:


> cor.test(off.08.n$Pen.Yds,off.08.n$TotPts)

Pearson's product-moment correlation

data: off.08.n$Pen.Yds and off.08.n$TotPts
t = 0.8927, df = 30, p-value = 0.3791
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1989928 0.4824930
sample estimates:
cor
0.1608631

As with defensive statistics, the correlation isn't statistically significant, though it is slightly stronger.


You might also be interested in:

7 Comments

You analyzed an entire season. Take a single week of play and I think you will probably find a greater correlation. When you take a teams entire body of work you pretty much average out any loss in points which could be associated with high penalty yardage.

Yeah, agree with Chris' comment. Was gonna say the same. It's probably a wash over the course of a season, but at a per-game level it's probably significant. Neat idea btw!

Chris and Andy:

You're right that you'd see more significance in single week. Taken to the extreme, it's obvious that there are single penalties that can affect the result of games (or seasons, or players' or cocahces careers). However, many football commentators comment on a teams' cumulative penalty yards, implying that this statistic is important.

However, the big reason I wrote this post was to show how to pull this data into R and take a look at it by yourself. Let me know if you find anything interesting!

Wouldn't it be more meaningful to look for a correlation between penalties incurred and games lost? A high number of total points could hide negative consequences of penalties incurred if the other teams scored more points.

Patrick:

Interesting question. Ultimately, what matters most is games won and lost.

However, there are a bunch of reasons why W/L statistics aren't the best measure here. First, there are only 16 games in an NFL season, so it's really tough to find a statistically significant effect. Secondly, wins and losses are a function of lots of different effects. Among other things, quality of opponents has a big effect.

Mostly, the point of the exercise was the show that there are both good and bad teams that get large and small numbers of penalties. Oh, and to show how to get the data and play with it in R.

Actually you've got a bug in the way you read the data. Notice that the penalty yards should be in the range 540 to 1002, but on your graph they're in the range 1 to 32. That's because you converted the factor to numeric by using its underlying factor-level-number and there are 32 teams. So no wonder there was no correlation. =)

Here's the way I read my data in, which gets rid of the commas in the first place (that's what was tripping up the read). Note that it's specific to OS X:

def.08

Still not much correlation in the data, but the effect is somewhat stronger when the correct data is used.

Good catch Ken; I updated the example to fix the issue. Looks like R got stuck on the value "1,002" in the original data.

News Topics

Recommended for You

Got a Question?