This year with AFL there has been a lot of talk about how this season is just weird. That it is as close as it has ever been. The question is, is it really? Taking an idea 538 lets see if we can come up with a nice graph ourselves that paints a picture of a close or not so close competition.
Lets start out with a hypothesis, if the game is as close as it has ever been in recent memory, then it follows that:
- Teams aren’t winning by on average a lot of scoring shots
- Teams aren’t losing by on average a lot of scoring shots.
For more reading of scoring shot differential analysis and to give yourself a few more ideas you can tackle at home (now that you can get the data) I can’t reccomend Tony over at matterofstats enough.
Knowing that scoring shots are a good measure of team quality, lets see if looking at scoring shot differential follows a similar narrative to what is being talked about in the media today.
So lets do a couple of things:
- Look at average margin through time
- Look at scoring shot differentials through time
First things first, we need to get the score shot data. We can’t do anything without data.
library(dplyr)
library(readr)
library(lubridate)
library(tidyr)
library(ggplot2)
library(dplyr)
library(readr)
bg3 <- read.fwf(
file=url("http://afltables.com/afl/stats/biglists/bg3.txt"),
skip = 1,
widths=c(6,13,10,18,17,18,18)) %>%
slice(-1) ##you could just skip the first two rows, but I like showing different functions :)
Now that we have the data, we should always view it. Always.
head(bg3)
## V1 V2 V3 V4 V5
## 1 1 8-May-1897 R1 Fitzroy 6.13.49
## 2 2 8-May-1897 R1 Collingwood 5.11.41
## 3 3 8-May-1897 R1 Geelong 3.6.24
## 4 4 8-May-1897 R1 South Melbourne 3.9.27
## 5 5 15-May-1897 R2 South Melbourne 6.4.40
## 6 6 15-May-1897 R2 Essendon 4.6.30
## V6 V7
## 1 Carlton 2.4.16
## 2 St Kilda 2.4.16
## 3 Essendon 7.5.47
## 4 Melbourne 6.8.44
## 5 Carlton 5.6.36
## 6 Collingwood 8.2.50
glimpse(bg3)
## Observations: 15,407
## Variables: 7
## $ V1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
## $ V2 <fct> 8-May-1897 , 8-May-1897 , 8-May-1897 , 8-May-1897 ,...
## $ V3 <fct> R1 , R1 , R1 , R1 , R2 ,...
## $ V4 <fct> Fitzroy , Collingwood , Geelong ,...
## $ V5 <fct> 6.13.49 , 5.11.41 , 3.6.24 , 3....
## $ V6 <fct> Carlton , St Kilda , Essendon ,...
## $ V7 <fct> 2.4.16 , 2.4.16 , 7.5.47 ,...
str(bg3)
## 'data.frame': 15407 obs. of 7 variables:
## $ V1: num 1 2 3 4 5 6 7 8 9 10 ...
## $ V2: Factor w/ 4737 levels " 1-Apr-1972 ",..: 4521 4521 4521 4521 1027 1027 1027 1027 2225 2225 ...
## $ V3: Factor w/ 29 levels " EF "," GF ",..: 5 5 5 5 16 16 16 16 22 22 ...
## $ V4: Factor w/ 24 levels "Adelaide ",..: 7 5 10 19 19 6 20 15 5 7 ...
## $ V5: Factor w/ 675 levels "0.18.18 ",..: 579 554 504 507 592 548 506 650 593 575 ...
## $ V6: Factor w/ 24 levels "Adelaide ",..: 4 20 6 15 4 5 7 10 10 15 ...
## $ V7: Factor w/ 633 levels "0.1.1 ",..: 292 292 577 555 532 592 45 459 533 580 ...
#View(bg3)
What do you notice about $V2, it is saved as a factor variable. We know by looking at it that it is our date variable so lets change it to that.
bg3$V2<-dmy(bg3$V2) ###using the lubridate R package converting v2 to dates
head(bg3)
## V1 V2 V3 V4 V5
## 1 1 1897-05-08 R1 Fitzroy 6.13.49
## 2 2 1897-05-08 R1 Collingwood 5.11.41
## 3 3 1897-05-08 R1 Geelong 3.6.24
## 4 4 1897-05-08 R1 South Melbourne 3.9.27
## 5 5 1897-05-15 R2 South Melbourne 6.4.40
## 6 6 1897-05-15 R2 Essendon 4.6.30
## V6 V7
## 1 Carlton 2.4.16
## 2 St Kilda 2.4.16
## 3 Essendon 7.5.47
## 4 Melbourne 6.8.44
## 5 Carlton 5.6.36
## 6 Collingwood 8.2.50
glimpse(bg3)
## Observations: 15,407
## Variables: 7
## $ V1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
## $ V2 <date> 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-15...
## $ V3 <fct> R1 , R1 , R1 , R1 , R2 ,...
## $ V4 <fct> Fitzroy , Collingwood , Geelong ,...
## $ V5 <fct> 6.13.49 , 5.11.41 , 3.6.24 , 3....
## $ V6 <fct> Carlton , St Kilda , Essendon ,...
## $ V7 <fct> 2.4.16 , 2.4.16 , 7.5.47 ,...
str(bg3)
## 'data.frame': 15407 obs. of 7 variables:
## $ V1: num 1 2 3 4 5 6 7 8 9 10 ...
## $ V2: Date, format: "1897-05-08" "1897-05-08" ...
## $ V3: Factor w/ 29 levels " EF "," GF ",..: 5 5 5 5 16 16 16 16 22 22 ...
## $ V4: Factor w/ 24 levels "Adelaide ",..: 7 5 10 19 19 6 20 15 5 7 ...
## $ V5: Factor w/ 675 levels "0.18.18 ",..: 579 554 504 507 592 548 506 650 593 575 ...
## $ V6: Factor w/ 24 levels "Adelaide ",..: 4 20 6 15 4 5 7 10 10 15 ...
## $ V7: Factor w/ 633 levels "0.1.1 ",..: 292 292 577 555 532 592 45 459 533 580 ...
#View(bg3)
Next we want to get out the scores
afl.games<-separate(bg3,V5, into=c("Home Goals","Home Behinds","Home Score"),sep="\\.")
afl.games<-separate(afl.games,V7, into=c("Away Goals","Away Behinds","Away Score"),sep="\\.")
head(afl.games)
## V1 V2 V3 V4 Home Goals Home Behinds
## 1 1 1897-05-08 R1 Fitzroy 6 13
## 2 2 1897-05-08 R1 Collingwood 5 11
## 3 3 1897-05-08 R1 Geelong 3 6
## 4 4 1897-05-08 R1 South Melbourne 3 9
## 5 5 1897-05-15 R2 South Melbourne 6 4
## 6 6 1897-05-15 R2 Essendon 4 6
## Home Score V6 Away Goals Away Behinds Away Score
## 1 49 Carlton 2 4 16
## 2 41 St Kilda 2 4 16
## 3 24 Essendon 7 5 47
## 4 27 Melbourne 6 8 44
## 5 40 Carlton 5 6 36
## 6 30 Collingwood 8 2 50
afl.games<-cbind(afl.games,season=year(afl.games$V2))
glimpse(afl.games)
## Observations: 15,407
## Variables: 12
## $ V1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ V2 <date> 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-08...
## $ V3 <fct> R1 , R1 , R1 , R1 ,...
## $ V4 <fct> Fitzroy , Collingwood , Geelong...
## $ `Home Goals` <chr> "6", "5", "3", "3", "6", "4", "3", "9", "6", "5...
## $ `Home Behinds` <chr> "13", "11", "6", "9", "4", "6", "8", "10", "5",...
## $ `Home Score` <chr> "49 ", "41 ", "24 "...
## $ V6 <fct> Carlton , St Kilda , Essendo...
## $ `Away Goals` <chr> "2", "2", "7", "6", "5", "8", "10", "3", "5", "...
## $ `Away Behinds` <chr> "4", "4", "5", "8", "6", "2", "6", "1", "7", "8...
## $ `Away Score` <chr> "16 ", "16 ", "47 ...
## $ season <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897,...
str(afl.games)
## 'data.frame': 15407 obs. of 12 variables:
## $ V1 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ V2 : Date, format: "1897-05-08" "1897-05-08" ...
## $ V3 : Factor w/ 29 levels " EF "," GF ",..: 5 5 5 5 16 16 16 16 22 22 ...
## $ V4 : Factor w/ 24 levels "Adelaide ",..: 7 5 10 19 19 6 20 15 5 7 ...
## $ Home Goals : chr "6" "5" "3" "3" ...
## $ Home Behinds: chr "13" "11" "6" "9" ...
## $ Home Score : chr "49 " "41 " "24 " "27 " ...
## $ V6 : Factor w/ 24 levels "Adelaide ",..: 4 20 6 15 4 5 7 10 10 15 ...
## $ Away Goals : chr "2" "2" "7" "6" ...
## $ Away Behinds: chr "4" "4" "5" "8" ...
## $ Away Score : chr "16 " "16 " "47 " "44 " ...
## $ season : num 1897 1897 1897 1897 1897 ...
# View(afl.games)
After viewing the data, we can see that we would want \(Home Goals\), \(Away Goals\) etc to be saved as numbers instead of characters. We can do this as follows using as.numeric
afl.games$`Home Goals`<-as.numeric(afl.games$`Home Goals`)
afl.games$`Home Behinds`<-as.numeric(afl.games$`Home Behinds`)
afl.games$`Home Score`<-as.numeric(afl.games$`Home Score`)
afl.games$`Away Goals`<-as.numeric(afl.games$`Away Goals`)
afl.games$`Away Behinds`<-as.numeric(afl.games$`Away Behinds`)
afl.games$`Away Score`<-as.numeric(afl.games$`Away Score`)
afl.games$margin<-afl.games$`Home Score`-afl.games$`Away Score`
glimpse(afl.games)
## Observations: 15,407
## Variables: 13
## $ V1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ V2 <date> 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-08...
## $ V3 <fct> R1 , R1 , R1 , R1 ,...
## $ V4 <fct> Fitzroy , Collingwood , Geelong...
## $ `Home Goals` <dbl> 6, 5, 3, 3, 6, 4, 3, 9, 6, 5, 12, 8, 5, 5, 2, 1...
## $ `Home Behinds` <dbl> 13, 11, 6, 9, 4, 6, 8, 10, 5, 9, 6, 11, 14, 11,...
## $ `Home Score` <dbl> 49, 41, 24, 27, 40, 30, 26, 64, 41, 39, 78, 59,...
## $ V6 <fct> Carlton , St Kilda , Essendo...
## $ `Away Goals` <dbl> 2, 2, 7, 6, 5, 8, 10, 3, 5, 7, 6, 0, 3, 5, 6, 7...
## $ `Away Behinds` <dbl> 4, 4, 5, 8, 6, 2, 6, 1, 7, 8, 5, 2, 4, 3, 6, 4,...
## $ `Away Score` <dbl> 16, 16, 47, 44, 36, 50, 66, 19, 37, 50, 41, 2, ...
## $ season <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897,...
## $ margin <dbl> 33, 25, -23, -17, 4, -20, -40, 45, 4, -11, 37, ...
str(afl.games)
## 'data.frame': 15407 obs. of 13 variables:
## $ V1 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ V2 : Date, format: "1897-05-08" "1897-05-08" ...
## $ V3 : Factor w/ 29 levels " EF "," GF ",..: 5 5 5 5 16 16 16 16 22 22 ...
## $ V4 : Factor w/ 24 levels "Adelaide ",..: 7 5 10 19 19 6 20 15 5 7 ...
## $ Home Goals : num 6 5 3 3 6 4 3 9 6 5 ...
## $ Home Behinds: num 13 11 6 9 4 6 8 10 5 9 ...
## $ Home Score : num 49 41 24 27 40 30 26 64 41 39 ...
## $ V6 : Factor w/ 24 levels "Adelaide ",..: 4 20 6 15 4 5 7 10 10 15 ...
## $ Away Goals : num 2 2 7 6 5 8 10 3 5 7 ...
## $ Away Behinds: num 4 4 5 8 6 2 6 1 7 8 ...
## $ Away Score : num 16 16 47 44 36 50 66 19 37 50 ...
## $ season : num 1897 1897 1897 1897 1897 ...
## $ margin : num 33 25 -23 -17 4 -20 -40 45 4 -11 ...
Looking at margins
Looking at this we can see we have negative margins, which makes sense as we have defined it as \(Home Score\) - \(Away Score\). So all it is really saying is that away teams happen to win games.
As we are not interested in things from only the home teams point of view, lets take the absolute value of margin.
afl.games$margin<-abs(afl.games$margin)
Lets get plotting yeah!
Remember up top, we wanted to look at average margin through time.
afl.games%>%
group_by(season)%>%
summarise(avemargin=mean(margin)) %>%
ggplot(aes(x=season,y=avemargin))+geom_line() +
ylab("Average Margin") +
xlab("Season") + ggtitle("Are margins really going down?")
Looking at the plot above do you think that the average margin of games is going up or down recently? What about compared to a few years ago?
Lets look at scoring shots
First step is lets create the dataset to analyse.
afl.games$home.ss.diff<-afl.games$`Home Goals`+afl.games$`Home Behinds` -afl.games$`Away Goals`-afl.games$`Away Behinds`
afl.games$away.ss.diff<-afl.games$`Away Goals`+afl.games$`Away Behinds` -afl.games$`Home Goals`-afl.games$`Home Behinds`
# head(afl.games)
# glimpse(afl.games)
home<-select(afl.games,season,V3,V4,home.ss.diff)
away<-select(afl.games,season,V3,V6,away.ss.diff)
head(home)
## season V3 V4 home.ss.diff
## 1 1897 R1 Fitzroy 13
## 2 1897 R1 Collingwood 10
## 3 1897 R1 Geelong -3
## 4 1897 R1 South Melbourne -2
## 5 1897 R2 South Melbourne -1
## 6 1897 R2 Essendon 0
head(away)
## season V3 V6 away.ss.diff
## 1 1897 R1 Carlton -13
## 2 1897 R1 St Kilda -10
## 3 1897 R1 Essendon 3
## 4 1897 R1 Melbourne 2
## 5 1897 R2 Carlton 1
## 6 1897 R2 Collingwood 0
Looking at the above datasets (home, away) ideally you would just like to stack them on top of each other. We can do this using rbind. Before using rbind, one thing to keep in mind is that the columns should be named the same.
So lets rename the columns so that they are the same and then lets stack them on top of each other.
names(home)[names(home)=="season"]<-"Season"
names(home)[names(home)=="V3"]<-"Round"
names(home)[names(home)=="V4"]<-"Team"
names(home)[names(home)=="home.ss.diff"]<-"ss.diff"
names(away)[names(away)=="season"]<-"Season"
names(away)[names(away)=="V3"]<-"Round"
names(away)[names(away)=="V6"]<-"Team"
names(away)[names(away)=="away.ss.diff"]<-"ss.diff"
games<-rbind(home,away)
head(games)
## Season Round Team ss.diff
## 1 1897 R1 Fitzroy 13
## 2 1897 R1 Collingwood 10
## 3 1897 R1 Geelong -3
## 4 1897 R1 South Melbourne -2
## 5 1897 R2 South Melbourne -1
## 6 1897 R2 Essendon 0
glimpse(games)
## Observations: 30,814
## Variables: 4
## $ Season <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897, ...
## $ Round <fct> R1 , R1 , R1 , R1 , R...
## $ Team <fct> Fitzroy , Collingwood , Geelong ...
## $ ss.diff <dbl> 13, 10, -3, -2, -1, 0, -5, 15, -1, -1, 7, 17, 12, 8, -...
What we want to do now is create a summary table whereby we can get the total scoring shot difference, the standard deviation of scoring shot differential and the average scoring shot differential by team within each season. We can do that as follows:
games<-select(games,Season, Team,ss.diff)
afl<-group_by(games,Season,Team)
df1<-summarise(afl,total.ss.diff=sum(ss.diff),std.ss.diff=sd(ss.diff),average.ss.diff=mean(ss.diff))
head(df1)
## # A tibble: 6 x 5
## # Groups: Season [1]
## Season Team total.ss.diff std.ss.diff average.ss.diff
## <dbl> <fct> <dbl> <dbl> <dbl>
## 1 1897 "Carlton " -126 8.47 -9
## 2 1897 "Collingwood " 14 8.20 0.824
## 3 1897 "Essendon " 78 9.02 4.59
## 4 1897 "Fitzroy " 19 10.2 1.36
## 5 1897 "Geelong " 113 10.1 6.65
## 6 1897 "Melbourne " 38 9.68 2.24
glimpse(df1)
## Observations: 1,496
## Variables: 5
## $ Season <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897...
## $ Team <fct> Carlton , Collingwood , Essend...
## $ total.ss.diff <dbl> -126, 14, 78, 19, 113, 38, 60, -196, -147, 128...
## $ std.ss.diff <dbl> 8.467131, 8.202403, 9.021217, 10.172415, 10.11...
## $ average.ss.diff <dbl> -9.0000000, 0.8235294, 4.5882353, 1.3571429, 6...
Now that we have the table we want, lets look at scoring shot differential, but instead of looking at it for all years, lets just look at things this side of 1999.
df1%>%filter(Season>1999)%>%
ggplot(aes(x=Season,y=average.ss.diff))+
geom_point() +geom_hline(yintercept = 5)+
geom_hline(yintercept = -5) +ylab("Average Scoring Shot Differential")
This graph is pretty interesting
From it we can see at first glance
- Only 3 teams have an absolute scoring shot differential average of over 5 this year which is the lowest its been in over 10 years.
- Only 1 team (guess who) get beaten by on average over 5 scoring shots which is the lowest its been this millenium
Of course we shouldn’t just base our conclusions from one graph. Now that you have the data, what would you graph?