This year with AFL there has been a lot of talk about how this season is just weird. That it is as close as it has ever been. The question is, is it really? Taking an idea 538 lets see if we can come up with a nice graph ourselves that paints a picture of a close or not so close competition.

Lets start out with a hypothesis, if the game is as close as it has ever been in recent memory, then it follows that:

  • Teams aren’t winning by on average a lot of scoring shots
  • Teams aren’t losing by on average a lot of scoring shots.

For more reading of scoring shot differential analysis and to give yourself a few more ideas you can tackle at home (now that you can get the data) I can’t reccomend Tony over at matterofstats enough.

Knowing that scoring shots are a good measure of team quality, lets see if looking at scoring shot differential follows a similar narrative to what is being talked about in the media today.

So lets do a couple of things:

  1. Look at average margin through time
  2. Look at scoring shot differentials through time

First things first, we need to get the score shot data. We can’t do anything without data.

library(dplyr)
library(readr)
library(lubridate)
library(tidyr)
library(ggplot2)
library(dplyr)
library(readr)
bg3 <- read.fwf(
  file=url("http://afltables.com/afl/stats/biglists/bg3.txt"), 
      skip = 1,
  widths=c(6,13,10,18,17,18,18)) %>% 
  slice(-1) ##you could just skip the first two rows, but I like showing different functions :)

Now that we have the data, we should always view it. Always.

head(bg3)
##   V1            V2         V3                 V4                V5
## 1  1  8-May-1897        R1    Fitzroy            6.13.49          
## 2  2  8-May-1897        R1    Collingwood        5.11.41          
## 3  3  8-May-1897        R1    Geelong            3.6.24           
## 4  4  8-May-1897        R1    South Melbourne    3.9.27           
## 5  5  15-May-1897       R2    South Melbourne    6.4.40           
## 6  6  15-May-1897       R2    Essendon           4.6.30           
##                   V6                 V7
## 1 Carlton            2.4.16            
## 2 St Kilda           2.4.16            
## 3 Essendon           7.5.47            
## 4 Melbourne          6.8.44            
## 5 Carlton            5.6.36            
## 6 Collingwood        8.2.50
glimpse(bg3)
## Observations: 15,407
## Variables: 7
## $ V1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
## $ V2 <fct>  8-May-1897  ,  8-May-1897  ,  8-May-1897  ,  8-May-1897  ,...
## $ V3 <fct>      R1   ,      R1   ,      R1   ,      R1   ,      R2   ,...
## $ V4 <fct> Fitzroy           , Collingwood       , Geelong           ,...
## $ V5 <fct> 6.13.49          , 5.11.41          , 3.6.24           , 3....
## $ V6 <fct> Carlton           , St Kilda          , Essendon          ,...
## $ V7 <fct> 2.4.16            , 2.4.16            , 7.5.47            ,...
str(bg3)
## 'data.frame':    15407 obs. of  7 variables:
##  $ V1: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ V2: Factor w/ 4737 levels " 1-Apr-1972  ",..: 4521 4521 4521 4521 1027 1027 1027 1027 2225 2225 ...
##  $ V3: Factor w/ 29 levels "     EF   ","     GF   ",..: 5 5 5 5 16 16 16 16 22 22 ...
##  $ V4: Factor w/ 24 levels "Adelaide          ",..: 7 5 10 19 19 6 20 15 5 7 ...
##  $ V5: Factor w/ 675 levels "0.18.18          ",..: 579 554 504 507 592 548 506 650 593 575 ...
##  $ V6: Factor w/ 24 levels "Adelaide          ",..: 4 20 6 15 4 5 7 10 10 15 ...
##  $ V7: Factor w/ 633 levels "0.1.1             ",..: 292 292 577 555 532 592 45 459 533 580 ...
#View(bg3)

What do you notice about $V2, it is saved as a factor variable. We know by looking at it that it is our date variable so lets change it to that.

bg3$V2<-dmy(bg3$V2) ###using the lubridate R package converting v2 to dates
head(bg3)
##   V1         V2         V3                 V4                V5
## 1  1 1897-05-08      R1    Fitzroy            6.13.49          
## 2  2 1897-05-08      R1    Collingwood        5.11.41          
## 3  3 1897-05-08      R1    Geelong            3.6.24           
## 4  4 1897-05-08      R1    South Melbourne    3.9.27           
## 5  5 1897-05-15      R2    South Melbourne    6.4.40           
## 6  6 1897-05-15      R2    Essendon           4.6.30           
##                   V6                 V7
## 1 Carlton            2.4.16            
## 2 St Kilda           2.4.16            
## 3 Essendon           7.5.47            
## 4 Melbourne          6.8.44            
## 5 Carlton            5.6.36            
## 6 Collingwood        8.2.50
glimpse(bg3)
## Observations: 15,407
## Variables: 7
## $ V1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
## $ V2 <date> 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-15...
## $ V3 <fct>      R1   ,      R1   ,      R1   ,      R1   ,      R2   ,...
## $ V4 <fct> Fitzroy           , Collingwood       , Geelong           ,...
## $ V5 <fct> 6.13.49          , 5.11.41          , 3.6.24           , 3....
## $ V6 <fct> Carlton           , St Kilda          , Essendon          ,...
## $ V7 <fct> 2.4.16            , 2.4.16            , 7.5.47            ,...
str(bg3)
## 'data.frame':    15407 obs. of  7 variables:
##  $ V1: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ V2: Date, format: "1897-05-08" "1897-05-08" ...
##  $ V3: Factor w/ 29 levels "     EF   ","     GF   ",..: 5 5 5 5 16 16 16 16 22 22 ...
##  $ V4: Factor w/ 24 levels "Adelaide          ",..: 7 5 10 19 19 6 20 15 5 7 ...
##  $ V5: Factor w/ 675 levels "0.18.18          ",..: 579 554 504 507 592 548 506 650 593 575 ...
##  $ V6: Factor w/ 24 levels "Adelaide          ",..: 4 20 6 15 4 5 7 10 10 15 ...
##  $ V7: Factor w/ 633 levels "0.1.1             ",..: 292 292 577 555 532 592 45 459 533 580 ...
#View(bg3)

Next we want to get out the scores

afl.games<-separate(bg3,V5, into=c("Home Goals","Home Behinds","Home Score"),sep="\\.") 
  afl.games<-separate(afl.games,V7, into=c("Away Goals","Away Behinds","Away Score"),sep="\\.")

head(afl.games)
##   V1         V2         V3                 V4 Home Goals Home Behinds
## 1  1 1897-05-08      R1    Fitzroy                     6           13
## 2  2 1897-05-08      R1    Collingwood                 5           11
## 3  3 1897-05-08      R1    Geelong                     3            6
## 4  4 1897-05-08      R1    South Melbourne             3            9
## 5  5 1897-05-15      R2    South Melbourne             6            4
## 6  6 1897-05-15      R2    Essendon                    4            6
##      Home Score                 V6 Away Goals Away Behinds     Away Score
## 1  49           Carlton                     2            4 16            
## 2  41           St Kilda                    2            4 16            
## 3 24            Essendon                    7            5 47            
## 4 27            Melbourne                   6            8 44            
## 5 40            Carlton                     5            6 36            
## 6 30            Collingwood                 8            2 50
afl.games<-cbind(afl.games,season=year(afl.games$V2))
glimpse(afl.games)
## Observations: 15,407
## Variables: 12
## $ V1             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ V2             <date> 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-08...
## $ V3             <fct>      R1   ,      R1   ,      R1   ,      R1   ,...
## $ V4             <fct> Fitzroy           , Collingwood       , Geelong...
## $ `Home Goals`   <chr> "6", "5", "3", "3", "6", "4", "3", "9", "6", "5...
## $ `Home Behinds` <chr> "13", "11", "6", "9", "4", "6", "8", "10", "5",...
## $ `Home Score`   <chr> "49          ", "41          ", "24           "...
## $ V6             <fct> Carlton           , St Kilda          , Essendo...
## $ `Away Goals`   <chr> "2", "2", "7", "6", "5", "8", "10", "3", "5", "...
## $ `Away Behinds` <chr> "4", "4", "5", "8", "6", "2", "6", "1", "7", "8...
## $ `Away Score`   <chr> "16            ", "16            ", "47        ...
## $ season         <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897,...
str(afl.games)
## 'data.frame':    15407 obs. of  12 variables:
##  $ V1          : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ V2          : Date, format: "1897-05-08" "1897-05-08" ...
##  $ V3          : Factor w/ 29 levels "     EF   ","     GF   ",..: 5 5 5 5 16 16 16 16 22 22 ...
##  $ V4          : Factor w/ 24 levels "Adelaide          ",..: 7 5 10 19 19 6 20 15 5 7 ...
##  $ Home Goals  : chr  "6" "5" "3" "3" ...
##  $ Home Behinds: chr  "13" "11" "6" "9" ...
##  $ Home Score  : chr  "49          " "41          " "24           " "27           " ...
##  $ V6          : Factor w/ 24 levels "Adelaide          ",..: 4 20 6 15 4 5 7 10 10 15 ...
##  $ Away Goals  : chr  "2" "2" "7" "6" ...
##  $ Away Behinds: chr  "4" "4" "5" "8" ...
##  $ Away Score  : chr  "16            " "16            " "47            " "44            " ...
##  $ season      : num  1897 1897 1897 1897 1897 ...
# View(afl.games)

After viewing the data, we can see that we would want \(Home Goals\), \(Away Goals\) etc to be saved as numbers instead of characters. We can do this as follows using as.numeric

afl.games$`Home Goals`<-as.numeric(afl.games$`Home Goals`)
afl.games$`Home Behinds`<-as.numeric(afl.games$`Home Behinds`)
afl.games$`Home Score`<-as.numeric(afl.games$`Home Score`)

afl.games$`Away Goals`<-as.numeric(afl.games$`Away Goals`)
afl.games$`Away Behinds`<-as.numeric(afl.games$`Away Behinds`)
afl.games$`Away Score`<-as.numeric(afl.games$`Away Score`)
afl.games$margin<-afl.games$`Home Score`-afl.games$`Away Score`
glimpse(afl.games)
## Observations: 15,407
## Variables: 13
## $ V1             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ V2             <date> 1897-05-08, 1897-05-08, 1897-05-08, 1897-05-08...
## $ V3             <fct>      R1   ,      R1   ,      R1   ,      R1   ,...
## $ V4             <fct> Fitzroy           , Collingwood       , Geelong...
## $ `Home Goals`   <dbl> 6, 5, 3, 3, 6, 4, 3, 9, 6, 5, 12, 8, 5, 5, 2, 1...
## $ `Home Behinds` <dbl> 13, 11, 6, 9, 4, 6, 8, 10, 5, 9, 6, 11, 14, 11,...
## $ `Home Score`   <dbl> 49, 41, 24, 27, 40, 30, 26, 64, 41, 39, 78, 59,...
## $ V6             <fct> Carlton           , St Kilda          , Essendo...
## $ `Away Goals`   <dbl> 2, 2, 7, 6, 5, 8, 10, 3, 5, 7, 6, 0, 3, 5, 6, 7...
## $ `Away Behinds` <dbl> 4, 4, 5, 8, 6, 2, 6, 1, 7, 8, 5, 2, 4, 3, 6, 4,...
## $ `Away Score`   <dbl> 16, 16, 47, 44, 36, 50, 66, 19, 37, 50, 41, 2, ...
## $ season         <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897,...
## $ margin         <dbl> 33, 25, -23, -17, 4, -20, -40, 45, 4, -11, 37, ...
str(afl.games)
## 'data.frame':    15407 obs. of  13 variables:
##  $ V1          : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ V2          : Date, format: "1897-05-08" "1897-05-08" ...
##  $ V3          : Factor w/ 29 levels "     EF   ","     GF   ",..: 5 5 5 5 16 16 16 16 22 22 ...
##  $ V4          : Factor w/ 24 levels "Adelaide          ",..: 7 5 10 19 19 6 20 15 5 7 ...
##  $ Home Goals  : num  6 5 3 3 6 4 3 9 6 5 ...
##  $ Home Behinds: num  13 11 6 9 4 6 8 10 5 9 ...
##  $ Home Score  : num  49 41 24 27 40 30 26 64 41 39 ...
##  $ V6          : Factor w/ 24 levels "Adelaide          ",..: 4 20 6 15 4 5 7 10 10 15 ...
##  $ Away Goals  : num  2 2 7 6 5 8 10 3 5 7 ...
##  $ Away Behinds: num  4 4 5 8 6 2 6 1 7 8 ...
##  $ Away Score  : num  16 16 47 44 36 50 66 19 37 50 ...
##  $ season      : num  1897 1897 1897 1897 1897 ...
##  $ margin      : num  33 25 -23 -17 4 -20 -40 45 4 -11 ...

Looking at margins

Looking at this we can see we have negative margins, which makes sense as we have defined it as \(Home Score\) - \(Away Score\). So all it is really saying is that away teams happen to win games.

As we are not interested in things from only the home teams point of view, lets take the absolute value of margin.

afl.games$margin<-abs(afl.games$margin)

Lets get plotting yeah!

Remember up top, we wanted to look at average margin through time.

afl.games%>%
  group_by(season)%>%
  summarise(avemargin=mean(margin))  %>%
   ggplot(aes(x=season,y=avemargin))+geom_line() +
  ylab("Average Margin") + 
  xlab("Season") + ggtitle("Are margins really going down?")

Looking at the plot above do you think that the average margin of games is going up or down recently? What about compared to a few years ago?

Lets look at scoring shots

First step is lets create the dataset to analyse.

afl.games$home.ss.diff<-afl.games$`Home Goals`+afl.games$`Home Behinds` -afl.games$`Away Goals`-afl.games$`Away Behinds`

afl.games$away.ss.diff<-afl.games$`Away Goals`+afl.games$`Away Behinds` -afl.games$`Home Goals`-afl.games$`Home Behinds`
# head(afl.games)
# glimpse(afl.games)

home<-select(afl.games,season,V3,V4,home.ss.diff)
away<-select(afl.games,season,V3,V6,away.ss.diff)
head(home)
##   season         V3                 V4 home.ss.diff
## 1   1897      R1    Fitzroy                      13
## 2   1897      R1    Collingwood                  10
## 3   1897      R1    Geelong                      -3
## 4   1897      R1    South Melbourne              -2
## 5   1897      R2    South Melbourne              -1
## 6   1897      R2    Essendon                      0
head(away)
##   season         V3                 V6 away.ss.diff
## 1   1897      R1    Carlton                     -13
## 2   1897      R1    St Kilda                    -10
## 3   1897      R1    Essendon                      3
## 4   1897      R1    Melbourne                     2
## 5   1897      R2    Carlton                       1
## 6   1897      R2    Collingwood                   0

Looking at the above datasets (home, away) ideally you would just like to stack them on top of each other. We can do this using rbind. Before using rbind, one thing to keep in mind is that the columns should be named the same.

So lets rename the columns so that they are the same and then lets stack them on top of each other.

names(home)[names(home)=="season"]<-"Season"
names(home)[names(home)=="V3"]<-"Round"
names(home)[names(home)=="V4"]<-"Team"
names(home)[names(home)=="home.ss.diff"]<-"ss.diff"

names(away)[names(away)=="season"]<-"Season"
names(away)[names(away)=="V3"]<-"Round"
names(away)[names(away)=="V6"]<-"Team"
names(away)[names(away)=="away.ss.diff"]<-"ss.diff"



games<-rbind(home,away)
head(games)
##   Season      Round               Team ss.diff
## 1   1897      R1    Fitzroy                 13
## 2   1897      R1    Collingwood             10
## 3   1897      R1    Geelong                 -3
## 4   1897      R1    South Melbourne         -2
## 5   1897      R2    South Melbourne         -1
## 6   1897      R2    Essendon                 0
glimpse(games)
## Observations: 30,814
## Variables: 4
## $ Season  <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897, ...
## $ Round   <fct>      R1   ,      R1   ,      R1   ,      R1   ,      R...
## $ Team    <fct> Fitzroy           , Collingwood       , Geelong       ...
## $ ss.diff <dbl> 13, 10, -3, -2, -1, 0, -5, 15, -1, -1, 7, 17, 12, 8, -...

What we want to do now is create a summary table whereby we can get the total scoring shot difference, the standard deviation of scoring shot differential and the average scoring shot differential by team within each season. We can do that as follows:

games<-select(games,Season, Team,ss.diff)

afl<-group_by(games,Season,Team)
 
df1<-summarise(afl,total.ss.diff=sum(ss.diff),std.ss.diff=sd(ss.diff),average.ss.diff=mean(ss.diff))
head(df1)
## # A tibble: 6 x 5
## # Groups:   Season [1]
##   Season Team                 total.ss.diff std.ss.diff average.ss.diff
##    <dbl> <fct>                        <dbl>       <dbl>           <dbl>
## 1   1897 "Carlton           "          -126        8.47          -9    
## 2   1897 "Collingwood       "            14        8.20           0.824
## 3   1897 "Essendon          "            78        9.02           4.59 
## 4   1897 "Fitzroy           "            19       10.2            1.36 
## 5   1897 "Geelong           "           113       10.1            6.65 
## 6   1897 "Melbourne         "            38        9.68           2.24
glimpse(df1)
## Observations: 1,496
## Variables: 5
## $ Season          <dbl> 1897, 1897, 1897, 1897, 1897, 1897, 1897, 1897...
## $ Team            <fct> Carlton           , Collingwood       , Essend...
## $ total.ss.diff   <dbl> -126, 14, 78, 19, 113, 38, 60, -196, -147, 128...
## $ std.ss.diff     <dbl> 8.467131, 8.202403, 9.021217, 10.172415, 10.11...
## $ average.ss.diff <dbl> -9.0000000, 0.8235294, 4.5882353, 1.3571429, 6...

Now that we have the table we want, lets look at scoring shot differential, but instead of looking at it for all years, lets just look at things this side of 1999.

df1%>%filter(Season>1999)%>%
  ggplot(aes(x=Season,y=average.ss.diff))+
  geom_point() +geom_hline(yintercept = 5)+
  geom_hline(yintercept = -5) +ylab("Average Scoring Shot Differential")

This graph is pretty interesting

From it we can see at first glance

  • Only 3 teams have an absolute scoring shot differential average of over 5 this year which is the lowest its been in over 10 years.
  • Only 1 team (guess who) get beaten by on average over 5 scoring shots which is the lowest its been this millenium

Of course we shouldn’t just base our conclusions from one graph. Now that you have the data, what would you graph?