“The greatest value of a picture is when it forces us to notice what we never expected to see.” - J. W. Tukey (1977)

To learn a new skill I think there needs to be 2 main drivers

  • An interest in a topic
  • easy to get to tangible wins to encourage deeper understanding

Hopefully you have stumbled across this blog because you have an interest in footy and because you want to start analysing the game yourself. So lets gets started.

The graph you are going to be able to create by the end of this post, is a cummulative line chart showing how quickly a player racks up a certain stat. For an example of a final product you can have a look at a graph produced by Matt Cowgill.

To get started using R you can download R from here and a nice Rstudio from here.

Once you have those both installed, lets get cracking.

install.packages("tidyverse")
install.packages("devtools")
devtools::install_github("jimmyday12/fitzRoy")
library(fitzRoy)
library(tidyverse)
## Warning: package 'fitzRoy' was built under R version 3.5.1
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date())
## Returning data from 1897-01-01 to 2018-11-24
## Downloading data
## 
## Finished downloading data. Processing XMLs
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 396 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected   actual file                                       expected   <int> <chr> <chr>      <chr>  <chr>                                      actual 1  8713 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~ file 2  8714 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~ row 3  8715 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~ col 4  8716 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~ expected 5  8717 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
## Warning: Unknown columns: `Substitute`
## Finished getting afltables data
df%>%filter(Season>1990)%>%
  group_by(ID) %>%
  mutate(games_played=row_number())%>%
  mutate(cummulativefrees=cumsum(Frees.For))%>%

ggplot(aes(x=games_played, y=cummulativefrees, group = ID)) + geom_line() +xlab("Games Played") +ylab("Cummulative Count of Free kicks received") 

Hopefully you are able to run the script above and get the same graph. If not #makemeauseR and tweet at me and I will lend a hand. Alternively there are great open slack groups like the R4DS where members are some of the most helpful going around!

So now you have hopefully the quick win out the way and you are now a bit more keen on diving in and seeing how this all works.

Why use ggplot2

  • The writing of the code helps you think about how your data drives the visualisation journey
  • easy to use and make changes if you want to see different variables, timeframes etc

Before you plot checklist

  • data is tidy
  • each variable is a column
  • each observation is a row

Another example a little explanation

Get the data – fitzRoy hopefully makes this an easier job because the data is already stored in a tidy format

But its not just enough to have data, after all whats the point? You want to do some analyse and visualise data because you have a question in mind, you are driven to look into something that you find interesting.

So with that in mind, lets see if there is a relationship between Contested.Marks and Weight.

# install.packages("devtools")
devtools::install_github("jimmyday12/fitzRoy")
library(fitzRoy)
library(tidyverse)
df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date())

Now that you have your dataframe df you can see that it collection of variables (columns) and observations of those variables (rows) df looking at some of these variables in df

  • Marks
  • Contested.Marks

Bit more ploting explanation

df%>%
ggplot(aes(x=Marks, y=Contested.Marks)) +
  geom_point() + facet_wrap(~Playing.for)
## Warning: Removed 401 rows containing missing values (geom_point).

Let me explain some of the commands going on here.

df is what we called our dataframe before, this is followed by the ‘pipe’ operator %>%. In simple terms it takes the output of one df and inserts it into the next ggplot In short this “chaining” allows you to pass a result onto the next function. For the above example we first create a dataframe called df (how creative) the next line we take this data and this dataframe df becomes the data we will base our plot off.

aes

aes– Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms. Aesthetics are things such as xy and colours x=Marks, y=Contested.Marks colour=Playing.For

geom

Geoms are the geometric objects displayed in the plot. Here geom_ controls the type of plot you want. For example geom_point will give you a scatterplot, geom_line a line graph

facet

Facet is a more general case of common conditioned or trellis plots. Faceting creates small multiples of different subsets of a dataset. These plots come in handy when you want to compare if patterns are the same or different across conditions facets

Putting it all together

  • Question - I want to visualise the MAE of the squiggle tipsters.

Thankfully, fitzRoy has easy to use functions to get the squiggle data!

library(fitzRoy)
tips <- get_squiggle_data("tips")
## Getting data from https://api.squiggle.com.au/?q=tips
head(tips) 
##         hteam         venue                  ateam   err
## 1     Carlton        M.C.G.               Richmond 42.00
## 2     Carlton        M.C.G.               Richmond    NA
## 3     Carlton        M.C.G.               Richmond 48.39
## 4 Collingwood        M.C.G.       Western Bulldogs  3.69
## 5 Collingwood        M.C.G.       Western Bulldogs  3.00
## 6    Adelaide Adelaide Oval Greater Western Sydney 53.00
##                  date hconfidence hteamid             updated correct
## 1 2017-03-23 19:20:00        50.0       3 2017-07-11 13:59:46       1
## 2 2017-03-23 19:20:00        42.0       3 2017-04-10 12:18:02       1
## 3 2017-03-23 19:20:00        56.7       3 2017-07-11 13:59:46       0
## 4 2017-03-24 19:50:00        37.3       4 2017-07-11 13:59:46       1
## 5 2017-03-24 19:50:00        38.0       4 2017-07-11 13:59:46       1
## 6 2017-03-26 15:20:00        50.0       1 2017-07-11 13:59:46       1
##            source round    bits              tip gameid year tipteamid
## 1        Squiggle     1  0.0000         Richmond      1 2017        14
## 2  Figuring Footy     1  0.2141         Richmond      1 2017        14
## 3 Matter of Stats     1 -0.2076          Carlton      1 2017         3
## 4 Matter of Stats     1  0.3265 Western Bulldogs      2 2017        18
## 5        Squiggle     1  0.3103 Western Bulldogs      2 2017        18
## 6        Squiggle     1  0.0000         Adelaide      8 2017         1
##   margin confidence ateamid sourceid
## 1   1.00       50.0      14        1
## 2     NA       58.0      14        3
## 3   5.39       56.7      14        4
## 4  10.31       62.7      18        4
## 5  17.00       62.0      18        1
## 6   3.00       50.0       9        1

Getting data in the right format

What do I actually want to visualise here?

I would like to see the data for this year 2018 filter(year>2017) that shows me for a given round and tipster group_by(round, source) of the squiggle tipsters by round and tipster their average MAE for that round by said tipster summarise(MAE_by_round=mean(err))

 tips%>%
  filter(year>2017)%>%
   group_by(round, source)%>%
   summarise(MAE_by_round=mean(err))
## # A tibble: 351 x 3
## # Groups:   round [?]
##    round source                MAE_by_round
##    <int> <chr>                        <dbl>
##  1     1 Aggregate                     21.1
##  2     1 Footy Maths Institute         26  
##  3     1 Graft                         20.1
##  4     1 HPN                           31.7
##  5     1 Live Ladders                  20.8
##  6     1 Massey Ratings                21.1
##  7     1 Matter of Stats               24.4
##  8     1 PlusSixOne                    19.7
##  9     1 Punters                       NA  
## 10     1 Squiggle                      22  
## # ... with 341 more rows

Now that we have our data in the right format for plotting lets you know get plotting!

tips%>%
  filter(year>2017)%>%
   group_by(round, source)%>%
   summarise(MAE_by_round=mean(err))%>%
  ggplot(aes(x=round, y=MAE_by_round))+ 
  geom_point()
## Warning: Removed 27 rows containing missing values (geom_point).

So what can we see here? We can see the average MAE by tipster by round for 2018. The problem is we can’t identify which tipster is who. So how can we do that? Just add colour.

tips%>%
  filter(year>2017)%>%
   group_by(round, source)%>%
   summarise(MAE_by_round=mean(err))%>%
  ggplot(aes(x=round, y=MAE_by_round))+ 
  geom_point(aes(colour=source))
## Warning: Removed 27 rows containing missing values (geom_point).

Ok now that we have added some colour things are still hard to see, so what if we joined each point for the respective tipster?

tips%>%
  filter(year>2017)%>%
   group_by(round, source)%>%
   summarise(MAE_by_round=mean(err))%>%
  ggplot(aes(x=round, y=MAE_by_round))+ 
  geom_point(aes(colour=source)) +
  geom_line(aes(group=source, colour=source))
## Warning: Removed 27 rows containing missing values (geom_point).
## Warning: Removed 27 rows containing missing values (geom_path).

Ok so now things are a bit clearer but still not as clear as we would like. Its hard to get a feel for each individual tipster because all the points are fairly close which makes our lines close together so disentanglement becomes difficult. This is where faceting or small multiples comes in handy.

tips%>%
  filter(year>2017)%>%
   group_by(round, source)%>%
   summarise(MAE_by_round=mean(err))%>%
  ggplot(aes(x=round, y=MAE_by_round))+ 
  geom_point(aes(colour=source)) +
  geom_line(aes(group=source, colour=source)) +facet_wrap(~source)
## Warning: Removed 27 rows containing missing values (geom_point).
## Warning: Removed 27 rows containing missing values (geom_path).

There you go how cool is that!