Group 4 WackyWednesday - Game Metrics: Decoding Premier League Outcomes Through Visual Analysis and Data Exploration

Exploring home versus away team dynamics in the world of English football

Dataset

soccer <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-04/soccer21-22.csv')

Description of the Dataset

The dataset comes from TidyTuesday’s 04 April 2023 post “Premier League Match Data 2021-2022”. It is derived from Evan Gower’s Kaggle post of the same name.

The Premier League is the foremost league for professional football clubs in England since its creation in 1992. The time period covered in this dataset spans entire 2021 to 2022 Premier League season. The 2021–22 Premier League season marked the 30th anniversary of the league.

Variables in this dataset includes teams, referee, and stats by home and away side such as fouls, shots, cards, and point totals. The dataset is in a tabular format with 380 rows representing individual matches and 22 columns representing each variable, described below. Both categorical and numerical variables are included.

A glimpse at each variable:

Rows: 380
Columns: 22
$ Date     <chr> "13/08/2021", "14/08/2021", "14/08/2021", "14/08/2021", "14/0…
$ HomeTeam <chr> "Brentford", "Man United", "Burnley", "Chelsea", "Everton", "…
$ AwayTeam <chr> "Arsenal", "Leeds", "Brighton", "Crystal Palace", "Southampto…
$ FTHG     <dbl> 2, 5, 1, 3, 3, 1, 3, 0, 2, 1, 2, 2, 0, 2, 5, 2, 1, 0, 0, 4, 5…
$ FTAG     <dbl> 0, 1, 2, 0, 1, 0, 2, 3, 4, 0, 0, 0, 0, 2, 0, 0, 1, 1, 2, 1, 0…
$ FTR      <chr> "H", "H", "A", "H", "H", "H", "H", "A", "A", "H", "H", "H", "…
$ HTHG     <dbl> 1, 1, 1, 2, 0, 1, 2, 0, 2, 0, 1, 1, 0, 1, 2, 2, 1, 0, 0, 1, 3…
$ HTAG     <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0…
$ HTR      <chr> "H", "H", "H", "H", "A", "H", "H", "A", "H", "D", "H", "H", "…
$ Referee  <chr> "M Oliver", "P Tierney", "D Coote", "J Moss", "A Madley", "C …
$ HS       <dbl> 8, 16, 14, 13, 14, 9, 13, 14, 17, 13, 27, 10, 7, 17, 16, 13, …
$ AS       <dbl> 22, 10, 14, 4, 6, 17, 11, 19, 8, 18, 9, 9, 14, 17, 1, 10, 15,…
$ HST      <dbl> 3, 8, 3, 6, 6, 5, 7, 3, 3, 3, 9, 2, 2, 4, 4, 3, 3, 6, 3, 7, 1…
$ AST      <dbl> 4, 3, 8, 1, 3, 3, 2, 8, 9, 4, 3, 1, 3, 8, 0, 1, 4, 6, 5, 1, 0…
$ HF       <dbl> 12, 11, 10, 15, 13, 6, 18, 4, 4, 11, 6, 8, 12, 6, 13, 11, 12,…
$ AF       <dbl> 8, 9, 7, 11, 15, 10, 13, 14, 3, 8, 12, 18, 9, 13, 7, 6, 10, 7…
$ HC       <dbl> 2, 5, 7, 5, 6, 5, 2, 3, 7, 3, 8, 3, 3, 8, 6, 7, 7, 5, 9, 10, …
$ AC       <dbl> 5, 4, 6, 2, 8, 4, 4, 11, 6, 11, 4, 4, 5, 5, 1, 2, 7, 4, 8, 0,…
$ HY       <dbl> 0, 1, 2, 0, 2, 1, 3, 1, 1, 2, 0, 3, 3, 2, 1, 4, 2, 1, 3, 0, 1…
$ AY       <dbl> 0, 2, 1, 0, 0, 2, 1, 1, 0, 1, 0, 4, 1, 4, 0, 0, 3, 4, 0, 1, 2…
$ HR       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ AR       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…

The description and type of each variable are as follows:

  1. Date (character): The date when the match was played.
  2. HomeTeam (character): The home team.
  3. AwayTeam (character): The away team.
  4. FTHG (double): Full-time home goals.
  5. FTAG (double): Full-time away goals.
  6. FTR (character): Full-time result.
  7. HTHG (double): Half-time home goals.
  8. HTAG (double): Half-time away goals.
  9. HTR (character): Half-time result.
  10. Referee (character): Referee of the match.
  11. HS (double): Number of shots taken by the home team.
  12. AS (double): Number of shots taken by the away team.
  13. HST (double): Number of shots on target by the home team.
  14. AST (double): Number of shots on target by the away team.
  15. HF (double): Number of fouls by the home team.
  16. AF (double): Number of fouls by the away team.
  17. HC (double): Number of corners taken by the home team.
  18. AC (double): Number of corners taken by the away team.
  19. HY (double): Number of yellow cards received by the home team.
  20. AY (double): Number of yellow cards received by the away team.
  21. HR (double): Number of red cards received by the home team.
  22. AR (double): Number of red cards received by the away team.

Diagnosing the quality of each variable:

# A tibble: 22 × 6
   variables types     missing_count missing_percent unique_count unique_rate
   <chr>     <chr>             <int>           <dbl>        <int>       <dbl>
 1 Date      character             0               0          123     0.324  
 2 HomeTeam  character             0               0           20     0.0526 
 3 AwayTeam  character             0               0           20     0.0526 
 4 FTHG      numeric               0               0            8     0.0211 
 5 FTAG      numeric               0               0            7     0.0184 
 6 FTR       character             0               0            3     0.00789
 7 HTHG      numeric               0               0            5     0.0132 
 8 HTAG      numeric               0               0            5     0.0132 
 9 HTR       character             0               0            3     0.00789
10 Referee   character             0               0           22     0.0579 
11 HS        numeric               0               0           29     0.0763 
12 AS        numeric               0               0           29     0.0763 
13 HST       numeric               0               0           14     0.0368 
14 AST       numeric               0               0           15     0.0395 
15 HF        numeric               0               0           20     0.0526 
16 AF        numeric               0               0           22     0.0579 
17 HC        numeric               0               0           16     0.0421 
18 AC        numeric               0               0           14     0.0368 
19 HY        numeric               0               0            7     0.0184 
20 AY        numeric               0               0            6     0.0158 
21 HR        numeric               0               0            3     0.00789
22 AR        numeric               0               0            2     0.00526

This dataset is ideal for analytical projects aiming to understand football dynamics, player contributions, and team strategies throughout the season.

Reason for Choosing this Dataset

We chose this data set because it provides comprehensive statistics contributing to the outcome of football matches. Therefore, we can explore the interesting topic of the potential existence of bias against away teams in the English Premier League. The availability of data allows us to use variables such as shots on goal or create new variables such as calculating the ratio of yellow cards to fouls to answer this question of bias. Some members of the group are also quite passionate about soccer, which definitely convinced us to use this data set!

Questions

Question 1

What effect (if any) does being the “home” team versus being the “away” team affect the teams’ performances?

  1. Do the number of goals at half-time determine the final outcome?
  2. Does home-team advantage play a major role in the final match outcome?

Our hypothesis is that the home team will have 1) more goals at both half-time and full-time, and 2) more wins due to “home team advantage”, where it is thought that the hosting team has a significant advantage over the visiting team. The home team and away team advantage will also be analyzed using the full-time results as well as goals. We also predict that the team ahead at half-time will win at full-time.

Question 2

What effect (if any) does being the “home” team versus being the “away” team affect the teams’ penalties?

  1. Is there a noticeable bias against the away team in terms of the number of fouls, red cards, and yellow cards received?
  2. How do the ratios of cards to fouls differ between home and away teams?

Although referees are supposed to be unbiased, we predict the number of fouls and cards against the home team will be less. Additionally, we predict that the home team will typically have a lower card to foul ratio than the away team (that is, will receive fewer cards per foul in a game).

Analysis plan

Timeline to Completion

  • Week of 05 Feb
    • Have proposal prepared for peer review
    • Provide peer review for other groups
  • Week of 12 Feb
    • Make changes to proposal based on peer review
    • Make changes to proposal based on instructor review
  • Week of 19 Feb
    • Determine work division
    • Do individual exploration of the data and start first drafts of visualizations to compare with other group members
    • Start slidedeck formatting
  • Week of 26 Feb
    • Finalize visualizations, making them presentation-ready
    • Complete writeups
    • Website cleanup
    • Presentation slidedeck finalization
    • Presentation practice

Question 1

Members: Sai Navya Reddy, Akash Srinivasan, Sanja Dmitrovic

  • Variables involved:
    • The name of the teams: HomeTeam (the home team) and AwayTeam (the away team)
    • Game winner: FTR (Full time result)
    • Ahead at half time: HTR (Half Time Results)
    • Full time goals: FTHG (Full time home goals) and FTAG (full time away goals)
    • Halftime goals for the home team: HTHG Halftime home goals
    • Halftime goals for the away team: HTAG Halftime away goals
  • This question is not likely to require additional variable creation.
  • This question is not likely to require external data to be merged in.
  • Plot suggestions:
    • Violin plots to show the distribution of foul cards for both the home and away teams (This is important to reveal any form of bias that maybe exist for either teams)
    • Scatter plot to show the relationship between half time goals and full time goals (if positive, the probability of winning is higher if a team leads in half time).

Question 2

Members: Valerie Okine, Jiayue He, Gillian McGinnis

  • Variables involved:
    • Fouls: HF (Home Fouls) and AF (Away Fouls).
    • Cards: HY (Home Yellow) , AY (Away Yellow), HR (Home Red), and AR (Away Red).
  • Variables to be created:
    • Ratios between the total number of yellow cards and red cards to fouls, by team and game.
    • Ratio of total number of cards and number of fouls, by team and game.
  • This question is not likely to require external data to be merged in.
  • Plot suggestions:
    • Bar charts to show the frequency of the foul cards for both the home and away teams.
    • Frequency polygon to show the distribution of cards (either yellow or red) for both away and home teams.