<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-04/soccer21-22.csv') soccer
Group 4 WackyWednesday - Game Metrics: Decoding Premier League Outcomes Through Visual Analysis and Data Exploration
Exploring home versus away team dynamics in the world of English football
Dataset
Description of the Dataset
The dataset comes from TidyTuesday
’s 04 April 2023 post “Premier League Match Data 2021-2022”. It is derived from Evan Gower’s Kaggle post of the same name.
The Premier League is the foremost league for professional football clubs in England since its creation in 1992. The time period covered in this dataset spans entire 2021 to 2022 Premier League season. The 2021–22 Premier League season marked the 30th anniversary of the league.
Variables in this dataset includes teams, referee, and stats by home and away side such as fouls, shots, cards, and point totals. The dataset is in a tabular format with 380 rows representing individual matches and 22 columns representing each variable, described below. Both categorical and numerical variables are included.
A glimpse at each variable:
Rows: 380
Columns: 22
$ Date <chr> "13/08/2021", "14/08/2021", "14/08/2021", "14/08/2021", "14/0…
$ HomeTeam <chr> "Brentford", "Man United", "Burnley", "Chelsea", "Everton", "…
$ AwayTeam <chr> "Arsenal", "Leeds", "Brighton", "Crystal Palace", "Southampto…
$ FTHG <dbl> 2, 5, 1, 3, 3, 1, 3, 0, 2, 1, 2, 2, 0, 2, 5, 2, 1, 0, 0, 4, 5…
$ FTAG <dbl> 0, 1, 2, 0, 1, 0, 2, 3, 4, 0, 0, 0, 0, 2, 0, 0, 1, 1, 2, 1, 0…
$ FTR <chr> "H", "H", "A", "H", "H", "H", "H", "A", "A", "H", "H", "H", "…
$ HTHG <dbl> 1, 1, 1, 2, 0, 1, 2, 0, 2, 0, 1, 1, 0, 1, 2, 2, 1, 0, 0, 1, 3…
$ HTAG <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0…
$ HTR <chr> "H", "H", "H", "H", "A", "H", "H", "A", "H", "D", "H", "H", "…
$ Referee <chr> "M Oliver", "P Tierney", "D Coote", "J Moss", "A Madley", "C …
$ HS <dbl> 8, 16, 14, 13, 14, 9, 13, 14, 17, 13, 27, 10, 7, 17, 16, 13, …
$ AS <dbl> 22, 10, 14, 4, 6, 17, 11, 19, 8, 18, 9, 9, 14, 17, 1, 10, 15,…
$ HST <dbl> 3, 8, 3, 6, 6, 5, 7, 3, 3, 3, 9, 2, 2, 4, 4, 3, 3, 6, 3, 7, 1…
$ AST <dbl> 4, 3, 8, 1, 3, 3, 2, 8, 9, 4, 3, 1, 3, 8, 0, 1, 4, 6, 5, 1, 0…
$ HF <dbl> 12, 11, 10, 15, 13, 6, 18, 4, 4, 11, 6, 8, 12, 6, 13, 11, 12,…
$ AF <dbl> 8, 9, 7, 11, 15, 10, 13, 14, 3, 8, 12, 18, 9, 13, 7, 6, 10, 7…
$ HC <dbl> 2, 5, 7, 5, 6, 5, 2, 3, 7, 3, 8, 3, 3, 8, 6, 7, 7, 5, 9, 10, …
$ AC <dbl> 5, 4, 6, 2, 8, 4, 4, 11, 6, 11, 4, 4, 5, 5, 1, 2, 7, 4, 8, 0,…
$ HY <dbl> 0, 1, 2, 0, 2, 1, 3, 1, 1, 2, 0, 3, 3, 2, 1, 4, 2, 1, 3, 0, 1…
$ AY <dbl> 0, 2, 1, 0, 0, 2, 1, 1, 0, 1, 0, 4, 1, 4, 0, 0, 3, 4, 0, 1, 2…
$ HR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ AR <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
The description and type of each variable are as follows:
Date
(character): The date when the match was played.HomeTeam
(character): The home team.AwayTeam
(character): The away team.FTHG
(double): Full-time home goals.FTAG
(double): Full-time away goals.FTR
(character): Full-time result.HTHG
(double): Half-time home goals.HTAG
(double): Half-time away goals.HTR
(character): Half-time result.Referee
(character): Referee of the match.HS
(double): Number of shots taken by the home team.AS
(double): Number of shots taken by the away team.HST
(double): Number of shots on target by the home team.AST
(double): Number of shots on target by the away team.HF
(double): Number of fouls by the home team.AF
(double): Number of fouls by the away team.HC
(double): Number of corners taken by the home team.AC
(double): Number of corners taken by the away team.HY
(double): Number of yellow cards received by the home team.AY
(double): Number of yellow cards received by the away team.HR
(double): Number of red cards received by the home team.AR
(double): Number of red cards received by the away team.
Diagnosing the quality of each variable:
# A tibble: 22 × 6
variables types missing_count missing_percent unique_count unique_rate
<chr> <chr> <int> <dbl> <int> <dbl>
1 Date character 0 0 123 0.324
2 HomeTeam character 0 0 20 0.0526
3 AwayTeam character 0 0 20 0.0526
4 FTHG numeric 0 0 8 0.0211
5 FTAG numeric 0 0 7 0.0184
6 FTR character 0 0 3 0.00789
7 HTHG numeric 0 0 5 0.0132
8 HTAG numeric 0 0 5 0.0132
9 HTR character 0 0 3 0.00789
10 Referee character 0 0 22 0.0579
11 HS numeric 0 0 29 0.0763
12 AS numeric 0 0 29 0.0763
13 HST numeric 0 0 14 0.0368
14 AST numeric 0 0 15 0.0395
15 HF numeric 0 0 20 0.0526
16 AF numeric 0 0 22 0.0579
17 HC numeric 0 0 16 0.0421
18 AC numeric 0 0 14 0.0368
19 HY numeric 0 0 7 0.0184
20 AY numeric 0 0 6 0.0158
21 HR numeric 0 0 3 0.00789
22 AR numeric 0 0 2 0.00526
This dataset is ideal for analytical projects aiming to understand football dynamics, player contributions, and team strategies throughout the season.
Reason for Choosing this Dataset
We chose this data set because it provides comprehensive statistics contributing to the outcome of football matches. Therefore, we can explore the interesting topic of the potential existence of bias against away teams in the English Premier League. The availability of data allows us to use variables such as shots on goal or create new variables such as calculating the ratio of yellow cards to fouls to answer this question of bias. Some members of the group are also quite passionate about soccer, which definitely convinced us to use this data set!
Questions
Question 1
What effect (if any) does being the “home” team versus being the “away” team affect the teams’ performances?
- Do the number of goals at half-time determine the final outcome?
- Does home-team advantage play a major role in the final match outcome?
Our hypothesis is that the home team will have 1) more goals at both half-time and full-time, and 2) more wins due to “home team advantage”, where it is thought that the hosting team has a significant advantage over the visiting team. The home team and away team advantage will also be analyzed using the full-time results as well as goals. We also predict that the team ahead at half-time will win at full-time.
Question 2
What effect (if any) does being the “home” team versus being the “away” team affect the teams’ penalties?
- Is there a noticeable bias against the away team in terms of the number of fouls, red cards, and yellow cards received?
- How do the ratios of cards to fouls differ between home and away teams?
Although referees are supposed to be unbiased, we predict the number of fouls and cards against the home team will be less. Additionally, we predict that the home team will typically have a lower card to foul ratio than the away team (that is, will receive fewer cards per foul in a game).
Analysis plan
Timeline to Completion
- Week of 05 Feb
- Have proposal prepared for peer review
- Provide peer review for other groups
- Week of 12 Feb
- Make changes to proposal based on peer review
- Make changes to proposal based on instructor review
- Week of 19 Feb
- Determine work division
- Do individual exploration of the data and start first drafts of visualizations to compare with other group members
- Start slidedeck formatting
- Week of 26 Feb
- Finalize visualizations, making them presentation-ready
- Complete writeups
- Website cleanup
- Presentation slidedeck finalization
- Presentation practice
Question 1
Members: Sai Navya Reddy, Akash Srinivasan, Sanja Dmitrovic
- Variables involved:
- The name of the teams:
HomeTeam
(the home team) andAwayTeam
(the away team) - Game winner:
FTR
(Full time result) - Ahead at half time:
HTR
(Half Time Results) - Full time goals:
FTHG
(Full time home goals) andFTAG
(full time away goals) - Halftime goals for the home team:
HTHG
Halftime home goals - Halftime goals for the away team:
HTAG
Halftime away goals
- The name of the teams:
- This question is not likely to require additional variable creation.
- This question is not likely to require external data to be merged in.
- Plot suggestions:
- Violin plots to show the distribution of foul cards for both the home and away teams (This is important to reveal any form of bias that maybe exist for either teams)
- Scatter plot to show the relationship between half time goals and full time goals (if positive, the probability of winning is higher if a team leads in half time).
Question 2
Members: Valerie Okine, Jiayue He, Gillian McGinnis
- Variables involved:
- Fouls:
HF
(Home Fouls) andAF
(Away Fouls). - Cards:
HY
(Home Yellow) ,AY
(Away Yellow),HR
(Home Red), andAR
(Away Red).
- Fouls:
- Variables to be created:
- Ratios between the total number of yellow cards and red cards to fouls, by team and game.
- Ratio of total number of cards and number of fouls, by team and game.
- This question is not likely to require external data to be merged in.
- Plot suggestions:
- Bar charts to show the frequency of the foul cards for both the home and away teams.
- Frequency polygon to show the distribution of cards (either yellow or red) for both away and home teams.