Game Metrics: Decoding Premier League Outcomes Through Visual Analysis and Data Exploration
INFO 526 - Project 1
Author
Group 4 WackyWednesday
Abstract
This project examines 2021-2022 Premier League match data to analyze team performance, home-team advantage, and potential officiating biases. Methods included creating visualizations and summary tables to compare home and away team performance and penalties across the entire season. Results indicate a strong correlation between half-time and full-time goals, with slightly higher predictability for home teams. Home-team advantage is evident in goal scoring but does not guarantee victory. Additionally, there’s a difference in the most frequent number of fouls given to the home team compared to the away team. The card-to-foul ratio also appears to be higher for the away team. Overall, this study offers insights into Premier League dynamics and officiating tendencies.
The Premier League is the foremost league for professional football clubs in England since its creation in 1992. The time period covered in this dataset spans entire 2021 to 2022 Premier League season. The 2021–22 Premier League season marked the 30th anniversary of the league.
Variables in this dataset includes teams, referee, and stats by home and away side such as fouls, shots, cards, and point totals. The dataset is in a tabular format with 380 rows representing individual matches and 22 columns representing each variable, described below. Both categorical and numerical variables are included.
The breakdown of each variable in the data set can be found in the codebook as well as the project proposal.
In our analysis we will be focusing on the following variables:
HomeTeam (character): The home team.
AwayTeam (character): The away team.
FTHG (double): Full-time home goals.
FTAG (double): Full-time away goals.
FTR (character): Full-time result.
HTHG (double): Half-time home goals.
HTAG (double): Half-time away goals.
HTR (character): Half-time result.
HF (double): Number of fouls by the home team.
AF (double): Number of fouls by the away team.
HY (double): Number of yellow cards received by the home team.
AY (double): Number of yellow cards received by the away team.
HR (double): Number of red cards received by the home team.
AR (double): Number of red cards received by the away team.
This dataset is ideal for analytical projects aiming to understand football dynamics and team strategies throughout the season.
Reason for Choosing this Dataset
We chose this data set because it provides comprehensive statistics contributing to the outcome of football matches. Therefore, we can explore the interesting topic of the potential existence of bias against away teams in the English Premier League. The availability of data allows us to use variables such as shots on goal or create new variables such as calculating the ratio of yellow cards to fouls to answer this question of bias. Some members of the group are also quite passionate about soccer, which definitely convinced us to use this data set!
Question 1
Introduction
First, we explore if a home-team advantage is present when assessing teams’ performances. Here, we use goals as the main indicator of how well a team performs. One way we answer this question is by characterizing the relationship between goals scored at half-time (variables: HTHG, HTAG) and goals scored at the end of the match (variables: FTHG, FTAG) for both home and away teams through scatterplot and linear regression analysis. We wish to quantify how well goals at half-time predict goals at full-time in general as well as seeing if the strength of this relationship changes based on being on the home or away team. Then, we explore this question of home-team advantage using the Full Time Result variable (FTR) and creating a barplot of total goals per team (e.g., Arsenal, Liverpool, etc.) and if they were home or away. Our hypothesis is that the home team will score more goals and win more matches.
Question 1A
Do the number of goals at half-time determine the final outcome? Does this relationship change between home and away teams?
Question 1B
Does home-team advantage play a major role in the final match outcome?
Approach
Part A: The variables used to answer this question are half-time home goals (HTHG), full-time home goals (FTHG), half-time away goals (HTAG), and full-time away goals (HTHG). A scatterplot is created to quantify the correlation between goals at half-time and goals at full-time (end of match) and is color-coded based on home versus away teams. Linear regressions are performed separately for home team and away team data and correlation coefficients are reported. Scatterplot and linear regression analysis is chosen because they directly determine how strongly two variables are related to each other.
Part B: Using the Full Time Result (FTR) variable, home advantage and away advantage are categorized into a new Home_Advantage variable. The advantage is determined through the number of wins of each team. If the Result is “H”, it will be added to home team victory and is therefore counted as home team advantage. If not, it will be added to away team victory and is counted as away team advantage. This information is summarized in a bar plot for Total Games to find the total goals scored by individual team in home and away. An advantage summary is created and combined together to form a new Total goals variable. The Home team, Total variable is then used as an aesthetic layer and a bar plot is created side by side to interpret the outcome for all 20 teams.
Analysis
Question 1A: Correlation between half-time and full-time goals.
Show the code
#Create tibbles based on if the home or away team scored the goals. This makes it easier to color-code the scatterplot and calculate separate linear regressions later.home_goals <-tibble(soccer$HTHG, soccer$FTHG)home_goals <- home_goals %>%rename('half'='soccer$HTHG','full'='soccer$FTHG' )away_goals <-tibble(soccer$HTAG, soccer$FTAG)away_goals <- away_goals %>%rename('half'='soccer$HTAG','full'='soccer$FTAG' )#Calculate correlation coefficients for home and away data sets.home_r <-cor(home_goals$half, home_goals$full)away_r <-cor(away_goals$half, away_goals$full)#Create scatterplot with two regression lines for home and away teams, respectively. Add number of points and correlation coefficients to plot.q1_goals <-ggplot() +geom_jitter(data = home_goals, aes(x = half, y = full, color ='Home'), width =0.7) +geom_smooth(data = home_goals, aes(x = half, y = full, color ='Home'), method = lm, se =FALSE) +geom_jitter(data = away_goals, aes(x = half, y = full, color ='Away'), width =0.7) +geom_smooth(data = away_goals, aes(x = half, y = full, color ='Away'), method = lm, se =FALSE) +xlim(0, 4) +geom_label(aes(x =0.25, y =7, label='N = 380')) +geom_label(aes(x =3.7, y =5.5, label='r = 0.72', color ='Home'), show.legend =FALSE) +geom_label(aes(x =3.7, y =4.2, label='r = 0.70', color ='Away'), show.legend =FALSE) +labs( x ='Goals at half-time',y ='Goals at full-time',color ='Team') +scale_color_manual(breaks =c("Home", "Away"),values =darken(pal_pl, 0.3) ) +theme_classic(base_size =12) +theme(text =element_text("Poppins",face ="bold",color = pal_pl_logo))#Suppress warning messages when printing ggplot object.suppressMessages(print(q1_goals))
Question 1B: Total games won by home and away teams.
Show the code
# Create a variable for home team advantagesoccer <- soccer %>%mutate(HomeAdvantage =ifelse(FTR =="H", "Home Team Win", "Away Team Win"))# Analyze home and away team advantageadvantage_summary <- soccer %>%group_by(HomeAdvantage) %>%summarise(TotalGames =n())kable(advantage_summary, "html") %>%kable_styling(full_width =FALSE)
HomeAdvantage
TotalGames
Away Team Win
217
Home Team Win
163
Show the code
# Bar plot for home and away team advantageggplot(advantage_summary, aes(x = HomeAdvantage, y = TotalGames, fill = HomeAdvantage)) +geom_bar(stat ="identity") +labs(title ="Distribution of Home and Away Team Wins",x ="Win_Count", y ="Total Games") +# scale_fill_manual(values = c("Home Team Win" = "lightgreen", "Away Team Win" = "red")) +scale_fill_manual(breaks =c("Home Team Win", "Away Team Win"),labels =c("Home", "Away"),values =darken(pal_pl, 0.3) ) +theme_classic() +theme(text =element_text("Poppins",face ="bold",color = pal_pl_logo,size =15) )
# Merge the data for home and away goalstotal_goals <-merge(total_home_goals, total_away_goals, by.x ="HomeTeam", by.y ="AwayTeam", all =TRUE)total_goals_long <-gather(total_goals, key ="TeamType", value ="TotalGoals", -HomeTeam)# Create a side-by-side bar plotggplot(total_goals_long , aes(x = HomeTeam, y = TotalGoals, fill = TeamType)) +geom_bar(position ="dodge", stat ="identity") +labs(title ="Total Home and Away Goals by Team",x ="Teams on Home and Away", y ="Total Goals") +# scale_fill_manual(values = c("Total_Home_goals" = "lightgreen", "Total_Away_goals" = "red")) +scale_fill_manual(breaks =c("Total_Home_goals", "Total_Away_goals"),labels =c("Home", "Away"),values =darken(pal_pl, 0.3) ) +theme_classic() +theme(text =element_text("Poppins",face ="bold",color = pal_pl_logo,size =12) ) +# theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Discussion
Question 1A
There is a strong correlation between goals scored at half-time and goals scored at full-time. In other words, goals at half-time is a strong indicator of how many goals will be scored at the end of the match. However, there is not a perfect one-to-one relationship between these variables. For example, there are some cases where there are 0 goals scored at half-time, but 4 or 5 goals are scored by the end of the match. This plot also shows that there is slightly higher correlation between half-time and full-time goals for the home team than the away team. However, this does not necessarily determine if there is a home-team advantage because more goals does not necessarily equate to more wins.
Question 1B
As per the dataset, the home team winning advantage did not create impact in every game but only in some games. For the goals scored in home and away match, the majority of the teams scored more goals when playing as home rather than in away (with a few exceptions, such as Chelsea).
Question 2
Introduction
In question 2, we want to figure out what effect (if any) does being the “home” team versus being the “away” team affect the teams’ penalties?
A correlation matrix is first used to find any potential relationships amongst the variables of interest before any further analysis is conducted. The variables of interest here are red cards, yellow cards,the card foul ratio and the full time goals. The correlation plot is designed to assign a higher color hue to the variables which have a relatively higher correlation. From the visualization, it is revealed that the most positively correlated variables of interest are yellow cards and the card foul ratio. This is followed by the yellow cards and the number of fouls. All other variables of interest seem to have very little correlation amongst them.
Is there a noticeable bias against the away team in terms of the number of fouls, red cards, and yellow cards received?
Question 2B
How do the ratios of cards to fouls differ between home and away teams?
Approach
Question 2A
To determine if there is home-team bias regarding penalties (fouls, yellow cards, and red cards), we plotted the distribution of the total number of games in the season in which a certain number of penalties were given. Although this can be shown with a density plot or area plot, it is easier to see at precisely which penalty counts were more often toward or against each team. Since most games had no red cards given at all, and the maximum number of red cards given was only two, both red cards and yellow cards are plotted on the same facet. The distribution of fouls is much greater, as every single game had at least one foul given. The two team types are mapped to both color and shape for accessibility.
Question 2B
To visualize the card foul ratio of the home and away team, a density plot was used. A density plot was chosen to facilitate the representation of the distribution of the home and away matches card foul ratio. A density plot in this context helps to view which team may have a higher card foul ratio and whether the card foul ratio is relatively higher or lower at certain point in the distribution. The aim of this plot is to visualize how aggressively the home or away teams play based on the card foul ratio information.
From the chart of the number of games by amount of penalties, there does not appear to be any major trends regarding a home-team bias when it comes to Cards. That said, there are a couple more instances of the home team receiving zero cards (both yellow and red) compared to the away team. The number of fouls does vary, however. The number of instances of a team with more fouls wavers, but generally there are more instances of very high foul counts (18+) for the away team. Interestingly, there is noticeably more instances of games in which the home team received a moderate number of fouls (from 11-13). In general, it also appears that the away team is right-skewed for the total number of fouls given, while the home team is left-skewed. Interestingly the actual averages across the whole season are quite close, as demonstrated in the later “Additional Information” section.
Question 2B
The density plot represents the distribution of the card foul ratio of the home and away teams of the data. The card foul ratio represents the number of cards (both yellow and red) given “per” fouls committed during a match. A higher card:foul ratio might suggest that a team may be playing more aggressively, while a lower card:foul ratio suggest the opposite. From this density plot, it is evident that the card:foul ratios of the away team are typically on the higher side compared to that of the home team. This could indicate that the away team feels the need to play more aggressively, or that there is a home-team bias when it comes to cards given.
Additional information
The average penalties by team are provided in the following summary table:
Show the code
#make new Dataset# Selecting only the necessary columns Compare_dataset <- soccer[, c("HomeTeam", "AwayTeam", "HF", "HR", "HY", "AF", "AR", "AY")]Compare_dataset$HCO = (Compare_dataset$HY + Compare_dataset$HR)/Compare_dataset$HF # Calculating Away Foul Ratio Compare_dataset$ACO = (Compare_dataset$AY + Compare_dataset$AR )/Compare_dataset$AF# Compare_dataset# Melt the dataset for plotting (tidyr pivot_longer is the newer alternative to gather) plot_data <- Compare_dataset %>%pivot_longer(cols =c(HF, HR, HY, AF, AR, AY,HCO,ACO), names_to ="Category", values_to ="Count")%>%mutate(Type =case_when( str_detect(Category, "F") ~"Fouls", str_detect(Category, "R") ~"Red Cards", str_detect(Category, "Y") ~"Yellow Cards", str_detect(Category, "C")~"Fouls Ratio"), Team =case_when( str_detect(Category, "^H") ~"Home", str_detect(Category, "^A") ~"Away" )) %>%select(-Category) %>%group_by(Type, Team) %>%summarise(AverageCount =mean(Count, na.rm =TRUE)) # Summary tableplot_data |>pivot_wider(names_from = Type,values_from = AverageCount )
Show the code
#|message: false#|warning: falsereshaped_data <- plot_data |>pivot_wider(names_from = Type,values_from = AverageCount )# Print the reshaped data as a tablereshaped_data |>kable("html") |>kable_styling(full_width =FALSE)
Team
Fouls
Fouls Ratio
Red Cards
Yellow Cards
Away
10.15789
0.1875560
0.0631579
1.744737
Home
10.05526
0.1750374
0.0500000
1.652632
Based on the overall averages, the away team on average receives a greater number of penalties per game. However, it is still by a relatively slim margin. The table is shown below in chart form to visualize how close these averages are:
Show the code
ggplot(plot_data, aes(x = Type,y = AverageCount,fill = Team)) +geom_bar(stat ="identity", position =position_dodge(width =0.9)) +geom_text(aes(label =round(AverageCount, 2), y = AverageCount +0.02), # Adjust the y position for visibility position =position_dodge(width =0.9), vjust =0, # Vertical adjustment; 0 means right above the bar size =5, # Text size, adjust as needed color = pal_pl_logo,family ="Poppins",fontface ="bold" ) +theme_classic() +theme(text =element_text("Poppins",face ="bold",color = pal_pl_logo,size =15),#legend.position = "inside",#legend.position.inside = c(0.85, 0.85),legend.box.background =element_rect(),plot.title.position ="plot" ) +labs(title =paste0("Average Fouls, Red Cards, and ","Yellow Cards\nfor Home vs. Away Teams"),subtitle = lab_pl_subtitle,x =NULL,y ="Count" ) +scale_fill_manual(breaks =c("Home", "Away"),values =darken(pal_pl, 0.3) )