Game Metrics: Decoding Premier League Outcomes Through Visual Analysis and Data Exploration

INFO 526 - Project 1

Author

Group 4 WackyWednesday

Abstract

This project examines 2021-2022 Premier League match data to analyze team performance, home-team advantage, and potential officiating biases. Methods included creating visualizations and summary tables to compare home and away team performance and penalties across the entire season. Results indicate a strong correlation between half-time and full-time goals, with slightly higher predictability for home teams. Home-team advantage is evident in goal scoring but does not guarantee victory. Additionally, there’s a difference in the most frequent number of fouls given to the home team compared to the away team. The card-to-foul ratio also appears to be higher for the away team. Overall, this study offers insights into Premier League dynamics and officiating tendencies.

Introduction

The dataset comes from TidyTuesday’s 04 April 2023 post “Premier League Match Data 2021-2022”. It is derived from Evan Gower’s Kaggle post of the same name.

The Premier League is the foremost league for professional football clubs in England since its creation in 1992. The time period covered in this dataset spans entire 2021 to 2022 Premier League season. The 2021–22 Premier League season marked the 30th anniversary of the league.

Variables in this dataset includes teams, referee, and stats by home and away side such as fouls, shots, cards, and point totals. The dataset is in a tabular format with 380 rows representing individual matches and 22 columns representing each variable, described below. Both categorical and numerical variables are included.

The breakdown of each variable in the data set can be found in the codebook as well as the project proposal.

In our analysis we will be focusing on the following variables:

  • HomeTeam (character): The home team.
  • AwayTeam (character): The away team.
  • FTHG (double): Full-time home goals.
  • FTAG (double): Full-time away goals.
  • FTR (character): Full-time result.
  • HTHG (double): Half-time home goals.
  • HTAG (double): Half-time away goals.
  • HTR (character): Half-time result.
  • HF (double): Number of fouls by the home team.
  • AF (double): Number of fouls by the away team.
  • HY (double): Number of yellow cards received by the home team.
  • AY (double): Number of yellow cards received by the away team.
  • HR (double): Number of red cards received by the home team.
  • AR (double): Number of red cards received by the away team.

This dataset is ideal for analytical projects aiming to understand football dynamics and team strategies throughout the season.

Reason for Choosing this Dataset

We chose this data set because it provides comprehensive statistics contributing to the outcome of football matches. Therefore, we can explore the interesting topic of the potential existence of bias against away teams in the English Premier League. The availability of data allows us to use variables such as shots on goal or create new variables such as calculating the ratio of yellow cards to fouls to answer this question of bias. Some members of the group are also quite passionate about soccer, which definitely convinced us to use this data set!

Question 1

Introduction

First, we explore if a home-team advantage is present when assessing teams’ performances. Here, we use goals as the main indicator of how well a team performs. One way we answer this question is by characterizing the relationship between goals scored at half-time (variables: HTHG, HTAG) and goals scored at the end of the match (variables: FTHG, FTAG) for both home and away teams through scatterplot and linear regression analysis. We wish to quantify how well goals at half-time predict goals at full-time in general as well as seeing if the strength of this relationship changes based on being on the home or away team. Then, we explore this question of home-team advantage using the Full Time Result variable (FTR) and creating a barplot of total goals per team (e.g., Arsenal, Liverpool, etc.) and if they were home or away. Our hypothesis is that the home team will score more goals and win more matches.

Question 1A

Do the number of goals at half-time determine the final outcome? Does this relationship change between home and away teams?

Question 1B

Does home-team advantage play a major role in the final match outcome?

Approach

Part A: The variables used to answer this question are half-time home goals (HTHG), full-time home goals (FTHG), half-time away goals (HTAG), and full-time away goals (HTHG). A scatterplot is created to quantify the correlation between goals at half-time and goals at full-time (end of match) and is color-coded based on home versus away teams. Linear regressions are performed separately for home team and away team data and correlation coefficients are reported. Scatterplot and linear regression analysis is chosen because they directly determine how strongly two variables are related to each other.

Part B: Using the Full Time Result (FTR) variable, home advantage and away advantage are categorized into a new Home_Advantage variable. The advantage is determined through the number of wins of each team. If the Result is “H”, it will be added to home team victory and is therefore counted as home team advantage. If not, it will be added to away team victory and is counted as away team advantage. This information is summarized in a bar plot for Total Games to find the total goals scored by individual team in home and away. An advantage summary is created and combined together to form a new Total goals variable. The Home team, Total variable is then used as an aesthetic layer and a bar plot is created side by side to interpret the outcome for all 20 teams.

Analysis

Question 1A: Correlation between half-time and full-time goals.

Show the code
#Create tibbles based on if the home or away team scored the goals. This makes it easier to color-code the scatterplot and calculate separate linear regressions later.
home_goals <- tibble(soccer$HTHG, soccer$FTHG)
home_goals <- home_goals %>% 
  rename(
     'half' = 'soccer$HTHG',
     'full' = 'soccer$FTHG'
  )

away_goals <- tibble(soccer$HTAG, soccer$FTAG)
away_goals <- away_goals %>% 
  rename(
    'half' = 'soccer$HTAG',
    'full' = 'soccer$FTAG'
  )

#Calculate correlation coefficients for home and away data sets.
home_r <- cor(home_goals$half, home_goals$full)
away_r <- cor(away_goals$half, away_goals$full)

#Create scatterplot with two regression lines for home and away teams, respectively. Add number of points and correlation coefficients to plot.
q1_goals <- ggplot() + 
  geom_jitter(data = home_goals, aes(x = half, y = full, color = 'Home'), width = 0.7) +
  geom_smooth(data = home_goals, aes(x = half, y = full, color = 'Home'), method = lm, se = FALSE) +
  geom_jitter(data = away_goals, aes(x = half, y = full, color = 'Away'), width = 0.7) +
  geom_smooth(data = away_goals, aes(x = half, y = full, color = 'Away'), method = lm,  se = FALSE) + 
  xlim(0, 4) +
  geom_label(aes(x = 0.25, y = 7, label= 'N = 380')) +
  geom_label(aes(x = 3.7, y = 5.5, label= 'r = 0.72', color = 'Home'), show.legend  = FALSE) +
  geom_label(aes(x = 3.7, y = 4.2, label= 'r = 0.70', color = 'Away'), show.legend = FALSE) +
  labs( x = 'Goals at half-time',
        y = 'Goals at full-time',
        color = 'Team') +
  scale_color_manual(
    breaks = c("Home", "Away"),
    values = darken(pal_pl, 0.3)
  ) + 
  theme_classic(base_size = 12) +
  theme(
    text = element_text("Poppins",
                        face = "bold",
                        color = pal_pl_logo))

#Suppress warning messages when printing ggplot object.
suppressMessages(print(q1_goals))

Scatter plot with linear line-of-best-fit showing goals at full-time versus goals at half-time for Premier League match data, with home team and away team indicated by color. The x-axis (half-time goals) is marked from 0 to 4 on intervals of 1, and the y-axis (full-time goals) is marked from 0 to 6 on intervals of 2. The lines-of-best-fit have a strong slope upward and are annotated with their r values, 0.72 for the home team and 0.70 for the away team.

Question 1B: Total games won by home and away teams.

Show the code
# Create a variable for home team advantage
soccer <- soccer %>%
  mutate(HomeAdvantage = ifelse(FTR == "H", "Home Team Win", "Away Team Win"))

# Analyze home and away team advantage
advantage_summary <- soccer %>%
  group_by(HomeAdvantage) %>%
  summarise(TotalGames = n())

kable(advantage_summary, "html") %>%
  kable_styling(full_width = FALSE)
HomeAdvantage TotalGames
Away Team Win 217
Home Team Win 163
Show the code
# Bar plot for home and away team advantage

ggplot(advantage_summary, aes(x = HomeAdvantage, y = TotalGames, fill = HomeAdvantage)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Home and Away Team Wins",
       x = "Win_Count", y = "Total Games") +
  # scale_fill_manual(values = c("Home Team Win" = "lightgreen", "Away Team Win" = "red")) +
  scale_fill_manual(
    breaks = c("Home Team Win", "Away Team Win"),
    labels = c("Home", "Away"),
    values = darken(pal_pl, 0.3)
  ) +
  theme_classic() +
  theme(
    text = element_text("Poppins",
                        face = "bold",
                        color = pal_pl_logo,
                        size = 15)
  )

Bar plot of the total number of games in which the away team won and when the home team won for Premier League match data. The y-axis is marked from 0 to 200 on intervals of 50, and the team types are indicated by x-axis labels and color. The away team appears to have won 217 games, while the home team won 163 games.

Show the code
total_home_goals <-soccer %>%
  group_by(HomeTeam) %>%
  summarize(Total_Home_goals = sum(FTHG))
 
total_away_goals <-soccer %>%
  group_by(AwayTeam) %>%
  summarize(Total_Away_goals = sum(FTAG))

kable(total_home_goals, "html") %>%
  kable_styling(full_width = FALSE)
HomeTeam Total_Home_goals
Arsenal 35
Aston Villa 29
Brentford 22
Brighton 19
Burnley 18
Chelsea 37
Crystal Palace 27
Everton 27
Leeds 19
Leicester 34
Liverpool 49
Man City 58
Man United 32
Newcastle 26
Norwich 12
Southampton 23
Tottenham 38
Watford 17
West Ham 33
Wolves 20
Show the code
kable(total_away_goals, "html") %>%
  kable_styling(full_width = FALSE)
AwayTeam Total_Away_goals
Arsenal 26
Aston Villa 23
Brentford 26
Brighton 23
Burnley 16
Chelsea 39
Crystal Palace 23
Everton 16
Leeds 23
Leicester 28
Liverpool 45
Man City 41
Man United 25
Newcastle 18
Norwich 11
Southampton 20
Tottenham 31
Watford 17
West Ham 27
Wolves 18
Show the code
# Merge the data for home and away goals
total_goals <- merge(total_home_goals, total_away_goals, by.x = "HomeTeam", by.y = "AwayTeam", all = TRUE)

total_goals_long <- gather(total_goals, key = "TeamType", value = "TotalGoals", -HomeTeam)

# Create a side-by-side bar plot
ggplot(total_goals_long , aes(x = HomeTeam, y = TotalGoals, fill = TeamType)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(title = "Total Home and Away Goals by Team",
       x = "Teams on Home and Away", y = "Total Goals") +
  # scale_fill_manual(values = c("Total_Home_goals" = "lightgreen", "Total_Away_goals" = "red")) +
  scale_fill_manual(
    breaks = c("Total_Home_goals", "Total_Away_goals"),
    labels = c("Home", "Away"),
    values = darken(pal_pl, 0.3)
  ) +
  theme_classic() +
  theme(
    text = element_text("Poppins",
                        face = "bold",
                        color = pal_pl_logo,
                        size = 12)
  ) +
  # theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar plot of the total number of scores made by each team in a Premier League season, divided by home or away as indicated by color. The y-axis ranges from 0 to 60, and the x-axis has labels for each team. The greatest number of goals scored was by Man City as a home team (58), and the lowest was Norwich as an away team (11); no team had an even number of home goals versus away goals except for Watford (17). All values are reported in tables available as text above the plot.

Discussion

Question 1A

There is a strong correlation between goals scored at half-time and goals scored at full-time. In other words, goals at half-time is a strong indicator of how many goals will be scored at the end of the match. However, there is not a perfect one-to-one relationship between these variables. For example, there are some cases where there are 0 goals scored at half-time, but 4 or 5 goals are scored by the end of the match. This plot also shows that there is slightly higher correlation between half-time and full-time goals for the home team than the away team. However, this does not necessarily determine if there is a home-team advantage because more goals does not necessarily equate to more wins.

Question 1B

As per the dataset, the home team winning advantage did not create impact in every game but only in some games. For the goals scored in home and away match, the majority of the teams scored more goals when playing as home rather than in away (with a few exceptions, such as Chelsea).

Question 2

Introduction

In question 2, we want to figure out what effect (if any) does being the “home” team versus being the “away” team affect the teams’ penalties?

A correlation matrix is first used to find any potential relationships amongst the variables of interest before any further analysis is conducted. The variables of interest here are red cards, yellow cards,the card foul ratio and the full time goals. The correlation plot is designed to assign a higher color hue to the variables which have a relatively higher correlation. From the visualization, it is revealed that the most positively correlated variables of interest are yellow cards and the card foul ratio. This is followed by the yellow cards and the number of fouls. All other variables of interest seem to have very little correlation amongst them.

Show the code
soccer_corr <- soccer |>
  mutate(
    HTFCR = (HR + HY)/ HF,
    ATFCR = (AR + AY)/AF
  ) |>
  pivot_longer( cols = c( 'HF', 'AF'),
                names_to = 'Team_Fouls',
                values_to = 'Fouls') |>
  pivot_longer( cols = c( 'HY', 'AY'),
                names_to = 'Team_Yellows',
                values_to = 'yellow_cards') |>
  pivot_longer( cols = c( 'HR', 'AR'),
                names_to = 'Team_Reds',
                values_to = 'red_cards') |>
  pivot_longer(
    cols = c( 'HTFCR', 'ATFCR'),
    names_to = 'Team',
    values_to = 'CFR'
  ) |>
  pivot_longer(
    cols = c('FTHG', 'FTAG'),
    names_to = 'TFTG', values_to = 'FTG'
  )

soccer_corr <- soccer_corr |>
  select( red_cards, yellow_cards, Fouls, CFR, FTG ) |>
  rename( 'Red' = red_cards, 'Yellow' = yellow_cards )

cor_matrix <- round(cor(soccer_corr) , 2)

ggcorrplot(cor_matrix,
           hc.order = TRUE,
           type = 'lower',
           lab = TRUE) +
  theme_minimal() +
  theme(
    legend.position = 'none',
    panel.grid = element_blank(),
    text = element_text(
      face = "bold",
      color = "#3d195b"
    ),
    plot.title.position = "plot"
  ) +
  labs(
    x = NULL,
    y = NULL,
    title = "The correlation between the variables used for the analysis",
    subtitle = paste0("Card to Foul Ratio(CFR), Yellow cards(Yellow),\n",
                      "Red cards(Red), and Full Time Goals (FTG)")
  )

Correlation plot comparing variables relating to Premier League data variables of cards-to-foul ratio (CFR), yellow cards, red cards, and full-time goals (FTG). The values are shaded based on their correlation value. Red & fouls has a correlation of 0.04, FTG & fouls -0.03, CFR & fouls -0.05, yellow & fouls 0.25; FTG & red 0.01, CFR & red 0.09, yellow & red 0.01; CFR & FTG -0.05, yellow & FTG -0.05; yellow & CFR 0.46.

Question 2A

Is there a noticeable bias against the away team in terms of the number of fouls, red cards, and yellow cards received?

Question 2B

How do the ratios of cards to fouls differ between home and away teams?

Approach

Question 2A

To determine if there is home-team bias regarding penalties (fouls, yellow cards, and red cards), we plotted the distribution of the total number of games in the season in which a certain number of penalties were given. Although this can be shown with a density plot or area plot, it is easier to see at precisely which penalty counts were more often toward or against each team. Since most games had no red cards given at all, and the maximum number of red cards given was only two, both red cards and yellow cards are plotted on the same facet. The distribution of fouls is much greater, as every single game had at least one foul given. The two team types are mapped to both color and shape for accessibility.

Question 2B

To visualize the card foul ratio of the home and away team, a density plot was used. A density plot was chosen to facilitate the representation of the distribution of the home and away matches card foul ratio. A density plot in this context helps to view which team may have a higher card foul ratio and whether the card foul ratio is relatively higher or lower at certain point in the distribution. The aim of this plot is to visualize how aggressively the home or away teams play based on the card foul ratio information.

Analysis

Question 2A: Penalties by Team

Show the code
q2_soccer_pen <- soccer |> 
  select(HF, AF, HY, AY, HR, AR) |>
  rename(
    yellow_home = HY,
    yellow_away = AY,
    foul_home = HF,
    foul_away = AF,
    red_home = HR,
    red_away = AR
  ) |>
  mutate(game_number = row_number()) |> 
  group_by(game_number) |> 
  pivot_longer(
    cols = c(yellow_home, red_home,
             yellow_away, red_away,
             foul_away, foul_home),
    names_to = "pen_type",
    values_to = "pen_count"
  ) |> 
  mutate(
    pen_cat = case_when(
      str_detect(pen_type, "yellow") ~ "yellow",
      str_detect(pen_type, "red") ~ "red",
      str_detect(pen_type, "foul") ~ "foul"
    ),
    h_a = case_when(
      str_detect(pen_type, "home") ~ "Home",
      str_detect(pen_type, "away") ~ "Away"
    )
  ) |>
  group_by(h_a, pen_count, pen_cat) |> 
  select(!pen_type) |> 
  summarize(tot = n()) |> 
  mutate(
    pen_cf = case_when(
      pen_cat == "foul" ~ "Fouls",
      TRUE ~ "Cards"
    )
  )

q2_penalties_viz <- ggplot(q2_soccer_pen,
                           aes(x = pen_count,
                               y = tot,
                               group = pen_cat,
                               color = h_a)) +
  facet_wrap(~pen_cf,
             ncol = 1,
             scales = "free"
             ) +
  geom_line(
    data = filter(q2_soccer_pen, pen_cat == "red"),
    aes(group = interaction(pen_cat, h_a)),
    lineend = "round",
    linewidth = 10,
    alpha = 0.05,
    color = "red"
  ) +
  geom_line(
    data = filter(q2_soccer_pen, pen_cat == "yellow"),
    aes(group = interaction(pen_cat, h_a)),
    lineend = "round",
    linewidth = 10,
    alpha = 0.1,
    color = "yellow"
  ) +
  geom_line(
    data = filter(q2_soccer_pen, pen_cat == "foul"),
    aes(group = interaction(pen_cat, h_a)),
    lineend = "round",
    linewidth = 10,
    alpha = 0.05,
    color = "purple"
  ) +
  geom_line(
    aes(group = interaction(pen_cat, h_a)),
    linewidth = 1,
    alpha = 0.5
  ) +
  geom_point(
    aes(shape = h_a),
    size = 4,
    stroke = 1.5
  ) +
  geom_text(
    data = data.frame(
      "label" = c("Red Cards", "Yellow Cards", "Fouls"),
      x = c(1, 1.75, 4),
      y = c(300, 170, 30),
      pen_cf = c("Cards", "Cards", "Fouls")
    ),
    aes(x = x, y = y, group = 1, label = label),
    color = pal_pl_logo,
    family = "Poppins",
    fontface = "bold",
    size = 5
  ) +
  scale_x_continuous(
    breaks = scales::pretty_breaks(7)
  ) +
  scale_color_manual(
    breaks = c("Home", "Away"),
    values = darken(pal_pl, 0.3)
  ) +
  scale_shape_manual(
    breaks = c("Home", "Away"),
    values = c(5, 4)
  ) +
  theme_classic() +
  theme(
    text = element_text("Poppins",
                        face = "bold",
                        color = pal_pl_logo,
                        size = 15),
    #legend.position = "inside",
    #legend.position.inside = c(0.85, 0.85),
    legend.box.background = element_rect(),
    strip.background = element_blank(),
    strip.text.x = element_blank(),
    plot.title.position = "plot"
  ) +
  labs(
    color = "Team",
    shape = "Team",
    x = "Total given in a game",
    y = "Number of games",
    title = "Number of English football games by amounts of penalties",
    subtitle = lab_pl_subtitle,
    fill = "Team"
  )

q2_penalties_viz

Point and line plot showing the number of Premier League games versus the number of certain penalties given in a game to the home team and away team, where penalties are divided by yellow cards, red cards, and fouls, and the team type is indicated by color and point shape. The plot is faceted by cards and fouls; the former has an x-axis range of 0 to 6 on intervals of 1 and a y-axis range of 0 to 300 on intervals of 100, and the latter has an x-axis range of 0 to 25 on intervals of 5 and a y-axis range of 0 to 40 on intervals of 10. Most games had 0 red cards given, and the away & home team are roughly equal in the number of instances of 1 red card given; only the home team has an instance of 2 red cards given in a game. The distribution of yellow cards given is roughly equivalent for both the home team and the away team, and only the home team has an instance of 6 fouls given. The distribution of fouls is roughly equivalent from 0 to 6, at which point the away team peaks and troughs from 7-9, and the home team has more instances than the away team around 11-13. The trend resumes from 15-22 fouls. Only away teams had instances of 23 and 25 fouls given.

Question 2B: Cards:Fouls Ratio by Team

Show the code
soccer_ctf <- soccer |>
  mutate(
    HTFCR = (HR + HY)/ HF,
    ATFCR = (AR + AY)/ AF
  ) |>
  rename(
    'Home' = 'HTFCR',
    'Away' = 'ATFCR'
  ) |> 
  pivot_longer(
    cols = c('Home', 'Away'),
    names_to = 'Team',
    values_to = 'FCR'
  )

ggplot(soccer_ctf, aes(x = FCR)) +
  geom_density(
    aes(fill = Team, color = Team),
    alpha = 0.25,
    linewidth = 1
  ) +
  scale_fill_manual(
    breaks = c("Home", "Away"),
    values = darken(pal_pl, 0.3)
  ) +
  scale_color_manual(
    breaks = c("Home", "Away"),
    values = darken(pal_pl, 0.3)
  ) +
  theme_classic() +
  theme(
    text = element_text("Poppins",
                        face = "bold",
                        color = pal_pl_logo,
                        size = 15),
    #legend.position = "inside",
    #legend.position.inside = c(0.75, 0.75),
    legend.box.background = element_rect()
  ) +
  labs(
    x = 'Card:Foul Ratio',
    y = 'Density',
    title = 'The distribution of card:foul ratios across teams',
    subtitle = lab_pl_subtitle,
    caption = 'Data from TidyTuesday'
  )

Distribution plot of card-to-foul ratios by home team and away team in Premier League data. The x-axis ranges from 0 to 1 on intervals of 0.25, and the y-axis ranges from 0 to 3 on intervals of 1. The home team and away team are differentiated with color, and the fill has a slight opacity to allow the density values to be visible throughout. Both distributions are right-skewed, and have slight dips around 0.01 with the home team being slightly higher. The away team has a higher density peak of about 3.25 around a ratio of 0.2 compared to the home team, which peaks around a density of 3 at 0.15. Both taper as the ratio increases, with the away team occasionally rising in density above that of the home team, especially around 0.04.

Discussion

Question 2A

From the chart of the number of games by amount of penalties, there does not appear to be any major trends regarding a home-team bias when it comes to Cards. That said, there are a couple more instances of the home team receiving zero cards (both yellow and red) compared to the away team. The number of fouls does vary, however. The number of instances of a team with more fouls wavers, but generally there are more instances of very high foul counts (18+) for the away team. Interestingly, there is noticeably more instances of games in which the home team received a moderate number of fouls (from 11-13). In general, it also appears that the away team is right-skewed for the total number of fouls given, while the home team is left-skewed. Interestingly the actual averages across the whole season are quite close, as demonstrated in the later “Additional Information” section.

Question 2B

The density plot represents the distribution of the card foul ratio of the home and away teams of the data. The card foul ratio represents the number of cards (both yellow and red) given “per” fouls committed during a match. A higher card:foul ratio might suggest that a team may be playing more aggressively, while a lower card:foul ratio suggest the opposite. From this density plot, it is evident that the card:foul ratios of the away team are typically on the higher side compared to that of the home team. This could indicate that the away team feels the need to play more aggressively, or that there is a home-team bias when it comes to cards given.

Additional information

The average penalties by team are provided in the following summary table:

Show the code
#make new Dataset
# Selecting only the necessary columns 

Compare_dataset <- soccer[, c("HomeTeam", "AwayTeam", "HF", "HR", "HY", "AF", "AR", "AY")]


Compare_dataset$HCO = (Compare_dataset$HY + Compare_dataset$HR)/Compare_dataset$HF 
# Calculating Away Foul Ratio 
Compare_dataset$ACO = (Compare_dataset$AY + Compare_dataset$AR )/Compare_dataset$AF

# Compare_dataset

# Melt the dataset for plotting (tidyr pivot_longer is the newer alternative to gather) 

plot_data <- Compare_dataset %>% 
  pivot_longer(cols = c(HF, HR, HY, AF, AR, AY,HCO,ACO), names_to = "Category", values_to = "Count")%>%
  mutate(Type = case_when( str_detect(Category, "F") ~ "Fouls", 
                           str_detect(Category, "R") ~ "Red Cards", 
                           str_detect(Category, "Y") ~ "Yellow Cards", 
                           str_detect(Category, "C")~"Fouls Ratio"), 
         Team = case_when( str_detect(Category, "^H") ~ "Home", 
                           str_detect(Category, "^A") ~ "Away" )) %>% 
  select(-Category) %>% 
  group_by(Type, Team) %>% 
  summarise(AverageCount = mean(Count, na.rm = TRUE)) 

# Summary table
plot_data |>
  pivot_wider(
    names_from = Type,
    values_from = AverageCount
  )
Show the code
#|message: false
#|warning: false


reshaped_data <- plot_data |>
  pivot_wider(
    names_from = Type,
    values_from = AverageCount
  )

# Print the reshaped data as a table
reshaped_data |>
  kable("html") |>
  kable_styling(full_width = FALSE)
Team Fouls Fouls Ratio Red Cards Yellow Cards
Away 10.15789 0.1875560 0.0631579 1.744737
Home 10.05526 0.1750374 0.0500000 1.652632

Based on the overall averages, the away team on average receives a greater number of penalties per game. However, it is still by a relatively slim margin. The table is shown below in chart form to visualize how close these averages are:

Show the code
ggplot(plot_data, aes(x = Type,
                      y = AverageCount,
                      fill = Team)) + 
  geom_bar(stat = "identity", 
           position = position_dodge(width = 0.9)) + 
  geom_text(
    aes(label = round(AverageCount, 2), 
        y = AverageCount + 0.02),  # Adjust the y position for visibility 
    position = position_dodge(width = 0.9), 
    vjust = 0, # Vertical adjustment; 0 means right above the bar 
    size = 5, # Text size, adjust as needed 
    color = pal_pl_logo,
    family = "Poppins",
    fontface = "bold"
  ) +
  theme_classic() +
  theme(
    text = element_text("Poppins",
                        face = "bold",
                        color = pal_pl_logo,
                        size = 15),
    #legend.position = "inside",
    #legend.position.inside = c(0.85, 0.85),
    legend.box.background = element_rect(),
    plot.title.position = "plot"
  ) +
  labs(
    title = paste0("Average Fouls, Red Cards, and ",
                   "Yellow Cards\nfor Home vs. Away Teams"),
    subtitle = lab_pl_subtitle,
    x = NULL,
    y = "Count"
  ) + 
  scale_fill_manual(
    breaks = c("Home", "Away"),
    values = darken(pal_pl, 0.3)
  )

Bar plot showing the average fouls, red cards, yellow cards, and card-to-foul ratio for the home team and the away team Premier League data. The teams are separated by color. The x-axis is marked with each category, and the y-axis ranges from 0 to 10 on intervals of 2.5. The values are reported as text in the table above the visualization, and are also annotated on the plot itself rounded to two decimal points. The fouls average to 10.16 and 10.06 for the away and home team respectively; the ratio as 0.19 and 0.18; the red cards ad 0.06 and 0.06; and the yellow cards as 1.74 and 1.65.