Our project aims to Uncovering Trends in cyber breaches analyses of cybersecurity attacks to reveal prevalent types over time and identify targeted industries, empowering proactive security measures.
The academic project seeks to investigate trends, impacts, and vulnerabilities related to data breaches through comprehensive data analysis and visualization techniques.
Introduction
In the ever-evolving realm of cybersecurity, the year 2023 marked a watershed moment, as revealed by The Identity Theft Research Center’s (ITRC) Annual Data Breach Report. With a staggering 72% increase over the previous record set in 2021, the United States witnessed an unprecedented surge in data compromises, impacting over 353 million individuals. This alarming trend underscores the urgent need for heightened vigilance and robust security measures in safeguarding our digital assets. In our project, we will delve into the World’s Biggest Data Breaches & Hacks to gain valuable insights into these incidents and explore strategies to mitigate future risks
Data Description
The “World’s Biggest Data Breaches & Hacks” dataset, sourced from informationisbeautiful.net, focuses on cataloging data breaches across various sectors. Key columns such as year, date, sector, method, and data sensitivity provide avenues for in-depth analysis. Spanning from 2003 to 2023, the dataset offers a comprehensive view of breaches over time.
Sectors covered include academic, app, finance, gaming, government, health, legal, military, and miscellaneous. Methods of breach encompass hacking, inside job, lost device, accidental exposure (oops), and poor security practices. Additionally, the dataset contains information on breaches affecting top companies like Facebook, Microsoft, and others.
Code
# Load the knitr package for the kable() functionlibrary(knitr)# Read the CSV file into the data frame dfdf <-read.csv("data/Data_Breaches_LATEST.csv")# Display the first few rows of the data frame using kablekable(head(df))
organisation
alternative.name
records.lost
year
date
story
sector
method
interesting.story
data.sensitivity
displayed.records
Unnamed..11
source.name
X1st.source.link
X2nd.source.link
ID
Unnamed..16
Irish towing company
512000
2023
2023-10-23 00:00:00
The driving licences and payment card etails of thousands of motorists who had vehicles towed on behalf of the Irish police
Russian ransomware group Clop stole names, dates of birth, Social Security numbers, driver’s license and other state or taxpayer identification numbers. Some individuals had medical and health insurance information taken.
Patient data was exposed during the breach, including full names, email addresses, physical addresses, and telephone numbers. For some, it also includes Social Security Numbers (SSNs), Medicare/Medicaid ID numbers, and certain Health Insurance information.
Names and email addresses of customers of the identity security company. 134 of the company’s 18,400 clients were impacted, but that only five instances of successful session hijacking were logged
tech
hacked
1
NA
Okta
https://sec.okta.com/harfiles
459
Delta Dental
7000000
2023
2023-05-23 00:00:00
The dental insurance company suffered unauthorized access by threat actors through the MOVEit file transfer software application exposing full credit card details of customers
Relevance: Data breaches are a pressing issue affecting organizations globally, necessitating thorough analysis for insights into cybersecurity landscapes.
Scope: With coverage across diverse sectors and spanning nearly two decades, this dataset offers a comprehensive view of data breaches over time.
Variety of Variables: The dataset encompasses multiple variables, facilitating multifaceted analysis of breach methods, sector vulnerabilities, and breach trends.
Questions
Q1) General Assessment: How have information breaches advanced over past decade(2013-2023), and what are the patterns with respect to their recurrence, seriousness, and affect over distinctive businesses?
Q2) Vulnerability Assessment: Which sectors or types of data (e.g., personal, financial) are particularly susceptible to different breach methods like hacking or insider jobs, and what are the consequential impacts on both businesses and individuals?
Analysis plan
QUESTION 1:
To solve this question we will use columns: Method, Sector, Sensitivity, Records Lost, Year
The steps are:
Data Preprocessing:
We will Clean and organize the provided data, ensuring accuracy and consistency.
Focus on relevant columns such as records lost, method, year, sector, data sensitivity, and displayed records. Ensure the dataset covers the entire timeframe from 2004 to 2023.
Exploratory Data Analysis (EDA):
We will Calculate summary statistics for key variables like records lost to understand the overall scale of breaches.
Visualize trends in the frequency of breaches over time using a line plot or bar chart. - Identify any outliers or anomalies in the data that may warrant further investigation.
Trend Analysis:
Analyze trends in breach severity by examining the distribution of records lost and displayed records over time.
Evaluate variations in breach frequency and severity across different sectors and regions.
Visualization and Interpretation:
We will utilize Shiny , a robust R package, to construct an interactive dashboard for data visualization.
Design a user-friendly interface incorporating dropdown menus, sliders, and checkboxes for seamless data filtering and selection.
Utilize ggplot2 to create visually appealing representations of key metrics such as breach frequency over time, distribution of records lost by sector, and trends in breach severity.
Implement interactive elements like hover tooltips, clickable legends, and zoom functionality to enhance user engagement and enable detailed exploration of the data.
If * possible- incorporate maps to visually display the geographical distribution of data breaches and their impacts across different regions. -Provide deeper insights by including textual descriptions, annotations, or pop-up messages highlighting significant findings and offering insights into the implications of data breaches for cybersecurity strategies.
* (Data breaches generally occur in companies that span multiple regions so will do analysis whether map is a good choose or not(database prespective)).
QUESTION 2:
To address this Vulnerability Assessment we will use columns: Method, Sector, Sensitivity, Records Lost, Year
we’ll follow these steps:
Data Preprocessing:
We will Clean the data by addressing any formatting anomalies or irrelevant details.
Encode categorical variables and handle any missing data if present.
Prepare the data for exploratory analysis.
Exploratory Data Analysis (EDA):
We Will Analyze the frequency of breaches across various sectors and Methods.
Identify trends or recurring patterns in breach incidents over Year.
indentify correlations between Methods and affected sectors or data types.
Will examine the of breach occurrences on both corporate entities and individuals by parsing the Story column.
Trend Analysis:
Will find trends or patterns in breach incidents across successive Year.
Identify Sector or data types that have displayed heightened vulnerability to breaches over time.
Analyze shifts in Methods or their consequences across different Year
Visualization and Interpretation:
We will utilize Shiny, an R package designed for constructing interactive web applications, to craft dynamic dashboards aimed at visualizing and interpreting data.
In our analysis, we will identify pertinent columns including “Sector,” “Method,” “Year,” “Type of Data Compromised,” and “Impact” for thorough examination.
To enhance user interaction, we will create user-friendly interactive components such as dropdown menus, sliders, and checkboxes tailored to users’ needs.
Furthermore, we will incorporate functionalities such as tooltips, hover effects, and linked views to enrich user engagement and facilitate data exploration.
Plot and UI Description of Dashboard
Plot 1: Breach Methods by Sector/Type of Data (Stacked Bar Chart with Gradient Colors)
Plot Explanation: This stacked bar chart visualizes the relationship between different sectors/types of data and various breach methods. Each bar represents the count of breaches for a specific sector/type of data, and gradient colors are used to enhance visual appeal.
UI Design:
Hover Tooltips: When users hover over a bar in the plot, a tooltip appears, providing detailed information about the count of breaches for that sector/type of data and breach method.
Clickable Legends: Users can click on the legend to toggle the visibility of different breach methods, allowing them to focus on specific methods for analysis.
Plot 2: Breach Severity by Sector/Type of Data (Diverging Stacked Bar Chart)
Plot Explanation: This diverging stacked bar chart shows the severity of breaches (e.g., measured by data sensitivity levels) for different sectors/types of data. It helps identify which sectors/types of data are most affected by severe breaches, with contrasting colors for positive and negative values.
UI Design:
Hover Tooltips: When users hover over a bar in the plot, a tooltip appears, providing detailed information about the severity of breaches for that sector/type of data.
Clickable Legends: Users can click on the legend to toggle the visibility of different severity levels, allowing them to focus on specific levels for analysis.
Plot 3: Records Lost by Sector/Type of Data (Stacked Bar Chart with 3D Effect)
Plot Explanation: This stacked bar chart displays the distribution of records lost across different sectors/types of data. It helps identify which sectors/types of data experience the highest number of records lost in breaches, with a visually appealing 3D effect.
UI Design:
Hover Tooltips: When users hover over a section in the plot, a tooltip appears, providing detailed information about the number of records lost for that sector/type of data.
Clickable Legends: Users can click on the legend to toggle the visibility of different sectors/types of data, allowing them to focus on specific sectors/types for analysis.
Plot 4: Breach Frequency Over Time (Smoothed Line Plot with Animated Transitions)
Plot Explanation: This smoothed line plot visualizes the frequency of breaches over time with animated transitions between data points. It helps users easily track trends and patterns in breach occurrences across different time periods.
UI Design:
Hover Tooltips: When users hover over a point on the line plot, a tooltip appears, providing detailed information about the number of breaches in that specific time period.
Zoom Functionality: Users can zoom in and out of the plot to explore trends more closely.
Clickable Legends: Users can click on the legend to toggle the visibility of different breach methods, allowing them to focus on specific methods for analysis.
Repo organisation
The following folders comprise the project repository
.github/: This directory is designated for files associated with GitHub, encompassing workflows, actions, and templates tailored for issues.
_extra/: Reserved for miscellaneous files that don’t neatly fit into other project categories, providing a catch-all space for various supplementary documents.
_freeze/: Within this directory lie frozen environment files containing comprehensive information regarding the project’s environment configuration and dependencies.
data/: Specifically allocated for storing indispensable data files crucial for the project’s functionality, encompassing input files, datasets, and other essential data resources.
images/: Serving as a repository for visual assets employed throughout the project, including diagrams, charts, and screenshots, this directory maintains visual elements integral to project documentation and presentation.
.gitignore: This file functions to specify exclusions from version control, ensuring that designated files and directories remain untracked by Git, thus streamlining the versioning process.
README.md: Serving as the primary hub of project information, this README document furnishes essential details encompassing project setup, usage instructions, and an overarching overview of project objectives and scope.
_quarto.yml: Acting as a pivotal configuration file for Quarto, this document encapsulates various settings and options governing the construction and rendering of Quarto documents, facilitating customization and control over document output.
about.qmd: This Quarto Markdown file supplements project documentation by providing additional contextual information, elucidating project purpose, contributor insights, and other pertinent project details.
index.qmd: This serves as the main documentation page for our project. This Quarto Markdown file provides detailed descriptions of our project, including code examples and visualizations.