Unveiling Trends in Data Breaches and Data Hacks

Proposal

Our project aims to Uncovering Trends in cyber breaches analyses of cybersecurity attacks to reveal prevalent types over time and identify targeted industries, empowering proactive security measures.
Author
Affiliation

Datavista -Abhishek kumar, Srinivasan Akash, Gowtham Theeda,
Divya Dhole, Lakshmi Neharika Anchula, Noureen Mithaigar

School of Information, University of Arizona

High Level Goal

The academic project seeks to investigate trends, impacts, and vulnerabilities related to data breaches through comprehensive data analysis and visualization techniques.


Introduction

In the ever-evolving realm of cybersecurity, the year 2023 marked a watershed moment, as revealed by The Identity Theft Research Center’s (ITRC) Annual Data Breach Report. With a staggering 72% increase over the previous record set in 2021, the United States witnessed an unprecedented surge in data compromises, impacting over 353 million individuals. This alarming trend underscores the urgent need for heightened vigilance and robust security measures in safeguarding our digital assets. In our project, we will delve into the World’s Biggest Data Breaches & Hacks to gain valuable insights into these incidents and explore strategies to mitigate future risks


Data Description

The “World’s Biggest Data Breaches & Hacks” dataset, sourced from informationisbeautiful.net, focuses on cataloging data breaches across various sectors. Key columns such as year, date, sector, method, and data sensitivity provide avenues for in-depth analysis. Spanning from 2003 to 2023, the dataset offers a comprehensive view of breaches over time.

Sectors covered include academic, app, finance, gaming, government, health, legal, military, and miscellaneous. Methods of breach encompass hacking, inside job, lost device, accidental exposure (oops), and poor security practices. Additionally, the dataset contains information on breaches affecting top companies like Facebook, Microsoft, and others.


Code
# Load the knitr package for the kable() function
library(knitr)

# Read the CSV file into the data frame df
df <- read.csv("data/Data_Breaches_LATEST.csv")

# Display the first few rows of the data frame using kable
kable(head(df))
organisation alternative.name records.lost year date story sector method interesting.story data.sensitivity displayed.records Unnamed..11 source.name X1st.source.link X2nd.source.link ID Unnamed..16
Irish towing company 512000 2023 2023-10-23 00:00:00 The driving licences and payment card etails of thousands of motorists who had vehicles towed on behalf of the Irish police transport poor security 3 NA Irish independent https://www.independent.ie/irish-news/thousands-of-drivers-have-sensitive-data-exposed-to-hackers-in-major-it-breach/a1379036136.html 463
Maine Government 1300000 2023 2023-05-23 00:00:00 Russian ransomware group Clop stole names, dates of birth, Social Security numbers, driver’s license and other state or taxpayer identification numbers. Some individuals had medical and health insurance information taken. government hacked 4 NA Tech Crunch https://techcrunch.com/2023/11/09/maine-government-data-breach-clop-ransomware/ 462
Welltok 8500000 2023 2023-11-23 00:00:00 Patient data was exposed during the breach, including full names, email addresses, physical addresses, and telephone numbers. For some, it also includes Social Security Numbers (SSNs), Medicare/Medicaid ID numbers, and certain Health Insurance information. health hacked 4 NA Bleeping Computer https://www.bleepingcomputer.com/news/security/welltok-data-breach-exposes-data-of-85-million-us-patients/ 461
Maximus 10000000 2023 2023-07-23 00:00:00 Exploit of a zero-day flaw in the MOVEit file transfer application. Data stolen included social security numbers, protected health information. government hacked 4 NA Bleeping Computer https://www.bleepingcomputer.com/news/security/8-million-people-hit-by-data-breach-at-us-govt-contractor-maximus/ 460
Okta 134 2023 2023-11-23 00:00:00 Names and email addresses of customers of the identity security company. 134 of the company’s 18,400 clients were impacted, but that only five instances of successful session hijacking were logged tech hacked 1 NA Okta https://sec.okta.com/harfiles 459
Delta Dental 7000000 2023 2023-05-23 00:00:00 The dental insurance company suffered unauthorized access by threat actors through the MOVEit file transfer software application exposing full credit card details of customers health hacked 3 NA Bleeping Computer https://www.bleepingcomputer.com/news/security/delta-dental-of-california-data-breach-exposed-info-of-7-million-people/ 458

Column Data Type Description
Column Data Type Description
organisation object Name of the organisation that lost data.
alternative name object Alternative name for the organisation.
records lost int64 Number of records lost in the data breach.
year int64 Year in which the data breach occurred.
date object Date of the data breach.
story object Description of the data breach incident.
sector object Sector affected by the data breach.
method object Method through which the data breach occurred.
interesting story object Interesting story related to the data breach.
data sensitivity float64 Level of data sensitivity (1,2,3,4,5).
displayed records object Displayed records in the data breach.

Reason for choosing this data


Questions

Q1) General Assessment: How have information breaches advanced over past decade(2013-2023), and what are the patterns with respect to their recurrence, seriousness, and affect over distinctive businesses?

Q2) Vulnerability Assessment: Which sectors or types of data (e.g., personal, financial) are particularly susceptible to different breach methods like hacking or insider jobs, and what are the consequential impacts on both businesses and individuals?


Analysis plan

QUESTION 1:

To solve this question we will use columns: Method, Sector, Sensitivity, Records Lost, Year

The steps are:

  • Data Preprocessing:

    • We will Clean and organize the provided data, ensuring accuracy and consistency.

    • Focus on relevant columns such as records lost, method, year, sector, data sensitivity, and displayed records. Ensure the dataset covers the entire timeframe from 2004 to 2023.

  • Exploratory Data Analysis (EDA):

    • We will Calculate summary statistics for key variables like records lost to understand the overall scale of breaches.
    • Visualize trends in the frequency of breaches over time using a line plot or bar chart. - Identify any outliers or anomalies in the data that may warrant further investigation.
  • Trend Analysis:

    • Analyze trends in breach severity by examining the distribution of records lost and displayed records over time.

    • Evaluate variations in breach frequency and severity across different sectors and regions.

  • Visualization and Interpretation:

    • We will utilize Shiny , a robust R package, to construct an interactive dashboard for data visualization.

    • Design a user-friendly interface incorporating dropdown menus, sliders, and checkboxes for seamless data filtering and selection.

    • Utilize ggplot2 to create visually appealing representations of key metrics such as breach frequency over time, distribution of records lost by sector, and trends in breach severity.

    • Implement interactive elements like hover tooltips, clickable legends, and zoom functionality to enhance user engagement and enable detailed exploration of the data.

    • If * possible- incorporate maps to visually display the geographical distribution of data breaches and their impacts across different regions. -Provide deeper insights by including textual descriptions, annotations, or pop-up messages highlighting significant findings and offering insights into the implications of data breaches for cybersecurity strategies.

    •  * (Data breaches generally occur in companies that span multiple regions so will do analysis whether map is a good choose or not(database prespective)).


QUESTION 2:

To address this Vulnerability Assessment we will use columns: Method, Sector, Sensitivity, Records Lost, Year

we’ll follow these steps:

  • Data Preprocessing:

    • We will Clean the data by addressing any formatting anomalies or irrelevant details.

    • Encode categorical variables and handle any missing data if present.

    • Prepare the data for exploratory analysis.

  • Exploratory Data Analysis (EDA):

    • We Will Analyze the frequency of breaches across various sectors and Methods.

    • Identify trends or recurring patterns in breach incidents over Year.

    • indentify correlations between Methods and affected sectors or data types.

    • Will examine the of breach occurrences on both corporate entities and individuals by parsing the Story column.

  • Trend Analysis:

    • Will find trends or patterns in breach incidents across successive Year.

    • Identify Sector or data types that have displayed heightened vulnerability to breaches over time.

    • Analyze shifts in Methods or their consequences across different Year

  • Visualization and Interpretation:

    • We will utilize Shiny, an R package designed for constructing interactive web applications, to craft dynamic dashboards aimed at visualizing and interpreting data.

    • In our analysis, we will identify pertinent columns including “Sector,” “Method,” “Year,” “Type of Data Compromised,” and “Impact” for thorough examination.

    • To enhance user interaction, we will create user-friendly interactive components such as dropdown menus, sliders, and checkboxes tailored to users’ needs.

    • Furthermore, we will incorporate functionalities such as tooltips, hover effects, and linked views to enrich user engagement and facilitate data exploration.


Plot and UI Description of Dashboard

Plot 1: Breach Methods by Sector/Type of Data (Stacked Bar Chart with Gradient Colors)

Plot 2: Breach Severity by Sector/Type of Data (Diverging Stacked Bar Chart)

Plot 3: Records Lost by Sector/Type of Data (Stacked Bar Chart with 3D Effect)

Plot 4: Breach Frequency Over Time (Smoothed Line Plot with Animated Transitions)


Repo organisation

The following folders comprise the project repository


Plan of Attack

Task Name Status Assignee Due Priority Summary
Dataset,Question selection Explored and finalize the dataset Everyone March 24th HIGH Selected dataset and question to be solved.
Brainstorming,proposal Completed the proposal Everyone April 1st High Proposal ready for review.
Peer Evaluation Completed Noureen,Gowtham April 8th High Discuss Feedback and made required changes.
Index.qmd Completed Divya,Akash April 15th Medium Worked on data visualization, plot Q1
Dashboard Completed Abhishek April 22nd Medium Worked on data visualization, plot Q2
Write-up, Project Completed Divya,laxmi April 29th High Worked on Project,write-up
Working in PPT Completed Divya,Abhishek May 05 High Completed entire project