The Power of Voters

Proposal

Dataset

US House Election Results

Variable Type Description
year integer Year in which election was held.
state_po character State name.
party character Party of the candidate.
candidate character Name of the candidate as it appears in the House Clerk report.
candidatevotes integer No. of votes the candidate received.
district character District number in state
runoff logical Second election held to determine a winner
special logical Special Election.
totalvotes integer Total Votes in District

Preview of dataset

# A tibble: 5 × 9
   year state_po party      candidate     candidatevotes district runoff special
  <dbl> <chr>    <chr>      <chr>                  <dbl> <chr>    <lgl>  <lgl>  
1  1976 AL       DEMOCRAT   BILL DAVENPO…          58906 001      FALSE  FALSE  
2  1976 AK       REPUBLICAN DON YOUNG              83722 000      FALSE  FALSE  
3  1976 CA       DEMOCRAT   ROBERT L LEG…          75844 004      FALSE  FALSE  
4  1976 FL       DEMOCRAT   GABRIEL CAZA…          80821 006      FALSE  FALSE  
5  1976 HI       REPUBLICAN FREDERICK RO…          53745 001      FALSE  FALSE  
# ℹ 1 more variable: totalvotes <dbl>

About the dataset

The dataset, US House Election Results sourced from MIT Election Data and Science Lab (MEDSL), offers a comprehensive overview of US House elections.

This dataset contains observations for elections held over 47 years from 1976 to 2022, encompassing a total of 32,452 recorded events. Each event is represented as a row with 20 attributes as columns. These columns provide details including the year, state, district, political party, candidate’s name, votes received, and various indicators such as whether it was a runoff election or if it was a write-in candidate.

Reason for choosing dataset

The US House Election dataset offers a rich and comprehensive source of data, encompassing multiple variables such as election years, state information, candidate details, and voting outcomes. This abundance of data provides ample opportunities for rigorous analysis and exploration within the scope of the academic project.

Questions

Question 1: How did US voting trends change in the 2016-2020 presidential term?

Question 2: How is the type of voting influenced by population and how does that influence who looses?

Analysis plan

For Question 1:

Introduction:

using the election data of 2016-2020 presidential term, we aim to filter the data for the most recent or a specific year of interest.Then to determine the wining party for each state by calculating the total votes as the winner. Finally, we create a a map of the USA is created to visualize the political landscape, coloring each state based on the wining party.

Variables Involved:

  • state_po: The two-letter postal abbreviation for the state.

  • party: The political party of the candidate.

  • year: Year of the election, focusing on the most recent general election or a specific year of interest.

  • candidatevotes: The number of votes the candidate received.

Analysis Approach:

  1. Filter Data: Data selection from the general elections (stage variable) for the most recent year or a specific year of interest. The analysis focuses on data from the general elections for the most recent year or a specific year of interest within the 2016-2020 presidential term. This data includes information such as the two-letter postal abbreviation for each state (state_po), the political party of the candidate (party), the year of the election (year), and the number of votes each candidate received (candidatevotes).

  2. Determine Winners: For each state, the analysis identifies the party with the highest aggregate candidate votes in the general election. This involves summing up the candidate votes for each party within each state and selecting the party with the highest total votes as the winner for that state.

  3. Visualization: Create a map of the USA, coloring each state based on the winning party. This visualization provides a clear depiction of the political landscape across the states for the selected election year, highlighting which party won each state.

For Question 2:

Introduction:

By merging the population data with existing election data to analyze how type of voting vary according to the population size in congressional districts. Using statistical approach to identify relationships between population and voting type, and also to determine which party tends to loose under certain conditions.

Variables Involved:

  • district: The congressional district number.

  • runoff : second election held to determine a winner

  • special : Special election

  • party: The political party of the candidate.

  • year: Year of the election, focusing on the most recent general election or a specific year of interest.

  • candidatevotes: The number of votes the candidate received.

  • candidate: Name of candidate

  • totalvotes: Total Votes in District

Variables to be Created:

  • population: This will require merging external data containing population

Analysis Approach:

  1. Merge External Data: The first step involves incorporating population data from a reliable source, such as the U.S. Census Bureau. This additional dataset will provide information on the population within each congressional district, which is essential for analyzing the relationship between population and voting type.

  2. Handling missing data or NA values: The dataset has various missing data points and NA values where data was either unable to be gathered, or there is simply no available data. To handle missing and NA values in discrete data, we as interpolation or imputation is not possible. For continuous data that is missing or NA, we will use mid-points, interpolation, or imputation to replace missing values. In the case that a row has missing continuous and discrete data, we will default to the rules we established for discrete data and remove the row entirely.

  3. Merge External Data: The first step involves incorporating population data from a reliable source, such as the U.S. Census Bureau. This additional dataset will provide information on the population within each congressional district, which is essential for analyzing the relationship between population and voting type.

  4. Analyze Voting Type by Population: Examine the relationship between population and the type of voting, using statistical analysis .how the type of voting (e.g., fusion ticket, runoff, special elections) varies with population size. This analysis could involve methods such as regression analysis, chi-square tests, or other appropriate statistical techniques to identify any significant relationships or patterns.

    i. Statistical analysis techniques:

    • Regression Analysis: We plan to use Logistic Regression to predict the probability of a party winning in a district based on variables such as population size, type of voting (e.g., fusion ticket, runoff, special elections), and demographic factors. Logistic regression is chosen because the outcome variable (win/lose) is binary. We will also explore Multiple linear regression models to understand the relationship between the number of votes received (continuous outcome variable) and predictors like population density and election type.

    • Chi-Square Tests: Chi-square tests will be utilized to examine the association between categorical variables, such as the presence of a fusion ticket and the winning party. This will help us identify if the distribution of election outcomes differs significantly across categories like election type (general, runoff, special) or district characteristics.

    • Time-Series Analysis: For a longitudinal examination of voting trends, we will implement time-series analysis to assess changes over multiple election cycles. This includes using ARIMA (Auto Regressive Integrated Moving Average) , Exponential smoothing models to forecast voting patterns based on historical data. Additionally, Holt-Winters Seasonal Method , Time-series decomposition will also be employed to identify seasonal effects, trends, and residuals in voting behaviors, which can be correlated with major political or social events.

    ii. Structured Comparative Analysis Over Time

    To systematically address changes in US voting trends over time, our approach will include:

    • Segmented Time-Series Analysis: We will divide the dataset into distinct time segments based on significant political or social milestones (e.g., changes in administration, major legislative changes, social movements). This will allow us to compare voting patterns before and after these events, providing insights into their impact on electoral outcomes.

    • Cross-Election Cycle Comparison: By comparing data across different election cycles, we aim to identify shifts in voter preferences, party dominance, and the effectiveness of various campaign strategies. This comparison will be visualized through line graphs and heat maps to illustrate trends and deviations over time.

    • Interaction Effects in Regression Models: To understand the complex interplay between different factors influencing voting trends, we will include interaction terms in our regression models. This will help us examine how the impact of one variable (e.g., population size) on voting outcomes might change under different conditions (e.g., during special elections versus general elections).

  5. Correlate with Election Outcomes: Focusing on which party tends to loose with certain characteristics.the next step is to assess how these factors correlate with election outcomes. Specifically, the focus will be on understanding which party tends to lose under certain conditions.

Visualization and Reporting:

  • Use scatter plots or heat maps to show the relationship between population and mode of voting.

  • Create comparative visualizations to illustrate how these factors influence election outcomes, potentially using bar charts and pie chart

LIMITATIONS:

Limited Demographic Data: Obtaining and processing extensive demographic data spanning 47 years pose significant challenges. As a result, only basic demographic information such as population size is included in the analysis. Detailed demographic variables like age, race, and income level is not included due to data availability constraints.

Limited Economic and Social Data: Similarly, accessing economic indicators, education levels, and other social factors at the district level for a 47-year period could also be challenging and so we are not including that in the analysis too