vignettes/introductory-walkthrough.Rmd
introductory-walkthrough.Rmd
library(precinctsopenelex)
library(tidyr)
library(stringr)
library(dplyr, warn.conflicts = FALSE)
library(janitor, warn.conflicts = FALSE)
library(readxl)
This vignette is intended to provide a walk-through of processing a set of precinct results from Saratoga County, NY, for the 2020 U.S. presidential election.
The sample dataset for Saratoga is included with the package to allow for exploration, and provides a template for how the data should be structured for the reshaping functions to operate.
First we’ll assign a state abbreviation and county name variable needed for the input file string
#chose state abbreviation and county name variable needed for the input file string
current_state <- "NY"
current_county <- "Saratoga"
Now we’ll use our first function from the package to create a name string for the pre-processed “wide” data file saved as Excel. The infile_string()
function can help us with this.
infile_string <- create_infile_string(current_state, current_county)
infile_string
#> [1] "NY_Saratoga/NY_Saratoga_GE20_cleaned.xlsx"
For the purposes of this walkthrough, instead of importing a new file we’ll use a sample dataset included with the package - precinctsampledata_ny
- to demonstrate what the pre-processed “wide” data should look like structure-wise when.
Let’s take a look at the sample data:
precinct | Joseph R. Biden - DEM | Donald J. Trump - REP | Donald J. Trump - CON | Joseph R. Biden - WOR | Howie Hawkins - GRE | Jo Jorgensen - LIB | Brock Pierce - IND | WriteIn | Blanks | Voids |
---|---|---|---|---|---|---|---|---|---|---|
Ballston 1 | 273 | 195 | 16 | 31 | 5 | 8 | 1 | 0 | 4 | 1 |
Ballston 2 | 485 | 412 | 58 | 18 | 7 | 20 | 6 | 0 | 5 | 2 |
Ballston 3 | 577 | 502 | 64 | 31 | 3 | 7 | 7 | 0 | 7 | 3 |
Ballston 4 | 428 | 316 | 48 | 16 | 3 | 11 | 5 | 0 | 3 | 4 |
Note the precise way of formatting the column names that will allow the function to work correctly:
– candidate columns should be written as “Candidate Name - Party Abbreviation” (e.g. “Joe Biden - DEM”)
– additional choices should be listed as “WriteIn” (write-in votes), “Blanks” (undervotes), and “Voids” (overvotes)
Now let’s run the function the package’s main function - reshape_precinct_data()
- to transform the dataset into the tidy/long format the OpenElections project needs, along with the correct standardized column names.
The reshape_precinct_data()
function wants:
– dataset (or nested import from file function)
– office: text label for office (e.g. “U.S. House”)
– district: text label for district (e.g. “42”; note that statewide offices have have no district, so use "")
Ok, now we’re ready to see the rubber meet the road.
Let’s run reshape_precinct_data()
on our sample data:
processed_prez <- reshape_precinct_data(precinctsampledata_ny,
"Presidential",
"")
processed_prez %>%
head(10) %>%
knitr::kable()
precinct | office | district | candidate | party | votes |
---|---|---|---|---|---|
Ballston 1 | Presidential | Joseph R. Biden | DEM | 273 | |
Ballston 1 | Presidential | Donald J. Trump | REP | 195 | |
Ballston 1 | Presidential | Donald J. Trump | CON | 16 | |
Ballston 1 | Presidential | Joseph R. Biden | WOR | 31 | |
Ballston 1 | Presidential | Howie Hawkins | GRE | 5 | |
Ballston 1 | Presidential | Jo Jorgensen | LIB | 8 | |
Ballston 1 | Presidential | Brock Pierce | IND | 1 | |
Ballston 1 | Presidential | Write-Ins | 0 | ||
Ballston 1 | Presidential | Blanks | 4 | ||
Ballston 1 | Presidential | Voids | 1 |
Bingo. We now have the results in the correct format, with the column names also matching the OpenElections naming conventions.
In a real-world scenario, there would be several races to process in addition to presidential: U.S Senate and House, along with state House and Senate. They can be done using the same manner.
As noted earlier, the dataframe fed to the function can a named R object table already imported in your script, or if you prefer you can also nest the import itself inside the first argument, such as this:
# not run:
# processed_prez <- process_ny_data(read_excel(filestring_import, sheet = "presidential"),
# "President",
# "")
In the end, once all a county’s races are prepared we’ll stitch them together into a combined file.
There are several ways to accomplish this. For this example we’ll use the code below to capture all dataframes present in the global environment that contain the word processed
in their names. Then we’ll append them all into a combined dataframe using the bind_rows()
function from dplyr.
With the dataset ready, it never hurts to run a few quick counts to check the integrity of what we’ve created.
Let’s do some quick visual inspections to see if anything strange or unexpected stands out, such as districts you don’t expect, or candidate names that shouldn’t be there or were parsed incorrectly.
party | n |
---|---|
588 | |
CON | 196 |
DEM | 196 |
GRE | 196 |
IND | 196 |
LIB | 196 |
REP | 196 |
WOR | 196 |
office | district | n |
---|---|---|
Presidential | 1960 |
candidate | n |
---|---|
Blanks | 196 |
Brock Pierce | 196 |
Donald J. Trump | 392 |
Howie Hawkins | 196 |
Jo Jorgensen | 196 |
Joseph R. Biden | 392 |
Voids | 196 |
Write-Ins | 196 |
Looks good! In our walkthrough, we’ve now successfully reshaped our precinct data to meet the OpenElections format and standardizations - it’s time to share it.
Now we’ll export the results to a csv file. The OpenElections project has a very specific way it prefers its file names to be structured. The function create_outfile_string()
provides a shortcut to creating that filename for 2020 general election files. It aims to put the file into a directory also named for the county.
Let’s use the state and county variables we set already at the very top of this walkthrough to feed into create_outfile_string()
.
outfile_string <- create_outfile_string(current_state, current_county)
outfile_string
#> [1] "NY_Saratoga/20201103__NY__general__saratoga__precinct.csv"
With that in place, we’ll take the final step of exporting the file.
# not run:
# write_csv(processed_combined, outfile_string, na = "")