IRE 2025 - Gathering and Cleaning Data in R

Author

Aaron Kessler (adapted from Sean Mussenden)

Published

June 20, 2025

Intro

What We’re Doing

In this hourlong session, we’re going to show you two things:

1. The many ways to get data in RStudio! * We’ll practice loading data from: a built in R dataset, an R package, a local csv, an Excel file, and an html table. If the wifi cooperates, we’ll also show how to load from a csv on the web and read in a Google Sheet. * I’ll introduce you to resources that will show you how to: pull data from an API (like the Census), and pull data from a database.

2. Some data cleaning tricks in R! * We’ll practice: cleaning up column names, correcting erroneous values, converting data types, manipulating strings.

What You’ll Get Out of This

  • You will leave here understanding some basic loading and cleaning data concepts in R, a list of helpful packages, and some resources for further learning.

Let’s go

Loading preloaded data from a package

If you’ve taken earlier classes in this sequence, you’ll have loaded Social Security Administration data on names of babies born across the last ~100 years. It comes built in with the ‘babynames’ package we loaded.

To refresh, let’s load that now and store it as an object called babynames_package_df.

We’re going to load the same data from different locations, using different methods, to show you how some of the many ways to get data into R.

# Load babynames
babynames_package_df <- babynames

# Show first 10 records
babynames_package_df %>%
  head(10)
# A tibble: 10 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132

Loading data from a local csv file

Loading data from a flat file – like a csv – stored on your local machine is a very common data loading task. For that, we’ll use the read_csv() function that’s part of the readr package that loads with the tidyverse.

We’re going to read in a sample of the babynames data that contains all of the names that start with “A”.

Store it as an object called “babynames_a_csv” (the name doesn’t matter, can be anything). The information inside of read_csv() is the filepath to the data, which is stored in the data folder.

babynames_a_csv <- read_csv("data/babynames_a.csv")
Rows: 187950 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): sex, name
dbl (3): year, n, prop

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show first 10 records
babynames_a_csv %>%
  head(10)
# A tibble: 10 × 5
    year sex   name       n    prop
   <dbl> <chr> <chr>  <dbl>   <dbl>
 1  1880 F     Anna    2604 0.0267 
 2  1880 F     Alice   1414 0.0145 
 3  1880 F     Annie   1258 0.0129 
 4  1880 F     Ada      652 0.00668
 5  1880 F     Agnes    473 0.00485
 6  1880 F     Alma     277 0.00284
 7  1880 F     Addie    274 0.00281
 8  1880 F     Amanda   241 0.00247
 9  1880 F     Amelia   221 0.00226
10  1880 F     Amy      167 0.00171
# Let's look at how you can use RStudio's import wizard to help
# with files like this?

Loading data from a csv file on the internet

We can load that same csv from the internet, instead of doing it locally. We just need to change the information inside of the read_csv() function to pass a URL, instead of a location on our computer.

The CSV is up on Github.com, here: https://raw.githubusercontent.com/smussenden/nicar_2024_r3_loading_cleaning_data/main/data/babynames_a.csv

Warning: this may not work if the wifi is buggy!

babynames_a_csv_web <- read_csv("https://raw.githubusercontent.com/amkessler/ire_2025_r3_gathering_cleaning_data/refs/heads/main/data/babynames_a.csv")
Rows: 187950 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): sex, name
dbl (3): year, n, prop

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show first 10 records
babynames_a_csv_web %>%
  head(10)
# A tibble: 10 × 5
    year sex   name       n    prop
   <dbl> <chr> <chr>  <dbl>   <dbl>
 1  1880 F     Anna    2604 0.0267 
 2  1880 F     Alice   1414 0.0145 
 3  1880 F     Annie   1258 0.0129 
 4  1880 F     Ada      652 0.00668
 5  1880 F     Agnes    473 0.00485
 6  1880 F     Alma     277 0.00284
 7  1880 F     Addie    274 0.00281
 8  1880 F     Amanda   241 0.00247
 9  1880 F     Amelia   221 0.00226
10  1880 F     Amy      167 0.00171

Loading data from an Excel file

Another data format you’ll likely see in the wild: Excel files. They’re a little trickier than csvs, because they can contain multiple sheets.

We can use the read_xlsx() function from the readxl package to load data from Excel files.

There’s an Excel file in the data folder called babynames_excel. It has two sheets, one with all of the “A” babynames, followed by one with all of the “B” babynames.

Let’s load the first sheet, the “A” babynames, as a dataframe and store it as an object called babynames_a_excel.

babynames_a_excel <- read_xlsx("data/babynames_excel.xlsx")

babynames_a_excel %>%
  head(10)
# A tibble: 10 × 5
    year sex   name       n    prop
   <dbl> <chr> <chr>  <dbl>   <dbl>
 1  1880 F     Anna    2604 0.0267 
 2  1880 F     Alice   1414 0.0145 
 3  1880 F     Annie   1258 0.0129 
 4  1880 F     Ada      652 0.00668
 5  1880 F     Agnes    473 0.00485
 6  1880 F     Alma     277 0.00284
 7  1880 F     Addie    274 0.00281
 8  1880 F     Amanda   241 0.00247
 9  1880 F     Amelia   221 0.00226
10  1880 F     Amy      167 0.00171
# Let's look at how you can use RStudio's import wizard for Excel, too?

Because the “A” names sheet is the first sheet in the Excel file, the function automatically loads the “A” sheet.

But if we need to get any other sheet, like the one containing the “B” names, we’ll have to get it by adding another argument to the read_xlsx function.

The sheet is called “babynames_b”.

How can we load Sheet B? Let’s take a look at the handy help page for the readxl package.

  • You can either type ?read_xlsx in the console and look at what pops up on the right.
  • Or you can visit https://readxl.tidyverse.org/reference/read_excel.html.

Can you see how to read in a different sheet from the first?

Let’s modify the read_xlsx function below to load the sheet “babynames_b”.

babynames_b_excel <- read_xlsx("data/babynames_excel.xlsx")

Loading data from an HTML table

On this page is an HTML table showing the top 2022 baby names. https://www.ssa.gov/oact/babynames/. Because the wifi is often buggy, I’ve saved an HTML file locally of this page, which we’ll work with instead. It’s in the data folder.

We can use the inspector in our web browser to see this. It’s composed of html tags.

Using the rvest package we’re going to first read in the entire html page using the read_html() function.

# If we were going to pull directly from the web, instead of from a local file, this is what it will the function would look 

# top_2022_baby_names <- read_html("https://www.ssa.gov/oact/babynames/") 

# Instead, we're gonna load a local html file to simulate the act of scraping from the web
top_2022_baby_names <- read_html("data/html/ssa_baby_names.html")

Open it up in the environment window. It’s the full HTML of the page. Next, we’ll use rvest to extract the html table we want using html_table()

top_2022_baby_names <- read_html("data/html/ssa_baby_names.html") %>%
  html_table()

Open it up in the environment window. We’ve isolated that one html table and turned it into a dataframe, albeit one that is nested in a list. Let’s extract it from the list.

# Load html and extract table
top_2022_baby_names <- read_html("data/html/ssa_baby_names.html") %>%
  html_table()

# Pluck out the table
top_2022_baby_names <- top_2022_baby_names[[1]]

# display it
top_2022_baby_names
# A tibble: 10 × 3
    Rank `Male name` `Female name`
   <int> <chr>       <chr>        
 1     1 Liam        Olivia       
 2     2 Noah        Emma         
 3     3 Oliver      Charlotte    
 4     4 Elijah      Amelia       
 5     5 James       Ava          
 6     6 William     Sophia       
 7     7 Benjamin    Isabella     
 8     8 Lucas       Mia          
 9     9 Henry       Evelyn       
10    10 Theodore    Harper       

Never web scraped before? Now you have! A warning that web scraping does get more challenging from here.

Loading data from Google Sheets

We can also load data directly from Google Sheets, using the read_sheet() function from the googlesheets4 package. Google Sheets, like Excel files, can also have multiple workbooks.

I’ve loaded up a Google Sheet workbook here with two sheets, one for “A” babynames and one for “B” babynames: https://docs.google.com/spreadsheets/d/1GG_RmYPGCNKbLb4x1dkQv1B9NjDBF0kQRtge9Aea4rs. These samples only have 1,000 records in them.

To read in the sheet, we need to paste in the sheet ID, which is the long sting of numbers and letters at the end of the URL: “1GG_RmYPGCNKbLb4x1dkQv1B9NjDBF0kQRtge9Aea4rs”

Because the “A” sheet comes first, it automatically loads that sheet.

Before we get there, though, we’ll need to perform an authentication with Google. Look down in the console, select an account or hit 1 to start the process of in-browser authentication.

gs4_auth()
! Using an auto-discovered, cached token.
  To suppress this message, modify your code or options to clearly consent to
  the use of a cached token.
  See gargle's "Non-interactive auth" vignette for more details:
  <https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
  'kessler.aaron@gmail.com'.
# If you've already authenticated once and want to tell R to use your saved choice
# without being prompted to enter its number on the list each time, you can do:
# gs4_auth("youremailaddress@gmail.com")

# Let's talk about the even better way to do this - storing your email value in .Renviron

# What is .REnviron??
# Run this line in the console:
# usethis::edit_r_environ()

# Now set the value to pull your locally stored value:
# gs4_auth(Sys.getenv("GMAIL_ADDRESS"))

Now we can read in the sheet.

babynames_a_sheet <- read_sheet("1GG_RmYPGCNKbLb4x1dkQv1B9NjDBF0kQRtge9Aea4rs")
✔ Reading from "babynames_excel_sample".
✔ Range 'babynames_a'.

If we want to load any other sheet, we need to tell it which one to load.

babynames_b_sheet <- read_sheet("1GG_RmYPGCNKbLb4x1dkQv1B9NjDBF0kQRtge9Aea4rs", sheet = "babynames_b")
✔ Reading from "babynames_excel_sample".
✔ Range ''babynames_b''.

Your Turn Option 1

Let’s revisit what we just talked about regarding the .Renviron variables. Could we do the same for the Google Sheet reference string? So we didn’t store it in a public repository where anyone could see where our sheet lived, but still allow for the functionality of it.

Try creating a new variable in your .Renviron file with the sheet reference and using it to load babynames_b…

In addition to loading data from Google Sheets, you can also WRITE data from R Studio to Google Sheets. This is such a great way to share data you’ve analyzed with other, less tech-savvy members of your team. More on that here: https://googlesheets4.tidyverse.org/index.html.

Your Turn Option 2

This page https://www.ssa.gov/oact/babynames/state/top5_2021.html, which I’ve also stored locally, contains tables of the top 5 birth names for 2021 by state, one for male and one for female.

Using what you just learned, I’d like you to take a stab at ingesting the html of this page, and saving each of the tables as a separate data frame.

# local path to the data 
# "data/html/baby_names_2021_state.html"

Other ways of getting data

There are lots of other ways to bring data into R. We don’t have to delve into all of them today, but you should be aware of them for continued learning.

  • Pull data from an SQL database with the dbplyr package. dbplyr

  • Get data from an API endpoint - You can access data available through pretty much any API, or application programming interface, by sending queries and getting back a response, using the excellent HTTR package. An example of an API: Spotify data

  • Get data from packages the R community has built to load and work with specific kinds of data, including some nice helper functions. If the dataset you want to use is popular, there’s a great change someone has built an R package for it! Some great examples: Tidycensus: for loading and working with all manner of U.S. Census Data. RTweet: for working with Twitter data

Clear enivronment

Before moving on to the data cleaning section, let’s clean up all those dataframes in our environment with this handy bit of code. You’ll rarely do this in practice in your reproducible code, but for now it helps remove some of the clutter for our demonstration.

rm(list=ls())


Data Cleaning

Data cleaning is a critical step in preparing data for analysis. Everyone wants to jump right to the fun stuff – me too! – but failing to account for flaws in the data can lead to errors, which can lead to corrections.

To start, we’re going to load a dataframe with info on year-by-year totals for the most popular female baby names from 2008 to 2017.

I’ve intentionally made this very ugly. Keeping with our theme, someone needs to change this dataframe’s diaper!

babynames_dirty <- read_rds("data/babynames_dirty.rds")

Identify Issues

Examine the data in the environment window.

What problems do you see? Don’t scroll down until you think you’ve found one or two of the problems.

Are you sure you can’t find more?

Are you sure? Ok. Here’s what I see.

  • EMPTY ROWS - There’s a totally empty row of NAs. Let’s get rid of it. It could prevent proper summarizing.
  • EMPTY COLUMN - There’s a totally empty column called source. Let’s get rid of it, because it’s unneeded.
  • INCONSISTENT VALUES - The “sex” column has inconsistent values for Female vs F. This could hurt our ability to group and summarize properly.
  • STYLING INCONSISTENCIES - There are lots of issues in “name_at_birth” column. MADISON is upper. elizabeth is lower. Abigail has “First Name: Abigail” instead of just Abigail. Everything else is title case. This is just…ugly.
  • DUPLICATE ROWS - There are two apparently identical rows for Emma and Amelia (or are they?). We should really only have one, and could lead to misleading summarization if we kept both.
  • NUMBERS STORED AS CHARACTER STRINGS - The 2015 column has a random underscore appended to the end of the number, unlike every other year column. The fact that it’s a mix of characters and numbers means the column is stored as a “character”, while all of the other year columns are numbers (integers). If we want to calculate say, the percent change in a name between 2008 and 2015, we’ll get an error.
  • BAD COLUMN NAMES - Less obvious, but still annoying: the fact that the year columns start with a number means they’ll be harder to include in our code.

Let’s examine each of these, while fixing them.

Our goal will be one nice long chained function to fix all the problems, so we have a clear record of what we’ve done.

Remove empty rows and columns

First, let’s get rid of that totally NA row and column. The janitor package has a nice function for getting rid of empty rows and columns

babynames_clean <- babynames_dirty %>%
  remove_empty()
value for "which" not specified, defaulting to c("rows", "cols")
# just empty rows? remove_empty(which="rows")
# just empty columns? remove_empty(which="columns")
# both rows and columns? remove_empty()

Remove duplicates

Right now, we have two identical rows with values for “Amelia” and “Emma”. We can remove those by using the distinct() function, which looks for duplicate values and keeps only one.

babynames_clean <- babynames_dirty %>%
  remove_empty() %>%
  distinct()
value for "which" not specified, defaulting to c("rows", "cols")

That appears to have removed the duplicate “Amelia” rows, but not the duplicate “Emma” rows.

Why?

Before scrolling down, see if you can figure it out.

If we examine the Emma row closely, we’ll see that one has extra whitespace on both sides of the name, and the other doesn’t. That means the rows aren’t EXACTLY equal.

We can fix that with our first of many string functions, from the stringr package, which has lots of helpful functions for cleaning character strings. The function we’ll use is called str_trim() and it’s good at trimming extra whitespace.

babynames_clean <- babynames_dirty %>%
  remove_empty() %>%
  mutate(name_at_birth = str_trim(name_at_birth, side = "both")) %>%
  distinct() 
value for "which" not specified, defaulting to c("rows", "cols")

Now when we run it again, distinct() gets rid of the extra Emma row because the difference has been fixed.

Clean up names column

There are still lots of issues with the “name_at_birth” column.

MADISON is UPPERCASE; elizabeth is lowercase; everything else is Title Case.

Let’s fix that first with a function called str_to_upper(), which will convert everything to upper case.

babynames_clean <- babynames_clean %>%
  mutate(
    name_at_birth = str_trim(name_at_birth),
    name_at_birth = str_to_upper(name_at_birth)
    ) 

head(babynames_clean)
# A tibble: 6 × 12
  sex    name_at_birth `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
  <chr>  <chr>          <int>  <int>  <int>  <int>  <int>  <int>  <int> <chr> 
1 F      AMELIA          4346   4695   5455   6368   7242   8032   8789 9852_ 
2 F      ARIA             540    592    903   1980   3228   5111   5918 6403_ 
3 Female AVA            17035  15867  15429  15496  15529  15249  15688 16361_
4 F      AVERY           5829   6298   6673   7340   8314   9174   9563 9339_ 
5 Female CHARLOTTE       3669   4187   5357   6427   7476   9297  10111 11390_
6 F      ELIZABETH      12002  11029  10260  10084   9703   9448   9557 9724_ 
# ℹ 2 more variables: `2016` <int>, `2017` <int>

Before we move on from names column

Before we move on, let’s suppose we wanted to make all the names lowercase or uppercase instead. What do we think a function to do that might be called? If you start typing str_to in the console, you might get a clue.

Fixing Abigail

The next issue will be a bit more challenging, fixing Abigail. Her name has extra stuff – “First Name: Abigail” – that the others don’t have. We can remove that extra stuff with str_remove().

babynames_clean <- babynames_clean %>%
  mutate(name_at_birth = str_remove(name_at_birth, "FIRST NAME: "))

head(babynames_clean)
# A tibble: 6 × 12
  sex    name_at_birth `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
  <chr>  <chr>          <int>  <int>  <int>  <int>  <int>  <int>  <int> <chr> 
1 F      AMELIA          4346   4695   5455   6368   7242   8032   8789 9852_ 
2 F      ARIA             540    592    903   1980   3228   5111   5918 6403_ 
3 Female AVA            17035  15867  15429  15496  15529  15249  15688 16361_
4 F      AVERY           5829   6298   6673   7340   8314   9174   9563 9339_ 
5 Female CHARLOTTE       3669   4187   5357   6427   7476   9297  10111 11390_
6 F      ELIZABETH      12002  11029  10260  10084   9703   9448   9557 9724_ 
# ℹ 2 more variables: `2016` <int>, `2017` <int>

Standardize “Sex” Column

We can use one another string function, str_sub() to standardize the “sex” column. This function in the block below says, “Starting at the left side of the sex column, keep everything from the first character on the left to…the first character on the left and Get rid of everything else.”
Let’s explore string functions!

babynames_clean %>%
  mutate(sex = str_sub(sex, start = 1L, end = 1L))
# A tibble: 20 × 12
   sex   name_at_birth `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
   <chr> <chr>          <int>  <int>  <int>  <int>  <int>  <int>  <int> <chr> 
 1 F     AMELIA          4346   4695   5455   6368   7242   8032   8789 9852_ 
 2 F     ARIA             540    592    903   1980   3228   5111   5918 6403_ 
 3 F     AVA            17035  15867  15429  15496  15529  15249  15688 16361_
 4 F     AVERY           5829   6298   6673   7340   8314   9174   9563 9339_ 
 5 F     CHARLOTTE       3669   4187   5357   6427   7476   9297  10111 11390_
 6 F     ELIZABETH      12002  11029  10260  10084   9703   9448   9557 9724_ 
 7 F     ELLA            9349   9650   9872   9586   9198   8444   8547 7896_ 
 8 F     EMILY          17429  15345  14268  14251  13646  13124  12642 11780_
 9 F     EMMA           18809  17892  17338  18797  20934  20936  20924 20435_
10 F     EVELYN          5098   5583   5840   6707   6876   7659   8739 9358_ 
11 F     ABIGAIL        15081  14387  14243  13247  12697  12405  12076 12390_
12 F     HARPER          1128   1903   2631   4674   7188   8277   9606 10291_
13 F     ISABELLA       18612  22298  22905  19901  19093  17631  17068 15594_
14 F     MADISON        17027  15235  13176  12381  11392  10591  10315 10087_
15 F     MIA            10168  11423  10637  11527  12019  13147  13506 14892_
16 F     OLIVIA         17078  17433  17022  17321  17310  18414  19791 19669_
17 F     SCARLETT        1620   1920   2716   3558   4042   5050   6003 7125_ 
18 F     SOFIA           5982   5690   6325   7327   7831   9157   9591 9684_ 
19 F     SOPHIA         16081  16941  20639  21837  22304  21213  18598 17402_
20 F     VICTORIA        7118   6359   6226   6888   6861   7202   7991 7606_ 
# ℹ 2 more variables: `2016` <int>, `2017` <int>

Before we move on from sex

What happens if you change the start and end numbers? How does the output change? Try it now.

Now we’ll save our original desired result of just one character being retained.

babynames_clean <- babynames_clean %>%
  mutate(
    sex = str_sub(sex, start = 1L, end = 1L)
    )

Fix 2015 column

The 2015 column is different from the other columns, with an underscore at the end. If we want to do math to that column, we’ll have to fix it. Here, again, we can use str_remove(). Uncomment this out and then try to run it…

# babynames_clean <- babynames_clean %>%
#   mutate(
#     2015 = str_remove(2015,"_")
#     )

But we get an error! Why? Because R hates it when you start column names with numbers. You could wrap it in tick marks when you use it, like 2015, but that’s pretty annoying.

Instead, we can use the helpful clean_names() function from the janitor package to standardize it. If our column names had spaces, or capital letters, it would fix that too.

babynames_clean <- babynames_clean %>%
  clean_names()

head(babynames_clean)
# A tibble: 6 × 12
  sex   name_at_birth x2008 x2009 x2010 x2011 x2012 x2013 x2014 x2015  x2016
  <chr> <chr>         <int> <int> <int> <int> <int> <int> <int> <chr>  <int>
1 F     AMELIA         4346  4695  5455  6368  7242  8032  8789 9852_  10753
2 F     ARIA            540   592   903  1980  3228  5111  5918 6403_   6927
3 F     AVA           17035 15867 15429 15496 15529 15249 15688 16361_ 16283
4 F     AVERY          5829  6298  6673  7340  8314  9174  9563 9339_   8759
5 F     CHARLOTTE      3669  4187  5357  6427  7476  9297 10111 11390_ 13080
6 F     ELIZABETH     12002 11029 10260 10084  9703  9448  9557 9724_   9542
# ℹ 1 more variable: x2017 <int>

This is one of my favorite functions to use on almost any dataset I get.
For columns that contain only numbers, it puts an x at the start of the name, so now we can get to cleaning 2015.

babynames_clean <- babynames_clean %>%
  mutate(
    x2015 = str_remove(x2015, "_")
    )

Okay, it looks like a proper number now. But if we look under the hood, at the data type of each column, we’ll see it hasn’t been fixed. We can do that with glimpse()

glimpse(babynames_clean)
Rows: 20
Columns: 12
$ sex           <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "…
$ name_at_birth <chr> "AMELIA", "ARIA", "AVA", "AVERY", "CHARLOTTE", "ELIZABET…
$ x2008         <int> 4346, 540, 17035, 5829, 3669, 12002, 9349, 17429, 18809,…
$ x2009         <int> 4695, 592, 15867, 6298, 4187, 11029, 9650, 15345, 17892,…
$ x2010         <int> 5455, 903, 15429, 6673, 5357, 10260, 9872, 14268, 17338,…
$ x2011         <int> 6368, 1980, 15496, 7340, 6427, 10084, 9586, 14251, 18797…
$ x2012         <int> 7242, 3228, 15529, 8314, 7476, 9703, 9198, 13646, 20934,…
$ x2013         <int> 8032, 5111, 15249, 9174, 9297, 9448, 8444, 13124, 20936,…
$ x2014         <int> 8789, 5918, 15688, 9563, 10111, 9557, 8547, 12642, 20924…
$ x2015         <chr> "9852", "6403", "16361", "9339", "11390", "9724", "7896"…
$ x2016         <int> 10753, 6927, 16283, 8759, 13080, 9542, 7898, 10957, 1947…
$ x2017         <int> 11800, 7132, 15902, 8186, 12893, 8915, 8014, 9746, 19738…

x2015 is still a “character”. If we try to do math to it, it will error. So let’s change the datatype to a number using as.integer().

babynames_clean <- babynames_clean %>%
  mutate(
    x2015 = as.integer(x2015)
    )

When we glimpse the data, it’s been converted to the correct datatype.

glimpse(babynames_clean)
Rows: 20
Columns: 12
$ sex           <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "…
$ name_at_birth <chr> "AMELIA", "ARIA", "AVA", "AVERY", "CHARLOTTE", "ELIZABET…
$ x2008         <int> 4346, 540, 17035, 5829, 3669, 12002, 9349, 17429, 18809,…
$ x2009         <int> 4695, 592, 15867, 6298, 4187, 11029, 9650, 15345, 17892,…
$ x2010         <int> 5455, 903, 15429, 6673, 5357, 10260, 9872, 14268, 17338,…
$ x2011         <int> 6368, 1980, 15496, 7340, 6427, 10084, 9586, 14251, 18797…
$ x2012         <int> 7242, 3228, 15529, 8314, 7476, 9703, 9198, 13646, 20934,…
$ x2013         <int> 8032, 5111, 15249, 9174, 9297, 9448, 8444, 13124, 20936,…
$ x2014         <int> 8789, 5918, 15688, 9563, 10111, 9557, 8547, 12642, 20924…
$ x2015         <int> 9852, 6403, 16361, 9339, 11390, 9724, 7896, 11780, 20435…
$ x2016         <int> 10753, 6927, 16283, 8759, 13080, 9542, 7898, 10957, 1947…
$ x2017         <int> 11800, 7132, 15902, 8186, 12893, 8915, 8014, 9746, 19738…

Bonus: Conditionals (and more string functions)

If we have time, we can delve more deeply into playing with string functions, as well as introducing you to conditionals (known as “if-then” statements).

We’ll use data from this Excel file to play with…

string_function_df <- read_excel("data/AdvancedExcel_StringFunctions_IfStatements.xlsx",
                                  sheet = "StringFunctions") %>% 
                      clean_names() %>% 
                      select(year, player)
New names:
• `` -> `...11`
basic_if_df <- read_excel("data/AdvancedExcel_StringFunctions_IfStatements.xlsx",
                          sheet = "BasicIF") %>% 
                clean_names() %>% 
                select(week_num:home_score)

if_with_calcs_df <- read_excel("data/AdvancedExcel_StringFunctions_IfStatements.xlsx",
                               sheet = "IFwithCalcs", skip = 3) %>% 
                    clean_names() %>% 
                    select(lastname:salary)
# Employees are getting a 1% raise, but all will get a minimum raise of $350

Basic IF-ELSE statement

Let’s take a look at some data on football scores in NFL games.

basic_if_df %>% 
  head(10)
# A tibble: 10 × 5
   week_num visit_team visit_score home_team home_score
      <dbl> <chr>            <dbl> <chr>          <dbl>
 1        1 NYJ                 13 WAS               16
 2        1 ARI                 24 DET               42
 3        1 DEN                 30 CIN               10
 4        1 IND                  9 CLE                6
 5        1 BAL                 15 PIT               34
 6        1 HOU                 21 MIA               20
 7        1 JAC                 23 CAR               24
 8        1 MIN                 30 GB                25
 9        1 NE                   0 BUF               31
10        1 SD                  14 KC                27

You’ll encounter a situation like this all the time: the dataset you have doesn’t contain certain columns you wish it had.

But does it include enough information to create the columns you wished were there? In this case, we have weekly NFL matchups with the scores of each team, but we don’t have:

  • whether the home or visiting team won

  • which team won the game

This is where If-Then statements, also known as If-Else, can help us out. Let’s see how - we’ll use dplyr’s if_else() function to set up a conditional. For this example, we’ll assume there won’t be any ties.

First let’s create a column to tell us which the home or visiting team won.

basic_if_df %>% 
   mutate(
     winner_home_or_visitor = if_else(visit_score > home_score,
                      "visitor",
                      "home")
   )
# A tibble: 256 × 6
   week_num visit_team visit_score home_team home_score winner_home_or_visitor
      <dbl> <chr>            <dbl> <chr>          <dbl> <chr>                 
 1        1 NYJ                 13 WAS               16 home                  
 2        1 ARI                 24 DET               42 home                  
 3        1 DEN                 30 CIN               10 visitor               
 4        1 IND                  9 CLE                6 visitor               
 5        1 BAL                 15 PIT               34 home                  
 6        1 HOU                 21 MIA               20 visitor               
 7        1 JAC                 23 CAR               24 home                  
 8        1 MIN                 30 GB                25 visitor               
 9        1 NE                   0 BUF               31 home                  
10        1 SD                  14 KC                27 home                  
# ℹ 246 more rows

Above, we effectively hard-coded our choices as to what would be displayed. But you can also pull from another column as to what you want to show - let’s see how that can help give us the winner of each game.

basic_if_df %>% 
   mutate(
     winner = if_else(visit_score > home_score,
                      visit_team,
                      home_team)
   )
# A tibble: 256 × 6
   week_num visit_team visit_score home_team home_score winner
      <dbl> <chr>            <dbl> <chr>          <dbl> <chr> 
 1        1 NYJ                 13 WAS               16 WAS   
 2        1 ARI                 24 DET               42 DET   
 3        1 DEN                 30 CIN               10 DEN   
 4        1 IND                  9 CLE                6 IND   
 5        1 BAL                 15 PIT               34 PIT   
 6        1 HOU                 21 MIA               20 HOU   
 7        1 JAC                 23 CAR               24 CAR   
 8        1 MIN                 30 GB                25 MIN   
 9        1 NE                   0 BUF               31 BUF   
10        1 SD                  14 KC                27 KC    
# ℹ 246 more rows

If-Else with calculations

How might you try and solve a problem that requires doing a little math?
Consider the data below and think about how you might come up with a way to determine each employee’s new salary if the rules are:

Employees are getting a 1% raise, but all will get a minimum raise of $350.

Let’s talk about ideas for tackling this.

if_with_calcs_df
# A tibble: 568 × 5
   lastname  firstname department            yrs_exp salary
   <chr>     <chr>     <chr>                   <dbl>  <dbl>
 1 RONAN     GARY      BUILDING MAINTENANCE        1  10500
 2 CASPER    KEVIN     CAPROP                      1  10500
 3 DEBROUX   EMILY     COMMUNICATIONS CENTER       1  10500
 4 THAO      PANG      COMMUNICATIONS CENTER       1  10500
 5 RYDBERG   LINDA     COMMUNICATIONS CENTER       1  10500
 6 ORCUTT    SARAH     COMMUNICATIONS CENTER       1  10500
 7 WAKEFIELD BONNIE    COMMUNICATIONS CENTER       1  10500
 8 SALKOWICZ AMANDA    COMMUNICATIONS CENTER       1  10500
 9 ZASADA    RICHARD   IMPOUND LOT                 1  10500
10 GLENNON   NANCY     PAYROLL                     1  10500
# ℹ 558 more rows

Good luck and go get ’em in your R journey!