Module 6
Tidyverse Family of Packages
The data frame is a key data structure in statistics and in R. The basic structure of a data frame is that there is one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation. Before you can conduct any analyses or draw any conclusions, you often need to reorganize your data. The Tidyverse is a collection of R packages (developed by RStudio’s chief scientist Hadley Wickham) that provides an efficient, fast, and well-documented workflow for general data modeling, wrangling, and visualization tasks.
The Tidyverse introduces a set of useful data analysis packages to help streamline your work in R. In particular, the Tidyverse was designed to address the top three common issues that arise when dealing with data analysis in base R: (1) Results obtained from a base R function often depend on the type of data being used; (2) When R expressions are used in a non-standard way, they can confuse beginners; (3) Hidden arguments often have various default operations that beginners are unaware of.
The core Tidyverse includes the packages that you’re likely to use in everyday data analyses.
ggplot2
- ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.dplyr
- dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.tidyr
- tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.readr
- readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.purrr
- purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive.tibble
- tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code.stringr
- stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.forcats
- forcats provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.
The Tidyverse also includes many other packages with more specialized usage. They are not loaded automatically with Tidyverse, so you’ll need to load each one with its own call.
To install the Tidyverse packages run the following code in the console:
install.packages("tidyverse")
Now the Tidyverse is available in R, but it is not activated yet. Whenever you start a new R session and plan to use the Tidyverse packages, you will need to activate the package by calling the library(tidyverse)
function in the console:
library(tidyverse)
#> ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will start learning the Tidyverse family of packages by introducing the dplyr
package.
We will be working with a nyc_flights
data set that provides information about flights departed New York City in 2013 (the data set is available on Courseworks). It contains 336 776 observations (rows) and 19 variables (columns). Let’s import this data set into R:
flights <- read.csv(file = "C:/Users/alexp/OneDrive/Desktop/R Bootcamp/R_bootcamp/nyc_flights.csv", header = T)
Let’s convert our data frame into a tibble
data frame (don’t worry about this function; we use it in this module just for a better representation of results):
flights <- as_tibble(flights)
dplyr Package
As mentioned earlier, dplyr
provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges such as selecting important variables, filtering out key observations, creating new variables, computing summaries, and so on.
In this module you are going to learn the key dplyr functions that allow you to solve the vast majority of your data manipulation challenges. All of the functions that we will discuss have a few common characteristics. In particular,
The first argument is a data frame
The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names)
The return result of a function is a new data frame
dplyr aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the data set that they work with:
-
Rows:
-
Columns:
-
select()
- changes whether or not a column is included -
rename()
- changes the name of columns -
mutate()
- changes the values of columns and creates new columns -
relocate()
- changes the order of the columns
-
-
Groups of rows:
-
group_by()
- changes the scope of each function from operating on the entire data set to operating on it group-by-group -
summarize()
- collapses a group into a single row
-
filter() Function
filter()
allows you to subset observations based on their values. The first argument is the name of the data frame, the second and subsequent arguments are the expressions that filter the data frame. For instance, let’s select all flights on January 1st:
filter(flights, month == 1, day == 1)
#> # A tibble: 842 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 832 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-
:
jan1 <- filter(flights, month == 1, day == 1)
Let’s find all flights that departed in November or December:
filter(flights, month == 11 | month == 12)
#> # A tibble: 55,403 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 11 1 5 2359 6
#> 2 2013 11 1 35 2250 105
#> 3 2013 11 1 455 500 -5
#> 4 2013 11 1 539 545 -6
#> 5 2013 11 1 542 545 -3
#> 6 2013 11 1 549 600 -11
#> 7 2013 11 1 550 600 -10
#> 8 2013 11 1 554 600 -6
#> 9 2013 11 1 554 600 -6
#> 10 2013 11 1 554 600 -6
#> # ℹ 55,393 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
We could do the same operation using the %in%
operator:
filter(flights, month %in% c(11, 12))
#> # A tibble: 55,403 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 11 1 5 2359 6
#> 2 2013 11 1 35 2250 105
#> 3 2013 11 1 455 500 -5
#> 4 2013 11 1 539 545 -6
#> 5 2013 11 1 542 545 -3
#> 6 2013 11 1 549 600 -11
#> 7 2013 11 1 550 600 -10
#> 8 2013 11 1 554 600 -6
#> 9 2013 11 1 554 600 -6
#> 10 2013 11 1 554 600 -6
#> # ℹ 55,393 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
slice() Function
slice()
function allows you to index rows by their (integer) locations. It can select, remove, and duplicate rows.
For instance, let’s get observations from rows 5 through 10:
slice(flights, 5:10)
#> # A tibble: 6 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 554 600 -6
#> 2 2013 1 1 554 558 -4
#> 3 2013 1 1 555 600 -5
#> 4 2013 1 1 557 600 -3
#> 5 2013 1 1 557 600 -3
#> 6 2013 1 1 558 600 -2
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
Let’s select all rows except the first four (this option can be used to drop some observations from a data set):
slice(flights, -(1:4))
#> # A tibble: 336,772 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 554 600 -6
#> 2 2013 1 1 554 558 -4
#> 3 2013 1 1 555 600 -5
#> 4 2013 1 1 557 600 -3
#> 5 2013 1 1 557 600 -3
#> 6 2013 1 1 558 600 -2
#> 7 2013 1 1 558 600 -2
#> 8 2013 1 1 558 600 -2
#> 9 2013 1 1 558 600 -2
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,762 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
Similar to head()
and tail()
functions, slice_head()
and slice_tail()
can be used to display top and bottom rows in the data set, respectively. Let’s print first and last 3 rows in the flights data set:
slice_head(flights, n = 3)
#> # A tibble: 3 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
slice_tail(flights, n = 3)
#> # A tibble: 3 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 9 30 NA 1210 NA
#> 2 2013 9 30 NA 1159 NA
#> 3 2013 9 30 NA 840 NA
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
Use the slice_sample()
function to randomly select rows. Use the option prop
to choose a certain proportion of the cases:
slice_sample(flights, n = 10)
#> # A tibble: 10 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 12 21 939 825 74
#> 2 2013 1 13 115 2000 315
#> 3 2013 5 6 1553 1557 -4
#> 4 2013 6 3 751 755 -4
#> 5 2013 6 22 712 715 -3
#> 6 2013 11 19 624 629 -5
#> 7 2013 7 9 557 600 -3
#> 8 2013 9 8 1553 1600 -7
#> 9 2013 9 21 555 600 -5
#> 10 2013 12 1 1451 1439 12
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
slice_sample(flights, prop = 0.001)
#> # A tibble: 336 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 5 29 754 800 -6
#> 2 2013 3 22 2053 2030 23
#> 3 2013 8 30 1254 1300 -6
#> 4 2013 7 10 2212 1820 232
#> 5 2013 3 18 1254 1238 16
#> 6 2013 10 26 1206 1206 0
#> 7 2013 1 30 600 600 0
#> 8 2013 9 17 1523 1530 -7
#> 9 2013 5 24 1749 1728 21
#> 10 2013 8 29 1343 1310 33
#> # ℹ 326 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
Use replace = TRUE
to take a sample with replacement.
arrange() Function
The arrange()
function is used to change the order of rows in a data set. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day)
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
Use desc()
to re-order by a column in descending order:
arrange(flights, desc(dep_delay))
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 9 641 900 1301
#> 2 2013 6 15 1432 1935 1137
#> 3 2013 1 10 1121 1635 1126
#> 4 2013 9 20 1139 1845 1014
#> 5 2013 7 22 845 1600 1005
#> 6 2013 4 10 1100 1900 960
#> 7 2013 3 17 2321 810 911
#> 8 2013 6 27 959 1900 899
#> 9 2013 7 22 2257 759 898
#> 10 2013 12 5 756 1700 896
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
select() Function
Often you work with large data sets with many columns but only a few are actually of interest to you. select()
function allows you to rapidly zoom in on a useful subset. You can select columns by name:
select(flights, year, month, day)
#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> 7 2013 1 1
#> 8 2013 1 1
#> 9 2013 1 1
#> 10 2013 1 1
#> # ℹ 336,766 more rows
You can select all columns between two variables (inclusive):
select(flights, year:day)
#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> 7 2013 1 1
#> 8 2013 1 1
#> 9 2013 1 1
#> 10 2013 1 1
#> # ℹ 336,766 more rows
You can select all columns except some:
select(flights, -(year:day))
#> # A tibble: 336,776 × 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int>
#> 1 517 515 2 830 819
#> 2 533 529 4 850 830
#> 3 542 540 2 923 850
#> 4 544 545 -1 1004 1022
#> 5 554 600 -6 812 837
#> 6 554 558 -4 740 728
#> 7 555 600 -5 913 854
#> 8 557 600 -3 709 723
#> 9 557 600 -3 838 846
#> 10 558 600 -2 753 745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
You can do the same operation with !
operator:
select(flights, !(year:day))
#> # A tibble: 336,776 × 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int>
#> 1 517 515 2 830 819
#> 2 533 529 4 850 830
#> 3 542 540 2 923 850
#> 4 544 545 -1 1004 1022
#> 5 554 600 -6 812 837
#> 6 554 558 -4 740 728
#> 7 555 600 -5 913 854
#> 8 557 600 -3 709 723
#> 9 557 600 -3 838 846
#> 10 558 600 -2 753 745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
You can use column indexes for column selection:
select(flights, c(1, 5, 8))
#> # A tibble: 336,776 × 3
#> year sched_dep_time sched_arr_time
#> <int> <int> <int>
#> 1 2013 515 819
#> 2 2013 529 830
#> 3 2013 540 850
#> 4 2013 545 1022
#> 5 2013 600 837
#> 6 2013 558 728
#> 7 2013 600 854
#> 8 2013 600 723
#> 9 2013 600 846
#> 10 2013 600 745
#> # ℹ 336,766 more rows
There are a number of helper functions you can use within select()
. For example, starts_with()
, ends_with()
, matches()
and contains()
. These let you quickly match larger blocks of variables that meet some criterion.
Let’s select all columns that start with “sched”:
select(flights, starts_with("sched"))
#> # A tibble: 336,776 × 2
#> sched_dep_time sched_arr_time
#> <int> <int>
#> 1 515 819
#> 2 529 830
#> 3 540 850
#> 4 545 1022
#> 5 600 837
#> 6 558 728
#> 7 600 854
#> 8 600 723
#> 9 600 846
#> 10 600 745
#> # ℹ 336,766 more rows
You can select all columns in the data set that end with “time”:
select(flights, ends_with("time"))
#> # A tibble: 336,776 × 5
#> dep_time sched_dep_time arr_time sched_arr_time air_time
#> <int> <int> <int> <int> <int>
#> 1 517 515 830 819 227
#> 2 533 529 850 830 227
#> 3 542 540 923 850 160
#> 4 544 545 1004 1022 183
#> 5 554 600 812 837 116
#> 6 554 558 740 728 150
#> 7 555 600 913 854 158
#> 8 557 600 709 723 53
#> 9 557 600 838 846 140
#> 10 558 600 753 745 138
#> # ℹ 336,766 more rows
Or suppose you want to select all columns in the data set that contain “ar”:
select(flights, contains("ar"))
#> # A tibble: 336,776 × 5
#> year arr_time sched_arr_time arr_delay carrier
#> <int> <int> <int> <int> <chr>
#> 1 2013 830 819 11 UA
#> 2 2013 850 830 20 UA
#> 3 2013 923 850 33 AA
#> 4 2013 1004 1022 -18 B6
#> 5 2013 812 837 -25 DL
#> 6 2013 740 728 12 UA
#> 7 2013 913 854 19 B6
#> 8 2013 709 723 -14 EV
#> 9 2013 838 846 -8 B6
#> 10 2013 753 745 8 AA
#> # ℹ 336,766 more rows
You can even combine these arguments:
select(flights, starts_with("sched") & ends_with("time"))
#> # A tibble: 336,776 × 2
#> sched_dep_time sched_arr_time
#> <int> <int>
#> 1 515 819
#> 2 529 830
#> 3 540 850
#> 4 545 1022
#> 5 600 837
#> 6 558 728
#> 7 600 854
#> 8 600 723
#> 9 600 846
#> 10 600 745
#> # ℹ 336,766 more rows
rename() Function
Use rename()
function to rename columns in a data frame. Suppose we want to rename the “year” and “month” variables and make them uppercase:
rename(flights, YEAR = year, MONTH = month)
#> # A tibble: 336,776 × 19
#> YEAR MONTH day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
relocate() Function
relocate()
function allows to change the positions of columns in a data frame. It has two useful arguments .before
and .after
that helps precisely select a location for a variable:
relocate(flights, year, .after = month)
#> # A tibble: 336,776 × 19
#> month year day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 1 2013 1 517 515 2
#> 2 1 2013 1 533 529 4
#> 3 1 2013 1 542 540 2
#> 4 1 2013 1 544 545 -1
#> 5 1 2013 1 554 600 -6
#> 6 1 2013 1 554 558 -4
#> 7 1 2013 1 555 600 -5
#> 8 1 2013 1 557 600 -3
#> 9 1 2013 1 557 600 -3
#> 10 1 2013 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
relocate(flights, c(year, month), .before = dep_delay)
#> # A tibble: 336,776 × 19
#> day dep_time sched_dep_time year month dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 1 517 515 2013 1 2
#> 2 1 533 529 2013 1 4
#> 3 1 542 540 2013 1 2
#> 4 1 544 545 2013 1 -1
#> 5 1 554 600 2013 1 -6
#> 6 1 554 558 2013 1 -4
#> 7 1 555 600 2013 1 -5
#> 8 1 557 600 2013 1 -3
#> 9 1 557 600 2013 1 -3
#> 10 1 558 600 2013 1 -2
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
relocate(flights, c(year, month), .after = last_col())
#> # A tibble: 336,776 × 19
#> day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int>
#> 1 1 517 515 2 830
#> 2 1 533 529 4 850
#> 3 1 542 540 2 923
#> 4 1 544 545 -1 1004
#> 5 1 554 600 -6 812
#> 6 1 554 558 -4 740
#> 7 1 555 600 -5 913
#> 8 1 557 600 -3 709
#> 9 1 557 600 -3 838
#> 10 1 558 600 -2 753
#> # ℹ 336,766 more rows
#> # ℹ 14 more variables: sched_arr_time <int>,
#> # arr_delay <int>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>, year <int>, month <int>
relocate(flights, dep_delay, .before = everything())
#> # A tibble: 336,776 × 19
#> dep_delay year month day dep_time sched_dep_time
#> <int> <int> <int> <int> <int> <int>
#> 1 2 2013 1 1 517 515
#> 2 4 2013 1 1 533 529
#> 3 2 2013 1 1 542 540
#> 4 -1 2013 1 1 544 545
#> 5 -6 2013 1 1 554 600
#> 6 -4 2013 1 1 554 558
#> 7 -5 2013 1 1 555 600
#> 8 -3 2013 1 1 557 600
#> 9 -3 2013 1 1 557 600
#> 10 -2 2013 1 1 558 600
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
mutate() Function
It’s often useful to add new columns that are functions of existing columns. That’s what the mutate()
function does.
mutate()
always adds new columns at the end of your data set so we’ll start by creating a narrower data set so we can see the new variables:
Now let’s add “gain” and “speed” columns to the data frame:
mutate(flights_2, gain = dep_delay - arr_delay, speed = distance / air_time * 60)
#> # A tibble: 336,776 × 7
#> month dep_delay arr_delay distance air_time gain speed
#> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 1 2 11 1400 227 -9 370.
#> 2 1 4 20 1416 227 -16 374.
#> 3 1 2 33 1089 160 -31 408.
#> 4 1 -1 -18 1576 183 17 517.
#> 5 1 -6 -25 762 116 19 394.
#> 6 1 -4 12 719 150 -16 288.
#> 7 1 -5 19 1065 158 -24 404.
#> 8 1 -3 -14 229 53 11 259.
#> 9 1 -3 -8 944 140 5 405.
#> 10 1 -2 8 733 138 -10 319.
#> # ℹ 336,766 more rows
Note that you can refer to columns that you’ve just created:
mutate(flights_2, gain = dep_delay - arr_delay, hours = air_time/60, gain_per_hour = gain/hours)
#> # A tibble: 336,776 × 8
#> month dep_delay arr_delay distance air_time gain hours
#> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 1 2 11 1400 227 -9 3.78
#> 2 1 4 20 1416 227 -16 3.78
#> 3 1 2 33 1089 160 -31 2.67
#> 4 1 -1 -18 1576 183 17 3.05
#> 5 1 -6 -25 762 116 19 1.93
#> 6 1 -4 12 719 150 -16 2.5
#> 7 1 -5 19 1065 158 -24 2.63
#> 8 1 -3 -14 229 53 11 0.883
#> 9 1 -3 -8 944 140 5 2.33
#> 10 1 -2 8 733 138 -10 2.3
#> # ℹ 336,766 more rows
#> # ℹ 1 more variable: gain_per_hour <dbl>
If you only want to keep the new variable, use transmute()
function:
transmute(flights_2, gain = dep_delay - arr_delay, hours = air_time/60, gain_per_hour = gain/hours)
#> # A tibble: 336,776 × 3
#> gain hours gain_per_hour
#> <int> <dbl> <dbl>
#> 1 -9 3.78 -2.38
#> 2 -16 3.78 -4.23
#> 3 -31 2.67 -11.6
#> 4 17 3.05 5.57
#> 5 19 1.93 9.83
#> 6 -16 2.5 -6.4
#> 7 -24 2.63 -9.11
#> 8 11 0.883 12.5
#> 9 5 2.33 2.14
#> 10 -10 2.3 -4.35
#> # ℹ 336,766 more rows
%>%
Pipe Operator
The dplyr functions are functional in the sense that function calls don’t have side-effects. You must always save their results. This doesn’t lead to particularly elegant code, especially if you want to do many operations at once. You either have to do it step-by-step or if you don’t want to name the intermediate results, you need to wrap the function calls inside each other, which lead to a messy and complex code:
select(filter(flights, month == 11 | month == 12), starts_with("sched") & ends_with("time"))
#> # A tibble: 55,403 × 2
#> sched_dep_time sched_arr_time
#> <int> <int>
#> 1 2359 345
#> 2 2250 2356
#> 3 500 651
#> 4 545 827
#> 5 545 855
#> 6 600 923
#> 7 600 659
#> 8 600 701
#> 9 600 827
#> 10 600 751
#> # ℹ 55,393 more rows
This is difficult to read because the order of the operations is from inside to out. Thus, the arguments are a long way away from the function. To get around this problem, dplyr provides the %>%
operator. The pipe operator, %>%
, comes from the magrittr package by Stefan Milton Bache. Packages in the tidyverse load %>%
for you automatically, so you don’t usually load magrittr explicitly.
x %>% f(y)
turns into f(x, y)
so you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom (reading the pipe operator as “then”):
flights %>%
filter(month == 11 | month == 12) %>%
select( starts_with("sched") & ends_with("time"))
#> # A tibble: 55,403 × 2
#> sched_dep_time sched_arr_time
#> <int> <int>
#> 1 2359 345
#> 2 2250 2356
#> 3 500 651
#> 4 545 827
#> 5 545 855
#> 6 600 923
#> 7 600 659
#> 8 600 701
#> 9 600 827
#> 10 600 751
#> # ℹ 55,393 more rows
Try to understand what the following code is doing:
flights %>%
filter(month %in% c(10, 11, 12), arr_delay < 10) %>%
slice(1:30) %>%
arrange(desc(arr_delay)) %>%
select(-c(1,4))
#> # A tibble: 30 × 17
#> month day sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int>
#> 1 10 1 600 -9 727
#> 2 10 1 600 -2 743
#> 3 10 1 600 -10 649
#> 4 10 1 610 -7 735
#> 5 10 1 600 -9 710
#> 6 10 1 600 -2 650
#> 7 10 1 600 -1 719
#> 8 10 1 600 0 706
#> 9 10 1 600 -10 648
#> 10 10 1 600 -9 655
#> # ℹ 20 more rows
#> # ℹ 12 more variables: sched_arr_time <int>,
#> # arr_delay <int>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
group_by(), summarise(), and across() Functions
Most data operations are done on groups defined by variables. dplyr
verbs are particularly powerful when you apply them to grouped data frames. The most important grouping verb is group_by()
. It takes an existing data frame and converts it into a grouped data frame where operations are performed “by group”. In other words, it takes a data frame and one or more variables to group by:
by_origin <- flights %>% group_by(origin)
by_origin
#> # A tibble: 336,776 × 19
#> # Groups: origin [3]
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
by_origin_carrier <- flights %>% group_by(origin, carrier)
by_origin_carrier
#> # A tibble: 336,776 × 19
#> # Groups: origin, carrier [35]
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <chr>
Grouping does not change how the data looks apart from listing how it is grouped.
Grouping is most useful when used in conjunction with the summarise()
function. summarise()
creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified. Thus, it changes the unit of analysis from the complete dataset to individual groups. Together group_by()
and summarise()
provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries.
For instance, let’s calculate the average arrival delay time for each group in the by_origin
grouped data:
by_origin %>% summarise(Mean = mean(arr_delay, na.rm = T))
#> # A tibble: 3 × 2
#> origin Mean
#> <chr> <dbl>
#> 1 EWR 9.11
#> 2 JFK 5.55
#> 3 LGA 5.78
You can even pass several variables to it:
by_origin %>%
summarise(Mean = mean(arr_delay, na.rm = T),
Median = median(arr_delay, na.rm = T),
Count = n())
#> # A tibble: 3 × 4
#> origin Mean Median Count
#> <chr> <dbl> <dbl> <int>
#> 1 EWR 9.11 -4 120835
#> 2 JFK 5.55 -6 111279
#> 3 LGA 5.78 -5 104662
The table below displays useful functions that are frequently used with summarise()
:
Functionality | Functions |
---|---|
Center |
mean() , median()
|
Spread |
sd() , IQR() , var() , quantile()
|
Range |
min() , max() , range()
|
Position |
first() , last() , nth()
|
Count |
n() , n_distinct()
|
Logical |
any() , all()
|
group_by()
and summarise()
functions can be combined with other single table verbs:
by_carrier <- flights %>% group_by(carrier)
by_carrier %>%
summarise (Count = n(), Distance_sd = sd(distance)) %>%
filter(Count < 10000) %>%
arrange(desc(Distance_sd))
#> # A tibble: 7 × 3
#> carrier Count Distance_sd
#> <chr> <int> <dbl>
#> 1 OO 32 206.
#> 2 FL 3260 161.
#> 3 YV 601 160.
#> 4 VX 5162 88.0
#> 5 AS 714 0
#> 6 F9 685 0
#> 7 HA 342 0
If you need to remove grouping and return to operations on ungrouped data, use ungroup()
:
by_carrier %>%
ungroup() %>%
summarise(flights = n())
#> # A tibble: 1 × 1
#> flights
#> <int>
#> 1 336776
It’s often useful to perform the same operation on multiple columns, but copying and pasting is both tedious and error prone. For example:
flights %>%
group_by(origin, carrier) %>%
summarise(Mean_dep_delay = mean(dep_delay, na.rm = T),
Mean_arrival_delay = mean(arr_delay, na.rm = T),
Mean_air_time = mean(air_time, na.rm = T))
#> `summarise()` has grouped output by 'origin'. You can
#> override using the `.groups` argument.
#> # A tibble: 35 × 5
#> # Groups: origin [3]
#> origin carrier Mean_dep_delay Mean_arrival_delay
#> <chr> <chr> <dbl> <dbl>
#> 1 EWR 9E 5.95 1.62
#> 2 EWR AA 10.0 0.978
#> 3 EWR AS 5.80 -9.93
#> 4 EWR B6 13.1 9.39
#> 5 EWR DL 12.1 8.78
#> 6 EWR EV 20.2 17.0
#> 7 EWR MQ 17.5 16.3
#> 8 EWR OO 20.8 21.5
#> 9 EWR UA 12.5 3.48
#> 10 EWR US 3.74 0.977
#> # ℹ 25 more rows
#> # ℹ 1 more variable: Mean_air_time <dbl>
Instead, we can use across()
function, which lets you rewrite the previous code more succinctly:
flights %>%
group_by(origin, carrier) %>%
summarise(across(
c(dep_delay, arr_delay, air_time),
~ mean(.x, na.rm = T)
))
#> `summarise()` has grouped output by 'origin'. You can
#> override using the `.groups` argument.
#> # A tibble: 35 × 5
#> # Groups: origin [3]
#> origin carrier dep_delay arr_delay air_time
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 EWR 9E 5.95 1.62 103.
#> 2 EWR AA 10.0 0.978 196.
#> 3 EWR AS 5.80 -9.93 326.
#> 4 EWR B6 13.1 9.39 118.
#> 5 EWR DL 12.1 8.78 125.
#> 6 EWR EV 20.2 17.0 94.0
#> 7 EWR MQ 17.5 16.3 112.
#> 8 EWR OO 20.8 21.5 137.
#> 9 EWR UA 12.5 3.48 207.
#> 10 EWR US 3.74 0.977 138.
#> # ℹ 25 more rows
across()
has two primary arguments: (1) the first argument, .col
, selects the columns you want to operate on; (2) the second argument, .fns
, is a function or list of functions to apply to each column. Here are some examples:
flights %>%
group_by(origin) %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
#> # A tibble: 3 × 15
#> origin year month day dep_time sched_dep_time dep_delay
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 EWR 2013 6.49 15.7 1337. 1322. 15.1
#> 2 JFK 2013 6.50 15.7 1399. 1402. 12.1
#> 3 LGA 2013 6.67 15.7 1310. 1308. 10.3
#> # ℹ 8 more variables: arr_time <dbl>, sched_arr_time <dbl>,
#> # arr_delay <dbl>, flight <dbl>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>
You can transform each variable with more than one function by supplying a named list of functions or lambda functions in the second argument:
min_max <- list(
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE)
)
flights %>%
group_by(origin) %>%
summarise(across(where(is.numeric), min_max))
#> # A tibble: 3 × 29
#> origin year_min year_max month_min month_max day_min
#> <chr> <int> <int> <int> <int> <int>
#> 1 EWR 2013 2013 1 12 1
#> 2 JFK 2013 2013 1 12 1
#> 3 LGA 2013 2013 1 12 1
#> # ℹ 23 more variables: day_max <int>, dep_time_min <int>,
#> # dep_time_max <int>, sched_dep_time_min <int>,
#> # sched_dep_time_max <int>, dep_delay_min <int>,
#> # dep_delay_max <int>, arr_time_min <int>,
#> # arr_time_max <int>, sched_arr_time_min <int>,
#> # sched_arr_time_max <int>, arr_delay_min <int>,
#> # arr_delay_max <int>, flight_min <int>, …
You can control how the names are created with the .names
argument:
flights %>%
group_by(origin) %>%
summarise(across(where(is.numeric), min_max,
.names = "{.fn}.{.col}"
))
#> # A tibble: 3 × 29
#> origin min.year max.year min.month max.month min.day
#> <chr> <int> <int> <int> <int> <int>
#> 1 EWR 2013 2013 1 12 1
#> 2 JFK 2013 2013 1 12 1
#> 3 LGA 2013 2013 1 12 1
#> # ℹ 23 more variables: max.day <int>, min.dep_time <int>,
#> # max.dep_time <int>, min.sched_dep_time <int>,
#> # max.sched_dep_time <int>, min.dep_delay <int>,
#> # max.dep_delay <int>, min.arr_time <int>,
#> # max.arr_time <int>, min.sched_arr_time <int>,
#> # max.sched_arr_time <int>, min.arr_delay <int>,
#> # max.arr_delay <int>, min.flight <int>, …
Relational Data: Two-Table Verbs
It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.
In dplyr, there are three families of verbs that work with two tables at a time:
- Mutating joins, which add new variables to one table from matching rows in another
- Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table
- Set operations, which combine the observations in the data sets as if they were set elements
Mutating joins
Mutating joins allow you to combine variables from multiple tables. It first matches observations by their keys, then copies across variables from one table to the other. There are four types of mutating join, which differ in their behavior when a match is not found. These are:
All these functions have the same input arguments. We will be focusing on the following arguments:
-
x
andy
- tables or dataframes that are being combined (x
is known as a primary table andy
as a secondary table) -
by
- the join key, a variable or variables that is/are used to match the rows between thex
andy
tables. In other words, it controls which variables are used to match observations in the two tables. -
keep
- a logical operator indicating whether the join keys from bothx
andy
tables should be preserved in the output. The default value isFALSE
.
The output is always a new table. By default, if an observation in x
matches multiple observations in y
, all of the matching observations in y
will be returned. If this occurs, normally a warning will be thrown stating that multiple matches have been detected since this is usually surprising.
To illustrative how these functions work, we will be using the following toy data frames:
df1 <- data.frame(
a = c(1, 2, 3, 2, 4),
b = c(10, 20, 30, 35, 40),
c = c(100, 200, 300, 350, 400)
)
print(df1)
#> a b c
#> 1 1 10 100
#> 2 2 20 200
#> 3 3 30 300
#> 4 2 35 350
#> 5 4 40 400
df2 <- data.frame(
a = c(1, 2, 5, 4, 6, 2),
b = c(10, 40, 50, 40, 60, 50),
x = c(15, 25, 35, 45, 55, 65),
z = c(150, 200, 350, 400, 550, 270)
)
print(df2)
#> a b x z
#> 1 1 10 15 150
#> 2 2 40 25 200
#> 3 5 50 35 350
#> 4 4 40 45 400
#> 5 6 60 55 550
#> 6 2 50 65 270
inner_join()
The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal. The output of an inner join is a new data frame that contains the key, the x
values, and the y
values. The most important property of an inner join is that unmatched rows in either input are not included in the result.
Below are some examples of inner join:
# Merging tables by the "a" variable
df1 %>%
inner_join(df2, by = "a")
#> Warning in inner_join(., df2, by = "a"): Detected an unexpected many-to-many relationship between
#> `x` and `y`.
#> ℹ Row 2 of `x` matches multiple rows in `y`.
#> ℹ Row 2 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set
#> `relationship = "many-to-many"` to silence this warning.
#> a b.x c b.y x z
#> 1 1 10 100 10 15 150
#> 2 2 20 200 40 25 200
#> 3 2 20 200 50 65 270
#> 4 2 35 350 40 25 200
#> 5 2 35 350 50 65 270
#> 6 4 40 400 40 45 400
# Merging tables by the "a" and "b" variable
df1 %>%
inner_join(df2, by = c("a", "b"))
#> a b c x z
#> 1 1 10 100 15 150
#> 2 4 40 400 45 400
## Merging tables by the "c" and "z" variable (Have different variable names)
df1 %>%
inner_join(df2, by = c("c" = "z"))
#> a.x b.x c a.y b.y x
#> 1 2 20 200 2 40 25
#> 2 2 35 350 5 50 35
#> 3 4 40 400 4 40 45
## Merging tables by the "c" and "z" variable (Have different variable names) and keeping both key variables in the output table
df1 %>%
inner_join(df2, by = c("c" = "z"), keep = T)
#> a.x b.x c a.y b.y x z
#> 1 2 20 200 2 40 25 200
#> 2 2 35 350 5 50 35 350
#> 3 4 40 400 4 40 45 400
left_join()
left_join()
includes all observations in x
, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table:
# Merging tables by the "a" variable
df1 %>%
left_join(df2, by = "a")
#> Warning in left_join(., df2, by = "a"): Detected an unexpected many-to-many relationship between
#> `x` and `y`.
#> ℹ Row 2 of `x` matches multiple rows in `y`.
#> ℹ Row 2 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set
#> `relationship = "many-to-many"` to silence this warning.
#> a b.x c b.y x z
#> 1 1 10 100 10 15 150
#> 2 2 20 200 40 25 200
#> 3 2 20 200 50 65 270
#> 4 3 30 300 NA NA NA
#> 5 2 35 350 40 25 200
#> 6 2 35 350 50 65 270
#> 7 4 40 400 40 45 400
right_join()
right_join()
includes all observations in y
:
# Merging tables by the "a" variable
df1 %>%
right_join(df2, by = "a")
#> Warning in right_join(., df2, by = "a"): Detected an unexpected many-to-many relationship between
#> `x` and `y`.
#> ℹ Row 2 of `x` matches multiple rows in `y`.
#> ℹ Row 2 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set
#> `relationship = "many-to-many"` to silence this warning.
#> a b.x c b.y x z
#> 1 1 10 100 10 15 150
#> 2 2 20 200 40 25 200
#> 3 2 20 200 50 65 270
#> 4 2 35 350 40 25 200
#> 5 2 35 350 50 65 270
#> 6 4 40 400 40 45 400
#> 7 5 NA NA 50 35 350
#> 8 6 NA NA 60 55 550
# Merging tables by the "a" and "b" variable
df1 %>%
right_join(df2, by = c("a", "b"))
#> a b c x z
#> 1 1 10 100 15 150
#> 2 4 40 400 45 400
#> 3 2 40 NA 25 200
#> 4 5 50 NA 35 350
#> 5 6 60 NA 55 550
#> 6 2 50 NA 65 270
## Merging tables by the "c" and "z" variable (Have different variable names)
df1 %>%
right_join(df2, by = c("c" = "z"))
#> a.x b.x c a.y b.y x
#> 1 2 20 200 2 40 25
#> 2 2 35 350 5 50 35
#> 3 4 40 400 4 40 45
#> 4 NA NA 150 1 10 15
#> 5 NA NA 550 6 60 55
#> 6 NA NA 270 2 50 65
full_join()
full_join()
includes all observations from both x
and y
:
# Merging tables by the "a" variable
df1 %>%
full_join(df2, by = "a")
#> Warning in full_join(., df2, by = "a"): Detected an unexpected many-to-many relationship between
#> `x` and `y`.
#> ℹ Row 2 of `x` matches multiple rows in `y`.
#> ℹ Row 2 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set
#> `relationship = "many-to-many"` to silence this warning.
#> a b.x c b.y x z
#> 1 1 10 100 10 15 150
#> 2 2 20 200 40 25 200
#> 3 2 20 200 50 65 270
#> 4 3 30 300 NA NA NA
#> 5 2 35 350 40 25 200
#> 6 2 35 350 50 65 270
#> 7 4 40 400 40 45 400
#> 8 5 NA NA 50 35 350
#> 9 6 NA NA 60 55 550
# Merging tables by the "a" and "b" variable
df1 %>%
full_join(df2, by = c("a", "b"))
#> a b c x z
#> 1 1 10 100 15 150
#> 2 2 20 200 NA NA
#> 3 3 30 300 NA NA
#> 4 2 35 350 NA NA
#> 5 4 40 400 45 400
#> 6 2 40 NA 25 200
#> 7 5 50 NA 35 350
#> 8 6 60 NA 55 550
#> 9 2 50 NA 65 270
## Merging tables by the "c" and "z" variable (Have different variable names)
df1 %>%
full_join(df2, by = c("c" = "z"))
#> a.x b.x c a.y b.y x
#> 1 1 10 100 NA NA NA
#> 2 2 20 200 2 40 25
#> 3 3 30 300 NA NA NA
#> 4 2 35 350 5 50 35
#> 5 4 40 400 4 40 45
#> 6 NA NA 150 1 10 15
#> 7 NA NA 550 6 60 55
#> 8 NA NA 270 2 50 65
Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
-
semi_join(x, y)
- keeps all observations inx
that have a match iny
-
anti_join(x, y)
- drops all observations inx
that have a match iny
Set Operations
Set operations expect x
and y
tables to have the same variables, and treat the observations as sets:
-
itersect(x, y)
- returns only observations in bothx
andy
-
union(x, y)
- returns unique observations in bothx
andy
-
setdiff(x, y)
- returns observations inx
, but not iny
We will first create toy data frames and then apply these functions to them:
df1 <- data.frame(
a = c(1, 2, 3, 4, 5),
b = c(10, 20, 30, 40, 50)
)
df1
#> a b
#> 1 1 10
#> 2 2 20
#> 3 3 30
#> 4 4 40
#> 5 5 50
df2 <- data.frame(
a = c(1, 2, 3, 4, 5),
b = c(10, 15, 30, 45, 65)
)
df2
#> a b
#> 1 1 10
#> 2 2 15
#> 3 3 30
#> 4 4 45
#> 5 5 65
intersect(df1, df2)
#> a b
#> 1 1 10
#> 2 3 30
union(df1, df2)
#> a b
#> 1 1 10
#> 2 2 20
#> 3 3 30
#> 4 4 40
#> 5 5 50
#> 6 2 15
#> 7 4 45
#> 8 5 65
setdiff(df1, df2)
#> a b
#> 1 2 20
#> 2 4 40
#> 3 5 50
setdiff(df2, df1)
#> a b
#> 1 2 15
#> 2 4 45
#> 3 5 65
Practice Data Sets
Here are two data sets that you can use to practice two-table verbs:
data1 <- data.frame(
Name = c("James", "Linda", "Jim", "Margo", "Nick", "Stacy", "Mary", "Tom", "Anna", "Bob", "Jeniffer", "Lucas", "Amy"),
Age = c(22, 56, 34, 48, 19, 25, 31, 68, 72, 42, 39, 52, 39),
State = c("California", "New York", "New York", "California", "Michigan", "Texas", "Ohio", "Arizona", "Texas", "Florida", "Nebraska", "Indiana", "Florida"),
state_abr = c("CA", "NY", "NY", "CA", "MI", "TX", "OH", "AZ", "TX", "FL", "NE", "IN", "FL"),
City = c("Los Angeles", "New York", "Buffalo", "San Diego", "Detroit", "Austin", "Cleveland", "Phoenix", "Houston", "Tampa", "Lincoln", "Indianapolis", "Miami"),
Salary = c(30000, 96500, 72000, 59000, 54300, 25000, 61000, 64000, 74700, 40000, 83000, 92400, 82000)
)
data1
#> Name Age State state_abr City Salary
#> 1 James 22 California CA Los Angeles 30000
#> 2 Linda 56 New York NY New York 96500
#> 3 Jim 34 New York NY Buffalo 72000
#> 4 Margo 48 California CA San Diego 59000
#> 5 Nick 19 Michigan MI Detroit 54300
#> 6 Stacy 25 Texas TX Austin 25000
#> 7 Mary 31 Ohio OH Cleveland 61000
#> 8 Tom 68 Arizona AZ Phoenix 64000
#> 9 Anna 72 Texas TX Houston 74700
#> 10 Bob 42 Florida FL Tampa 40000
#> 11 Jeniffer 39 Nebraska NE Lincoln 83000
#> 12 Lucas 52 Indiana IN Indianapolis 92400
#> 13 Amy 39 Florida FL Miami 82000
data2 <- data.frame(
State = c("Washington", "Florida", "Nebraska", "Indiana", "Florida","California", "New York", "New York", "California", "Michigan", "Texas", "Ohio", "Arizona", "Utah"),
state_abbriviation = c("WA", "FL", "NE", "IN", "FL","CA", "NY", "NY", "CA", "MI", "TX", "OH", "AZ", "UT"),
City = c( "Seatle", "Tampa", "Lincoln", "Indianapolis", "Miami","Los Angeles", "New York", "Ithaca", "San Francisco", "Detroit", "Dallas", "Cleveland", "Phoenix", "Salt Lake City"),
Average_salary = c(63500, 53900, 59900, 59800, 57900, 79000, 80000, 75000, 85000, 54000, 63800, 57000, 61000, 58600)
)
data2
#> State state_abbriviation City
#> 1 Washington WA Seatle
#> 2 Florida FL Tampa
#> 3 Nebraska NE Lincoln
#> 4 Indiana IN Indianapolis
#> 5 Florida FL Miami
#> 6 California CA Los Angeles
#> 7 New York NY New York
#> 8 New York NY Ithaca
#> 9 California CA San Francisco
#> 10 Michigan MI Detroit
#> 11 Texas TX Dallas
#> 12 Ohio OH Cleveland
#> 13 Arizona AZ Phoenix
#> 14 Utah UT Salt Lake City
#> Average_salary
#> 1 63500
#> 2 53900
#> 3 59900
#> 4 59800
#> 5 57900
#> 6 79000
#> 7 80000
#> 8 75000
#> 9 85000
#> 10 54000
#> 11 63800
#> 12 57000
#> 13 61000
#> 14 58600