Lecture 11
dplyr Package (continued)
We continue working with the nyc_flights
data set that provides information about flights departed New York City in 2013 (the data set is available on Courseworks). It contains 336 776 observations (rows) and 19 variables (columns).
group_by(), summarise(), and across() Functions
Most data operations are done on groups defined by variables. dplyr
verbs are particularly powerful when you apply them to grouped data frames. The most important grouping verb is group_by()
. It takes an existing data frame and converts it into a grouped data frame where operations are performed “by group”. In other words, it takes a data frame and one or more variables to group by:
by_origin <- flights %>% group_by(origin)
by_origin
#> # A tibble: 336,776 × 19
#> # Groups: origin [3]
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <fct>,
#> # flight <int>, tailnum <fct>, origin <fct>, dest <fct>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <fct>
by_origin_carrier <- flights %>% group_by(origin, carrier)
by_origin_carrier
#> # A tibble: 336,776 × 19
#> # Groups: origin, carrier [35]
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <int>, carrier <fct>,
#> # flight <int>, tailnum <fct>, origin <fct>, dest <fct>,
#> # air_time <int>, distance <int>, hour <int>,
#> # minute <int>, time_hour <fct>
Grouping does not change how the data looks apart from listing how it is grouped.
Grouping is most useful when used in conjunction with the summarise()
function. summarise()
creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified. Thus, it changes the unit of analysis from the complete dataset to individual groups. Together group_by()
and summarise()
provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries.
For instance, let’s calculate the average arrival delay time for each group in the by_origin
grouped data:
by_origin %>% summarise(Mean = mean(arr_delay, na.rm = T))
#> # A tibble: 3 × 2
#> origin Mean
#> <fct> <dbl>
#> 1 EWR 9.11
#> 2 JFK 5.55
#> 3 LGA 5.78
You can even pass several variables to it:
by_origin %>%
summarise(Mean = mean(arr_delay, na.rm = T),
Median = median(arr_delay, na.rm = T),
Count = n())
#> # A tibble: 3 × 4
#> origin Mean Median Count
#> <fct> <dbl> <dbl> <int>
#> 1 EWR 9.11 -4 120835
#> 2 JFK 5.55 -6 111279
#> 3 LGA 5.78 -5 104662
The table below displays useful functions that are frequently used with summarise()
:
Functionality | Functions |
---|---|
Center |
mean() , median()
|
Spread |
sd() , IQR() , var() , quantile()
|
Range |
min() , max() , range()
|
Position |
first() , last() , nth()
|
Count |
n() , n_distinct()
|
Logical |
any() , all()
|
group_by()
and summarise()
functions can be combined with other single table verbs:
by_carrier <- flights %>% group_by(carrier)
by_carrier %>%
summarise (Count = n(), Distance_sd = sd(distance)) %>%
filter(Count < 10000) %>%
arrange(desc(Distance_sd))
#> # A tibble: 7 × 3
#> carrier Count Distance_sd
#> <fct> <int> <dbl>
#> 1 OO 32 206.
#> 2 FL 3260 161.
#> 3 YV 601 160.
#> 4 VX 5162 88.0
#> 5 AS 714 0
#> 6 F9 685 0
#> 7 HA 342 0
If you need to remove grouping and return to operations on ungrouped data, use ungroup()
:
by_carrier %>%
ungroup() %>%
summarise(flights = n())
#> # A tibble: 1 × 1
#> flights
#> <int>
#> 1 336776
It’s often useful to perform the same operation on multiple columns, but copying and pasting is both tedious and error prone. For example:
flights %>%
group_by(origin, carrier) %>%
summarise(Mean_dep_delay = mean(dep_delay, na.rm = T),
Mean_arrival_delay = mean(arr_delay, na.rm = T),
Mean_air_time = mean(air_time, na.rm = T))
#> `summarise()` has grouped output by 'origin'. You can
#> override using the `.groups` argument.
#> # A tibble: 35 × 5
#> # Groups: origin [3]
#> origin carrier Mean_dep_delay Mean_arrival_delay
#> <fct> <fct> <dbl> <dbl>
#> 1 EWR 9E 5.95 1.62
#> 2 EWR AA 10.0 0.978
#> 3 EWR AS 5.80 -9.93
#> 4 EWR B6 13.1 9.39
#> 5 EWR DL 12.1 8.78
#> 6 EWR EV 20.2 17.0
#> 7 EWR MQ 17.5 16.3
#> 8 EWR OO 20.8 21.5
#> 9 EWR UA 12.5 3.48
#> 10 EWR US 3.74 0.977
#> # ℹ 25 more rows
#> # ℹ 1 more variable: Mean_air_time <dbl>
Instead, we can use across()
function, which lets you rewrite the previous code more succinctly:
flights %>%
group_by(origin, carrier) %>%
summarise(across(
c(dep_delay, arr_delay, air_time),
~ mean(.x, na.rm = T)
))
#> `summarise()` has grouped output by 'origin'. You can
#> override using the `.groups` argument.
#> # A tibble: 35 × 5
#> # Groups: origin [3]
#> origin carrier dep_delay arr_delay air_time
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 EWR 9E 5.95 1.62 103.
#> 2 EWR AA 10.0 0.978 196.
#> 3 EWR AS 5.80 -9.93 326.
#> 4 EWR B6 13.1 9.39 118.
#> 5 EWR DL 12.1 8.78 125.
#> 6 EWR EV 20.2 17.0 94.0
#> 7 EWR MQ 17.5 16.3 112.
#> 8 EWR OO 20.8 21.5 137.
#> 9 EWR UA 12.5 3.48 207.
#> 10 EWR US 3.74 0.977 138.
#> # ℹ 25 more rows
across()
has two primary arguments: (1) the first argument, .col
, selects the columns you want to operate on; (2) the second argument, .fns
, is a function or list of functions to apply to each column. Here are some examples:
flights %>%
summarise(across(where(is.factor), n_distinct))
#> # A tibble: 1 × 5
#> carrier tailnum origin dest time_hour
#> <int> <int> <int> <int> <int>
#> 1 16 4044 3 105 6936
flights %>%
group_by(origin) %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
#> # A tibble: 3 × 15
#> origin year month day dep_time sched_dep_time dep_delay
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 EWR 2013 6.49 15.7 1337. 1322. 15.1
#> 2 JFK 2013 6.50 15.7 1399. 1402. 12.1
#> 3 LGA 2013 6.67 15.7 1310. 1308. 10.3
#> # ℹ 8 more variables: arr_time <dbl>, sched_arr_time <dbl>,
#> # arr_delay <dbl>, flight <dbl>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>
You can transform each variable with more than one function by supplying a named list of functions or lambda functions in the second argument:
min_max <- list(
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE)
)
flights %>%
group_by(origin) %>%
summarise(across(where(is.numeric), min_max))
#> # A tibble: 3 × 29
#> origin year_min year_max month_min month_max day_min
#> <fct> <int> <int> <int> <int> <int>
#> 1 EWR 2013 2013 1 12 1
#> 2 JFK 2013 2013 1 12 1
#> 3 LGA 2013 2013 1 12 1
#> # ℹ 23 more variables: day_max <int>, dep_time_min <int>,
#> # dep_time_max <int>, sched_dep_time_min <int>,
#> # sched_dep_time_max <int>, dep_delay_min <int>,
#> # dep_delay_max <int>, arr_time_min <int>,
#> # arr_time_max <int>, sched_arr_time_min <int>,
#> # sched_arr_time_max <int>, arr_delay_min <int>,
#> # arr_delay_max <int>, flight_min <int>, …
You can control how the names are created with the .names
argument:
flights %>%
group_by(origin) %>%
summarise(across(where(is.numeric), min_max,
.names = "{.fn}.{.col}"
))
#> # A tibble: 3 × 29
#> origin min.year max.year min.month max.month min.day
#> <fct> <int> <int> <int> <int> <int>
#> 1 EWR 2013 2013 1 12 1
#> 2 JFK 2013 2013 1 12 1
#> 3 LGA 2013 2013 1 12 1
#> # ℹ 23 more variables: max.day <int>, min.dep_time <int>,
#> # max.dep_time <int>, min.sched_dep_time <int>,
#> # max.sched_dep_time <int>, min.dep_delay <int>,
#> # max.dep_delay <int>, min.arr_time <int>,
#> # max.arr_time <int>, min.sched_arr_time <int>,
#> # max.sched_arr_time <int>, min.arr_delay <int>,
#> # max.arr_delay <int>, min.flight <int>, …
Relational Data: Two-Table Verbs
It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.
In dplyr, there are three families of verbs that work with two tables at a time:
- Mutating joins, which add new variables to one table from matching rows in another
- Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table
- Set operations, which combine the observations in the data sets as if they were set elements
Mutating joins
Mutating joins allow you to combine variables from multiple tables. It first matches observations by their keys, then copies across variables from one table to the other. There are four types of mutating join, which differ in their behavior when a match is not found. These are:
All these functions have the same input arguments. We will be focusing on the following arguments:
-
x
andy
- tables or dataframes that are being combined (x
is known as a primary table andy
as a secondary table) -
by
- the join key, a variable or variables that is/are used to match the rows between thex
andy
tables. In other words, it controls which variables are used to match observations in the two tables. -
keep
- a logical operator indicating whether the join keys from bothx
andy
tables should be preserved in the output. The default value isFALSE
.
The output is always a new table. By default, if an observation in x
matches multiple observations in y
, all of the matching observations in y
will be returned. If this occurs, normally a warning will be thrown stating that multiple matches have been detected since this is usually surprising.
To illustrative how these functions work, we will be using the following toy data frames:
df1 <- data.frame(
a = c(1, 2, 3, 2, 4),
b = c(10, 20, 30, 35, 40),
c = c(100, 200, 300, 350, 400)
)
print(df1)
#> a b c
#> 1 1 10 100
#> 2 2 20 200
#> 3 3 30 300
#> 4 2 35 350
#> 5 4 40 400
df2 <- data.frame(
a = c(1, 2, 5, 4, 6, 2),
b = c(10, 40, 50, 40, 60, 50),
x = c(15, 25, 35, 45, 55, 65),
z = c(150, 200, 350, 400, 550, 270)
)
print(df2)
#> a b x z
#> 1 1 10 15 150
#> 2 2 40 25 200
#> 3 5 50 35 350
#> 4 4 40 45 400
#> 5 6 60 55 550
#> 6 2 50 65 270
inner_join()
The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal. The output of an inner join is a new data frame that contains the key, the x
values, and the y
values. The most important property of an inner join is that unmatched rows in either input are not included in the result.
Below are some examples of inner join:
# Merging tables by the "a" variable
df1 %>%
inner_join(df2, by = "a")
#> Warning in inner_join(., df2, by = "a"): Detected an unexpected many-to-many relationship between
#> `x` and `y`.
#> ℹ Row 2 of `x` matches multiple rows in `y`.
#> ℹ Row 2 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set
#> `relationship = "many-to-many"` to silence this warning.
#> a b.x c b.y x z
#> 1 1 10 100 10 15 150
#> 2 2 20 200 40 25 200
#> 3 2 20 200 50 65 270
#> 4 2 35 350 40 25 200
#> 5 2 35 350 50 65 270
#> 6 4 40 400 40 45 400
# Merging tables by the "a" and "b" variable
df1 %>%
inner_join(df2, by = c("a", "b"))
#> a b c x z
#> 1 1 10 100 15 150
#> 2 4 40 400 45 400
## Merging tables by the "c" and "z" variable (Have different variable names)
df1 %>%
inner_join(df2, by = c("c" = "z"))
#> a.x b.x c a.y b.y x
#> 1 2 20 200 2 40 25
#> 2 2 35 350 5 50 35
#> 3 4 40 400 4 40 45
## Merging tables by the "c" and "z" variable (Have different variable names) and keeping both key variables in the output table
df1 %>%
inner_join(df2, by = c("c" = "z"), keep = T)
#> a.x b.x c a.y b.y x z
#> 1 2 20 200 2 40 25 200
#> 2 2 35 350 5 50 35 350
#> 3 4 40 400 4 40 45 400
left_join()
left_join()
includes all observations in x
, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table:
# Merging tables by the "a" variable
df1 %>%
left_join(df2, by = "a")
#> Warning in left_join(., df2, by = "a"): Detected an unexpected many-to-many relationship between
#> `x` and `y`.
#> ℹ Row 2 of `x` matches multiple rows in `y`.
#> ℹ Row 2 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set
#> `relationship = "many-to-many"` to silence this warning.
#> a b.x c b.y x z
#> 1 1 10 100 10 15 150
#> 2 2 20 200 40 25 200
#> 3 2 20 200 50 65 270
#> 4 3 30 300 NA NA NA
#> 5 2 35 350 40 25 200
#> 6 2 35 350 50 65 270
#> 7 4 40 400 40 45 400
right_join()
right_join()
includes all observations in y
:
# Merging tables by the "a" variable
df1 %>%
right_join(df2, by = "a")
#> Warning in right_join(., df2, by = "a"): Detected an unexpected many-to-many relationship between
#> `x` and `y`.
#> ℹ Row 2 of `x` matches multiple rows in `y`.
#> ℹ Row 2 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set
#> `relationship = "many-to-many"` to silence this warning.
#> a b.x c b.y x z
#> 1 1 10 100 10 15 150
#> 2 2 20 200 40 25 200
#> 3 2 20 200 50 65 270
#> 4 2 35 350 40 25 200
#> 5 2 35 350 50 65 270
#> 6 4 40 400 40 45 400
#> 7 5 NA NA 50 35 350
#> 8 6 NA NA 60 55 550
# Merging tables by the "a" and "b" variable
df1 %>%
right_join(df2, by = c("a", "b"))
#> a b c x z
#> 1 1 10 100 15 150
#> 2 4 40 400 45 400
#> 3 2 40 NA 25 200
#> 4 5 50 NA 35 350
#> 5 6 60 NA 55 550
#> 6 2 50 NA 65 270
## Merging tables by the "c" and "z" variable (Have different variable names)
df1 %>%
right_join(df2, by = c("c" = "z"))
#> a.x b.x c a.y b.y x
#> 1 2 20 200 2 40 25
#> 2 2 35 350 5 50 35
#> 3 4 40 400 4 40 45
#> 4 NA NA 150 1 10 15
#> 5 NA NA 550 6 60 55
#> 6 NA NA 270 2 50 65
full_join()
full_join()
includes all observations from both x
and y
:
# Merging tables by the "a" variable
df1 %>%
full_join(df2, by = "a")
#> Warning in full_join(., df2, by = "a"): Detected an unexpected many-to-many relationship between
#> `x` and `y`.
#> ℹ Row 2 of `x` matches multiple rows in `y`.
#> ℹ Row 2 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set
#> `relationship = "many-to-many"` to silence this warning.
#> a b.x c b.y x z
#> 1 1 10 100 10 15 150
#> 2 2 20 200 40 25 200
#> 3 2 20 200 50 65 270
#> 4 3 30 300 NA NA NA
#> 5 2 35 350 40 25 200
#> 6 2 35 350 50 65 270
#> 7 4 40 400 40 45 400
#> 8 5 NA NA 50 35 350
#> 9 6 NA NA 60 55 550
# Merging tables by the "a" and "b" variable
df1 %>%
full_join(df2, by = c("a", "b"))
#> a b c x z
#> 1 1 10 100 15 150
#> 2 2 20 200 NA NA
#> 3 3 30 300 NA NA
#> 4 2 35 350 NA NA
#> 5 4 40 400 45 400
#> 6 2 40 NA 25 200
#> 7 5 50 NA 35 350
#> 8 6 60 NA 55 550
#> 9 2 50 NA 65 270
## Merging tables by the "c" and "z" variable (Have different variable names)
df1 %>%
full_join(df2, by = c("c" = "z"))
#> a.x b.x c a.y b.y x
#> 1 1 10 100 NA NA NA
#> 2 2 20 200 2 40 25
#> 3 3 30 300 NA NA NA
#> 4 2 35 350 5 50 35
#> 5 4 40 400 4 40 45
#> 6 NA NA 150 1 10 15
#> 7 NA NA 550 6 60 55
#> 8 NA NA 270 2 50 65
Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
-
semi_join(x, y)
- keeps all observations inx
that have a match iny
-
anti_join(x, y)
- drops all observations inx
that have a match iny
Set Operations
Set operations expect x
and y
tables to have the same variables, and treat the observations as sets:
-
intersect(x, y)
- returns only observations in bothx
andy
-
union(x, y)
- returns unique observations in bothx
andy
-
setdiff(x, y)
- returns observations inx
, but not iny
We will first create toy data frames and then apply these functions to them:
df1 <- data.frame(
a = c(1, 2, 3, 4, 5),
b = c(10, 20, 30, 40, 50)
)
df1
#> a b
#> 1 1 10
#> 2 2 20
#> 3 3 30
#> 4 4 40
#> 5 5 50
df2 <- data.frame(
a = c(1, 2, 3, 4, 5),
b = c(10, 15, 30, 45, 65)
)
df2
#> a b
#> 1 1 10
#> 2 2 15
#> 3 3 30
#> 4 4 45
#> 5 5 65
intersect(df1, df2)
#> a b
#> 1 1 10
#> 2 3 30
union(df1, df2)
#> a b
#> 1 1 10
#> 2 2 20
#> 3 3 30
#> 4 4 40
#> 5 5 50
#> 6 2 15
#> 7 4 45
#> 8 5 65
setdiff(df1, df2)
#> a b
#> 1 2 20
#> 2 4 40
#> 3 5 50
setdiff(df2, df1)
#> a b
#> 1 2 15
#> 2 4 45
#> 3 5 65
Practice Data Sets
Here are two data sets that you can use to practice two-table verbs:
data1 <- data.frame(
Name = c("James", "Linda", "Jim", "Margo", "Nick", "Stacy", "Mary", "Tom", "Anna", "Bob", "Jeniffer", "Lucas", "Amy"),
Age = c(22, 56, 34, 48, 19, 25, 31, 68, 72, 42, 39, 52, 39),
State = c("California", "New York", "New York", "California", "Michigan", "Texas", "Ohio", "Arizona", "Texas", "Florida", "Nebraska", "Indiana", "Florida"),
state_abr = c("CA", "NY", "NY", "CA", "MI", "TX", "OH", "AZ", "TX", "FL", "NE", "IN", "FL"),
City = c("Los Angeles", "New York", "Buffalo", "San Diego", "Detroit", "Austin", "Cleveland", "Phoenix", "Houston", "Tampa", "Lincoln", "Indianapolis", "Miami"),
Salary = c(30000, 96500, 72000, 59000, 54300, 25000, 61000, 64000, 74700, 40000, 83000, 92400, 82000)
)
data1
#> Name Age State state_abr City Salary
#> 1 James 22 California CA Los Angeles 30000
#> 2 Linda 56 New York NY New York 96500
#> 3 Jim 34 New York NY Buffalo 72000
#> 4 Margo 48 California CA San Diego 59000
#> 5 Nick 19 Michigan MI Detroit 54300
#> 6 Stacy 25 Texas TX Austin 25000
#> 7 Mary 31 Ohio OH Cleveland 61000
#> 8 Tom 68 Arizona AZ Phoenix 64000
#> 9 Anna 72 Texas TX Houston 74700
#> 10 Bob 42 Florida FL Tampa 40000
#> 11 Jeniffer 39 Nebraska NE Lincoln 83000
#> 12 Lucas 52 Indiana IN Indianapolis 92400
#> 13 Amy 39 Florida FL Miami 82000
data2 <- data.frame(
State = c("Washington", "Florida", "Nebraska", "Indiana", "Florida","California", "New York", "New York", "California", "Michigan", "Texas", "Ohio", "Arizona", "Utah"),
state_abbriviation = c("WA", "FL", "NE", "IN", "FL","CA", "NY", "NY", "CA", "MI", "TX", "OH", "AZ", "UT"),
City = c( "Seatle", "Tampa", "Lincoln", "Indianapolis", "Miami","Los Angeles", "New York", "Ithaca", "San Francisco", "Detroit", "Dallas", "Cleveland", "Phoenix", "Salt Lake City"),
Average_salary = c(63500, 53900, 59900, 59800, 57900, 79000, 80000, 75000, 85000, 54000, 63800, 57000, 61000, 58600)
)
data2
#> State state_abbriviation City
#> 1 Washington WA Seatle
#> 2 Florida FL Tampa
#> 3 Nebraska NE Lincoln
#> 4 Indiana IN Indianapolis
#> 5 Florida FL Miami
#> 6 California CA Los Angeles
#> 7 New York NY New York
#> 8 New York NY Ithaca
#> 9 California CA San Francisco
#> 10 Michigan MI Detroit
#> 11 Texas TX Dallas
#> 12 Ohio OH Cleveland
#> 13 Arizona AZ Phoenix
#> 14 Utah UT Salt Lake City
#> Average_salary
#> 1 63500
#> 2 53900
#> 3 59900
#> 4 59800
#> 5 57900
#> 6 79000
#> 7 80000
#> 8 75000
#> 9 85000
#> 10 54000
#> 11 63800
#> 12 57000
#> 13 61000
#> 14 58600