11 Tidyverse II
11.1 Tidying/Pivoting
We previewed some of the tidyverse
functions already because they are vastly superior to their Base R counterparts. Two of those were the pivot()
functions to reshape data. We can now incorporate those into the tidy workflow.
For example we worked with the following data of election results in PA for PS2.
dat <- rio::import("/Users/marctrussler/Documents/GitHub/IDS-Data/PS2Question2.Rds")
#> Warning: Missing `trust` will be set to FALSE by default
#> for RDS in 2.0.0.
The final question of that dataset had you filter to the high education counties and to do some reshaping. To accomplish the same thing using the tidyverse:
dat |>
rename(percent.college = V5) |>
filter(percent.college>40) |>
mutate(county.name = gsub("PA_","", county.name)) |>
select(county.name, candidate, votes) |>
pivot_wider(names_from = candidate,
values_from = votes) |>
mutate(Biden = 100*round(biden/(biden + trump + other),4),
Trump = 100*round(trump/(biden + trump + other),4),
Other = 100*round(trump/(biden + trump + other),4)) |>
select(county.name, Biden, Trump, Other) |>
pivot_longer(cols = Biden:Other,
names_to = "cand",
values_to = "percent") |>
pivot_wider(names_from = cand,
values_from = percent) |>
kableExtra::kable()
county.name | Biden | Trump | Other |
---|---|---|---|
Centre County | 51.69 | 46.94 | 46.94 |
Chester County | 57.99 | 40.88 | 40.88 |
Allegheny County | 59.61 | 39.23 | 39.23 |
Montgomery County | 62.63 | 36.35 | 36.35 |
Bucks County | 51.66 | 47.29 | 47.29 |
11.2 Mutate Add ons
data(flights)
The mutate command is very powerful, and there are two particular add-ons we can use to make the creation of variables more effective. Both of them use conditional logic to help create new variables. Everything that we have learned to this point about conditional logic applies to these cases.
Let’s say that in the flights data we want a variable that tells us whether the arrival delay was more than hour, or not.
In base R we would do:
flights$hour.delay <- flights$arr_delay>60
table(flights$hour.delay)
#>
#> FALSE TRUE
#> 299557 27789
And if this is the prompt, we actually can just use mutate directly:
flights <- flights |>
mutate(hour.delay = arr_delay>60)
flights |>
group_by(hour.delay) |>
summarise(count = n()) |>
mutate(prop = count/sum(count))
#> # A tibble: 3 × 3
#> hour.delay count prop
#> <lgl> <int> <dbl>
#> 1 FALSE 299557 0.889
#> 2 TRUE 27789 0.0825
#> 3 NA 9430 0.0280
But what if we wanted to do this in characters where an hour delay says “Yes” and below hour delay says “No”?
In base R:
flights$hour.delay <- NA
flights$hour.delay[flights$arr_delay>60] <- "Yes"
flights$hour.delay[flights$arr_delay<=60] <- "No"
table(flights$hour.delay)
#>
#> No Yes
#> 299557 27789
In tidy we can make use of the if_else()
insert into mutate. This works by putting a logical condition as the first argument, then what you want to set the variable to if the argument is true, then what you want to set the variable to if the argument is false.
#Just doing this to delete the baseR work we did above
flights$hour.delay <- NULL
flights |>
mutate(hour.delay = if_else(arr_delay>60,"Yes","No")) |>
group_by(hour.delay) |>
summarise(n())
#> # A tibble: 3 × 2
#> hour.delay `n()`
#> <chr> <int>
#> 1 No 299557
#> 2 Yes 27789
#> 3 <NA> 9430
This did what we want. Importantly: it also preserved the NA
that we had in that column. It didn’t set those to “No” which would be bad.
Another example: what if we want a variable that gives state of origin? There are three airports in the NYC area: LaGuardia, JFK (both in NY), and Newark (in NJ). Using if_else we can therefore do:
flights |>
mutate(state.origin = if_else(origin %in% c("JFK","LGA"), "NY", "NJ")) |>
group_by(state.origin) |>
summarize(n())
#> # A tibble: 2 × 2
#> state.origin `n()`
#> <chr> <int>
#> 1 NJ 120835
#> 2 NY 215941
The other addition to mutate, which is actually much more flexible, is case_when()
. Using this function, we can define multiple conditions and specify exactly what we want the new variable to se to when that condition is met.
So say we want a new variable that is “delay degree” that is set equal to “early” if arrival delay is negative, is “on time” if the arrival delay is between 0 and 10 minutes, and “late” if the arrival delay is greater than 10 minutes.
In base R:
flights$delay.degree <- NA
flights$delay.degree[flights$arr_delay<0] <- "Early"
flights$delay.degree[flights$arr_delay>=0 & flights$arr_delay<10] <- "On Time"
flights$delay.degree[flights$arr_delay>=10] <- "Late"
table(flights$delay.degree)
#>
#> Early Late On Time
#> 188933 94994 43419
Using case_when()
:
flights$delay.degree <- NULL
flights |>
mutate(delay.degree = case_when(arr_delay<0 ~ "Early",
arr_delay>=0 & arr_delay<10 ~ "On Time",
arr_delay>=10 ~ "Late")) |>
group_by(delay.degree) |>
summarize(n())
#> # A tibble: 4 × 2
#> delay.degree `n()`
#> <chr> <int>
#> 1 Early 188933
#> 2 Late 94994
#> 3 On Time 43419
#> 4 <NA> 9430
If we want to set those NAs to something, say “No Data”, there are two ways that we could do that. We could explicitly capture the NAs.
flights$delay.degree <- NULL
flights |>
mutate(delay.degree = case_when(arr_delay<0 ~ "Early",
arr_delay>=0 & arr_delay<10 ~ "On Time",
arr_delay>=10 ~ "Late",
is.na(arr_delay) ~ "No Data")) |>
group_by(delay.degree) |>
summarize(n())
#> # A tibble: 4 × 2
#> delay.degree `n()`
#> <chr> <int>
#> 1 Early 188933
#> 2 Late 94994
#> 3 No Data 9430
#> 4 On Time 43419
The other option, using case_when()
is that we can set a default option if a row matches none of the conditions:
flights$delay.degree <- NULL
flights |>
mutate(delay.degree = case_when(arr_delay<0 ~ "Early",
arr_delay>=0 & arr_delay<10 ~ "On Time",
arr_delay>=10 ~ "Late",
.default = "No Data")) |>
group_by(delay.degree) |>
summarize(n())
#> # A tibble: 4 × 2
#> delay.degree `n()`
#> <chr> <int>
#> 1 Early 188933
#> 2 Late 94994
#> 3 No Data 9430
#> 4 On Time 43419
I don’t like that it’s .default = "No Data"
instead of .default ~ "No Data"
. They should fix that.
11.3 Joining
The merge commands also have analogues in the tidy world.
In base R the merge()
command works pretty well. In particular, through the use of the all.x=T
command we could control what get’s kept in a certain merge.
So for example, if we want to merge more information about the planes into our flights dataset using tailnum
we can do that via the merge command:
new <- merge(flights, planes, by = "tailnum")
Our new dataset has significantly less rows than it did before, because by default R drops the rows that do not have a match. We learned that we can specifically override that behavior:
new <- merge(flights, planes, by ="tailnum", all.x=T)
We are going to do something similar using tidyverse, with the difference being that we will use specific functions for specific behavior in terms of dropping or keeping non-matches.
To replicate the behavior of the base R merge
function we will use inner_join
, which keeps all of the cases that have a match
new <- inner_join(flights, planes,join_by(tailnum))
We see that the join function works similar to the merge function. We give it both datasets and tell it the variable we want to join by.
What if we want to keep all the rows in flights? To do that we can use left_join
There are a lot of columns in planes
. What if we don’t want to use all of them in the join, and just merge in manufacturer and model?
We will investigate a couple of ways of doing this.
First we could reduce the columns in planes, save a new dataset and then merge that dataset:
planes2 <- planes |>
select(tailnum, manufacturer, model)
#Remember we need to keep tailnum so we can merge!
left_join(flights, planes2)
#> Joining with `by = join_by(tailnum)`
#> # A tibble: 336,776 × 21
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <dbl>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 15 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, manufacturer <chr>,
#> # model <chr>
Note that i’ve omitted join_by
and tidy took a guess at what to use to join the two datasets. It got it right here, but generally I try not to use this functionality because I want to be clear about what I’m doing!
Another possibility is to do the select stage within the left_join function. This demonstrates a pretty cool part of tidyverse, where we can actually embed a whole piped set of commands in another set of piped commands:
left_join(flights, planes |> select(tailnum, manufacturer, model))
#> Joining with `by = join_by(tailnum)`
#> # A tibble: 336,776 × 21
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <dbl>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 15 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, manufacturer <chr>,
#> # model <chr>
The third way that we can accomplish this is using right_join
. I’m going to take a second to get us there.
First, we can use left_join()
as part of a set of piped commands:
flights |>
mutate(mph = distance/(air_time/60)) |>
left_join(planes, join_by(tailnum))
#> # A tibble: 336,776 × 28
#> year.x month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <dbl>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 22 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, mph <dbl>,
#> # year.y <int>, type <chr>, manufacturer <chr>, …
Note that like all other tidy commands, the left_join command uses the piped in dataset as the data we are merging into (and will keep all of those rows).
right_join
treats the second dataset as the dataset to merge into and keep all the rows. Using this functionality we can start with the planes dataset, make the changes that we want, and then merge that into the flights dataset:
planes |>
select(tailnum, manufacturer, model) |>
right_join(flights, join_by(tailnum))
#> # A tibble: 336,776 × 21
#> tailnum manufacturer model year month day dep_time
#> <chr> <chr> <chr> <int> <int> <int> <int>
#> 1 N10156 EMBRAER EMB-145XR 2013 1 10 626
#> 2 N10156 EMBRAER EMB-145XR 2013 1 10 1120
#> 3 N10156 EMBRAER EMB-145XR 2013 1 10 1619
#> 4 N10156 EMBRAER EMB-145XR 2013 1 11 632
#> 5 N10156 EMBRAER EMB-145XR 2013 1 11 1116
#> 6 N10156 EMBRAER EMB-145XR 2013 1 11 1845
#> 7 N10156 EMBRAER EMB-145XR 2013 1 12 830
#> 8 N10156 EMBRAER EMB-145XR 2013 1 12 1410
#> 9 N10156 EMBRAER EMB-145XR 2013 1 13 1551
#> 10 N10156 EMBRAER EMB-145XR 2013 1 13 2221
#> # ℹ 336,766 more rows
#> # ℹ 14 more variables: sched_dep_time <int>,
#> # dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>,
#> # origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
All three of these methods will be useful at different points in time. I use right_join
all the time when I load in some extra data but want to make some adjustments before I merge it in to my “main” dataset.
Now in all of these cases the variable we are merging on has the same name in both datasets. That obviously won’t always be the case.
the join_by()
option we use is secretly doing this behind the scenes:
left_join(flights, planes, join_by(tailnum==tailnum))
#> # A tibble: 336,776 × 27
#> year.x month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <dbl>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 21 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, year.y <int>,
#> # type <chr>, manufacturer <chr>, model <chr>, …
To give it different names, we just need to change the right name to match the specific variable name:
planes2 <- planes2 |>
rename(tailnumber=tailnum)
left_join(flights, planes2, join_by(tailnum==tailnumber))
#> # A tibble: 336,776 × 21
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <dbl>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 533 529 4
#> 3 2013 1 1 542 540 2
#> 4 2013 1 1 544 545 -1
#> 5 2013 1 1 554 600 -6
#> 6 2013 1 1 554 558 -4
#> 7 2013 1 1 555 600 -5
#> 8 2013 1 1 557 600 -3
#> 9 2013 1 1 557 600 -3
#> 10 2013 1 1 558 600 -2
#> # ℹ 336,766 more rows
#> # ℹ 15 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, manufacturer <chr>,
#> # model <chr>
To join on more than one variable, we can just add them to join_by()
. For example, what if we want to merge in the average windspeed for every origin airport on every day? Also a good opportunity to show right_join()
weather |>
group_by(origin, year, month, day) |>
summarize(avg.wind = mean(wind_speed, na.rm=T)) |>
right_join(flights, join_by(origin, year, month, day))
#> `summarise()` has grouped output by 'origin', 'year',
#> 'month'. You can override using the `.groups` argument.
#> # A tibble: 336,776 × 20
#> # Groups: origin, year, month [36]
#> origin year month day avg.wind dep_time sched_dep_time
#> <chr> <int> <int> <int> <dbl> <int> <int>
#> 1 EWR 2013 1 1 13.2 517 515
#> 2 EWR 2013 1 1 13.2 554 558
#> 3 EWR 2013 1 1 13.2 555 600
#> 4 EWR 2013 1 1 13.2 558 600
#> 5 EWR 2013 1 1 13.2 559 600
#> 6 EWR 2013 1 1 13.2 601 600
#> 7 EWR 2013 1 1 13.2 606 610
#> 8 EWR 2013 1 1 13.2 607 607
#> 9 EWR 2013 1 1 13.2 608 600
#> 10 EWR 2013 1 1 13.2 615 615
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: dep_delay <dbl>, arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
11.4 Functions
In the introduction to tidyverse I said what we were doing here was “functional” programming. This broadly means that they prefer to “hide” the complicated stuff within functions and avoid directly manipulating vectors and matrices. In particular, the school of thought that created tidyverse wants to avoid copy and pasting at all costs, and instead to use or write functions that accomplish what you want to accomplish.
I think group_by()
is an excellent example of this. Instead of explicitly pulling out the groups and looping over them and constructing a new, aggregated, dataset; tidyverse just hides that stuff away and let’s you just use group_by
and summarise
to do that work.
That’s all fine and good if you are doing something that the makers of tidyverse expects, but what if you want to do something a bit different? In this case, what you often will want to do is to write your own functions that you can then make use of in your code.
This is a big topic so i’m just going to do a couple of examples here which you can hopefully generalize from.
I’m going to take some of the nycflights data as an example:
data(flights)
flights <- flights |>
select(carrier,flight, dep_delay, arr_delay, distance, air_time)
A common thing we might do with variables is to re-code them to be “standardized”, which is statistics speak to making all variables have a mean of 0 and standard deviation of 1. You don’t really have to care right now of why we would want to do that, just believe me that it’s a good goal.
Mathematically, we standardize variables by subtracting the mean from each variable and dividing by the standard deviation:
\[ x_s = \frac{x - \bar{x}}{sd(x)} \]
If I wanted to standardize the 4 continuous variables in my dataset using tidyverse right now I can do:
flights |>
mutate(dep_delay = (dep_delay- mean(dep_delay,na.rm=T))/sd(dep_delay,na.rm=T),
arr_delay = (arr_delay- mean(arr_delay,na.rm=T))/sd(arr_delay,na.rm=T),
distance = (distance- mean(distance,na.rm=T))/sd(distance,na.rm=T),
air_time = (air_time- mean(air_time,na.rm=T))/sd(air_time,na.rm=T)) |>
summarise(mean(dep_delay,na.rm=T),
sd(dep_delay,na.rm=T))
#> # A tibble: 1 × 2
#> `mean(dep_delay, na.rm = T)` `sd(dep_delay, na.rm = T)`
#> <dbl> <dbl>
#> 1 2.34e-13 1.00
But notice that in each line in mutate I am doing the same thing every time. All that’s changing is the variable that we are inputting into the algorithm that standardizes a variable.
We could imagine writing something like the following, where we generically set a vector equal to x, and then standardize x:
Writing a function in R allows us to do this in a better way. We can tell R that there is a new function that has a vector as an input. When we call that function it will do the operation that we wish on that vector.
So to define our function for standardizing:
Our function is called strdz
. That function takes one argument x
which is any function. Inside the function it performs the standardization math and returns the vector, now standardized. We can see that this function is now stored in our environment, and if we click on the scroll next to it we can see what that function does.
Applying this function:
head(stdrz(x=flights$dep_delay))
#> [1] -0.2645873 -0.2148485 -0.2645873 -0.3391955 -0.4635425
#> [6] -0.4138037
We can shortcut this by not writing x=
. Like other functions we have seen R is looking for x
as the first argument, so we don’t need to actually write that if we don’t want to.
head(stdrz(flights$dep_delay))
#> [1] -0.2645873 -0.2148485 -0.2645873 -0.3391955 -0.4635425
#> [6] -0.4138037
To make this more explicit, we are writing our own function that is just like mean()
or sd()
, those functions similarly just take inputs and do some math on them. If we wanted to use this functionality to re-write the mean function we could:
#using var instead of x to show it doesn't really matter, but we have to match that to what's in the function
our.mean <- function(var){
sum(var)/length(var)
}
our.mean(var=0:10)
#> [1] 5
our.mean(0:10)
#> [1] 5
Returning to our standardizing function, this allows us to shortcut the steps we were doing in mutate above:
flights |>
mutate(dep_delay = stdrz(dep_delay),
arr_delay = stdrz(arr_delay),
distance = stdrz(distance),
air_time = stdrz(air_time)) |>
summarise(mean(dep_delay,na.rm=T),
sd(dep_delay,na.rm=T))
#> # A tibble: 1 × 2
#> `mean(dep_delay, na.rm = T)` `sd(dep_delay, na.rm = T)`
#> <dbl> <dbl>
#> 1 2.34e-13 1.00
Functions are a great shortcut and can help you speed up repetitive tasks. This is really the minimum case covered here. If you are interested in learning more (particularly in writing functions similar to the the other tidy verbs) I would encourage you to look at the R for Data Science Textbook.
11.5 Iteration
In the above we still are doing a little bit of copy and pasting because we are calling stdrz
4 seperate times. We can actually shortcut this further with a powerful verb, across
. This verb lets us apply the same function to multiple columns at the same time:
flights |>
mutate(across(.cols=dep_delay:air_time, .fns=stdrz))
#> # A tibble: 336,776 × 6
#> carrier flight dep_delay arr_delay distance air_time
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 UA 1545 -0.265 0.0920 0.491 0.815
#> 2 UA 1714 -0.215 0.294 0.513 0.815
#> 3 AA 1141 -0.265 0.585 0.0669 0.0994
#> 4 B6 725 -0.339 -0.558 0.731 0.345
#> 5 DL 461 -0.464 -0.715 -0.379 -0.370
#> 6 UA 1696 -0.414 0.114 -0.438 -0.00733
#> 7 B6 507 -0.439 0.271 0.0342 0.0781
#> 8 EV 5708 -0.389 -0.468 -1.11 -1.04
#> 9 B6 79 -0.389 -0.334 -0.131 -0.114
#> 10 AA 301 -0.364 0.0247 -0.419 -0.135
#> # ℹ 336,766 more rows
We can provide any number of columns in the first argument, and then any function we want in the second argument (including a function that we made ourselves).
We can use the power of select()
in defining which cols get the function applied to them. See that section of the notes to see what you can do, but for example:
flights |>
mutate(across(starts_with("dep"), stdrz))
#> # A tibble: 336,776 × 6
#> carrier flight dep_delay arr_delay distance air_time
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 UA 1545 -0.265 11 1400 227
#> 2 UA 1714 -0.215 20 1416 227
#> 3 AA 1141 -0.265 33 1089 160
#> 4 B6 725 -0.339 -18 1576 183
#> 5 DL 461 -0.464 -25 762 116
#> 6 UA 1696 -0.414 12 719 150
#> 7 B6 507 -0.439 19 1065 158
#> 8 EV 5708 -0.389 -14 229 53
#> 9 B6 79 -0.389 -8 944 140
#> 10 AA 301 -0.364 8 733 138
#> # ℹ 336,766 more rows
A special case is selecting all of the variables to apply the function to, though we have to ensure that the function applies to all the variables. For example here we would first have to remove the character variable from flights
flights |>
select(-carrier) |>
mutate(across(everything(), stdrz))
#> # A tibble: 336,776 × 5
#> flight dep_delay arr_delay distance air_time
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.262 -0.265 0.0920 0.491 0.815
#> 2 -0.158 -0.215 0.294 0.513 0.815
#> 3 -0.509 -0.265 0.585 0.0669 0.0994
#> 4 -0.764 -0.339 -0.558 0.731 0.345
#> 5 -0.926 -0.464 -0.715 -0.379 -0.370
#> 6 -0.169 -0.414 0.114 -0.438 -0.00733
#> 7 -0.897 -0.439 0.271 0.0342 0.0781
#> 8 2.29 -0.389 -0.468 -1.11 -1.04
#> 9 -1.16 -0.389 -0.334 -0.131 -0.114
#> 10 -1.02 -0.364 0.0247 -0.419 -0.135
#> # ℹ 336,766 more rows
One way around this is that we can use where()
, which allows us to specify that we want to apply the function to a certain class of variables:
flights |>
mutate(across(where(is.numeric), stdrz))
#> # A tibble: 336,776 × 6
#> carrier flight dep_delay arr_delay distance air_time
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 UA -0.262 -0.265 0.0920 0.491 0.815
#> 2 UA -0.158 -0.215 0.294 0.513 0.815
#> 3 AA -0.509 -0.265 0.585 0.0669 0.0994
#> 4 B6 -0.764 -0.339 -0.558 0.731 0.345
#> 5 DL -0.926 -0.464 -0.715 -0.379 -0.370
#> 6 UA -0.169 -0.414 0.114 -0.438 -0.00733
#> 7 B6 -0.897 -0.439 0.271 0.0342 0.0781
#> 8 EV 2.29 -0.389 -0.468 -1.11 -1.04
#> 9 B6 -1.16 -0.389 -0.334 -0.131 -0.114
#> 10 AA -1.02 -0.364 0.0247 -0.419 -0.135
#> # ℹ 336,766 more rows
We can use the across()
function in summarize
as well. For example, what if we want to get the mean of all these variables?
flights |>
summarize(across(where(is.numeric), mean))
#> # A tibble: 1 × 5
#> flight dep_delay arr_delay distance air_time
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1972. NA NA 1040. NA
Now we have a classic problem here. Several of our variables have NAs in them, which means the mean is not computed. We may want to do something like this:
But that will give us the error ! argument "x" is missing, with no default
. So that doesn’t work! Indeed, this applies to any function we want to use in across()
but has certain options we want to apply. We can’t apply them in a straightforward way.
Now this is where the “functional programming” thing goes off the rails a little bit and I start to find it a bit annoying. The official recommendation here is to write a new function that has the options applied and then put that in across:
mean_na <- function(x){
mean(x, na.rm=T)
}
flights |>
summarize(across(where(is.numeric), mean_na))
#> # A tibble: 1 × 5
#> flight dep_delay arr_delay distance air_time
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1972. 12.6 6.90 1040. 151.
The “shortcut” that is provided is that you can write a little mini-function inside of across using a backslash like this:
flights |>
summarize(across(where(is.numeric), \(x) mean(x,na.rm=T)))
#> # A tibble: 1 × 5
#> flight dep_delay arr_delay distance air_time
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1972. 12.6 6.90 1040. 151.
Is that helpful? Honestly, this is stuff at the edge of my current knowledge/experience of tidyverse. I see the use cases for across()
but I find the “just embed another function!” stuff to be a bit of a pain in the ass, and an overly dogmatic approach to an ideology (functional programming) that I have no real interest in. If you are going down this road and are enjoying it I would encourage you to check out the textbook I am pulling from that is linked at the start of these chapters. Truly: it’s hard for me to change my stripes because I’ve been doing things a certain way for a long time. If you can habituate yourself to this newer way of doing things it seems like it is very effective!