11 Tidyverse II

11.1 Tidying/Pivoting

We previewed some of the tidyverse functions already because they are vastly superior to their Base R counterparts. Two of those were the pivot() functions to reshape data. We can now incorporate those into the tidy workflow.

For example we worked with the following data of election results in PA for PS2.

dat <- rio::import("/Users/marctrussler/Documents/GitHub/IDS-Data/PS2Question2.Rds")
#> Warning: Missing `trust` will be set to FALSE by default
#> for RDS in 2.0.0.

The final question of that dataset had you filter to the high education counties and to do some reshaping. To accomplish the same thing using the tidyverse:

dat |> 
  rename(percent.college = V5) |> 
  filter(percent.college>40) |> 
  mutate(county.name = gsub("PA_","", county.name)) |> 
  select(county.name, candidate, votes) |> 
  pivot_wider(names_from = candidate, 
              values_from = votes) |> 
  mutate(Biden = 100*round(biden/(biden + trump + other),4),
         Trump = 100*round(trump/(biden + trump + other),4),
         Other = 100*round(trump/(biden + trump + other),4)) |> 
  select(county.name, Biden, Trump, Other) |> 
  pivot_longer(cols = Biden:Other, 
               names_to = "cand",
               values_to = "percent") |> 
  pivot_wider(names_from = cand, 
              values_from = percent) |> 
  kableExtra::kable()

county.name	Biden	Trump	Other
Centre County	51.69	46.94	46.94
Chester County	57.99	40.88	40.88
Allegheny County	59.61	39.23	39.23
Montgomery County	62.63	36.35	36.35
Bucks County	51.66	47.29	47.29

11.2 Mutate Add ons

data(flights)

The mutate command is very powerful, and there are two particular add-ons we can use to make the creation of variables more effective. Both of them use conditional logic to help create new variables. Everything that we have learned to this point about conditional logic applies to these cases.

Let’s say that in the flights data we want a variable that tells us whether the arrival delay was more than hour, or not.

In base R we would do:

flights$hour.delay <- flights$arr_delay>60
table(flights$hour.delay)
#> 
#>  FALSE   TRUE 
#> 299557  27789

And if this is the prompt, we actually can just use mutate directly:

flights <- flights |> 
  mutate(hour.delay = arr_delay>60)

flights |> 
  group_by(hour.delay) |> 
  summarise(count = n()) |> 
  mutate(prop = count/sum(count))
#> # A tibble: 3 × 3
#>   hour.delay  count   prop
#>   <lgl>       <int>  <dbl>
#> 1 FALSE      299557 0.889 
#> 2 TRUE        27789 0.0825
#> 3 NA           9430 0.0280

But what if we wanted to do this in characters where an hour delay says “Yes” and below hour delay says “No”?

In base R:

flights$hour.delay <- NA
flights$hour.delay[flights$arr_delay>60] <- "Yes"
flights$hour.delay[flights$arr_delay<=60] <- "No"
table(flights$hour.delay)
#> 
#>     No    Yes 
#> 299557  27789

In tidy we can make use of the if_else() insert into mutate. This works by putting a logical condition as the first argument, then what you want to set the variable to if the argument is true, then what you want to set the variable to if the argument is false.

#Just doing this to delete the baseR work we did above
flights$hour.delay <- NULL

flights |> 
  mutate(hour.delay = if_else(arr_delay>60,"Yes","No")) |> 
  group_by(hour.delay) |> 
  summarise(n())
#> # A tibble: 3 × 2
#>   hour.delay  `n()`
#>   <chr>       <int>
#> 1 No         299557
#> 2 Yes         27789
#> 3 <NA>         9430

This did what we want. Importantly: it also preserved the NA that we had in that column. It didn’t set those to “No” which would be bad.

Another example: what if we want a variable that gives state of origin? There are three airports in the NYC area: LaGuardia, JFK (both in NY), and Newark (in NJ). Using if_else we can therefore do:

flights |> 
  mutate(state.origin = if_else(origin %in% c("JFK","LGA"), "NY", "NJ")) |> 
  group_by(state.origin) |> 
  summarize(n())
#> # A tibble: 2 × 2
#>   state.origin  `n()`
#>   <chr>         <int>
#> 1 NJ           120835
#> 2 NY           215941

The other addition to mutate, which is actually much more flexible, is case_when(). Using this function, we can define multiple conditions and specify exactly what we want the new variable to se to when that condition is met.

So say we want a new variable that is “delay degree” that is set equal to “early” if arrival delay is negative, is “on time” if the arrival delay is between 0 and 10 minutes, and “late” if the arrival delay is greater than 10 minutes.

In base R:

flights$delay.degree <- NA
flights$delay.degree[flights$arr_delay<0] <- "Early"
flights$delay.degree[flights$arr_delay>=0 & flights$arr_delay<10] <- "On Time"
flights$delay.degree[flights$arr_delay>=10] <- "Late"
table(flights$delay.degree)
#> 
#>   Early    Late On Time 
#>  188933   94994   43419

Using case_when():

flights$delay.degree <- NULL

flights |> 
  mutate(delay.degree = case_when(arr_delay<0 ~ "Early",
                                  arr_delay>=0 & arr_delay<10 ~ "On Time",
                                  arr_delay>=10 ~ "Late")) |> 
  group_by(delay.degree) |> 
  summarize(n())
#> # A tibble: 4 × 2
#>   delay.degree  `n()`
#>   <chr>         <int>
#> 1 Early        188933
#> 2 Late          94994
#> 3 On Time       43419
#> 4 <NA>           9430

If we want to set those NAs to something, say “No Data”, there are two ways that we could do that. We could explicitly capture the NAs.

flights$delay.degree <- NULL

flights |> 
  mutate(delay.degree = case_when(arr_delay<0 ~ "Early",
                                  arr_delay>=0 & arr_delay<10 ~ "On Time",
                                  arr_delay>=10 ~ "Late",
                                  is.na(arr_delay) ~ "No Data")) |> 
  group_by(delay.degree) |> 
  summarize(n())
#> # A tibble: 4 × 2
#>   delay.degree  `n()`
#>   <chr>         <int>
#> 1 Early        188933
#> 2 Late          94994
#> 3 No Data        9430
#> 4 On Time       43419

The other option, using case_when() is that we can set a default option if a row matches none of the conditions:

flights$delay.degree <- NULL

flights |> 
  mutate(delay.degree = case_when(arr_delay<0 ~ "Early",
                                  arr_delay>=0 & arr_delay<10 ~ "On Time",
                                  arr_delay>=10 ~ "Late",
                                  .default = "No Data")) |> 
  group_by(delay.degree) |> 
  summarize(n())
#> # A tibble: 4 × 2
#>   delay.degree  `n()`
#>   <chr>         <int>
#> 1 Early        188933
#> 2 Late          94994
#> 3 No Data        9430
#> 4 On Time       43419

I don’t like that it’s .default = "No Data" instead of .default ~ "No Data". They should fix that.

11.3 Joining

data(flights)
data(airlines)
data(weather)
data(planes)

The merge commands also have analogues in the tidy world.

In base R the merge() command works pretty well. In particular, through the use of the all.x=T command we could control what get’s kept in a certain merge.

So for example, if we want to merge more information about the planes into our flights dataset using tailnum we can do that via the merge command:

new <- merge(flights, planes, by = "tailnum")

Our new dataset has significantly less rows than it did before, because by default R drops the rows that do not have a match. We learned that we can specifically override that behavior:

new <- merge(flights, planes, by ="tailnum", all.x=T)

We are going to do something similar using tidyverse, with the difference being that we will use specific functions for specific behavior in terms of dropping or keeping non-matches.

To replicate the behavior of the base R merge function we will use inner_join, which keeps all of the cases that have a match

new <- inner_join(flights, planes,join_by(tailnum))

We see that the join function works similar to the merge function. We give it both datasets and tell it the variable we want to join by.

What if we want to keep all the rows in flights? To do that we can use left_join

new <- left_join(flights, planes, join_by(tailnum))

There are a lot of columns in planes. What if we don’t want to use all of them in the join, and just merge in manufacturer and model?

We will investigate a couple of ways of doing this.

First we could reduce the columns in planes, save a new dataset and then merge that dataset:

planes2 <- planes |> 
            select(tailnum, manufacturer, model)
#Remember we need to keep tailnum so we can merge!

left_join(flights, planes2)
#> Joining with `by = join_by(tailnum)`
#> # A tibble: 336,776 × 21
#>     year month   day dep_time sched_dep_time dep_delay
#>    <int> <int> <int>    <int>          <int>     <dbl>
#>  1  2013     1     1      517            515         2
#>  2  2013     1     1      533            529         4
#>  3  2013     1     1      542            540         2
#>  4  2013     1     1      544            545        -1
#>  5  2013     1     1      554            600        -6
#>  6  2013     1     1      554            558        -4
#>  7  2013     1     1      555            600        -5
#>  8  2013     1     1      557            600        -3
#>  9  2013     1     1      557            600        -3
#> 10  2013     1     1      558            600        -2
#> # ℹ 336,766 more rows
#> # ℹ 15 more variables: arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, manufacturer <chr>,
#> #   model <chr>

Note that i’ve omitted join_by and tidy took a guess at what to use to join the two datasets. It got it right here, but generally I try not to use this functionality because I want to be clear about what I’m doing!

Another possibility is to do the select stage within the left_join function. This demonstrates a pretty cool part of tidyverse, where we can actually embed a whole piped set of commands in another set of piped commands:

left_join(flights, planes |>  select(tailnum, manufacturer, model))
#> Joining with `by = join_by(tailnum)`
#> # A tibble: 336,776 × 21
#>     year month   day dep_time sched_dep_time dep_delay
#>    <int> <int> <int>    <int>          <int>     <dbl>
#>  1  2013     1     1      517            515         2
#>  2  2013     1     1      533            529         4
#>  3  2013     1     1      542            540         2
#>  4  2013     1     1      544            545        -1
#>  5  2013     1     1      554            600        -6
#>  6  2013     1     1      554            558        -4
#>  7  2013     1     1      555            600        -5
#>  8  2013     1     1      557            600        -3
#>  9  2013     1     1      557            600        -3
#> 10  2013     1     1      558            600        -2
#> # ℹ 336,766 more rows
#> # ℹ 15 more variables: arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, manufacturer <chr>,
#> #   model <chr>

The third way that we can accomplish this is using right_join. I’m going to take a second to get us there.

First, we can use left_join() as part of a set of piped commands:

flights |> 
  mutate(mph = distance/(air_time/60)) |> 
  left_join(planes, join_by(tailnum))
#> # A tibble: 336,776 × 28
#>    year.x month   day dep_time sched_dep_time dep_delay
#>     <int> <int> <int>    <int>          <int>     <dbl>
#>  1   2013     1     1      517            515         2
#>  2   2013     1     1      533            529         4
#>  3   2013     1     1      542            540         2
#>  4   2013     1     1      544            545        -1
#>  5   2013     1     1      554            600        -6
#>  6   2013     1     1      554            558        -4
#>  7   2013     1     1      555            600        -5
#>  8   2013     1     1      557            600        -3
#>  9   2013     1     1      557            600        -3
#> 10   2013     1     1      558            600        -2
#> # ℹ 336,766 more rows
#> # ℹ 22 more variables: arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, mph <dbl>,
#> #   year.y <int>, type <chr>, manufacturer <chr>, …

Note that like all other tidy commands, the left_join command uses the piped in dataset as the data we are merging into (and will keep all of those rows).

right_join treats the second dataset as the dataset to merge into and keep all the rows. Using this functionality we can start with the planes dataset, make the changes that we want, and then merge that into the flights dataset:

planes |> 
  select(tailnum, manufacturer, model) |> 
  right_join(flights, join_by(tailnum))
#> # A tibble: 336,776 × 21
#>    tailnum manufacturer model      year month   day dep_time
#>    <chr>   <chr>        <chr>     <int> <int> <int>    <int>
#>  1 N10156  EMBRAER      EMB-145XR  2013     1    10      626
#>  2 N10156  EMBRAER      EMB-145XR  2013     1    10     1120
#>  3 N10156  EMBRAER      EMB-145XR  2013     1    10     1619
#>  4 N10156  EMBRAER      EMB-145XR  2013     1    11      632
#>  5 N10156  EMBRAER      EMB-145XR  2013     1    11     1116
#>  6 N10156  EMBRAER      EMB-145XR  2013     1    11     1845
#>  7 N10156  EMBRAER      EMB-145XR  2013     1    12      830
#>  8 N10156  EMBRAER      EMB-145XR  2013     1    12     1410
#>  9 N10156  EMBRAER      EMB-145XR  2013     1    13     1551
#> 10 N10156  EMBRAER      EMB-145XR  2013     1    13     2221
#> # ℹ 336,766 more rows
#> # ℹ 14 more variables: sched_dep_time <int>,
#> #   dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

All three of these methods will be useful at different points in time. I use right_join all the time when I load in some extra data but want to make some adjustments before I merge it in to my “main” dataset.

Now in all of these cases the variable we are merging on has the same name in both datasets. That obviously won’t always be the case.

the join_by() option we use is secretly doing this behind the scenes:

left_join(flights, planes, join_by(tailnum==tailnum))
#> # A tibble: 336,776 × 27
#>    year.x month   day dep_time sched_dep_time dep_delay
#>     <int> <int> <int>    <int>          <int>     <dbl>
#>  1   2013     1     1      517            515         2
#>  2   2013     1     1      533            529         4
#>  3   2013     1     1      542            540         2
#>  4   2013     1     1      544            545        -1
#>  5   2013     1     1      554            600        -6
#>  6   2013     1     1      554            558        -4
#>  7   2013     1     1      555            600        -5
#>  8   2013     1     1      557            600        -3
#>  9   2013     1     1      557            600        -3
#> 10   2013     1     1      558            600        -2
#> # ℹ 336,766 more rows
#> # ℹ 21 more variables: arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, year.y <int>,
#> #   type <chr>, manufacturer <chr>, model <chr>, …

To give it different names, we just need to change the right name to match the specific variable name:

planes2 <- planes2 |> 
            rename(tailnumber=tailnum)

left_join(flights, planes2, join_by(tailnum==tailnumber))
#> # A tibble: 336,776 × 21
#>     year month   day dep_time sched_dep_time dep_delay
#>    <int> <int> <int>    <int>          <int>     <dbl>
#>  1  2013     1     1      517            515         2
#>  2  2013     1     1      533            529         4
#>  3  2013     1     1      542            540         2
#>  4  2013     1     1      544            545        -1
#>  5  2013     1     1      554            600        -6
#>  6  2013     1     1      554            558        -4
#>  7  2013     1     1      555            600        -5
#>  8  2013     1     1      557            600        -3
#>  9  2013     1     1      557            600        -3
#> 10  2013     1     1      558            600        -2
#> # ℹ 336,766 more rows
#> # ℹ 15 more variables: arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, manufacturer <chr>,
#> #   model <chr>

To join on more than one variable, we can just add them to join_by(). For example, what if we want to merge in the average windspeed for every origin airport on every day? Also a good opportunity to show right_join()

weather |> 
  group_by(origin, year, month, day) |> 
  summarize(avg.wind = mean(wind_speed, na.rm=T)) |> 
  right_join(flights, join_by(origin, year, month, day))
#> `summarise()` has grouped output by 'origin', 'year',
#> 'month'. You can override using the `.groups` argument.
#> # A tibble: 336,776 × 20
#> # Groups:   origin, year, month [36]
#>    origin  year month   day avg.wind dep_time sched_dep_time
#>    <chr>  <int> <int> <int>    <dbl>    <int>          <int>
#>  1 EWR     2013     1     1     13.2      517            515
#>  2 EWR     2013     1     1     13.2      554            558
#>  3 EWR     2013     1     1     13.2      555            600
#>  4 EWR     2013     1     1     13.2      558            600
#>  5 EWR     2013     1     1     13.2      559            600
#>  6 EWR     2013     1     1     13.2      601            600
#>  7 EWR     2013     1     1     13.2      606            610
#>  8 EWR     2013     1     1     13.2      607            607
#>  9 EWR     2013     1     1     13.2      608            600
#> 10 EWR     2013     1     1     13.2      615            615
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: dep_delay <dbl>, arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

11.4 Functions

In the introduction to tidyverse I said what we were doing here was “functional” programming. This broadly means that they prefer to “hide” the complicated stuff within functions and avoid directly manipulating vectors and matrices. In particular, the school of thought that created tidyverse wants to avoid copy and pasting at all costs, and instead to use or write functions that accomplish what you want to accomplish.

I think group_by() is an excellent example of this. Instead of explicitly pulling out the groups and looping over them and constructing a new, aggregated, dataset; tidyverse just hides that stuff away and let’s you just use group_by and summarise to do that work.

That’s all fine and good if you are doing something that the makers of tidyverse expects, but what if you want to do something a bit different? In this case, what you often will want to do is to write your own functions that you can then make use of in your code.

This is a big topic so i’m just going to do a couple of examples here which you can hopefully generalize from.

I’m going to take some of the nycflights data as an example:

data(flights)

flights <- flights |> 
  select(carrier,flight, dep_delay, arr_delay, distance, air_time)

A common thing we might do with variables is to re-code them to be “standardized”, which is statistics speak to making all variables have a mean of 0 and standard deviation of 1. You don’t really have to care right now of why we would want to do that, just believe me that it’s a good goal.

Mathematically, we standardize variables by subtracting the mean from each variable and dividing by the standard deviation:

\[ x_s = \frac{x - \bar{x}}{sd(x)} \]

If I wanted to standardize the 4 continuous variables in my dataset using tidyverse right now I can do:

flights |> 
  mutate(dep_delay = (dep_delay- mean(dep_delay,na.rm=T))/sd(dep_delay,na.rm=T),
         arr_delay = (arr_delay- mean(arr_delay,na.rm=T))/sd(arr_delay,na.rm=T),
         distance = (distance- mean(distance,na.rm=T))/sd(distance,na.rm=T),
         air_time = (air_time- mean(air_time,na.rm=T))/sd(air_time,na.rm=T)) |> 
  summarise(mean(dep_delay,na.rm=T),
            sd(dep_delay,na.rm=T))
#> # A tibble: 1 × 2
#>   `mean(dep_delay, na.rm = T)` `sd(dep_delay, na.rm = T)`
#>                          <dbl>                      <dbl>
#> 1                     2.34e-13                       1.00

But notice that in each line in mutate I am doing the same thing every time. All that’s changing is the variable that we are inputting into the algorithm that standardizes a variable.

We could imagine writing something like the following, where we generically set a vector equal to x, and then standardize x:

x <- flights$dep_delay
out <- (x-mean(x,na.rm=T))/sd(x,na.rm=T)

Writing a function in R allows us to do this in a better way. We can tell R that there is a new function that has a vector as an input. When we call that function it will do the operation that we wish on that vector.

So to define our function for standardizing:

stdrz <- function(x){
  (x-mean(x,na.rm=T))/sd(x,na.rm=T)
}

Our function is called strdz. That function takes one argument x which is any function. Inside the function it performs the standardization math and returns the vector, now standardized. We can see that this function is now stored in our environment, and if we click on the scroll next to it we can see what that function does.

Applying this function:

head(stdrz(x=flights$dep_delay))
#> [1] -0.2645873 -0.2148485 -0.2645873 -0.3391955 -0.4635425
#> [6] -0.4138037

We can shortcut this by not writing x=. Like other functions we have seen R is looking for x as the first argument, so we don’t need to actually write that if we don’t want to.

head(stdrz(flights$dep_delay))
#> [1] -0.2645873 -0.2148485 -0.2645873 -0.3391955 -0.4635425
#> [6] -0.4138037

To make this more explicit, we are writing our own function that is just like mean() or sd(), those functions similarly just take inputs and do some math on them. If we wanted to use this functionality to re-write the mean function we could:

#using var instead of x to show it doesn't really matter, but we have to match that to what's in the function
our.mean <- function(var){
  sum(var)/length(var)
}

our.mean(var=0:10)
#> [1] 5

our.mean(0:10)
#> [1] 5

Returning to our standardizing function, this allows us to shortcut the steps we were doing in mutate above:

flights |> 
  mutate(dep_delay = stdrz(dep_delay),
         arr_delay = stdrz(arr_delay),
         distance = stdrz(distance),
         air_time = stdrz(air_time)) |> 
  summarise(mean(dep_delay,na.rm=T),
            sd(dep_delay,na.rm=T))
#> # A tibble: 1 × 2
#>   `mean(dep_delay, na.rm = T)` `sd(dep_delay, na.rm = T)`
#>                          <dbl>                      <dbl>
#> 1                     2.34e-13                       1.00

Functions are a great shortcut and can help you speed up repetitive tasks. This is really the minimum case covered here. If you are interested in learning more (particularly in writing functions similar to the the other tidy verbs) I would encourage you to look at the R for Data Science Textbook.

11.5 Iteration

In the above we still are doing a little bit of copy and pasting because we are calling stdrz 4 seperate times. We can actually shortcut this further with a powerful verb, across. This verb lets us apply the same function to multiple columns at the same time:

flights |> 
  mutate(across(.cols=dep_delay:air_time, .fns=stdrz))
#> # A tibble: 336,776 × 6
#>    carrier flight dep_delay arr_delay distance air_time
#>    <chr>    <int>     <dbl>     <dbl>    <dbl>    <dbl>
#>  1 UA        1545    -0.265    0.0920   0.491   0.815  
#>  2 UA        1714    -0.215    0.294    0.513   0.815  
#>  3 AA        1141    -0.265    0.585    0.0669  0.0994 
#>  4 B6         725    -0.339   -0.558    0.731   0.345  
#>  5 DL         461    -0.464   -0.715   -0.379  -0.370  
#>  6 UA        1696    -0.414    0.114   -0.438  -0.00733
#>  7 B6         507    -0.439    0.271    0.0342  0.0781 
#>  8 EV        5708    -0.389   -0.468   -1.11   -1.04   
#>  9 B6          79    -0.389   -0.334   -0.131  -0.114  
#> 10 AA         301    -0.364    0.0247  -0.419  -0.135  
#> # ℹ 336,766 more rows

We can provide any number of columns in the first argument, and then any function we want in the second argument (including a function that we made ourselves).

We can use the power of select() in defining which cols get the function applied to them. See that section of the notes to see what you can do, but for example:

flights |> 
  mutate(across(starts_with("dep"), stdrz))
#> # A tibble: 336,776 × 6
#>    carrier flight dep_delay arr_delay distance air_time
#>    <chr>    <int>     <dbl>     <dbl>    <dbl>    <dbl>
#>  1 UA        1545    -0.265        11     1400      227
#>  2 UA        1714    -0.215        20     1416      227
#>  3 AA        1141    -0.265        33     1089      160
#>  4 B6         725    -0.339       -18     1576      183
#>  5 DL         461    -0.464       -25      762      116
#>  6 UA        1696    -0.414        12      719      150
#>  7 B6         507    -0.439        19     1065      158
#>  8 EV        5708    -0.389       -14      229       53
#>  9 B6          79    -0.389        -8      944      140
#> 10 AA         301    -0.364         8      733      138
#> # ℹ 336,766 more rows

A special case is selecting all of the variables to apply the function to, though we have to ensure that the function applies to all the variables. For example here we would first have to remove the character variable from flights

flights |> 
  select(-carrier) |> 
  mutate(across(everything(), stdrz))
#> # A tibble: 336,776 × 5
#>    flight dep_delay arr_delay distance air_time
#>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
#>  1 -0.262    -0.265    0.0920   0.491   0.815  
#>  2 -0.158    -0.215    0.294    0.513   0.815  
#>  3 -0.509    -0.265    0.585    0.0669  0.0994 
#>  4 -0.764    -0.339   -0.558    0.731   0.345  
#>  5 -0.926    -0.464   -0.715   -0.379  -0.370  
#>  6 -0.169    -0.414    0.114   -0.438  -0.00733
#>  7 -0.897    -0.439    0.271    0.0342  0.0781 
#>  8  2.29     -0.389   -0.468   -1.11   -1.04   
#>  9 -1.16     -0.389   -0.334   -0.131  -0.114  
#> 10 -1.02     -0.364    0.0247  -0.419  -0.135  
#> # ℹ 336,766 more rows

One way around this is that we can use where(), which allows us to specify that we want to apply the function to a certain class of variables:

flights |> 
  mutate(across(where(is.numeric), stdrz))
#> # A tibble: 336,776 × 6
#>    carrier flight dep_delay arr_delay distance air_time
#>    <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
#>  1 UA      -0.262    -0.265    0.0920   0.491   0.815  
#>  2 UA      -0.158    -0.215    0.294    0.513   0.815  
#>  3 AA      -0.509    -0.265    0.585    0.0669  0.0994 
#>  4 B6      -0.764    -0.339   -0.558    0.731   0.345  
#>  5 DL      -0.926    -0.464   -0.715   -0.379  -0.370  
#>  6 UA      -0.169    -0.414    0.114   -0.438  -0.00733
#>  7 B6      -0.897    -0.439    0.271    0.0342  0.0781 
#>  8 EV       2.29     -0.389   -0.468   -1.11   -1.04   
#>  9 B6      -1.16     -0.389   -0.334   -0.131  -0.114  
#> 10 AA      -1.02     -0.364    0.0247  -0.419  -0.135  
#> # ℹ 336,766 more rows

We can use the across() function in summarize as well. For example, what if we want to get the mean of all these variables?

flights |> 
  summarize(across(where(is.numeric), mean))
#> # A tibble: 1 × 5
#>   flight dep_delay arr_delay distance air_time
#>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1  1972.        NA        NA    1040.       NA

Now we have a classic problem here. Several of our variables have NAs in them, which means the mean is not computed. We may want to do something like this:

flights |> 
  summarize(across(where(is.numeric), mean(na.rm=T)))

But that will give us the error ! argument "x" is missing, with no default. So that doesn’t work! Indeed, this applies to any function we want to use in across() but has certain options we want to apply. We can’t apply them in a straightforward way.

Now this is where the “functional programming” thing goes off the rails a little bit and I start to find it a bit annoying. The official recommendation here is to write a new function that has the options applied and then put that in across:

mean_na <- function(x){
  mean(x, na.rm=T)
}

flights |> 
  summarize(across(where(is.numeric), mean_na))
#> # A tibble: 1 × 5
#>   flight dep_delay arr_delay distance air_time
#>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1  1972.      12.6      6.90    1040.     151.

The “shortcut” that is provided is that you can write a little mini-function inside of across using a backslash like this:


flights |> 
  summarize(across(where(is.numeric), \(x) mean(x,na.rm=T)))
#> # A tibble: 1 × 5
#>   flight dep_delay arr_delay distance air_time
#>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1  1972.      12.6      6.90    1040.     151.

Is that helpful? Honestly, this is stuff at the edge of my current knowledge/experience of tidyverse. I see the use cases for across() but I find the “just embed another function!” stuff to be a bit of a pain in the ass, and an overly dogmatic approach to an ideology (functional programming) that I have no real interest in. If you are going down this road and are enjoying it I would encourage you to check out the textbook I am pulling from that is linked at the start of these chapters. Truly: it’s hard for me to change my stripes because I’ve been doing things a certain way for a long time. If you can habituate yourself to this newer way of doing things it seems like it is very effective!

10 Tidyverse I

12 Regression I