Learning Objectives


What is dplyr?

The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with tibbles. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.

This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.

dplyr is loaded with the tidyverse metapackage.

library(tidyverse)
library(tidylog)
## 
## Attaching package: 'tidylog'
## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     rename, rename_all, rename_at, rename_if, right_join, sample_frac,
##     sample_n, select, select_all, select_at, select_if, semi_join,
##     slice, summarise, summarise_all, summarise_at, summarise_if,
##     summarize, summarize_all, summarize_at, summarize_if, tally,
##     top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup
## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     spread, uncount
## The following object is masked from 'package:stats':
## 
##     filter

Selecting columns and filtering rows

We’re going to learn some of the most common dplyr functions: select(), filter(), mutate(), group_by(), and summarise(). To select columns of a data frame, use select(). The first argument to this function is the data frame (tibble), and the subsequent arguments are the columns to keep.

select(Ecoli_citrate, sample, clade, cit, genome_size)
## select: dropped 3 variables (generation, strain, run)
## # A tibble: 30 x 4
##    sample   clade   cit     genome_size
##    <chr>    <chr>   <chr>         <dbl>
##  1 REL606   <NA>    unknown        4.62
##  2 REL1166A unknown unknown        4.63
##  3 ZDB409   unknown unknown        4.6 
##  4 ZDB429   UC      unknown        4.59
##  5 ZDB446   UC      unknown        4.66
##  6 ZDB458   (C1,C2) unknown        4.63
##  7 ZDB464*  (C1,C2) unknown        4.62
##  8 ZDB467   (C1,C2) unknown        4.61
##  9 ZDB477   C1      unknown        4.65
## 10 ZDB483   C3      unknown        4.59
## # … with 20 more rows

To choose rows, use filter():

filter(Ecoli_citrate, cit == "plus")
## filter: removed 21 rows (70%), 9 rows remaining
## # A tibble: 9 x 7
##   sample   generation clade strain cit   run       genome_size
##   <chr>         <dbl> <chr> <chr>  <chr> <chr>           <dbl>
## 1 ZDB564        31500 Cit+  REL606 plus  SRR098289        4.74
## 2 ZDB172        32000 Cit+  REL606 plus  SRR098042        4.77
## 3 ZDB143        32500 Cit+  REL606 plus  SRR098040        4.79
## 4 CZB152        33000 Cit+  REL606 plus  SRR097977        4.8 
## 5 CZB154        33000 Cit+  REL606 plus  SRR098026        4.76
## 6 ZDB87         34000 C2    REL606 plus  SRR098035        4.75
## 7 ZDB96         36000 Cit+  REL606 plus  SRR098036        4.74
## 8 ZDB107        38000 Cit+  REL606 plus  SRR098038        4.79
## 9 REL10979      40000 Cit+  REL606 plus  SRR098029        4.78
filter(Ecoli_citrate, cit != "plus")
## filter: removed 9 rows (30%), 21 rows remaining
## # A tibble: 21 x 7
##    sample   generation clade   strain cit     run       genome_size
##    <chr>         <dbl> <chr>   <chr>  <chr>   <chr>           <dbl>
##  1 REL606            0 <NA>    REL606 unknown <NA>             4.62
##  2 REL1166A       2000 unknown REL606 unknown SRR098028        4.63
##  3 ZDB409         5000 unknown REL606 unknown SRR098281        4.6 
##  4 ZDB429        10000 UC      REL606 unknown SRR098282        4.59
##  5 ZDB446        15000 UC      REL606 unknown SRR098283        4.66
##  6 ZDB458        20000 (C1,C2) REL606 unknown SRR098284        4.63
##  7 ZDB464*       20000 (C1,C2) REL606 unknown SRR098285        4.62
##  8 ZDB467        20000 (C1,C2) REL606 unknown SRR098286        4.61
##  9 ZDB477        25000 C1      REL606 unknown SRR098287        4.65
## 10 ZDB483        25000 C3      REL606 unknown SRR098288        4.59
## # … with 11 more rows

Pipes

But what if you wanted to select and filter? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. Pipes in R look like %>% and are made available via the magrittr package installed as part of tidyverse. If you’re familiar with the Unix shell, you may already have used pipes to pass the output from one command to the next. The concept is the same, except the shell uses the | character rather than R’s pipe operator %>%.

The pipe operator can be tedious to type. In Rstudio pressing Ctrl + Shift + M under Windows / Linux will insert the pipe operator. On the mac, use ⌘ + Shift + M.

Ecoli_citrate %>%
  filter(cit == "plus") %>%
  select(sample, generation, clade, cit)
## filter: removed 21 rows (70%), 9 rows remaining
## select: dropped 3 variables (strain, run, genome_size)
## # A tibble: 9 x 4
##   sample   generation clade cit  
##   <chr>         <dbl> <chr> <chr>
## 1 ZDB564        31500 Cit+  plus 
## 2 ZDB172        32000 Cit+  plus 
## 3 ZDB143        32500 Cit+  plus 
## 4 CZB152        33000 Cit+  plus 
## 5 CZB154        33000 Cit+  plus 
## 6 ZDB87         34000 C2    plus 
## 7 ZDB96         36000 Cit+  plus 
## 8 ZDB107        38000 Cit+  plus 
## 9 REL10979      40000 Cit+  plus

In the above we use the pipe to send the data set first through filter, to keep rows where cit was equal to ‘plus’, and then through select to keep the sample and generation and clade columns. When the data frame is being passed to the filter() and select() functions through a pipe, we don’t need to include it as an argument to these functions anymore. Note that the order of operations here matters - try reversing the filter() and select() commands - why doesn’t that work?

This is the same as the nested version:

select(filter(Ecoli_citrate, cit == "plus"), sample, generation, clade)
## filter: removed 21 rows (70%), 9 rows remaining
## select: dropped 4 variables (strain, cit, run, genome_size)
## # A tibble: 9 x 3
##   sample   generation clade
##   <chr>         <dbl> <chr>
## 1 ZDB564        31500 Cit+ 
## 2 ZDB172        32000 Cit+ 
## 3 ZDB143        32500 Cit+ 
## 4 CZB152        33000 Cit+ 
## 5 CZB154        33000 Cit+ 
## 6 ZDB87         34000 C2   
## 7 ZDB96         36000 Cit+ 
## 8 ZDB107        38000 Cit+ 
## 9 REL10979      40000 Cit+

If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:

Ecoli_citplus <- Ecoli_citrate %>%
  filter(cit == "plus") %>%
  select(sample, generation, clade)
## filter: removed 21 rows (70%), 9 rows remaining
## select: dropped 4 variables (strain, cit, run, genome_size)
Ecoli_citplus
## # A tibble: 9 x 3
##   sample   generation clade
##   <chr>         <dbl> <chr>
## 1 ZDB564        31500 Cit+ 
## 2 ZDB172        32000 Cit+ 
## 3 ZDB143        32500 Cit+ 
## 4 CZB152        33000 Cit+ 
## 5 CZB154        33000 Cit+ 
## 6 ZDB87         34000 C2   
## 7 ZDB96         36000 Cit+ 
## 8 ZDB107        38000 Cit+ 
## 9 REL10979      40000 Cit+

We can think of the filter() and select() functions as verbs in the sentence; they do things to the data flowing through the pipeline.

EXERCISE

  1. Using pipes, subset Ecoli_citrate to include rows where the clade is ‘Cit+’ and keep only the columns sample, cit, and genome_size.
    How many rows are in that tibble?
    HINT: You can use the nrow() function to find out how many rows are in a tibble.

  2. Using pipes, subset Ecol_citrate to include rows where the genome_size is greater than 4.6 and generation is greater than 20 000. Then keep only the columns sample, run, clade

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of genome size in bp:

Ecoli_citrate %>%
  mutate(genome_bp = genome_size *1e6)
## mutate: new variable 'genome_bp' with 14 unique values and 0% NA
## # A tibble: 30 x 8
##    sample   generation clade   strain cit     run       genome_size genome_bp
##    <chr>         <dbl> <chr>   <chr>  <chr>   <chr>           <dbl>     <dbl>
##  1 REL606            0 <NA>    REL606 unknown <NA>             4.62   4620000
##  2 REL1166A       2000 unknown REL606 unknown SRR098028        4.63   4630000
##  3 ZDB409         5000 unknown REL606 unknown SRR098281        4.6    4600000
##  4 ZDB429        10000 UC      REL606 unknown SRR098282        4.59   4590000
##  5 ZDB446        15000 UC      REL606 unknown SRR098283        4.66   4660000
##  6 ZDB458        20000 (C1,C2) REL606 unknown SRR098284        4.63   4630000
##  7 ZDB464*       20000 (C1,C2) REL606 unknown SRR098285        4.62   4620000
##  8 ZDB467        20000 (C1,C2) REL606 unknown SRR098286        4.61   4610000
##  9 ZDB477        25000 C1      REL606 unknown SRR098287        4.65   4650000
## 10 ZDB483        25000 C3      REL606 unknown SRR098288        4.59   4590000
## # … with 20 more rows

The first row has a NA value for clade, so if we wanted to remove those we could insert a filter() in this chain:

Ecoli_citrate %>%
  mutate(genome_bp = genome_size *1e6) %>%
  filter(!is.na(clade))
## mutate: new variable 'genome_bp' with 14 unique values and 0% NA
## filter: removed one row (3%), 29 rows remaining
## # A tibble: 29 x 8
##    sample   generation clade   strain cit     run       genome_size genome_bp
##    <chr>         <dbl> <chr>   <chr>  <chr>   <chr>           <dbl>     <dbl>
##  1 REL1166A       2000 unknown REL606 unknown SRR098028        4.63   4630000
##  2 ZDB409         5000 unknown REL606 unknown SRR098281        4.6    4600000
##  3 ZDB429        10000 UC      REL606 unknown SRR098282        4.59   4590000
##  4 ZDB446        15000 UC      REL606 unknown SRR098283        4.66   4660000
##  5 ZDB458        20000 (C1,C2) REL606 unknown SRR098284        4.63   4630000
##  6 ZDB464*       20000 (C1,C2) REL606 unknown SRR098285        4.62   4620000
##  7 ZDB467        20000 (C1,C2) REL606 unknown SRR098286        4.61   4610000
##  8 ZDB477        25000 C1      REL606 unknown SRR098287        4.65   4650000
##  9 ZDB483        25000 C3      REL606 unknown SRR098288        4.59   4590000
## 10 ZDB16         30000 C1      REL606 unknown SRR098031        4.61   4610000
## # … with 19 more rows

is.na() is a function that determines whether something is or is not an NA. The ! symbol negates it, so we’re asking for everything that is not an NA.

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function, which splits the data into groups. When the data is grouped in this way summarize() can be used to collapse each group into a single-row summary. summarize() does this by applying an aggregating or summary function to each group. For example, if we wanted to group by citrate-using mutant status and find the number of rows of data for each status, we would do:

Ecoli_citrate %>%
  group_by(cit) %>%
  summarize(n())
## group_by: one grouping variable (cit)
## summarize: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 x 2
##   cit     `n()`
##   <chr>   <int>
## 1 minus       9
## 2 plus        9
## 3 unknown    12

Here the summary function used was n() to find the count for each group. We can also apply many other functions to individual columns to get other summary statistics. For example, in the R base package we can use built-in functions like mean, median, min, and max. By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove).

So to view mean genome_size by mutant status:

Ecoli_citrate %>%
  group_by(cit) %>%
  summarize(mean_size = mean(genome_size, na.rm = TRUE))
## group_by: one grouping variable (cit)
## summarize: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 x 2
##   cit     mean_size
##   <chr>       <dbl>
## 1 minus        4.61
## 2 plus         4.77
## 3 unknown      4.62

You can group by multiple columns too:

Ecoli_citrate %>%
  group_by(cit, clade) %>%
  summarize(mean_size = mean(genome_size, na.rm = TRUE))
## group_by: 2 grouping variables (cit, clade)
## summarize: now 13 rows and 3 columns, one group variable remaining (cit)
## # A tibble: 13 x 3
## # Groups:   cit [3]
##    cit     clade   mean_size
##    <chr>   <chr>       <dbl>
##  1 minus   C1           4.61
##  2 minus   C2           4.62
##  3 minus   C3           4.61
##  4 minus   Cit+         4.6 
##  5 plus    C2           4.75
##  6 plus    Cit+         4.77
##  7 unknown (C1,C2)      4.62
##  8 unknown C1           4.63
##  9 unknown C2           4.62
## 10 unknown C3           4.59
## 11 unknown UC           4.62
## 12 unknown unknown      4.62
## 13 unknown <NA>         4.62

Looks like for one of these clones, the clade is missing. We could then discard those rows using filter():

Ecoli_citrate %>%
  group_by(cit, clade) %>%
  summarize(mean_size = mean(genome_size, na.rm = TRUE)) %>% 
  filter(!is.na(clade))
## group_by: 2 grouping variables (cit, clade)
## summarize: now 13 rows and 3 columns, one group variable remaining (cit)
## filter (grouped): removed one row (8%), 12 rows remaining
## # A tibble: 12 x 3
## # Groups:   cit [3]
##    cit     clade   mean_size
##    <chr>   <chr>       <dbl>
##  1 minus   C1           4.61
##  2 minus   C2           4.62
##  3 minus   C3           4.61
##  4 minus   Cit+         4.6 
##  5 plus    C2           4.75
##  6 plus    Cit+         4.77
##  7 unknown (C1,C2)      4.62
##  8 unknown C1           4.63
##  9 unknown C2           4.62
## 10 unknown C3           4.59
## 11 unknown UC           4.62
## 12 unknown unknown      4.62

You can also summarize multiple variables at the same time:

Ecoli_citrate %>%
  group_by(cit, clade) %>%
  summarize(mean_size = mean(genome_size, na.rm = TRUE),
            min_generation = min(generation))
## group_by: 2 grouping variables (cit, clade)
## summarize: now 13 rows and 4 columns, one group variable remaining (cit)
## # A tibble: 13 x 4
## # Groups:   cit [3]
##    cit     clade   mean_size min_generation
##    <chr>   <chr>       <dbl>          <dbl>
##  1 minus   C1           4.61          31500
##  2 minus   C2           4.62          31500
##  3 minus   C3           4.61          32000
##  4 minus   Cit+         4.6           34000
##  5 plus    C2           4.75          34000
##  6 plus    Cit+         4.77          31500
##  7 unknown (C1,C2)      4.62          20000
##  8 unknown C1           4.63          25000
##  9 unknown C2           4.62          30000
## 10 unknown C3           4.59          25000
## 11 unknown UC           4.62          10000
## 12 unknown unknown      4.62           2000
## 13 unknown <NA>         4.62              0

Handy dplyr cheatsheet

EXERCISE

Create a tibble that has removed all of the rows where the clade is “unknown”. Then find the mean genome size of each unique clade and the resultant rank of mean genome size. (note that ranking genome size will not sort the table; the row order will be unchanged. You can use the arrange() function to sort the table).

There are several functions for ranking observations, which handle tied values differently. For this exercise it doesn’t matter which function you choose. Use the help options to find a ranking function.

*Reminder that unknown is part of the dataset - R exlusively uses NA to indicate no data

Ecoli_citrate %>% 
  group_by(clade) %>%
  filter(clade != "unknown") %>% 
  summarise(means = mean(genome_size)) %>% 
  mutate(rank = rank(means)) %>% 
  arrange(rank)
## group_by: one grouping variable (clade)
## filter (grouped): removed 3 rows (10%), 27 rows remaining
## summarise: now 6 rows and 2 columns, ungrouped
## mutate: new variable 'rank' with 6 unique values and 0% NA
## # A tibble: 6 x 3
##   clade   means  rank
##   <chr>   <dbl> <dbl>
## 1 C3       4.6      1
## 2 C1       4.62     2
## 3 (C1,C2)  4.62     3
## 4 UC       4.62     4
## 5 C2       4.64     5
## 6 Cit+     4.75     6

Long to Wide or Wide to Long

Long to wide

Data is often recorded in a long format for efficiency. We’re going to use a toy dataset to illustrate. For every plate a researcher counted, they wrote down the number of colonies they saw, and the dilution from the original culture As this was just a running tally, the data ended up in a long format:

counts <- tibble(replicate = c(1, 1, 2, 2, 3), 
                      num_colonies = c(40, 70, 
                                25, 60, 
                                15),
                      dilution = c("d10", "d5", "d10", "d5", "d5")
            )
counts
## # A tibble: 5 x 3
##   replicate num_colonies dilution
##       <dbl>        <dbl> <chr>   
## 1         1           40 d10     
## 2         1           70 d5      
## 3         2           25 d10     
## 4         2           60 d5      
## 5         3           15 d5

But, what if we wanted to easily compare number of colonies across drug concentrations? We can use pivot_wider to change the data into a wide format.

pivot_wider requires you tell it the name of the column which contains values that will be the column names in your new data set - a so called “key” column. You also tell it which column contains the relevant numbers - the “values” column.

counts_wide <- counts %>%
  pivot_wider(names_from = dilution, values_from = num_colonies)
## pivot_wider: reorganized (num_colonies, dilution) into (d10, d5) [was 5x3, now 3x3]
counts_wide
## # A tibble: 3 x 3
##   replicate   d10    d5
##       <dbl> <dbl> <dbl>
## 1         1    40    70
## 2         2    25    60
## 3         3    NA    15

You’ll notice there is an NA values for one of the plates. This is incredibly common in microbial and biological data. Sometimes, they are NA - they weren’t recorded (or, for example, a contaminated plate). Other times, such as in this data set, they actually mean 0 observations. To add this information we use the values_fill argument in pivot_wider.

counts_wide <- counts %>%
  pivot_wider(names_from = dilution, values_from = c(num_colonies), values_fill=list(num_colonies = 0))
## pivot_wider: reorganized (num_colonies, dilution) into (d10, d5) [was 5x3, now 3x3]
counts_wide
## # A tibble: 3 x 3
##   replicate   d10    d5
##       <dbl> <dbl> <dbl>
## 1         1    40    70
## 2         2    25    60
## 3         3     0    15

Note that we could also fill those in right into the long data format using the complete function. In that function, we specify which columns we want all combinations of, and then supply a list of how new values should be filled in for other columns. If we don’t give a column name in that list, it defaults to NA.

counts_long_0 <- counts %>%
  complete(dilution, replicate, fill=list(num_colonies=0))

counts_long_0
## # A tibble: 6 x 3
##   dilution replicate num_colonies
##   <chr>        <dbl>        <dbl>
## 1 d10              1           40
## 2 d10              2           25
## 3 d10              3            0
## 4 d5               1           70
## 5 d5               2           60
## 6 d5               3           15

Wide to long

In other cases the reverse is true - as some people recode their data in a wide format. To go from wide to long we use the pivot_longer function, which “gathers up your wide data”.

In this case you specify what you want the name of the new key column to be, what you want the name of the new values column to be, and then you can either specify which columns are to be gathered up (which can be tedious if there are a lot or they are spread) or you can specify which columns you want to exclude. I actually do things by exclusion quite often.

count_long <- counts_wide %>%
  pivot_longer(cols = c(d10, d5), names_to = "dilution", values_to = "count")
## pivot_longer: reorganized (d10, d5) into (dilution, count) [was 3x3, now 6x3]
count_long
## # A tibble: 6 x 3
##   replicate dilution count
##       <dbl> <chr>    <dbl>
## 1         1 d10         40
## 2         1 d5          70
## 3         2 d10         25
## 4         2 d5          60
## 5         3 d10          0
## 6         3 d5          15

See vignette("pivot") for more details.

Other great resources

Key Points

Attribution

This lesson was created by Aleeza Gerstein at the University of Manitoba. It is based largely on material from The Carpentries and compiled from workshop materials located here and here. Much of that was copied or adapted from Jeff Hollister’s materials. Made available under the Creative Commons Attribution license. License.