Learning Objectives
The package dplyr
is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with tibbles. The thinking behind it was largely inspired by the package plyr
which has been in use for some time but suffered from being slow in some cases. dplyr
addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.
dplyr
is loaded with the tidyverse
metapackage.
library(tidyverse)
library(tidylog)
##
## Attaching package: 'tidylog'
## The following objects are masked from 'package:dplyr':
##
## add_count, add_tally, anti_join, count, distinct, distinct_all,
## distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
## full_join, group_by, group_by_all, group_by_at, group_by_if,
## inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
## rename, rename_all, rename_at, rename_if, right_join, sample_frac,
## sample_n, select, select_all, select_at, select_if, semi_join,
## slice, summarise, summarise_all, summarise_at, summarise_if,
## summarize, summarize_all, summarize_at, summarize_if, tally,
## top_frac, top_n, transmute, transmute_all, transmute_at,
## transmute_if, ungroup
## The following objects are masked from 'package:tidyr':
##
## drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
## spread, uncount
## The following object is masked from 'package:stats':
##
## filter
We’re going to learn some of the most common dplyr
functions: select()
, filter()
, mutate()
, group_by()
, and summarise()
. To select columns of a data frame, use select()
. The first argument to this function is the data frame (tibble), and the subsequent arguments are the columns to keep.
select(Ecoli_citrate, sample, clade, cit, genome_size)
## select: dropped 3 variables (generation, strain, run)
## # A tibble: 30 x 4
## sample clade cit genome_size
## <chr> <chr> <chr> <dbl>
## 1 REL606 <NA> unknown 4.62
## 2 REL1166A unknown unknown 4.63
## 3 ZDB409 unknown unknown 4.6
## 4 ZDB429 UC unknown 4.59
## 5 ZDB446 UC unknown 4.66
## 6 ZDB458 (C1,C2) unknown 4.63
## 7 ZDB464* (C1,C2) unknown 4.62
## 8 ZDB467 (C1,C2) unknown 4.61
## 9 ZDB477 C1 unknown 4.65
## 10 ZDB483 C3 unknown 4.59
## # … with 20 more rows
To choose rows, use filter():
filter(Ecoli_citrate, cit == "plus")
## filter: removed 21 rows (70%), 9 rows remaining
## # A tibble: 9 x 7
## sample generation clade strain cit run genome_size
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 ZDB564 31500 Cit+ REL606 plus SRR098289 4.74
## 2 ZDB172 32000 Cit+ REL606 plus SRR098042 4.77
## 3 ZDB143 32500 Cit+ REL606 plus SRR098040 4.79
## 4 CZB152 33000 Cit+ REL606 plus SRR097977 4.8
## 5 CZB154 33000 Cit+ REL606 plus SRR098026 4.76
## 6 ZDB87 34000 C2 REL606 plus SRR098035 4.75
## 7 ZDB96 36000 Cit+ REL606 plus SRR098036 4.74
## 8 ZDB107 38000 Cit+ REL606 plus SRR098038 4.79
## 9 REL10979 40000 Cit+ REL606 plus SRR098029 4.78
filter(Ecoli_citrate, cit != "plus")
## filter: removed 9 rows (30%), 21 rows remaining
## # A tibble: 21 x 7
## sample generation clade strain cit run genome_size
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 REL606 0 <NA> REL606 unknown <NA> 4.62
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.6
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
## 7 ZDB464* 20000 (C1,C2) REL606 unknown SRR098285 4.62
## 8 ZDB467 20000 (C1,C2) REL606 unknown SRR098286 4.61
## 9 ZDB477 25000 C1 REL606 unknown SRR098287 4.65
## 10 ZDB483 25000 C3 REL606 unknown SRR098288 4.59
## # … with 11 more rows
But what if you wanted to select
and filter
? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. Pipes in R look like %>%
and are made available via the magrittr
package installed as part of tidyverse. If you’re familiar with the Unix shell, you may already have used pipes to pass the output from one command to the next. The concept is the same, except the shell uses the |
character rather than R’s pipe operator %>%
.
The pipe operator can be tedious to type. In Rstudio pressing Ctrl + Shift + M
under Windows / Linux will insert the pipe operator. On the mac, use ⌘ + Shift + M
.
Ecoli_citrate %>%
filter(cit == "plus") %>%
select(sample, generation, clade, cit)
## filter: removed 21 rows (70%), 9 rows remaining
## select: dropped 3 variables (strain, run, genome_size)
## # A tibble: 9 x 4
## sample generation clade cit
## <chr> <dbl> <chr> <chr>
## 1 ZDB564 31500 Cit+ plus
## 2 ZDB172 32000 Cit+ plus
## 3 ZDB143 32500 Cit+ plus
## 4 CZB152 33000 Cit+ plus
## 5 CZB154 33000 Cit+ plus
## 6 ZDB87 34000 C2 plus
## 7 ZDB96 36000 Cit+ plus
## 8 ZDB107 38000 Cit+ plus
## 9 REL10979 40000 Cit+ plus
In the above we use the pipe to send the data set first through filter
, to keep rows where cit was equal to ‘plus’, and then through select
to keep the sample and generation and clade columns. When the data frame is being passed to the filter()
and select()
functions through a pipe, we don’t need to include it as an argument to these functions anymore. Note that the order of operations here matters - try reversing the filter()
and select()
commands - why doesn’t that work?
This is the same as the nested version:
select(filter(Ecoli_citrate, cit == "plus"), sample, generation, clade)
## filter: removed 21 rows (70%), 9 rows remaining
## select: dropped 4 variables (strain, cit, run, genome_size)
## # A tibble: 9 x 3
## sample generation clade
## <chr> <dbl> <chr>
## 1 ZDB564 31500 Cit+
## 2 ZDB172 32000 Cit+
## 3 ZDB143 32500 Cit+
## 4 CZB152 33000 Cit+
## 5 CZB154 33000 Cit+
## 6 ZDB87 34000 C2
## 7 ZDB96 36000 Cit+
## 8 ZDB107 38000 Cit+
## 9 REL10979 40000 Cit+
If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:
Ecoli_citplus <- Ecoli_citrate %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
## filter: removed 21 rows (70%), 9 rows remaining
## select: dropped 4 variables (strain, cit, run, genome_size)
Ecoli_citplus
## # A tibble: 9 x 3
## sample generation clade
## <chr> <dbl> <chr>
## 1 ZDB564 31500 Cit+
## 2 ZDB172 32000 Cit+
## 3 ZDB143 32500 Cit+
## 4 CZB152 33000 Cit+
## 5 CZB154 33000 Cit+
## 6 ZDB87 34000 C2
## 7 ZDB96 36000 Cit+
## 8 ZDB107 38000 Cit+
## 9 REL10979 40000 Cit+
We can think of the filter()
and select()
functions as verbs in the sentence; they do things to the data flowing through the pipeline.
Using pipes, subset Ecoli_citrate
to include rows where the clade is ‘Cit+’ and keep only the columns sample
, cit
, and genome_size
.
How many rows are in that tibble?
HINT: You can use the nrow()
function to find out how many rows are in a tibble.
Using pipes, subset Ecol_citrate
to include rows where the genome_size is greater than 4.6 and generation is greater than 20 000. Then keep only the columns sample, run, clade
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of genome size in bp:
Ecoli_citrate %>%
mutate(genome_bp = genome_size *1e6)
## mutate: new variable 'genome_bp' with 14 unique values and 0% NA
## # A tibble: 30 x 8
## sample generation clade strain cit run genome_size genome_bp
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 REL606 0 <NA> REL606 unknown <NA> 4.62 4620000
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63 4630000
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.6 4600000
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59 4590000
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66 4660000
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63 4630000
## 7 ZDB464* 20000 (C1,C2) REL606 unknown SRR098285 4.62 4620000
## 8 ZDB467 20000 (C1,C2) REL606 unknown SRR098286 4.61 4610000
## 9 ZDB477 25000 C1 REL606 unknown SRR098287 4.65 4650000
## 10 ZDB483 25000 C3 REL606 unknown SRR098288 4.59 4590000
## # … with 20 more rows
The first row has a NA value for clade, so if we wanted to remove those we could insert a filter() in this chain:
Ecoli_citrate %>%
mutate(genome_bp = genome_size *1e6) %>%
filter(!is.na(clade))
## mutate: new variable 'genome_bp' with 14 unique values and 0% NA
## filter: removed one row (3%), 29 rows remaining
## # A tibble: 29 x 8
## sample generation clade strain cit run genome_size genome_bp
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 REL1166A 2000 unknown REL606 unknown SRR098028 4.63 4630000
## 2 ZDB409 5000 unknown REL606 unknown SRR098281 4.6 4600000
## 3 ZDB429 10000 UC REL606 unknown SRR098282 4.59 4590000
## 4 ZDB446 15000 UC REL606 unknown SRR098283 4.66 4660000
## 5 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63 4630000
## 6 ZDB464* 20000 (C1,C2) REL606 unknown SRR098285 4.62 4620000
## 7 ZDB467 20000 (C1,C2) REL606 unknown SRR098286 4.61 4610000
## 8 ZDB477 25000 C1 REL606 unknown SRR098287 4.65 4650000
## 9 ZDB483 25000 C3 REL606 unknown SRR098288 4.59 4590000
## 10 ZDB16 30000 C1 REL606 unknown SRR098031 4.61 4610000
## # … with 19 more rows
is.na()
is a function that determines whether something is or is not an NA. The !
symbol negates it, so we’re asking for everything that is not an NA
.
Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function, which splits the data into groups. When the data is grouped in this way summarize()
can be used to collapse each group into a single-row summary. summarize()
does this by applying an aggregating or summary function to each group. For example, if we wanted to group by citrate-using mutant status and find the number of rows of data for each status, we would do:
Ecoli_citrate %>%
group_by(cit) %>%
summarize(n())
## group_by: one grouping variable (cit)
## summarize: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 x 2
## cit `n()`
## <chr> <int>
## 1 minus 9
## 2 plus 9
## 3 unknown 12
Here the summary function used was n()
to find the count for each group. We can also apply many other functions to individual columns to get other summary statistics. For example, in the R base package we can use built-in functions like mean
, median
, min
, and max
. By default, all R functions operating on vectors that contains missing data will return NA
. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean
, the easiest way to ignore NA
(the missing data) is to use na.rm=TRUE
(rm stands for remove).
So to view mean genome_size
by mutant status:
Ecoli_citrate %>%
group_by(cit) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))
## group_by: one grouping variable (cit)
## summarize: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 x 2
## cit mean_size
## <chr> <dbl>
## 1 minus 4.61
## 2 plus 4.77
## 3 unknown 4.62
You can group by multiple columns too:
Ecoli_citrate %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))
## group_by: 2 grouping variables (cit, clade)
## summarize: now 13 rows and 3 columns, one group variable remaining (cit)
## # A tibble: 13 x 3
## # Groups: cit [3]
## cit clade mean_size
## <chr> <chr> <dbl>
## 1 minus C1 4.61
## 2 minus C2 4.62
## 3 minus C3 4.61
## 4 minus Cit+ 4.6
## 5 plus C2 4.75
## 6 plus Cit+ 4.77
## 7 unknown (C1,C2) 4.62
## 8 unknown C1 4.63
## 9 unknown C2 4.62
## 10 unknown C3 4.59
## 11 unknown UC 4.62
## 12 unknown unknown 4.62
## 13 unknown <NA> 4.62
Looks like for one of these clones, the clade is missing. We could then discard those rows using filter():
Ecoli_citrate %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE)) %>%
filter(!is.na(clade))
## group_by: 2 grouping variables (cit, clade)
## summarize: now 13 rows and 3 columns, one group variable remaining (cit)
## filter (grouped): removed one row (8%), 12 rows remaining
## # A tibble: 12 x 3
## # Groups: cit [3]
## cit clade mean_size
## <chr> <chr> <dbl>
## 1 minus C1 4.61
## 2 minus C2 4.62
## 3 minus C3 4.61
## 4 minus Cit+ 4.6
## 5 plus C2 4.75
## 6 plus Cit+ 4.77
## 7 unknown (C1,C2) 4.62
## 8 unknown C1 4.63
## 9 unknown C2 4.62
## 10 unknown C3 4.59
## 11 unknown UC 4.62
## 12 unknown unknown 4.62
You can also summarize multiple variables at the same time:
Ecoli_citrate %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE),
min_generation = min(generation))
## group_by: 2 grouping variables (cit, clade)
## summarize: now 13 rows and 4 columns, one group variable remaining (cit)
## # A tibble: 13 x 4
## # Groups: cit [3]
## cit clade mean_size min_generation
## <chr> <chr> <dbl> <dbl>
## 1 minus C1 4.61 31500
## 2 minus C2 4.62 31500
## 3 minus C3 4.61 32000
## 4 minus Cit+ 4.6 34000
## 5 plus C2 4.75 34000
## 6 plus Cit+ 4.77 31500
## 7 unknown (C1,C2) 4.62 20000
## 8 unknown C1 4.63 25000
## 9 unknown C2 4.62 30000
## 10 unknown C3 4.59 25000
## 11 unknown UC 4.62 10000
## 12 unknown unknown 4.62 2000
## 13 unknown <NA> 4.62 0
Create a tibble that has removed all of the rows where the clade is “unknown”. Then find the mean genome size of each unique clade and the resultant rank of mean genome size. (note that ranking genome size will not sort the table; the row order will be unchanged. You can use the arrange()
function to sort the table).
There are several functions for ranking observations, which handle tied values differently. For this exercise it doesn’t matter which function you choose. Use the help options to find a ranking function.
*Reminder that unknown is part of the dataset - R exlusively uses NA to indicate no data
Ecoli_citrate %>%
group_by(clade) %>%
filter(clade != "unknown") %>%
summarise(means = mean(genome_size)) %>%
mutate(rank = rank(means)) %>%
arrange(rank)
## group_by: one grouping variable (clade)
## filter (grouped): removed 3 rows (10%), 27 rows remaining
## summarise: now 6 rows and 2 columns, ungrouped
## mutate: new variable 'rank' with 6 unique values and 0% NA
## # A tibble: 6 x 3
## clade means rank
## <chr> <dbl> <dbl>
## 1 C3 4.6 1
## 2 C1 4.62 2
## 3 (C1,C2) 4.62 3
## 4 UC 4.62 4
## 5 C2 4.64 5
## 6 Cit+ 4.75 6
Data is often recorded in a long format for efficiency. We’re going to use a toy dataset to illustrate. For every plate a researcher counted, they wrote down the number of colonies they saw, and the dilution from the original culture As this was just a running tally, the data ended up in a long format:
counts <- tibble(replicate = c(1, 1, 2, 2, 3),
num_colonies = c(40, 70,
25, 60,
15),
dilution = c("d10", "d5", "d10", "d5", "d5")
)
counts
## # A tibble: 5 x 3
## replicate num_colonies dilution
## <dbl> <dbl> <chr>
## 1 1 40 d10
## 2 1 70 d5
## 3 2 25 d10
## 4 2 60 d5
## 5 3 15 d5
But, what if we wanted to easily compare number of colonies across drug concentrations? We can use pivot_wider
to change the data into a wide format.
pivot_wider
requires you tell it the name of the column which contains values that will be the column names in your new data set - a so called “key” column. You also tell it which column contains the relevant numbers - the “values” column.
counts_wide <- counts %>%
pivot_wider(names_from = dilution, values_from = num_colonies)
## pivot_wider: reorganized (num_colonies, dilution) into (d10, d5) [was 5x3, now 3x3]
counts_wide
## # A tibble: 3 x 3
## replicate d10 d5
## <dbl> <dbl> <dbl>
## 1 1 40 70
## 2 2 25 60
## 3 3 NA 15
You’ll notice there is an NA values for one of the plates. This is incredibly common in microbial and biological data. Sometimes, they are NA - they weren’t recorded (or, for example, a contaminated plate). Other times, such as in this data set, they actually mean 0 observations. To add this information we use the values_fill
argument in pivot_wider
.
counts_wide <- counts %>%
pivot_wider(names_from = dilution, values_from = c(num_colonies), values_fill=list(num_colonies = 0))
## pivot_wider: reorganized (num_colonies, dilution) into (d10, d5) [was 5x3, now 3x3]
counts_wide
## # A tibble: 3 x 3
## replicate d10 d5
## <dbl> <dbl> <dbl>
## 1 1 40 70
## 2 2 25 60
## 3 3 0 15
Note that we could also fill those in right into the long data format using the complete
function. In that function, we specify which columns we want all combinations of, and then supply a list of how new values should be filled in for other columns. If we don’t give a column name in that list, it defaults to NA.
counts_long_0 <- counts %>%
complete(dilution, replicate, fill=list(num_colonies=0))
counts_long_0
## # A tibble: 6 x 3
## dilution replicate num_colonies
## <chr> <dbl> <dbl>
## 1 d10 1 40
## 2 d10 2 25
## 3 d10 3 0
## 4 d5 1 70
## 5 d5 2 60
## 6 d5 3 15
In other cases the reverse is true - as some people recode their data in a wide format. To go from wide to long we use the pivot_longer
function, which “gathers up your wide data”.
In this case you specify what you want the name of the new key column to be, what you want the name of the new values column to be, and then you can either specify which columns are to be gathered up (which can be tedious if there are a lot or they are spread) or you can specify which columns you want to exclude. I actually do things by exclusion quite often.
count_long <- counts_wide %>%
pivot_longer(cols = c(d10, d5), names_to = "dilution", values_to = "count")
## pivot_longer: reorganized (d10, d5) into (dilution, count) [was 3x3, now 6x3]
count_long
## # A tibble: 6 x 3
## replicate dilution count
## <dbl> <chr> <dbl>
## 1 1 d10 40
## 2 1 d5 70
## 3 2 d10 25
## 4 2 d5 60
## 5 3 d10 0
## 6 3 d5 15
See vignette("pivot")
for more details.
Use the dplyr
package (within the tidyverse
) to manipulate tibbles.
Use select()
to choose variables from a tibbles.
Use filter()
to choose data based on values.
Use group_by()
and summarize()
to work with subsets of data.
Use mutate()
to create new variables.
This lesson was created by Aleeza Gerstein at the University of Manitoba. It is based largely on material from The Carpentries and compiled from workshop materials located here and here. Much of that was copied or adapted from Jeff Hollister’s materials. Made available under the Creative Commons Attribution license. License.