Learning Objectives


Constructing a tibble

This section is from the tibbles vignette

Basic tibble construction is analogous to data fame construction.

place1 <- tibble(x = 1:5, y = c("a", "b", "c", "d", "e"))

When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen. It also prints an abbreviated description of the column type, and uses font styles and color for highlighting.

tibble(x = -5:100, y = 123.456 * (3 ^ x))
## # A tibble: 106 × 2
##        x         y
##    <int>     <dbl>
##  1    -5     0.508
##  2    -4     1.52 
##  3    -3     4.57 
##  4    -2    13.7  
##  5    -1    41.2  
##  6     0   123.   
##  7     1   370.   
##  8     2  1111.   
##  9     3  3333.   
## 10     4 10000.   
## # … with 96 more rows

Tibbles are evaluated lazily and sequentially:

tibble(x = 1:5, y = x ^ 2)
## # A tibble: 5 × 2
##       x     y
##   <int> <dbl>
## 1     1     1
## 2     2     4
## 3     3     9
## 4     4    16
## 5     5    25

When constructing a tibble, only values of length 1 are recycled. The first column with length different to one determines the number of rows in the tibble, conflicts lead to an error. This also extends to tibbles with zero rows, which is sometimes important for programming:

tibble(a = 1, b = 1:3)
## # A tibble: 3 × 2
##       a     b
##   <dbl> <int>
## 1     1     1
## 2     1     2
## 3     1     3
tibble(a = 1:3, b = 1)
## # A tibble: 3 × 2
##       a     b
##   <int> <dbl>
## 1     1     1
## 2     2     1
## 3     3     1
# tibble(a = 1:3, c = 1:2)
tibble(a = 1, b = integer())
## # A tibble: 0 × 2
## # … with 2 variables: a <dbl>, b <int>

Importing Data

“From Modern Dive Section 5.1”

Almost everything you are typically going to do in R will require you to load data (at least at first), typically stored in a spreadsheet.

Spreadsheet data is often saved in one of the following formats:

Today we going to study a population of Escherichia coli (designated Ara-3), which were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate, which the ancestral E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points revealed that spontaneous citrate-using mutants (Cit+) appeared at around 31,000 generations in one of twelve populations (Blount et al., 2008 PNAS). This metadata describes information on the Ara-3 clones and the columns represent:

The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Data dictionary.
Column Description
sample clone name
generation generation when sample frozen
clade based on a parsimony tree
strain ancestral strain
cit citrate-using mutant status
run sequence read archive sample ID
genome_size size in Mbp (made up for this lesson)

Note this type of information is often called metadata and this table refereed to as a data dictionary.

First we are going to download the metadata file using the download.file() function and the here function.

The here package

See posts by Jenny Bryan and Malcolm Barrett

RStudio projects let us set up a local working directory, which makes it easier for someone else to access your files with the same folder and file structure.
The here package lets us write file paths that work across operating systems: it detects the root directory and writes lets us build paths accordingly. Within a RProject the root directory is where your *.Rproj file is. This is one reason why for each project you should only have one *.Rproj file.

getwd()   #  prints out my working directory
## [1] "/Users/acgerstein/Nextcloud/Umanitoba/Teaching/22MBIO7040-RStats/lecture/scripts"
dir()     #  prints out what is in my working directory
##  [1] "01-Spreadsheets.html"         "01-Spreadsheets.Rmd"         
##  [3] "02-Intro.html"                "02-Intro.Rmd"                
##  [5] "03-R_Start.html"              "03-R_Start.Rmd"              
##  [7] "04-Data_Start.html"           "04-Data_Start.Rmd"           
##  [9] "05-dplyr.html"                "05-dplyr.Rmd"                
## [11] "06-EDA.html"                  "06-EDA.Rmd"                  
## [13] "06-introStats.html"           "06-introStats.Rmd"           
## [15] "07-EDA.html"                  "Day1_code_handout.R"         
## [17] "Day1_code_inClass.R"          "Day2_code_handout_full.R"    
## [19] "Day2_code_handout_skeleton.R" "Day2_code_handout.R"         
## [21] "GA_script.html"               "hideOutput.css"              
## [23] "hideOutput.js"                "index.html"                  
## [25] "index.Rmd"                    "style.css"
library(here) #  load the here package 
## here() starts at /Users/acgerstein/Nextcloud/Umanitoba/Teaching/22MBIO7040-RStats/lecture

Importantly, this allows us to read and write files based on the working directory at the time when the package was loaded, i.e., the place where the *.Rproj file is.

library(here)
library(tidyverse)
library(tidylog)
## 
## Attaching package: 'tidylog'
## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     relocate, rename, rename_all, rename_at, rename_if, rename_with,
##     right_join, sample_frac, sample_n, select, select_all, select_at,
##     select_if, semi_join, slice, slice_head, slice_max, slice_min,
##     slice_sample, slice_tail, summarise, summarise_all, summarise_at,
##     summarise_if, summarize, summarize_all, summarize_at, summarize_if,
##     tally, top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup
## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     spread, uncount
## The following object is masked from 'package:stats':
## 
##     filter
download.file("https://raw.githubusercontent.com/datacarpentry/R-genomics/gh-pages/data/Ecoli_metadata.csv", here("data_in", "Ecoli_citrate.csv"))
here("data_in", "Ecoli_citrate")
## [1] "/Users/acgerstein/Nextcloud/Umanitoba/Teaching/22MBIO7040-RStats/lecture/data_in/Ecoli_citrate"
Ecoli_citrate <- read_csv(here("data_in", "Ecoli_citrate.csv"))
## Rows: 30 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): sample, clade, strain, cit, run
## dbl (2): generation, genome_size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can see that read_csv reports a “column specification”. This shows the variable names that were read in and the type of data that each column was interpreted as. Among other things, read_csv() will store the data as a tibble and read.csv() (which is the built-in function in base R) will store the data as a data frame.

Ecoli_citrate
## # A tibble: 30 × 7
##    sample   generation clade   strain cit     run       genome_size
##    <chr>         <dbl> <chr>   <chr>  <chr>   <chr>           <dbl>
##  1 REL606            0 <NA>    REL606 unknown <NA>             4.62
##  2 REL1166A       2000 unknown REL606 unknown SRR098028        4.63
##  3 ZDB409         5000 unknown REL606 unknown SRR098281        4.6 
##  4 ZDB429        10000 UC      REL606 unknown SRR098282        4.59
##  5 ZDB446        15000 UC      REL606 unknown SRR098283        4.66
##  6 ZDB458        20000 (C1,C2) REL606 unknown SRR098284        4.63
##  7 ZDB464*       20000 (C1,C2) REL606 unknown SRR098285        4.62
##  8 ZDB467        20000 (C1,C2) REL606 unknown SRR098286        4.61
##  9 ZDB477        25000 C1      REL606 unknown SRR098287        4.65
## 10 ZDB483        25000 C3      REL606 unknown SRR098288        4.59
## # … with 20 more rows

EXERCISE

Using the GUI interface:
1. Go to the Files panel of RStudio.
2. Navigate to the directory i.e. folder on your computer where the downloaded Ecoli_citrate.csv file is saved.
3. Click on Ecoli_citrate.csv
4. Click “Import Dataset…”

At this point you should see an image like this:

After clicking on the “Import” button on the bottom right RStudio, RStudio will save this spreadsheet’s data in a data frame called Ecoli_citrate and display its contents in the spreadsheet viewer. Furthermore, note in the bottom right of the above image there exists a “Code Preview”: you can copy and paste this code to reload your data again later automatically instead of repeating the above manual point-and-click process.

Note that both methods have imported the data as a tibble (if you copied the code from the Code Preview you’ll note that it used read_csv just like we did).

Differences with base R

Here you’ve learned to read files using the functionality in the readr package, which is part of the tidyverse.

R’s standard data structure for tabular data is the data.frame. In contrast, read_csv() creates a tibble (also referred to, for historic reasons, as a tbl_df). This extends the functionality of a data.frame, and can, for the most part, be treated like a data.frame.

You will find that some older functions don’t work on tibbles. A tibble can be converted to a dataframe using as.data.frame(mytibble). To convert a data frame to a tibble, use as.tibble(mydataframe).

Different R Syntaxes

Different varieties of R sytaxes give you many ways to “say” the same thing.

We just learned dollar sign “base” and will now be discussing much more of the Tidyverse syntax. I will try and continue to point out what syntax we’re working in. At least a little bit of mixing and matching is unavoidable, since most of the statistics functions require base and/or formula, while the methods we use for plotting and data wrangling (coming up next) are tidyverse.

You can download a cheatsheet that provides an overview of this here

Just for fun we can compare the tibble to the same file loaded using read.csv():

Ecoli_citrate2 <- read.csv(here("data_in", "Ecoli_citrate.csv"))
head(Ecoli_citrate2)
##     sample generation   clade strain     cit       run genome_size
## 1   REL606          0    <NA> REL606 unknown                  4.62
## 2 REL1166A       2000 unknown REL606 unknown SRR098028        4.63
## 3   ZDB409       5000 unknown REL606 unknown SRR098281        4.60
## 4   ZDB429      10000      UC REL606 unknown SRR098282        4.59
## 5   ZDB446      15000      UC REL606 unknown SRR098283        4.66
## 6   ZDB458      20000 (C1,C2) REL606 unknown SRR098284        4.63

Notice that I’ve used head, which gives me only the first 6 rows of the data sheet. What happens if you type Ecoli_citrate2 into the console? Note that this is different than when we read in Ecoli_citrate2 as a tibble. There will be some applications in the future where we will need to use data frames instead of tibbles. We can easily convert between them using as.data.frame or as_tibble:

Ecoli_citrate2 <- as_tibble(Ecoli_citrate2)
Ecoli_citrate2
## # A tibble: 30 × 7
##    sample   generation clade   strain cit     run         genome_size
##    <chr>         <int> <chr>   <chr>  <chr>   <chr>             <dbl>
##  1 REL606            0 <NA>    REL606 unknown ""                 4.62
##  2 REL1166A       2000 unknown REL606 unknown "SRR098028"        4.63
##  3 ZDB409         5000 unknown REL606 unknown "SRR098281"        4.6 
##  4 ZDB429        10000 UC      REL606 unknown "SRR098282"        4.59
##  5 ZDB446        15000 UC      REL606 unknown "SRR098283"        4.66
##  6 ZDB458        20000 (C1,C2) REL606 unknown "SRR098284"        4.63
##  7 ZDB464*       20000 (C1,C2) REL606 unknown "SRR098285"        4.62
##  8 ZDB467        20000 (C1,C2) REL606 unknown "SRR098286"        4.61
##  9 ZDB477        25000 C1      REL606 unknown "SRR098287"        4.65
## 10 ZDB483        25000 C3      REL606 unknown "SRR098288"        4.59
## # … with 20 more rows
Ecoli_citrate2 <- as.data.frame(Ecoli_citrate2)
head(Ecoli_citrate2)
##     sample generation   clade strain     cit       run genome_size
## 1   REL606          0    <NA> REL606 unknown                  4.62
## 2 REL1166A       2000 unknown REL606 unknown SRR098028        4.63
## 3   ZDB409       5000 unknown REL606 unknown SRR098281        4.60
## 4   ZDB429      10000      UC REL606 unknown SRR098282        4.59
## 5   ZDB446      15000      UC REL606 unknown SRR098283        4.66
## 6   ZDB458      20000 (C1,C2) REL606 unknown SRR098284        4.63

Since we don’t need this data frame, let’s remove the object from memory.

rm(Ecoli_citrate2)

You can see what is currently being stored using ls().

Exploring tibbles

We can explore the contents of a tibble in several ways. We can view the first ten rows of a tibble as above, which tells us lots of information about the column types and the number of rows. We can also use

View(Ecoli_citrate)
glimpse(Ecoli_citrate)
## Rows: 30
## Columns: 7
## $ sample      <chr> "REL606", "REL1166A", "ZDB409", "ZDB429", "ZDB446", "ZDB45…
## $ generation  <dbl> 0, 2000, 5000, 10000, 15000, 20000, 20000, 20000, 25000, 2…
## $ clade       <chr> NA, "unknown", "unknown", "UC", "UC", "(C1,C2)", "(C1,C2)"…
## $ strain      <chr> "REL606", "REL606", "REL606", "REL606", "REL606", "REL606"…
## $ cit         <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "un…
## $ run         <chr> NA, "SRR098028", "SRR098281", "SRR098282", "SRR098283", "S…
## $ genome_size <dbl> 4.62, 4.63, 4.60, 4.59, 4.66, 4.63, 4.62, 4.61, 4.65, 4.59…

We can return a vector containing the values of a variable (column) using the $ sign:

Ecoli_citrate$generation
##  [1]     0  2000  5000 10000 15000 20000 20000 20000 25000 25000 30000 30000
## [13] 31500 31500 31500 32000 32000 32500 32500 33000 33000 33000 34000 34000
## [25] 36000 36000 38000 38000 40000 40000

We can also use the subsetting operator [] directly on tibbles. In contrast to a vector,a tibble is two dimensional. We pass two arguments to the [] operator; the first indicates the row(s) we require and the second indicates the columns. So to return the value in row 10, column 1:

Ecoli_citrate[10, 1]
## # A tibble: 1 × 1
##   sample
##   <chr> 
## 1 ZDB483

Similarly, to return the values in rows 25 to 30, and columns 1 to 3:

Ecoli_citrate[25:30, 1:3]
## # A tibble: 6 × 3
##   sample   generation clade
##   <chr>         <dbl> <chr>
## 1 ZDB96         36000 Cit+ 
## 2 ZDB99         36000 C1   
## 3 ZDB107        38000 Cit+ 
## 4 ZDB111        38000 C2   
## 5 REL10979      40000 Cit+ 
## 6 REL10988      40000 C2

If we leave an index blank, this acts as a wildcard and matches all of the rows or columns:

Ecoli_citrate[22, ]
## # A tibble: 1 × 7
##   sample generation clade strain cit   run       genome_size
##   <chr>       <dbl> <chr> <chr>  <chr> <chr>           <dbl>
## 1 CZB154      33000 Cit+  REL606 plus  SRR098026        4.76
Ecoli_citrate[, 1:3]
## # A tibble: 30 × 3
##    sample   generation clade  
##    <chr>         <dbl> <chr>  
##  1 REL606            0 <NA>   
##  2 REL1166A       2000 unknown
##  3 ZDB409         5000 unknown
##  4 ZDB429        10000 UC     
##  5 ZDB446        15000 UC     
##  6 ZDB458        20000 (C1,C2)
##  7 ZDB464*       20000 (C1,C2)
##  8 ZDB467        20000 (C1,C2)
##  9 ZDB477        25000 C1     
## 10 ZDB483        25000 C3     
## # … with 20 more rows

You can also refer to columns by name with quotation marks.

Ecoli_citrate[, "sample"]
## # A tibble: 30 × 1
##    sample  
##    <chr>   
##  1 REL606  
##  2 REL1166A
##  3 ZDB409  
##  4 ZDB429  
##  5 ZDB446  
##  6 ZDB458  
##  7 ZDB464* 
##  8 ZDB467  
##  9 ZDB477  
## 10 ZDB483  
## # … with 20 more rows

Note that subsetting a tibble returns another tibble; in contrast, using $ to extract a variable returns a vector:

Ecoli_citrate$cit
##  [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
##  [8] "unknown" "unknown" "unknown" "unknown" "unknown" "minus"   "minus"  
## [15] "plus"    "minus"   "plus"    "minus"   "plus"    "minus"   "plus"   
## [22] "plus"    "minus"   "plus"    "plus"    "minus"   "plus"    "minus"  
## [29] "plus"    "minus"
Ecoli_citrate[, "cit"]
## # A tibble: 30 × 1
##    cit    
##    <chr>  
##  1 unknown
##  2 unknown
##  3 unknown
##  4 unknown
##  5 unknown
##  6 unknown
##  7 unknown
##  8 unknown
##  9 unknown
## 10 unknown
## # … with 20 more rows

We can use arrange() to re-order a data frame based on the values of a columns. It will take also multiple columns and can be in descending or ascending order.

#descending
arrange(Ecoli_citrate, genome_size)
## # A tibble: 30 × 7
##    sample generation clade   strain cit     run       genome_size
##    <chr>       <dbl> <chr>   <chr>  <chr>   <chr>           <dbl>
##  1 ZDB429      10000 UC      REL606 unknown SRR098282        4.59
##  2 ZDB483      25000 C3      REL606 unknown SRR098288        4.59
##  3 CZB199      33000 C1      REL606 minus   SRR098027        4.59
##  4 ZDB409       5000 unknown REL606 unknown SRR098281        4.6 
##  5 ZDB83       34000 Cit+    REL606 minus   SRR098034        4.6 
##  6 ZDB467      20000 (C1,C2) REL606 unknown SRR098286        4.61
##  7 ZDB16       30000 C1      REL606 unknown SRR098031        4.61
##  8 ZDB30*      32000 C3      REL606 minus   SRR098032        4.61
##  9 ZDB99       36000 C1      REL606 minus   SRR098037        4.61
## 10 REL606          0 <NA>    REL606 unknown <NA>             4.62
## # … with 20 more rows
#multiple columns: smallest genome size and largest generation
arrange(Ecoli_citrate, genome_size, desc(generation))
## # A tibble: 30 × 7
##    sample   generation clade   strain cit     run       genome_size
##    <chr>         <dbl> <chr>   <chr>  <chr>   <chr>           <dbl>
##  1 CZB199        33000 C1      REL606 minus   SRR098027        4.59
##  2 ZDB483        25000 C3      REL606 unknown SRR098288        4.59
##  3 ZDB429        10000 UC      REL606 unknown SRR098282        4.59
##  4 ZDB83         34000 Cit+    REL606 minus   SRR098034        4.6 
##  5 ZDB409         5000 unknown REL606 unknown SRR098281        4.6 
##  6 ZDB99         36000 C1      REL606 minus   SRR098037        4.61
##  7 ZDB30*        32000 C3      REL606 minus   SRR098032        4.61
##  8 ZDB16         30000 C1      REL606 unknown SRR098031        4.61
##  9 ZDB467        20000 (C1,C2) REL606 unknown SRR098286        4.61
## 10 REL10988      40000 C2      REL606 minus   SRR098030        4.62
## # … with 20 more rows

Writing data in R

We can save a tibble (or data frame) to a csv file, using readr’s write_csv() function. For example, to save the Ecoli_citrate data to Ecoli_citrate.csv:

Ecoli_citrate_sub <- Ecoli_citrate[25:30, 1:3]  #note that splice only works for rows, we'll see a way to select specific columns in the next lesson
Ecoli_citrate_sub
## # A tibble: 6 × 3
##   sample   generation clade
##   <chr>         <dbl> <chr>
## 1 ZDB96         36000 Cit+ 
## 2 ZDB99         36000 C1   
## 3 ZDB107        38000 Cit+ 
## 4 ZDB111        38000 C2   
## 5 REL10979      40000 Cit+ 
## 6 REL10988      40000 C2
write_csv(Ecoli_citrate_sub, here("data_out", "Ecoli_citrate_sub"))

Attribution

This lesson was created by Aleeza Gerstein at the University of Manitoba. It is based largely on material from The Carpentries. The material is compiled from workshop materials located here and here. The section on read_csv vs. read.csv is from Modern Dive Section 5. Made available under the Creative Commons Attribution license. License. The R syntax cheatsheet was developed by Amelia McNamara