Learning Objectives
here
package doesread_csv
to read.csv
This section is from the tibbles vignette
Basic tibble construction is analogous to data fame construction.
place1 <- tibble(x = 1:5, y = c("a", "b", "c", "d", "e"))
When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen. It also prints an abbreviated description of the column type, and uses font styles and color for highlighting.
tibble(x = -5:100, y = 123.456 * (3 ^ x))
## # A tibble: 106 × 2
## x y
## <int> <dbl>
## 1 -5 0.508
## 2 -4 1.52
## 3 -3 4.57
## 4 -2 13.7
## 5 -1 41.2
## 6 0 123.
## 7 1 370.
## 8 2 1111.
## 9 3 3333.
## 10 4 10000.
## # … with 96 more rows
Tibbles are evaluated lazily and sequentially:
tibble(x = 1:5, y = x ^ 2)
## # A tibble: 5 × 2
## x y
## <int> <dbl>
## 1 1 1
## 2 2 4
## 3 3 9
## 4 4 16
## 5 5 25
When constructing a tibble, only values of length 1 are recycled. The first column with length different to one determines the number of rows in the tibble, conflicts lead to an error. This also extends to tibbles with zero rows, which is sometimes important for programming:
tibble(a = 1, b = 1:3)
## # A tibble: 3 × 2
## a b
## <dbl> <int>
## 1 1 1
## 2 1 2
## 3 1 3
tibble(a = 1:3, b = 1)
## # A tibble: 3 × 2
## a b
## <int> <dbl>
## 1 1 1
## 2 2 1
## 3 3 1
# tibble(a = 1:3, c = 1:2)
tibble(a = 1, b = integer())
## # A tibble: 0 × 2
## # … with 2 variables: a <dbl>, b <int>
“From Modern Dive Section 5.1”
Almost everything you are typically going to do in R will require you to load data (at least at first), typically stored in a spreadsheet.
Spreadsheet data is often saved in one of the following formats:
.csv
file. You can
think of a .csv
file as a bare-bones spreadsheet where:
.xlsx
file. This format is based on
Microsoft’s proprietary Excel software. As opposed to a bare-bones
.csv
files, .xlsx
Excel files sometimes
contain a lot of meta-data, or put more simply, data about the data.
Some examples of spreadsheet meta-data include the use of bold and
italic fonts, colored cells, different column widths, and formula
macros..csv
and Excel
.xlsx
formats however: go to the Google Sheets menu bar
-> File -> Download as -> Select “Microsoft Excel” or
“Comma-separated values.”Today we going to study a population of Escherichia coli (designated Ara-3), which were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate, which the ancestral E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points revealed that spontaneous citrate-using mutants (Cit+) appeared at around 31,000 generations in one of twelve populations (Blount et al., 2008 PNAS). This metadata describes information on the Ara-3 clones and the columns represent:
The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
sample | clone name |
generation | generation when sample frozen |
clade | based on a parsimony tree |
strain | ancestral strain |
cit | citrate-using mutant status |
run | sequence read archive sample ID |
genome_size | size in Mbp (made up for this lesson) |
Note this type of information is often called metadata and this table refereed to as a data dictionary.
First we are going to download the metadata file using the
download.file()
function and the here
function.
See posts by Jenny Bryan and Malcolm Barrett
RStudio projects let us set up a local working directory, which makes
it easier for someone else to access your files with the same folder and
file structure.
The here
package lets us write file paths that work across
operating systems: it detects the root directory and writes lets us
build paths accordingly. Within a RProject the root directory is where
your *.Rproj
file is. This is one reason why for each
project you should only have one *.Rproj
file.
getwd() # prints out my working directory
## [1] "/Users/acgerstein/Nextcloud/Umanitoba/Teaching/22MBIO7040-RStats/lecture/scripts"
dir() # prints out what is in my working directory
## [1] "01-Spreadsheets.html" "01-Spreadsheets.Rmd"
## [3] "02-Intro.html" "02-Intro.Rmd"
## [5] "03-R_Start.html" "03-R_Start.Rmd"
## [7] "04-Data_Start.html" "04-Data_Start.Rmd"
## [9] "05-dplyr.html" "05-dplyr.Rmd"
## [11] "06-EDA.html" "06-EDA.Rmd"
## [13] "06-introStats.html" "06-introStats.Rmd"
## [15] "07-EDA.html" "Day1_code_handout.R"
## [17] "Day1_code_inClass.R" "Day2_code_handout_full.R"
## [19] "Day2_code_handout_skeleton.R" "Day2_code_handout.R"
## [21] "GA_script.html" "hideOutput.css"
## [23] "hideOutput.js" "index.html"
## [25] "index.Rmd" "style.css"
library(here) # load the here package
## here() starts at /Users/acgerstein/Nextcloud/Umanitoba/Teaching/22MBIO7040-RStats/lecture
Importantly, this allows us to read and write files based on the
working directory at the time when the package was loaded, i.e., the
place where the *.Rproj
file is.
library(here)
library(tidyverse)
library(tidylog)
##
## Attaching package: 'tidylog'
## The following objects are masked from 'package:dplyr':
##
## add_count, add_tally, anti_join, count, distinct, distinct_all,
## distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
## full_join, group_by, group_by_all, group_by_at, group_by_if,
## inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
## relocate, rename, rename_all, rename_at, rename_if, rename_with,
## right_join, sample_frac, sample_n, select, select_all, select_at,
## select_if, semi_join, slice, slice_head, slice_max, slice_min,
## slice_sample, slice_tail, summarise, summarise_all, summarise_at,
## summarise_if, summarize, summarize_all, summarize_at, summarize_if,
## tally, top_frac, top_n, transmute, transmute_all, transmute_at,
## transmute_if, ungroup
## The following objects are masked from 'package:tidyr':
##
## drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
## spread, uncount
## The following object is masked from 'package:stats':
##
## filter
download.file("https://raw.githubusercontent.com/datacarpentry/R-genomics/gh-pages/data/Ecoli_metadata.csv", here("data_in", "Ecoli_citrate.csv"))
here("data_in", "Ecoli_citrate")
## [1] "/Users/acgerstein/Nextcloud/Umanitoba/Teaching/22MBIO7040-RStats/lecture/data_in/Ecoli_citrate"
Ecoli_citrate <- read_csv(here("data_in", "Ecoli_citrate.csv"))
## Rows: 30 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): sample, clade, strain, cit, run
## dbl (2): generation, genome_size
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You can see that read_csv
reports a “column
specification”. This shows the variable names that were read in and the
type of data that each column was interpreted as. Among other things,
read_csv()
will store the data as a tibble and
read.csv()
(which is the built-in function in base R) will
store the data as a data frame.
Ecoli_citrate
## # A tibble: 30 × 7
## sample generation clade strain cit run genome_size
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 REL606 0 <NA> REL606 unknown <NA> 4.62
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.6
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
## 7 ZDB464* 20000 (C1,C2) REL606 unknown SRR098285 4.62
## 8 ZDB467 20000 (C1,C2) REL606 unknown SRR098286 4.61
## 9 ZDB477 25000 C1 REL606 unknown SRR098287 4.65
## 10 ZDB483 25000 C3 REL606 unknown SRR098288 4.59
## # … with 20 more rows
Using the GUI interface:
1. Go to the Files panel of RStudio.
2. Navigate to the directory i.e. folder on your computer where the
downloaded Ecoli_citrate.csv
file is saved.
3. Click on Ecoli_citrate.csv
4. Click “Import Dataset…”
At this point you should see an image like this:
After clicking on the “Import” button on the bottom right RStudio,
RStudio will save this spreadsheet’s data in a data frame called
Ecoli_citrate
and display its contents in the spreadsheet
viewer. Furthermore, note in the bottom right of the above image there
exists a “Code Preview”: you can copy and paste this code to reload your
data again later automatically instead of repeating the above manual
point-and-click process.
Note that both methods have imported the data as a tibble (if you
copied the code from the Code Preview you’ll note that it used
read_csv
just like we did).
Here you’ve learned to read files using the functionality in the
readr
package, which is part of the
tidyverse
.
R’s standard data structure for tabular data is the
data.frame
. In contrast, read_csv()
creates a
tibble
(also referred to, for historic reasons, as a
tbl_df
). This extends the functionality of a
data.frame
, and can, for the most part, be treated like a
data.frame
.
You will find that some older functions don’t work on tibbles. A
tibble can be converted to a dataframe using
as.data.frame(mytibble)
. To convert a data frame to a
tibble, use as.tibble(mydataframe)
.
Different varieties of R sytaxes give you many ways to “say” the same thing.
We just learned dollar sign “base” and will now be discussing much more of the Tidyverse syntax. I will try and continue to point out what syntax we’re working in. At least a little bit of mixing and matching is unavoidable, since most of the statistics functions require base and/or formula, while the methods we use for plotting and data wrangling (coming up next) are tidyverse.
You can download a cheatsheet that provides an overview of this here
Just for fun we can compare the tibble to the same file loaded using
read.csv()
:
Ecoli_citrate2 <- read.csv(here("data_in", "Ecoli_citrate.csv"))
head(Ecoli_citrate2)
## sample generation clade strain cit run genome_size
## 1 REL606 0 <NA> REL606 unknown 4.62
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.60
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
Notice that I’ve used head
, which gives me only the
first 6 rows of the data sheet. What happens if you type
Ecoli_citrate2
into the console? Note that this is
different than when we read in Ecoli_citrate2
as a tibble.
There will be some applications in the future where we will need to use
data frames instead of tibbles. We can easily convert between them using
as.data.frame
or as_tibble
:
Ecoli_citrate2 <- as_tibble(Ecoli_citrate2)
Ecoli_citrate2
## # A tibble: 30 × 7
## sample generation clade strain cit run genome_size
## <chr> <int> <chr> <chr> <chr> <chr> <dbl>
## 1 REL606 0 <NA> REL606 unknown "" 4.62
## 2 REL1166A 2000 unknown REL606 unknown "SRR098028" 4.63
## 3 ZDB409 5000 unknown REL606 unknown "SRR098281" 4.6
## 4 ZDB429 10000 UC REL606 unknown "SRR098282" 4.59
## 5 ZDB446 15000 UC REL606 unknown "SRR098283" 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown "SRR098284" 4.63
## 7 ZDB464* 20000 (C1,C2) REL606 unknown "SRR098285" 4.62
## 8 ZDB467 20000 (C1,C2) REL606 unknown "SRR098286" 4.61
## 9 ZDB477 25000 C1 REL606 unknown "SRR098287" 4.65
## 10 ZDB483 25000 C3 REL606 unknown "SRR098288" 4.59
## # … with 20 more rows
Ecoli_citrate2 <- as.data.frame(Ecoli_citrate2)
head(Ecoli_citrate2)
## sample generation clade strain cit run genome_size
## 1 REL606 0 <NA> REL606 unknown 4.62
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.60
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
Since we don’t need this data frame, let’s remove the object from memory.
rm(Ecoli_citrate2)
You can see what is currently being stored using
ls()
.
We can explore the contents of a tibble in several ways. We can view the first ten rows of a tibble as above, which tells us lots of information about the column types and the number of rows. We can also use
View(Ecoli_citrate)
glimpse(Ecoli_citrate)
## Rows: 30
## Columns: 7
## $ sample <chr> "REL606", "REL1166A", "ZDB409", "ZDB429", "ZDB446", "ZDB45…
## $ generation <dbl> 0, 2000, 5000, 10000, 15000, 20000, 20000, 20000, 25000, 2…
## $ clade <chr> NA, "unknown", "unknown", "UC", "UC", "(C1,C2)", "(C1,C2)"…
## $ strain <chr> "REL606", "REL606", "REL606", "REL606", "REL606", "REL606"…
## $ cit <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "un…
## $ run <chr> NA, "SRR098028", "SRR098281", "SRR098282", "SRR098283", "S…
## $ genome_size <dbl> 4.62, 4.63, 4.60, 4.59, 4.66, 4.63, 4.62, 4.61, 4.65, 4.59…
We can return a vector containing the values of a variable (column)
using the $
sign:
Ecoli_citrate$generation
## [1] 0 2000 5000 10000 15000 20000 20000 20000 25000 25000 30000 30000
## [13] 31500 31500 31500 32000 32000 32500 32500 33000 33000 33000 34000 34000
## [25] 36000 36000 38000 38000 40000 40000
We can also use the subsetting operator []
directly on
tibbles. In contrast to a vector,a tibble is two dimensional. We pass
two arguments to the []
operator; the first indicates the
row(s) we require and the second indicates the columns. So to return the
value in row 10, column 1:
Ecoli_citrate[10, 1]
## # A tibble: 1 × 1
## sample
## <chr>
## 1 ZDB483
Similarly, to return the values in rows 25 to 30, and columns 1 to 3:
Ecoli_citrate[25:30, 1:3]
## # A tibble: 6 × 3
## sample generation clade
## <chr> <dbl> <chr>
## 1 ZDB96 36000 Cit+
## 2 ZDB99 36000 C1
## 3 ZDB107 38000 Cit+
## 4 ZDB111 38000 C2
## 5 REL10979 40000 Cit+
## 6 REL10988 40000 C2
If we leave an index blank, this acts as a wildcard and matches all of the rows or columns:
Ecoli_citrate[22, ]
## # A tibble: 1 × 7
## sample generation clade strain cit run genome_size
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 CZB154 33000 Cit+ REL606 plus SRR098026 4.76
Ecoli_citrate[, 1:3]
## # A tibble: 30 × 3
## sample generation clade
## <chr> <dbl> <chr>
## 1 REL606 0 <NA>
## 2 REL1166A 2000 unknown
## 3 ZDB409 5000 unknown
## 4 ZDB429 10000 UC
## 5 ZDB446 15000 UC
## 6 ZDB458 20000 (C1,C2)
## 7 ZDB464* 20000 (C1,C2)
## 8 ZDB467 20000 (C1,C2)
## 9 ZDB477 25000 C1
## 10 ZDB483 25000 C3
## # … with 20 more rows
You can also refer to columns by name with quotation marks.
Ecoli_citrate[, "sample"]
## # A tibble: 30 × 1
## sample
## <chr>
## 1 REL606
## 2 REL1166A
## 3 ZDB409
## 4 ZDB429
## 5 ZDB446
## 6 ZDB458
## 7 ZDB464*
## 8 ZDB467
## 9 ZDB477
## 10 ZDB483
## # … with 20 more rows
Note that subsetting a tibble returns another tibble; in contrast,
using $
to extract a variable returns a vector:
Ecoli_citrate$cit
## [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
## [8] "unknown" "unknown" "unknown" "unknown" "unknown" "minus" "minus"
## [15] "plus" "minus" "plus" "minus" "plus" "minus" "plus"
## [22] "plus" "minus" "plus" "plus" "minus" "plus" "minus"
## [29] "plus" "minus"
Ecoli_citrate[, "cit"]
## # A tibble: 30 × 1
## cit
## <chr>
## 1 unknown
## 2 unknown
## 3 unknown
## 4 unknown
## 5 unknown
## 6 unknown
## 7 unknown
## 8 unknown
## 9 unknown
## 10 unknown
## # … with 20 more rows
We can use arrange()
to re-order a data frame based on
the values of a columns. It will take also multiple columns and can be
in descending or ascending order.
#descending
arrange(Ecoli_citrate, genome_size)
## # A tibble: 30 × 7
## sample generation clade strain cit run genome_size
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 2 ZDB483 25000 C3 REL606 unknown SRR098288 4.59
## 3 CZB199 33000 C1 REL606 minus SRR098027 4.59
## 4 ZDB409 5000 unknown REL606 unknown SRR098281 4.6
## 5 ZDB83 34000 Cit+ REL606 minus SRR098034 4.6
## 6 ZDB467 20000 (C1,C2) REL606 unknown SRR098286 4.61
## 7 ZDB16 30000 C1 REL606 unknown SRR098031 4.61
## 8 ZDB30* 32000 C3 REL606 minus SRR098032 4.61
## 9 ZDB99 36000 C1 REL606 minus SRR098037 4.61
## 10 REL606 0 <NA> REL606 unknown <NA> 4.62
## # … with 20 more rows
#multiple columns: smallest genome size and largest generation
arrange(Ecoli_citrate, genome_size, desc(generation))
## # A tibble: 30 × 7
## sample generation clade strain cit run genome_size
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 CZB199 33000 C1 REL606 minus SRR098027 4.59
## 2 ZDB483 25000 C3 REL606 unknown SRR098288 4.59
## 3 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 4 ZDB83 34000 Cit+ REL606 minus SRR098034 4.6
## 5 ZDB409 5000 unknown REL606 unknown SRR098281 4.6
## 6 ZDB99 36000 C1 REL606 minus SRR098037 4.61
## 7 ZDB30* 32000 C3 REL606 minus SRR098032 4.61
## 8 ZDB16 30000 C1 REL606 unknown SRR098031 4.61
## 9 ZDB467 20000 (C1,C2) REL606 unknown SRR098286 4.61
## 10 REL10988 40000 C2 REL606 minus SRR098030 4.62
## # … with 20 more rows
We can save a tibble (or data frame) to a csv file, using readr’s
write_csv() function. For example, to save the
Ecoli_citrate
data to Ecoli_citrate.csv
:
Ecoli_citrate_sub <- Ecoli_citrate[25:30, 1:3] #note that splice only works for rows, we'll see a way to select specific columns in the next lesson
Ecoli_citrate_sub
## # A tibble: 6 × 3
## sample generation clade
## <chr> <dbl> <chr>
## 1 ZDB96 36000 Cit+
## 2 ZDB99 36000 C1
## 3 ZDB107 38000 Cit+
## 4 ZDB111 38000 C2
## 5 REL10979 40000 Cit+
## 6 REL10988 40000 C2
write_csv(Ecoli_citrate_sub, here("data_out", "Ecoli_citrate_sub"))
This lesson was created by Aleeza
Gerstein at the University of Manitoba. It is based largely on
material from The Carpentries.
The material is compiled from workshop materials located here and here.
The section on read_csv
vs. read.csv
is from
Modern Dive Section 5.
Made available under the Creative Commons
Attribution license. License.
The R syntax cheatsheet was developed by Amelia McNamara