Learning Objectives
SOURCE: (Whitlock & Schluter, The Analysis of Biological Data)[http://whitlockschluter.zoology.ubc.ca/]
We will follow these metrics to create and evaluate figures:
1. Show the data
2. Make patterns in the data easy to see
3. Represent magnitudes honestly
4. Draw graphical elements clearly, minimizing clutter
In the context of this workshop we won’t have too much time to talk about the philosophy of visualization, but I want to point you to a wonderful new book on this that is available free online. It’s actually a really accessible book that could probably be read in it’s entirety in a day. (“Fundamental of Data Visualization by Clause Wilke”)[https://serialmentor.com/dataviz/].
The type of statistical analysis you will conduct depends critically on the type of data you have. In the time we have available it is not possible to go through every possible iteration of data and analysis. We also seek to link data visualization in tandem with statistics through exploratory data analysis or EDA, of our data. We will use EDA to get a sense of the distribution of our data, and see whether there are outliers or missing values. We will let the EDA inform us how to build our models.
For our first EDA we’re going to load a new dataset that contains the MIC50 information for 488 Candida albicans strains. This data comes from a paper that compared two different methods of measuring fluconazole resistance, broth microdilution tests (MIC) and disk diffusion assays (disk). The MIC is determined as the drug concentration where growth is inhibited by ~ 50% relative to the level of growth in medium with no drug. The measured drug concentrations are (0, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128). The disk diffusion measures resistance in millimeters as the diameter of the zone of inhibition on an agar plate seeded with fungal cells that has a single disk placed in the center that contains a set amount of fluconazole. For the sake of this workshop I added two toy categorical variables, “type” that indicates whether the strain was isolated from the skin or the blood and “sex” that specifies whether the strain was isolated from a male or female.
Load the data:
download.file("http://home.cc.umanitoba.ca/~gersteia/MBIO7040/Calb_resistance.csv",
here("data_in", "Calb_resistance.csv"))
Calb_R <- read_csv(here("data_in", "Calb_resistance.csv"))
## Rows: 501 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): strain, site, sex
## dbl (2): MIC (ug/mL), disk (mm)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(Calb_R)
## Rows: 501
## Columns: 5
## $ strain <chr> "s498", "s499", "s465", "s480", "s481", "s466", "s482", …
## $ site <chr> "blood", "blood", "blood", "blood", "blood", "blood", "b…
## $ sex <chr> "f", "f", "f", "f", "f", "f", "m", "m", "f", "f", "f", "…
## $ `MIC (ug/mL)` <dbl> 128, 128, 32, 64, 64, 32, 64, 64, 64, 64, 64, 64, 32, 32…
## $ `disk (mm)` <dbl> 6, 6, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14…
First I’m going to rename the two numerical variables to remove the unit information. It’s useful but unnecessary. You can see that there are quotes around the column names and this isn’t ideal. Since we’re keeping a script of all the steps are are taking, the information won’t really be lost anyways. I’m going to break one of my rules and overwrite the original tibble without changing the name because I like this name and I haven’t actually done anything to alter the data itself, just how we access it.
Calb_R <- Calb_R %>%
rename(MIC = `MIC (ug/mL)`, disk = `disk (mm)`)
glimpse(Calb_R)
## Rows: 501
## Columns: 5
## $ strain <chr> "s498", "s499", "s465", "s480", "s481", "s466", "s482", "s483",…
## $ site <chr> "blood", "blood", "blood", "blood", "blood", "blood", "blood", …
## $ sex <chr> "f", "f", "f", "f", "f", "f", "m", "m", "f", "f", "f", "f", "f"…
## $ MIC <dbl> 128, 128, 32, 64, 64, 32, 64, 64, 64, 64, 64, 64, 32, 32, 32, 6…
## $ disk <dbl> 6, 6, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 1…
ggplot2
ggplot2
is a plotting package that makes it simple to
create complex plots from data in a data frame. ggplot
was
actually first developed as a PhD thesis chapter! It provides a more
programmatic interface for specifying what variables to plot, how they
are displayed, and general visual properties. Therefore, we only need
minimal changes if the underlying data change or if we decide to change
from a bar plot to a scatterplot. This helps in creating publication
quality plots with minimal amounts of adjustments and tweaking.
ggplot2
functions like data in the ‘long’ format, i.e.,
a column for every dimension, and a row for every observation.
Well-structured data will save you lots of time when making figures with
ggplot2
.
ggplot
graphics are built step by step by adding new
elements. Adding layers in this fashion allows for extensive flexibility
and customization of plots.
To build a ggplot, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>
The basic idea is that a statistical graphic is a mapping from data to aesthetic attributes (such as colour, shape, and size) of geometric objects (such as points, lines, and bars).
Use the ggplot() function and bind the plot to a specific data frame
using the data argument ggplot(data = Calb_R)
Define a mapping (using the aesthetic (aes) function), by selecting
the variables to be plotted and specifying how to present them in the
graph, e.g. as x/y positions or characteristics such as size, shape,
color, etc.
ggplot(data = Calb_R, mapping = aes(x = MIC, y = disk))
.
So we’ve told R the data we want to plot, but not how we want it
plotted. For that we need to add ‘geoms’ – graphical representations of
the data in the plot (points, lines, bars). ggplot2
offers
many different geoms; we will use some common ones today,
including:
- geom_histogram()
for (unsurprisingly) histograms
- geom_boxplot()
for, well, boxplots!
- geom_bar()
for bar plots
- geom_point()
for scatter plots, dot plots, etc.
- geom_line()
for trend lines, time series, etc.
Let’s first look at a couple very simple plots to see what our data looks like. I like to refer to this step as touching the data or interacting with the data. I think one of the best things about R is that graphs are really cheap - we can make quick and dirty graphs really easily and (hopefully) gain a lot of insight about the data just by plotting alone.
First we’ll explore the spread of the different variables. We’ll start with a histogram of the disk diffusion resistance.
One important thing to note: we’re going to iteratively build up the
code using the +
sign, not the %>%
. The
plus sign is the same as what happens in the console when you have
multiple lines of code that are to be run together from a single funcion
or operation, which is what we’re doing here. That’s different than what
the pipe does, which is chaining functions together.
ggplot(data = Calb_R, mapping = aes(disk)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
We got two messages from R. As with the red font yesterday, messages
don’t necessarily mean errors. The first point is about
stat_bin
. To plot a histogram with the geometric object
geom_histogram
, that requires a statistical transformation.
Some plot types (such as scatterplots) do not require transformations,
each point is plotted at x and y coordinates equal to the original
value. The data for other plots, such as boxplots, histograms,
prediction lines etc. need to be transformed, and usually those geoms
have a default statistic that can be changed via stat_*
arguments.
If we use the help menu to look up geom_histogram
(or
type ?geom_histogram
directly into the console), the
description tells us what is happening: “Visualise the distribution of a
single continuous variable by dividing the x axis into bins and counting
the number of observations in each bin. Histograms (geom_histogram())
display the counts with bars; frequency polygons (geom_freqpoly())
display the counts with lines.” If you scroll down and look through the
arguments, you can see all the different things we can change that will
alter how the data is presented. For histograms, there’s actually two
different ways to specify bins. The first is through bins
,
which is the number of bins to break the data into. This is the default,
and the message here has told us that the default was 30, and it prompts
us to purposefully decide if this is what we want or not. The other
option is to specify a binwidth
, which can be either a
numeric value or a function that calculates width from x. These help
files often contain a lot of useful information, and I recommend you
spend some time reading them.
Let’s play a bit with changing the the number of bins (which the help file tells us we should always do). I’ve done this type of experiments before, and I can use my expert knowledge to make this decision. In this case I suggest choosing a binwidth of 1, because 1mm is a natural unit for this type of data.
ggplot(data = Calb_R, mapping = aes(disk)) +
geom_histogram(binwidth = 1)
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
The second prompt tells us that 6 rows containing “non-finite values”
were removed. This typically means NA
s. Let’s double check
that:
Calb_R %>%
filter(is.na(disk))
## # A tibble: 6 × 5
## strain site sex MIC disk
## <chr> <chr> <chr> <dbl> <dbl>
## 1 s317 blood f NA NA
## 2 s318 blood f NA NA
## 3 s344 blood f 1 NA
## 4 s28 oral m 0.12 NA
## 5 s146 skin m 0.25 NA
## 6 s147 skin m 0.25 NA
We can get rid of that using the same nomenclature as before
na.rm = TRUE
within our ggplot function.
Note that we actually have options for how we specify this exact same
figure. The aes
can actually be set in either the original
ggplot
call or in the geom_*
call. If you set
them in the original ggplot
they are inherited by any other
geom’s that build on top of that (unless you override it by specifying
specific aes
in the geom
). If you specify the
aes
in a geom
, it will only be used in that
geom
. e.g.,
ggplot(data = Calb_R) +
geom_histogram(mapping = aes(disk), binwidth =1, na.rm = TRUE)
We can also specify binwidth
using a different layer in
the ggplot
call, stat_bin
instead of
geom_histogram
.
ggplot(data = Calb_R) +
# geom_histogram(mapping = aes(disk)) +
stat_bin(mapping = aes(disk), binwidth = 1)
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
The difference between geom and stat layers is somewhat nuanced and I refer you to this stackoverflow (post)[https://stackoverflow.com/questions/38775661/what-is-the-difference-between-geoms-and-stats-in-ggplot2]. “geoms stand for”geometric objects.” These are the core elements that you see on the plot, object like points, lines, areas, curves. stats stand for “statistical transformations.” These objects summarize the data in different ways such as counting observations, creating a loess line that best fits the data, or adding a confidence interval to the loess line. … these distinctions are somewhat conceptual as the majority of geoms undergo some statistical transformation prior to being plotted.” Let’s leave it at that (but have a read of the full post if you’re interested).
There are a few easy things we can do to change this plot. We can easily change the theme (I personally don’t like the grey background) and the axis labels. In reality if you can think of something to change, it can be done. The nice thing is that once we develop the code to specify the format we like, we can just copy it as we go forward.
ggplot(data = Calb_R, mapping = aes(disk)) +
geom_histogram(binwidth = 1) + # change the number of bins
theme_bw() + # change the theme
labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains")
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
We can also easily overlay information from a categorical variable in
this data set. In this case we’ll add the strain type
information using the fill
command within the aesthetics
specification.
ggplot(data = Calb_R, mapping = aes(disk, fill = site)) +
geom_histogram(binwidth = 1, na.rm = TRUE) + # change the number of bins
theme_bw() + # change the theme
labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains")
We could accomplish a related visualization by separating this into
three different panels using the function facet_wrap
:
ggplot(data = Calb_R, mapping = aes(disk)) +
geom_histogram(binwidth = 1, na.rm = TRUE) +
theme_bw() +
labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains") +
facet_wrap(~site)
We can see here quite clearly that we have different amounts of data
in each of these categories. Sometimes instead of plotting counts data
it can also be helpful to visualize this as the density of the data. In
this case we use a different (but related) geom,
geom_freqpoly
, which also requires us to specify that the
stat we want is density.
ggplot(data = Calb_R, mapping = aes(disk)) +
geom_freqpoly(mapping = aes(disk, stat(density)), binwidth = 1, na.rm = TRUE) +
theme_bw() +
labs(x = "disk diffusion zone of inhibition (mm)" , y = "Density of strains") +
facet_wrap(~site)
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
I don’t actually think we gain that much in this case (and actually lose the intuitive ability to see that there are different numbers of strains in the different groups) but I wanted to demonstrate how easy it is to do fairly complicated data transformations for visualization with very minor adjustments to the code.
And as always, we can use the hand help menus to see what options there are. Note that we can actully calling up help directly on a component of the ggplot function e.g., ?geom_freqpoly
Create a new figure to plot the disk
variable with two
panels, one for each sex
that colours the different strain
site
differently. Use the help menus or “google fu” to
figure out how to add a title to the figure (add whatever you like).
Since dividing up the data into six categories makes it a bit sparse,
play around with either bin
or binwidth
to
find something that looks more pleasing.
Extra things to do: If you get that done, can you figure out how to stack the different figures/facets vertically instead of horizonatally? (remember that google is your friend). Can you figure out how to change bar colours? Can you change the name of the legend? What about the legend labels - can you change it from f & m to female and male? (don’t recode the dataset, just change it in the figure) Change the panel titles from “f” and “m” to “female” and “male” (don’t recode the actual dataset, just change the title labels) Look up different themes and find one you like Change the default colours that are used
Visually it looks like type, but not sex, influences disk resistance.
If this is the story we want to tell, the histograms may not be the best
method to do that. Let’s change to another geom method,
geom_boxplot
.
ggplot(data = Calb_R) +
geom_boxplot(mapping = aes(x = site, y = disk), na.rm = TRUE) +
theme_bw()
I’m going to save this base figure as an object and then build it up:
p <- ggplot(data = Calb_R) +
geom_boxplot(mapping = aes(x = site, y = disk), na.rm = TRUE) +
theme_bw()
p
I’d like to change the axes labels (this is almost always true) and
add the tukey test results using the annotate
layer.
p +
scale_y_reverse(limits = c(50, 0)) +
labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation")
And just as a quick aside, I don’t actually like boxplots all that much, since I think they aren’t that intuitive. A related alternative to the boxplot called a violin plot has been developed and is easily accessible through the ggplot framework. Again, in the interest of time I’ll refer you elsewhere for a more in-depth discussion. To start, wikipedia is a good place to start.
p2 <- ggplot(data = Calb_R) +
geom_violin(mapping = aes(x = site, y = disk), na.rm = TRUE) +
theme_bw() +
scale_y_reverse(limits = c(50, 0)) +
labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation")
p2
We can also easily add different summary statistics directly on top, like the mean or the median.
p2 +
stat_summary(fun="mean",
geom="point",
shape=23,
size=2)
Aha. Here is a case where it matters where we specify the aesthetics. Because they are defined in the geom and not the ggplot function, stat_summary doesn’t have access to the data.
p3 <- ggplot(data = Calb_R, mapping = aes(x = site, y = disk)) +
geom_violin(na.rm = TRUE) +
theme_bw() +
scale_y_reverse(limits = c(50, 0)) +
labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") + stat_summary(fun="mean", geom="point", shape=23, size=3, na.rm = TRUE, fill = "red")
p3
One more variant on this plot, the sina plot. We can literaly just
swap geom_sina()
in for geom_violin()
(once we
download and load the ggforce
package).
library(ggforce)
p4 <- ggplot(data = Calb_R, mapping = aes(x = site, y = disk)) +
geom_sina(na.rm=TRUE) +
theme_bw() +
scale_y_reverse(limits = c(50, 0)) +
labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +
stat_summary(fun="mean", geom="point", shape=23, size=3, na.rm = TRUE, fill = "red")
p4
We could also add a horizontal line to indicate the mean (note that we need to chang a number of arguments:
p4 <- ggplot(data = Calb_R, mapping = aes(x = site, y = disk)) +
geom_sina(na.rm=TRUE) +
theme_bw() +
scale_y_reverse(limits = c(50, 0)) +
labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +
stat_summary(fun="mean", geom="crossbar", linewidth=0.5, na.rm = TRUE, colour = "red")
p4
We’re now going to look at our second continuous variable, resistance from MIC. We’ll start the same way, with a histogram:
pM <- ggplot(data = Calb_R, mapping = aes(MIC)) +
geom_histogram(na.rm = TRUE)
pM
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Well. That doesn’t look very good, does it? Again, we can use our expert knowledge to think about what we have here. Broth microdiluation is measured on drug levels that begin low (in this case 0.012 uM drug) and then increase by doubling up to 128. This tells me how to plot the data: in a way that mimics the experiment, i.e., on a \(log_2\) scale, which we specify in the scale_x_continuous layer.
pM +
scale_x_continuous(trans="log2")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
That’s better. And thinking about this further, what I probably want is a bar plot, not a histogram. That’s because a histogram, by definition, requires continuous data. Yet this data is not continuous, there are only a set number of possible values because of the way the data was recorded [side note: when my lab collects this data we actually fit a logistic equation through the optical density readings taken at each drug concentration, which gives me a continuous measurement.]
So given that, let’s switch to geom_bar()
:
ggplot(data = Calb_R, mapping = aes(MIC)) +
geom_bar(na.rm = TRUE) +
scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
labs(x = expression(MIC[50]), y = "Number of strains")
What if I want to add the type information into this plot? We have to
do this differently than we did before, because this is a different type
of plot with a different type of data. What I think I’d like to do is
present a bar plot with different coloured bars for each type beside
each other. To do that I actually need a new tibble, one that calculates
how many strains fall into each type x MIC group. We can turn back to
our friends group_by
and summarize
to
calculate this information for us.
Calb_R_sum <- Calb_R %>%
group_by(site, MIC) %>%
summarize(num_strains = n())
## `summarise()` has grouped output by 'site'. You can override using the
## `.groups` argument.
Calb_R_sum
## # A tibble: 23 × 3
## # Groups: site [3]
## site MIC num_strains
## <chr> <dbl> <int>
## 1 blood 0.12 4
## 2 blood 0.25 53
## 3 blood 0.5 29
## 4 blood 1 22
## 5 blood 2 24
## 6 blood 4 25
## 7 blood 8 4
## 8 blood 16 2
## 9 blood 32 15
## 10 blood 64 17
## # … with 13 more rows
pSum <- ggplot(Calb_R_sum, mapping = aes(MIC, num_strains, fill= site)) +
geom_bar(stat = "identity", position=position_dodge(preserve = "single"), na.rm = TRUE, width=0.75) +
scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
labs(x = expression(MIC[50]), y = "Number of strains") +
scale_fill_discrete(name = "Site of strain isolation")
pSum
Using the help menu, explore the options for geom_bar
.
What does stat = "identity"
do? What about
position_dodge
? How do you change the width of the bar to
be equal to the total width of all elements at a position?
Now we’re going to tie all of this together. The goal of the original
paper was to see whether the same resistance level is found when you use
disk assays compared to broth microdilution. We’ll use
geom_point
for this task and I’m going to use all of the
things we worked out above to specify the layers.
gP <- ggplot(Calb_R, aes(MIC, disk)) +
scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
scale_y_reverse(limits = c(50, 0)) +
labs(y = "disk diffusion zone of inhibition (mm)" , x = expression(MIC[50])) +
theme_bw()
gP +
geom_point(na.rm =TRUE)
That looks like a pretty solid relationship. We can add a second geom to this figure to add a line of fit:
gP +
geom_point(na.rm =TRUE) +
geom_smooth(method = "lm", na.rm=TRUE)
## `geom_smooth()` using formula = 'y ~ x'
There’s a few more improvements I’d like to make. For one, the points
are all on top of each other, which obscures the data. We can use a
technique called jitter to add a bit of noise to the data on
the x-axis for visualization purposes (it will not influence our linear
model fit). We can do that through geom_jitter
which takes
the place to geom_point
. Another good technique that serves
a similar purpose is to change the opacity of the points (which is
refereed to as alpha).
gP +
geom_smooth(method = "lm", na.rm=TRUE) +
geom_jitter(alpha = 0.5, color = "tomato", width = 0.2)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 15 rows containing missing values (`geom_point()`).
The way we plotted the scatterplot ignored the other variables in our dataset that we know are important. How can you improve this figure to add in that additional information? There are many different options, see how many you can think of (and implement).
Through the GUI you can navigate over to the plot area and use the export button.
If you prefer a programmatic (non-clicky) way of doing things, I usually wrap my figures in a pdf command like this:
pdf(here("figures_out", "190411Calb_R_MIC_DDA.pdf"), width=4, height=4)
gP +
geom_smooth(method = "lm", na.rm=TRUE) +
geom_jitter(alpha = 0.5, color = "tomato", width = 0.2)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 14 rows containing missing values (`geom_point()`).
dev.off()
## quartz_off_screen
## 2
system("open figures_out/190411Calb_R_MIC_DDA.pdf")
This lesson was created by Aleeza Gerstein at the University of Manitoba based partially on material from: The Carpentries and Five number summary.
Made available under the Creative Commons Attribution license. License.