Learning Objectives


Principles of effective display

SOURCE: (Whitlock & Schluter, The Analysis of Biological Data)[http://whitlockschluter.zoology.ubc.ca/]

We will follow these metrics to create and evaluate figures:
1. Show the data
2. Make patterns in the data easy to see
3. Represent magnitudes honestly
4. Draw graphical elements clearly, minimizing clutter

In the context of this workshop we won’t have too much time to talk about the philosophy of visualization, but I want to point you to a wonderful new book on this that is available free online. It’s actually a really accessible book that could probably be read in it’s entirety in a day. (“Fundamental of Data Visualization by Clause Wilke”)[https://serialmentor.com/dataviz/].

Data types are important

The type of statistical analysis you will conduct depends critically on the type of data you have. In the time we have available it is not possible to go through every possible iteration of data and analysis. We also seek to link data visualization in tandem with statistics through exploratory data analysis or EDA, of our data. We will use EDA to get a sense of the distribution of our data, and see whether there are outliers or missing values. We will let the EDA inform us how to build our models.

For our first EDA we’re going to load a new dataset that contains the MIC50 information for 488 Candida albicans strains. This data comes from a paper that compared two different methods of measuring fluconazole resistance, broth microdilution tests (MIC) and disk diffusion assays (disk). The MIC is determined as the drug concentration where growth is inhibited by ~ 50% relative to the level of growth in medium with no drug. The measured drug concentrations are (0, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128). The disk diffusion measures resistance in millimeters as the diameter of the zone of inhibition on an agar plate seeded with fungal cells that has a single disk placed in the center that contains a set amount of fluconazole. For the sake of this workshop I added two toy categorical variables, “type” that indicates whether the strain was isolated from the skin or the blood and “sex” that specifies whether the strain was isolated from a male or female.

Load the data:

download.file("http://home.cc.umanitoba.ca/~gersteia/MBIO7040/Calb_resistance.csv", 
              here("data_in", "Calb_resistance.csv"))

Calb_R <- read_csv(here("data_in", "Calb_resistance.csv"))
## Rows: 501 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): strain, site, sex
## dbl (2): MIC (ug/mL), disk (mm)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(Calb_R)
## Rows: 501
## Columns: 5
## $ strain        <chr> "s498", "s499", "s465", "s480", "s481", "s466", "s482", …
## $ site          <chr> "blood", "blood", "blood", "blood", "blood", "blood", "b…
## $ sex           <chr> "f", "f", "f", "f", "f", "f", "m", "m", "f", "f", "f", "…
## $ `MIC (ug/mL)` <dbl> 128, 128, 32, 64, 64, 32, 64, 64, 64, 64, 64, 64, 32, 32…
## $ `disk (mm)`   <dbl> 6, 6, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14…

First I’m going to rename the two numerical variables to remove the unit information. It’s useful but unnecessary. You can see that there are quotes around the column names and this isn’t ideal. Since we’re keeping a script of all the steps are are taking, the information won’t really be lost anyways. I’m going to break one of my rules and overwrite the original tibble without changing the name because I like this name and I haven’t actually done anything to alter the data itself, just how we access it.

Calb_R <- Calb_R %>% 
  rename(MIC = `MIC (ug/mL)`, disk = `disk (mm)`) 
glimpse(Calb_R)
## Rows: 501
## Columns: 5
## $ strain <chr> "s498", "s499", "s465", "s480", "s481", "s466", "s482", "s483",…
## $ site   <chr> "blood", "blood", "blood", "blood", "blood", "blood", "blood", …
## $ sex    <chr> "f", "f", "f", "f", "f", "f", "m", "m", "f", "f", "f", "f", "f"…
## $ MIC    <dbl> 128, 128, 32, 64, 64, 32, 64, 64, 64, 64, 64, 64, 32, 32, 32, 6…
## $ disk   <dbl> 6, 6, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 1…

Plotting with ggplot2

ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. ggplot was actually first developed as a PhD thesis chapter! It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot2 functions like data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2.

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>

The basic idea is that a statistical graphic is a mapping from data to aesthetic attributes (such as colour, shape, and size) of geometric objects (such as points, lines, and bars).

Use the ggplot() function and bind the plot to a specific data frame using the data argument ggplot(data = Calb_R)

Define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc. ggplot(data = Calb_R, mapping = aes(x = MIC, y = disk)).

So we’ve told R the data we want to plot, but not how we want it plotted. For that we need to add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:
- geom_histogram() for (unsurprisingly) histograms
- geom_boxplot() for, well, boxplots!
- geom_bar() for bar plots
- geom_point() for scatter plots, dot plots, etc.
- geom_line() for trend lines, time series, etc.

Plotting the distribution of one varible

Let’s first look at a couple very simple plots to see what our data looks like. I like to refer to this step as touching the data or interacting with the data. I think one of the best things about R is that graphs are really cheap - we can make quick and dirty graphs really easily and (hopefully) gain a lot of insight about the data just by plotting alone.

First we’ll explore the spread of the different variables. We’ll start with a histogram of the disk diffusion resistance.

One important thing to note: we’re going to iteratively build up the code using the + sign, not the %>%. The plus sign is the same as what happens in the console when you have multiple lines of code that are to be run together from a single funcion or operation, which is what we’re doing here. That’s different than what the pipe does, which is chaining functions together.

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).

We got two messages from R. As with the red font yesterday, messages don’t necessarily mean errors. The first point is about stat_bin. To plot a histogram with the geometric object geom_histogram, that requires a statistical transformation. Some plot types (such as scatterplots) do not require transformations, each point is plotted at x and y coordinates equal to the original value. The data for other plots, such as boxplots, histograms, prediction lines etc. need to be transformed, and usually those geoms have a default statistic that can be changed via stat_* arguments.

If we use the help menu to look up geom_histogram (or type ?geom_histogram directly into the console), the description tells us what is happening: “Visualise the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars; frequency polygons (geom_freqpoly()) display the counts with lines.” If you scroll down and look through the arguments, you can see all the different things we can change that will alter how the data is presented. For histograms, there’s actually two different ways to specify bins. The first is through bins, which is the number of bins to break the data into. This is the default, and the message here has told us that the default was 30, and it prompts us to purposefully decide if this is what we want or not. The other option is to specify a binwidth, which can be either a numeric value or a function that calculates width from x. These help files often contain a lot of useful information, and I recommend you spend some time reading them.

Let’s play a bit with changing the the number of bins (which the help file tells us we should always do). I’ve done this type of experiments before, and I can use my expert knowledge to make this decision. In this case I suggest choosing a binwidth of 1, because 1mm is a natural unit for this type of data.

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram(binwidth = 1)
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).

The second prompt tells us that 6 rows containing “non-finite values” were removed. This typically means NAs. Let’s double check that:

Calb_R %>% 
  filter(is.na(disk))
## # A tibble: 6 × 5
##   strain site  sex     MIC  disk
##   <chr>  <chr> <chr> <dbl> <dbl>
## 1 s317   blood f     NA       NA
## 2 s318   blood f     NA       NA
## 3 s344   blood f      1       NA
## 4 s28    oral  m      0.12    NA
## 5 s146   skin  m      0.25    NA
## 6 s147   skin  m      0.25    NA

We can get rid of that using the same nomenclature as before na.rm = TRUE within our ggplot function.

Note that we actually have options for how we specify this exact same figure. The aes can actually be set in either the original ggplot call or in the geom_* call. If you set them in the original ggplot they are inherited by any other geom’s that build on top of that (unless you override it by specifying specific aes in the geom). If you specify the aes in a geom, it will only be used in that geom. e.g.,

ggplot(data = Calb_R) +
  geom_histogram(mapping = aes(disk), binwidth =1, na.rm = TRUE)

We can also specify binwidth using a different layer in the ggplot call, stat_bin instead of geom_histogram.

ggplot(data = Calb_R) +
#  geom_histogram(mapping = aes(disk)) +
  stat_bin(mapping = aes(disk), binwidth = 1)
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).

The difference between geom and stat layers is somewhat nuanced and I refer you to this stackoverflow (post)[https://stackoverflow.com/questions/38775661/what-is-the-difference-between-geoms-and-stats-in-ggplot2]. “geoms stand for”geometric objects.” These are the core elements that you see on the plot, object like points, lines, areas, curves. stats stand for “statistical transformations.” These objects summarize the data in different ways such as counting observations, creating a loess line that best fits the data, or adding a confidence interval to the loess line. … these distinctions are somewhat conceptual as the majority of geoms undergo some statistical transformation prior to being plotted.” Let’s leave it at that (but have a read of the full post if you’re interested).

There are a few easy things we can do to change this plot. We can easily change the theme (I personally don’t like the grey background) and the axis labels. In reality if you can think of something to change, it can be done. The nice thing is that once we develop the code to specify the format we like, we can just copy it as we go forward.

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram(binwidth = 1) +    # change the number of bins
  theme_bw() +   # change the theme
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains")
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).

Plotting with multiple types of data

We can also easily overlay information from a categorical variable in this data set. In this case we’ll add the strain type information using the fill command within the aesthetics specification.

ggplot(data = Calb_R, mapping = aes(disk, fill = site)) +
  geom_histogram(binwidth = 1, na.rm = TRUE) +    # change the number of bins
  theme_bw() +   # change the theme
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains")

We could accomplish a related visualization by separating this into three different panels using the function facet_wrap:

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram(binwidth = 1, na.rm = TRUE) + 
  theme_bw() +
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains") +
  facet_wrap(~site)

We can see here quite clearly that we have different amounts of data in each of these categories. Sometimes instead of plotting counts data it can also be helpful to visualize this as the density of the data. In this case we use a different (but related) geom, geom_freqpoly, which also requires us to specify that the stat we want is density.

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_freqpoly(mapping = aes(disk, stat(density)), binwidth = 1, na.rm = TRUE) +
  theme_bw() +
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Density of strains") +
  facet_wrap(~site)
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

I don’t actually think we gain that much in this case (and actually lose the intuitive ability to see that there are different numbers of strains in the different groups) but I wanted to demonstrate how easy it is to do fairly complicated data transformations for visualization with very minor adjustments to the code.

And as always, we can use the hand help menus to see what options there are. Note that we can actully calling up help directly on a component of the ggplot function e.g., ?geom_freqpoly

EXERCISE:

Create a new figure to plot the disk variable with two panels, one for each sex that colours the different strain site differently. Use the help menus or “google fu” to figure out how to add a title to the figure (add whatever you like). Since dividing up the data into six categories makes it a bit sparse, play around with either bin or binwidth to find something that looks more pleasing.

Extra things to do: If you get that done, can you figure out how to stack the different figures/facets vertically instead of horizonatally? (remember that google is your friend). Can you figure out how to change bar colours? Can you change the name of the legend? What about the legend labels - can you change it from f & m to female and male? (don’t recode the dataset, just change it in the figure) Change the panel titles from “f” and “m” to “female” and “male” (don’t recode the actual dataset, just change the title labels) Look up different themes and find one you like Change the default colours that are used

Visually it looks like type, but not sex, influences disk resistance. If this is the story we want to tell, the histograms may not be the best method to do that. Let’s change to another geom method, geom_boxplot.

ggplot(data = Calb_R) +
  geom_boxplot(mapping = aes(x = site, y = disk), na.rm = TRUE) + 
  theme_bw()

I’m going to save this base figure as an object and then build it up:

p <- ggplot(data = Calb_R) +
  geom_boxplot(mapping = aes(x = site, y = disk), na.rm = TRUE) + 
  theme_bw()
p

I’d like to change the axes labels (this is almost always true) and add the tukey test results using the annotate layer.

p +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation")

And just as a quick aside, I don’t actually like boxplots all that much, since I think they aren’t that intuitive. A related alternative to the boxplot called a violin plot has been developed and is easily accessible through the ggplot framework. Again, in the interest of time I’ll refer you elsewhere for a more in-depth discussion. To start, wikipedia is a good place to start.

p2 <- ggplot(data = Calb_R) +
  geom_violin(mapping = aes(x = site, y = disk), na.rm = TRUE) + 
  theme_bw() +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") 
p2

We can also easily add different summary statistics directly on top, like the mean or the median.

p2 +
  stat_summary(fun="mean", 
               geom="point", 
               shape=23, 
               size=2)

Aha. Here is a case where it matters where we specify the aesthetics. Because they are defined in the geom and not the ggplot function, stat_summary doesn’t have access to the data.

p3 <- ggplot(data = Calb_R, mapping = aes(x = site, y = disk)) +
  geom_violin(na.rm = TRUE) + 
  theme_bw() +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +  stat_summary(fun="mean", geom="point", shape=23, size=3, na.rm = TRUE, fill = "red")

p3

One more variant on this plot, the sina plot. We can literaly just swap geom_sina() in for geom_violin() (once we download and load the ggforce package).

library(ggforce)
p4 <- ggplot(data = Calb_R, mapping = aes(x = site, y = disk)) +
  geom_sina(na.rm=TRUE) + 
  theme_bw() +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +
  stat_summary(fun="mean", geom="point", shape=23, size=3, na.rm = TRUE, fill = "red")

p4

We could also add a horizontal line to indicate the mean (note that we need to chang a number of arguments:

p4 <- ggplot(data = Calb_R, mapping = aes(x = site, y = disk)) +
  geom_sina(na.rm=TRUE) + 
  theme_bw() +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +
  stat_summary(fun="mean", geom="crossbar", linewidth=0.5, na.rm = TRUE, colour = "red")

p4

We’re now going to look at our second continuous variable, resistance from MIC. We’ll start the same way, with a histogram:

pM <- ggplot(data = Calb_R, mapping = aes(MIC)) +
  geom_histogram(na.rm = TRUE) 

pM
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Well. That doesn’t look very good, does it? Again, we can use our expert knowledge to think about what we have here. Broth microdiluation is measured on drug levels that begin low (in this case 0.012 uM drug) and then increase by doubling up to 128. This tells me how to plot the data: in a way that mimics the experiment, i.e., on a \(log_2\) scale, which we specify in the scale_x_continuous layer.

pM +
  scale_x_continuous(trans="log2")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

That’s better. And thinking about this further, what I probably want is a bar plot, not a histogram. That’s because a histogram, by definition, requires continuous data. Yet this data is not continuous, there are only a set number of possible values because of the way the data was recorded [side note: when my lab collects this data we actually fit a logistic equation through the optical density readings taken at each drug concentration, which gives me a continuous measurement.]

So given that, let’s switch to geom_bar():

ggplot(data = Calb_R, mapping = aes(MIC)) +
  geom_bar(na.rm = TRUE) + 
  scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
  labs(x = expression(MIC[50]), y = "Number of strains")

What if I want to add the type information into this plot? We have to do this differently than we did before, because this is a different type of plot with a different type of data. What I think I’d like to do is present a bar plot with different coloured bars for each type beside each other. To do that I actually need a new tibble, one that calculates how many strains fall into each type x MIC group. We can turn back to our friends group_by and summarize to calculate this information for us.

Calb_R_sum <- Calb_R %>% 
  group_by(site, MIC) %>% 
  summarize(num_strains = n())
## `summarise()` has grouped output by 'site'. You can override using the
## `.groups` argument.
Calb_R_sum
## # A tibble: 23 × 3
## # Groups:   site [3]
##    site    MIC num_strains
##    <chr> <dbl>       <int>
##  1 blood  0.12           4
##  2 blood  0.25          53
##  3 blood  0.5           29
##  4 blood  1             22
##  5 blood  2             24
##  6 blood  4             25
##  7 blood  8              4
##  8 blood 16              2
##  9 blood 32             15
## 10 blood 64             17
## # … with 13 more rows
pSum <- ggplot(Calb_R_sum, mapping = aes(MIC, num_strains, fill= site)) +
    geom_bar(stat = "identity", position=position_dodge(preserve = "single"), na.rm = TRUE, width=0.75) +
    scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
    labs(x = expression(MIC[50]), y = "Number of strains") +
    scale_fill_discrete(name = "Site of strain isolation")

pSum

EXERCISE

Using the help menu, explore the options for geom_bar. What does stat = "identity" do? What about position_dodge? How do you change the width of the bar to be equal to the total width of all elements at a position?

Two numerical variables

Now we’re going to tie all of this together. The goal of the original paper was to see whether the same resistance level is found when you use disk assays compared to broth microdilution. We’ll use geom_pointfor this task and I’m going to use all of the things we worked out above to specify the layers.

gP <- ggplot(Calb_R, aes(MIC, disk)) +
  scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , x = expression(MIC[50])) +
  theme_bw()

gP +
  geom_point(na.rm =TRUE)

That looks like a pretty solid relationship. We can add a second geom to this figure to add a line of fit:

gP +
  geom_point(na.rm =TRUE) +
  geom_smooth(method = "lm", na.rm=TRUE)
## `geom_smooth()` using formula = 'y ~ x'

There’s a few more improvements I’d like to make. For one, the points are all on top of each other, which obscures the data. We can use a technique called jitter to add a bit of noise to the data on the x-axis for visualization purposes (it will not influence our linear model fit). We can do that through geom_jitter which takes the place to geom_point. Another good technique that serves a similar purpose is to change the opacity of the points (which is refereed to as alpha).

gP +
  geom_smooth(method = "lm", na.rm=TRUE) +
  geom_jitter(alpha = 0.5, color = "tomato", width = 0.2)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 15 rows containing missing values (`geom_point()`).

EXERCISE:

The way we plotted the scatterplot ignored the other variables in our dataset that we know are important. How can you improve this figure to add in that additional information? There are many different options, see how many you can think of (and implement).

Saving graphs

Through the GUI you can navigate over to the plot area and use the export button.

If you prefer a programmatic (non-clicky) way of doing things, I usually wrap my figures in a pdf command like this:

pdf(here("figures_out", "190411Calb_R_MIC_DDA.pdf"), width=4, height=4)
gP +
  geom_smooth(method = "lm", na.rm=TRUE) +
  geom_jitter(alpha = 0.5, color = "tomato", width = 0.2)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 14 rows containing missing values (`geom_point()`).
dev.off()
## quartz_off_screen 
##                 2
system("open figures_out/190411Calb_R_MIC_DDA.pdf")

Attributes

This lesson was created by Aleeza Gerstein at the University of Manitoba based partially on material from: The Carpentries and Five number summary.

Made available under the Creative Commons Attribution license. License.