Learning Objectives


Data types are important

The type of statistical analysis you will conduct depends critically on the type of data you have. In the time we have available it is not possible to go through every possible iteration of data and analysis. We also seek to link data visualization in tandem with statistics through exploratory data analysis or EDA, of our data. We will use EDA to get a sense of the distribution of our data, and see whether there are outliers or missing values. We will let the EDA inform us how to build our models.

The Five Number Summary

The five number summary gives a quick look at the features of numerical variables. It consists of the variables:

  • minimum
  • 1st quartile
  • median
  • 3rd quartile
  • maximum

QUANTILES: The pth percentile of a data set sorted from smallest to largest is the value such that p percent of the data are at or below this value. The quartiles are special percentiles; the 1st quartile is the 25th percentile, and the 3rd quartile is the 75th percentile. The median is also a quartile – it is the 50th percentile.

Within these five numbers is a lot of useful data!

  • the median gives a measure of the center of the data
  • the minimum and maximum give the range of the data
  • the 1st and 3rd quartiles give a sense of the spread of the data, especially when compared to the minimum, maximum, and median

For our first EDA we’re going to load a new dataset that contains the MIC50 information for 488 Candida albicans strains. This data comes from a paper that compared two different methods of measuring fluconazole resistance, broth microdilution tests (MIC) and disk diffusion assays (disk). The MIC is determined as the drug concentration where growth is inhibited by ~ 50% relative to the level of growth in medium with no drug. The measured drug concentrations are (0, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128). The disk diffusion measures resistance in millimeters as the diameter of the zone of inhibition on an agar plate seeded with fungal cells that has a single disk placed in the center that contains a set amount of fluconazole. For the sake of this workshop I added two toy categorical variables, “type” that indicates whether the strain was isolated from the skin or the blood and “gender” that specifies whether the strain was isolated from a male or female.

Load the data:

download.file("http://home.cc.umanitoba.ca/~gersteia/MBIO7040/Calb_resistance.csv", 
              here("data_in", "Calb_resistance.csv"))

Calb_R <- read_csv(here("data_in", "Calb_resistance.csv"))
## Parsed with column specification:
## cols(
##   strain = col_character(),
##   type = col_character(),
##   gender = col_character(),
##   `MIC (ug/mL)` = col_double(),
##   `disk (mm)` = col_double()
## )
glimpse(Calb_R)
## Observations: 501
## Variables: 5
## $ strain        <chr> "s498", "s499", "s465", "s480", "s481", "s466", "s482",…
## $ type          <chr> "blood", "blood", "blood", "blood", "blood", "blood", "…
## $ gender        <chr> "f", "f", "f", "f", "f", "f", "m", "m", "f", "f", "f", …
## $ `MIC (ug/mL)` <dbl> 128, 128, 32, 64, 64, 32, 64, 64, 64, 64, 64, 64, 32, 3…
## $ `disk (mm)`   <dbl> 6, 6, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14, 14, 1…

First I’m going to rename the two numerical variables to remove the unit information. It’s useful but unnecessary. You can see that there are quotes around the column names and this isn’t ideal. Since we’re keeping a script of all the steps are are taking, the information won’t really be lost anyways. I’m going to break one of my rules and overwrite the original tibble without changing the name because I like this name and I haven’t actually done anything to alter the data itself, just how we access it.

Calb_R <- Calb_R %>% 
  rename(MIC = `MIC (ug/mL)`, disk = `disk (mm)`) 
glimpse(Calb_R)
## Observations: 501
## Variables: 5
## $ strain <chr> "s498", "s499", "s465", "s480", "s481", "s466", "s482", "s483"…
## $ type   <chr> "blood", "blood", "blood", "blood", "blood", "blood", "blood",…
## $ gender <chr> "f", "f", "f", "f", "f", "f", "m", "m", "f", "f", "f", "f", "f…
## $ MIC    <dbl> 128, 128, 32, 64, 64, 32, 64, 64, 64, 64, 64, 64, 32, 32, 32, …
## $ disk   <dbl> 6, 6, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, …

We’ll going to use a new function skim() from the skimr package to look at the summary information for the numerical variables of interest.

#install.packages("skimr")
library(skimr)

#select the columns with our data of interest.
Calb_R %>% 
  select(MIC, disk) %>% 
  skim()
Data summary
Name Piped data
Number of rows 501
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MIC 9 0.98 5.08 16.84 0.12 0.25 0.5 1 128 ▇▁▁▁▁
disk 6 0.99 32.56 7.25 6.00 30.00 34.0 37 50 ▁▁▅▇▁

Skim() gives us the five numbers, as well as the sample size (n), the mean, and the standard deviation sd. We also get a quick snapshot of the histogram of values.

In this case the two variables give us quite different results. For the disk parameter, the mean and median (p50) are very similar, while for the MIC results they are quite different. We say that the mean is not a robust statistic since it is not resistant to extreme observations. In contrast, the median is robust to extreme (or outlier) observations. This summary (as well as the histogram) tells us that there is a skew in the data for MIC the data is not symmetrical around the median. It also gives us information about differences between these two types of experiments that seek to capture the same data, and perhaps suggests that outliers in MIC experiments are more common than in disk diffusion experiments.

We’re going to explore these concepts more through the second phase of EDA, plotting the data. To do that we’ll use ggplot, which was actually one of the first developed components of the tidyverse framework that we’ve been using is this workshop. ggplot was actually first developed as a PhD thesis chapter!

Plotting with ggplot2

ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot2 functions like data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2.

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>

The basic idea is that a statistical graphic is a mapping from data to aesthetic attributes (such as colour, shape, and size) of geometric objects (such as points, lines, and bars).

Use the ggplot() function and bind the plot to a specific data frame using the data argument ggplot(data = Calb_R)

Define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc. ggplot(data = Calb_R, mapping = aes(x = MIC, y = disk)). In the context of this workshop we won’t have too much time to talk about the philosophy of visualization, but I want to point you to a wonderful new book on this that is available free online. It’s actually a really accessible book that could probably be read in it’s entirety in a day. (“Fundamental of Data Visualization by Clause Wilke”)[https://serialmentor.com/dataviz/].

So we’ve told R the data we want to plot, but not how we want it plotted. For that we need to add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:
- geom_histogram() for (unsurprisingly) histograms
- geom_boxplot() for, well, boxplots!
- geom_bar() for bar plots
- geom_point() for scatter plots, dot plots, etc.
- geom_line() for trend lines, time series, etc.

Principles of effective display

SOURCE: (Whitlock & Schluter, The Analysis of Biological Data)[http://whitlockschluter.zoology.ubc.ca/]

We will follow these metrics to create and evaluate figures:
1. Show the data
2. Make patterns in the data easy to see
3. Represent magnitudes honestly
4. Draw graphical elements clearly, minimizing clutter

Plotting the distribution of one varible

Let’s first look at a couple very simple plots to see what our data looks like. I like to refer to this step as touching the data or interacting with the data. I think one of the best things about R is that graphs are really cheap - we can make quick and dirty graphs really easily and (hopefully) gain a lot of insight about the data just by plotting alone.

First we’ll explore the spread of the different variables. We’ll start with a histogram of the disk diffusion resistance.

One important thing to note: we’re going to iteratively build up the code using the + sign, not the %>%. The plus sign is the same as what happens in the console when you have multiple lines of code that are to be run together from a single funcion or operation, which is what we’re doing here. That’s different than what the pipe does, which is chaining functions together.

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite values (stat_bin).

We got two messages from R. As with the red font yesterday, messages don’t necessarily mean errors. The first point is about stat_bin. To plot a histogram with the geometric object geom_histogram, that requires a statistical transformation. Some plot types (such as scatterplots) do not require transformations, each point is plotted at x and y coordinates equal to the original value. The data for other plots, such as boxplots, histograms, prediction lines etc. need to be transformed, and usually those geoms have a default statistic that can be changed via stat_* arguments.

If we use the help menu to look up geom_histogram (or type ?geom_histogram directly into the console), the description tells us what is happening: “Visualise the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars; frequency polygons (geom_freqpoly()) display the counts with lines.” If you scroll down and look through the arguments, you can see all the different things we can change that will alter how the data is presented. For histograms, there’s actually two different ways to specify bins. The first is through bins, which is the number of bins to break the data into. This is the default, and the message here has told us that the default was 30, and it prompts us to purposefully decide if this is what we want or not. The other option is to specify a binwidth, which can be either a numeric value or a function that calculates width from x. These help files often contain a lot of useful information, and I recommend you spend some time reading them.

Let’s play a bit with changing the the number of bins (which the help file tells us we should always do). I’ve done this type of experiments before, and I can use my expert knowledge to make this decision. In this case I suggest choosing a binwidth of 1, because 1mm is a natural unit for this type of data.

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram(binwidth = 1)
## Warning: Removed 6 rows containing non-finite values (stat_bin).

The second prompt tells us that 6 rows containing “non-finite values” were removed. This typically means NAs. Let’s double check that:

Calb_R %>% 
  filter(is.na(disk))
## # A tibble: 6 x 5
##   strain type  gender   MIC  disk
##   <chr>  <chr> <chr>  <dbl> <dbl>
## 1 s317   blood f      NA       NA
## 2 s318   blood f      NA       NA
## 3 s344   blood f       1       NA
## 4 s28    oral  m       0.12    NA
## 5 s146   skin  m       0.25    NA
## 6 s147   skin  m       0.25    NA

We can get rid of that using the same nomenclature as before na.rm = TRUE within our ggplot function.

Note that we actually have options for how we specify this exact same figure. The aes can actually be set in either the original ggplot call or in the geom_* call. If you set them in the original ggplot they are inherited by any other geom’s that build on top of that (unless you override it by specifying specific aes in the geom). If you specify the aes in a geom, it will only be used in that geom. e.g.,

ggplot(data = Calb_R) +
  geom_histogram(mapping = aes(disk), binwidth =1, na.rm = TRUE)

We can also specify binwidth using a different layer in the ggplot call, stat_bin instead of geom_histogram.

ggplot(data = Calb_R) +
#  geom_histogram(mapping = aes(disk)) +
  stat_bin(mapping = aes(disk), binwidth = 1)
## Warning: Removed 6 rows containing non-finite values (stat_bin).

The difference between geom and stat layers is somewhat nuanced and I refer you to this stackoverflow (post)[https://stackoverflow.com/questions/38775661/what-is-the-difference-between-geoms-and-stats-in-ggplot2]. “geoms stand for”geometric objects." These are the core elements that you see on the plot, object like points, lines, areas, curves. stats stand for “statistical transformations.” These objects summarize the data in different ways such as counting observations, creating a loess line that best fits the data, or adding a confidence interval to the loess line. … these distinctions are somewhat conceptual as the majority of geoms undergo some statistical transformation prior to being plotted." Let’s leave it at that (but have a read of the full post if you’re interested).

There are a few easy things we can do to change this plot. We can easily change the theme (I personally don’t like the grey background) and the axis labels. In reality if you can think of something to change, it can be done. The nice thing is that once we develop the code to specify the format we like, we can just copy it as we go forward.

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram(binwidth = 1) +    # change the number of bins
  theme_bw() +   # change the theme
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains")
## Warning: Removed 6 rows containing non-finite values (stat_bin).

EXERCISE

By eye, does this variable look normally distributed? Why or why not?

We can statistically test for normality using the shapiro.test():

shapiro.test(Calb_R$disk)
## 
##  Shapiro-Wilk normality test
## 
## data:  Calb_R$disk
## W = 0.89958, p-value < 2.2e-16

Notice that we switched syntaxes above, to base R (i.e., we used df$variable). The majority, if not all of the common statistical tests require this syntax, because they were developed prior to the tidyverse set of commands. Although there are some workarounds, we’re going to go back and forth a little bit from these two syntaxes as required.

Plotting with multiple types of data

We can also easily overlay information from a categorical variable in this data set. In this case we’ll add the strain type information using the fill command within the aesthetics specification.

ggplot(data = Calb_R, mapping = aes(disk, fill = type)) +
  geom_histogram(binwidth = 1, na.rm = TRUE) +    # change the number of bins
  theme_bw() +   # change the theme
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains")

We could accomplish a related visualization by separating this into three different panels using the function facet_wrap:

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram(binwidth = 1, na.rm = TRUE) + 
  theme_bw() +
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains") +
  facet_wrap(~type)

We can see here quite clearly that we have different amounts of data in each of these categories. Sometimes instead of plotting counts data it can also be helpful to visualize this as the density of the data. In this case we use a different (but related) geom, geom_freqpoly, which also requires us to specify that the stat we want is density.

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_freqpoly(mapping = aes(disk, stat(density)), binwidth = 1, na.rm = TRUE) +
  theme_bw() +
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Density of strains") +
  facet_wrap(~type)

I don’t actually think we gain that much in this case (and actually lose the intuitive ability to see that there are different numbers of strains in the different groups) but I wanted to demonstrate how easy it is to do fairly complicated data transformations for visualization with very minor adjustments to the code.

And as always, we can use the hand help menus to see what options there are. Note that we can actully calling up help directly on a component of the ggplot function e.g., ?geom_freqpoly

EXERCISE:

Create a new figure with two panels, one panel for each gender that colours the different strain types differently. Use the help menus or “google fu” to figure out how to add a title to the figure (whatever you like). Since dividing up the data into six categories makes it a bit sparse, play around with either bin or binwidth to find something that looks more pleasing. If you get that done, can you figure out how to stack the different figures/facets vertically instead of horizonatally? (remember that google is your friend). Can you figure out how to change bar colours? What about the name of the legend? What about the legend labels?

Statistical tests

It’s probably time to do a statistical test, though having fairly extensively plotted (and interacted) with this data, we can probably intuit what the test will say.

First let’s ignore the different types and test whether disk diffusion MIC differed between strains isolated from men and women. To do that we’re going to use a two sample t-test. The assumptions of a t-test are normally-distributed data, which we already saw is not met. So instead we will do the non-parametric wilcoxon test (or Mann-Whitney U test) that compares data ranks instead.

wilcox.test(disk ~ gender, data = Calb_R)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  disk by gender
## W = 30727, p-value = 0.928
## alternative hypothesis: true location shift is not equal to 0

We can similarly ignore gender and test the effect of type. In this case we have more than two groups, so we’re going to use an ANOVA test. In reality a t-test is the same thing as an ANOVA, just with two groups instead of more than two groups. The non-parametric equivalent of an ANOVA is the Kruskal-Wallis test, but the Kruskal-Wallis test assumes that sampled populations have identical shape and dispersion. We can see from our figure that this is not met. In this case it is actually better to use an ANOVA test. Although the ANOVA is parametric, it is considered a robust test against the normality assumption, that is non-normal data has only a small effect on the Type I error rate.

aov(disk ~ type, data = Calb_R)
## Call:
##    aov(formula = disk ~ type, data = Calb_R)
## 
## Terms:
##                      type Residuals
## Sum of Squares   1106.458 24891.534
## Deg. of Freedom         2       492
## 
## Residual standard error: 7.112844
## Estimated effects may be unbalanced
## 6 observations deleted due to missingness

When you run the ANOVA we don’t actually get all the information we need out of just the model aov call. We need to wrap that in a second function to pull out additional information:

anova_test <- aov(disk ~ type, data = Calb_R) 
summary(anova_test)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## type          2   1106   553.2   10.94 2.26e-05 ***
## Residuals   492  24892    50.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 6 observations deleted due to missingness

There’s the stats. We’re going to install one more package, the broom package, that will clean up this output.

library(broom)
tidy(anova_test)
## # A tibble: 2 x 6
##   term         df  sumsq meansq statistic    p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
## 1 type          2  1106.  553.       10.9  0.0000226
## 2 Residuals   492 24892.   50.6      NA   NA
glance(anova_test)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0426        0.0387  7.11      10.9 2.26e-5     3 -1672. 3352. 3369.
## # … with 2 more variables: deviance <dbl>, df.residual <int>

Using the broom functions tidy and glance we can easily access the parameter values:

anova_test_tidy <- tidy(anova_test)
anova_test_tidy
## # A tibble: 2 x 6
##   term         df  sumsq meansq statistic    p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
## 1 type          2  1106.  553.       10.9  0.0000226
## 2 Residuals   492 24892.   50.6      NA   NA
anova_test_tidy$p.value[1]
## [1] 2.256929e-05

If we want to know which groups are different from each other, we can use the post-hoc (or “after the event”) tukey test:

TukeyHSD(anova_test)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = disk ~ type, data = Calb_R)
## 
## $type
##                 diff       lwr       upr     p adj
## oral-blood  3.965051  1.886492 6.0436086 0.0000271
## skin-blood  2.130479  0.458185 3.8027737 0.0080993
## skin-oral  -1.834571 -3.923196 0.2540533 0.0982914

In reality, we actually know that there are two different categorical variables here that could influence the disk diffusion resistance, and we can include them both in one test, a two-way ANOVA.

full_anova_test <- aov(disk ~ type*gender, data = Calb_R) 
tidy(full_anova_test)
## # A tibble: 4 x 6
##   term           df   sumsq meansq statistic    p.value
##   <chr>       <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1 type            2  1106.   553.     10.9    0.0000228
## 2 gender          1    13.6   13.6     0.269  0.604    
## 3 type:gender     2   116.    58.0     1.15   0.319    
## 4 Residuals     489 24762.    50.6    NA     NA

This hopefully tells us what we already intuited from the EDA: type but not gender influences disk resistance. If this is the story we want to tell, the histograms may not be the best method to do that. Let’s change to another geom method, geom_boxplot.

ggplot(data = Calb_R) +
  geom_boxplot(mapping = aes(x = type, y = disk), na.rm = TRUE) + 
  theme_bw()

I’m going to save this base figure as an object and then build it up:

p <- ggplot(data = Calb_R) +
  geom_boxplot(mapping = aes(x = type, y = disk), na.rm = TRUE) + 
  theme_bw()
p

I’d like to change the axes labels (this is almost always true) and add the tukey test results using the annotate layer.

p +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +
  annotate("text",x = c(1, 2, 3), y = c(0, 0, 0), label = c("a", "b", "b"))

And just as a quick aside, I don’t actually like boxplots all that much, since I think they aren’t that intuitive. A related alternative to the boxplot called a violin plot has been developed and is easily accessible through the ggplot framework. Again, in the interest of time I’ll refer you elsewhere for a more in-depth discussion. To start, wikipedia is a good place to start.

p2 <- ggplot(data = Calb_R) +
  geom_violin(mapping = aes(x = type, y = disk), na.rm = TRUE) + 
  theme_bw() +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +
  annotate("text",x = c(1, 2, 3), y = c(0, 0, 0), label = c("a", "b", "b"))

p2

We can also easily add different summary statistics directly on top, like the mean or the median.

p2 +
  stat_summary(fun.y=mean, geom="point", shape=23, size=2)

Aha. Here is a case where it matters where we specify the aesthetics. Because they are defined in the geom and not the ggplot function, stat_summary doesn’t have access to the data.

p3 <- ggplot(data = Calb_R, mapping = aes(x = type, y = disk)) +
  geom_violin(na.rm = TRUE) + 
  theme_bw() +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +
  annotate("text",x = c(1, 2, 3), y = c(0, 0, 0), label = c("a", "b", "b")) +
  stat_summary(fun.y=mean, geom="point", shape=23, size=3, na.rm = TRUE, fill = "red")

p3

One more variant on this plot, the sina plot. We can literaly just swap geom_sina() in for geom_violin() (once we download and load the ggforce package).

library(ggforce)
p4 <- ggplot(data = Calb_R, mapping = aes(x = type, y = disk)) +
  geom_sina() + 
  theme_bw() +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , y = "site of strain isolation") +
  annotate("text",x = c(1, 2, 3), y = c(0, 0, 0), label = c("a", "b", "b")) +
  stat_summary(fun.y=mean, geom="point", shape=23, size=3, na.rm = TRUE, fill = "red")

p4
## Warning: Removed 6 rows containing non-finite values (stat_sina).

We’re now going to look at our second continuous variable, resistance from MIC. We’ll start the same way, with a histogram:

pM <- ggplot(data = Calb_R, mapping = aes(MIC)) +
  geom_histogram(na.rm = TRUE) 

pM
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Well. That doesn’t look very good, does it? Again, we can use our expert knowledge to think about what we have here. Broth microdiluation is measured on drug levels that begin low (in this case 0.012 uM drug) and then increase by doubling up to 128. This tells me how to plot the data: in a way that mimics the experiment, i.e., on a \(log_2\) scale, which we specify in the scale_x_continuous layer.

pM +
  scale_x_continuous(trans="log2")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

That’s better. And thinking about this further, what I probably want is a bar plot, not a histogram. That’s because a histogram, by definition, requires continuous data. Yet this data is not continuous, there are only a set number of possible values because of the way the data was recorded [side note: when my lab collects this data we actually fit a logistic equation through the optical density readings taken at each drug concentration, which gives me a continuous measurement.]

So given that, let’s switch to geom_bar():

ggplot(data = Calb_R, mapping = aes(MIC)) +
  geom_bar(na.rm = TRUE) + 
  scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
  labs(x = expression(MIC[50]), y = "Number of strains")

What if I want to add the type information into this plot? We have to do this differently than we did before, because this is a different type of plot with a different type of data. What I think I’d like to do is present a bar plot with different coloured bars for each type beside each other. To do that I actually need a new tibble, one that calculates how many strains fall into each type x MIC group. We can turn back to our friends group_by and summarize to calculate this information for us.

Calb_R_sum <- Calb_R %>% 
  group_by(type, MIC) %>% 
  summarize(num_strains = n())

Calb_R_sum
## # A tibble: 23 x 3
## # Groups:   type [3]
##    type    MIC num_strains
##    <chr> <dbl>       <int>
##  1 blood  0.12           4
##  2 blood  0.25          53
##  3 blood  0.5           29
##  4 blood  1             22
##  5 blood  2             24
##  6 blood  4             25
##  7 blood  8              4
##  8 blood 16              2
##  9 blood 32             15
## 10 blood 64             17
## # … with 13 more rows
pSum <- ggplot(Calb_R_sum, mapping = aes(MIC, num_strains, fill= type)) +
    geom_bar(stat = "identity", position=position_dodge(preserve = "single"), na.rm = TRUE, width=0.75) +
    scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
    labs(x = expression(MIC[50]), y = "Number of strains") +
    scale_fill_discrete(name = "Site of strain isolation")

pSum

EXERCISE

  1. Using the help menu, explore the options for geom_bar. What does stat = "identity" do? What about position_dodge? How do you change the width of the bar to be equal to the total width of all elements at a position?

  2. Conduct a statistical test to determine whether type or gender (or their interaction) has a significant effect on MIC.

Two numerical variables

Now we’re going to tie all of this together. The goal of the original paper was to see whether the same resistance level is found when you use disk assays compared to broth microdilution. We’ll use geom_pointfor this task and I’m going to use all of the things we worked out above to specify the layers.

gP <- ggplot(Calb_R, aes(MIC, disk)) +
  scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , x = expression(MIC[50])) +
  theme_bw()

gP +
  geom_point(na.rm =TRUE)

That looks like a pretty solid relationship. We can add a second geom to this figure to add a line of fit:

gP +
  geom_point(na.rm =TRUE) +
  geom_smooth(method = "lm", na.rm=TRUE)

There’s a few more improvements I’d like to make. For one, the points are all on top of each other, which obscures the data. We can use a technique called jitter to add a bit of noise to the data on the x-axis for visualization purposes (it will not influence our linear model fit). We can do that through geom_jitter which takes the place to geom_point. Another good technique that serves a similar purpose is to change the opacity of the points (which is refereed to as alpha).

gP +
  geom_smooth(method = "lm", na.rm=TRUE) +
  geom_jitter(alpha = 0.5, color = "tomato", width = 0.2)
## Warning: Removed 14 rows containing missing values (geom_point).

And finally a statistical test to cap it all off, we’ll look for a correlation between these two variables. We’ll again turn to our non-parametric statistics, and specify that we want Spearman’s rho test (in this case it’s the same function as the parametric test, we just specify the method we want to use).

cor_test <- cor.test(Calb_R$MIC, Calb_R$disk, method = "spearman")
## Warning in cor.test.default(Calb_R$MIC, Calb_R$disk, method = "spearman"):
## Cannot compute exact p-value with ties
cor_test
## 
##  Spearman's rank correlation rho
## 
## data:  Calb_R$MIC and Calb_R$disk
## S = 30335647, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5661987

And we can again use the broom package to make the output more tidy and easier to access.

tidy(cor_test)
## # A tibble: 1 x 5
##   estimate statistic  p.value method                          alternative
##      <dbl>     <dbl>    <dbl> <chr>                           <chr>      
## 1   -0.566 30335647. 1.03e-42 Spearman's rank correlation rho two.sided

EXERCISE:

The way we plotted the scatterplot ignored the other variables in our dataset that we know are important. How can you improve this figure to add in that additional information? There are many different options, see how many you can think of (and implement).

Saving graphs

Through the GUI you can navigate over to the plot area and use the export button.

If you prefer a programmatic (non-clicky) way of doing things, I usually wrap my figures in a pdf command like this:

pdf(here("figures_out", "190411Calb_R_MIC_DDA.pdf"), width=4, height=4)
gP +
  geom_smooth(method = "lm", na.rm=TRUE) +
  geom_jitter(alpha = 0.5, color = "tomato", width = 0.2)
## Warning: Removed 13 rows containing missing values (geom_point).
dev.off()
## quartz_off_screen 
##                 2
system("open figures_out/190411Calb_R_MIC_DDA.pdf")

Attributes

This lesson was created by Aleeza Gerstein at the University of Manitoba based partially on material from: The Carpentries and Five number summary.

Made available under the Creative Commons Attribution license. License.