Learning Objectives

Describe the difference between populations and samples
Distinguish between data that follows a normal distribution and data that deviates in modality, skew, or kurtosis
Define the major steps of hypothesis testing
Define type I and type II error
Describe the relationship between alpha, beta and power

Introduction to Statistics

Statistics is the science of collecting, organising, summarising, analyzing, and interpreting data. Descriptive statistics refers to summarising data with numbers or pictures while inferential statistics is about making conclusions or decisions based on data. Inferential statistics uses data from a sample to make inferences about a population.

For example, if we are interested in the number of bacteria in a test tube, it is not possible to count every single individual cell. Instead, we could take a subset of the population and count it with a hemocytometer.

EXERCISE

Think about the experiment of taking samples from the population to counting cells with a hemocytometer. What experimental details would help us to get a more accurate measurement from our samples of the true population size?

The goal of statistics is to use imperfect information (our data) in order to infer facts, make decisions, and make predictions about the world.

A statistic is a number or value measured within some particular context.

It is essential to understand the context: * Which data were collected? * How and why were the data collected? * On which individuals or entities were the data collected? * What questions do we hope to answer from the data?

We use statistics to (hopefully) tease apart causation from correlation using frameworks (briefly!): 1. Hypothesis testing - do the data at hand sufficiently support a particular hypothesis?

Estimation - not just is there an effect, how large is the effect size (point estimate) and how confident are you in it?
Causal inference - ascribe causal relationships to associations between variables

As biostatisticians, we have very big goals:

What is the motivating biological question?
What experiments can I do/have been done and/or data can I collect to address this question?
How can I design this experiment and/or collect these data in a way that will map cleanly onto both a statistical model and my motivating idea?
Do results support an interesting conclusion?
What are the shortcomings of statistical models and causal frameworks in the analysis?
How do I best communicate my results (including estimates, visualizations, conclusions, and caveats)?

To a statistician the TRUTH is a population – a collection of all individuals of a circumscribed type (e.g. all microbiologists in Winnipeg), or a process for generating these individuals (e.g., the expectation from the process of flipping a fair coin). The population is characterized by its parameters (e.g. mean, variance).

It is almost always impractical or impossible to study every individual in a population. As such, we often deal with a sample – a subset of a population. We characterize a sample by taking an estimate of population parameters.

So a major goal of a statistical analysis is how to go from conclusions about a sample which we can measure and observe, to the population(s) we care about. In doing so we must worry about random differences between a sample and a population (known as Sampling Error, as well as any systematic issues in our sampling or measuring procedure which will cause estimates to reliably differ from the population (known as Sampling Bias).

A statistic is a number or value measured within some particular context.

It is essential to understand the context:
* Which data were collected?
* How and why were the data collected?
* On which individuals or entities were the data collected?
* What questions do we hope to answer from the data?

A common use of statistics (especially biostatistics) is hypothesis testing, where we use estimates from samples to ask if data come from populations with different parameters. This typically relies on statistical models.

To build and evaluate models, we need to consider the type of data and the process that generates these data. Variables are things which differ among individuals (or sampling units) of our study. So, for example, species, genotype, temperature, or the drug concentration in the media are all variables.

We often need to distinguish between explanatory variables, which we think underlie or are associated with the biological process of interest, from response variables, the outcome we aim to understand. This distinction helps us build and consider our statistical model and relate the results to our biological motivation.

Parametric vs. Non-Parametric Tests and the Normal Distribution

In the literal meaning of the terms, a parametric statistical test is one that makes assumptions about the parameters (defining properties) of the population distribution(s) from which one’s data are drawn, while a non-parametric test is one that makes no such assumptions. In this strict sense, “non-parametric” is essentially a null category, since virtually all statistical tests assume one thing or another about the properties of the source population(s).

For practical purposes, you can think of “parametric” as referring to tests that assume the underlying source population(s) to be normally distributed; they generally also assume that one’s measures derive from an equal-interval scale. And you can think of “non-parametric” as referring to tests that do not make on these particular assumptions.

If data is perfectly normally distributed, the two sides of the curve are the exact mirror of each other and the three measures of central tendancy [mean (average, $\mu$), median (middle number) and mode (most commonly observed value)] are all exactly the same in the middle of the distribution.

[Source: wikipedia]

Another important consideration is the variability (or amount of spread) in the data. You can see above that different normal distributions have different amount of spread, shown above with $\sigma^2$ which is shorthand for variance (the square of the standard deviation). Variance is calculated by finding the difference between every data point and the mean, squaring them, and taking the average of those numbers. The squares weigh outliers more heavily than points that are close to the mean and prevents values above the mean from neutralizing those below. Standard deviation (the square-root of variance) is used more often, because it is in the same unit of measurement as the original data. If you remember the words Z-Score, that tells us how many standard deviations a specific data point lies above or below the mean.

There are three other aspects of the data to consider with regards to the normal distribution.

Modality: Is there a single peak in the distribution or multiple peaks?
Skewness: Is the distribution symmetrical?
Kurtosis: How much of the data is in the middle of the distribution vs. the tails? It is a measure of how many outliers are in the data.

For todays purposes you can think about non-parametric statistics as ranked versions of the corresponding parametric tests when the assumptions of normality of variance are not met. We will discuss this further as we proceed below.
SOURCE: Concepts and Applications of Inferential Statistics, Towards Data Science blog post and blog post

Different types of data

In general when we are analyzing data what we are tying to do is define the relationship between:

the outcome (or dependent) variable, y and
an explanatory/predictor (or independent) variable, x (also sometimes called a covariate).

The specific type of statistics required depends on the type of data, i.e., whether you have numerical (quantitative) or categorical data.

Categorical data represent characteristics and can be thought of as ways to label the data. They can be broken down into two classes: nominal and ordinal. Nominal data has no quantitative value and have no inherent order (e.g., growth or no growth of a population in a given drug concentration). By contrast, ordinal data represents discrete and ordered units (e.g., level of growth of a mutant compared to wildtype such as +1, +2, etc.).

There are also two categories of numerical data. It can be discrete, if the data can only take on certain values. This type of data can be counted but not measured (e.g., the number of heads in 100 coin flips). By contrast, continuous data can be measured (e.g., CFU counts, growth rate).

The basics of hypothesis testing

SOURCE: Modern Dive, Chapter 10

In a hypothesis test, we use data from a sample to help us decide between two competing hypotheses about a population. We make these hypotheses more concrete by specifying them in terms of at least one population parameter of interest. We refer to the competing claims about the population as the null hypothesis, denoted by $H_0$, and the alternative (or research) hypothesis, denoted by $H_a$. The roles of these two hypotheses are NOT interchangeable.

The claim for which we seek significant evidence is assigned to the alternative hypothesis. The alternative is usually what the experimenter or researcher wants to establish or find evidence for. Usually, the null hypothesis is a claim that there really is “no effect” or “no difference.” In many cases, the null hypothesis represents the status quo or that nothing interesting is happening. We assess the strength of evidence by assuming the null hypothesis is true and determining how unlikely it would be to see sample results/statistics as extreme (or more extreme) as those in the original sample.

Hypothesis testing brings about many weird and incorrect notions in the scientific community and society at large. One reason for this is that statistics has traditionally been thought of as this magic box of algorithms and procedures to get to results and this has been readily apparent if you do a Google search of “flowchart statistics hypothesis tests.” There are so many different complex ways to determine which test is appropriate.

You’ll see that we don’t need to rely on these complicated series of assumptions and procedures to conduct a hypothesis test any longer. These methods were introduced in a time when computers weren’t powerful. Your cellphone (in 2016) has more power than the computers that sent NASA astronauts to the moon after all.

We can actually break down ALL hypothesis tests into the following framework given by Allen Downey here:

From Allen’s blog post “There is still only one test”

Given a dataset, you compute a test statistic that measures the size of the apparent effect. For example, if you are describing a difference between two groups, the test statistic might be the absolute difference in means. I’ll call the test statistic from the observed data 𝛿*.
Next, you define a null hypothesis, which is a model of the world under the assumption that the effect is not real; for example, if you think there might be a difference between two groups, the null hypothesis would assume that there is no difference.
Your model of the null hypothesis should be stochastic; that is, capable of generating random datasets similar to the original dataset.
Now, the goal of classical hypothesis testing is to compute a p-value, which is the probability of seeing an effect as big as $\delta^*$ under the null hypothesis. You can estimate the p-value by using your model of the null hypothesis to generate many simulated datasets. For each simulated dataset, compute the same test statistic you used on the actual data.
Finally, count the fraction of times the test statistic from simulated data exceeds 𝛿*. This fraction approximates the p-value. If it’s sufficiently small, you can conclude that the apparent effect is unlikely to be due to chance (if you don’t believe that sentence, please read this).

That’s it. All hypothesis tests fit into this framework. The reason there are so many names for so many supposedly different tests is that each name corresponds to

A test statistic,
A model of a null hypothesis, and usually,
An analytic method that computes or approximates the p-value.

To break this down a different way,

“Brandvain Chapter 16: Hypothesis testing”

Our goal in null hypothesis significance testing is to see if results are easily explained by sampling error. Let’s work though a concrete example:

So, say we did an experiment: we gave the Moderna Covid vaccine to 15,000 people and a placebo to 15,000 people. This experimental design is meant to

Imagine if the population that got the Covid vaccine, or it did not. Calculate parameters of interest (e.g. the probability of contracting Covid, or the frequency of severe Covid among those who caught Covid), or the frequency of severe reactions etc.. in the vaccinated and unvaccinated population. Compare these parameters across populations with and without the placebo. Let’s look at the estimates from the data!!

Vaccine group: 11 cases of Covid 0 severe cases. Placebo group: 185 cases of Covid 30 severe cases. So did the vaccine work? There are certainly fewer Covid cases in the vaccine group.

But these are estimates, NOT parameters. We didn’t look at populations, rather these results came from a process of sampling – we sampled from a population. So, these results represent all the things that make samples estimates differ from population parameters, as well as true differences between populations (if there were any). So, before beginning a vaccination campaign we want to know if results are easily explained by something other than a real effect.

What leads samples to deviate from a population?

Sampling bias Nonindependent sampling Sampling error

Our goal in null hypotheses significance testing is to see if results are easily explained by sampling error.

There are six major steps to hypothesis testing:

Step 1: Determine level of significance

Step 2: State the hypotheses

Step 3: State the decision rule

Step 4: Calculate test statistic

Step 5: Find the critical value

Step 6: Write a conclusion

Criminal trial analogy

We can think of these hypothesistesting stepa in the same context as a criminal trial in which a choice between two contradictory claims must be made:

The accuser of the crime must be judged either guilty or not guilty.
The individual on trial is initially presumed not guilty.
Only STRONG EVIDENCE to the contrary causes the not guilty claim to be rejected in favor of a guilty verdict.
The phrase “beyond a reasonable doubt” is often used to set the cutoff value for when enough evidence has been given to convict.
Theoretically, we should never say “The person is innocent.” but instead “There is not sufficient evidence to show that the person is guilty.”

Now let’s compare that to how we look at a hypothesis test.

Step 1: Determine level of significance

This sets your $\alpha$ value. It says how far out in the tails you consider to be extreme and unlikely to occur by random chance. It is also your Type 1 Error, the probability you will reject a claim that is true. In practice it is almost always 0.05

We will be willing to conclude in of vaccines having an effect if the p-value less than or equal to 0.05. The analogy to “beyond a reasonable doubt” in hypothesis testing is what is known as the significance level.

Step 2: State hypotheses

Give your null and alternative hypotheses in words or in symbols.

Convention:

$H_0$ is always stated as: parameter = number

$H_a$ has three possible forms:

parameter > number
parameter < number
parameter $\neq$ number

and (2) are referred to as 1-tailed tests (3) is referred to as a 2-tailed test

We initially assume that $H_0$ is true. The null hypothesis $H_0$ will be rejected (in favor of $H_a$) only if the sample evidence strongly suggests that
$H_0$ is false. If the sample does not provide such evidence, $H_0$ will not be rejected.

Step 3: State the decision rule (also known as rejection rule)

Give a precise statement of what must happen to reject $H_0$.

Often: We will reject H0 if the p-value $\leq$ $\alpha$ = 0.05.

Step 4: Calculate test statistic

The test statistic provides a measure of the compatibility between the null hypothesis and our data. The value we calculate depends on the specifics of the test we are conducting (which is based on the question we have and the data we collected).

Step 5: Step 5: Calculate the p-value

The p-value in a one-sided test (like this one) tells you the probability of observing a test statistic as extreme as this one, if the null hypothesis is true.

In a two-sided test, the p-value tells you the probability of observing a value as extreme in either direction.

Step 6: Write a conclusion

Results that lead to the rejection of a null hypothesis are said to be statistically significant.

For this reason, statistical hypothesis tests are also referred to as tests of significance.

Statistical significance is an effect so large that it would rarely occur by chance alone.

Note that we never conclude that $H_0$ is true. All we can say is that we have insufficient evidence to reject the null hypothesis. We never write (or say) that we accept the null hypothesis.

Gut instinct says that “Fail to reject $H_0$” should say “Accept $H_0$” but this technically is not correct. Accepting
$H_0$ is the same as saying that a person is innocent. We cannot show that a person is innocent; we can only say that there was not enough substantial evidence to find the person guilty.

When you run a hypothesis test, you are the jury of the trial. You decide whether there is enough evidence to convince yourself that $H_a$ is true (“the person is guilty”) or that there was not enough evidence to convince yourself $H_a$ is true (“the person is not guilty”). You must convince yourself (using statistical arguments) which hypothesis is the correct one given the sample information.

Types of errors in hypothesis testing

The risk of error is the price researchers pay for basing an inference about a population on a sample. With any reasonable sample-based procedure, there is some chance that a Type I error will be made and some chance that a Type II error will occur.

Image source: unbiasedresearch.blogspot.com

If we are using sample data to make inferences about a population parameter, we run the risk of making a mistake. Obviously, we want to minimize our chance of error; we want a small probability of drawing an incorrect conclusion. A type I error is the rejection of a true null hypothesis (also known as a “false positive”), while a type II error is the failure to reject a false null hypothesis (also known as a “false negative”).

The probability of a Type I Error occurring is denoted by $\alpha$ and is called the significance level of a hypothesis test.
The probability of a Type II Error is denoted by $\beta$. $\alpha$ corresponds to the probability of rejecting $H_0$ when, in fact, $H_0$ is true.

$\beta$ corresponds to the probability of failing to reject $H_0$ when, in fact, $H_0$ is false. Ideally, we want
$\alpha$ = 0 and $\beta$ = 0, meaning that the chance of making an error does not exist. When we have to use incomplete information (sample data), it is not possible to have both $\alpha$ = 0 and $\beta$ = 0. We will always have the possibility of at least one error existing when we use sample data.

Usually, what is done is that $\alpha$ is set before the hypothesis test is conducted and then the evidence is judged against that significance level. Common values for $\alpha$ are 0.05, 0.01, and 0.10. If $\alpha$ = 0.05, we are using a testing procedure that, used over and over with different samples, rejects a TRUE null hypothesis five percent of the time.

EXERCISE

So if we can set $\alpha$ to be whatever we want, why choose 0.05 instead of 0.01 or even better 0.0000000000000001?

Well, a small $\alpha$ means the test procedure requires the evidence against $H_0$ to be very strong before we can reject
$H_0$. This means we will almost never reject $H_0$ if $\alpha$ is very small. If we almost never reject $H_0$, the probability of a Type II Error – failing to reject $H_0$ when we should – will increase! Thus, as $\alpha$ decreases, $\beta$ increases and $\alpha$ increases, $\beta$ decreases. We therefore need to strike a balance, and 0.05, 0.01 and 0.1 usually lead to a nice balance.

Power

The third part of this discussion is power. Power is the probability of not making a Type II error, which we can write mathematically as power is 1 – $\beta$. The power of a hypothesis test is between 0 and 1; if the power is close to 1, the hypothesis test is very good at detecting a false null hypothesis. $\beta$ is commonly set at 0.2, but may be set by the researchers to be smaller.

Consequently, power may be as low as 0.8, but may be higher. There are the following four primary factors affecting power: * Significance level ($\alpha$) * Sample size * Variability, or variance, in the measured response variable * Magnitude of the effect of the variable

Power is increased when the sample size increases, as well as when there is a stronger effect size and a higher level. Power decreases when variance ($\sigma$) increases.

Statistical Significance

The idea that sample results are more extreme than we would reasonably expect to see by random chance if the null hypothesis were true is the fundamental idea behind statistical hypothesis tests. If data at least as extreme would be very unlikely if the null hypothesis were true, we say the data are statistically significant. Statistically significant data provide convincing evidence against the null hypothesis in favor of the alternative, and allow us to generalize our sample results to the claim about the population.

However, from the discussion of the tradeoff between $\alpha$ and $\beta$ hopefully you can see that strictly relying on p-values can be troublesome. There are many papers written about this.

Linking statistical analysis to data visualization

The goal is to show that both in tandem can tell us the stories that lie in our data.

We’re going to start by going back to our one-dimensional visualizations, i.e., histograms.

Calb_R <- read_csv(here("data_in", "Calb_resistance.csv"))

## Rows: 501 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): strain, site, sex
## dbl (2): MIC (ug/mL), disk (mm)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Calb_R <- Calb_R %>% 
  rename(MIC = `MIC (ug/mL)`, disk = `disk (mm)`) 

Calb_R

## # A tibble: 501 × 5
##    strain site  sex     MIC  disk
##    <chr>  <chr> <chr> <dbl> <dbl>
##  1 s498   blood f       128     6
##  2 s499   blood f       128     6
##  3 s465   blood f        32    12
##  4 s480   blood f        64    12
##  5 s481   blood f        64    12
##  6 s466   blood f        32    13
##  7 s482   blood m        64    13
##  8 s483   blood m        64    13
##  9 s484   blood f        64    13
## 10 s486   blood f        64    13
## # ℹ 491 more rows

ggplot(data = Calb_R, mapping = aes(disk)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).

EXERCISE

By eye, does this variable look normally distributed? Why or why not?

We can statistically test for normality using the shapiro.test():

shapiro.test(Calb_R$disk)

## 
##  Shapiro-Wilk normality test
## 
## data:  Calb_R$disk
## W = 0.89958, p-value < 2.2e-16

Notice that we switched syntaxes above, to base R (i.e., we used df$variable). The majority, if not all of the common statistical tests require this syntax, because they were developed prior to the tidyverse set of commands. Although there are some workarounds, we’re going to go back and forth a little bit from these two syntaxes as required.

Now we’re going to look at multiple variables.

A reminder of what our dataset it:

glimpse(Calb_R)

## Rows: 501
## Columns: 5
## $ strain <chr> "s498", "s499", "s465", "s480", "s481", "s466", "s482", "s483",…
## $ site   <chr> "blood", "blood", "blood", "blood", "blood", "blood", "blood", …
## $ sex    <chr> "f", "f", "f", "f", "f", "f", "m", "m", "f", "f", "f", "f", "f"…
## $ MIC    <dbl> 128, 128, 32, 64, 64, 32, 64, 64, 64, 64, 64, 64, 32, 32, 32, 6…
## $ disk   <dbl> 6, 6, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 1…

First let’s ignore the different sites and test whether disk diffusion MIC differed between strains isolated from males and females. To do that we’re going to use a two sample t-test. The assumptions of a t-test are normally-distributed data. So first let’s subset the data and test for normality.

Calb_R_f <- filter(Calb_R, sex == "f")
Calb_R_m <- filter(Calb_R, sex == "m")
shapiro.test(Calb_R_f$disk)

## 
##  Shapiro-Wilk normality test
## 
## data:  Calb_R_f$disk
## W = 0.87021, p-value = 6.053e-14

shapiro.test(Calb_R_m$disk)

## 
##  Shapiro-Wilk normality test
## 
## data:  Calb_R_m$disk
## W = 0.91646, p-value = 2.678e-10

We already saw this was not met on the whole data set so this is not surprising. So instead we will do the non-parametric wilcoxon test (or Mann-Whitney U test) that compares data ranks instead.

# specify x and y
wilcox.test(Calb_R_f$disk, Calb_R_m$disk)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Calb_R_f$disk and Calb_R_m$disk
## W = 30727, p-value = 0.928
## alternative hypothesis: true location shift is not equal to 0

# use equation format
wilcox.test(disk ~ sex, data = Calb_R)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  disk by sex
## W = 30727, p-value = 0.928
## alternative hypothesis: true location shift is not equal to 0

ggplot(data = Calb_R, mapping = aes(disk, fill = site)) +
  geom_histogram(binwidth = 2, na.rm = TRUE) + 
  theme_bw() +
  labs(x = "disk diffusion zone of inhibition (mm)" , y = "Number of strains")

We can similarly ignore sex and test the effect of site. In this case we have more than two groups, so we’re going to use an ANOVA test. In reality a t-test is the same thing as an ANOVA, just with two groups instead of more than two groups. The non-parametric equivalent of an ANOVA is the Kruskal-Wallis test, but the Kruskal-Wallis test assumes that sampled populations have identical shape and dispersion. We can see from our figure that this is not met. In this case it is actually better to use an ANOVA test. Although the ANOVA is parametric, it is considered a robust test against the normality assumption, that is non-normal data has only a small effect on the Type I error rate.

aov(disk ~ site, data = Calb_R)

## Call:
##    aov(formula = disk ~ site, data = Calb_R)
## 
## Terms:
##                      site Residuals
## Sum of Squares   1106.458 24891.534
## Deg. of Freedom         2       492
## 
## Residual standard error: 7.112844
## Estimated effects may be unbalanced
## 6 observations deleted due to missingness

When you run the ANOVA we don’t actually get all the information we need out of just the model aov call. We need to wrap that in a second function to pull out additional information:

anova_test <- aov(disk ~ site, data = Calb_R) 
summary(anova_test)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## site          2   1106   553.2   10.94 2.26e-05 ***
## Residuals   492  24892    50.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 6 observations deleted due to missingness

There’s the stats. We’re going to install one more package, the broom package, that will clean up this output.

library(broom)
tidy(anova_test)

## # A tibble: 2 × 6
##   term         df  sumsq meansq statistic    p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
## 1 site          2  1106.  553.       10.9  0.0000226
## 2 Residuals   492 24892.   50.6      NA   NA

Using the broom functions tidy we can easily access the parameter values:

anova_test_tidy <- tidy(anova_test)
anova_test_tidy

## # A tibble: 2 × 6
##   term         df  sumsq meansq statistic    p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
## 1 site          2  1106.  553.       10.9  0.0000226
## 2 Residuals   492 24892.   50.6      NA   NA

anova_test_tidy$p.value[1]

## [1] 2.256929e-05

If we want to know which groups are different from each other, we can use the post-hoc (or “after the event”) tukey test:

TukeyHSD(anova_test)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = disk ~ site, data = Calb_R)
## 
## $site
##                 diff       lwr       upr     p adj
## oral-blood  3.965051  1.886492 6.0436086 0.0000271
## skin-blood  2.130479  0.458185 3.8027737 0.0080993
## skin-oral  -1.834571 -3.923196 0.2540533 0.0982914

Exercise

Conduct a t-test comparing skin and oral samples. First subset the data frame as needed, then check for normality. ?t.test might be helpful

In reality, we actually know that there are two different categorical variables here that could influence the disk diffusion resistance (site and sex), and we can include them both in one test, a two-way ANOVA.

full_anova_test <- aov(disk ~ site*sex, data = Calb_R) 
tidy(full_anova_test)

## # A tibble: 4 × 6
##   term         df   sumsq meansq statistic    p.value
##   <chr>     <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1 site          2  1106.   553.     10.9    0.0000228
## 2 sex           1    13.6   13.6     0.269  0.604    
## 3 site:sex      2   116.    58.0     1.15   0.319    
## 4 Residuals   489 24762.    50.6    NA     NA

This hopefully tells us what we already intuited from the EDA: site but not sex influences disk resistance.

Exercise

Conduct a statistical test to determine whether site or sex (or their interaction) has a significant effect on MIC.

ggplot(Calb_R, aes(MIC, disk)) +
  scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , x = expression(MIC[50])) +
  theme_bw() +
  geom_point(na.rm =TRUE) +
  geom_jitter(alpha = 0.5, color = "tomato", width = 0.2)

## Warning: Removed 14 rows containing missing values (`geom_point()`).

And finally a statistical test to cap it all off, we’ll look for a correlation between these two resistance variables. We’ll again turn to our non-parametric statistics, and specify that we want Spearman’s rho test (in this case it’s the same function as the parametric test, we just specify the method we want to use).

cor_test <- cor.test(Calb_R$MIC, Calb_R$disk, method = "spearman")

## Warning in cor.test.default(Calb_R$MIC, Calb_R$disk, method = "spearman"):
## Cannot compute exact p-value with ties

cor_test

## 
##  Spearman's rank correlation rho
## 
## data:  Calb_R$MIC and Calb_R$disk
## S = 30335647, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5661987

And we can again use the broom package to make the output more tidy and easier to access.

tidy(cor_test)

## # A tibble: 1 × 5
##   estimate statistic  p.value method                          alternative
##      <dbl>     <dbl>    <dbl> <chr>                           <chr>      
## 1   -0.566 30335647. 1.03e-42 Spearman's rank correlation rho two.sided

ggplot(Calb_R, aes(MIC, disk)) +
  scale_x_continuous(trans="log2", breaks = unique(Calb_R$MIC)) +
  scale_y_reverse(limits = c(50, 0)) +
  labs(y = "disk diffusion zone of inhibition (mm)" , x = expression(MIC[50])) +
  theme_bw() +
  geom_point(na.rm =TRUE) +
  geom_jitter(alpha = 0.5, color = "tomato", width = 0.2)+
  geom_smooth(method = "lm", na.rm=TRUE)

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 13 rows containing missing values (`geom_point()`).

Attributes

This lesson was created by Aleeza Gerstein at the University of Manitoba based on material from: Wikipedia Applied Biostats Modern Dive,
Diva Jain, Alan Downey, :Probably Overthinking it” Introduction to Statistical Ideas and Methods online modules, Statistics Teacher: What is power?, Concepts and Applications of Inferential Statistics, Towards Data Science blog post, and blog post. Brandvain Chapter 1: Introduction to Statistics

Made available under the Creative Commons Attribution license. License.