Mixed-effects models

Using a regular linear model or Anova when several of your observations come from the same speaker/word is bad because:

The observations aren’t independent.
- Subject F’s response to kangaroo is likely to be more similar to subject F’s response to giraffe than Subject Z’s would be.
- Subject F’s and Subject G’s response to kangaroo are likely more similar to each other than they should be.
But every statistical test assumes that you're calculating the variance of independent data.
So you'll be underestimating the true variance in your data.
So you’ll be underestimating the true p-value for your test statistic.
So you’ll be more likely to make a Type I error and claim you’ve found an effect when one isn’t there.

A repeated-measures Anova is a type of Anova that has been modified to deal with this problem. You'll see they a lot in psycholinguistics articles. But they suffer from the same problem that Anovas in general have — their assumptions about the data are very restrictive and probably not true of your data.

A mixed-effect model is a modification of an ordinary linear model that deals with the same problem.

Fixed and random effects

The name “mixed-effects model” comes from the fact that these models combine both random effects and fixed effects.

Random effects are categorical variables in your regression model whose values are a random sample from a larger population. For example, individual speakers, participants in an experiment, the 40 words of English you’re testing with, schools in a multi-school study, cities in a multi-city study, and so on.
Fixed effects are variables or categories in your regression model whose values are the entire set of possibilities in the population (or at least the entire set you’re interested in making claims about). Examples might include gender, regular vs. irregular verb, bilinguals vs. monolinguals, new drug vs. old drug vs. placebo.

Individual speakers are a random effect because, unless your language is on the verge of extinction, your speakers are just a sample from a wider population — you'd like to be able to make claims about the whole population based on your research, not just claims about the 10 speakers you happened to measure.

(You should think of most continuous variables as fixed effects, especially if they're measurements about one of your random effects. For example: the age of a speaker, the length or the frequency of a word.)

Thinking about mixed-effects models

The result of a mixed-effects model is the same as for an ordinary linear model. It gives you a set of coefficients that you can turn into an equation for predicting the response variable. The only difference is that you now also get coefficients for every individual member in a random variable, e.g., individual speakers and individual words.

For example, a mixed-effects model might tell you that the best prediction for how fast male subject F responds to the word kangaroo in a lexical decision experiment is the sum of a number of fixed and random effects:

671 ms: the average RT of all female subjects to all words with a frequency of 0 (the overall intercept of the whole model).
24 ms: how much slower men are overall compared to women (a fixed effect).
–9 ms: how much faster subject F is than the average male (a random effect, which helps the model cope with any additional weirdness subject F has).
129 ms: how much slower the average subject is to any word with the same frequency as kangaroo has (a fixed effect).
21 ms: how much slower the average subject is to kangaroo over and above the frequency effect (a random effect, which helps the model cope with any additional weirdness the word kangaroo might have).
–32 ms: how much less sensitive subject F is to the frequency effect (multiplied by the frequency of kangaroo – an interaction).
etc.

Building mixed-effects models in R

You call the model-building function and construct your formula as usual, except:

the model-building function is lmer instead of lm.

lmer is in the lme4 package, which you'll need to install. But it's more convenient to use the lmerTest package, which acts as a middleman between you and lme4, adding a few extra abilities to the models and giving you summaries that are more convenient to interpret.
You allow each value in a random effect variable (say, Word) to have a different intercept with:

+ (1|Word)
You can allow each speaker to have a different intercept and a different slope for Frequency with:

+ (1+Frequency|Subject)

library(languageR)
library(lmerTest)

A typical call might be:

model1 = lmer(RT ~ Frequency + Sex + (1 | Word) + (1 + Frequency | Subject), 
    data = lexdec)

When you ask for the summary of this model, it might take a few seconds to get it, since R has to considerably more computation for the p-values than it does with an ordinary linear model.

summary(model1)

## Linear mixed model fit by REML ['merModLmerTest']
## Formula: RT ~ Frequency + Sex + (1 | Word) + (1 + Frequency | Subject)
##    Data: lexdec
## 
## REML criterion at convergence: -953.5
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.310 -0.610 -0.116  0.478  6.172 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  Word     (Intercept) 0.00297  0.0545        
##  Subject  (Intercept) 0.05717  0.2391        
##           Frequency   0.00041  0.0202   -0.91
##  Residual             0.02917  0.1708        
## Number of obs: 1659, groups: Word, 79; Subject, 21
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  6.57856    0.06248 31.10000  105.28  < 2e-16 ***
## Frequency   -0.04287    0.00731 46.80000   -5.86  4.4e-07 ***
## SexM         0.03065    0.05694 19.00000    0.54      0.6    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) Frqncy
## Frequency -0.814       
## SexM      -0.304  0.000

The information we're most interested in is two-thirds of the way down the summary, under the heading “Fixed effects” — the coefficients and the p-values for our fixed-effect predictors.

Remember that lmer has also calculated coefficients for every individual subject and word. The model summary doesn't show you all of these; it only tells you what the variance (and standard deviation) of each group of coefficients is. Here, we can see that subjects cause a lot more variation in reaction time than words do (.057 compared to .003).

It's still possible to find out what the coefficients for the random effects of subject and word are by calling the function ranef on the model:

ranef(model1)

## $Word
##            (Intercept)
## almond       0.0166598
## ant         -0.0247979
## apple       -0.0499647
## apricot     -0.0415198
## asparagus    0.0339187
## avocado     -0.0261636
## banana      -0.0378619
## bat         -0.0186912
## beaver       0.0003554
## bee         -0.0317677
## beetroot     0.0388577
## blackberry  -0.0127765
## blueberry   -0.0432901
## broccoli    -0.0400282
## bunny       -0.0250453
## butterfly    0.0279270
## camel        0.0402816
## carrot      -0.0173136
## cat         -0.0118809
## cherry      -0.0217833
## chicken     -0.0151615
## clove       -0.0027544
## crocodile    0.0498886
## cucumber     0.0282791
## dog          0.0241956
## dolphin     -0.0324489
## donkey      -0.0201388
## eagle       -0.0001176
## eggplant    -0.0897609
## elephant     0.0453390
## fox         -0.0096457
## frog        -0.0240729
## gherkin      0.0394817
## goat        -0.0318655
## goose       -0.0164713
## grape       -0.0426421
## gull         0.0414234
## hedgehog     0.0925493
## horse        0.0550439
## kiwi        -0.0960585
## leek         0.0072247
## lemon       -0.0532865
## lettuce     -0.0048729
## lion        -0.0176973
## magpie       0.0626921
## melon       -0.0276672
## mole         0.0160981
## monkey      -0.0227018
## moose       -0.0720857
## mouse       -0.0137014
## mushroom    -0.0101361
## mustard      0.0131914
## olive       -0.0369809
## orange      -0.0206903
## owl         -0.0151321
## paprika      0.0156422
## peanut      -0.0626069
## pear        -0.0346215
## pig         -0.0315359
## pineapple   -0.0106043
## potato       0.0328864
## radish      -0.0073131
## reindeer     0.0779623
## shark       -0.0127645
## sheep       -0.0056749
## snake        0.0087572
## spider      -0.0279644
## squid       -0.0065762
## squirrel     0.0704945
## stork        0.0715612
## strawberry  -0.0255583
## swan         0.0014335
## tomato      -0.0060598
## tortoise     0.1178308
## vulture      0.1657870
## walnut       0.0274418
## wasp         0.0961881
## whale       -0.0046359
## woodpecker  -0.0045011
## 
## $Subject
##    (Intercept)  Frequency
## A1   -0.099090  0.0011274
## A2   -0.262159  0.0168706
## A3    0.007167  0.0029788
## C    -0.119466  0.0138360
## D    -0.024990  0.0051250
## I    -0.127415  0.0020594
## J    -0.185231  0.0139719
## K    -0.260788  0.0171942
## M1   -0.323062  0.0259097
## M2    0.312304 -0.0322195
## P     0.142120 -0.0180479
## R1   -0.165228  0.0201179
## R2    0.063301  0.0001795
## R3   -0.025278  0.0026383
## S     0.004059 -0.0010477
## T1   -0.006359 -0.0010557
## T2    0.603151 -0.0375270
## V     0.260399 -0.0112407
## W1   -0.206544  0.0141113
## W2    0.065240 -0.0024841
## Z     0.347866 -0.0324976

We can see, for example, that subjects responded exceptionally slowly to tortoise and vulture, but very quickly to kiwi. (Remember that this experiment was conducted in New Zealand.)

p-values of mixed-effect models

Calculating p-values for mixed-effects models is problematic. Remember that hypothesis testing involves figuring out how far out the results of your experiment are on a reference distribution of the results of similar experiments where the null hypothesis was true: for example, how far out on the normal distribution of height a single mystery child was, or how far out on the t-distribution a sample of five mystery children was.

Unfortunately, mixed-effects models are so complex that it's not obvious what that reference distribution should be. There are a couple of proposals on this, but even they are so complex that they don't make it possible to exactly calculate the F-statistic (the version of the t-statistic used in Anovas) — instead they have to estimate an approximation of the F-statistic. The lmerTest package uses the “Satterthwaite approximation” — which is what takes R so long to calculate when you ask for a summary() of your model.

Here's a different model for response time in lexdec:

model.with.NL = lmer(RT ~ Frequency + NativeLanguage + (1 | Subject) + (1 | 
    Word), data = lexdec)
summary(model.with.NL)

## Linear mixed model fit by REML ['merModLmerTest']
## Formula: RT ~ Frequency + NativeLanguage + (1 | Subject) + (1 | Word)
##    Data: lexdec
## 
## REML criterion at convergence: -934.2
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.452 -0.625 -0.131  0.479  5.952 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Word     (Intercept) 0.00294  0.0542  
##  Subject  (Intercept) 0.01847  0.1359  
##  Residual             0.02984  0.1727  
## Number of obs: 1659, groups: Word, 79; Subject, 21
## 
## Fixed effects:
##                     Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)          6.52200    0.04872 40.70000  133.85  < 2e-16 ***
## Frequency           -0.04287    0.00583 77.00000   -7.36  1.7e-10 ***
## NativeLanguageOther  0.15582    0.06053 19.00000    2.57    0.019 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Frqncy
## Frequency   -0.568       
## NtvLnggOthr -0.532  0.000

Just like with ordinary linear models, it's possible to get the results printed in an Anova table that's more familiar to traditional psycholinguists, using the anova() function. Because this model doesn't contain any interactions, anova() shows the same results as summary().

anova(model.with.NL)

## Analysis of Variance Table of type 3  with  Satterthwaite 
## approximation for degrees of freedom
##                Sum Sq Mean Sq NumDF DenDF F.value  Pr(>F)    
## Frequency       1.615   1.615     1    77    54.1 1.7e-10 ***
## NativeLanguage  0.198   0.198     1    19     6.6   0.019 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Unfortunately, in the near future, any reactionary journal editors and referees will want to see the results for a repeated-measures Anova, which gives two hard-to-interpret F-values that apply to the entire table, rather than one F-value for every predictor, like this anova() function gives you.)

Another method of estimating p-values that you'll see a lot in recent psycholinguistics articles uses Monte Carlo Markov chain (MCMC) simulation. This method carries out some large number (often 10,000) simulated replications of your experiment, each time adjusting the coefficients of the model a little, generating a new set of simulated data based on that new prediction equation, and keeping track of how often the data from those simulated experiments ended up looking like yours. Because these \( p_{MCMC} \) values are simulated rather than calculated exactly, they'll end up a little bit different every time you do the computation. They've been commonly used in psycholinguistics because, until very recently, Harald Baayen's languageR package offered the pvals.fnc() function that would do all the necessary work automatically; unfortunately, the latest changes in the lme4 package have stopped pvals.fnc() from working.

Comparing models

Most scientific disciplines are moving, at different speeds, away from hypothesis testing and towards a perspective where a researcher's job is to find the model that best expresses the structure in their data, and to quantify how much better that best model is than the alternatives.

Using hypothesis testing, the question we (kind of) asked in the preceding section was: “Is the effect of native language on response time statistically significant?” Meaning the backwards question: “Is there less than a 5% chance that our experiment would have produced data this extreme if there weren't an effect of native language?”

In model comparison, the question is “How much better is a model that includes native language as a predictor compared to a model that doesn't include it?”

To see how to answer this question, let's build the mixed-effects model that's exactly like model.with.NL, except without the native language predictor:

model.without.NL = lmer(RT ~ Frequency + (1 | Subject) + (1 | Word), data = lexdec)
summary(model.without.NL)

## Linear mixed model fit by REML ['merModLmerTest']
## Formula: RT ~ Frequency + (1 | Subject) + (1 | Word)
##    Data: lexdec
## 
## REML criterion at convergence: -932
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.438 -0.625 -0.134  0.473  5.946 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Word     (Intercept) 0.00294  0.0542  
##  Subject  (Intercept) 0.02377  0.1542  
##  Residual             0.02984  0.1727  
## Number of obs: 1659, groups: Word, 79; Subject, 21
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  6.58878    0.04420 51.00000  149.07  < 2e-16 ***
## Frequency   -0.04287    0.00583 77.00000   -7.36  1.7e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## Frequency -0.626

anova(model.without.NL)

## Analysis of Variance Table of type 3  with  Satterthwaite 
## approximation for degrees of freedom
##           Sum Sq Mean Sq NumDF DenDF F.value  Pr(>F)    
## Frequency   1.61    1.61     1    77    54.1 1.7e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To see how well our original model.with.NL fits the data compared to this less complex model, we compare them directly against each other by giving them both to the anova() function.

anova(model.with.NL, model.without.NL)

## refitting model(s) with ML (instead of REML)

## Data: lexdec
## Models:
## ..1: RT ~ Frequency + (1 | Subject) + (1 | Word)
## object: RT ~ Frequency + NativeLanguage + (1 | Subject) + (1 | Word)
##        Df     AIC     BIC logLik deviance  Chisq Chi Df Pr(>Chisq)  
## ..1     5 -935.42 -908.35 472.71  -945.42                           
## object  6 -939.69 -907.20 475.84  -951.69 6.2713      1    0.01227 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova function has inconveniently relabelled our models: the simpler one (model.without.NL) appears as ..1 in the output, and the more complex one (model.with.NL) as object.

For those still concerned with hypothesis testing, the number they're interested in is the final p-value for the comparison, in the Pr(>Chisq) column — that is, 0.012, which is less than 0.05, so they'd be justified in saying that the model with native language is “significantly better” than the model without.

People who are serious about model comparison look instead at two other pairs of numbers, the AIC and the BIC. The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are measures of how well a model fits its data. They are calculated almost identically, and both are pretty much useless on their own. They only become interesting when comparing two models — the model with a lower AIC or BIC (that is, closer to negative infinity) fits the data better.

Both the AIC and the BIC penalize a model for the amount that the data disagree with its predictions. Both also penalize a model for added complexity, that is, for every additional predictor that the model uses. So, although a more complex model will always make predictions that are closer to the data, the improvement might not be enough to overcome the complexity penalty, so the AIC and BIC might choose the simpler model as better.

The only difference between the AIC and the BIC is how much they penalize a model for its complexity: the BIC probably penalizes complexity a little too much; the AIC probably doesn't penalize complexity enough. This sometimes results in situations like we have here, where the AIC prefers the more complex model (model.with.NL has a lower AIC than model.without.NL) and the BIC prefers the simpler model (model.without.NL has a smaller BIC than model.with.NL). All you can do is report the situation and let your audience make up their own minds about how much they value simple models and how convincing they find your experiment — which, despite all the pretense of objectivity, is what we've been doing all along.

Estimating p(h|d)

For more, see the paper by Eric-Jan Wagenmakers (2007, Psychonomic Bulletin & Review)

The BIC can give us the most conservative possible estimate for the likelihood of the data assuming the hypothesis embodied in the model, i.e., \( p(d|h) \).

This means that we can get the conservative estimate of what we wanted to know all along: Given the results of our experiment, how much more likely is it that my theory is true than that some other theory is true. Recall that by Bayes' Rule, this can be calculated by:

\[ \frac{p(h_1|d)}{p(h_2|d)} = \frac{\frac{p(d|h_1)p(h_1)}{p(d)}}{\frac{p(d|h_2)p(h_2)}{p(d)}} \]

The two denominators, \( p(d) \), will cancel each other out. Furthermore, if we assume equal prior probabilities, that is, that we accept that both hypotheses were equally likely before we did the experiment, then \( p(h_1) \) and \( p(h_2) \) will also cancel each other out, and we're left with:

\[ \frac{p(h_1|d)}{p(h_2|d)} = \frac{p(d|h_1)}{p(d|h_2)} \]

And the BICs of the two models let us estimate \( p(d|h_1) \) and \( p(d|h_2) \).

According to Wagenmakers, the difference between the models' BICs, divided by 2, is the log-odds ratio between them.

The BIC of model.with.NL is -907.20. The BIC of model.without.NL is -908.35. Their log-odds ratio, then their odds-ratio, is:

(-907.2 - -908.35)/2

## [1] 0.575

exp((-907.2 - -908.35)/2)

## [1] 1.777131

So the most conservative possible estimate is that the simpler model, model.without.NL is 1.777 times more likely to be true than the more complex model, model.with.NL. Or:

1.777/2.777

## [1] 0.6398992

If these are the only two possible models in the world (which they're obviously not), then there's at most a 64% chance that model.without.NL is right and at least a 36% chance that model.with.NL is right.

(Notice that, if the odds-ratio had been more impressive than 1.777, we could have ended up arguing in favour of the null hypothesis — something which was completely impossible according to hypothesis testing.)

I predict that within a couple of decades, every serious scientific discipline is going to reporting their results like this. You'll be hearing reports like: “According to the results of this clinical trial, there is an 85% probability that Drug X is better than Drug Y.” It will be left up to the judgment of human beings to decide whether an 85% probability is enough for doctors to start prescribing only Drug X or for Pharmacare programs to pay four times as much for Drug X as they're now paying for Drug Y. This will bug the hell out of people who want their decisions to be made by machines, but that's life.