Directions

In this lab you will:

Open RStudio

Click on the home button, and scroll down until you find the RStudio directory. Click on RStudio.

Create a script file

Click on “File”, “New File”, “R Script”.

Download a CPS data set

Download the data from the website using:

cps <- read.csv("http://home.cc.umanitoba.ca/~godwinrt/3040/data/lab4.csv")

This is another version of the “Current Population Survey” data from the US. Take a good look at the data set either by clicking on the spreadsheet icon next to its object name in the top-right window, or by using the command:

View(cps)

Our dependent variable will be ahe - average hourly earnings. Notice that we have a “factor” variable in the data set called location. Let’s estimate a simple model that uses all of the variables:

summary(lm(ahe ~ female + age + yrseduc + location, data = cps))
## 
## Call:
## lm(formula = ahe ~ female + age + yrseduc + location, data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.259  -5.731  -1.321   4.267  49.160 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -10.366088   0.246199 -42.105  < 2e-16 ***
## female             -4.235319   0.071359 -59.352  < 2e-16 ***
## age                 0.156247   0.003348  46.664  < 2e-16 ***
## yrseduc             1.741244   0.014448 120.521  < 2e-16 ***
## locationnortheast   1.288082   0.106263  12.122  < 2e-16 ***
## locationsouth       0.036092   0.095600   0.378    0.706    
## locationwest        0.806209   0.101145   7.971  1.6e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.762 on 61388 degrees of freedom
## Multiple R-squared:  0.2514, Adjusted R-squared:  0.2514 
## F-statistic:  3437 on 6 and 61388 DF,  p-value: < 2.2e-16

What are the estimated returns to education?

Estimate a polynomial regression model

We want to allow the possibility for age and yrseduc to have a non-linear effect on wage, for reasons discussed in class. One way to approximate a non-linear relationship is to use a polynomial in the variables age and yrseduc. To begin with, create the new variables that we’ll need for the polynomial:

cps$age2 <- cps$age ^ 2
cps$age3 <- cps$age ^ 3
cps$age4 <- cps$age ^ 4
cps$yrseduc2 <- cps$yrseduc ^ 2
cps$yrseduc3 <- cps$yrseduc ^ 3
cps$yrseduc4 <- cps$yrseduc ^ 4

Next, estimate a regression model with these extra variables:

summary(lm(ahe ~ female + age + age2 + age3 + age4 + yrseduc + yrseduc2 + yrseduc3 + yrseduc4 + location, data = cps))
## 
## Call:
## lm(formula = ahe ~ female + age + age2 + age3 + age4 + yrseduc + 
##     yrseduc2 + yrseduc3 + yrseduc4 + location, data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.458  -5.605  -1.238   4.162  48.309 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.031e+00  8.774e+00   0.345 0.729752    
## female            -4.200e+00  7.062e-02 -59.470  < 2e-16 ***
## age                1.577e+00  6.504e-01   2.424 0.015346 *  
## age2              -2.245e-02  2.466e-02  -0.911 0.362512    
## age3               3.908e-05  4.021e-04   0.097 0.922568    
## age4               7.079e-07  2.386e-06   0.297 0.766716    
## yrseduc           -7.516e+00  2.137e+00  -3.517 0.000438 ***
## yrseduc2           7.913e-01  2.615e-01   3.026 0.002482 ** 
## yrseduc3          -2.702e-02  1.362e-02  -1.983 0.047326 *  
## yrseduc4           2.922e-04  2.570e-04   1.137 0.255500    
## locationnortheast  1.230e+00  1.051e-01  11.706  < 2e-16 ***
## locationsouth      8.359e-03  9.458e-02   0.088 0.929576    
## locationwest       7.350e-01  1.001e-01   7.344  2.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.656 on 61382 degrees of freedom
## Multiple R-squared:  0.2696, Adjusted R-squared:  0.2695 
## F-statistic:  1888 on 12 and 61382 DF,  p-value: < 2.2e-16

Determine the appropriate degree for the polynomial

We can use a series of -tests in order to determine how many “powers” of the variables age and yrseduc to include in the model. Currently, we have a polynomial of degree \(r = 4\) for both variables. We begin with the null hypothesis:

\(H_0: \beta_{age^4} = 0\)

vs.

\(H_A: \beta_{age^4} \neq 0\)

Notice, from the previous regression, that the p-value for this test is 0.766716. We fail to reject. We eliminate age4 from the model, and re-estimate:

summary(lm(ahe ~ female + age + age2 + age3 + yrseduc + yrseduc2 + yrseduc3 + yrseduc4 + location, data = cps))
## 
## Call:
## lm(formula = ahe ~ female + age + age2 + age3 + yrseduc + yrseduc2 + 
##     yrseduc3 + yrseduc4 + location, data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.471  -5.606  -1.237   4.163  48.317 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.263e+00  6.441e+00   0.196 0.844480    
## female            -4.200e+00  7.062e-02 -59.470  < 2e-16 ***
## age                1.765e+00  1.350e-01  13.077  < 2e-16 ***
## age2              -2.970e-02  3.335e-03  -8.905  < 2e-16 ***
## age3               1.581e-04  2.646e-05   5.975 2.31e-09 ***
## yrseduc           -7.517e+00  2.137e+00  -3.517 0.000436 ***
## yrseduc2           7.917e-01  2.615e-01   3.027 0.002469 ** 
## yrseduc3          -2.705e-02  1.362e-02  -1.986 0.047059 *  
## yrseduc4           2.930e-04  2.569e-04   1.140 0.254207    
## locationnortheast  1.230e+00  1.051e-01  11.709  < 2e-16 ***
## locationsouth      8.260e-03  9.458e-02   0.087 0.930405    
## locationwest       7.350e-01  1.001e-01   7.344 2.09e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.656 on 61383 degrees of freedom
## Multiple R-squared:  0.2696, Adjusted R-squared:  0.2695 
## F-statistic:  2060 on 11 and 61383 DF,  p-value: < 2.2e-16

We now test the hypothesis:

\(H_0: \beta_{age^3} = 0\)

vs.

\(H_A: \beta_{age^3} \neq 0\)

Based on the output from above, the p-value for this test is 0.00000000231. We reject the null. We have now determined that \(r = 3\) for age. Using a similar process for yrseduc, we will also find that \(r = 3\) (try it yourself).

For simplicity, however, we will only use the squared term. Note that many researchers would make this decision as well, in order to obtain a simpler model, unless the cubic terms really made a difference in the interpretation of the model. The model is now:

mod1 <- lm(ahe ~ female + age + age2 + yrseduc + yrseduc2 + location, data = cps)
summary(mod1)
## 
## Call:
## lm(formula = ahe ~ female + age + age2 + yrseduc + yrseduc2 + 
##     location, data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.640  -5.609  -1.247   4.154  48.448 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.703e+01  8.644e-01 -19.702  < 2e-16 ***
## female            -4.179e+00  7.069e-02 -59.110  < 2e-16 ***
## age                9.789e-01  2.468e-02  39.660  < 2e-16 ***
## age2              -9.949e-03  2.949e-04 -33.733  < 2e-16 ***
## yrseduc            3.878e-01  1.020e-01   3.802 0.000143 ***
## yrseduc2           4.816e-02  3.656e-03  13.174  < 2e-16 ***
## locationnortheast  1.214e+00  1.052e-01  11.536  < 2e-16 ***
## locationsouth     -1.271e-02  9.468e-02  -0.134 0.893207    
## locationwest       7.639e-01  1.002e-01   7.623 2.52e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.67 on 61386 degrees of freedom
## Multiple R-squared:  0.2672, Adjusted R-squared:  0.2671 
## F-statistic:  2797 on 8 and 61386 DF,  p-value: < 2.2e-16

Perform an F-test

Recall that an F-test is used when we want to test multiple restrictions simultaneously. Let’s test to see if location has any effect on ahe. Notice that location is a categorical variable with 4 categories. Using it in the model means we have 3 dummy variables. Our null hypothesis involves 3 of the \(\beta\)s simultaneously:

\(H_0: \beta_{northeast} = 0, \beta_{south} = 0, \beta_{west} = 0\)

vs.

\(H_A: \mathrm{not} \ H_0\)

The alternative hypothesis is that one or more of the \(\beta\)s are not equal to zero.

In order to perform this hypothesis test we can estimate two models: one under the alternative hypothesis (the unrestricted model), and one under the null hypothesis (the restricted model).

The unrestricted model has already been estimated. We saved it as mod1 above. To get the restricted model, we simply substitute in the restrictions under the null hypothesis. This results in the location dummy variables being dropped from the model:

mod2 <- lm(ahe ~ female + age + age2 + yrseduc + yrseduc2, data = cps)
summary(mod2)
## 
## Call:
## lm(formula = ahe ~ female + age + age2 + yrseduc + yrseduc2, 
##     data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.166  -5.630  -1.276   4.153  48.793 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.653e+01  8.592e-01 -19.235  < 2e-16 ***
## female      -4.193e+00  7.078e-02 -59.229  < 2e-16 ***
## age          9.815e-01  2.472e-02  39.707  < 2e-16 ***
## age2        -9.971e-03  2.954e-04 -33.753  < 2e-16 ***
## yrseduc      3.598e-01  1.020e-01   3.529 0.000418 ***
## yrseduc2     4.942e-02  3.657e-03  13.517  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.685 on 61389 degrees of freedom
## Multiple R-squared:  0.2647, Adjusted R-squared:  0.2646 
## F-statistic:  4419 on 5 and 61389 DF,  p-value: < 2.2e-16

Notice how there are 3 less variables in the restricted model.

Now, we can use the anova() command:

anova(mod1, mod2)
## Analysis of Variance Table
## 
## Model 1: ahe ~ female + age + age2 + yrseduc + yrseduc2 + location
## Model 2: ahe ~ female + age + age2 + yrseduc + yrseduc2
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1  61386 4614395                                  
## 2  61389 4630134 -3    -15739 69.792 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic is 69.792, and the p-value is 0.00000000000000022. We reject the null hypothesis that the location variables do not matter. That is, we reject the restricted model (mod2) in favour of the unrestricted model (mod1).

Calculate the F-test statistic by hand

The formula for the F-test statistic that you would use on the exam is:

\(F = \frac{(R^2_U - R^2_R) / q}{(1 - R^2_U) / (n - k_u - 1)}\)

Get the \(R^2\) from the unrestricted and restricted models. We have 3 restrictions in the null hypothesis, so \(q = 3\). \(n = 61395\) (see the top-right of your screen), and there are 8 regressors in the unrestricted model, so \(k_u = 8\). Plug all of this into the formula and get:

((0.2672 - 0.2647) / 3) / ((1 - 0.2672) / (61395 - 8 - 1))
## [1] 69.80759

Very close to the value from the anova() command (difference due to rounding the \(R^2\) in the summary() output).

Interpret the estimated effect of education

As discussed in class, interpreting the effects of a variable that has a squared term (or cubed term, etc.) in the model can be tricky. One way is to get two OLS predicted values, that differ in yrseduc by one unit, and take the difference. For example, let’s consider the impact of an additional year of eduction for a female, aged 30 years, who lives in the “midwest”, and who has 8 years of education:

predict(mod1, data.frame(female = 1, age = 30, age2 = (30^2), yrseduc = 8, yrseduc2 = (8^2), location = "midwest"))
##        1 
## 5.387269

Notice that when we plug in the “30” and the “8” we need to square them for the age2 and yrseduc2 variables, respectively. Now, we take the same person, but increase their education by 1:

predict(mod1, data.frame(female = 1, age = 30, age2 = (30^2), yrseduc = 9, yrseduc2 = (9^2), location = "midwest"))
##        1 
## 6.593853

Taking the difference between the two predicted values gives us the estimated effect due to an increase in education by one year, for a worker that has 8 years of education:

6.593853 - 5.387269
## [1] 1.206584

Because yrseduc has a non-linear effect, we need to determine the effect of the 1 year increase for a different starting value of yrseduc. We’ll take the difference in predicted wages for a worker with 17 vs. 16 years of education:

predict(mod1, data.frame(female = 1, age = 30, age2 = (30^2), yrseduc = 17, yrseduc2 = (17^2), location = "midwest")) - predict(mod1, data.frame(female = 1, age = 30, age2 = (30^2), yrseduc = 16, yrseduc2 = (16^2), location = "midwest"))
##        1 
## 1.977198

The estimated marginal effect of education on wages depends on education itself. For a worker with 8 years of education, the effect is $1.21. For a worker with 16 years, it is $1.98.

Important: notice that all of the characteristics that we choose for the individual, except yrseduc, do not matter. They all cancel out when we take the difference.

Look at summary(mod1) again, and make sure you understand where this is coming from (it will be on the final exam!):

(0.3878 * (17) + 0.04816 * (17 ^ 2)) - (0.3878 * (16) + 0.04816 * (16 ^ 2))
## [1] 1.97708

It’s the same marginal effect determined from the predict() command.

Save your script file

Make sure the top-left window (your R script) is active (click on it), then click “File”, “Save As…”, and name your script file.

Assignment 4

Assignment 4 can be found here.