In this lab you will:
Click on the home button, and scroll down until you find the RStudio directory. Click on RStudio.
Click on “File”, “New File”, “R Script”.
Download the data from the website using:
cps <- read.csv("http://home.cc.umanitoba.ca/~godwinrt/3040/data/cps1985.csv")
This is a sub-sample from the 1985 wave of the “Current Population Survey” in the US, which contains information on individual wages and demographic characteristics. Take a good look at the data set either by clicking on the spreadsheet icon next to its object name in the top-right window, or by using the command:
View(cps)
Notice that many of the variables do not contain numbers, but are instead characters (words). For example, the ethnicity
variable takes on values “hispanic”, “cauc”, and “other”. In order to use variables such as ethnicity
, region
and gender
, we need to create dummy variables from them. From the ethnicity
variable for example, we would create 2 dummies, even though there are 3 categories (in order to avoid the dummy variable trap).
R will recognize that variables like ethnicity
contain several categories, and will automatically create the system of dummies for us. To see this, try:
summary(lm(wage ~ ethnicity, data = cps))
##
## Call:
## lm(formula = wage ~ ethnicity, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.278 -3.765 -1.278 2.205 35.222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2779 0.2439 38.032 <2e-16 ***
## ethnicityhispanic -1.9946 1.0146 -1.966 0.0498 *
## ethnicityother -1.2196 0.6711 -1.817 0.0697 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.117 on 531 degrees of freedom
## Multiple R-squared: 0.01227, Adjusted R-squared: 0.008545
## F-statistic: 3.297 on 2 and 531 DF, p-value: 0.03776
How would you interpret the estimated coefficients?
Hispanics make $1.9946 less on average compared to whites, and other ethnicities make $1.2196 less on average compared to whites, according to this data.
Let’s try some models in order to estimate the returns to education. Start with a simple model:
\(wage = \beta_0 + \beta_1 education + \beta_2 gender + \epsilon\)
Estimate this population model, and view the results, using:
model1 <- lm(wage ~ education + gender, data = cps)
summary(model1)
What are the estimated returns to education?
An additional year of education is associated with an additional $0.75 in wages.
One use of the estimated model above (model1
), is to created an OLS predicted value for a “representative case”. That is, we could use the model to predict the wage of an individual with certain characteristics. How much (according to the estimated model) can a woman with 12 years of education expect to make? We can get this number with the following code:
predict(model1, data.frame(education = 12, gender = "female"))
We just “plug” the numbers into the estimated equation. Take a moment to let the usefulness of prediction wash over you. You want to know how much a house will sell for? How much product you will sell at a certain price? How much GDP will be lost if the temperature rises by 3\(^{\circ}\)C?
In order to avoid omitted variable bias, we need to include variables that are correlated with education
, and are determinants of wage
. Notice that things like age
and experience
are correlated with education
(think about why):
cor(cps$education, cps$experience)
cor(cps$education, cps$age)
Hence, when we add these variables to the regression, we should expect the estimate for the returns to education to change. Estimate a model using all of the variables available in the data set:
model2 <- lm(wage ~ education + experience + age + ethnicity + region + gender + occupation + sector + union + married, data = cps)
summary(model2)
Look at the results, and interpret the estimates. Which variables are statistically significant?
gender
, occupation
, and union
are the only variables that appear to be statistically significant.
Using the above estimated model, test the hypothesis that the returns to education are zero. The null and alternative hypotheses are:
\(H_0: \beta_{educ} = 0\)
\(H_A: \beta_{educ} \neq 0\)
The t-statistic associated with this test is:
\(t = \frac{b_{educ} - 0}{s.e.(b_{educ})}\)
Take another look at the summary for model2
:
summary(model2)
and note that \(b_{educ} = 0.8128\) and that \(s.e.(b_{educ}) = 1.0869\). The t-statistic is then:
test1 <- 0.8128 / 1.0869
Notice test1
pop up in the top-right. Find this value in the summary(model2)
table.
We need a p-value for this t-statistic, but first we need to know the degrees of freedom for this t-distribution. The sample size is 534 (see in the top-right), and we have estimated 17 \(\beta\)s (count them), so that the degrees of freedom are \(534 - 17 = 517\). Find the degrees of freedom in the summary(model2)
table.
Now our p-value is found by:
(1 - pt(test1, 517)) * 2
## [1] 0.4549118
The pt()
command gives the probability to the left of t-statistic. Since our t-statistic is greater than zero, we want the area to the right, so we have to use 1 - pt()
, and since our alternative hypothesis is two sided we need to multiply by two: * 2
. Be sure that you can find this p-value in the summary(model2)
table.
Notice that all tests of “no effect” have been automatically performed by R, but that any test where the \(\beta\) takes a non-zero value under the null hypothesis needs to be calculated manually.
Make sure the top-left window (your R script) is active (click on it), then click “File”, “Save As…”, and name your script file.