This tutorial will start getting you accustomed to R and RStudio. Please refer to Lecture Slides 1 for information on how to install R and RStudio. R is a free software environment for statistical computing and graphics. R is one of the fastest growing programming languages in the last 5 years, and is used in a wide array of industries. R has the advantage that it is free and open-source, and that thousands of users have contributed “add-on” packages that are readily downloadable by anyone. RStudio is a free extension to R, that we will be using for the course.
You will:
Search your computer for RStudio.exe and open the program. It should look something like this:
Click on “File”, “New File”, “R Script”.
R’s arithmetic operators include:
Operator | Function |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ | exponentiation |
Type the following command into your R script: 1 + 2
. Run the command by highlighting it, or making sure the cursor is active at the end of the line, and clicking ``Run’’.
Ignore the [1]
that is printed out.
You can create objects in R. Objects can be vectors, matrices, character strings, data frames, etc. Create two different scalars (give them any name you like, it doesn’t have to be a
and b
):
a <- 3
b <- 5
Note that you can copy-paste commands in the gray boxes directly into R.
We have created two new objects called a
and b
, and have assigned them values using the assignment operator <-
(the “less than” symbol followed by the “minus” symbol). Notice that a
and b
pop up in the top-right of your screen. We can now refer to these objects by name:
a * b
## [1] 15
To create a vector in R we use the “combine” function, c()
:
myvector <- c(1, 2, 4, 6, 7)
To call a function in R, type the name of the function, and the arguments of the function in parentheses: functionname(arguments)
.
There are thousands of functions in R. Here are a few that we’ll need:
Function |
---|
sum() |
mean() |
var() |
summary() |
Try all of these functions on myvector
. For example:
sum(myvector)
## [1] 20
The sum()
function is looking for arguments that it can add together. Put an object in the brackets, and the function will try to add. Try mean(myvector)
. You can get help on a function by typing ?
followed by the function name, and running the command. For example, try ?summary
.
Logical operators are used to determine whether something is TRUE
or FALSE
. Some logical operators are:
Operator | Function |
---|---|
> | greater than |
== | equal to |
< | less than |
>= | greater than or equal to |
<= | less than or equal to |
!= | not equal to |
The operators &
“and” and |
“or” may be used to combine the above operators.
Try entering the following commands:
8 > 4
## [1] TRUE
b == 6
## [1] FALSE
b > 2
## [1] TRUE
myvector > 3
## [1] FALSE FALSE TRUE TRUE TRUE
myvector > 3 & myvector < 7
## [1] FALSE FALSE TRUE TRUE FALSE
For the last example, R has checked to see whether each element in myvector
is greater than 3 and less than 7.
We will use these logical operators for “indexing”. It allows us to choose certain observations in the data set based on values of the variables. In other words, it allows us to “subset” the data.
Indexing allows us to extract parts of a vector, matrix, or data set. For example, to get the third value from myvector
use:
myvector[3]
## [1] 4
We need to create a matrix for the next example. Copy-paste the following command and run it (don’t worry about what it means, just do it):
mymatrix <- matrix(c(rep(1:2,2), 1:8), 4, 3)
Again, you’ll see the object you created pop up in the top-right of your screen. Click the “spreadsheet” icon next to the data set to view it:
Close the “mymatrix” tab when you are done. You can also view an object by entering its name:
mymatrix
## [,1] [,2] [,3]
## [1,] 1 1 5
## [2,] 2 2 6
## [3,] 1 3 7
## [4,] 2 4 8
Suppose you want to extract the 6
from the matrix. You would ask for the 2nd row, 3rd column:
mymatrix[2, 3]
## [1] 6
Suppose you wanted the entire 3rd row of the matrix. You would ask for the 3rd row, and all columns (leave a blank):
mymatrix[3, ]
## [1] 1 3 7
Now, let’s combine indexing and logical operators. First, let’s see which elements in the first column are equal to 1:
mymatrix[, 1] == 1
## [1] TRUE FALSE TRUE FALSE
and use this to get only the rows that have a 1 in their 1st column:
mymatrix[mymatrix[, 1] == 1, ]
## [,1] [,2] [,3]
## [1,] 1 1 5
## [2,] 1 3 7
On your own, try to extract only the rows for which the number in the 2nd column is greater than or equal to 2, and also less than 4.
mymatrix[mymatrix[, 2] >= 2 & mymatrix[, 2] < 4, ]
This will get easier when the columns have names.
The data for this tutorial was scraped by Abdulshaheed Alqunber and is originally from (https://www.vgchartz.com/). We are using a sub sample of the data, and only look at video game sales for games making at least $100,000 USD in global sales.
The data is in the “comma-separated-format” or “.csv”. This is a very simple and common format for storing data.
RStudio can read data from your computer, or from the internet. Load the data with:
mydata <- read.csv("http://home.cc.umanitoba.ca/~godwinrt/data/vidsales.csv")
We have created a new object called mydata
, and have assigned it the values in the .csv
file using the assignment operator <-
. You could have chosen a name different than mydata
.
The mydata
object shows up in the top-right of your screen. The sample size is 4706 and there are 8 variables.
Click the “spreadsheet” icon next to the data set to view it. Take a moment to look through the data, and make sure you have an idea of what each of the variables are. The Sales
variable is either the total global sales of the game, or the sales from bundling with consoles, and is measured in millions of US dollars. Score
is the critic score from video game reviewers, and is on a scale of 0 to 10, with 10 being best. Notice that each video game (each observation in the data set) takes up a different row, while the type of information on the video game (the variables) takes up a different column. Close the “data” tab when you are done viewing.
Click the blue arrow next to data
in the top-right of your screen.
Each variable name is listed, along with some info. Notice that the variables each have a type:
Name | Type |
---|---|
chr | A character string (words) |
num | Any real (continuous) number. |
int | Any integer. |
The “character” variables can be used to create dummy variables (more on this later!)
To explore the data, we can calculate “summary” statistics for all the variables:
summary(mydata)
## Name Genre ESRB_Rating Platform
## Length:4706 Length:4706 Length:4706 Length:4706
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Publisher Score Sales Year
## Length:4706 Min. : 1.00 Min. : 0.010 Min. :1985
## Class :character 1st Qu.: 6.50 1st Qu.: 0.150 1st Qu.:2004
## Mode :character Median : 7.50 Median : 0.420 Median :2008
## Mean : 7.27 Mean : 1.177 Mean :2007
## 3rd Qu.: 8.30 3rd Qu.: 1.100 3rd Qu.:2010
## Max. :10.00 Max. :82.860 Max. :2020
The min, max, quartiles, and mean have been calculated. For example, the sample mean Score
is 7.27.
We can extract variables from the data set, and perform functions on them. To calculate the sample variance for Sales
:
var(mydata$Sales)
## [1] 7.607421
Important: when we type mydata$Sales
we are getting the Sales
variable from within the mydata
data set.
Now, calculate the correlation between Sales
and Score
:
cor(mydata$Sales, mydata$Score)
## [1] 0.2634555
What does this tell you?
We can use indexing, together with logical operators, to create sub-samples.
Nintendo is my favourite publisher. Let’s see their critic scores vs. the critic scores from other publishers:
mean(mydata$Score[mydata$Publisher == "Nintendo"])
## [1] 7.799296
mean(mydata$Score[mydata$Publisher != "Nintendo"])
## [1] 7.21722
Notice how we have used logical operators and indexing to create two sub-samples: Nintendo games, and other games.
On your own, determine the total global sales in the data set for the publisher “Rockstar Games”.
sum(mydata$Sales[mydata$Publisher == "Rockstar Games"])
Now, find all the video games that have received a perfect score of 10.
mydata[mydata$Score == 10,]
For the rest of the tutorial, we’ll use a sub-sample of all video games that have sales 2 million USD or more:
vid2 <- mydata[mydata$Sales >= 2, ]
The above line creates a new data set, by selecting only the rows from mydata
which have Sales >= 2
, and by selecting all columns (notice the blank space in the brackets []
after the comma).The new data set shows up in the top-right panel. Check the sample size.
Visualization is important. Plot a histogram of critic scores:
hist(vid2$Score)
Many options are available to customize the histogram, see ?hist
. Let’s add some labels, and control the number of “breakpoints”:
hist(vid2$Score,
main = "Histogram of video game critic scores",
xlab = "score",
breaks = 10)
The scatterplot is the most widely used tool for visualizing the relationship between two variables. Draw a scatterplot for Sales
and Score
, adding a title and labeling the axis:
plot(vid2$Score, vid2$Sales,
main = "critic scores and video game sales",
xlab = "score", ylab = "Sales")
We can also change the color and style of the dots:
plot(vid2$Score, vid2$Sales,
col = 3,
pch = 16)
Type ?pch
to see the different styles, and Google to see the different colours.
Let’s create a new variable that we’ll use to select the colour of each data point. Make a new variable called G
inside of the vid2
dataset, and use G
to control the colour of each data point. We begin by setting all rows of the variable equal to 1, and then change the value of G
based on the game’s Genre
. Copy and paste the code below:
vid2$G <- 1
vid2$G[vid2$Genre == "Action"] <- 2
vid2$G[vid2$Genre == "Sports"] <- 3
vid2$G[vid2$Genre == "Shooter"] <- 7
vid2$G[vid2$Genre == "Role-Playing"] <- 4
vid2$G[vid2$Genre == "Platform"] <- 5
vid2$G[vid2$Genre == "Racing"] <- 6
For example, if Genre
is equal to "Action"
, the variable G
gets a value of 2. Look at this G
variable in the spreadsheet to understand what has happened, Now, use this variable to determine the colour for each data point:
plot(vid2$Score, vid2$Sales, col=vid2$G, pch=16,
main = "video game sales and scores by genre",
xlab = "score", ylab = "sales")
We need to add a legend to the plot to show what the colours mean:
plot(vid2$Score, vid2$Sales, col=vid2$G, pch=16)
legend("topleft",
legend = c("Action", "Sports", "Shooter", "Role-Playing", "Platform", "Racing", "Other"),
col=c(2, 3, 7, 4, 5, 6, 1), pch=16)
In the “plots” window, click “Export” and “Save as Image…”. This way, you can import graphics you create into Word, or another word processor.
Make sure the top-left window is active, then click “File”, “Save As…”, and name your script file.
If you get too much junk floating around in R, you can clear the environment by clicking the “sweep” icon in the top-right, and start over.