<- c(1,2,3)
my_very_first_vector my_very_first_vector
[1] 1 2 3
Michal Wypych
So far we have worked only with single values or objects storing only one value. However, usually you want to work with whole sets of values like variables or whole datasets. There is a number of types of data you can encounter in R which allow you to do that. A fairly easy way to orient yourself in the different types of data is:
The most basic type of data is a vector. Vectors can store any number of values of the same type in 1 dimension. You can create a vector using c()
function.
Vectors are indexed: they have a first, second value etc. This means that you can access part of a vector -subset them. Subsetting is accomplished with []
. You can also subset a range of values from a vector wirh [:]
:
If you try to put different types of values into one vector R will convert the types to a matching one. This is especially important when due to some mistake/error a single value of a different type gets lost in some other variable. Just a single value will trigger the whole variable to be converted!
[1] "1" "TRUE" "some text"
[1] "character"
You can get a brief summary of a given vector with summary()
. It will give slightly different information depending on what type of values is stored in a given vector:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.50 5.50 7.75 10.00
Length Class Mode
3 character character
You can make pretty much the same operations on vectors as on single values. One of the great features of R is that by default it will make operations element wise - if you try to add two vectors together then the first element from vector 1 will be added to first element of vector 2 and so on (the fancy name for this is vectorization).
If the vectors have different lengths then R will start to recycle values from the shorter vectors. But it will output a warning if the length of one vector is not a multiple of the other vector.
Warning in short_v + long_v: longer object length is not a multiple of shorter
object length
[1] 2 4 6 5 7
If you want to join two vectors together you can do it in 2 ways: the first one is with c()
just like creating a new vector (and in fact it will simply create a new vector!). The other one is with append()
. The first argument is the vector you want to append to and the second argument is the vector you want to append. append()
also allows you to specify where to append the second vector with after
argument that requires an index so you can put it e.g. inside the first vector rather than at the end
Another thing about vectors is that they can be named: each element can have a name. This can be especially useful e.g. when the vector is a result of some statistical operations and you want to make it easier to understand which number means what (e.g. you want to put together the mean, median and mode in 1 vector). You can add names to elements in a vector simply with a =
:
Before we move on to factors lets introduce a few functions that can be useful for creating vectors:
rep()
function allows you to repeat a given value or a vector n times. You can create large vectors with it easily:
seq()
allows you to create a sequence of numbers from some number to some number. You can either specify how long the sequence is to be and R will figure out the distances between numbers (lenght.out
argument) or you can specify the distances between numbers with by
argument and R will figure out the length of a resulting vector.
rnorm()
allows you to draw random numbers from a normal distribution with a specified mean and standard deviation (there is actually a whole family of drawing numbers from different distributions e.g. rbinom()
for drawing from binomial distribution or rbeta()
for drawing from beta distribution):
Factors are much like vectors except that they are used for storing categorical values - they have levels. You can store variables such as country or experimental condition of participants in a factor. You can create a factor by calling factor()
and passing it a vector as an argument.
Factors can also have ordered levels. You can make an ordered factor by setting ordered = T
argument when creating a factor. Notice how the output looks different now: it includes information on the order.
ordered_vec <- c('low', 'high', 'high', 'low', 'low')
ordered_fac <- factor(ordered_vec, ordered = T)
ordered_fac
[1] low high high low low
Levels: high < low
You can also manually set the levels of a factor. You can do it when creating the factor. Notice that for ordered factor the order in which you pass the levels will determine the order of levels in the factor.
Matrices are a bit like vectors but they have two dimensions. They have rows and columns but treat them in the same way. Because of this they can store only one type of values (much like vectors). You can create a matrix from scratch with the matrix()
function. This function takes a vectors of values as its input (these are the values we will fill our matrix with) and additional information on how the matrix has to look - how many columns and rows it should have and whether to fill the matrix with values by rows or columns
numbered_vector <- c(1,2,3,4,5,6,7,8,9)
my_matrix <- matrix(numbered_vector, nrow = 3, ncol = 3)
my_matrix
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
You can also create a matrix by ‘glueing’ vectors together. You can bind them either as rows (rbind()
function) or by columns (cbind()
function). Notice that the names of the vectors will be used either as names of rows or columns.
Since we have two dimensions subsetting matrices can work both on rows and columns. The general idea is still the same but we have to specify whether we are subsetting rows or columns. Rows always come first, columns second separated by a comma like this matrix[rows,columns]
. You can select ranges of rows or columns just like in a vector.
If you want to select all rows or columns you can leave the space blank. Remember to keep the comma though!
Operations on matrices follow similar rules like operations on vectors - they are element-wise by default (note that they are not your classical linear algebra operations!). E.g. if you multiply a matrix by a vector each row from the matrix will be multiplied by a given element from the vector (1st row by 1st value etc):
In a day to day analysis you will likely work with data frames most of the time. A data frame is like a matrix in that it has rows and columns but can store different types of values in each column (so that e.g. you can have some variables that are numeric and others that are text). A different way of thinking about data frames is as a list of vectors of the same length with each vector representing a different variable. Each row represents a different observation (e.g. participant).
You can create a data frame with data.frame()
function passing all the variables as arguments. Lets create 3 vectors: author, title and year.
author <- c('Allport', 'Heider', 'Lewin', 'Allport', 'Heider')
title <- c('Nature of Prejudice', 'Psychology of interpersonal relations',
'Principles of Topological Psychology', 'Psychology of Rumor', 'The life of a psychologist: An autobiography')
year <- c(1954, 1958, 1936, 1947, 1983)
psych_books <- data.frame(author, title, year)
psych_books
author title year
1 Allport Nature of Prejudice 1954
2 Heider Psychology of interpersonal relations 1958
3 Lewin Principles of Topological Psychology 1936
4 Allport Psychology of Rumor 1947
5 Heider The life of a psychologist: An autobiography 1983
Subsetting data frames works the same way as matrices. You can subset both on rows and columns. An important thing to remember (and one of the reasons a lot of people switch to tibbles which are kind of data frames+. We’ll get to tibbles some time in the future) is that if you subset a single column the result will be a vector and not a dataframe. This sometimes is annoying if you are designing something that is supposed to work on data frames specifically.
There is one additional way of subsetting a data frame. Subsetting variables based on their position is tiresome because we rarely remember the order of all the columns (especially as our data frames get bigger). You can select a single variable using $
:
Using the $
operator reflects a way of thinking about datasets that is pretty common: data frames are ordered collections of variables and each variable has its name. You can also use $
to create new variables. Just assign a vector of values to a new name in your dataframe:
psych_books$discipline <- c('intergroup relations', 'social psychology',
'general psychology', 'social psychology',
'biography')
psych_books
author title year
1 Allport Nature of Prejudice 1954
2 Heider Psychology of interpersonal relations 1958
3 Lewin Principles of Topological Psychology 1936
4 Allport Psychology of Rumor 1947
5 Heider The life of a psychologist: An autobiography 1983
discipline
1 intergroup relations
2 social psychology
3 general psychology
4 social psychology
5 biography
Remember how we talked about element-wise operations on vectors? You can leverage it to easily create new variables that are results of operations on other variables. Thanks to this you can add a whole new variable that is a result of a mathematical operation in just one line (imagine adding a variable that is a sum of all points from a quiz for each student). Here’s an example if we wanted to calculate how many years ago each book from our data frame was published:
author title year
1 Allport Nature of Prejudice 1954
2 Heider Psychology of interpersonal relations 1958
3 Lewin Principles of Topological Psychology 1936
4 Allport Psychology of Rumor 1947
5 Heider The life of a psychologist: An autobiography 1983
discipline book_age
1 intergroup relations 68
2 social psychology 64
3 general psychology 86
4 social psychology 75
5 biography 39
You can also subset data frames based on condition. Lets say we want to find out which psychology books in our dataset are really old, say above 70. We can subset psych_books
using []
but we need to add one more thing to specify our condition - we need to tell R which rows are the ones that fulfill our condition. We can do it with which()
. It needs the condition as argument and will return numbers of rows from the data frame that fulfill the condition. We just need to put which()
inside our subsetting to get the rows we want:
Lists are the final type of basic data in R we will discuss here. They are the most versatile ones - they can store anything inside of them: single values, vectors, matrices, dataframes or even other lists! So they have 1 dimension but can store anything inside them. One important feature of lists is that they are ordered: you can access their elements by position. So you can think of lists as collections of objects but there aren’t really limits to what these objects are (in fact data frames are very specific lists: they are collections of variables that have the same length and form a nice rectangular table). Lists are created with list()
. Lets create a list of the plants in a house along with a value storing information on how many days ago did we last water them:
list_of_objects <- list(
plants = c('Calathea', 'Chamedora', 'Pilea', 'Philodendron'),
days_since_watering = 5
)
list_of_objects
$plants
[1] "Calathea" "Chamedora" "Pilea" "Philodendron"
$days_since_watering
[1] 5
You might encounter lists if you need to store a number of different things together. E.g. results of many statistical analyses are stored in lists because they might contain both information about the model, data and results. In fact as you dive deeper into R you will start to encounter more and more lists because they are very versatile.
You can access objects stored in lists in a few ways. You can use the []
you used for all other types of data. An important feature of this type of subsetting is that the result will always be a list (even if it has only 1 element). The other option is to use double square brackets [[]]
. This will extract the object inside a list so the result won’t be a list (you can think of it as ‘getting deeper’ into the list to extract the exact element you want).
$plants
[1] "Calathea" "Chamedora" "Pilea" "Philodendron"
[1] "Calathea" "Chamedora" "Pilea" "Philodendron"
If the elements in your list are named you can also use $
to extract them.
Suppose you ran a survey among your friends to what extent they agree with a statement “I like pineapples” and got 6 answers: “agree”, “disagree”, “agree”, “somewhat agree”, “somewhat disagree”, “disagree”. Choose the most appropriate type of object to store this information and save it as answers
.
#' We need a data structure that can store a categorical variable with various levels.
#' These levels also have some order to them. The best structure to store the answers is
#' an ordered factor. We can create it by turning a vector into a factor,
#' setting ordered argument to TRUE and specifying the levels of values.
pinapple_factor <- factor(c("agree", "disagree", "agree", "somewhat agree", "somewhat disagree", "disagree"), ordered = TRUE, levels = c("disagree", "somewhat disagree", "somewhat agree", "agree"))
pinapple_factor
Create v1
vector as 500 random numbers from normal distribution with mean of 3 and standard deviation of 1 and a v2
vector as 500 random numbers from normal distribution with mean of 2.5 and standard deviation of 1.5. Append v2
to v1
to create one vector with 1000 values and save it as v4
. Create a v3
vector by repeating c("a", "b")
500 times. Put together v4
and v3
as a dataframe. Then calculate which rows: a
or b
has a higher mean of v4
?
#' We need to create 2 vectors, put them into a data frame and then calculate means for
#' subsetted data frame. We can create v4 by using rnorm() with appropriate arguments twice
#' and putting the resulting vectors together into v4.
#' v3 can be created with rep() function to repeat vector c("a", "b") 500 times.
#' Then we need to put v3 and v4 into a dataframe with data.frame and save it as e.g. df.
#' Finally we need to calculate 2 means on subsetted dataframes. We can use which() inside
#' df[] to get the rows that match our condition (v1 == "a" or v1 == "b") and get the mean
#' of v4 from the subsetted data frames.
v1 <- rnorm(500, 3, 1)
v2 <- rnorm(500, 2.5, 1.5)
v4 <- c(v1, v2)
v3 <- rep(c("a", "b"), 500)
df <- data.frame(v3, v4)
mean(df[which(df$v3 == "a"), "v4"])
mean(df[which(df$v3 == "b"), "v4"])
Suppose you are interested in assessing how effective a given drug is. You have the following data: Out of 900 people who took the drug 657 got better. Out of 1000 people who did not take the drug, 540 got better. Represent this information as a matrix with columns coding those who took a drug or didn’t take it and rows representing those who got better or not. Next calculate the percentage of people who got better in the drug and no drug conditions.
# we have all the information that we just need to put into a matrix.
# Remember that columns are supposed to code whether someone took the drug or not
# (so 1st column should sum to 900 and second to 1000) and rows code whether someone got better or not.
# We can get all the necesary values into the vector: we know how many people took the drug or
#not and how many got better in each condition. We can substract the appropriate numbers
# to get how many people took (or didn't take) the drug and did not get better
# we then need to create a column with 2 rows and carefully set the byrow argument
# so that it will match our input vector.
# Finally to get the percentages we need to divide values from the first row and appropriate column by the sum of appropriate column.
trial_matrix <- matrix(c(657, 900-657, 540, 1000 - 540), nrow = 2, byrow = F)
got_better_drug_percent <- trial_matrix[1,1]/sum(trial_matrix[,1])
got_better_nodrug_percent <- trial_matrix[1,2]/sum(trial_matrix[,2])
got_better_drug_percent
got_better_nodrug_percent