<- function(arguments) {
my_function #what the function does
}
Functions
What is a function?
Most of the things we do in R doesn’t have to be written from scratch. We have many tools available to get what we want. These tools are functions.
How is a function built?
The first approximation to how a function is built is to think of it as a kind of machine. The machine takes some inputs, processes them in some way and returns outputs. The inputs are the arguments you provide to a function like a vector or a dataset. The result is the output. Very often the insides of a function, the machinery within it that is responsible for getting from the input to the output is a black box to us. We have no clue how exactly a function arrives at its result. Sometimes we don’t need to know it but in many situations at least some knowledge is necessary to be certain that the function does exactly what we need it to do and won’t surprise us (an annoying example we will get to later is silent dropping of missing values by some functions).
Types of arguments
Lets focus on the inputs. There are a few kinds of them. The most basic ones are input arguments - this is what you put into the machine. Apart from it there are a few other types of arguments that can allow you to have more control over the behavior of functions. They are a bit like toggles and switches on a machine that change how it operates.
Default arguments: arguments that are set to some default value. This value will be used unless specified otherwise. An example is the na.rm
argument from mean
or sum
. This argument is set to FALSE
by default so that the function will return an error if there are any missing values in the input argument. This is such a good example because it also stresses why choosing proper defaults is really important when writing functions. A lot of people, when they first encounter functions like mean
or sum
, are surprised or even annoyed. Why in the world set defaults that are more likely to produce errors? We are often fixated on avoiding errors in code but this is not always the way to go in data analysis. We often want functions to operate smoothly and seamlessly. But that is false peace. Smooth behavior is not always what we need from functions. Clunky functions are often good in data analysis because they force us to be explicit with what we do with data. Even if they make climbing the hill a bit more steep they are sure to lead us on the right path to the top.
You can also encounter alternative arguments. These arguments have a prespecified set of possible values (usually defined as a vector).= For example the table()
function that can give us a frequency table of a factor has a useNA
argument that specifies whether to use NA
values.It can take three different valus that specify possible behaviour of the function. You can read more onwhat they do inthe documentation of the function.
The final type of argument is the …
argument. It is a placeholder for any number and kind of arguments that will later on be passed inside the function usually as arguments in some internal function. Take lapply
as an example. Apart from the argument X
and FUN
which specify what to loop over and what function to apply to each element of X
it also has the …
argument. It’s there because the function you want to apply to every element of X
might take some additional arguments. How many and what kind of arguments these are might vary from function to function and the …
argument allows us to handle this. Any arguments passed in the …
will be used as argument of the function specified in the FUN
argument of lapply.
Building your own functions
Why spend time building your own functions? There are a few general cases. The first and probably most obvious one is when there is no available function that would do what you need. For example there is no available function to find a mode of a vector in R. If you want to find it you need to build your own function. Finding a mode is a simple example but there may be cases where you need to do something more complex or customize the behavior of an already existing function. Second reason is to avoid repetition. If you do similar operations a number of times (e.g. only the dataset or vriables change but all the rest stays the same) then copying and pasting code will soon become problematic. It makes code less readable, longer and more difficult to manage. Imagine you need to change one thing in that code. You’ll need to change it in every place where it was pasted. Writing a function instead means you can just change how you define the function.
The general logic for defining a function is as follows:
Turning a chunk of code into a function can be done quite easily in a few steps. Imagine we want to see what is the probability that a random number drawn from one vector will be larger than the mean of another vector (we will simulate a few variables with rnorm()
which draws random numbers from a normal distribution with a given mean and standard deviation):
#simulate vectors
<-rnorm(100, 0, 1)
var1 <- rnorm(100, 1, 3)
var2 <- rnorm(100, .5, 2)
var3 <- rnorm(100, 0, 3)
var4 <- rnorm(100, 0.1, .2)
var5
#calculate mean
<- mean(var1)
mean_1
#calculate lenght
<- length(var2)
length_2
#calculate how many values in var2 are larger than mean_1
<- length(var2[var2 > mean_1])
n_larger
#get the proportion
/length_2 n_larger
[1] 0.61
Build the scaffolding of the function. This is exactly what is in the code chunk above:
<- function(arguments) { my_function #what the function does }
Paste the code you want to turn into a function
<- function(arguments) { my_function #calculate mean <- mean(var1) mean_1 #calculate lenght <- length(var2) length_2 #calculate how many values in var2 are larger than mean_1 <- length(var2[var2 > mean_1]) n_larger #get the proportion /length_2 n_larger }
Identify all the “moving parts”: What will change? Each of these things has to get its own argument (all the moving parts are marked on the right side of the code chunk):
Change each of the “moving parts” in the code chunk into appropriate argument
<- function(x, y) { my_function #calculate mean <- mean(y) mean_1 #calculate lenght <- length(x) length_2 #calculate how many values in var2 are larger than mean_1 <- length(x[x > mean_2]) n_larger #get the proportion /length_2 n_larger }
A few remarks on testing
When building your own functions, especially if they are going to be used by other people, it’s a good idea to consider potential weird things that could happen. When first creating a function we usually have its typical behaviour in mind because we just want our function to work. However, there’s a whole bunch of weird stuff that might happen if you don’t prepare for it in advance. For example, imagine you want to create a function from scratch that will output the mean of a numeric vector. You could try to do something like a for loop (yes this is slow and inefficient but it’s just for the purpose of demonstration):
<- function(x) {
my_mean <- 0
sum for(i in x){
<- sum + i
sum
}<- sum/length(x)
result return(result)
}
Pretty straightforward right? Now lets see our function in action on some typical use case and compare its results to the built-in mean()
function:
<- c(1,2,3,4,5,6,7,8,9)
v
my_mean(v)
[1] 5
mean(v)
[1] 5
Yay, we get the same result! Seems like our function works! But before we call it a day and start using our own mean function lets see some less typical cases. E.g. What will happen if the vector has some missing values? Or if its an empty vector? Or if it is not a numeric vector? Lets see:
<- c(1,2,3,NA,5)
v_na <- c()
v_empty <- c("A", "B", "C")
v_char
my_mean(v_na)
[1] NA
my_mean(v_empty)
[1] NaN
my_mean(v_char)
Error in sum + i: non-numeric argument to binary operator
We get some weird behaviour. Each of these calls to my_mean()
function returned something different. First we got NA
when the vector had NAs
in it. Passing an empty vector resulted in NaN
- short for Not a Number. Finally, passing a character vector gave us an error.. Notice that only the last case gave us an error so if we then implemented our functions in some calculations we might not even notice something is wrong - for example imagine we calculated a mean with our function from a vector with missiing values (e.g we asked a bunch of participants about their mood 5 times a day and now we want to calculate daily average mood and see how it relates to some variables of interest) and then tried to use its output in some other function that had na.rm
argument set to TRUE
. We’d lose a bunch of information without so much as a warning! That`s why considering possible but unusual cases for a new function is important. It allows us to prepare for possible future problems.
Anonymous functions
There are situations in which you might want to use a custom function but not necessarily save it with a name for future use. In such situations we often use what is called an anonymous function (sometimes you can also encounter the term lambda functions). The general way these functions are constructed in R is as follows:
function(x) WHAT THE FUCNTION DOES)(arguments) (
A pretty common situation where you can also encounter these functions is inside iterations like with apply:
<- c(1,2,3,4)
v1 <- c(3,4,5,6,8)
v2 <- c(-1,4,3,2)
v3 <- list(v1,v2,v3)
v_list
lapply(v_list, function(x) {x+x})
[[1]]
[1] 2 4 6 8
[[2]]
[1] 6 8 10 12 16
[[3]]
[1] -2 8 6 4
Documentation
Ok, so we wrote our super cool new function. We tested it and are pretty confident it works properly. Can we finally call it a day? Again, not so fast. We need one more thing. Imagine you take a long holiday and get back towork after a month or two, How confident are you you will remember how exactly our new function works? Or imagine you share the function you ccreated with other people .Of course you (or others) can read the code of the function to learn that again but that is tedious. That`s why it’s so important to document code. This goes for functions but is just as true for any code that will be used by others or you in the future. Treat yourself in the future like you would treat another person. Documentation is super important! For a lot of people writing (or reading!) documentation is seen as tedious and redundant task. I guarantee you that if you don’t document your own functions you will regret this sooner or later (probably sooner). Very few functions are self explanatory enough to not need any form of documentation. In many cases simply using comments in code with #
will be enough. Sometimes building vignettes that shows how to use some functions can be a better idea. Remember to always leave some description of what given code is about and what it does.
Functions from packages
Since R has a huge community people are constantly developing new things you can do in R. You don’t have to define everything from scratch. Usually if you need a function for some statistical procedure or e.g. for plotting some package out there already has it. There is no need to reinvent the wheel.
In order to use functions from packages you need to first install the package on your computer. You can do it by calling install.packages("PACKAGENAME")
functon. You need to do it only once on a given device (unless you want to update the package or you are using renv
but more on that later). Once the package has been installed you can load it in a given R session by calling library(PACKAGENAME)
. Remember you need to run it every time you open a new R session. Alternatively, after installing a package you can call a function from it directly without loading the package first by using PACKAGENAME::function_name()
.
Another thing to know about functions from packages is name conflict. Since R is open source and most of the packages are developed and maintained by the community it is not so uncommon that two different packages have a function with the same name. You might wonder what will happen if you load both packages and then call this function? Generally, the last package loaded is going to mask previous packages. However this can be problematic e.g. if you are sharing scripts (and someone changes the order of loading packages) or if you actually want to use the function from the first package.
There are at least two ways of dealing with this problem. The first one is to be explicit. Above we described a second way of calling a function from a package: PACKAGENAME::function_name()
. This way you explicitly state which package the function is from so you shield yourself from name conflict. The second way is by specifying additional arguments to the library()
function. If you look up its documentation you can see that it has two optional arguments: exlcude
and include.only
. They allow you to load a package without some function or to load only some functions from a package. This is useful in situations where you want to load 2 packages with conflicting functions but you know you want to use the conflicting function from only one of them.
Exercises
Remember the for loop that generated n first numbers from Fibonacci sequence from the class on loops? Now turn it into a function that will return from ith to jth Fibonacci number. Document the function properly so it is clear what it does
exercise 1
#' Recall our code for generating 50 numbers from Fibonacci sequence: <- c(0,1) x for(i in 3:50) { <- x[i-1] + x[i-2] x[i] } #' What we need to do now is wrap this in a function that will generate numbers from Fibonacci sequence and select from ith to jth number. #' We'll need 2 arguments: starting and ending number #' Next we'll need to generate numbers from Fibonacci sequence up to jth number #' Finally we'll need to subset the resulting vector from ith number #' We also need one more modification: the loop own't give us onyl the first or second number #' because the subsetting won't work there. We just need an if else statement that will test if the jth number is 1 or 2 and return proper result if yes. <- function(i,j) { get_fibonacci <- c(0,1) x if (j == 1) { <- x[1] result return(result) else if (j == 2) { } <- x result return(result[i:j]) } for(k in 3:j) { <- x[k-1] + x[k-2] x[k] }<- x[i:j] result return(result) }
Create a function that calculates a mode of a vector. Consider potential edge cases and provide tests that show your function behaves properly
exercise 2
#' A mode of a vector is its most common value. #' One way to try and get a mode of a vector would be to get frequencies of every value in the vector, #' sort them in descending order and then extract the first value. We can do it with table() to get counts #' sort() with decreasing = TRUE to sort. There are just two problems, one small and another bigger: #' 1. table() will return a named vector of values - what we'll get is the highest count and not really #' the value with highest counts. We can fix it by calling names() on the result of table() and sort() #' to get the values rather than counts. #' 2. What if there are two modes? In such cases extracting the highest value won't work because we'll #' only get one mode. We'll need to adjust for this somehow. One way to do it would be to first #' extract the first highest value and then loop over the remaining values. In each iteration of the loop #'we want to test if the ith value is equal to the highest value. If yes - we have another mode and we #' append it to results. If no - we got all modes and we can close the loop. One more thing to consider #' here is whether this loop should run every time? What if we get a vector with just one type of values #' e.g. c(1,1,1,1)? Or an empty vector? The result of table() will have length of 1 so we can't loop over #' all the other elements except for the mos common one because there is just one element. #' We need to test if length of the sorted counts have at least 2 elements and run the loop if yes. #' #' Other things we might want to test are: what if we get missing values? #' Table() will remove them by default and not return counts of NAs. What if we provide an empty vector? #' Again, table() has a default behavior - it will return an empty table that will be turned to NULL when #' we try to get the names. Are these the behaviors we want for our function? #' If not then we need to adjust accordingly. #' There might be other edge cases but these should suffice to get you thinking about what you might encounter when designing functions <- function(factor){ mode #get sorted counts of factor <- sort(table(factor), decreasing = T) sorted #extract highest counts <- sorted[1] largest if (length(sorted) >= 2) { #loop over remaining values for (i in 2:length(sorted)) { #test if next count is equal to highest count if(sorted[i] == largest[1]) { #append next mode <- c(largest, sorted[i]) largest else { } #if next count is not equal - end the loop break } } } #get names of all the modes to return modes rather than their counts <- names(sorted[1:length(largest)]) largest #return results return(largest) } #some checks: simple case, two modes, empty vector, missing values <- c(1,1,1,2,3,2) simple_v <- c(1,1,1,2,2,2) two_modes_v <- c() empty_v <- c(1,2,1,NA,NA,NA) missing_v <- c("a", "a") one_type_v mode(simple_v) mode(two_modes_v) mode(empty_v) mode(missing_v) mode(one_type_v)