R for reproducible scientific analysis

Learning Objectives

• Define a function that takes arguments.
• Return a value from a function.
• Test a function.
• Set default values for function arguments.
• Explain why we should divide programs into small, single-purpose functions.

If we only had one data set to analyze, it would probably be faster to load the file into a spreadsheet and use that to plot simple statistics. However, the gapminder data is updated periodically, and we may want to pull in that new information later and re-run our analysis again. We may also obtain similar data from a different source in the future.

In this lesson, we’ll learn how to write a function so that we can repeat several operations with a single command.

Defining a function

Let’s open a new R script file in the functions/ directory and call it functions-lesson.R.

my_sum <- function(a, b) {
the_sum <- a + b
return(the_sum)
}

Let’s define a function fahr_to_kelvin that converts temperatures from Fahrenheit to Kelvin:

fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}

We define fahr_to_kelvin by assigning it to the output of function. The list of argument names are contained within parentheses. Next, the body of the function–the statements that are executed when it runs–is contained within curly braces ({}). The statements in the body are indented by two spaces. This makes the code easier to read but does not affect how the code operates.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

Let’s try running our function. Calling our own function is no different from calling any other function:

# freezing point of water
fahr_to_kelvin(32)
[1] 273.15

# boiling point of water
fahr_to_kelvin(212)
[1] 373.15


Challenge 1

Write a function called kelvin_to_celsius that takes a temperature in Kelvin and returns that temperature in Celsius

Hint: To convert from Kelvin to Celsius you minus 273.15

Combining functions

The real power of functions comes from mixing, matching and combining them into ever large chunks to get the effect we want.

Let’s define two functions that will convert temperature from Fahrenheit to Kelvin, and Kelvin to Celsius:

fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}

kelvin_to_celsius <- function(temp) {
celsius <- temp - 273.15
return(celsius)
}

Challenge 2

Define the function to convert directly from Fahrenheit to Celsius, by reusing the two functions above (or using your own functions if you prefer).

We’re going to define a function that calculates the Gross Domestic Product of a nation from the data available in our dataset:

# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat) {
gdp <- dat$pop * dat$gdpPercap
return(gdp)
}

We define calcGDP by assigning it to the output of function. The list of argument names are contained within parentheses. Next, the body of the function – the statements executed when you call the function – is contained within curly braces ({}).

We’ve indented the statements in the body by two spaces. This makes the code easier to read but does not affect how it operates.

When we call the function, the values we pass to it are assigned to the arguments, which become variables inside the body of the function.

Inside the function, we use the return function to send back the result. This return function is optional: R will automatically return the results of whatever command is executed on the last line of the function.

calcGDP(head(gapminder))
[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231


That’s not very informative. Let’s add some more arguments so we can extract that per year and country.

# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat, year=NULL, country=NULL) {
if(!is.null(year)) {
dat <- dat[dat$year %in% year, ] } if (!is.null(country)) { dat <- dat[dat$country %in% country,]
}
gdp <- dat$pop * dat$gdpPercap

new <- cbind(dat, gdp=gdp)
return(new)
}

If you’ve been writing these functions down into a separate R script (a good idea!), you can load in the functions into our R session by using the source function:

source("functions/functions-lesson.R")

Ok, so there’s a lot going on in this function now. In plain English, the function now subsets the provided data by year if the year argument isn’t empty, then subsets the result by country if the country argument isn’t empty. Then it calculates the GDP for whatever subset emerges from the previous two steps. The function then adds the GDP as a new column to the subsetted data and returns this as the final result. You can see that the output is much more informative than a vector of numbers.

Let’s take a look at what happens when we specify the year:

head(calcGDP(gapminder, year=2007))
       country year      pop continent lifeExp  gdpPercap          gdp
12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894


Or for a specific country:

calcGDP(gapminder, country="Australia")
     country year      pop continent lifeExp gdpPercap          gdp
61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894


Or both:

calcGDP(gapminder, year=2007, country="Australia")
     country year      pop continent lifeExp gdpPercap          gdp
72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894


Let’s walk through the body of the function:

calcGDP <- function(dat, year=NULL, country=NULL) {

Here we’ve added two arguments, year, and country. We’ve set default arguments for both as NULL using the = operator in the function definition. This means that those arguments will take on those values unless the user specifies otherwise.

  if(!is.null(year)) {
dat <- dat[dat$year %in% year, ] } if (!is.null(country)) { dat <- dat[dat$country %in% country,]
}

Here, we check whether each additional argument is set to null, and whenever they’re not null overwrite the dataset stored in dat with a subset given by the non-null argument.

I did this so that our function is more flexible for later. We can ask it to calculate the GDP for:

• The whole dataset;
• A single year;
• A single country;
• A single combination of year and country.

By using %in% instead, we can also give multiple years or countries to those arguments.

  gdp <- dat$pop * dat$gdpPercap
new <- cbind(dat, gdp=gdp)
return(new)
}

Finally, we calculated the GDP on our new subset, and created a new data frame with that column added. This means when we call the function later we can see the context for the returned GDP values, which is much better than in our first attempt where we got a vector of numbers.

Challenge 3

Test out your GDP function by calculating the GDP for New Zealand in 1987. How does this differ from New Zealand’s GDP in 1952?

Challenge 4

The paste function can be used to combine text together, e.g:

best_practice <- c("Write", "programs", "for", "people", "not", "computers")
paste(best_practice, collapse=" ")
[1] "Write programs for people not computers"


Write a function called fence that takes two vectors as arguments, called text and wrapper, and prints out the text wrapped with the wrapper:

fence(text=best_practice, wrapper="***")

Note: the paste function has an argument called sep, which specifies the separator between text. The default is a space: " “. The default for paste0 is no space”“.

Solution to challenge 1

Write a function called kelvin_to_celsius that takes a temperature in Kelvin and returns that temperature in Celsius

kelvin_to_celsius <- function(temp) {
celsius <- temp - 273.15
return(celsius)
}

Solution to challenge 2

Define the function to convert directly from Fahrenheit to Celsius, by reusing these two functions above

fahr_to_celsius <- function(temp) {
temp_k <- fahr_to_kelvin(temp)
result <- kelvin_to_celsius(temp_k)
return(result)
}

Solution to challenge 3

GDP for New Zealand in 1987: 65050008703

GDP for New Zealand in 1952: 21058193787

Solution to challenge 4

Write a function called fence that takes two vectors as arguments, called text and wrapper, and prints out the text wrapped with the wrapper:

fence <- function(text, wrapper){
text <- c(wrapper, text, wrapper)
result <- paste(text, collapse = " ")
return(result)
}
best_practice <- c("Write", "programs", "for", "people", "not", "computers")
fence(text=best_practice, wrapper="***")
[1] "*** Write programs for people not computers ***"