# Performance Optimization and Parallelization

## Learning Objectives

• Learn how to benchmark code
• Understand several common R performance bottlenecks
• Properly identify instances where code might benefit from parallelization
• Be able to parallelize R code

Although R is a fantastic tool for data analysis and exploration, speed is not one of its strengths. This is by design - R was made to be easy and powerful to use, and takes many performance shortcuts for the sake of usability. That said, simply knowing several common mistakes to avoid can often reduce the execution times of your scripts from hours to minutes.

Before we start, make sure you have the following packages installed:

install.packages(c("microbenchmark", "plyr", "doParallel"))

## Benchmarking

Before we can start to cover what slows code down, we’ll need a way of measuring how fast it is. This is called benchmarking. The microbenchmark package provides a very easy way of doing this. Simply give it a piece of code to execute, and it will run your code repeatedly, reporting the amount of time it took to run. Less efficient code is slower.

As an example, let’s generate 100 random numbers and see how long it takes to compute their square root.

library(microbenchmark)

randoms <- runif(100)
microbenchmark(sqrt(randoms))
Unit: microseconds
expr   min    lq    mean median    uq    max neval
sqrt(randoms) 1.118 1.257 1.48609 1.2985 1.334 17.515   100


We can also compare two pieces of code at once. Just specify two pieces of code to compare. In this case, we will compare the execution time of R’s sqrt() function and the ^ exponent operator for taking a square root.

microbenchmark(sqrt(randoms), randoms ^ 0.5)
Unit: microseconds
expr    min      lq     mean median      uq     max neval
sqrt(randoms)  1.102  1.2700  1.71767  1.377  1.5485  17.973   100
randoms^0.5 10.590 10.7765 15.94200 10.874 11.2030 298.370   100


After benchmarking it would appear that sqrt() is significantly faster. Using comparisons like this, we’ll be able to figure out what code is good (fast) and what code could use improvement.

## Common performance optimizations

Although there’s not enough time to cover every possible performance optimization in R, we’ll cover several common mistakes that have a major impact on code performance.

### Preallocate for-loops

Let’s cover two examples of the same piece of code: in this case, just creating a sequence of numbers of a certain length. For the first case (dynamic), we will grow the variable sequence with each iteration of our loop. In the second case, we will create sequence beforehand at its expected final size.

dynamic <- function(size) {
sequence <- NA
for (i in 1:size) {
sequence[i] <- i
}
return(sequence)
}

preallocated <- function(size) {
sequence <- rep(NA, size)
for (i in 1:size) {
sequence[i] <- i
}
return(sequence)
}

microbenchmark(dynamic(10),
dynamic(100),
dynamic(1000),
dynamic(10000),
preallocated(10),
preallocated(100),
preallocated(1000),
preallocated(10000))
Unit: microseconds
expr       min          lq         mean      median
dynamic(10)    12.013     14.1545     21.08508     16.9640
dynamic(100)   116.729    133.5830    158.08893    145.4185
dynamic(1000)  1836.242   2142.1455   3189.47906   2663.8300
dynamic(10000) 92574.725 100751.2375 120766.04840 109819.2705
preallocated(10)     9.923     11.9220     18.89082     17.7230
preallocated(100)    81.102     88.6555    103.51341     95.1705
preallocated(1000)   776.135    852.0280   1004.88944    907.9835
preallocated(10000)  8187.022   8579.1210   9595.38399   9130.6060
uq        max neval
24.3220     79.757   100
173.2030    279.465   100
3094.8480  45449.344   100
142190.6030 205317.575   100
22.7275     97.232   100
110.0850    226.714   100
1037.4640   2639.557   100
10290.4660  13649.102   100


Preallocating the sequence variable of our loop performed better in every case. Intriguingly, the speed advantage of preallocating sequence vs. dynamically growing it actually increases with larger sizes. The reason for the difference in speed is caused by how R handles stored objects in memory. Every time it has to make sequence bigger, R actually has to create a copy of sequence at the new size, and then add i to the new object. By preallocating sequence at its final size, we avoid all of this unnecessary copying.

Note that all of the *ply functions from the plyr function do this internally. The easiest way to avoid the pitfalls of dynamically resizing an object is to simply use the corresponding *ply function.

The equivalent *ply function for the last example, compared to preallocated():

library(plyr)

microbenchmark(
preallocated(10000),
llply(1:10000, function(x) return(x)))
Unit: milliseconds
expr      min       lq      mean
preallocated(10000) 8.343753 8.805730 10.772959
llply(1:10000, function(x) return(x)) 7.377793 7.892931  8.805502
median        uq      max neval
10.02329 11.242248 67.28878   100
8.62486  9.408815 11.17397   100


### Avoid creating unnecessary variables

The more things you make your computer do, the longer it will take. Saving variables with <- is not always an instantaneous operation, especially if the variables you are saving are comparatively large. In this case we will calculate the mean gdpPercap for a particular country using the gapminder dataset. In our extra_variables() function, we will save all of the data for the country of interest as countryData, then calculate the mean gdpPercap. In the second case, we calculate gdpPercap in a single step.

library(gapminder)

extra_variables <- function(country) {
countryData <- gapminder[gapminder$country == country, ] return(mean(countryData$gdpPercap))
}

less_variables <- function(country) {
return(mean(gapminder$gdpPercap[gapminder$country == country]))
}

microbenchmark(extra_variables("Germany"), less_variables("Germany"))
Unit: microseconds
expr     min       lq     mean  median       uq
extra_variables("Germany") 253.530 258.5945 360.6558 274.616 321.8900
less_variables("Germany") 110.416 113.5490 133.6912 115.664 145.1995
max neval
2410.062   100
375.777   100


Using less variables is faster. That said, keep in mind that there is a balance to be struck between having less variables and having readable code. If you do not understand what a piece of code is doing at first glance, it’s a great idea to break it into more readable chunks and add comments that explain what your code is doing.

### Use vectorized code

Many of R’s core functions use actually use ultra-fast code written in other programming languages like C and Fortran. One major example of this is using R’s vectorized operators and functions (that take a vector as input and return a vector as output). One example of this is adding a number to a vector: in the first case, we will simply add 10 to a set of 1000 random numbers with the + operator. In the second case, we will loop over all 1000 numbers and add 10 to each.

randoms <- runif(1000)

microbenchmark(randoms + 10,
for (i in 1:1000) {
randoms[i] + 10
})
Unit: microseconds
expr     min       lq      mean
randoms + 10   2.285   2.5485   3.89762
for (i in 1:1000) {     randoms[i] + 10 } 354.520 403.1360 497.57094
median       uq      max neval
3.0245   3.7370   29.299   100
460.2490 530.2925 2293.270   100


Although the computer is in fact performing the same operation in both cases (it loops over the vector), the vectorized operation is almost 200x faster because it is using an ultra-fast implementation written in another language.

### Avoid repeatedly writing or reading from disk

Using the hard disk on your computer is an extremely slow operation. Objects stored in memory (the R workspace) are orders of magnitude faster to access. If one of your scripts is reading or writing to disk repeatedly, your code will be limited by your disk I/O speed.

As an example of this, we will save the data from a country as a in memory and then read it back. We will then do the same thing, but instead of using memory, we will save it to disk and then read it back.

memory <- function(country) {
data <- gapminder[gapminder$country == country, ] return(data) } disk <- function(country) { write.csv(gapminder[gapminder$country == country, ],
file = "temp.csv", row.names = FALSE)
}

microbenchmark(memory("Germany"), disk("Germany"))
Unit: microseconds
expr      min        lq      mean    median        uq
memory("Germany")  242.075  275.7285  380.5267  304.1155  345.0335
disk("Germany") 1901.962 2040.0400 2381.5342 2154.5500 2419.0180
max neval
3193.915   100
7233.583   100

file.remove("temp.csv")  # clean up after ourselves
[1] TRUE


## Parallelization with plyr and doParallel

One final way to have code run faster is to simply run it with more computing power. Keep in mind however, that faster code will always beat faster hardware. As we’ve shown in previous examples, optimizing your code can often result in it running hundreds, if not thousands of times faster. The performance boost offered by parallelizing code is comparatively modest and is limited by the number of CPUs you have available.

Furthermore, some code simply cannot be parallelized. At its heart, all parallel computing involves splitting a large operation into separate chunks, performing operations on separate pieces of hardware, and then aggregating the results. If one operation depends on the last or otherwise modifies a variable used by other processes, it cannot be done in parallel.

Put more simply, problems that are a good fit for parallel computing typically meet the following criteria:

• The code takes a long time to execute
• Each computation is completely independent of other computations
• The code has already been optimized as much as possible

With that said, let’s write some parallel code! We will use plyr and the doParallel package, as those packages are easy to use and can be run on any OS (OSX/Linux/Windows).

How many cores do we have available?

library(doParallel)
Loading required package: foreach

Loading required package: iterators

Loading required package: parallel

detectCores()
[1] 4


The machine this tutorial was written on has 4 cores. Let’s use ’em. For our example calculation, we will actually cheat a bit and make function that just returns the mean gdp for a country / 1000 (from gapminder) after waiting for a fraction of a second. The process for parallelization on is slightly different between UNIX (OSX/Linux) and Windows.

### On OSX/Linux

To parallelize a function using one of the *ply functions, it’s as simple as registering a parallel backend with registerDoParallel(cores) once in your script and adding the argument .parallel = TRUE to the *ply function you want to use. That’s it - you’re done!

So the parallel workflow looks like this:

• registerDoParallel(cores)
• Parallel calculations with plyr

A sample parallel script on OSX/Linux might look like this:

library(plyr)
library(microbenchmark)
library(gapminder)
library(doParallel)

registerDoParallel(cores = detectCores())

someVariable <- 1000

fakefunc <- function(country) {
Sys.sleep(0.2)
df <- gapminder[gapminder$country == country, ] toReturn <- data.frame(country, gdp = mean(df$gdpPercap) / someVariable)
return(toReturn)
}

microbenchmark(
laply(unique(gapminder$country), fakefunc), laply(unique(gapminder$country), fakefunc, .parallel = TRUE),
times = 1)
Unit: seconds
expr       min
laply(unique(gapminder$country), fakefunc) 28.821741 laply(unique(gapminder$country), fakefunc, .parallel = TRUE)  7.530481
lq      mean    median        uq       max neval
28.821741 28.821741 28.821741 28.821741 28.821741     1
7.530481  7.530481  7.530481  7.530481  7.530481     1


### On Windows

Parallelizing a function on Windows is slightly harder than on OSX/Linux. On OSX/Linux, the workspace is copied to all of the parallel processes you use. On Windows, the parallel R processes must be manually created and variables and functions must be passed to your parallel R workers (this is done with the .paropts argument to *ply. Finally, once you are done using R in parallel (like at the end of your script), you must stop your parallel R workers with stopImplicitCluster() or they will hang around and continue to eat up system resources.

On Windows, a sample parallel workflow looks like this:

• registerDoParallel(cores)
• Parallel calculations with plyr (must use .paropts to pass variables and packages). .paropts typically looks like .paropts = list(.packages = c("package1", "package2"), .export = c("variableName1", "functionName2"))
• stopImplicitCluster()
library(microbenchmark)
library(plyr)
library(doParallel)
library(gapminder)

# start parallel workers
registerDoParallel(cores = detectCores())

someVariable <- 1000

fakefunc <- function(country) {
Sys.sleep(0.2)
df <- gapminder[gapminder$country == country, ] toReturn <- data.frame(country, gdp = mean(df$gdpPercap) / someVariable)
return(toReturn)
}

microbenchmark(
ldply(unique(gapminder$country), fakefunc), ldply(unique(gapminder$country), fakefunc, .parallel = TRUE,
.paropts = list(.packages = "gapminder", .export = c("someVariable"))),
times = 1)

# kill parallel workers
stopImplicitCluster()

### Final notes

Sometimes, if what you’re parallelizing is very fast, your code make actually run slower when in parallel. Every parallel iteration of your *ply function involves splitting up your job and sending it to the different parallel R workers. If individual iterations run slower in parallel, it is likely that your function is spending more time splitting up the job to run it on multiple cores than it is doing actual work. These types of jobs should not be done in parallel.

Attempting to registerDoParallel(cores) with more cores than your machine actually has will slow down your job. If you tried to do a 16 core computation on a machine that only has 4 cores, for instance, the 16 R workers will be forced to take turns on your available CPUs, resulting in them running at roughly 25% speed. For the same reason, do not put a parallel function inside another parallel function. Something like this: **ply(data, function(x) {**ply(data, fun, .parallel = TRUE)}), .parallel = TRUE) will only end badly.

Finally when running a parallel R job on a compute cluster, NEVER do this: registerDoParallel(cores = detectCores()). This will attempt to use every available CPU (including ones you have not been allocated), potentially ruining other jobs on a compute node. You must register a fixed number of cores (registerDoParallel(cores = 4) for instance), and then request that number of cores from the scheduler.