Performance Optimization and Parallelization
Learning Objectives
- Learn how to benchmark code
- Understand several common R performance bottlenecks
- Properly identify instances where code might benefit from parallelization
- Be able to parallelize R code
Although R is a fantastic tool for data analysis and exploration, speed is not one of its strengths. This is by design - R was made to be easy and powerful to use, and takes many performance shortcuts for the sake of usability. That said, simply knowing several common mistakes to avoid can often reduce the execution times of your scripts from hours to minutes.
Before we start, make sure you have the following packages installed:
install.packages(c("microbenchmark", "plyr", "doParallel"))
Benchmarking
Before we can start to cover what slows code down, we’ll need a way of measuring how fast it is. This is called benchmarking. The microbenchmark
package provides a very easy way of doing this. Simply give it a piece of code to execute, and it will run your code repeatedly, reporting the amount of time it took to run. Less efficient code is slower.
As an example, let’s generate 100 random numbers and see how long it takes to compute their square root.
library(microbenchmark)
randoms <- runif(100)
microbenchmark(sqrt(randoms))
Unit: microseconds
expr min lq mean median uq max neval
sqrt(randoms) 1.118 1.257 1.48609 1.2985 1.334 17.515 100
We can also compare two pieces of code at once. Just specify two pieces of code to compare. In this case, we will compare the execution time of R’s sqrt()
function and the ^
exponent operator for taking a square root.
microbenchmark(sqrt(randoms), randoms ^ 0.5)
Unit: microseconds
expr min lq mean median uq max neval
sqrt(randoms) 1.102 1.2700 1.71767 1.377 1.5485 17.973 100
randoms^0.5 10.590 10.7765 15.94200 10.874 11.2030 298.370 100
After benchmarking it would appear that sqrt()
is significantly faster. Using comparisons like this, we’ll be able to figure out what code is good (fast) and what code could use improvement.
Common performance optimizations
Although there’s not enough time to cover every possible performance optimization in R, we’ll cover several common mistakes that have a major impact on code performance.
Preallocate for-loops
Let’s cover two examples of the same piece of code: in this case, just creating a sequence of numbers of a certain length. For the first case (dynamic
), we will grow the variable sequence
with each iteration of our loop. In the second case, we will create sequence
beforehand at its expected final size.
dynamic <- function(size) {
sequence <- NA
for (i in 1:size) {
sequence[i] <- i
}
return(sequence)
}
preallocated <- function(size) {
sequence <- rep(NA, size)
for (i in 1:size) {
sequence[i] <- i
}
return(sequence)
}
microbenchmark(dynamic(10),
dynamic(100),
dynamic(1000),
dynamic(10000),
preallocated(10),
preallocated(100),
preallocated(1000),
preallocated(10000))
Unit: microseconds
expr min lq mean median
dynamic(10) 12.013 14.1545 21.08508 16.9640
dynamic(100) 116.729 133.5830 158.08893 145.4185
dynamic(1000) 1836.242 2142.1455 3189.47906 2663.8300
dynamic(10000) 92574.725 100751.2375 120766.04840 109819.2705
preallocated(10) 9.923 11.9220 18.89082 17.7230
preallocated(100) 81.102 88.6555 103.51341 95.1705
preallocated(1000) 776.135 852.0280 1004.88944 907.9835
preallocated(10000) 8187.022 8579.1210 9595.38399 9130.6060
uq max neval
24.3220 79.757 100
173.2030 279.465 100
3094.8480 45449.344 100
142190.6030 205317.575 100
22.7275 97.232 100
110.0850 226.714 100
1037.4640 2639.557 100
10290.4660 13649.102 100
Preallocating the sequence
variable of our loop performed better in every case. Intriguingly, the speed advantage of preallocating sequence
vs. dynamically growing it actually increases with larger sizes. The reason for the difference in speed is caused by how R handles stored objects in memory. Every time it has to make sequence
bigger, R actually has to create a copy of sequence
at the new size, and then add i
to the new object. By preallocating sequence at its final size, we avoid all of this unnecessary copying.
Note that all of the *ply
functions from the plyr
function do this internally. The easiest way to avoid the pitfalls of dynamically resizing an object is to simply use the corresponding *ply
function.
The equivalent *ply
function for the last example, compared to preallocated()
:
library(plyr)
microbenchmark(
preallocated(10000),
llply(1:10000, function(x) return(x)))
Unit: milliseconds
expr min lq mean
preallocated(10000) 8.343753 8.805730 10.772959
llply(1:10000, function(x) return(x)) 7.377793 7.892931 8.805502
median uq max neval
10.02329 11.242248 67.28878 100
8.62486 9.408815 11.17397 100
Avoid creating unnecessary variables
The more things you make your computer do, the longer it will take. Saving variables with <-
is not always an instantaneous operation, especially if the variables you are saving are comparatively large. In this case we will calculate the mean gdpPercap for a particular country using the gapminder
dataset. In our extra_variables()
function, we will save all of the data for the country of interest as countryData
, then calculate the mean gdpPercap
. In the second case, we calculate gdpPercap
in a single step.
library(gapminder)
extra_variables <- function(country) {
countryData <- gapminder[gapminder$country == country, ]
return(mean(countryData$gdpPercap))
}
less_variables <- function(country) {
return(mean(gapminder$gdpPercap[gapminder$country == country]))
}
microbenchmark(extra_variables("Germany"), less_variables("Germany"))
Unit: microseconds
expr min lq mean median uq
extra_variables("Germany") 253.530 258.5945 360.6558 274.616 321.8900
less_variables("Germany") 110.416 113.5490 133.6912 115.664 145.1995
max neval
2410.062 100
375.777 100
Using less variables is faster. That said, keep in mind that there is a balance to be struck between having less variables and having readable code. If you do not understand what a piece of code is doing at first glance, it’s a great idea to break it into more readable chunks and add comments that explain what your code is doing.
Use vectorized code
Many of R’s core functions use actually use ultra-fast code written in other programming languages like C and Fortran. One major example of this is using R’s vectorized operators and functions (that take a vector as input and return a vector as output). One example of this is adding a number to a vector: in the first case, we will simply add 10 to a set of 1000 random numbers with the +
operator. In the second case, we will loop over all 1000 numbers and add 10 to each.
randoms <- runif(1000)
microbenchmark(randoms + 10,
for (i in 1:1000) {
randoms[i] + 10
})
Unit: microseconds
expr min lq mean
randoms + 10 2.285 2.5485 3.89762
for (i in 1:1000) { randoms[i] + 10 } 354.520 403.1360 497.57094
median uq max neval
3.0245 3.7370 29.299 100
460.2490 530.2925 2293.270 100
Although the computer is in fact performing the same operation in both cases (it loops over the vector), the vectorized operation is almost 200x faster because it is using an ultra-fast implementation written in another language.
Avoid repeatedly writing or reading from disk
Using the hard disk on your computer is an extremely slow operation. Objects stored in memory (the R workspace) are orders of magnitude faster to access. If one of your scripts is reading or writing to disk repeatedly, your code will be limited by your disk I/O speed.
As an example of this, we will save the data from a country as a in memory and then read it back. We will then do the same thing, but instead of using memory, we will save it to disk and then read it back.
memory <- function(country) {
data <- gapminder[gapminder$country == country, ]
return(data)
}
disk <- function(country) {
write.csv(gapminder[gapminder$country == country, ],
file = "temp.csv", row.names = FALSE)
return(read.csv("temp.csv", as.is = TRUE))
}
microbenchmark(memory("Germany"), disk("Germany"))
Unit: microseconds
expr min lq mean median uq
memory("Germany") 242.075 275.7285 380.5267 304.1155 345.0335
disk("Germany") 1901.962 2040.0400 2381.5342 2154.5500 2419.0180
max neval
3193.915 100
7233.583 100
file.remove("temp.csv") # clean up after ourselves
[1] TRUE
Parallelization with plyr
and doParallel
One final way to have code run faster is to simply run it with more computing power. Keep in mind however, that faster code will always beat faster hardware. As we’ve shown in previous examples, optimizing your code can often result in it running hundreds, if not thousands of times faster. The performance boost offered by parallelizing code is comparatively modest and is limited by the number of CPUs you have available.
Furthermore, some code simply cannot be parallelized. At its heart, all parallel computing involves splitting a large operation into separate chunks, performing operations on separate pieces of hardware, and then aggregating the results. If one operation depends on the last or otherwise modifies a variable used by other processes, it cannot be done in parallel.
Put more simply, problems that are a good fit for parallel computing typically meet the following criteria:
- The code takes a long time to execute
- Each computation is completely independent of other computations
- The code has already been optimized as much as possible
With that said, let’s write some parallel code! We will use plyr
and the doParallel
package, as those packages are easy to use and can be run on any OS (OSX/Linux/Windows).
How many cores do we have available?
library(doParallel)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
detectCores()
[1] 4
The machine this tutorial was written on has 4 cores. Let’s use ’em. For our example calculation, we will actually cheat a bit and make function that just returns the mean gdp for a country / 1000 (from gapminder
) after waiting for a fraction of a second. The process for parallelization on is slightly different between UNIX (OSX/Linux) and Windows.
On OSX/Linux
To parallelize a function using one of the *ply
functions, it’s as simple as registering a parallel backend with registerDoParallel(cores)
once in your script and adding the argument .parallel = TRUE
to the *ply
function you want to use. That’s it - you’re done!
So the parallel workflow looks like this:
registerDoParallel(cores)
- Parallel calculations with
plyr
A sample parallel script on OSX/Linux might look like this:
library(plyr)
library(microbenchmark)
library(gapminder)
library(doParallel)
registerDoParallel(cores = detectCores())
someVariable <- 1000
fakefunc <- function(country) {
Sys.sleep(0.2)
df <- gapminder[gapminder$country == country, ]
toReturn <- data.frame(country, gdp = mean(df$gdpPercap) / someVariable)
return(toReturn)
}
microbenchmark(
laply(unique(gapminder$country), fakefunc),
laply(unique(gapminder$country), fakefunc, .parallel = TRUE),
times = 1)
Unit: seconds
expr min
laply(unique(gapminder$country), fakefunc) 28.821741
laply(unique(gapminder$country), fakefunc, .parallel = TRUE) 7.530481
lq mean median uq max neval
28.821741 28.821741 28.821741 28.821741 28.821741 1
7.530481 7.530481 7.530481 7.530481 7.530481 1
On Windows
Parallelizing a function on Windows is slightly harder than on OSX/Linux. On OSX/Linux, the workspace is copied to all of the parallel processes you use. On Windows, the parallel R processes must be manually created and variables and functions must be passed to your parallel R workers (this is done with the .paropts
argument to *ply
. Finally, once you are done using R in parallel (like at the end of your script), you must stop your parallel R workers with stopImplicitCluster()
or they will hang around and continue to eat up system resources.
On Windows, a sample parallel workflow looks like this:
registerDoParallel(cores)
- Parallel calculations with
plyr
(must use.paropts
to pass variables and packages)..paropts
typically looks like.paropts = list(.packages = c("package1", "package2"), .export = c("variableName1", "functionName2"))
stopImplicitCluster()
library(microbenchmark)
library(plyr)
library(doParallel)
library(gapminder)
# start parallel workers
registerDoParallel(cores = detectCores())
someVariable <- 1000
fakefunc <- function(country) {
Sys.sleep(0.2)
df <- gapminder[gapminder$country == country, ]
toReturn <- data.frame(country, gdp = mean(df$gdpPercap) / someVariable)
return(toReturn)
}
microbenchmark(
ldply(unique(gapminder$country), fakefunc),
ldply(unique(gapminder$country), fakefunc, .parallel = TRUE,
.paropts = list(.packages = "gapminder", .export = c("someVariable"))),
times = 1)
# kill parallel workers
stopImplicitCluster()
Final notes
Sometimes, if what you’re parallelizing is very fast, your code make actually run slower when in parallel. Every parallel iteration of your *ply
function involves splitting up your job and sending it to the different parallel R workers. If individual iterations run slower in parallel, it is likely that your function is spending more time splitting up the job to run it on multiple cores than it is doing actual work. These types of jobs should not be done in parallel.
Attempting to registerDoParallel(cores)
with more cores than your machine actually has will slow down your job. If you tried to do a 16 core computation on a machine that only has 4 cores, for instance, the 16 R workers will be forced to take turns on your available CPUs, resulting in them running at roughly 25% speed. For the same reason, do not put a parallel function inside another parallel function. Something like this: **ply(data, function(x) {**ply(data, fun, .parallel = TRUE)}), .parallel = TRUE)
will only end badly.
Finally when running a parallel R job on a compute cluster, NEVER do this: registerDoParallel(cores = detectCores())
. This will attempt to use every available CPU (including ones you have not been allocated), potentially ruining other jobs on a compute node. You must register a fixed number of cores (registerDoParallel(cores = 4)
for instance), and then request that number of cores from the scheduler.