3 min read

Updated vectors in R 3.4.0

The last version of R comes with a small update of vector behavior in regard to inserting new value just after the last element. See the entry from the announcement (https://stat.ethz.ch/pipermail/r-announce/2017/000612.html):

* Assigning to an element of a vector beyond the current length now
     over-allocates by a small fraction. The new vector is marked
     internally as growable, and the true length of the new vector is
     stored in the truelength field. This makes building up a vector
     result by assigning to the next element beyond the current length
     more efficient, though pre-allocating is still preferred.  The
     implementation is subject to change and not intended to be used
     in packages at this time.

In this post, I’ll try to show how does this work in practice, and what does it mean to the user.

Functions used in benchmark

To test the new functionality I created three simple functions which add a new element at the end of the vector in three different ways:

  • Assign new item using [[]] operator:
fnc_assign <- function(y) {
  x <- NULL
  for (i in y) {
    x[[i]] <- i
  }
  x
}
  • Use c(...) function:
fnc_combines <- function(y) {
  x <- NULL
  for (i in y) {
    x <- c(x, i)
  }
  x
}
  • Pre-allocate memory (recommended way):
fnc_prealloc <- function(y) {
  x <- numeric(length(y))
  for (i in y) {
    x[[i]] <- i
  }
  x
}

Then I created simple bash scripts to download and compile two versions of R. The testing functions were executed in both releases using microbenchmark (the code used to create this analysis is available in GitHub repository - https://github.com/zzawadz/blog-entry-vector-speed).

Results

Assign with [[]] operator.

In the first version of the append function, the difference is significant. For \(10^4\) elements, the speed up is by order of magnitude.

Version Fnc Median Time Ratio
R-3.3.3 fnc_assign(y) 62.84 8.52
R-3.4.0 fnc_assign(y) 7.37 1.00

Even when we compare this result with the pre-allocated version, it is only slightly slower. Good news!

Version Fnc Median Time Ratio
R-3.4.0 fnc_assign(y) 7.37 1.07
R-3.4.0 fnc_prealloc(y) 6.89 1.00

Pre-allocation.

No real difference;)

Version Fnc Median Time Ratio
R-3.3.3 fnc_prealloc(y) 6.72 1.00
R-3.4.0 fnc_prealloc(y) 6.89 1.03

c() function.

This version does not support the new functionality. It might be a bit confusing because it seems to be quite natural to add the new observation to the end of a vector using c(...) function. However, this causes massive loss of performance, and it is a big mistake! It’s better to avoid c(...) in the for loop!

Version Fnc Median Time Ratio
R-3.3.3 fnc_combines(y) 118.08 1.04
R-3.4.0 fnc_combines(y) 113.33 1.00

Impact of vector size.

It is worth to note that in the old version (and in the current one when the c function is used) creating a vector of length \(n\) by adding just one element in the loop has \(O(n^2)\) complexity. Inserting just one element causes reallocating the whole vector in the memory, so in the ith iteration, we need to move \(i-1\) elements. So in the end we get \(1 + 2 + 3 + ... + n - 2 + n - 1\) operations, which is \(O(n^2)\).

In the new version, the reallocation can be done less often so it can be much more linear. See the following graphs for comparison:

And the zoom for the new version:

Summary

In R 3.4.0 the memory management for vectors is much better organized so resizing of the vector is much faster (\(O(n)\) vs. \(O(n^2)\)). The only problem is that this does not apply to the x <- c(x, i) - this one is always slow. But the best option is to stay with preallocation because it is still the fastest way of dealing with growable vectors:)

comments powered by Disqus