R Code Optimization III: Hardware Utilization and Performance
Welcome to the third post in this series! In the first article we covered the foundational principles, in the second we explored language choices and algorithm design. Now we’ll talk a bit about where the real performance gains happen: vectorization, parallelization, and memory management.
Hardware Utilization
Hardware utilization refers to how code leverages computational resources. For example, vectorization and parallelization help us squeeze every last drop of juice from our CPUs, while in-place modification, object size pre-allocation, and on-demand data access are useful to manage memory usage.
Vectorization
Vectorization refers to the application of an operation to multiple elements simultaneously.
At the hardware level, vectorization is enabled by an architectural feature known as Single Instruction Multiple Data (SIMD). SIMD operations can, for example, sum 16 pairs of vector elements simultaneously within a single core, offering substantial speed-ups. However, only compiled languages (C, C++, Fortran, etc) can leverage SIMD instructions via specific compiler optimizations.
At the software level, many languages implement vectorized semantics. Think of adding two vectors b and c with the expression a = b + c. This abstraction makes code concise, and can also unlock performance gains in different ways. In compiled languages like Fortran, such expressions are typically optimized for SIMD vectorization. In interpreted languages like R, many vectorized functions are backed by compiled code. For instance, primitives like + are implemented as fast C loops, that may or may not be optimized for SIMD by the compiler (see the section R side: how can R possibly use SIMD? in
this excellent StackOverflow answer for details). In contrast, matrix operations rely on
blazing-fast matrix algebra backends such as
BLAS and
LAPACK, which explicitly exploit SIMD vectorization (and parallelization!).
Vectorization in R
Some functions offer vectorized semantics without performance gains. This is the case with R functions like apply(), lapply(), purrr::map(), and the likes, which are essentially loops in a trenchcoat.
By combining SIMD vectorization for raw performance with semantics-level vectorization for expressiveness, we maximize hardware utilization while keeping our code clean and efficient.
Parallelization
Parallelization accelerates execution by spreading independent tasks across multiple cores.
At the software level, parallelization can be achieved by spawning multiple processes, each with its own memory space, or by a process spawning several threads, all them sharing the same memory space.
Explicit vs Implicit Parallelization
Parallelization can be explicit and implicit.
Explicit parallelization requires the user to define how and where parallel tasks are executed. This approach offers fine control over execution but also demands more setup and understanding of parallel workflows. In R,
parallelized loops written with the packages
doParallel and
foreach require defining a parallelization backend (a.k.a “cluster”), selecting a number of cores, and a specific syntax (y <- foreach(...) %dopar% {...}). That’s pretty explicit if you ask me! Modern alternatives like
future and
future.apply achieve the same results with a less involved code.
On the other hand, implicit parallelization happens without user intervention or even knowledge. For example, the packages
arrow and
data.table apply multithreading to parallelize many data operations. This is also the case of matrix operations in R (i.e. GAM fitting with mgcv::gam()), which are multithreaded by the matrix algebra libraries BLAS and Intel MKL.
The CRAN Task View: High-Performance and Parallel Computing with R offers a more complete overview of the different parallelization options available in R.
Requirements for Effective Parallelization
In any case, parallelization has several requirements:
- The task must be easy to split into independent sub-tasks (a.k.a embarrassingly parallel).
- The computation time of a task must be longer than the time required to move its input and output data between memory or disk to the CPU and back, or otherwise the communication overhead will cause a parallel slowdown. Parallelizing very fast tasks is rarely worth it!
- The memory required by a parallel task times the number of parallel processes must not exceed the available system memory. I wrote this one in bold so you can remember it whenever your code crashes for this very reason!
Even under ideal conditions, parallelization has well-known diminishing returns formulated in Amdahl’s Law. That just means that beyond some point we cannot simply throw more processors at our code and expect immediate efficiency gains.
Memory Management
Let’s jump into what’s IMHO the most interesting topic of this article: Memory Management!
Computers have a
short-term memory directly connected to the processor named main memory, system memory, or RAM. Any code and data required by a program lives (and sometimes dies) there during run time. For example, when you start an R session, the operating system assigns it a section of the system memory, and all functions of the packages base, stats, graphics, and a few others are read from disk and loaded there. So does the code of any package you load using library(), or the data your program reads from disk, or any results it generates via models or other computations.
Main memory is FAST, but FINITE! If a program requires more memory than available, the operating system may start moving parts of the main memory to the hard disk (see memory paging and swap file), slowing things down. In extreme cases, a program can run out of memory and crash.

Also, a program repeatedly allocating and deallocating memory chunks of varying sizes usually accumulates non-contiguous free gaps between used memory blocks that are hard to re-allocate. This issue, known as memory fragmentation, leads to performance slowdowns and a higher memory usage that can end with a crash.
If your code tries to use more RAM than your system has available, things slow down dramatically or crash altogether.

Efficient memory management can help avoid these issues by ensuring that our code uses the system’s memory in a sensible manner.
Being memory aware is a first step in the right direction. And this is a silly concept, really, but keeping a memory monitor like htop and the likes open during code development and testing helps build an intuition on how our program uses memory.
Other good techniques we can apply to consistently improve memory management in R are in-place modification, pre-allocating object size, and on-demand data access.
In-place Modification
Also known as modification by reference, it refers to object modification without duplication (see
copy-on-modify in R). This is probably the most consistent strategy we can apply to manage memory in R! The
section 2.5 of the book Advanced R covers the technical details, and offers great advice: We can reduce the number of copies by using a list instead of a data frame. Modifying a list uses internal C code, so … no copy is made. If data frames are your jam, then the package
data.table may come as a life-saver, as it has an
innate ability to modify large data frames in place, making it fast and efficient.
Object Size Pre-allocation
Growing data frames, vectors, or matrices in a loop triggers the copy-on-modify behavior of R and makes things very slow. This happens because R has to reallocate memory on each iteration for the object’s copy, which takes time and increases memory usage. But if growing something is unavoidable, either pre-allocate the object size, or better, grow a list, as lists are dynamically allocated (rather than pre-allocated) and don’t require their elements to be stored in contiguous memory regions. In any case, when in doubt, apply benchmarking to identify the most efficient method.
On-demand Data Access
On-demand data access refers to several data handling strategies to work with data larger than memory.
Memory-mapped files are representations of large on-disk data in the virtual memory of the operating system. The operating system handles directly the on-demand reading and caching of specific portions of these files, which reduces memory overhead at the expense of increased disk reads (having an efficient SSD is a game changer here!) and computation time. In R, the packages mmap and ff (see brief tutorial here) offer low-level memory-mapping implementations, while the bigmemory package focuses on large matrices.
Chunk-wise processing involves explicitly dividing large data into smaller and more manageable pieces, making it a flexible solution for handling large-scale computations efficiently. For example, the package terra combines this technique with lazy evaluation when working with large raster files to control memory usage.
Modern data solutions like
Apache Arrow and
DuckDB provide efficient columnar storage and query capabilities. The arrow package enables efficient on-demand access and streaming reads with arrow::open_dataset(), while DuckDB brings SQL-powered processing to R with support for lazy evaluation and filtering only relevant data subsets.
The package targets combines chunk-wise processing, parallelization, and multisession execution seamlessly via dynamic branching.
Memory Management Resources
Memory management in R is a deep rabbit hole, but there are several great resources out there that may help you find your footing on this topic:
- Best Coding Practices for R: The chapters 10, 11 and 12 of this regrettably unfinished on-line book offers plenty of tips and tricks to improve memory management in R.
- Chapter 14 of The Art of R Programming (pdf available here): might seem dated, but goes deep on the trade-off between computational speed and memory usage through many enlightening examples.
- Advanced R: the first edition of this essential book has the chapter Memory, which explains in detail how modification in place and garbage collection work in R. Chapter 24 in the latest version, titled Improving Performance is full of tips to improve general performance in R code.
Wrapping Up
And that’s it! We’ve covered the three major pillars of hardware utilization: vectorization, parallelization, and memory management. These techniques can deliver dramatic performance gains when applied appropriately.
Throughout this series we’ve covered the theoretical foundations and key techniques. But knowing these techniques is one thing—knowing when and how to apply them in practice is another!
The final post focuses on practical tools for code optimization: profiling to identify bottlenecks, benchmarking to validate improvements, and the iterative optimization workflow that ties all these concepts into a systematic process.