Multicollinearity Hinders Model Interpretability

Adapted from XKCD: https://xkcd.com/2347/
This post is written for beginner to intermediate R users wishing to learn what multicollinearity is and how it can turn model interpretation into a challenge.

Summary

In this post, I delve into the intricacies of model interpretation under the influence of multicollinearity, and use R and a toy data set to demonstrate how this phenomenon impacts both linear and machine learning models:

  • The section Multicollinearity Explained explains the origin of the word and the nature of the problem.
  • The section Model Interpretation Challenges describes how to create the toy data set, and applies it to Linear Models and Random Forest to explain how multicollinearity can make model interpretation a challenge.
  • The Appendix shows extra examples of linear and machine learning models affected by multicollinearity.

I hope you’ll enjoy it!

R packages

This tutorial requires the newly released R package collinear, and a few more listed below. The optional ones are used only in the Appendix at the end of the post.

#required
install.packages("collinear")
install.packages("ranger")
install.packages("dplyr")

#optional
install.packages("nlme")
install.packages("glmnet")
install.packages("xgboost")

Multicollinearity Explained

This cute word comes from the amalgamation of these three Latin terms:

  • multus: adjective meaning many or multiple.
  • con: preposition often converted to co- (as in co-worker) meaning together or mutually.
  • linealis (later converted to linearis): from linea (line), adjective meaning “resembling a line” or “belonging to a line”, among others.

After looking at these serious words, we can come up with a (VERY) liberal translation: “several things together in the same line”. From here, we just have to replace the word “things” with “predictors” (or “features”, or “independent variables”, whatever rocks your boat) to build an intuition of the whole meaning of the word in the context of statistical and machine learning modeling.

If I lost you there, we can move forward with this idea instead: multicollinearity happens when there are redundant predictors in a modeling dataset. A predictor can be redundant because it shows a high pairwise correlation with other predictors, or because it is a linear combination of other predictors. For example, in a data frame with the columns a, b, and c, if the correlation between a and b is high, we can say that a and b are mutually redundant and there is multicollinearity. But also, if c is the result of a linear operation between a and b, like c <- a + b, or c <- a * 1 + b * 0.5, then we can also say that there is multicollinearity between c, a, and b.

Multicollinearity is a fact of life that lurks in most data sets. For example, in climate data, variables like temperature, humidity and air pressure are closely intertwined, leading to multicollinearity. That’s the case as well in medical research, where parameters like blood pressure, heart rate, and body mass index frequently display common patterns. Economic analysis is another good example, as variables such as Gross Domestic Product (GDP), unemployment rate, and consumer spending often exhibit multicollinearity.

Model Interpretation Challenges

Multicollinearity isn’t inherently problematic, but it can be a real buzz kill when the goal is interpreting predictor importance in explanatory models. In the presence of highly correlated predictors, most modelling methods, from the veteran linear models to the fancy gradient boosting, attribute a large part of the importance to only one of the predictors and not the others. In such cases, neglecting multicollinearity will certainly lead to underestimate the relevance of certain predictors.

Let me go ahead and develop a toy data set to showcase this issue. But let’s load the required libraries first.

#load the collinear package and its example data
library(collinear)
data(vi)

#other required libraries
library(ranger)
library(dplyr)

In the vi data frame shipped with the collinear package, the variables “soil_clay” and “humidity_range” are not correlated at all (Pearson correlation = -0.06).

In the code block below, the dplyr::transmute() command selects and renames them as a and b. After that, the two variables are scaled and centered, and dplyr::mutate() generates a few new columns:

  • y: response variable resulting from a linear model where a has a slope of 0.75, b has a slope of 0.25, plus a bit of white noise generated with runif().
  • c: a new predictor highly correlated with a.
  • d: a new predictor resulting from a linear combination of a and b.
set.seed(1)
df <- vi |>
  dplyr::slice_sample(n = 2000) |>
  dplyr::transmute(
    a = soil_clay,
    b = humidity_range
  ) |>
  scale() |>
  as.data.frame() |> 
  dplyr::mutate(
    y = a * 0.75 + b * 0.25 + runif(n = dplyr::n(), min = -0.5, max = 0.5),
    c = a + runif(n = dplyr::n(), min = -0.5, max = 0.5),
    d = (a + b)/2 + runif(n = dplyr::n(), min = -0.5, max = 0.5)
  )

The Pearson correlation between all pairs of these predictors is shown below.

collinear::cor_df(
  df = df,
  predictors = c("a", "b", "c", "d")
)
## # A tibble: 6 × 3
##   x     y     correlation
##   <chr> <chr>       <dbl>
## 1 c     a           0.962
## 2 d     b           0.639
## 3 d     a           0.636
## 4 d     c           0.615
## 5 b     a          -0.047
## 6 c     b          -0.042

At this point, we have are two groups of predictors useful to understand how multicollinearity muddles model interpretation:

  • Predictors with no multicollinearity: a and b.
  • Predictors with multicollinearity: a, b, c, and d.

In the next two sections and the Appendix, I show how and why model interpretation becomes challenging when multicollinearity is high. Let’s start with linear models.

Linear Models

The code below fits multiple linear regression models for both groups of predictors.

#non-collinear predictors
lm_ab <- lm(
  formula = y ~ a + b,
  data = df
  )

#collinear predictors
lm_abcd <- lm(
  formula = y ~ a + b + c + d,
  data = df
  )

I would like you to pay attention to the estimates of the predictors a and b for both models. The estimates are the slopes in the linear model, a direct indication of the effect of a predictor over the response.

coefficients(lm_ab)[2:3] |> 
  round(4)
##      a      b 
## 0.7477 0.2616
coefficients(lm_abcd)[2:5] |> 
  round(4)
##      a      b      c      d 
## 0.7184 0.2596 0.0273 0.0039

On one hand, the model with no multicollinearity (lm_ab) achieved a pretty good solution for the coefficients of a and b. Remember that we created y as a * 0.75 + b * 0.25 plus some noise, and that’s exactly what the model is telling us here, so the interpretation is pretty straightforward.

On the other hand, the model with multicollinearity (lm_abcd) did well with b, but there are a few things in there that make the interpretation harder.

  • The coefficient of a (0.7165) is slightly smaller than the true one (0.75), which could lead us to downplay its relationship with y by a tiny bit. This is kinda OK though, as long as one is not using the model’s results to build nukes in the basement.
  • The coefficient of c is so small that it could led us to believe that this predictor not important at all to explain y. But we know that a and c are almost identical copies, so model interpretation here is being definitely muddled by multicollinearity.
  • The coefficient of d is tiny. Since d results from the sum of a and b, we could expect this predictor to be important in explaining y, but it got the shorter end of the stick in this case.

It is not that the model it’s wrong though. This behavior of the linear model results from the QR decomposition (also QR factorization) applied by functions like lm(), glm(), glmnet::glmnet(), and nlme::gls() to improve numerical stability and computational efficiency, and to… address multicollinearity in the model predictors.

The QR decomposition transforms the original predictors into a set of orthogonal predictors with no multicollinearity. This is the Q matrix, created in a fashion that resembles the way in which a Principal Components Analysis generates uncorrelated components from a set of correlated variables.

The code below applies QR decomposition to our multicollinear predictors, extracts the Q matrix, and shows the correlation between the new versions of a, b, c, and d.

#predictors names
predictors <- c("a", "b", "c", "d")

#QR decomposition of predictors
df.qr <- qr(df[, predictors])

#extract Q matrix
df.q <- qr.Q(df.qr)
colnames(df.q) <- predictors

#correlation between transformed predictors
collinear::cor_df(df = df.q)
## # A tibble: 6 × 3
##   x     y     correlation
##   <chr> <chr>       <dbl>
## 1 d     c               0
## 2 c     b               0
## 3 d     b               0
## 4 d     a               0
## 5 c     a               0
## 6 b     a               0

The new set of predictors we are left with after the QR decomposition have exactly zero correlation! And now they are not our original predictors anymore, and have a different interpretation:

  • a is now “the part of a not in b, c, and d”.
  • b is now “the part of b not in a, c, and d”.
  • …and so on…

The result of the QR decomposition can be plugged into the solve() function along with the response vector to estimate the coefficients of the linear model.

solve(a = df.qr, b = df$y) |> 
  round(4)
##      a      b      c      d 
## 0.7189 0.2595 0.0268 0.0040

These are almost exactly the ones we got for our model with multicollinearity. In the end, the coefficients resulting from a linear model are not those of the original predictors, but the ones of their uncorrelated versions generated by the QR decomposition.

But this is not the only issue of model interpretability under multicollinearity. Let’s take a look at the standard errors of the estimates. These are a measure of the coefficient estimation uncertainty, and are used to compute the p-values of the estimates. As such, they are directly linked with the “statistical significance” (whatever that means) of the predictors within the model.

The code below shows the standard errors of the model without and with multicollinearity.

summary(lm_ab)$coefficients[, "Std. Error"][2:3] |> 
  round(4)
##      a      b 
## 0.0066 0.0066
summary(lm_abcd)$coefficients[, "Std. Error"][2:5] |> 
  round(4)
##      a      b      c      d 
## 0.0267 0.0134 0.0232 0.0230

These standard errors of the model with multicollinearity are an order of magnitude higher than the ones of the model without multicollinearity.

Since our toy dataset is relatively large (2000 cases) and the relationship between the response and a few of the predictors pretty robust, there are no real issues arising, as these differences in estimation precision are not enough to change the p-values of the estimates. However, in a small data set with high multicollinearity and a weaker relationship between the response and the predictors, standard errors of the estimate become wide, which increases p-values and reduces “significance”. Such a situation might lead us to believe that a predictor does not explain the response, when in fact it does. And this, again, is a model interpretability issue caused by multicollinearity.

At the end of this post there is an appendix with code examples of other types of linear models that use QR decomposition and become challenging to interpret in the presence of multicollinearity. Play with them as you please!

Now, let’s take a look at how multicollinearity can also mess up the interpretation of a commonly used machine learning algorithm.

Random Forest

It is not uncommon to hear something like “random forest is insensitive to multicollinearity”. Actually, I cannot confirm nor deny that I have said that before. Anyway, it is kind of true if one is focused on prediction problmes. However, when the aim is interpreting predictor importance scores, then one has to be mindful about multicollinearity as well.

Let’s see an example. The code below fits two random forest models with our two sets of predictors.

#non-collinear predictors
rf_ab <- ranger::ranger(
  formula = y ~ a + b,
  data = df,
  importance = "permutation",
  seed = 1 #for reproducibility
)

#collinear predictors
rf_abcd <- ranger::ranger(
  formula = y ~ a + b + c + d,
  data = df,
  importance = "permutation",
  seed = 1
)

Let’s take a look at the prediction error the two models on the out-of-bag data. While building each regression tree, Random Forest leaves a random subset of the data out. Then, each case gets a prediction from all trees that had it in the out-of-bag data, and the prediction error is averaged across all cases to get the numbers below.

rf_ab$prediction.error
## [1] 0.1026779
rf_abcd$prediction.error
## [1] 0.1035678

According to these numbers, these two models are basically equivalent in their ability to predict our response y.

But now, you noticed that I set the argument importance to “permutation”. Permutation importance quantifies how the out-of-bag error increases when a predictor is permuted across all trees where the predictor is used. It is pretty robust importance metric that bears no resemblance whatsoever with the coefficients of a linear model. Think of it as a very different way to answer the question “what variables are important in this model?”.

The permutation importance scores of the two random forest models are show below.

rf_ab$variable.importance |> round(4)
##      a      b 
## 1.0702 0.1322
rf_abcd$variable.importance |> round(4)
##      a      b      c      d 
## 0.5019 0.0561 0.1662 0.0815

There is one interesting detail here. The predictor a has a permutation error three times higher than c in the second model, even though we could expect them to be similar due to their very high correlation. There are two reasons for this mismatch:

  • Random Forest is much more sensitive to the white noise in c than linear models, especially in the deep parts of the regression trees, due to local (within-split data) decoupling with the response y. In consequence, it does not get selected as often as a in these deeper areas of the trees, and has less overall importance.
  • The predictor c competes with d, that has around 50% of the information in c (and a). If we remove d from the model, then the permutation importance of c doubles up. Then, with d in the model, we underestimate the real importance of c due to multicollinearity alone.
rf_abc <- ranger::ranger(
  formula = y ~ a + b + c,
  data = df,
  importance = "permutation",
  seed = 1
)
rf_abc$variable.importance |> round(4)
##      a      b      c 
## 0.5037 0.1234 0.3133

With all that in mind, we can conclude that interpreting importance scores in Random Forest models is challenging when multicollinearity is high. But Random Forest is not the only machine learning affected by this issue. In the Appendix below I have left an example with Extreme Gradient Boosting so you can play with it.

And that’s all for now, folks, I hope you found this post useful!

Appendix

This section shows several extra examples of linear and machine learning models you can play with.

Other linear models using QR decomposition

As I commented above, many linear modeling functions use QR decomposition, and you will have to be careful interpreting model coefficients in the presence of strong multicollinearity in the predictors.

Here I show several examples with glm() (Generalized Linear Models), nlme::gls() (Generalized Least Squares), and glmnet::cv.glmnet() (Elastic Net Regularization). In all them, no matter how fancy, the interpretation of coefficients becomes tricky when multicollinearity is high.

Generalized Linear Models with glm()

#Generalized Linear Models
#non-collinear predictors
glm_ab <- glm(
  formula = y ~ a + b,
  data = df
  )

round(coefficients(glm_ab), 4)[2:3]
##      a      b 
## 0.7477 0.2616
#collinear predictors
glm_abcd <- glm(
  formula = y ~ a + b + c + d,
  data = df
  )

round(coefficients(glm_abcd), 4)[2:5]
##      a      b      c      d 
## 0.7184 0.2596 0.0273 0.0039

Generalized Least Squares with nlme::gls()

library(nlme)
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
#Generalized Least Squares
#non-collinear predictors
gls_ab <- nlme::gls(
  model = y ~ a + b,
  data = df
  )

round(coefficients(gls_ab), 4)[2:3]
##      a      b 
## 0.7477 0.2616
#collinear predictors
gls_abcd <- nlme::gls(
  model = y ~ a + b + c + d,
  data = df
  )

round(coefficients(gls_abcd), 4)[2:5]
##      a      b      c      d 
## 0.7184 0.2596 0.0273 0.0039

Elastic Net Regularization and Lasso penalty with glmnet::glmnet()

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
#Elastic net regularization with Lasso penalty
#non-collinear predictors
glmnet_ab <- glmnet::cv.glmnet(
  x = as.matrix(df[, c("a", "b")]),
  y = df$y,
  alpha = 1 #lasso penalty
)

round(coef(glmnet_ab$glmnet.fit, s = glmnet_ab$lambda.min), 4)[2:3]
## [1] 0.7438 0.2578
#collinear predictors
glmnet_abcd <- glmnet::cv.glmnet(
  x = as.matrix(df[, c("a", "b", "c", "d")]),
  y = df$y,
  alpha = 1 
)

#notice that the lasso regularization nuked the coefficients of predictors b and c
round(coef(glmnet_abcd$glmnet.fit, s = glmnet_abcd$lambda.min), 4)[2:5]
## [1] 0.7101 0.2507 0.0267 0.0149

Extreme Gradient Boosting under multicollinearity

Gradient Boosting models trained with multicollinear predictors behave in a way similar to linear models with QR decomposition. When two variables are highly correlated, one of them is going to have an importance much higher than the other.

library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
#without multicollinearity
gb_ab <- xgboost::xgboost(
  data = as.matrix(df[, c("a", "b")]),
  label = df$y,
  objective = "reg:squarederror",
  nrounds = 100,
  verbose = FALSE
  )

#with multicollinearity
gb_abcd <- xgboost::xgboost(
  data = as.matrix(df[, c("a", "b", "c", "d")]),
  label = df$y,
  objective = "reg:squarederror",
  nrounds = 100,
  verbose = FALSE
)
xgb.importance(model = gb_ab)[, c(1:2)]
##    Feature      Gain
## 1:       a 0.8463005
## 2:       b 0.1536995
xgb.importance(model = gb_abcd)[, c(1:2)] |> 
  dplyr::arrange(Feature)
##    Feature       Gain
## 1:       a 0.78129661
## 2:       b 0.07386393
## 3:       c 0.03595619
## 4:       d 0.10888327

But there is a twist too. When two variables are perfectly correlated, one of them is removed right away from the model!

#replace c with perfect copy of a
df$c <- df$a

#with multicollinearity
gb_abcd <- xgboost::xgboost(
  data = as.matrix(df[, c("a", "b", "c", "d")]),
  label = df$y,
  objective = "reg:squarederror",
  nrounds = 100,
  verbose = FALSE
)

xgb.importance(model = gb_abcd)[, c(1:2)] |> 
  dplyr::arrange(Feature)
##    Feature       Gain
## 1:       a 0.79469959
## 2:       b 0.07857141
## 3:       d 0.12672900
Blas M. Benito
Blas M. Benito
Data Scientist and Team Lead

Related