The linear algebra of ridge regression

Ridge regression — a regularized variant of ordinary least squares — is useful for dealing with collinearity and non-identifiability. Here, we’ll explore some of the linear algebra behind it.

Introduction

Let $\mathbf{X} \in \mathbb{R}^{n \times p}$ be a design matrix made up of $n$ samples characterized by $p$ covariates, and let $\mathbf{Y} \in \mathbb{R}^{n}$ be a vector of response variables. Often, we are interested in finding a linear relationship between $\mathbf{X}$ and $\mathbf{Y}$:

\[\mathbf{Y} = \mathbf{X}^\top \mathbf{\beta} + \epsilon\]

where $\mathbf{\beta} \in \mathbb{R}^p$ is a vector of coefficients, and $\epsilon$ is some noise (error) around the line or hyperplane.

Recall that the ordinary least squares (OLS) estimator for $\beta$ is

\[\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}.\]

Ridge regression

If $p > n$ or if the columns of $\mathbf{X}$ are collinear, the OLS estimate of $\hat{\beta}$ will be unstable and not identifiable (i.e., a single best solution does not exist).

One method for dealing with this problem is ridge regression. The estimate of $\beta$ in ridge regression is:

\[\hat{\beta} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}_p)^{-1} \mathbf{X}^\top \mathbf{Y}\]

where $\lambda \in [0, \infty)$ is a tunable parameter, and $\mathbf{I}_p$ is the $p\times p$ identity matrix. (Notice that this is the OLS estimator when $\lambda = 0$.)

Intuitively, ridge regression is stabilizing the estimate by constraining the problem. There are multiple ways to understand the effect of switching from OLS to the ridge estimator, and we’ll discuss a few below.

Linear algebra of ridge regression

Another way to understand ridge regression is in terms of linear algebra and linear transformations. Recall that the singular value decomposition (SVD) of a matrix $\mathbf{X}$ is

\[\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^\top\]

where $\mathbf{U}$’s columns are the left singular vectors, $\mathbf{V}$’s columns are the right singular vectors, and $\mathbf{D}$ is a diagonal matrix whose diagonal elements are the singular values. (Recall that this is equivalent to saying that $\mathbf{U}$’s columns contain the eigenvectors of $\mathbf{A} \mathbf{A}^\top$, $\mathbf{V}$’s columns contain the eigenvectors of $\mathbf{A}^\top \mathbf{A}$, and $\mathbf{D}$’s diagonal contains the square roots of the eigenvalues of $\mathbf{A} \mathbf{A}^\top$ or $\mathbf{A}^\top \mathbf{A}$.)

We can rewrite the OLS estimator using the SVD of $\mathbf{X}$:

\begin{align} \hat{\beta} &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \\ &= (\mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{U} \mathbf{D} \mathbf{V}^\top)^{-1} \mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{Y} \\ &= (\mathbf{V} \mathbf{D}^2 \mathbf{V}^\top)^{-1} \mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{Y} && \text{($\mathbf{U}$ is orthogonal)} \\ &= \mathbf{V} \mathbf{D}^{-2} \mathbf{V}^\top \mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{Y} \\ &= \mathbf{V} \mathbf{D}^{-1} \mathbf{U}^\top \mathbf{Y} && \text{($\mathbf{V}$ is orthogonal)} \\ \end{align}

Notice that $\mathbf{V}\mathbf{D}^{-1}\mathbf{U}^\top$ is essentially performing the inverse transformation of $\mathbf{U} \mathbf{D} \mathbf{V}^\top$, the original SVD of $\mathbf{X}$. One way to understand this is that our original goal was to solve the linear system $\mathbf{Y} = \mathbf{X} \beta$, and to find $\beta$, we can just carry over the inverse of $\mathbf{X}$, i.e., $\beta = \mathbf{X}^{-1} \mathbf{Y}$. The solution $\hat{\beta} = \mathbf{V} \mathbf{D}^{-1} \mathbf{U}^\top \mathbf{Y}$ is performing this as a pseudoinverse by taking the inverse transformations done by $\mathbf{X}$’s SVD.

Note also that decreasing the diagonal values in $\mathbf{D}^{-1}$ will decrease the values of the coefficient estimates.

Similar to above, we can also rewrite the ridge estimator:

\begin{align} \hat{\beta} &= (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}_p)^{-1} \mathbf{X}^\top \mathbf{Y} \\ &= (\mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{U} \mathbf{D} \mathbf{V}^\top + \lambda \mathbf{I}_p)^{-1} \mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{Y} \\ &= (\mathbf{V} \mathbf{D}^2 \mathbf{V}^\top + \lambda \mathbf{I}_p)^{-1} \mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{Y} \\ &= (\mathbf{V} \mathbf{D}^2 \mathbf{V}^\top + \lambda \mathbf{V} \mathbf{V}^\top)^{-1} \mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{Y} \\ &= \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I}_n)^{-1} \mathbf{V}^\top \mathbf{V} \mathbf{D} \mathbf{U}^\top \mathbf{Y} \\ &= \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I}_n)^{-1} \mathbf{D} \mathbf{U}^\top \mathbf{Y} \\ \end{align}

Now, let’s directly compare the OLS and ridge estimators:

\begin{align} \hat{\beta}_{\text{OLS}} &= \mathbf{V} \mathbf{D}^{-1} \mathbf{U}^\top \mathbf{Y} \\ &= \mathbf{V} (\mathbf{D}^2)^{-1} \mathbf{D} \mathbf{U}^\top \mathbf{Y} \\ \end{align}

\[\hat{\beta}_{\text{ridge}} = \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I}_n)^{-1} \mathbf{D} \mathbf{U}^\top \mathbf{Y}\]

Notice that these are the same, except for the term inside the inverse: $\mathbf{D}^2$ vs. $\mathbf{D^2} + \lambda \mathbf{I}_n$. The ridge estimates are essentially the OLS estimates, multiplied by the term $\frac{\mathbf{D}^2}{\mathbf{D}^2 + \lambda \mathbf{I}_n}$, which is always between zero and one. As mentioned above, this has the effect of shifting the coefficient estimates downward. Further, coefficients with a smaller corresponding value $d_i$ (i.e., the $i$’th diagonal of $\mathbf{D}$) will be shrunk more than coefficients with a large $d_i$. In other words, covariates that account for very little of the variance in the data will be shifted to zero more quickly.

Also, if the matrix $\mathbf{X}$ is singular (equivalently, if any of its eigenvalues are zero), ridge regression pushes the eigenvalues slightly upward, forcing it to be nonsingular. Effectively, the pseudo-SVD of $\mathbf{X}$ becomes $\mathbf{U} (\mathbf{D} + \lambda \mathbf{I}_n) \mathbf{V}^\top$, instead of the original $\mathbf{U} (\mathbf{D}) \mathbf{V}^\top$. This can be very useful when $p > n$.

References

Lecture notes on ridge regression by Wessel N. van Wieringen
This StackExchange answer by the user whuber.
Fan, J., Li, R., Zhang, C.-H., and Zou, H. (2020). Statistical Foundations of Data Science.