# Nyström approximation

$\DeclareMathOperator*{\argmin}{arg\,min}$ $\DeclareMathOperator*{\argmax}{arg\,max}$

Numerical linear algebra relies on performing fast and efficient calculations with matrices. However, when matrices become very large, computational bottlenecks can arise even for the most efficient numerical linear algebra techniques. Thus, there’s a need for methods that circumvent these bottlenecks by performing approximations that trade accuracy for computational efficiency.

In this post, we describe one of the simplest and most general approximations, widely known as the Nyström approximation. We focus on the case of positive definite covariance matrices, although similar approximations have been made in more general cases.

## Covariance matrices

Suppose we have $n$ data points $x_1, \dots, x_n$. Let $K$ be an $n \times n$ covariance matrix (which we will also refer to as a kernel matrix), where $[K]_{ij} = k(x_i, x_j)$, and $k$ is a positive definite kernel function.

When $n$ is large, performing operations with $K$ will be expensive (recall that matrix storage costs $O(n^2)$ and matrix inversion costs $O(n^3)$). Thus, it’s of interest to find cheaper ways to approximate this kernel matrix, which can make downstream operations more efficient. The Nyström approximation is one such approach, which we explore next.

## Nyström approximation

The basic idea behind the Nyström approximation is to describe the covariance matrix in terms of a smaller number of points. A full $n \times n$ covariance matrix describes the relationship between all $\frac{n^2 - n}{2}$ pairs of points, which can be expensive when $n$ is large. On the other hand, the Nyström approximation selects $m < n$ points (which we’ll call the “representative” points here) and approximates the covariance matrix by describing each of the original point’s relationship to these $m$ representative points.

More conceretly, a Nyström approximation makes the assumption that $\text{rank}(K) = m < n$. By making this low-rank assumption, we can avoid having to work with the full large $n \times n$ matrix.

### External represenatative points

To see how this works in practice, suppose we randomly choose $m$ new points to be the “representative” samples of the dataset. Let’s call these points $\widetilde{x}_1, \dots, \widetilde{x}_m$. Then we can augment $K$ with these new points in the form of a block matrix:

$\widetilde{K} = \begin{bmatrix} K_{11} & K_{12} \\ K_{21} & K_{22} \end{bmatrix},$

where $K_{11}$ is the $m \times m$ kernel matrix for our representative points, $K_{22} = K$ is our $n \times n$ kernel matrix for the original points, and $K_{12}$ is the $m \times n$ kernel matrix for the cross-dataset covariances.

Let’s write $\widetilde{K}$ in terms of its SVD:

$\widetilde{K} = U \Lambda U^\top = \begin{bmatrix} U_1 \Lambda U_1^\top & U_1 \Lambda U_2^\top \\ U_2 \Lambda U_1^\top & U_2 \Lambda U_2^\top \end{bmatrix}.$

## References

• Rasmussen, Carl Edward. “Gaussian processes in machine learning.” Summer school on machine learning. Springer, Berlin, Heidelberg, 2003.
• Liu, Haitao, et al. “When Gaussian process meets big data: A review of scalable GPs.” IEEE transactions on neural networks and learning systems 31.11 (2020): 4405-4423.
• This stackoverflow post.