These posts are informal notes and may contain errors — please let me know if you spot any.
Neural networks effectively, become equivalent to Gaussian processes as the number of hidden units tends to infinity.
Matérn kernels, which can be seen as generalizations of the RBF kernel, allow for modeling non-smooth functions with Gaussian processes.
Conjugate gradient descent is an approach to optimization that accounts for second-order structure of the objective function.
Traditional Gaussian processes only allow for a single scalar output. However, we often are interested in understanding the relationship between the input and *multiple* outputs. The linear model of coregionalization is one approach for doing this.
Constructing and evaluating positive semidefinite kernel functions is a major challenge in statistics and machine learning. By leveraging Bochner's theorem, we can approximate a kernel function by transforming samples from its spectral density.
Normalizing flows are a family of methods for flexibly approximating complex distributions. By combining ideas from probability theory, statistics, and deep learning, the learned distributions can be much more complex than traditional approaches to density estimation.
Martingales are a special type of stochastic process that are, in a sense, unpredictable.
Inducing point approximations for Gaussian processes can be formalized into a Bayesian model using a variational inference approach.
The GPLVM provides a principled approach to nonlinear dimensionality reduction. Here, we review its relationship to probabilistic PCA and provide a rough overfiew of how to fit the model.
Automatic differentiation (AD) refers to a family of algorithms that can be used to compute derivatives of functions in a systematized way.
Most research in machine learning and computational statistics focuses on advancing methodology. However, a less-hyped topic — but an extremely important one — is the actual implementation of these methods using programming languages and compilers.
The mixture of factor analyzers model combines clustering and dimensionality reduction by allowing different regions of the data space to be modeled by different low-dimensional approximations.
Stochastic variational inference (SVI) is a family of methods that exploits stochastic optimization techniques to speed up variational approaches and scale them to large datasets.
The natural gradient generalizes the classical gradient to account for non-Euclidean geometries.
Linear dimensionality reduction is a cornerstone of machine learning and statistics. Here we review a 2015 paper by Cunningham and Ghahramani that unifies this zoo by casting each of them as a special case of a very general optimization problem.
The radial basis function (RBF) kernel is one of the most commonly-used kernels in kernel methods. Here, we show how the kernel arises from taking an infinite polynomial feature expansion.
Inducing points provide a strategy for lowering the computational cost of Gaussian process prediction by closely modeling only a subset of the input space.
Minimizing the $\chi^2$ divergence between a true posterior and an approximate posterior is equivalent to minimizing an upper bound on the log marginal likelihood.
Here, we discuss and visualize the mode-seeking behavior of the reverse KL divergence.
Mixed models are effectively a special case of hierarchical models. In this post, I try to draw some connections between these jargon-filled modeling approaches.
Bayesian posterior inference requires the analyst to specify a full probabilistic model of the data generating process. Gibbs posteriors are a broader family of distributions that are intended to relax this requirement and to allow arbitrary loss functions.
Estimators based on sampling schemes can be 'Rao-Blackwellized' to reduce their variance.
Thompson sampling is a simple Bayesian approach to selecting actions in a multi-armed bandit setting.
Bayesian models provide a principled way to make inferences about underlying parameters. But under what conditions do those inferences converge to the truth?
Describing $\chi$ random variables as the lengths of vectors.
'Recommending that scientists use Bayes' theorem is like giving the neighborhood kids the key to your F-16' and other critiques.
The power iteration algorithm is a numerical approach to computing the top eigenvector and eigenvalue of a matrix.
Belief propagation is a family of message passing algorithms, often used for computing marginal distributions and maximum a posteriori (MAP) estimates of random variables that have a graph structure.
In this post, we try to visualize a couple simple differential equations and their solutions with a few lines of Python code.
Cubic splines are flexible nonparametric models. Here, we discuss some of the spline fundamentals.
A brief review of shrinkage in ridge regression and a comparison to OLS.
The binomial model is a simple method for determining the prices of options.
BFGS is a second-order optimization method -- a close relative of Newton's method -- that approximates the Hessian of the objective function.
Tweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.
Here, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.
The Concrete distribution is a relaxation of discrete distributions.
A brief review of Gaussian processes with simple visualizations.
A sketch of the derivation for Ito's Lemma and a simple example.
A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.
Slice sampling is a method for obtaining random samples from an arbitrary distribution. Here, we walk through the basic steps of slice sampling and present two visual examples.
Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.
The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.
Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q( heta)$ is taken to be a point mass.
Copulas are flexible statistical tools for modeling correlation structure between variables.
Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.
Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we'll give a brief overview and a simple example implementation.
In this post, we draw a simple connection between the optimization problems for NMF and PMF.
Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we'll review some of the basic motivation behind MCMC and a couple of the most well-known methods.
There exists a duality between maximum likelihood estimation and finding the maximum entropy distribution subject to a set of linear constraints.
Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley's paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.
Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often 'improprer' -- we review this issue here.
Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.
Principal component analysis is a widely-used dimensionality reduction technique. However, PCA has an implicit connection to the Gaussian distribution, which may be undesirable for non-Gaussian data. Here, we'll see a second approach for generalizing PCA to other distributions introduced by Andrew Landgraf in 2015.
Reduced-rank regression is a method for finding associations between two high-dimensional datasets with paired samples.
In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.
Condition numbers measure the sensitivity of a function to changes in its inputs. We review this concept here, along with some specific examples to build intuition.
In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of 'boosting': creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we'll give a quick overview of boosting, and we'll review one of the most influential boosting algorithms, AdaBoost.
Statistical 'whitening' is a family of procedures for standardizing and decorrelating a set of variables. Here, we'll review this concept in a general sense, and see two specific examples.
Matrix decomposition methods factor a matrix $A$ into a product of two other matrices, $A = BC$. In this post, we review some of the most common matrix decompositions, and why they're useful.
On 4/2/20, The Dana-Farber Cancer Institute and the Brown Institute at Columbia hosted a 'zoomposium' (symposium via Zoom) about epidemiological modeling of the COVID-19 pandemic. These are my notes from the speakers' presentations.
As their name suggests, 'quasi-likelihoods' are quantities that aren't formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.
Variable selection is an important part of high-dimensional statistical modeling. Many popular approaches for variable selection, such as LASSO, suffer from bias. The smoothly clipped absolute deviation (SCAD) estimator attempts to alleviate this bias issue, while also retaining a continuous penalty that encourages sparsity.
In this post, we'll review a family of fundamental classification algorithms: linear and quadratic discriminant analysis.
The representer theorem is a powerful result that implies a certain type of duality between solutions to function estimation problems.
Generalized linear models are flexible tools for modeling various response disributions. This post covers one common way of fitting them.
In this post, we cover a condition that is necessary and sufficient for the LASSO estimator to work correctly.
The log-derivative trick is really just a simple application of the chain rule. However, it allows us to rewrite expectations in a way that is amenable to Monte Carlo approximation.
When we construct and analyze statistical estimators, we often assume that the model is correctly specified. However, in practice, this is rarely the case --- our assumed models are usually approximations of the truth, but they're useful nonetheless.
The Gumbel max trick is a method for sampling from discrete distributions using only a deterministic function of the distributions' parameters.
Control variates are a class of methods for reducing the variance of a generic Monte Carlo estimator.
Mallows' $C_p$ statistic is one way to measure and correct for model complexity when searching for statistical model with the best performance.
Ridge regression --- a regularized variant of ordinary least squares --- is useful for dealing with collinearity and non-identifiability. Here, we'll explore some of the linear algebra behind it.
In this post we'll cover a simple algorithm for managing a portfolio of assets called the Universal Portfolio, developed by Thomas Cover in the 90s. Although the method was developed in the context of finance, it applies more generally to the setting of online learning.
Maximum likelihood estimation (MLE) is one of the most popular and well-studied methods for creating statistical estimators. This post will review conditions under which the MLE is consistent.
Online learning algorithms make decisions in uncertain, constantly-changing environments. This post will review a couple basic forms of online learning algorithms, as well as some motivating examples.
This post briefly covers a broad class of statistical estimators: M-estimators. We'll review the basic definition, some well-known special cases, and some of its asymptotic properties.
Here, we'll look at linear regression from a statistical learning theory perspective. In particular, we'll derive the number of samples necessary in order to achieve a certain level of regression error. We'll also see a technique called 'discretization' that allows for proving things about infinite sets by relying on results in finite sets.
Maximum entropy distributions are those that are the 'least informative' (i.e., have the greatest entropy) among a class of distributions with certain constraints. The principle of maximum entropy has roots across information theory, statistical mechanics, Bayesian probability, and philosophy. For this post, we'll focus on the simple definition of maximum entropy distributions.
Hypothesis testing is a fundamental part of mathematical statistics, but finding the best hypothesis test for a given problem is a nontrivial exercise. The Neyman-Pearson Lemma gives strong guidance about how to choose hypothesis tests -- we review and prove it here.
VC dimension is a measure of the complexity of a statistical model. In essence, a model with a higher VC dimension is able to learn more complex mappings between data and labels. In this post, we'll firm up this definition and walk through a couple simple examples.
I take for granted that I can easily generate random samples from a variety of probability distributions in NumPy, R, and other statistical software. However, the process for generating these quantities is somewhat nontrivial, and we'll look under the hood at one example in this post.
When thinking about the convergence of random quantities, two types of convergence that are often confused with one another are convergence in probability and almost sure convergence. Here, I give the definition of each and a simple example that illustrates the difference. The example comes from the textbook *Statistical Inference* by Casella and Berger, but I'll step through the example in more detail.
Here, we'll discuss a high-level concept relating to efficient learning. In many learning problems, an agent seeks to minimize the cost they incur by selecting a hypothesis about the world from a set of many possible hypotheses. We'll dive into a framework for prioritizing among the candidate hypotheses (structural risk minimization) -- and one interesting way to specifically assign priority levels to hypotheses (minimum description length).
In this post, we'll look at a simple example of performing transformations on random variables. Specifically, we'll explore what happens when two independent Gaussian-distributed random variables are transformed to polar coordinates.