Andy Jones

These posts are informal notes and may contain errors — please let me know if you spot any.

2022

Moment-generating functions November 27
An alternative characterization of probability distributions.

ARMA distributions November 25
Deriving the distributional assumptions behind the AR, MA, and ARMA models.

The forward-backward algorithm November 12
Computing posterior distributions in hidden Markov models.

Kernel PCA October 22
A nonlinear extension of PCA using basis expansions and the kernel trick.

Statistical tests as linear models October 8
Framing t-, ANOVA, and chi-squared tests as linear models

Gibbs posteriors and energy-based models September 10
Two approaches to generalizing probabilistic models and Bayesian inference.

The ubiquity of degrees of freedom September 8
The concept of degrees of freedom plays an important role in many areas of statistics, engineering, and mathematics.

Gaussian copulas and the Global Financial Crisis September 2
Gaussian copulas, which have been used to model credit risk, have zero tail dependence, causing the model to break down under extreme conditions in some settings.

Convex cones and positive definite matrices May 29
Definitions and visualizations of cones, convex cones, and PD matrices.

Regression splines May 22
Splines - originally a tool for shipbuilding - are useful in statistics for fitting smooth nonlinear functions to data.

Biology of Genomes themes: A text analysis of abstracts May 16

Autoregressive moving average models May 1
ARMA models capture stable, longer-term trends in sequential data, as well as shorter-term noisy shocks to the trend.

Determinental point processes April 11
DPPs induce probability distributions over arbitrary subsets, which is useful for many applications.

Visualizing Lebesgue integration March 27
Lebesgue integrals can be visualized in a similar way to Riemannian sums.

First- and second-price auctions March 13
We explore two types of sealed-bid auctions that form the backbone for much of auction theory.

Reproducing kernel Hilbert spaces March 5
Exploring vector spaces with special inner product structure and a frequentist perspective on Gaussian processes.

Bayesian optimal experimental design February 24
Adaptively adjusting experimental design choices based on streaming observations can provide benefits for downstream inference.

Epistemic and aleotoric uncertainty in statistical models February 12
A simple demonstration of uncertainty arising from modeling choices and uncertainty arising from noise.

The unintuitive nature of high-dimensional spaces February 6
Projecting our intuition from two- and three-dimensional spaces onto high-dimensional spaces can go wildly wrong.

Classification and regression trees January 23
Classification and regression tree (CART) models provide a flexible, interpretable nonparametric approach to supervised learning problems.

2021

Nyström approximation December 28
The Nyström approximation is a simple way to approximate covariance matrices and speed up downstream matrix computations.

False discovery rate and the $q$-value November 28
Multiple hypothesis testing requires balancing different types of errors and determining a tolerable level of such errors. The most common approaches for controlling errors are overly restrictive. Controlling the false discovery rate and computing $q$-values can circumvent this problem.

Score matching November 20
Complicated likelihoods with intractable normalizing constants are commonplace in many modern machine learning methods. Score matching is an approach to fit these models which circumvents the need to approximate these intractable constants.

Empirical Bayes November 13
Empirical Bayesian methods take a counterintuitive approach to the problem of choosing priors: selecting priors that are informed by the data itself.

Effective sample size November 7
Classical central limit theorems characterize the error in computing the mean of a set of independent random variables. The effective sample size helps generalize this to dependent/correlated sequences of random variables.

Power posteriors October 9
Power posteriors are a slight deviation from standard Bayesian posteriors and offer a simple approach for making Bayesian inference robust to model misspecification.

Monge and Kontorovich formulations of the Optimal Transport problem September 25
The field of optimal transport is concerned with finding routes for the movement of mass that minimize cost. Here, we review two of the most popular framings of the OT problem and demonstrate some solutions with simple numerical examples.

Nearest neighbor Gaussian processes September 12
Nearest neighbor Gaussian processes are sparse and fast approximations for Gaussian process models.

Dirichlet process mixture models August 29
Dirichlet process mixture models provide an attractive alternative to finite mixture models because they don't require the modeler to specify the number of components a priori.

Neural networks as Gaussian processes August 2
Neural networks effectively, become equivalent to Gaussian processes as the number of hidden units tends to infinity.

The Matérn class of covariance functions July 31
Matérn kernels, which can be seen as generalizations of the RBF kernel, allow for modeling non-smooth functions with Gaussian processes.

Conjugate gradients July 24
Conjugate gradient descent is an approach to optimization that accounts for second-order structure of the objective function.

Linear model of coregionalization July 17
Traditional Gaussian processes only allow for a single scalar output. However, we often are interested in understanding the relationship between the input and *multiple* outputs. The linear model of coregionalization is one approach for doing this.

Approximating kernels with random projections July 10
Constructing and evaluating positive semidefinite kernel functions is a major challenge in statistics and machine learning. By leveraging Bochner's theorem, we can approximate a kernel function by transforming samples from its spectral density.

Normalizing flows July 8
Normalizing flows are a family of methods for flexibly approximating complex distributions. By combining ideas from probability theory, statistics, and deep learning, the learned distributions can be much more complex than traditional approaches to density estimation.

Martingales June 26
Martingales are a special type of stochastic process that are, in a sense, unpredictable.

Variational inference for Gaussian processes June 16
Inducing point approximations for Gaussian processes can be formalized into a Bayesian model using a variational inference approach.

Gaussian process latent variable models June 6
The GPLVM provides a principled approach to nonlinear dimensionality reduction. Here, we review its relationship to probabilistic PCA and provide a rough overfiew of how to fit the model.

Automatic differentiation May 21
Automatic differentiation (AD) refers to a family of algorithms that can be used to compute derivatives of functions in a systematized way.

Just-in-time compilation and JAX May 15
Most research in machine learning and computational statistics focuses on advancing methodology. However, a less-hyped topic — but an extremely important one — is the actual implementation of these methods using programming languages and compilers.

Mixture of factor analyzers May 6
The mixture of factor analyzers model combines clustering and dimensionality reduction by allowing different regions of the data space to be modeled by different low-dimensional approximations.

Stochastic variational inference April 24
Stochastic variational inference (SVI) is a family of methods that exploits stochastic optimization techniques to speed up variational approaches and scale them to large datasets.

Natural gradients April 13
The natural gradient generalizes the classical gradient to account for non-Euclidean geometries.

Unifying linear dimensionality reduction methods April 10
Linear dimensionality reduction is a cornerstone of machine learning and statistics. Here we review a 2015 paper by Cunningham and Ghahramani that unifies this zoo by casting each of them as a special case of a very general optimization problem.

RBF kernel as an infinite feature expansion April 3
The radial basis function (RBF) kernel is one of the most commonly-used kernels in kernel methods. Here, we show how the kernel arises from taking an infinite polynomial feature expansion.

Inducing points for Gaussian Processes March 27
Inducing points provide a strategy for lowering the computational cost of Gaussian process prediction by closely modeling only a subset of the input space.

$\chi$ divergence upper bound (CUBO) March 20
Minimizing the $\chi^2$ divergence between a true posterior and an approximate posterior is equivalent to minimizing an upper bound on the log marginal likelihood.

$KL(q \| p)$ is mode-seeking March 15
Here, we discuss and visualize the mode-seeking behavior of the reverse KL divergence.

Equivalence of mixed models and hierarchical models March 14
Mixed models are effectively a special case of hierarchical models. In this post, I try to draw some connections between these jargon-filled modeling approaches.

Gibbs posteriors February 28
Bayesian posterior inference requires the analyst to specify a full probabilistic model of the data generating process. Gibbs posteriors are a broader family of distributions that are intended to relax this requirement and to allow arbitrary loss functions.

Rao-Blackwellization February 27
Estimators based on sampling schemes can be 'Rao-Blackwellized' to reduce their variance.

Thompson sampling February 18
Thompson sampling is a simple Bayesian approach to selecting actions in a multi-armed bandit setting.

Posterior consistency February 13
Bayesian models provide a principled way to make inferences about underlying parameters. But under what conditions do those inferences converge to the truth?

$\chi$ triangles February 5
Describing $\chi$ random variables as the lengths of vectors.

Critiques of Bayesian statistics January 31
'Recommending that scientists use Bayes' theorem is like giving the neighborhood kids the key to your F-16' and other critiques.

Power iteration method January 24
The power iteration algorithm is a numerical approach to computing the top eigenvector and eigenvalue of a matrix.

Belief propagation January 12
Belief propagation is a family of message passing algorithms, often used for computing marginal distributions and maximum a posteriori (MAP) estimates of random variables that have a graph structure.

Visualizing differential equations in Python January 7
In this post, we try to visualize a couple simple differential equations and their solutions with a few lines of Python code.

2020

Cubic splines December 26
Cubic splines are flexible nonparametric models. Here, we discuss some of the spline fundamentals.

Relationship between the multivariate normal, SVD, and Cholesky decomposition December 19
Matrix musings.

Shrinkage in ridge regression December 18
A brief review of shrinkage in ridge regression and a comparison to OLS.

Binomial model for options pricing December 6
The binomial model is a simple method for determining the prices of options.

BFGS November 27
BFGS is a second-order optimization method -- a close relative of Newton's method -- that approximates the Hessian of the objective function.

Tweedie distributions November 21
Tweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.

Scale mixtures of normals November 15
Here, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.

The Concrete Distribution November 12
The Concrete distribution is a relaxation of discrete distributions.

Gaussian process regression November 1
A brief review of Gaussian processes with simple visualizations.

Ito's lemma October 24
A sketch of the derivation for Ito's Lemma and a simple example.

Wiener and Ito processes October 12
A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.

Slice sampling October 10
Slice sampling is a method for obtaining random samples from an arbitrary distribution. Here, we walk through the basic steps of slice sampling and present two visual examples.

Bayesian model averaging September 27
Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.

James-Stein estimator September 5
The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.

EM as a special case of variational inference August 29
Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q( heta)$ is taken to be a point mass.

Copulas and Sklar's Theorem August 22
Copulas are flexible statistical tools for modeling correlation structure between variables.

Schur complements August 19
Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.

Hamiltonian Monte Carlo August 16
Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we'll give a brief overview and a simple example implementation.

Connection between non-negative matrix factorization and Poisson matrix factorization August 7
In this post, we draw a simple connection between the optimization problems for NMF and PMF.

Whirlwind tour of MCMC for posterior inference August 2
Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we'll review some of the basic motivation behind MCMC and a couple of the most well-known methods.

Duality between maximum likelihood and maximum entropy July 25
There exists a duality between maximum likelihood estimation and finding the maximum entropy distribution subject to a set of linear constraints.

Lindley's paradox July 24
Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley's paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.

Improper priors July 18
Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often 'improprer' -- we review this issue here.

Probabilistic PCA derivations July 11
Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.

Dirichlet Processes June 28

Estimation and Inference in probabilistic models: A whirlwind tour June 21

Generalized PCA: an alternative approach June 13
Principal component analysis is a widely-used dimensionality reduction technique. However, PCA has an implicit connection to the Gaussian distribution, which may be undesirable for non-Gaussian data. Here, we'll see a second approach for generalizing PCA to other distributions introduced by Andrew Landgraf in 2015.

Reduced-rank regresssion June 7
Reduced-rank regression is a method for finding associations between two high-dimensional datasets with paired samples.

Are random seeds hyperparameters? May 30
In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.

Condition numbers May 17
Condition numbers measure the sensitivity of a function to changes in its inputs. We review this concept here, along with some specific examples to build intuition.

AdaBoost May 13
In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of 'boosting': creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we'll give a quick overview of boosting, and we'll review one of the most influential boosting algorithms, AdaBoost.

Statistical whitening transformations May 3
Statistical 'whitening' is a family of procedures for standardizing and decorrelating a set of variables. Here, we'll review this concept in a general sense, and see two specific examples.

Common matrix decompositions April 17
Matrix decomposition methods factor a matrix $A$ into a product of two other matrices, $A = BC$. In this post, we review some of the most common matrix decompositions, and why they're useful.

COVID-19 Zoomposium notes April 4
On 4/2/20, The Dana-Farber Cancer Institute and the Brown Institute at Columbia hosted a 'zoomposium' (symposium via Zoom) about epidemiological modeling of the COVID-19 pandemic. These are my notes from the speakers' presentations.

Quasi-likelihoods April 2
As their name suggests, 'quasi-likelihoods' are quantities that aren't formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.

The smoothly clipped absolute deviation (SCAD) penalty March 27
Variable selection is an important part of high-dimensional statistical modeling. Many popular approaches for variable selection, such as LASSO, suffer from bias. The smoothly clipped absolute deviation (SCAD) estimator attempts to alleviate this bias issue, while also retaining a continuous penalty that encourages sparsity.

Linear discriminant analysis from scratch March 24
In this post, we'll review a family of fundamental classification algorithms: linear and quadratic discriminant analysis.

Generalized Pca March 14

The representer theorem and kernel ridge regression March 7
The representer theorem is a powerful result that implies a certain type of duality between solutions to function estimation problems.

Newton's method and Fisher scoring for fitting GLMs March 4
Generalized linear models are flexible tools for modeling various response disributions. This post covers one common way of fitting them.

LASSO and the irrepresentable condition February 28
In this post, we cover a condition that is necessary and sufficient for the LASSO estimator to work correctly.

Log-derivative trick February 26
The log-derivative trick is really just a simple application of the chain rule. However, it allows us to rewrite expectations in a way that is amenable to Monte Carlo approximation.

MLE under a misspecified model February 24
When we construct and analyze statistical estimators, we often assume that the model is correctly specified. However, in practice, this is rarely the case --- our assumed models are usually approximations of the truth, but they're useful nonetheless.

Gumbel max trick February 21
The Gumbel max trick is a method for sampling from discrete distributions using only a deterministic function of the distributions' parameters.

Control variates February 14
Control variates are a class of methods for reducing the variance of a generic Monte Carlo estimator.

Mallows $C_p$ February 11
Mallows' $C_p$ statistic is one way to measure and correct for model complexity when searching for statistical model with the best performance.

The linear algebra of ridge regression February 9
Ridge regression --- a regularized variant of ordinary least squares --- is useful for dealing with collinearity and non-identifiability. Here, we'll explore some of the linear algebra behind it.

Universal Portfolios: A simple online learning algorithm January 25
In this post we'll cover a simple algorithm for managing a portfolio of assets called the Universal Portfolio, developed by Thomas Cover in the 90s. Although the method was developed in the context of finance, it applies more generally to the setting of online learning.

Consistency of MLE January 10
Maximum likelihood estimation (MLE) is one of the most popular and well-studied methods for creating statistical estimators. This post will review conditions under which the MLE is consistent.

Learning from Expert Advice and Hedge January 5
Online learning algorithms make decisions in uncertain, constantly-changing environments. This post will review a couple basic forms of online learning algorithms, as well as some motivating examples.

2019

$M$-estimation December 31
This post briefly covers a broad class of statistical estimators: M-estimators. We'll review the basic definition, some well-known special cases, and some of its asymptotic properties.

Sample complexity of linear regression December 29
Here, we'll look at linear regression from a statistical learning theory perspective. In particular, we'll derive the number of samples necessary in order to achieve a certain level of regression error. We'll also see a technique called 'discretization' that allows for proving things about infinite sets by relying on results in finite sets.

Maximum entropy distributions December 13
Maximum entropy distributions are those that are the 'least informative' (i.e., have the greatest entropy) among a class of distributions with certain constraints. The principle of maximum entropy has roots across information theory, statistical mechanics, Bayesian probability, and philosophy. For this post, we'll focus on the simple definition of maximum entropy distributions.

Neyman-Pearson Lemma December 7
Hypothesis testing is a fundamental part of mathematical statistics, but finding the best hypothesis test for a given problem is a nontrivial exercise. The Neyman-Pearson Lemma gives strong guidance about how to choose hypothesis tests -- we review and prove it here.

Introduction to VC dimension November 17
VC dimension is a measure of the complexity of a statistical model. In essence, a model with a higher VC dimension is able to learn more complex mappings between data and labels. In this post, we'll firm up this definition and walk through a couple simple examples.

Generating random samples from probability distributions November 12
I take for granted that I can easily generate random samples from a variety of probability distributions in NumPy, R, and other statistical software. However, the process for generating these quantities is somewhat nontrivial, and we'll look under the hood at one example in this post.

Convergence in probability vs. almost sure convergence November 11
When thinking about the convergence of random quantities, two types of convergence that are often confused with one another are convergence in probability and almost sure convergence. Here, I give the definition of each and a simple example that illustrates the difference. The example comes from the textbook *Statistical Inference* by Casella and Berger, but I'll step through the example in more detail.

Structural risk minimization and minimum description length October 5
Here, we'll discuss a high-level concept relating to efficient learning. In many learning problems, an agent seeks to minimize the cost they incur by selecting a hypothesis about the world from a set of many possible hypotheses. We'll dive into a framework for prioritizing among the candidate hypotheses (structural risk minimization) -- and one interesting way to specifically assign priority levels to hypotheses (minimum description length).

Rayleigh distribution (aka polar transformation of Gaussians) September 28
In this post, we'll look at a simple example of performing transformations on random variables. Specifically, we'll explore what happens when two independent Gaussian-distributed random variables are transformed to polar coordinates.