These posts are informal notes and may contain errors — please let me know if you spot any.
The natural gradient generalizes the classical gradient to account for non-Euclidean geometries.
Linear dimensionality reduction is a cornerstone of machine learning and statistics. Here we review a 2015 paper by Cunningham and Ghahramani that unifies this zoo by casting each of them as a special case of a very general optimization problem.
The radial basis function (RBF) kernel is one of the most commonly-used kernels in kernel methods. Here, we show how the kernel arises from taking an infinite polynomial feature expansion.
Inducing points provide a strategy for lowering the computational cost of Gaussian process prediction by closely modeling only a subset of the input space.
Minimizing the $\chi^2$ divergence between a true posterior and an approximate posterior is equivalent to minimizing an upper bound on the log marginal likelihood.
Here, we discuss and visualize the mode-seeking behavior of the reverse KL divergence.
Mixed models are effectively a special case of hierarchical models. In this post, I try to draw some connections between these jargon-filled modeling approaches.
Bayesian posterior inference requires the analyst to specify a full probabilistic model of the data generating process. Gibbs posteriors are a broader family of distributions that are intended to relax this requirement and to allow arbitrary loss functions.
Estimators based on sampling schemes can be 'Rao-Blackwellized' to reduce their variance.
Thompson sampling is a simple Bayesian approach to selecting actions in a multi-armed bandit setting.
Bayesian models provide a principled way to make inferences about underlying parameters. But under what conditions do those inferences converge to the truth?
'Recommending that scientists use Bayes' theorem is like giving the neighborhood kids the key to your F-16' and other critiques.
The power iteration algorithm is a numerical approach to computing the top eigenvector and eigenvalue of a matrix.
Belief propagation is a family of message passing algorithms, often used for computing marginal distributions and maximum a posteriori (MAP) estimates of random variables that have a graph structure.
In this post, we try to visualize a couple simple differential equations and their solutions with a few lines of Python code.
Cubic splines are flexible nonparametric models. Here, we discuss some of the spline fundamentals.
Matrix musings.
A brief review of shrinkage in ridge regression and a comparison to OLS.
The binomial model is a simple method for determining the prices of options.
BFGS is a second-order optimization method -- a close relative of Newton's method -- that approximates the Hessian of the objective function.
Tweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.
Here, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.
The Concrete distribution is a relaxation of discrete distributions.
A brief review of Gaussian processes with simple visualizations.
A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.
Slice sampling is a method for obtaining random samples from an arbitrary distribution. Here, we walk through the basic steps of slice sampling and present two visual examples.
Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.
The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.
Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q( heta)$ is taken to be a point mass.
Copulas are flexible statistical tools for modeling correlation structure between variables.
Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.
Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we'll give a brief overview and a simple example implementation.
In this post, we draw a simple connection between the optimization problems for NMF and PMF.
Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we'll review some of the basic motivation behind MCMC and a couple of the most well-known methods.
There exists a duality between maximum likelihood estimation and finding the maximum entropy distribution subject to a set of linear constraints.
Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley's paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.
Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often 'improprer' -- we review this issue here.
Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.
Principal component analysis is a widely-used dimensionality reduction technique. However, PCA has an implicit connection to the Gaussian distribution, which may be undesirable for non-Gaussian data. Here, we'll see a second approach for generalizing PCA to other distributions introduced by [Andrew Landgraf in 2015](https://etd.ohiolink.edu/!etd.send_file?accession=osu1437610558&disposition=inline).
Reduced-rank regression is a method for finding associations between two high-dimensional datasets with paired samples.
In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.
Condition numbers measure the sensitivity of a function to changes in its inputs. We review this concept here, along with some specific examples to build intuition.
In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of 'boosting': creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we'll give a quick overview of boosting, and we'll review one of the most influential boosting algorithms, AdaBoost.
Statistical 'whitening' is a family of procedures for standardizing and decorrelating a set of variables. Here, we'll review this concept in a general sense, and see two specific examples.
Matrix decomposition methods factor a matrix $A$ into a product of two other matrices, $A = BC$. In this post, we review some of the most common matrix decompositions, and why they're useful.
On 4/2/20, The Dana-Farber Cancer Institute and the Brown Institute at Columbia hosted a 'zoomposium' (symposium via Zoom) about epidemiological modeling of the COVID-19 pandemic. These are my notes from the speakers' presentations.
As their name suggests, 'quasi-likelihoods' are quantities that aren't formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.
Variable selection is an important part of high-dimensional statistical modeling. Many popular approaches for variable selection, such as LASSO, suffer from bias. The smoothly clipped absolute deviation (SCAD) estimator attempts to alleviate this bias issue, while also retaining a continuous penalty that encourages sparsity.
In this post, we'll review a family of fundamental classification algorithms: linear and quadratic discriminant analysis.