# Blog posts

These posts are notes on topics I've found interesting. They may contain errors -- please let me know if you spot any!

## Tweedie distributions

Published:

Tweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.

## Scale mixtures of normals

Published:

Here, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.

## The Concrete Distribution

Published:

The Concrete distribution is a relaxation of discrete distributions.

## Gaussian process regression

Published:

A brief review of Gaussian processes with simple visualizations.

## Ito’s lemma

Published:

A sketch of the derivation for Ito’s Lemma and a simple example.

## Wiener and Ito processes

Published:

A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.

## Slice sampling

Published:

Slice sampling is a method for obtaining random samples from an arbitrary distribution. Here, we walk through the basic steps of slice sampling and present two visual examples.

## Bayesian model averaging

Published:

Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.

## James-Stein estimator

Published:

The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.

## EM as a special case of variational inference

Published:

Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q(\theta)$ is taken to be a point mass.

## Copulas and Sklar’s Theorem

Published:

Copulas are flexible statistical tools for modeling correlation structure between variables.

## Schur complements

Published:

Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.

## Hamiltonian Monte Carlo

Published:

Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we’ll give a brief overview and a simple example implementation.

## Connection between non-negative matrix factorization and Poisson matrix factorization

Published:

In this post, we draw a simple connection between the optimization problems for NMF and PMF.

## Whirlwind tour of MCMC for posterior inference

Published:

Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we’ll review some of the basic motivation behind MCMC and a couple of the most well-known methods.

## Duality between maximum likelihood and maximum entropy

Published:

There exists a duality between maximum likelihood estimation and finding the maximum entropy distribution subject to a set of linear constraints.

Published:

Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley’s paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.

## Improper priors

Published:

Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often “improprer” – we review this issue here.

## Probabilistic PCA derivations

Published:

Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.

## Dirichlet Processes: the basics

Published:

The Dirichlet process (DP) is one of the most common – and one of the most simple – prior distributions used in Bayesian nonparametric models. In this post, we’ll review a couple different interpretations of DPs.

## Estimation and Inference in probabilistic models: A whirlwind tour

Published:

Probabilistic models are flexible tools for understanding a data generating process. There are many ways to do inference and estimation in these models, and it continues to be an active area of research. Here, our goal will be to understand a few of the most general classes of probabilistic model fitting, while trying to understand the important differences between them. We’ll demonstrate how to implement each one with a very simple example: the Beta-Bernoulli model.

## Generalized PCA: an alternative approach

Published:

Principal component analysis is a widely-used dimensionality reduction technique. However, PCA has an implicit connection to the Gaussian distribution, which may be undesirable for non-Gaussian data. Here, we’ll see a second approach for generalizing PCA to other distributions introduced by Andrew Landgraf in 2015.

## Reduced-rank regresssion

Published:

Reduced-rank regression is a method for finding associations between two high-dimensional datasets with paired samples.

## Are random seeds hyperparameters?

Published:

In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.

## Condition numbers

Published:

Condition numbers measure the sensitivity of a function to changes in its inputs. We review this concept here, along with some specific examples to build intuition.

Published:

In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of “boosting”: creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we’ll give a quick overview of boosting, and we’ll review one of the most influential boosting algorithms, AdaBoost.

## Statistical whitening transformations

Published:

Statistical “whitening” is a family of procedures for standardizing and decorrelating a set of variables. Here, we’ll review this concept in a general sense, and see two specific examples.

## Common matrix decompositions

Published:

Matrix decomposition methods factor a matrix $A$ into a product of two other matrices, $A = BC$. In this post, we review some of the most common matrix decompositions, and why they’re useful.

## COVID-19 Zoomposium notes

Published:

On 4/2/20, The Dana-Farber Cancer Institute and the Brown Institute at Columbia hosted a “zoomposium” (symposium via Zoom) about epidemiological modeling of the COVID-19 pandemic. These are my notes from the speakers’ presentations.

## Quasi-likelihoods

Published:

As their name suggests, “quasi-likelihoods” are quantities that aren’t formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.

## The smoothly clipped absolute deviation (SCAD) penalty

Published:

Variable selection is an important part of high-dimensional statistical modeling. Many popular approaches for variable selection, such as LASSO, suffer from bias. The smoothly clipped absolute deviation (SCAD) estimator attempts to alleviate this bias issue, while also retaining a continuous penalty that encourages sparsity.

## Linear discriminant analysis from scratch

Published:

In this post, we’ll review a family of fundamental classification algorithms: linear and quadratic discriminant analysis.

## Generalized PCA

Published:

Principal component analysis (PCA) in its typical form implicitly assumes that the observed data matrix follows a Gaussian distribution. However, PCA can be generalized to allow for other distributions – here, we take a look at its generalization for exponential families introduced by Collins et al. in 2001.

## The representer theorem and kernel ridge regression

Published:

The representer theorem is a powerful result that implies a certain type of duality between solutions to function estimation problems.

## Newton’s method and Fisher scoring for fitting GLMs

Published:

Generalized linear models are flexible tools for modeling various response disributions. This post covers one common way of fitting them.

## LASSO and the irrepresentable condition

Published:

In this post, we cover a condition that is necessary and sufficient for the LASSO estimator to work correctly.

## Log-derivative trick

Published:

The “log-derivative trick” is really just a simple application of the chain rule. However, it allows us to rewrite expectations in a way that is amenable to Monte Carlo approximation.

## MLE under a misspecified model

Published:

When we construct and analyze statistical estimators, we often assume that the model is correctly specified. However, in practice, this is rarely the case — our assumed models are usually approximations of the truth, but they’re useful nonetheless.

## Gumbel max trick

Published:

The “Gumbel max trick” is a method for sampling from discrete distributions using only a deterministic function of the distributions’ parameters.

## Control variates

Published:

Control variates are a class of methods for reducing the variance of a generic Monte Carlo estimator.

## Mallows Cp

Published:

Mallows’ $C_p$ statistic is one way to measure and correct for model complexity when searching for statistical model with the best performance.

## The linear algebra of ridge regression

Published:

Ridge regression — a regularized variant of ordinary least squares — is useful for dealing with collinearity and non-identifiability. Here, we’ll explore some of the linear algebra behind it.

## Universal Portfolios: A simple online learning algorithm

Published:

In this post we’ll cover a simple algorithm for managing a portfolio of assets called the “Universal Portfolio”, developed by Thomas Cover in the 90s. Although the method was developed in the context of finance, it applies more generally to the setting of online learning.

## Consistency of MLE

Published:

Maximum likelihood estimation (MLE) is one of the most popular and well-studied methods for creating statistical estimators. This post will review conditions under which the MLE is consistent.

## Learning from Expert Advice and Hedge

Published:

Online learning algorithms make decisions in uncertain, constantly-changing environments. This post will review a couple basic forms of online learning algorithms, as well as some motivating examples.

## M-estimation

Published:

This post briefly covers a broad class of statistical estimators: M-estimators. We’ll review the basic definition, some well-known special cases, and some of its asymptotic properties.

## Sample complexity of linear regression

Published:

Here, we’ll look at linear regression from a statistical learning theory perspective. In particular, we’ll derive the number of samples necessary in order to achieve a certain level of regression error. We’ll also see a technique called “discretization” that allows for proving things about infinite sets by relying on results in finite sets.

## Maximum entropy distributions

Published:

Maximum entropy distributions are those that are the “least informative” (i.e., have the greatest entropy) among a class of distributions with certain constraints. The principle of maximum entropy has roots across information theory, statistical mechanics, Bayesian probability, and philosophy. For this post, we’ll focus on the simple definition of maximum entropy distributions.

## Neyman-Pearson Lemma

Published:

Hypothesis testing is a fundamental part of mathematical statistics, but finding the best hypothesis test for a given problem is a nontrivial exercise. The Neyman-Pearson Lemma gives strong guidance about how to choose hypothesis tests – we review and prove it here.

## Introduction to VC dimension

Published:

VC dimension is a measure of the complexity of a statistical model. In essence, a model with a higher VC dimension is able to learn more complex mappings between data and labels. In this post, we’ll firm up this definition and walk through a couple simple examples.

## Generating random samples from probability distributions

Published:

I take for granted that I can easily generate random samples from a variety of probability distributions in NumPy, R, and other statistical software. However, the process for generating these quantities is somewhat nontrivial, and we’ll look under the hood at one example in this post.

## Convergence in probability vs. almost sure convergence

Published:

When thinking about the convergence of random quantities, two types of convergence that are often confused with one another are convergence in probability and almost sure convergence. Here, I give the definition of each and a simple example that illustrates the difference. The example comes from the textbook Statistical Inference by Casella and Berger, but I’ll step through the example in more detail.