Posts by Tags

Bayesian statistics

Lindley’s paradox

6 minute read

Published:

Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley’s paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.

Improper priors

8 minute read

Published:

Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often “improprer” – we review this issue here.

MCMC

Hamiltonian Monte Carlo

8 minute read

Published:

Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we’ll give a brief overview and a simple example implementation.

algorithms

Universal Portfolios: A simple online learning algorithm

8 minute read

Published:

In this post we’ll cover a simple algorithm for managing a portfolio of assets called the “Universal Portfolio”, developed by Thomas Cover in the 90s. Although the method was developed in the context of finance, it applies more generally to the setting of online learning.

Learning from Expert Advice and Hedge

6 minute read

Published:

Online learning algorithms make decisions in uncertain, constantly-changing environments. This post will review a couple basic forms of online learning algorithms, as well as some motivating examples.

bayesian statistics

Bayesian model averaging

8 minute read

Published:

Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.

decision theory

James-Stein estimator

8 minute read

Published:

The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.

epidemiology

COVID-19 Zoomposium notes

10 minute read

Published:

On 4/2/20, The Dana-Farber Cancer Institute and the Brown Institute at Columbia hosted a “zoomposium” (symposium via Zoom) about epidemiological modeling of the COVID-19 pandemic. These are my notes from the speakers’ presentations.

information theory

Maximum entropy distributions

6 minute read

Published:

Maximum entropy distributions are those that are the “least informative” (i.e., have the greatest entropy) among a class of distributions with certain constraints. The principle of maximum entropy has roots across information theory, statistical mechanics, Bayesian probability, and philosophy. For this post, we’ll focus on the simple definition of maximum entropy distributions.

learning theory

AdaBoost

6 minute read

Published:

In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of “boosting”: creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we’ll give a quick overview of boosting, and we’ll review one of the most influential boosting algorithms, AdaBoost.

Learning from Expert Advice and Hedge

6 minute read

Published:

Online learning algorithms make decisions in uncertain, constantly-changing environments. This post will review a couple basic forms of online learning algorithms, as well as some motivating examples.

Sample complexity of linear regression

5 minute read

Published:

Here, we’ll look at linear regression from a statistical learning theory perspective. In particular, we’ll derive the number of samples necessary in order to achieve a certain level of regression error. We’ll also see a technique called “discretization” that allows for proving things about infinite sets by relying on results in finite sets.

Introduction to VC dimension

6 minute read

Published:

VC dimension is a measure of the complexity of a statistical model. In essence, a model with a higher VC dimension is able to learn more complex mappings between data and labels. In this post, we’ll firm up this definition and walk through a couple simple examples.

Structural risk minimization and minimum description length

7 minute read

Published:

Here, we’ll discuss a high-level concept relating to efficient learning. In many learning problems, an agent seeks to minimize the cost they incur by selecting a hypothesis about the world from a set of many possible hypotheses. We’ll dive into a framework for prioritizing among the candidate hypotheses (structural risk minimization) – and one interesting way to specifically assign priority levels to hypotheses (minimum description length).

linear algebra

Schur complements

6 minute read

Published:

Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.

Reduced-rank regresssion

7 minute read

Published:

Reduced-rank regression is a method for finding associations between two high-dimensional datasets with paired samples.

Condition numbers

6 minute read

Published:

Condition numbers measure the sensitivity of a function to changes in its inputs. We review this concept here, along with some specific examples to build intuition.

Statistical whitening transformations

5 minute read

Published:

Statistical “whitening” is a family of procedures for standardizing and decorrelating a set of variables. Here, we’ll review this concept in a general sense, and see two specific examples.

Common matrix decompositions

10 minute read

Published:

Matrix decomposition methods factor a matrix $A$ into a product of two other matrices, $A = BC$. In this post, we review some of the most common matrix decompositions, and why they’re useful.

The linear algebra of ridge regression

5 minute read

Published:

Ridge regression — a regularized variant of ordinary least squares — is useful for dealing with collinearity and non-identifiability. Here, we’ll explore some of the linear algebra behind it.

machine learning

The Concrete Distribution

6 minute read

Published:

The Concrete distribution is a relaxation of discrete distributions.

EM as a special case of variational inference

4 minute read

Published:

Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q(\theta)$ is taken to be a point mass.

Hamiltonian Monte Carlo

8 minute read

Published:

Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we’ll give a brief overview and a simple example implementation.

Whirlwind tour of MCMC for posterior inference

12 minute read

Published:

Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we’ll review some of the basic motivation behind MCMC and a couple of the most well-known methods.

Probabilistic PCA derivations

13 minute read

Published:

Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.

Dirichlet Processes: the basics

3 minute read

Published:

The Dirichlet process (DP) is one of the most common – and one of the most simple – prior distributions used in Bayesian nonparametric models. In this post, we’ll review a couple different interpretations of DPs.

Estimation and Inference in probabilistic models: A whirlwind tour

17 minute read

Published:

Probabilistic models are flexible tools for understanding a data generating process. There are many ways to do inference and estimation in these models, and it continues to be an active area of research. Here, our goal will be to understand a few of the most general classes of probabilistic model fitting, while trying to understand the important differences between them. We’ll demonstrate how to implement each one with a very simple example: the Beta-Bernoulli model.

Generalized PCA: an alternative approach

5 minute read

Published:

Principal component analysis is a widely-used dimensionality reduction technique. However, PCA has an implicit connection to the Gaussian distribution, which may be undesirable for non-Gaussian data. Here, we’ll see a second approach for generalizing PCA to other distributions introduced by Andrew Landgraf in 2015.

Are random seeds hyperparameters?

9 minute read

Published:

In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.

AdaBoost

6 minute read

Published:

In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of “boosting”: creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we’ll give a quick overview of boosting, and we’ll review one of the most influential boosting algorithms, AdaBoost.

The smoothly clipped absolute deviation (SCAD) penalty

6 minute read

Published:

Variable selection is an important part of high-dimensional statistical modeling. Many popular approaches for variable selection, such as LASSO, suffer from bias. The smoothly clipped absolute deviation (SCAD) estimator attempts to alleviate this bias issue, while also retaining a continuous penalty that encourages sparsity.

Log-derivative trick

3 minute read

Published:

The “log-derivative trick” is really just a simple application of the chain rule. However, it allows us to rewrite expectations in a way that is amenable to Monte Carlo approximation.

Introduction to VC dimension

6 minute read

Published:

VC dimension is a measure of the complexity of a statistical model. In essence, a model with a higher VC dimension is able to learn more complex mappings between data and labels. In this post, we’ll firm up this definition and walk through a couple simple examples.

numerical analysis

Condition numbers

6 minute read

Published:

Condition numbers measure the sensitivity of a function to changes in its inputs. We review this concept here, along with some specific examples to build intuition.

philosophy of science

Lindley’s paradox

6 minute read

Published:

Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley’s paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.

physics

Hamiltonian Monte Carlo

8 minute read

Published:

Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we’ll give a brief overview and a simple example implementation.

probability

Wiener and Ito processes

4 minute read

Published:

A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.

Slice sampling

6 minute read

Published:

Slice sampling is a method for obtaining random samples from an arbitrary distribution. Here, we walk through the basic steps of slice sampling and present two visual examples.

Bayesian model averaging

8 minute read

Published:

Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.

Dirichlet Processes: the basics

3 minute read

Published:

The Dirichlet process (DP) is one of the most common – and one of the most simple – prior distributions used in Bayesian nonparametric models. In this post, we’ll review a couple different interpretations of DPs.

Estimation and Inference in probabilistic models: A whirlwind tour

17 minute read

Published:

Probabilistic models are flexible tools for understanding a data generating process. There are many ways to do inference and estimation in these models, and it continues to be an active area of research. Here, our goal will be to understand a few of the most general classes of probabilistic model fitting, while trying to understand the important differences between them. We’ll demonstrate how to implement each one with a very simple example: the Beta-Bernoulli model.

Maximum entropy distributions

6 minute read

Published:

Maximum entropy distributions are those that are the “least informative” (i.e., have the greatest entropy) among a class of distributions with certain constraints. The principle of maximum entropy has roots across information theory, statistical mechanics, Bayesian probability, and philosophy. For this post, we’ll focus on the simple definition of maximum entropy distributions.

Generating random samples from probability distributions

6 minute read

Published:

I take for granted that I can easily generate random samples from a variety of probability distributions in NumPy, R, and other statistical software. However, the process for generating these quantities is somewhat nontrivial, and we’ll look under the hood at one example in this post.

Convergence in probability vs. almost sure convergence

5 minute read

Published:

When thinking about the convergence of random quantities, two types of convergence that are often confused with one another are convergence in probability and almost sure convergence. Here, I give the definition of each and a simple example that illustrates the difference. The example comes from the textbook Statistical Inference by Casella and Berger, but I’ll step through the example in more detail.

Rayleigh distribution (aka polar transformation of Gaussians)

5 minute read

Published:

In this post, we’ll look at a simple example of performing transformations on random variables. Specifically, we’ll explore what happens when two independent Gaussian-distributed random variables are transformed to polar coordinates.

programming

Are random seeds hyperparameters?

9 minute read

Published:

In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.

statistics

Tweedie distributions

4 minute read

Published:

Tweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.

Scale mixtures of normals

8 minute read

Published:

Here, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.

The Concrete Distribution

6 minute read

Published:

The Concrete distribution is a relaxation of discrete distributions.

Wiener and Ito processes

4 minute read

Published:

A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.

Slice sampling

6 minute read

Published:

Slice sampling is a method for obtaining random samples from an arbitrary distribution. Here, we walk through the basic steps of slice sampling and present two visual examples.

Bayesian model averaging

8 minute read

Published:

Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.

James-Stein estimator

8 minute read

Published:

The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.

EM as a special case of variational inference

4 minute read

Published:

Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q(\theta)$ is taken to be a point mass.

Copulas and Sklar’s Theorem

4 minute read

Published:

Copulas are flexible statistical tools for modeling correlation structure between variables.

Schur complements

6 minute read

Published:

Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.

Hamiltonian Monte Carlo

8 minute read

Published:

Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we’ll give a brief overview and a simple example implementation.

Whirlwind tour of MCMC for posterior inference

12 minute read

Published:

Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we’ll review some of the basic motivation behind MCMC and a couple of the most well-known methods.

Lindley’s paradox

6 minute read

Published:

Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley’s paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.

Improper priors

8 minute read

Published:

Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often “improprer” – we review this issue here.

Probabilistic PCA derivations

13 minute read

Published:

Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.

Dirichlet Processes: the basics

3 minute read

Published:

The Dirichlet process (DP) is one of the most common – and one of the most simple – prior distributions used in Bayesian nonparametric models. In this post, we’ll review a couple different interpretations of DPs.

Estimation and Inference in probabilistic models: A whirlwind tour

17 minute read

Published:

Probabilistic models are flexible tools for understanding a data generating process. There are many ways to do inference and estimation in these models, and it continues to be an active area of research. Here, our goal will be to understand a few of the most general classes of probabilistic model fitting, while trying to understand the important differences between them. We’ll demonstrate how to implement each one with a very simple example: the Beta-Bernoulli model.

Generalized PCA: an alternative approach

5 minute read

Published:

Principal component analysis is a widely-used dimensionality reduction technique. However, PCA has an implicit connection to the Gaussian distribution, which may be undesirable for non-Gaussian data. Here, we’ll see a second approach for generalizing PCA to other distributions introduced by Andrew Landgraf in 2015.

Reduced-rank regresssion

7 minute read

Published:

Reduced-rank regression is a method for finding associations between two high-dimensional datasets with paired samples.

Statistical whitening transformations

5 minute read

Published:

Statistical “whitening” is a family of procedures for standardizing and decorrelating a set of variables. Here, we’ll review this concept in a general sense, and see two specific examples.

Quasi-likelihoods

5 minute read

Published:

As their name suggests, “quasi-likelihoods” are quantities that aren’t formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.

The smoothly clipped absolute deviation (SCAD) penalty

6 minute read

Published:

Variable selection is an important part of high-dimensional statistical modeling. Many popular approaches for variable selection, such as LASSO, suffer from bias. The smoothly clipped absolute deviation (SCAD) estimator attempts to alleviate this bias issue, while also retaining a continuous penalty that encourages sparsity.

Generalized PCA

12 minute read

Published:

Principal component analysis (PCA) in its typical form implicitly assumes that the observed data matrix follows a Gaussian distribution. However, PCA can be generalized to allow for other distributions – here, we take a look at its generalization for exponential families introduced by Collins et al. in 2001.

Log-derivative trick

3 minute read

Published:

The “log-derivative trick” is really just a simple application of the chain rule. However, it allows us to rewrite expectations in a way that is amenable to Monte Carlo approximation.

MLE under a misspecified model

4 minute read

Published:

When we construct and analyze statistical estimators, we often assume that the model is correctly specified. However, in practice, this is rarely the case — our assumed models are usually approximations of the truth, but they’re useful nonetheless.

Gumbel max trick

4 minute read

Published:

The “Gumbel max trick” is a method for sampling from discrete distributions using only a deterministic function of the distributions’ parameters.

Control variates

3 minute read

Published:

Control variates are a class of methods for reducing the variance of a generic Monte Carlo estimator.

Mallows Cp

3 minute read

Published:

Mallows’ $C_p$ statistic is one way to measure and correct for model complexity when searching for statistical model with the best performance.

The linear algebra of ridge regression

5 minute read

Published:

Ridge regression — a regularized variant of ordinary least squares — is useful for dealing with collinearity and non-identifiability. Here, we’ll explore some of the linear algebra behind it.

Consistency of MLE

4 minute read

Published:

Maximum likelihood estimation (MLE) is one of the most popular and well-studied methods for creating statistical estimators. This post will review conditions under which the MLE is consistent.

M-estimation

4 minute read

Published:

This post briefly covers a broad class of statistical estimators: M-estimators. We’ll review the basic definition, some well-known special cases, and some of its asymptotic properties.

Neyman-Pearson Lemma

5 minute read

Published:

Hypothesis testing is a fundamental part of mathematical statistics, but finding the best hypothesis test for a given problem is a nontrivial exercise. The Neyman-Pearson Lemma gives strong guidance about how to choose hypothesis tests – we review and prove it here.

Generating random samples from probability distributions

6 minute read

Published:

I take for granted that I can easily generate random samples from a variety of probability distributions in NumPy, R, and other statistical software. However, the process for generating these quantities is somewhat nontrivial, and we’ll look under the hood at one example in this post.

stochastic calculus

Ito’s lemma

4 minute read

Published:

A sketch of the derivation for Ito’s Lemma and a simple example.

Wiener and Ito processes

4 minute read

Published:

A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.

stochastic processes

Whirlwind tour of MCMC for posterior inference

12 minute read

Published:

Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we’ll review some of the basic motivation behind MCMC and a couple of the most well-known methods.