## James-Stein estimator

** Published:**

The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.

** Published:**

The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.

** Published:**

Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q(\theta)$ is taken to be a point mass.

** Published:**

Copulas are flexible statistical tools for modeling correlation structure between variables.

** Published:**

Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.

** Published:**

Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we’ll give a brief overview and a simple example implementation.

** Published:**

In this post, we draw a simple connection between the optimization problems for NMF and PMF.

** Published:**

Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we’ll review some of the basic motivation behind MCMC and a couple of the most well-known methods.

** Published:**

There exists a duality between maximum likelihood estimation and finding the maximum entropy distribution subject to a set of linear constraints.

** Published:**

Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley’s paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.

** Published:**

Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often “improprer” – we review this issue here.

** Published:**

Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.

** Published:**

The Dirichlet process (DP) is one of the most common – and one of the most simple – prior distributions used in Bayesian nonparametric models. In this post, we’ll review a couple different interpretations of DPs.

** Published:**

Probabilistic models are flexible tools for understanding a data generating process. There are many ways to do inference and estimation in these models, and it continues to be an active area of research. Here, our goal will be to understand a few of the most general classes of probabilistic model fitting, while trying to understand the important differences between them. We’ll demonstrate how to implement each one with a very simple example: the Beta-Bernoulli model.

** Published:**

Principal component analysis is a widely-used dimensionality reduction technique. However, PCA has an implicit connection to the Gaussian distribution, which may be undesirable for non-Gaussian data. Here, we’ll see a second approach for generalizing PCA to other distributions introduced by Andrew Landgraf in 2015.

** Published:**

Reduced-rank regression is a method for finding associations between two high-dimensional datasets with paired samples.

** Published:**

In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.

** Published:**

Condition numbers measure the sensitivity of a function to changes in its inputs. We review this concept here, along with some specific examples to build intuition.

** Published:**

In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of “boosting”: creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we’ll give a quick overview of boosting, and we’ll review one of the most influential boosting algorithms, AdaBoost.

** Published:**

Statistical “whitening” is a family of procedures for standardizing and decorrelating a set of variables. Here, we’ll review this concept in a general sense, and see two specific examples.

** Published:**

Matrix decomposition methods factor a matrix $A$ into a product of two other matrices, $A = BC$. In this post, we review some of the most common matrix decompositions, and why they’re useful.

** Published:**

On 4/2/20, The Dana-Farber Cancer Institute and the Brown Institute at Columbia hosted a “zoomposium” (symposium via Zoom) about epidemiological modeling of the COVID-19 pandemic. These are my notes from the speakers’ presentations.

** Published:**

As their name suggests, “quasi-likelihoods” are quantities that aren’t formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.

** Published:**

Variable selection is an important part of high-dimensional statistical modeling. Many popular approaches for variable selection, such as LASSO, suffer from bias. The smoothly clipped absolute deviation (SCAD) estimator attempts to alleviate this bias issue, while also retaining a continuous penalty that encourages sparsity.

** Published:**

In this post, we’ll review a family of fundamental classification algorithms: linear and quadratic discriminant analysis.

** Published:**

Principal component analysis (PCA) in its typical form implicitly assumes that the observed data matrix follows a Gaussian distribution. However, PCA can be generalized to allow for other distributions – here, we take a look at its generalization for exponential families introduced by Collins et al. in 2001.

** Published:**

The representer theorem is a powerful result that implies a certain type of duality between solutions to function estimation problems.

** Published:**

Generalized linear models are flexible tools for modeling various response disributions. This post covers one common way of fitting them.

** Published:**

In this post, we cover a condition that is necessary and sufficient for the LASSO estimator to work correctly.

** Published:**

The “log-derivative trick” is really just a simple application of the chain rule. However, it allows us to rewrite expectations in a way that is amenable to Monte Carlo approximation.

** Published:**

When we construct and analyze statistical estimators, we often assume that the model is correctly specified. However, in practice, this is rarely the case — our assumed models are usually approximations of the truth, but they’re useful nonetheless.

** Published:**

The “Gumbel max trick” is a method for sampling from discrete distributions using only a deterministic function of the distributions’ parameters.

** Published:**

Control variates are a class of methods for reducing the variance of a generic Monte Carlo estimator.

** Published:**

Mallows’ $C_p$ statistic is one way to measure and correct for model complexity when searching for statistical model with the best performance.

** Published:**

Ridge regression — a regularized variant of ordinary least squares — is useful for dealing with collinearity and non-identifiability. Here, we’ll explore some of the linear algebra behind it.

** Published:**

In this post we’ll cover a simple algorithm for managing a portfolio of assets called the “Universal Portfolio”, developed by Thomas Cover in the 90s. Although the method was developed in the context of finance, it applies more generally to the setting of online learning.

** Published:**

Maximum likelihood estimation (MLE) is one of the most popular and well-studied methods for creating statistical estimators. This post will review conditions under which the MLE is consistent.

** Published:**

Online learning algorithms make decisions in uncertain, constantly-changing environments. This post will review a couple basic forms of online learning algorithms, as well as some motivating examples.

** Published:**

This post briefly covers a broad class of statistical estimators: M-estimators. We’ll review the basic definition, some well-known special cases, and some of its asymptotic properties.

** Published:**

Here, we’ll look at linear regression from a statistical learning theory perspective. In particular, we’ll derive the number of samples necessary in order to achieve a certain level of regression error. We’ll also see a technique called “discretization” that allows for proving things about infinite sets by relying on results in finite sets.

** Published:**

Maximum entropy distributions are those that are the “least informative” (i.e., have the greatest entropy) among a class of distributions with certain constraints. The principle of maximum entropy has roots across information theory, statistical mechanics, Bayesian probability, and philosophy. For this post, we’ll focus on the simple definition of maximum entropy distributions.

** Published:**

Hypothesis testing is a fundamental part of mathematical statistics, but finding the best hypothesis test for a given problem is a nontrivial exercise. The Neyman-Pearson Lemma gives strong guidance about how to choose hypothesis tests – we review and prove it here.

** Published:**

VC dimension is a measure of the complexity of a statistical model. In essence, a model with a higher VC dimension is able to learn more complex mappings between data and labels. In this post, we’ll firm up this definition and walk through a couple simple examples.

** Published:**

I take for granted that I can easily generate random samples from a variety of probability distributions in NumPy, R, and other statistical software. However, the process for generating these quantities is somewhat nontrivial, and we’ll look under the hood at one example in this post.

** Published:**

When thinking about the convergence of random quantities, two types of convergence that are often confused with one another are convergence in probability and almost sure convergence. Here, I give the definition of each and a simple example that illustrates the difference. The example comes from the textbook *Statistical Inference* by Casella and Berger, but I’ll step through the example in more detail.

** Published:**

Here, we’ll discuss a high-level concept relating to efficient learning. In many learning problems, an agent seeks to minimize the cost they incur by selecting a hypothesis about the world from a set of many possible hypotheses. We’ll dive into a framework for prioritizing among the candidate hypotheses (structural risk minimization) – and one interesting way to specifically assign priority levels to hypotheses (minimum description length).

** Published:**

In this post, we’ll look at a simple example of performing transformations on random variables. Specifically, we’ll explore what happens when two independent Gaussian-distributed random variables are transformed to polar coordinates.