Jekyll2021-01-12T14:23:00-08:00https://andrewcharlesjones.github.io/feed.xmlAndy Jonespersonal descriptionAndy Jonesaj13@princeton.eduBelief propagation2021-01-12T00:00:00-08:002021-01-12T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2021/01/belief-propagation<p>Belief propagation is a family of message passing algorithms, often used for computing marginal distributions and maximum a posteriori (MAP) estimates of random variables that have a graph structure.</p> <h2 id="computing-marginals">Computing marginals</h2> <p>Consider three binary random variables $x_1, x_2, x_3 \in {0, 1}$. Denote their joint distribution as $p(x_1, x_2, x_3)$. Suppose we want to compute the marginal distribution of $x_2$, $p(x_2)$. Naively, we can do this by summing over the other variables:</p> $p(x_2) = \sum\limits_{x_1 \in \{0, 1\}} \sum\limits_{x_3 \in \{0, 1\}} p(x_1, x_2, x_3).$ <p>This sum has 4 terms, one for each possible value of $(x_1, x_3)$. In general, finding the marginal of $x_i$ from the joint of $p$ total binary variables will require computing a sum with $2^{p-1}$ terms:</p> $p(x_i) = \sum\limits_{x_1 \in \{0, 1\}} \cdots \sum\limits_{x_{i-1} \in \{0, 1\}} \sum\limits_{x_{i+1} \in \{0, 1\}} \cdots \sum\limits_{x_p \in \{0, 1\}} p(x_1, \dots, x_p).$ <p>These sums become intractable for even moderate values of $p$, and also become intractable if the number of possible states of each variable is larger than two.</p> <p>However, exploiting any special structure that exists between the variables can greatly expedite computing the marginals. Here, we explore the concept of factor graphs, and using these graphs to perform belief propagation with the goal of computing marginal distributions.</p> <h2 id="factor-graphs">Factor graphs</h2> <p>Belief propagation is typically defined as operating on factor graphs. In their simplest form, factor graphs are a graph representation of a function of multiple variables. When a function factorizes in a certain way, factor graphs help represent the relationships between related variables.</p> <p>In the context of probability and statistics, factor graphs are usually used to represent probability distributions. For example, consider the joint distribution of three random variables $p(x_1, x_2, x_3)$. Without knowing anything else about their relationships, we can represent these variables in a fully-connected graph:</p> <p><img src="/assets/bp1.png" alt="bp1" /></p> <p>However, we may know more about the relationships between them. Suppose that $x_1$ and $x_3$ don’t directly depend on one another, and $x_3$ has its own special behavior. This means that the joint factorizes as</p> $p(x_1, x_2, x_3) = p(x_1, x_2) p(x_2, x_3) p(x_3).$ <p>We can think of this as having three “groups” of interrelated variables: one comprised of ${x_1, x_2}$, another comprised of ${x_2, x_3}$, and a third with just $x_3$. We can now represent these variables in the form of a factor graph. Factor graphs show the connections between the factors and the variables that those factors relate. In this case, we have the following graph:</p> <p><img src="/assets/bp2.png" alt="bp2" />{ width=20% }</p> <p>Factor graphs are always bipartite – we can shift the nodes in the above graph to show this clearly visually:</p> <p><img src="/assets/bp3.png" alt="bp3" /></p> <p>The “factors” in factor graphs help describe the relationships between the variable nodes, and help coordinate inference as we’ll see next.</p> <h2 id="messages">Messages</h2> <p>As mentioned above, belief propagation is part of a family of algorithms known as “message-passing” algorithms. The name of this family means exactly what it sounds like: the nodes in the graph send “messages” to one another in order to learn about the overall structure of the variables.</p> <p>In this post, we’ll denote a message from node $a$ to node $b$ as $\mu_{a \to b}$. In the context of probability and statistics, we can usually think about a message from $a$ to $b$ as node $a$ “encouraging” node $b$ to have some type of behavior. For example, in our example above, consider the message from $a_1$ to $x_2$, $\mu_{a_1 \to x_2}$. This message will encode what node $a_1$ “thinks” the state of $x_2$ should be based on its information about the relationship between $x_1$ and $x_2$.</p> <p>The exact content of these messages – and how they’re passed – is dependent on the algorithm. Here, we’ll look at belief propagation’s protocol.</p> <h2 id="belief-propagation">Belief propagation</h2> <p>Belief propagation updates the messages that are outgoing from a node based on the ones that are incoming to that node. This eventually spreads the information across the whole graph.</p> <p>It’s an iterative algorithm that updates the messages at each timestep $t=1, \dots, T$. The algorithm is as follows. Note that in this post, we adopt the same notation as in the book Constraint Satisfaction networks in Physics and Computation, where $\partial a$ denotes the set of nodes immediately adjacent to $a$.</p> <p>Belief propagation steps:</p> <ol> <li>$\mu_{j \to a}^{(t+1)}(x_j) = \prod\limits_{b \in \partial j \setminus a} \mu_{b \to j}^{(t)} (x_j)$</li> <li>$\mu_{a \to j}^{(t)} = \sum\limits_{\mathbf{x}_{\partial a \setminus j}} f_a(\mathbf{x}_{\partial a}) \prod\limits_{k \in \partial a \setminus j} \mu_{k \to a}^{(t)} (x_k).$</li> </ol> <p>Belief propagation is also known as the “sum-product algorithm” because of the second step – in particular, the way that $\mu_{a \to j}$ is computed.</p> <p>A useful outcome of belief propagation is that its messages can be used to estimate the marginal distributions of each variable. Specifically, the marginal for $x_i$ is estimated as the product of all incoming messages:</p> $p(x_i) \propto \prod\limits_{a \in \partial x_i} \mu_{a \to x_i}^{(t-1)} (x_i).$ <p>The product of these messages is only proportional to the marginal, so one must divide by the sum of the elements to make it sum to one.</p> <h2 id="example">Example</h2> <p>For example, consider the same example as in the sections above with variables $x_1, x_2, x_3$. Suppose we want to calculate the marginal distribution $p(x_2)$. Using the procedure above, we take the product of all messages incoming to $x_2$:</p> $p(x_2 = \mathcal{X}_i) \propto \prod\limits_{a \in \partial x_2} \mu_{a \to x_2}[i]$ <p>where $\mu[i]$ denotes the $i$th element of $\mu$.</p> <p>Again, we’ll need to divide by the sum of the elements to make it sum to one.</p> <p>In this example, there are going to be two incoming messages to $x_2$: one from $a_1$ and one from $a_2$. To start, let’s compute the message going from $a_2$ to $x_2$, $\mu_{a_2 \to \mu_2}$. We have</p> <p>\begin{align} \mu_{a_2 \to x_2} &amp;= \sum\limits_{\mathbf{x}_{\partial a_2 \setminus x_2}} f_2(\mathbf{x}_{\partial a_2}) \prod\limits_{k \in \partial a_2 \setminus x_2} \mu_{k \to a_2} (x_k) \\ &amp;= \sum\limits_{x_3 \in {0, 1}} f_2(x_3) \mu_{x_3 \to a_2} (x_3) \\ &amp;= \sum\limits_{x_3 \in {0, 1}} f_2(x_3) \prod\limits_{b \in \partial x_3 \setminus a_2} \mu_{b \to x_3} (x_3) \\ &amp;= \sum\limits_{x_3 \in {0, 1}} f_2(x_3) \mu_{f_3 \to x_3} (x_3). \\ \end{align}</p> <p>Since $f_3$ doesn’t have any neighbors other than $x_3$, $\mu_{f_3 \to x_3} (x_3)$ reduces to $p(x_3)$. Continuing to simplify,</p> <p>\begin{align} \mu_{a_2 \to x_2}[i] &amp;= \sum\limits_{x_3 \in {0, 1}} p(x_2=i, x_3) 0.5 \\ &amp;= p(x_2=i, x_3=0) p(x_3 = i) + p(x_2=i, x_3=1) p(x_3 = i). \\ \end{align}</p> <p>Notice that to compute this update, we had to consider messages streaming all the way from $a_3$ to $x_2$. We can visualize these steps like so:</p> <p><img src="/assets/bp4.png" alt="bp4" /></p> <p>For the message coming from $a_1$, we have</p> <p>\begin{align} \mu_{a_1 \to x_2} &amp;= \sum\limits_{\mathbf{x}_{\partial a_1 \setminus x_2}} f_1(\mathbf{x}_{\partial a_1}) \prod\limits_{k \in \partial a_1 \setminus x_2} \mu_{k \to a_1} (x_k) \\ &amp;= \sum\limits_{x_1 \in {0, 1}} f_1(x_1) \mu_{x_1 \to a_1} (x_1) \\ &amp;= p(x_1 = 0, x_2 = i) 0.5 + p(x_1 = 1, x_2 = i) 0.5 \end{align}</p> <p>Here, since $x_1$ doesn’t have any neighbors other than $a_1$, we assume that $\mu_{x_1 \to a_1} (x_1)$ is the uniform distribution over ${0, 1}$.</p> <p>Putting these together, we obtain that the unnormalized marginal is</p> <p>\begin{align} p(x_2=i) &amp;\propto \underbrace{0.5 \left[p(x_1 = 0, x_2 = i) + p(x_1 = 1, x_2 = i) \right]}_{\text{Contribution from $\mu_{a_1 \to x_2}$}} \underbrace{p(x_3 = i) \left[ p(x_2=i, x_3=0)+ p(x_2=i, x_3=1) \right]}_{\text{Contribution from $\mu_{a_2 \to x_2}$}} \\ &amp;= 0.5 \left[\sum\limits_{x_1} p(x_1, x_2=i)\right] \left[ \sum\limits_{x_3} p(x_2=i, x_3) p(x_3) \right]. \end{align}</p> <p>In this case, we can see that belief propagation simply reduces to computing the marginal by “brute force”. In other words, since the joint distribution factorizes as</p> $p(x_1, x_2, x_3) = p(x_1, x_2) p(x_2, x_3) p(x_3),$ <p>this belief propagation equation is just the complete sum of the joint over $x_1$ and $x_3$. However, in more general situations, belief propagation will require iterative updating of the messages between nodes. I hope to provide a more complex example in a future post.</p> <h2 id="references">References</h2> <ul> <li>Graph images were created with BioRender.com</li> <li>Constraint Satisfaction networks in Physics and Computation by Marc Mezard and Andrea Montanari.</li> </ul>Andy Jonesaj13@princeton.eduBelief propagation is a family of message passing algorithms, often used for computing marginal distributions and maximum a posteriori (MAP) estimates of random variables that have a graph structure.Visualizing differential equations in Python2021-01-07T00:00:00-08:002021-01-07T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2021/01/differential-equation-viz<p>In this post, we try to visualize a couple simple differential equations and their solutions with a few lines of Python code.</p> <h2 id="setup">Setup</h2> <p>Consider the following simple differential equation</p> <p>\begin{equation} \frac{dy}{dx} = x. \label{diffeq1} \end{equation}</p> <p>Clearly, the solution to this equation will have the form</p> $y = \frac12 x^2 + C$ <p>where $C \in \mathbb{R}$ is any constant.</p> <p>There are two ways we can think about the differential equation in \eqref{diffeq1}.</p> <h2 id="integral-curves">Integral curves</h2> <p>First, we can say that for a given point $(x_0, y_0)$, the equation computes the slope $m$ of the tangent line at that point as $m = x$.</p> <p>For example, consider the point $(1, 1)$. Clearly, at this point $\frac{dy}{dx} = 1$. We can visualize this by plotting a small line with slope $1$ at the point $(1, 1)$.</p> <p><img src="/assets/diffeq_fig1.png" alt="diffeq_fig1" /></p> <p>We can plot another line at $(2, 1)$.</p> <p><img src="/assets/diffeq_fig2.png" alt="diffeq_fig2" /></p> <p>We can continue doing this at points throughout the graph to get a sense of what the vector field looks like.</p> <p><img src="/assets/diffeq_fig3.png" alt="diffeq_fig3" /></p> <p>As expected, the graph has a parabolic shape to it, as we saw from the solution to Equation \eqref{diffeq1}. Also notice that we can trace any single one of these curves to yield a single solution.</p> <p><img src="/assets/diffeq_fig6.png" alt="diffeq_fig6" /></p> <p>Consider a slightly different example:</p> $\frac{dy}{dx} = x + y.$ <p>We can plot similar lines for this equation and notice a different pattern, this time having the shape of curves $x^3 + C$:</p> <p><img src="/assets/diffeq_fig4.png" alt="diffeq_fig4" /></p> <h2 id="isoclines">Isoclines</h2> <p>Here’s a second approach for visualizing differential equations and their solutions. For a given slope $m_0$, we can find all points ${(x, y)}$ that satisfy $\frac{dy}{dx} = m_0$. These points form a curve called an “isocline” (think iso = same, cline = change).</p> <p>Consider again the example $\frac{dy}{dx} = x$. In this case, these points will lie along a verical line:</p> $x = m_0.$ <p>Notationally, we can write this as the set ${(x, y) : y = m_0 - x}$.</p> <p>To start plotting this, consider $m_0 = 1$. That is, let’s find all the points where the slope is $1$. In this example, these points will lie along the line $x = 1$. Plotting this, we have the following graph.</p> <p><img src="/assets/diffeq_fig5.png" alt="diffeq_fig5" /></p> <p>We can continue this for various values of $m_0$ to fill out the same plot:</p> <p><img src="/assets/diffeq_fig3.png" alt="diffeq_fig3" /></p> <h2 id="code">Code</h2> <p>Here’s the simple code used to visualize these equations. Simply fill in the body of the function <code class="language-plaintext highlighter-rouge">dydx(x, y)</code>.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mesh_width</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="n">dir_field_x_template</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="n">mesh_width</span> <span class="o">/</span> <span class="mi">2</span><span class="p">,</span> <span class="n">mesh_width</span> <span class="o">/</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span> <span class="n">xlims</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">]</span> <span class="n">ylims</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">]</span> <span class="k">def</span> <span class="nf">dydx</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span> <span class="k">return</span> <span class="n">x</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span> <span class="n">plt</span><span class="p">.</span><span class="n">xlim</span><span class="p">(</span><span class="n">xlims</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">ylim</span><span class="p">(</span><span class="n">ylims</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">axvline</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">"black"</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">axhline</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">"black"</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">xlims</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">xlims</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">mesh_width</span><span class="p">):</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">ylims</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">ylims</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">mesh_width</span><span class="p">):</span> <span class="n">curr_slope</span> <span class="o">=</span> <span class="n">dydx</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">curr_intercept</span> <span class="o">=</span> <span class="n">y</span> <span class="o">-</span> <span class="n">curr_slope</span> <span class="o">*</span> <span class="n">x</span> <span class="n">dir_field_xs</span> <span class="o">=</span> <span class="n">dir_field_x_template</span> <span class="o">+</span> <span class="n">x</span> <span class="n">dir_field_ys</span> <span class="o">=</span> <span class="p">[</span><span class="n">curr_slope</span> <span class="o">*</span> <span class="n">dfx</span> <span class="o">+</span> <span class="n">curr_intercept</span> <span class="k">for</span> <span class="n">dfx</span> <span class="ow">in</span> <span class="n">dir_field_xs</span><span class="p">]</span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">dir_field_xs</span><span class="p">,</span> <span class="n">dir_field_ys</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"red"</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">"x"</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">"y"</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">"dy/dx"</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span> </code></pre></div></div> <h2 id="references">References</h2> <ul> <li>Prof. Arthur Mattuck’s <a href="https://www.youtube.com/watch?v=XDhJ8lVGbl8">Differential Equations lecture videos</a></li> </ul>Andy Jonesaj13@princeton.eduIn this post, we try to visualize a couple simple differential equations and their solutions with a few lines of Python code.Cubic splines2020-12-26T00:00:00-08:002020-12-26T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2020/12/cubic-splines<p>Cubic splines are flexible nonparametric models. Here, we discuss some of the spline fundamentals.</p> <h2 id="introduction">Introduction</h2> <p>Consider the following regression problem:</p> $Y = \Phi(X)\beta + \epsilon$ <p>where $Y \in \mathbb{R}^n$, $X \in \mathbb{R}^{n}$, and $\Phi(X)$ denotes a basis expansion of $X$. Here, we work with a one-dimensional regression for simplicity, although all ideas will easily extend to multiple dimensions.</p> <p>In vanilla linear regression, we have $\Phi(X) = X$, that is, there is no basis expansion.</p> <p>In polynomial regression of order $d$, the basis expansion for the $i$th sample is</p> $\phi(x_i) = \begin{bmatrix} 1 &amp; x_i &amp; x_i^2 &amp; x_i^3 &amp; \cdots &amp; x_i^d \end{bmatrix}.$ <p>The coefficient vector $\beta$ then becomes a vector of length $d+1$, and we can still apply the OLS estimator:</p> $\widehat{\beta} = (\Phi(X)^\top \Phi(X))^{-1} \Phi(X)^\top Y.$ <h2 id="cubic-splines">Cubic splines</h2> <p>Cubic splines generalize cubic regression to allow for modeling local regions of the input space differently. Specifically, the approach is to choose $k$ thresholds (often called “knots”) $\tau_1, \dots, \tau_k$ and split up the input space into $k+1$ intervals based on these thresholds:</p> $(-\infty, \tau_1], (\tau_1, \tau_2], (\tau_2, \tau_3], \cdots, (\tau_k, \infty).$ <p>Within each of these intervals, we fit a local (cubic) polynomial subject to constraints of continuity and smoothness between intervals.</p> <p>The basis expansion then becomes</p> $\phi(x_i) = \begin{bmatrix} 1 &amp; x_i &amp; x_i^2 &amp; x_i^3 &amp; (x_i - \tau_1)_+^3 &amp; (x_i - \tau_2)_+^3 &amp; \cdots &amp; (x_i - \tau_k)_+^3 \end{bmatrix}.$ <h2 id="smoothing-cubic-splines">Smoothing cubic splines</h2> <p>In general, choosing the locations of the knots is a difficult problem. Smoothing splines avoid having to make these choices by placing a knot on each data point, i.e., $k=n$ and $\tau_1=x_1, \tau_2=x_2, \dots, \tau_n=x_n$.</p> $\phi(x_i) = \begin{bmatrix}1 &amp; ~x_i &amp; x_i^2 &amp; x_i^3 &amp; (x_i-x_1)^3 &amp; \cdots &amp; (x_i-x_n)^3\end{bmatrix}.$ <p>“Natural” cubic splines make one more assumption: that the function is linear in the extremes of the input space (below the minimum and above the maximum). As a natural cubic spline, we can write the basis expansion in terms of $n$ basis functions. The design matrix then becomes an $n \times p$ matrix. Let’s call this $\Phi(X)$.</p> <p>It can be shown (see below), that natural cubic splines are the functions that minimize a regularized sum of squares penalty:</p> <p>\begin{equation} \text{arg}\min_f \sum\limits_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b [f^{\prime\prime}(x)]^2 dx \label{eq:spline_objective} \end{equation}</p> <p>Notice that the second term penalizes the average second derivative of the function ($\lambda$ is a tuning parameter here). In other words, this objective function favors smooth functions. Clearly, this penalty will be zero when the function is linear. Equivalently, the optimal $f$ will be linear when $\lambda=\infty$.</p> <p>Since the optimal function will be a cubic smoothing spline, we can rewrite the objective as</p> $\text{arg}\min_\beta \left[\sum\limits_{i=1}^n (y_i - \Phi(x_i)\beta)^2 + \lambda \sum\limits_{j=1}^n \int_a^b \beta^\top \Phi^{\prime\prime}(x_i) \Phi^{\prime\prime}(x_j) \beta dx\right].$ <p>Letting $\Omega$ be a $n\times n$ matrix with $\Omega_{ij} = \int_a^b \Phi^{\prime\prime}(x_i) \Phi^{\prime\prime}(x_j) dx$, this simplifies to</p> $\text{arg}\min_\beta \sum\limits_{i=1}^n (y_i - \Phi(x_i)\beta)^2 + \lambda \beta^\top \Omega \beta.$ <p>In this form, we can recognize it as a ridge regression problem, and the coefficient estimate will then be</p> $\widehat{\beta} = (\Phi(X)^\top \Phi(X) + \lambda \Omega)^{-1} \Phi(X)^\top Y.$ <h2 id="cubic-splines-as-minimizer">Cubic splines as minimizer</h2> <p>In this section, we show that the natural cubic spline minimizes the objective in \eqref{eq:spline_objective}.</p> <p>We want to show that the function that minimizes the following objective is a natural cubic spline.</p> $\mathcal{L} = \sum\limits_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b \left[f^{\prime\prime}(x)\right]^2 dx.$ <p>Let $f(x)$ be a cubic spline, and let $g(x)$ be any other function that interpolates the points. Let $h(x)=g(x)-f(x)$ be their difference.</p> <p>For both of these functions, the first term will be zero because they perfectly interpolate the data.</p> <p>Focusing on the second term, our goal reduces to showing that</p> $\int_a^b \left[f^{\prime\prime}(x)\right]^2 dx \leq \int_a^b \left[g^{\prime\prime}(x)\right]^2 dx, ~~~ \forall g.$ <p>Notice that by rewriting $g(x) = h(x) + f(x)$ and using the linearity of derivatives and integrals, the right side is equal to</p> <p>\begin{align} \int_a^b [h^{\prime\prime}(x) + f^{\prime\prime}(x)]^2 dx &amp;= \int_a^b [h^{\prime\prime}(x)^2 + 2h^{\prime\prime}(x) f^{\prime\prime}(x) + f^{\prime\prime}(x)^2] dx \\ &amp;= \int_a^b [h^{\prime\prime}(x)]^2 dx + 2 \int_a^b h^{\prime\prime}(x) f^{\prime\prime}(x) dx + \int_a^b [f^{\prime\prime}(x)]^2 dx. \end{align}</p> <p>Notice that $\int_a^b h^{\prime\prime}(x)^2 dx \geq 0$. Thus, we just need to show that $\int_a^b h^{\prime\prime}(x) f^{\prime\prime}(x) dx \geq 0$. Using integration by parts, where $u=f^{\prime\prime}(x)$ and $dv=h^{\prime\prime}(x)$, we have</p> $\int_a^b h^{\prime\prime}(x) f^{\prime\prime}(x) dx = f^{\prime\prime}(x) h^\prime(x) \big\rvert_a^b - \int_a^b h^\prime(x) f^{\prime\prime\prime}(x) dx.$ <p>A natural cubic spline is linear on its endpoints, which implies that $f^{\prime\prime}(a) = f^{\prime\prime}(b) = 0$. Thus, the first term is zero.</p> <p>For the second term, notice that we can break up the integral into each of the $n-1$ intervals between the points.</p> <p>\begin{align} \int_a^b h^\prime(x) f^{\prime\prime\prime}(x) dx &amp;= \int_{x_1}^{x_2} h^\prime(x) f^{\prime\prime\prime}(x) dx + \int_{x_2}^{x_3} h^\prime(x) f^{\prime\prime\prime}(x) dx + \cdots + \int_{x_{n-1}}^{x_n} h^\prime(x) f^{\prime\prime\prime}(x) dx \\ &amp;= \sum\limits_{i=1}^{n-1} \int_{x_{i}}^{x_{i+1}} h^\prime(x) f^{\prime\prime\prime}(x) dx. \end{align}</p> <p>If we again integrate by parts with $u=f^{\prime\prime\prime}(x)$ and $dv=h^\prime(x)$, we have</p> $\sum\limits_{i=1}^{n-1} \left[f^{\prime\prime\prime}(x) h(x)\big\rvert_{x_i}^{x_{i+1}} - \int_{x_{i}}^{x_{i+1}} h^{\prime\prime}(x) f^{\prime\prime\prime\prime}(x) dx\right].$ <p>Since both $f$ and $g$ perfectly interpolate the data, the first term will be zero. Furthermore, since $f$ is a cubic polynomial, its fourth derivative will be zero, making the second term zero.</p> <p>Thus, we have proved that the penalty for any other function will be at least as great as that of a natural cubic spline:</p> $\int_a^b \left[f^{\prime\prime}(x)\right]^2 dx \leq \int_a^b \left[g^{\prime\prime}(x)\right]^2 dx, ~~~ \forall g.$ <p>Returning to the initial loss function, we can see that this implies that the overall loss function for any other function will be greater than a natural cubic spline (again, since we have enough degrees of freedom to interpolate the data, so the first term will always be zero):</p> $\mathcal{L}(f) \leq \mathcal{L}(g).$ <h2 id="references">References</h2> <ul> <li>Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001.</li> <li>Sergey Fomel’s <a href="http://sepwww.stanford.edu/sep/sergey/128A/answers6.pdf">notes on splines</a>.</li> </ul>Andy Jonesaj13@princeton.eduCubic splines are flexible nonparametric models. Here, we discuss some of the spline fundamentals.Relationship between the multivariate normal, SVD, and Cholesky decomposition2020-12-19T00:00:00-08:002020-12-19T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2020/12/mvn-svd-cd<p>Matrix musings.</p> <p>Consider an $n\times p$ matrix $X$. Its singular value decomposition is $X = UDV^\top$.</p> <p>Let’s reconstruct this a different way. Let $U = [u_1, \dots, u_n]^\top$ with</p> $u_1, \dots, u_n \sim \mathcal{N}\left(0, \frac{1}{n} I_p\right).$ <p>Further, let $D = \text{diag}(d_1, \dots, d_p)$. Then</p> $u_iD \sim \mathcal{N}\left(0, \frac{1}{n} D^2 \right).$ <p>Consider an orthogonal matrix $V = [v_1, \dots, v_p]$. Then</p> $u_iDV^\top \sim \mathcal{N}\left(0, \frac{1}{n} V D^2 V^\top \right).$ <p>We can immediately notice that this is the $i$th sample of $X$, where $x_i = u_iDV^\top$. Furthermore, we have a decomposition of its covariance matrix</p> $\Sigma = V D^2 V^\top.$ <p>Notice the relationship to the Cholesky decomposition:</p> $V D^2 V^\top = V D D^\top V^\top = LL^\top$ <p>where $L = VD$. This coincides with a popular way to generate multivariate normal samples with covariance $\Sigma$, namely</p> $x = LzL^\top=VDzD^\top V, ~~~ z\sim \mathcal{N}(0, I).$ <p>Furthermore, notice that given an observed data matrix $X \in \mathbb{R}^{n \times p}$, its covariance matrix can be completely described without the rotation matrix U,</p> $X^\top X = VDU^\top UDV^\top = VD^2V = LL^\top.$ <p>In the context of the multivariate normal, this makes sense because the rows of $U$ have spherical covariance. Thus, we can arbitrarily rotate these samples about the origin and still yield the same covariance matrix. Concretely, define $\widetilde{U} = WU$ such that $W^\top W = I_p$ (i.e., let’s rotate the samples of $U$). Then</p> $VD\widetilde{U}^\top \widetilde{U}DV^\top = VDU^\top W^\top WUDV^\top = VD^2V^\top,$ <p>which is the same as the covariance of $X$.</p>Andy Jonesaj13@princeton.eduMatrix musings.Shrinkage in ridge regression2020-12-18T00:00:00-08:002020-12-18T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2020/12/shrinkage-ridge<p>A brief review of shrinkage in ridge regression and a comparison to OLS.</p> <h2 id="ols-review">OLS review</h2> <p>Consider the regression problem</p> $Y = X\beta + \epsilon$ <p>where $Y$ is an $n$-vector of responses, $X$ is an $n \times p$ matrix of covariates, $\beta$ is a $p$-vector of unknown coefficients, and $\epsilon$ is i.i.d. noise. We can estimate $\beta$ by minimizing the sum of squares:</p> $\ell = \frac12 \|Y - X\beta\|^2_2.$ <p>Taking the derivative w.r.t. $\beta$, we have</p> $\frac{\partial \ell}{\partial \beta} = -X^\top (Y - X\beta).$ <p>Setting to zero, we have $X^\top Y = X^\top X\beta$, which implies</p> $\widehat{\beta} = (X^\top X)^{-1} X^\top Y.$ <h2 id="mle-interpretation-of-ols">MLE interpretation of OLS</h2> <p>Note that an equivalent solution can be found by maximizing the likelihood of the model</p> $Y \sim \mathcal{N}(X\beta, \sigma^2 I_p).$ <p>Consider maximizing the log likelihood:</p> $\text{arg}\max_\beta \log p(Y | X, \beta) = \text{arg}\max_\beta -\frac{p}{2} \log 2\pi -\frac{p}{2} \log \sigma^2 - \frac{1}{2\sigma^2} (Y - X\beta)^\top (Y - X\beta).$ <p>Ignoring constant terms, we have</p> $\frac12 p \log \sigma^2 = \frac{1}{2\sigma^2} (Y^\top Y - Y^\top X\beta - \beta^\top X^\top Y + \beta^\top X^\top X\beta)$ <p>Taking a derivative w.r.t. $\beta$, we have</p> <p>\begin{align} &amp;\frac{\partial \ell}{\partial \beta} = 0 = \frac{1}{2\sigma^2} (- X^\top Y - X^\top Y + 2 X^\top X \beta) \\ \implies&amp; X^\top X \beta = X^\top Y \\ \implies&amp; \widehat{\beta}_{\text{MLE}} = (X^\top X)^{-1} X^\top Y \end{align}</p> <h2 id="ridge-regression">Ridge regression</h2> <p>Consider instead maximizing the sum of squares with an additional $\ell_2$ penalty on $\beta$:</p> $\min_\beta \frac12 \|Y - X\beta\|^2_2 + \frac{\lambda}{2} \|\beta\|_2^2.$ <p>Taking a derivative w.r.t. $\beta$,</p> $-X^\top (Y - X\beta) + \lambda \beta = 0.$ <p>This implies that $X^\top X\beta + \lambda \beta = X^\top Y$, and thus $(X^\top X + \lambda I_p) \beta = X^\top Y$. The solution can easily be seen as</p> $\widehat{\beta}_{\text{ridge}} = (X^\top X + \lambda I_p)^{-1} X^\top Y.$ <p>In other words, we add a small constant value $\lambda$ to the diagonal of the sample covariance $X^\top X$ before inverting it.</p> <p>To see this another way, consider the SVD of $X$,</p> $X = UDV^\top.$ <p>Plugging this into the ridge regression solution, we have</p> <p>\begin{align} \widehat{\beta} &amp;= (VDU^\top UDV^\top + \lambda I_p)^{-1} VDU^\top Y \\ &amp;= (VD^2V^\top + \lambda I_p)^{-1} VDU^\top Y \\ &amp;= (D^2 + \lambda I_p)^{-1} VDU^\top Y. \end{align}</p> <p>The fitted values $X \widehat{\beta}$ are then</p> <p>\begin{align} X \widehat{\beta} &amp;= X (D^2 + \lambda I_p)^{-1} VDU^\top Y \\ &amp;= UDV^\top (D^2 + \lambda I_p)^{-1} VDU^\top Y \\ &amp;= UD (D^2 + \lambda I_p)^{-1} DU^\top Y \\ &amp;= \sum\limits_{j=1}^p u_j \frac{d_j^2}{d_j^2 + \lambda} u_j^\top Y \end{align}</p> <p>The “shrinkage factor” for feature $j$ is $\frac{d_j^2}{d_j^2 + \lambda}$. For mean-centered data, this essentially corresponds to a rotation of the coefficient vector toward zero. In effect, this means that the fitted values $\widehat{Y}$ are shrunk toward zero. Notice that if $\lambda=0$, it reduces to vanilla least squares with $X \widehat{\beta} = UU^\top Y$. This is the projection of Y onto the orthogonal subspace defined by $X$.</p> <p>We can visualize this. Below is a scatter plot with $p=1$ and $n=20$ points where the true value $\beta=1.5$. As we plot the OLS and ridge estimates, we can see that the ridge coefficient rotates toward zero as $\lambda$ increases.</p> <p><img src="/assets/ols_ridge_lines.png" alt="ols_ridge_lines" /></p> <p>Furthermore, we can see that the shrinkage factor $\frac{d^2}{d^2 + \lambda}$ will decrease (causing more shrinkage) at the rate $\frac{c}{c + \lambda}$, where $c$ is a constant determined by the $j$th singular value of $X^\top X$:</p> <p><img src="/assets/shrinkage_factors1.png" alt="shrinkage_factors1" /></p> <p>In higher dimensions, each covariate will have a different rate of shrinkage (determined by its corresponding singular value).</p> <h2 id="mle-interpretation-of-ridge-regression">MLE interpretation of ridge regression</h2> <p>Notice that we can arrive at an equivalent solution to ridge regression assuming the following model, where we place a prior on $\beta$</p> $Y \sim \mathcal{N}(X \beta, \sigma^2 I), ~~~ \beta \sim \mathcal{N}(0, \frac{1}{\lambda} I).$ <p>The posterior is</p> $p(\beta | X, Y) = \frac1Z p(Y | X, \beta) p(\beta)$ <p>where $Z$ is the normalizing constant. We can find a MAP solution by maximizing the quantity without the constant, which is proportional to the posterior.</p> <p>\begin{align} &amp;\max_\beta \log p(Y | X, \beta) + \log p(\beta) \\ &amp;= \max_\beta -\frac{p}{2} \log 2\pi -\frac12 p \log \sigma^2 - \frac{1}{2\sigma^2} (Y - X\beta)^\top (Y - X\beta) - \frac{p}{2} \log 2\pi -\frac{p}{2} \log \sigma^2 - \frac{\lambda}{2} |\beta|_2^2. \end{align}</p> <p>Ignoring constant terms and taking a derivative w.r.t. $\beta$, we have</p> <p>\begin{align} &amp;-\frac{1}{2\sigma^2} \left( -X^\top Y - X^\top Y + 2 X^\top X \beta\right) -\frac{\lambda}{2} 2\beta = 0 \\ \implies&amp; X^\top X \beta + \sigma^2 \lambda \beta = X^\top Y \\ \implies&amp; \widehat{\beta} = \left( X^\top X + \sigma^2 \lambda I\right)^{-1} X^\top Y \end{align}</p> <h2 id="the-ridge-solution-to-collinearity">The ridge solution to collinearity</h2> <p>Suppose our data lives in $\mathbb{R}^2$, that is, $X \in \mathbb{R}^{n \times 2}$. Further, suppose the two columns of $X$ are identical. If we then perform linear regression with response $Y$, the problem is under-constrained: there are an infinite number of equally good solutions. To see this, consider an SVD of $X = UDV^\top$, and notice that</p> $(X^\top X)^{-1} X^\top Y = UU^\top Y = \sum\limits_{j=1}^p u_j u_j^\top Y = u_1 u_1^\top Y + u_2 u_2^\top Y.$ <p>Since the columns of $X$ are equal, we know that $u_1 = u_2$. In this case, we can arbitrarily scale each term to get an infinite number of equivalent solutions:</p> $\widehat{\beta} = \gamma u_1 u_1^\top Y + \frac{1}{\gamma} u_2 u_2^\top Y, ~~ \forall \gamma\in \mathbb{R}.$ <p>Ridge regression alleviates this issue by adding a small quantity to the diagonal of $X^\top X$ to make the solution unique:</p> $u_1 \frac{d_1^2}{d_1^2 + \lambda} u_1^\top Y + u_2 \frac{d_2^2}{d_2^2 + \lambda} u_2^\top Y.$Andy Jonesaj13@princeton.eduA brief review of shrinkage in ridge regression and a comparison to OLS.Binomial model for options pricing2020-12-06T00:00:00-08:002020-12-06T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2020/12/binomial-model-options<p>The binomial model is a simple method for determining the prices of options.</p> <h2 id="basic-binomial-model-assumptions">Basic binomial model assumptions</h2> <p>The binomial model makes a few simplifying assumptions (here, we’ll assume the underlying asset is a stock):</p> <ul> <li>In a given time interval, a stock price can only make two types of moves: up or down. Furthermore, each of these moves is by a fixed amount.</li> <li>All time intervals are discretized.</li> </ul> <p>While these assumptions are fairly unrealistic, the model can be a good starting point for understanding more complex models. The binomial model is a discrete-time approximation to other, more interesting models, so it’s a good place to start. (Note that below we sometimes drop the $sign to avoid notational clutter.)</p> <h2 id="starting-example">Starting example</h2> <p>Consider a call option$V$for a stock$S$that is currently worth$100. Recall that the earnings from a call option will be positive for increases in the underlying’s value, but will be 0 for a decrease in the underlying stock. The earnings from a call option can be visualized in the plot below:</p> <p><img src="/assets/call_option.png" alt="call_option" /></p> <p>If the stock goes up by 1 tomorrow, the option is worth 1. If the stock goes down by 1 tomorrow, the option is worth 0.</p> <p>The main question is: how much is the option worth today?</p> <p>Without knowing anything else about the situation or the market, it seems like the answer will depend on the probability $p$ that the stock will go up (and equivalently the probability $1-p$ that it will go down). Indeed, the expected value of the option is $$\mathbb{E}[V] = px,$$ where $x$ is the amount the stock could go up tomorrow. In this case $x=1$, so $\mathbb{E}[V] = p$.</p> <p>However, due to the opportunity for investors to hedge their bets, this reasoning is faulty.</p> <p>Consider the case when $p=0.2$, and the option costs $0.20$. Suppose an investor buys the call option $V$ and simultaneously takes a short position in $\frac12$ of the stock $S$, which costs $\frac12(100)=50$ in this case. Then this portfolio $P$ is worth $$P = \underbrace{0.2}_{\text{option}} - \underbrace{50}_{\text{short}}=-49.8.$$ How much will the portfolio be worth tomorrow? There are two scenarios:</p> <p>\begin{align} &amp;\text{$S$ increases by 1} \implies P = 1-\frac12(101)=-49.50. \\ &amp;\text{$S$ decreases by 1} \implies P = 0-\frac12(99)=-49.50. \end{align}</p> <p>In either case, the portfolio will be worth $49.50$. If the investor were to buy back the short position tomorrow, he or she would have gained $0.30$ without assuming any risk at all. This is an arbitrage opportunity.</p> <p>Alternatively, if the option costs $0.50$ initially, then the initial portfolio is worth $49.50$, and there is no opportunity for riskless profit.</p> <h2 id="interest-rates">Interest rates</h2> <p>In practice, there is another, simpler way to make a risk-free profit: through the risk-free interest rate (usually approximated by bonds). The return on these bonds is the interest rate. Thus, we should factor this opportunity into the calculation of the option price.</p> <p>Denote the interest rate as $r$. (For simplicity, we’ll assume $r$ is the daily return.) If we currently own $50$ in cash, then by buying bonds, we could have a portfolio worth $50(1+r)$ tomorrow without assuming any risk. Thus, we should discount tomorrow’s portfolio value by $\frac{1}{1+r}$ to account for this.</p> <p>In the example above, this would mean</p> <p>\begin{align} &amp;V - \frac12(100) = -49.5\left(\frac{1}{1+r}\right) \\ \implies&amp; (1+r)(V-50) = -49.5 \\ \implies&amp; V-50+rV-50r=-49.5 \\ \implies&amp; V=\frac{0.5+50r}{1+r} \end{align}</p> <p>As an example, consider when $r=10^{-3}$. Plugging into the above, this implies that $V=0.504945$. Intuitively it makes sense that the option should cost slightly more than the no-interest case because the projected portfolio value tomorrow, $$-49.5\left(\frac{1}{1+10^{-3}}\right) = -49.495,$$ which is a gain of $0.005$.</p> <p>By simply working with bonds, our portfolio value would have been: $$50(1+10^{-3}) = 50.005$$ for an equal gain of $0.005$.</p> <h2 id="more-general-form">More general form</h2> <p>Suppose the current time is $t$ and we’re considering the price of an option that expires at the next time step $t + \delta t$. The current stock price is $S$. There are two scenarios for the next time step:</p> <ul> <li>The stock price rises to $uS$, and the option price rises to $V^+$, making the portfolio worth $V^+ - \Delta uS$.</li> <li>The stock price falls to $vS$, and the option price falls to $V^-$, making the portfolio worth $V^- - \Delta vS$.</li> </ul> <p>To figure out how much of the stock to short (represented by $\Delta$ here), we must hedge so that these two possible portfolios have equal value. \begin{align} &amp;V^+ - \Delta uS = V^- - \Delta vS \\ \implies&amp; \Delta = \frac{V^+ - V^-}{uS - vS}. \end{align} We can think of this quantity as a discrete approximation to “Delta”, or the sensitivity of the option to the change in the underlying stock price, $$\frac{V^+ - V^-}{uS - vS} \to \frac{\partial V}{\partial S} ~~~\text{as}~~~ \delta t \to 0.$$</p> <p>The portfolio’s value at $t+\delta t$ then has two equivalent forms: \begin{align} P_{t + \delta t} &amp;= V^+ - u \frac{V^+ - V^-}{u - v} \\ P_{t + \delta t} &amp;= V^- - v \frac{V^+ - V^-}{u - v}. \end{align} To account for nonzero interest rates, this portfolio value must also be equal to the amount that could be earned just through the risk-free interest rate. Recall that if the interest rate is $r$ and the current value of the portfolio is $P$, then the value of the portfolio that just earns based on the interest rate at $t + \delta t$ is $$P_{t + \delta t} = P + Pr\delta t = P(1 + r \delta t)$$ where $P = V-\Delta S$ is the original value of the portfolio.</p> <p>Setting this value equal to the portfolio under the option investment, we have \begin{align} &amp;P(1 + r \delta t) = V^+ - u \frac{V^+ - V^-}{u - v} \\ \implies&amp; (V-\Delta S) (1 + r \delta t) = V^+ - u \frac{V^+ - V^-}{u - v} \\ \implies&amp; \left(V-\left(\frac{V^+ - V^-}{uS - vS}\right) S\right) (1 + r \delta t) = V^+ - u \frac{V^+ - V^-}{u - v} \\ \implies&amp; V(1 + r \delta t) - \left(\frac{V^+ - V^-}{u - v}\right) (1 + r \delta t) = V^+ - u \frac{V^+ - V^-}{u - v} \\ \implies&amp; V(1 + r \delta t) = \left(\frac{V^+ - V^-}{u - v}\right) (1 + r \delta t) + V^+ - u \frac{V^+ - V^-}{u - v} \\ \implies&amp; V(1 + r \delta t) = \left(\frac{V^+ - V^-}{u - v}\right) (1 + r \delta t) + \frac{u V^+ - v V^+}{u-v} - \frac{uV^+ - uV^-}{u - v} \\ \implies&amp; V(1 + r \delta t) = \left(\frac{V^+ - V^-}{u - v}\right) (1 + r \delta t) + \frac{uV^- - vV^+}{u - v} \\ \end{align}</p> <p>Suppose we choose to model the stock’s behavior as a random walk, where $$S_{t+\delta t} \sim \mathcal{N}(S_t + \mu \delta t, \sigma^2 S_t^2 \delta t)$$</p> <p>We can then choose</p> <p>\begin{align} u &amp;= 1 + \sigma \sqrt{\delta t} \\ v &amp;= 1 - \sigma \sqrt{\delta t} \\ p &amp;= \frac12 + \frac{\mu \sqrt{\delta t}}{2\sigma} \end{align}</p> <p>Plugging these values into the equation for the option price, we have \begin{align} &amp;V(1 + r \delta t) = \left(\frac{V^+ - V^-}{(1 + \sigma \sqrt{\delta t}) - (1 - \sigma \sqrt{\delta t})}\right) (1 + r \delta t) + \frac{(1 + \sigma \sqrt{\delta t})V^- - (1 - \sigma \sqrt{\delta t})V^+}{(1 + \sigma \sqrt{\delta t}) - (1 - \sigma \sqrt{\delta t})} \\ \implies&amp; V(1 + r \delta t) = \left(\frac{V^+ - V^-}{2\sigma \sqrt{\delta t}}\right) (1 + r \delta t) + \frac{V^- + \sigma \sqrt{\delta t} V^- - V^+ + \sigma \sqrt{\delta t} V^+}{2\sigma \sqrt{\delta t}} \\ \implies&amp; V(1 + r \delta t) = V^+\left( \frac{1}{2\sigma \sqrt{\delta t}} + \frac{r \sqrt{\delta t}}{2 \sigma} - \frac{1}{2 \sigma \sqrt{\delta t}} + \frac12 \right) + V^- \left( -\frac{1}{2\sigma \sqrt{\delta t}} - \frac{r\sqrt{\delta t}}{2\sigma} + \frac{1}{2\sigma \sqrt{\delta t}} + \frac12 \right) \\ \implies&amp; V(1 + r \delta t) = V^+ \underbrace{\left( \frac{r \sqrt{\delta t}}{2 \sigma} + \frac12 \right)}_{p} + V^- \underbrace{\left( - \frac{r\sqrt{\delta t}}{2\sigma} + \frac12 \right)}_{1-p} \end{align} The quantities $p$ and $1-p$ labeled with brackets above can be seen as the “risk-neutral probabilities”. In simple terms, these are the “probabilities” under which, if the stock obeyed them, the portfolio would have equal value at step $t+\delta t$ regardless of the direction of movement.</p> <h2 id="references">References</h2> <ul> <li>Wilmott, Paul. Paul Wilmott on quantitative finance. John Wiley &amp; Sons, 2013.</li> </ul>Andy Jonesaj13@princeton.eduThe binomial model is a simple method for determining the prices of options.BFGS2020-11-27T00:00:00-08:002020-11-27T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2020/11/bfgs<p>BFGS is a second-order optimization method – a close relative of Newton’s method – that approximates the Hessian of the objective function.</p> <p>Throughout this post, we let $f(x)$ be the objective function which we seek to minimize.</p> <h2 id="newtons-method">Newton’s method</h2> <p>Newton’s method is one of the most fundamental second-order methods in optimization. The key idea is to form a second-order approximation of $f(x)$ at the current point, find the minimum of this approximation, and repeat.</p> <p>Specifically, suppose we’d like to minimize $f(x)$ starting at $x_0$. Let $x \in \mathbb{R}$ be one-dimensional for simplicity for now. If we take a linear Taylor expansion of $f^\prime(x)$, we have</p> $f^\prime(x) \approx f^\prime(x_0) + f^{\prime\prime}(x_0)(x - x_0).$ <p>Setting this equal to zero, we have \begin{align} &amp;f^\prime(x_0) + f^{\prime\prime}(x_0) x - f^{\prime\prime}(x_0) x_0 = 0 \\ \implies&amp; x^\star = x_0 - \frac{f^{\prime}(x_0)}{f^{\prime\prime}(x_0)} \end{align}</p> <p>This update is known as Newton’s method.</p> <p>If $x \in \mathbb{R}^p$ and $p&gt;1$, Newton’s method requires the gradient and Hessian:</p> $x^\star = x_0 - [\nabla^2 f(x_0)]^{-1} \nabla f(x_0).$ <p>Note that the size of the Hessian is $p \times p$ in this case. For high-dimensional optimization problems, storing these matrices may become difficult. Furthermore, finding the inverse Hessian could be difficult or computationally expensive. Approximating the Hessian (or its inverse) can yield great computational boosts without much loss in accuracy. One method that uses such a trick is BFGS.</p> <h2 id="bfgs">BFGS</h2> <p>Consider again the scenario in which we are minimizing $f(x)$ where $x \in \mathbb{R}^p$. We are iteratively updating $x_k, k = 1, \dots, T$ where $T$ is determined by some convergence criterion.</p> <p>Suppose we use a quadratic approximation to $f$ at each iteration. Denote this approximation at step $k$ as $\hat{f}_k(x)$. Specifically,</p> $\hat{f}_k = f(x_k) + [\nabla f(x_k)]^\top (x - x_k) + \frac12 (x - x_k)^\top [\nabla f(x_k)]^2 (x - x_k).$ <p>Now, instead of directly computing the Hessian $[\nabla f(x_k)]^2$, let’s approximate it. Call this approximation $B_k$. Various choices for $B_k$ form a famliy of methods called “quasi-Newton methods”. Here, we review one of the most popular approximations, which leads to the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm.</p> <p>The BFGS update rule is the same as Newton’s method, simply replacing the Hessian with its approximation:</p> $x_{k+1} = x_k - B_k^{-1} \nabla f(x_k).$ <p>Suppose we are currently on step $k$, and we have just generated the next iterate $x_k$. Our goal is now to find a good $B_{k+1}$.</p> <p>BFGS constrains $B_{k+1}$ such that the gradients of $\hat{f}_{k+1}$ are equal to the true gradients of $f$ at the most recent two points: $x_k$ and $x_{k+1}$. Note that the gradient of $\hat{f}_{k+1}$ is</p> $\nabla \hat{f}_{k+1} = \nabla f(x_{k+1}) + B_{k+1} (x - x_{k+1}).$ <p>Plugging in $x_{k+1}$, we can immediately see that the second condition is met:</p> $\nabla \hat{f}_{k+1} = \nabla f(x_{k+1}) + B_{k+1} (x_{k+1} - x_{k+1}) = \nabla f(x_{k+1}).$ <p>For the first condition we have</p> <p>\begin{align} &amp;\nabla \hat{f}_{k+1} = \nabla f(x_{k+1}) + B_{k+1} (x_k - x_{k+1}) = \nabla f(x_k) \\ \implies&amp; B_{k+1} (x_{k+1} - x_k) = \nabla f(x_{k+1}) - \nabla f(x_k) \end{align}</p> <p>Denoting $s_k = x_{k+1} - x_k$ and $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$, this simplifies to</p> $B_{k+1} s_k = y_k.$ <p>To make $B_{k+1}$ positive definite, we must have that</p> $s_k^\top B_{k+1} s_k = s_k^\top y_k &gt; 0.$ <p>While the closely-related <a href="https://www.wikiwand.com/en/Davidon%E2%80%93Fletcher%E2%80%93Powell_formula">DFP algorithm</a> parameterizes the problem in terms of the approximate Hessian $B_k$, BFGS parameterizes it in terms of the inverse Hessian $H_k := B_k^{-1}$. The constraints then become:</p> $H_{k+1} y_k = s_k \;\;\text{ and }\;\; H_{k+1} = H_{k+1}^\top.$ <p>We further specify $H_{k+1}$ by making as close to $H_k$ as possible.</p> $H_{k+1} = \text{arg}\min_H \|H - H_k\| \;\;\; \text{ s.t. } H = H^\top, \;\; Hy_k = s_k.$ <p>The solution is then given by</p> $H_{k+1} = (I - \frac{1}{y_k^\top s_k} s_k y_k^\top) H_k (I - \frac{1}{y_k^\top s_k} y_k s_k^\top) + \frac{1}{y_k^\top s_k} s_k s_k^\top.$ <p>This is the BFGS update rule. There are a couple important properties to notice about it:</p> <ol> <li>The inverse Hessian at step $k+1$ depends on the inverse Hessian at step $k$. This is unlike the traditional Newton’s method, which computes the inverse Hessian “from scratch” at each iteration.</li> <li>The update only depends on the previous inverse Hessian and the vectors $s_k$ and $y_k$. Furthermore, we only have to perform matrix multiplications and outer products (no inverses), so this update will be $\mathcal{O}(p^2)$, where $p$ is the dimension of $x$.</li> </ol> <h2 id="simple-example">Simple example</h2> <p>To further build intuition, notice that when $p=1$, this update reduces to</p> $H_{k+1} = \frac{s_k}{y_k} = \frac{x_{k+1} - x_k}{f^\prime(x_{k+1}) - f^\prime(x_{k})} = \left[ \frac{f^\prime(x_{k+1}) - f^\prime(x_{k})}{x_{k+1} - x_k} \right]^{-1}.$ <p>This is simply a linear approximation to the (reciprocal) second derivative.</p> <p>Suppose $f(x) = x^4$ is our objective function. Further, suppose $x_k = 4$ and $x_{k+1} = 2$. Then we can visualize the BFGS method by seeing that the approximation to the second derivative will just be the slope of a linear interpolation between the values of $f^\prime(x)$ at these two points. In this case, the computation is extremely simple:</p> $f^{\prime\prime}_{k+1} \approx \frac{f^\prime(4) - f^\prime(2)}{4 - 2}.$ <p>Here’s a plot of how this looks:</p> <p><img src="/assets/bfgs_approx.png" alt="bfgs_approx" /></p> <p>Notice that if $x_{k+1}$ and $x_k$ are extremely close to each other, the approximation will improve. In fact, in the limit, this reduces to the definition of a derivative:</p> $f^{\prime\prime}(x) = \lim_{\epsilon \to 0} \frac{f^\prime(x + \epsilon) - f^\prime(x)}{\epsilon}.$ <h2 id="references">References</h2> <ul> <li>Nocedal, Jorge, and Stephen Wright. Numerical optimization. Springer Science &amp; Business Media, 2006.</li> <li><a href="https://www.wikiwand.com/en/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm">Wikipedia page on BFGS</a></li> </ul>Andy Jonesaj13@princeton.eduBFGS is a second-order optimization method – a close relative of Newton’s method – that approximates the Hessian of the objective function.Tweedie distributions2020-11-21T00:00:00-08:002020-11-21T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2020/11/tweedie<p>Tweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.</p> <h2 id="exponential-dispersion-models">Exponential dispersion models</h2> <p>Exponential dispersion models (EDMs) have the following form:</p> $f(x; \mu, \sigma^2) = h(\sigma^2, x) \exp\left(\frac{\theta x - A(\theta)}{\sigma^2}\right)$ <p>where $h(\sigma^2, x)$ is the “base distribution”, $\theta(\mu, \sigma^2)$ is a combination/function of the parameters, and $A(\theta)$ is the normalization quantity.</p> <p>Notice that if we treat $\sigma^2$ as constant, this reduces to the natural exponential family, which has the form $$f(x; \mu) = h(x) \exp\left(\widetilde{\theta} x - A(\widetilde{\theta})\right).$$ In this way, we can view EDMs as a generalization of the exponential family that allow for varying dispersion (hence the name).</p> <p>The mean and variance of an EDM-distributed random variable $X$ are \begin{align} \mathbb{E}[X] &amp;= \mu = A^\prime(\theta) \\ \mathbb{V}[X] &amp;= \sigma^2 A^{\prime \prime}(\theta) = \sigma^2 V(\mu) \end{align} where $V(\mu)$ is called the “variance function”.</p> <h2 id="tweedie">Tweedie</h2> <p>Tweedie families are a special case of the EDMs discussed above. Specifically, Tweedie distributions make an assumption about the relationship between the mean and the variance of the distribution. To specify a Tweedie distribution, another parameter $p \in \mathbb{R}$ is introduced, and we restrict the variance as: $$\mathbb{V}[X] = \sigma^2 V(\mu) = \sigma^2 \mu^p.$$</p> <p>If $X$ is Tweedie-distributed with power parameter $p$, we write $$X \sim \text{Tw}_p(\mu, \sigma^2).$$</p> <p>Given a value of $p$, writing down the pdf requires finding the proper base measure $h(\mu, \sigma^2)$ such that the density normalizes to 1 properly. In general, this is difficult for the Tweedie family, but we show a few special cases below.</p> <h2 id="special-cases">Special cases</h2> <h3 id="gaussian-p--0">Gaussian ($p = 0$)</h3> <p>Let $X \sim \text{Tw}_{p}(\mu, \sigma^2)$ where $p=0$. Then the variance is given by $\mathbb{V}[X] = \sigma^2$. Let $\theta = \frac12 \mu$, $A(\theta) = \frac12 \theta^2 = \frac12 \mu^2$ $$h(x; \mu, \sigma^2) = \frac{\exp(-\frac{1}{2\sigma^2} x^2)}{\sqrt{2 \pi \sigma^2}}.$$ Notice that $A^{\prime\prime}(\theta) = 1$, which implies that $$\mathbb{V}[X] = \sigma^2 A^{\prime\prime}(\theta) = \sigma^2 \cdot 1 = \sigma^2 \cdot 1^0$$ so we have satisifed the mean-variance relationship.</p> <p>Then we have \begin{align} f(x; \mu, \sigma^2) &amp;= \frac{\exp(-\frac{1}{2\sigma^2} x^2)}{\sqrt{2 \pi \sigma^2}} \exp\left( \frac{\frac12 \mu x - \frac12 \mu^2}{\sigma^2} \right) \\ &amp;= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( \frac{-\frac{1}{2\sigma^2} x^2 + \frac12 \mu x - \frac12 \mu^2}{\sigma^2} \right) \\ &amp;= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{1}{2\sigma^2} (x^2 - \mu x + \mu^2) \right) \\ &amp;= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{1}{2\sigma^2} (x - \mu)^2 \right) \\ \end{align} which is the density for a Gaussian random variable with mean $\mu$ and variance $\sigma^2$.</p> <h3 id="poisson-p1-sigma21">Poisson ($p=1, \sigma^2=1$)</h3> <p>Let $X \sim \text{Tw}_{p}(\mu, \sigma^2)$ where $p=1$ and we set $\sigma^2 = 1$ to be a constant. Then the variance is given by $\mathbb{V}[X] = \sigma^2 \mu^1 = \mu$, implying that the mean and variance are equal.</p> <p>Since $\sigma^2 = 1$, the density’s general form reduces to $$f(x; \mu) = h(x) \exp\left(\theta x - A(\theta)\right).$$</p> <p>Let $\theta = \log \mu$, $h(x) = \frac{1}{x!}$, $A(\theta) = e^\theta$. Then we have \begin{align} f(x; \mu) &amp;= \frac{1}{x!} \exp\left(x \log \mu - e^\theta) \right) \\ &amp;= \frac{1}{x!} \exp(\log \mu^x)) \exp(e^{-\log \mu}) \\ &amp;= \frac{1}{x!} \mu^x e^{-\mu} \end{align} which is the density of a Poisson-distributed random variable with rate parameter $\mu$.</p> <p>Notice that $$\mathbb{V}[X] = \sigma^2 A^{\prime\prime}(\theta) = \sigma^2 e^\theta = \sigma^2 e^{\log \mu} = \sigma^2 \mu = \sigma^2 \cdot \mu^1 = \sigma^2 \cdot \mu^p$$ so we have satisfied the mean-variance relationship (in the case of the Poisson, they’re identical).</p> <h3 id="gamma-p2">Gamma ($p=2$)</h3> <p>Let $X \sim \text{Tw}_{p}(\mu, \sigma^2)$ where $p=2$. The variance of $X$ is then $\mathbb{V}[X] = \sigma^2 \mu^2$.</p> <p>Let $\theta = -\mu \sigma^2$, $A(\theta) = -\log (-\theta)$, and $h(x; \mu, \sigma^2) = \frac{x^{1/\sigma^2 - 1}}{\Gamma(1/\sigma^2) (\sigma^2)^{1/\sigma^2}}$. Then the density is \begin{align} f(x; \mu, \sigma^2) &amp;= h(\sigma^2, x) \exp\left(\frac{\theta x - A(\theta)}{\sigma^2}\right) \\ &amp;= \frac{x^{1/\sigma^2 - 1}}{\Gamma(1/\sigma^2) (\sigma^2)^{1/\sigma^2}} \exp\left( \frac{-\mu \sigma^2 x + \log(-\theta)}{\sigma^2} \right) \\ &amp;= \frac{x^{1/\sigma^2 - 1}}{\Gamma(1/\sigma^2) (\sigma^2)^{1/\sigma^2}} \exp\left( -\mu x + \frac{1}{\sigma^2} \log(\mu \sigma^2) \right) \\ &amp;= \frac{x^{1/\sigma^2 - 1}}{\Gamma(1/\sigma^2) (\sigma^2)^{1/\sigma^2}} e^{-\mu x}\exp\left( \log(\mu^{1/\sigma^2} + \log(\sigma^2)^{1/\sigma^2} \right) \\ &amp;= \frac{x^{1/\sigma^2 - 1}}{\Gamma(1/\sigma^2) (\sigma^2)^{1/\sigma^2}} e^{-\mu x} \mu^{1/\sigma^2} (\sigma^2)^{1/\sigma^2} \\ &amp;= \frac{x^{1/\sigma^2 - 1} }{\Gamma(1/\sigma^2)} e^{-\mu x} \mu^{1/\sigma^2} \\ \end{align} which is the density of a Gamma-distributed random variable with shape $1/\sigma^2$ and rate $\mu$.</p> <h2 id="references">References</h2> <ul> <li>Wikipedia pages on <a href="https://www.wikiwand.com/en/Tweedie_distribution">Tweedie distributions</a> and <a href="https://www.wikiwand.com/en/Exponential_dispersion_model">exponential dispersion models</a></li> <li>Seth David Temple’s <a href="https://math.uoregon.edu/wp-content/uploads/2018/07/TempleStempleTweedieThesis.pdf">thesis, The Tweedie Index Parameter and Its Estimator</a></li> <li>Bonat, Wagner H., et al. “Extended Poisson–Tweedie: Properties and regression models for count data.” Statistical Modelling 18.1 (2018): 24-49.</li> </ul>Andy Jonesaj13@princeton.eduTweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.Scale mixtures of normals2020-11-15T00:00:00-08:002020-11-15T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2020/11/scale-mixtures<p>Here, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.</p> <h2 id="introduction">Introduction</h2> <p>Mixture models are typically first introduced in the context of discrete mixtures. For example, the Gaussian mixture model (GMM) is often the canonical mixture model. In particular, if we assume $x$ follows a GMM with $K$ mixture components, we can write the model as follows. \begin{align} p(x) = \sum\limits_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \sigma_k) \\ \end{align} where $\pi_1, \dots, \pi_K$ are the mixture weights that must satisfy $\sum_k \pi_k = 1$.</p> <p>However, we can also consider continuous mixtures. Consider, for example, a continuous mixture of Gaussians defined by the following hierarchical model: \begin{align} x &amp;\sim \mathcal{N}(\mu, \sigma^2) \\ \mu &amp;\sim \mathcal{N}(0, \tau^2) \\ \end{align} Now, we can write the marginal distribution of $x$: \begin{equation} p(x) = \int_{-\infty}^\infty p(x | \mu) p(\mu) d\mu \end{equation} We call this a continuous mixture of Gaussians. Notice that this mixture has a similar form to the discrete mixture above. In particular, $p(\mu)$ here plays a similar role as ${\pi_k}_{k=1}^K$ above as the “mixing distribution”.</p> <p>In the Gaussian setting, we can compute this marginal in closed form (see appendix for full derivation) to get: \begin{equation} x \sim \mathcal{N}(0, \tau^2 + \sigma^2) \end{equation} Intuitively, we can think of the Gaussian prior on $\mu$ as adding extra variability to $x$, as compared to a model with a fixed $\mu$.</p> <p>Here, the mixing distribution was specified for the mean parameter $\mu$, but we can also specify it for the variance $\sigma^2$, as we’ll see next.</p> <h2 id="scale-mixtures">Scale mixtures</h2> <p>Consider the following hierarchical model: \begin{align} x &amp;\sim \mathcal{N}(\mu, \sigma^2) \\ \sigma^2 &amp;\sim p(\sigma^2) \\ \end{align} Assuming a constant $\mu$ for now, the marginal distribution of $x$ is then \begin{equation} p(x) = \int_0^\infty p(x | \sigma^2) p(\sigma^2) d\sigma^2 \end{equation} Below, we consider two choices for $p(\sigma^2)$ and discuss the implications for the marginal density of $x$.</p> <h2 id="laplace">Laplace</h2> <p>Consider placing an exponential prior on $\sigma^2$ such that $\sigma^2 \sim \text{Exp}(2 \lambda^2)$. Recall that the PDF of the exponential distribution is \begin{equation} p(\sigma^2; \lambda) = 2 \lambda^2 \exp(-2 \lambda^2 \sigma^2) \end{equation} We can then compute the marginal of $x$ as follows:</p> <p>\begin{align} p(x) &amp;= \int_0^\infty p(x | \sigma^2) p(\sigma^2) d\sigma^2 \\ &amp;= \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{1}{2\sigma^2} x^2 \right) 2\lambda^2 \exp(-2 \lambda^2 \sigma^2) d\sigma^2 \\ &amp;= 2 \lambda^2 \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -2 \lambda^2 \sigma^2 - \frac{1}{2\sigma^2} x^2 \right) d\sigma^2 \\ &amp;= 2 \lambda^2 \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{2 \lambda^2}{\sigma^2} \left( (\sigma^2)^2 + \frac{1}{4 \lambda^2} x^2 \right) \right) d\sigma^2 \\ &amp;= 2 \lambda^2 \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{ 2\lambda^2}{\sigma^2} \left( (\sigma^2 - \frac{1}{2 \lambda} x)^2 + 2 \sigma^2 \frac{1}{2 \lambda} x \right) \right) d\sigma^2 \\ &amp;= 2 \lambda^2 \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{2\lambda^2}{\sigma^2} \left( (\sigma^2 - \frac{x}{2 \lambda})^2 \right) - 2 \lambda x \right) d\sigma^2 \\ &amp;= 2 \lambda^2 \exp(- 2\lambda x) \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{2\lambda^2 (x / 2\lambda)^2 }{\sigma^2 (x / 2\lambda)^2} \left( \sigma^2 - \frac{x}{2\lambda} \right)^2 \right) d\sigma^2 \\ &amp;= 2 \lambda^2 \exp(- 2\lambda x) \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{2\lambda^2 \frac{x^2}{4\lambda^2} }{\sigma^2 \frac{x^2}{4\lambda^2}} \left( \sigma^2 - \frac{x}{2\lambda} \right)^2 \right) d\sigma^2 \\ &amp;= 2 \lambda^2 \exp(- 2\lambda x) \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{x^2 }{2\sigma^2 (x / 2\lambda)^2} \left( \sigma^2 - \frac{x}{2\lambda} \right)^2 \right) d\sigma^2 \\ &amp;= 2 \lambda^2 \exp(- 2\lambda x) \frac{1}{x} \int_0^\infty \sigma^2 \sqrt{\frac{x^2}{ 2\pi \sigma^3}} \exp \left( -\frac{x^2 }{2\sigma^2 (x / 2\lambda)^2} \left( \sigma^2 - \frac{x}{2\lambda} \right)^2 \right) d\sigma^2 \\ \end{align} Now let $\tau := x^2$ and $\mu = \frac{x}{2\lambda}$. \begin{align} p(x) = 2 \lambda^2 \exp(- 2\lambda x) \frac{1}{x} \int_0^\infty \sigma^2 \sqrt{\frac{\tau}{ 2\pi \sigma^3}} \exp \left( -\frac{\tau }{2\sigma^2 \mu^2} \left( \sigma^2 - \mu \right)^2 \right) d\sigma^2 \\ \end{align}</p> <p>Recognizing the integrand (past the first $\sigma^2$) as a inverse-Gaussian distribution with mean $\mu$ and scale parameter $\tau$, we can notice that the integral is an expectation of an inverse-Gaussian-distributed random variable $\sigma^2$. Thus, \begin{align} p(x) &amp;= 2 \lambda^2 \exp(- 2\lambda x) \frac{1}{x} \mathbb{E}[\sigma^2] \\ &amp;= 2 \lambda^2 \exp(- 2\lambda x) \frac{1}{x} \mu \\ &amp;= 2 \lambda^2 \exp(- 2\lambda x) \frac{1}{x} \frac{x}{2 \lambda} \\ &amp;= \lambda \exp(- 2\lambda x) \\ \end{align}</p> <p>Notice that this has the form of a Laplace distribution with scale parameter $b = \frac{1}{2\lambda}$. To be precise, I should have started using $|x|$ instead of $x$ starting in line $(14)$. Inserting that now, we have a real Laplace density: $$p(x) = \lambda \exp(- 2\lambda |x|).$$</p> <p>Laplace priors (or, equivalently, a scale mixture of normals with exponential mixing distribution) are often used as a Bayesian analogue to the LASSO. Intuitively, the correspondence with the LASSO can be seen in the log-likelihood, which will penalize large values of $|x|$.</p> <h2 id="student-t">Student-t</h2> <p>Consider the following hierarchical model: \begin{align} x &amp;\sim \mathcal{N}(0, \sigma^2) \\ \sigma^2 &amp;\sim \text{Inv-Gamma}(\nu / 2, \nu / 2) \\ \end{align} We can again find the marginal distribution of $\sigma^2$: \begin{align} p(x) &amp;= \int_0^\infty p(x | \sigma^2) p(\sigma^2) d\sigma^2 \\ &amp;= \int_0^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( \frac{-x^2}{2\sigma^2} x^2 \right) \frac{(\nu / 2)^{\nu / 2}}{\Gamma(\nu / 2)} (\sigma^2)^{-\nu / 2 - 1} \exp\left( -\frac{\nu / 2}{\sigma^2} \right) d\sigma^2 \\ &amp;= \frac{(\nu / 2)^{\nu / 2}}{\Gamma(\nu / 2) \sqrt{2 \pi}} \int_0^\infty \exp \left( -\frac{ x^2 + \nu}{2\sigma^2} \right) (\sigma^2)^{-\frac{\nu + 1}{2} - 1} d\sigma^2 \\ &amp;= \frac{(\nu / 2)^{\nu / 2}}{\Gamma(\nu / 2) \sqrt{2 \pi}} \Gamma\left(\frac{\nu + 1}{2}\right) \left(\frac{x^2 + \nu}{2}\right)^{-\frac{\nu + 1}{2}} \\ \end{align} With a little more algebra, this can be put in the form of a Student-t distribution with $\nu$ degrees of freedom.</p> <h2 id="references">References</h2> <ul> <li>Gelman, Andrew. “Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper).” Bayesian analysis 1.3 (2006): 515-534.</li> <li>Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. “Handling sparsity via the horseshoe.” Artificial Intelligence and Statistics. 2009.</li> <li>Kenneth Tay’s <a href="https://statisticaloddsandends.wordpress.com/2018/12/21/laplace-distribution-as-a-mixture-of-normals/">derivation of the Laplace distribution</a></li> <li>John Cook’s <a href="https://www.johndcook.com/t_normal_mixture.pdf">derivation of the Student-t</a></li> </ul> <h2 id="appendix">Appendix</h2> <h3 id="derivation-of-marginal-for-continuous-gaussian-mixture">Derivation of marginal for continuous Gaussian mixture</h3> <p>\begin{align} p(x) &amp;= \int_{-\infty}^\infty p(x | \mu) p(\mu) d\mu \\ &amp;= \int_{-\infty}^\infty \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{1}{2\sigma^2} (x - \mu)^2 \right) \frac{1}{\tau \sqrt{2\pi}} \exp \left( -\frac{1}{2\tau^2} \mu^2 \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2\sigma^2} (x - \mu)^2 - \frac{1}{2\tau^2} \mu^2 \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left(\frac{1}{\sigma^2} x^2 - \frac{1}{\sigma^2} 2x\mu + \frac{1}{\sigma^2} \mu^2 + \frac{1}{\tau^2} \mu^2\right) \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left( \left(\frac{1}{\sigma^2} + \frac{1}{\tau^2}\right) \mu^2 - \frac{1}{\sigma^2} 2x\mu + \frac{1}{\sigma^2} x^2 \right) \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left( \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \mu^2 - \frac{1}{\sigma^2} 2x\mu + \frac{1}{\sigma^2} x^2 \right) \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left( \mu^2 - \frac{\tau^2}{\tau^2 + \sigma^2} 2x\mu + \frac{\tau^2}{\tau^2 + \sigma^2} x^2 \right) \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left( \mu^2 - \frac{\tau^2}{\tau^2 + \sigma^2} 2x\mu + \frac{\tau^2}{\tau^2 + \sigma^2} x^2 \right) + \left(\left(\frac{\tau^2}{\tau^2 + \sigma^2} x\right)^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x^2\right) - \left(\left(\frac{\tau^2}{\tau^2 + \sigma^2} x\right)^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x^2\right) \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left( \mu^2 - \frac{\tau^2}{\tau^2 + \sigma^2} 2x\mu + \left(\frac{\tau^2}{\tau^2 + \sigma^2} x\right)^2 \right) - \left(\left(\frac{\tau^2}{\tau^2 + \sigma^2} x\right)^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x^2\right) \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left( \left(\mu^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x \right)^2 - \left(\frac{\tau^2}{\tau^2 + \sigma^2} x\right)^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x^2\right) \right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left(\mu^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x \right)^2\right) \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left(\frac{\tau^2}{\tau^2 + \sigma^2} x\right)^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x^2\right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left(\frac{\tau^2}{\tau^2 + \sigma^2} x\right)^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x^2\right) \int_{-\infty}^\infty \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left(\mu^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x \right)^2\right) d\mu \\ &amp;= \frac{1}{\sigma \tau 2\pi} \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left(\frac{\tau^2}{\tau^2 + \sigma^2} x\right)^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x^2\right) \sqrt{2\pi} \sqrt{\frac{\sigma^2 \tau^2}{\tau^2 + \sigma^2}} \\ &amp;= \frac{1}{\sqrt{2\pi (\tau^2 + \sigma^2)}} \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left(\left(\frac{\tau^2}{\tau^2 + \sigma^2}\right)^2 x^2 - \frac{\tau^2}{\tau^2 + \sigma^2} x^2\right)\right) \\ &amp;= \frac{1}{\sqrt{2\pi (\tau^2 + \sigma^2)}} \exp \left( -\frac{1}{2} \left(\frac{\tau^2 + \sigma^2}{\sigma^2 \tau^2}\right) \left(\left(\left(\frac{\tau^2}{\tau^2 + \sigma^2}\right)^2 - \frac{\tau^2}{\tau^2 + \sigma^2}\right) x^2 \right)\right) \\ &amp;= \frac{1}{\sqrt{2\pi (\tau^2 + \sigma^2)}} \exp \left( -\frac{1}{2 (\tau^2 + \sigma^2)} x^2 \right) \\ \end{align}</p>Andy Jonesaj13@princeton.eduHere, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.The Concrete Distribution2020-11-12T00:00:00-08:002020-11-12T00:00:00-08:00https://andrewcharlesjones.github.io/posts/2020/11/concrete<p>The Concrete distribution is a relaxation of discrete distributions.</p> <p>Here, we explain the motivation for relaxing discrete distributions and the properties of the Concrete distribution.</p> <p>To start, note that the distribution was discovered simultaneously by <a href="https://arxiv.org/abs/1611.00712">Maddison et al.</a> and <a href="https://arxiv.org/abs/1611.01144">Jang et al.</a>. Below, I fluidly use terminology from both papers, focusing more on the core, shared ideas than either implementation in particular.</p> <h2 id="reparameterization-trick">Reparameterization trick</h2> <p>The reparameterization trick’’ actually refers to a family of methods for sampling from a distribution using alternate parameterizations. Most commonly, it amounts to reparameterizing a random variable so that it’s described by a deterministic function that takes as input the distribution’s parameters and another random variable from a <strong>fixed</strong> distribution. This trick is useful for fitting certain statistical models using gradient descent.</p> <p>Concretely, let’s say $z$ is a random variable drawn from a distribution $p_\theta$. The reparameterization trick seeks to find a function $g_\phi(\theta, \epsilon)$ such that if $\epsilon$ is randomly drawn from a fixed distribution, then the output of $g$ is a sample from $p_\theta$.</p> <p>The most common example of the reparameterization trick uses the Gaussian distribution. Let $z \sim \mathcal{N}(\mu, \sigma^2)$. We can reparameterize this as $$\tilde{z} = \mu + \sigma \epsilon, \;\;\; \epsilon \sim \mathcal{N}(0, 1).$$</p> <p>Since adding a constant to a Gaussian shifts its mean, and multiplying by a constant $c$ scales the variance by $c^2$, we have that $\tilde{z} \sim \mathcal{N}(\mu, \sigma^2)$.</p> <p>There are many other instances of the reparameterization trick. In <a href="https://arxiv.org/abs/1312.6114">the paper that originally introduced it</a>, the authors mention that there are three primary classes of distributions that are amenable to the trick:</p> <ul> <li>When the distribution has a tractable inverse CDF. In this case, we can use the inverse transform method.</li> <li>When the distribution is in the location-scale family. In this case, we can use the same approach as the Gaussian example above.</li> <li>When the distribution can be expressed as a composition of other random variables (e.g., a log-normal random variable is the exponential of a normal random variable).</li> </ul> <p>Unfortunately, discrete random variables meet none of these criteria.</p> <h2 id="concrete-distribution">Concrete distribution</h2> <p>The concrete distribution relies on a reparameterization of the multinomial distribution – this trick is called the <a href="https://andrewcharlesjones.github.io/posts/2020/02/gumbelmax/">Gumbel max trick</a>.</p> <p>Suppose we want to sample from a multinomial with $k$ classes with associated probabilities $\pi_1, \dots, \pi_k$. The Gumbel max trick exploits the fact that the following quantity follows the distribution $\text{Mult}(\pi_1, \dots, \pi_k)$:</p> $z = \text{arg}\max_{i \in [k]} \left[\log(\pi_i) + g_i\right], \;\;\; g_i \sim \text{Gumbel}(0, 1).$ <p>In words, this means that if we add independent Gumbel noise to each of the class probabilities and take the arg-maximum, the output will have the desired multinomial distribution.</p> <p>However, in the setting of optimization, this trick still isn’t differentiable with respect to the parameters (since the $\text{arg}\max$ operation isn’t differentable).</p> <p>This is where the Concrete distribution comes into the picture. This distribution is a relaxation of the multinomial distribution, which is also differentiable with respect to its parameters.</p> <p>The easiest way to understand the Concrete distribution is as a relaxation of the Gumbel max trick itself: instead of taking the $\text{arg}\max$ above, we simply take the softmax. The softmax is a function $f : \mathbb{R}^k \mapsto \mathbb{R}^k$ whose $i$th output is defined as the following:</p> $f(x_1, \dots, x_k)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$ <p>The softmax function looks like this:</p> <p><img src="/assets/softmax.png" alt="softmax" /></p> <p>Often, a “temperature” parameter $\tau$ is included in the softmax function, which controls how steep the function is. The form of the function is then: $$f(x_1, \dots, x_k)_i = \frac{\exp(x_i / \tau)}{\sum_j \exp(x_j / \tau)}.$$</p> <p>Plotting this function across different temperatures, we can see that it approaches a step function as the temperature decreases.</p> <p><img src="/assets/softmax_temps.png" alt="softmax_temps" /></p> <p>Now, if we combine our temperature-controlled softmax function with the Gumbel max trick, we can approximate the Multinomial with the following operation:</p> $z_i = \frac{\exp\left\{(\log(\pi_i) + g_i) / \tau\right\}}{\sum_j \exp\left\{(\log(\pi_j) + g_j) / \tau\right\}}$ <p>This is the sampling process for the Concrete distribution.</p> <p>As a simple example, consider the case when we have just two classes $k=2$, and the class probabilities are $\pi_1 = 0.2, \pi_2 = 0.8$. If we sample from the Concrete distribution at varying temperatures and plot $z_1$ (the first element of the output vector), we see that the samples approach the discrete multinomial as the temperature decreases, and becomes more uniform at higher temperatures.</p> <p><img src="/assets/concrete_hist.png" alt="concrete_hist" /></p> <p>So far, we have just described the sampling procedure. But we can also define a proper probability density. The Concrete distribution is a two-parameter distribution with the following PDF:</p> $p(x_1, \dots, x_k; \boldsymbol{\pi}, \tau) = \frac{\Gamma(k) \tau^{k-1}}{\left(\sum\limits_{i=1}^k \pi_i / y_i^\tau\right)^k} \prod\limits_{i=1}^k (\pi_i / y_i^{\tau + 1}).$ <p>Plotting the density for $\pi_1 = 0.2, \pi_2 = 0.8$, we can see similar behavior as the temperature increases and decreases:</p> <p><img src="/assets/concrete_density.png" alt="concrete_density" /></p> <h2 id="an-application-of-the-concrete-distribution-variational-inference">An application of the Concrete distribution: variational inference</h2> <p>One of the settings in which the reparameterization trick is most useful is variational inference. VI is a general approach to performing approximate posterior inference in statistical models that seeks to approximate a distribution $p(z | x)$ using another family of distributions $q(z | x)$.</p> <p>Under the variational approach, the objective is to minimize the KL-divergence between the true posterior $p$ and the approximate posterior $q$. In turn, one typically seeks to maximize a lower bound on the log model evidence. This is called the evidence lower bound (ELBO) and is written as:</p> $\log p(x) \geq \mathbb{E}_{q(z | x)} [-\log q(z | x) + \log p(x, z)] =: \mathcal{L}.$ <p>In the seminal paper <a href="https://arxiv.org/abs/1312.6114">Auto-Encoding Variational Bayes</a>, the authors found that this objective $\mathcal{L}$ could be maximized using stochastic gradient descent. To use SGD in this setting, one has to reparameterize the stochastic nodes.</p> <p>In models with discrete latent variables, gradient descent was intractable because one could not use generic methods such as the inverse transform method to reparameterize the distributions (because it would not be differentiable).</p> <p>This is where the benefit of the Concrete distribution arises: instead of directly using the desired discrete distribution, we can use the Concrete relaxation, which allows us to backpropagate gradients through the model.</p> <h2 id="references">References</h2> <p>Please note that the first two references (Maddison et al. and Jang et al.) simultaneously discovered this distribution and typically share credit in the literature. In this post, I used “Concrete distribution” as the default terminology but borrowed ideas and notation from both papers.</p> <ul> <li>Maddison, Chris J., Andriy Mnih, and Yee Whye Teh. “The concrete distribution: A continuous relaxation of discrete random variables.” arXiv preprint arXiv:1611.00712 (2016).</li> <li>Jang, Eric, Shixiang Gu, and Ben Poole. “Categorical reparameterization with gumbel-softmax.” arXiv preprint arXiv:1611.01144 (2016).</li> <li>Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).</li> </ul>Andy Jonesaj13@princeton.eduThe Concrete distribution is a relaxation of discrete distributions.