Andy Jones CS PhD student @ Princeton

Log-derivative trick

The “log-derivative trick” is really just a simple application of the chain rule. However, it allows us to rewrite expectations in a way that is amenable to Monte Carlo approximation.

Log-derivative trick

Suppose we have a function $p(x; \theta)$ (in this context we’ll mostly think of $p$ as a probability density), and we’d like to take the gradient of its logarithm with respect to $\theta$,

\[\nabla_\theta \log p(x; \theta).\]

By a simple application of the chain rule, we have

\[\nabla_\theta \log p(x; \theta) = \frac{\nabla_\theta p(x; \theta)}{p(x; \theta)}\]

which, rearranging, implies that

\[\nabla_\theta p(x; \theta) = p(x; \theta) \nabla_\theta \log p(x; \theta).\]

Score function estimator

In many statistical applications, we want to estimate the gradient of an expectation of a function $f$:

\[\nabla_\theta \mathbb{E}_{p(x; \theta)}[f(x)].\]

To learn more about a few applications where this gradient estimation problem shows up, as well as more modern methods for solving it, I’d recommend this review by Shakir Mohamed et al.

Unfortunately, we cannot directly approximate this expression with naive Monte Carlo methods. This is because the expression isn’t in general an expectation. Expanding the expectation we have:

\begin{align} \nabla_\theta \mathbb{E}_{p(x; \theta)}[f(x)] &= \nabla_\theta \int p(x; \theta) f(x) dx \\ &= \int \underbrace{\nabla_\theta p(x; \theta)}_{\text{density?}} f(x) dx && \text{(Leibniz rule)} \end{align}

However, $\nabla_\theta p(x; \theta)$ will not in general be a valid probability density, so we cannot approximate this with

\[\nabla_\theta \mathbb{E}_{p(x; \theta)}[f(x)] \approx \frac1n \sum\limits_{i=1}^n \nabla_\theta p(x_i; \theta) f(x_i).\]

Thankfully, the log-derivative trick allows us to rewrite it as a true expectation:

\begin{align} \nabla_\theta \mathbb{E}_{p(x; \theta)}[f(x)] &= \nabla_\theta \int p(x; \theta) f(x) dx \\ &= \int \nabla_\theta p(x; \theta) f(x) dx && \text{(Leibniz rule)} \\ &= \int p(x; \theta) \frac{\nabla_\theta p(x; \theta)}{p(x; \theta)} f(x) dx && \left(\text{Multiply by } 1=\frac{p(x; \theta)}{p(x; \theta)}\right) \\ &= \int p(x; \theta) \nabla_\theta \log p(x; \theta) f(x) dx && \text{(Log-derivative trick)} \\ &= \mathbb{E}_{p(x; \theta)}[\nabla_\theta \log p(x; \theta) f(x)]. \end{align}

We can then approximate this expectation with $n$ Monte Carlo samples from $p(x; \theta)$, $x_1, \dots, x_n$:

\[\mathbb{E}_{p(x; \theta)}[\nabla_\theta \log p(x; \theta) f(x)] \approx \frac{1}{n} \sum\limits_{i=1}^n \nabla_\theta \log p(x_i; \theta) f(x_i).\]


Although relatively straightforward, the score function estimator shows up all over the place. In reinforcement learning, it’s known as the REINFORCE method, in which the gradient of the policy is being taken. In variational inference, it shows up when trying to optimize the evidence lower bound (ELBO). And in computational finance, this estimator is important for performing “sensitivity analysis”, or understanding how financial outcomes change with underlying model assumptions.

Another interesting line of work has been exploring ways to reduce the variance of the score function estimator, which can have extremely high variance, especially in discrete settings. Much work has been done to design effective control variates. Also, in discrete latent variable models, another popular approach is to introduce a continuous relaxation of the problem, which reduces gradient variance.


The log-derivative trick is a straightforward manipulation of the derivative of a logarithm, but it provides an important route to estimating otherwise unmanageable integrals.


  • Gumbel max trick

  • Introduction to VC dimension

  • -->