Until I read the recent paper at ICML 2017, I hadn’t heard of the Gumbel trick. There is surprisingly little online about the Gumbel trick—related to the more popular Gumbel-max trick—so here we go.

We often want to characterize probabilistic models in discrete situations. The Gumbel trick allows us to estimate as associated partition function with relative ease. At a high level, finding or even is very difficult; however, we can add some noise and compute the maximum a posteriori (MAP) more easily through approximation methods. If we repeat this process enough times, we get a reliable estimate of .

In complexity theory, we know that finding the MAP is NP-hard but can be approximated quickly in practice. Note that the partition function is a harder even still, containing #P-hard problems.

Let’s formalize. For finite sample of size , we define an unnormalized mass function and let be the normalizing partition function. We then define as the log-unnormalized probabilities or the potential function.

Our algorithm is then:

- Add Gumbel distributed noise to our potential functions
- Find the MAP of this perturbed value over all . Call this value
- Repeat steps 1 and 2 multiple times and then collect the mean

## But why?

We want to prove the supposedly useful Gumbel trick then using the *Perturb-and-MAP* method, specifically

where $\phi(x)$ has been defined as the potentials and where is the Euler-Mascheroni constant.

Because the mean of is , we can show that and then are recoverable.

## A brief Gumbal interlude

The Gumbel distribution is traditionally used to model the maxima of already extreme events. For example, what will be worst earthquake next year given the measurements of the worst earthquakes in the past 10 years in San Francisco?

A variable drawn from has the probability distribution

where . For this problem, we set the scale parameter whereas the location parameter remains free.

Additionally, the cumulative distribution function of a is

This will become useful!

## The actual proof

We want to find the value of that maximizes . Thinking in terms of the CDF, we want all values of to produce smaller or equal values

The first equality follows from multiplying the Gumbel CDF of of all possible values to capture the maximum. The second equality comes from expanding out the Gumbel CDF. The third equality consolidates the potential functions such that . The fourth equality sticks the back into the function. The last equality compresses the probability back into the Gumbel CDF, except set at a different location.

## Other proofs

I enjoyed and modeled this post after the proof from Hazan and Jaakkola 2012; however, Matej et al 2017 has another proof using a cleverly chosen function and competing exponential clocks.