We often want to characterize probabilistic models in discrete situations. The Gumbel trick allows us to estimate as associated partition function with relative ease. At a high level, finding or even is very difficult; however, we can add some noise and compute the maximum a posteriori (MAP) more easily through approximation methods. If we repeat this process enough times, we get a reliable estimate of .
Let’s formalize. For finite sample of size , we define an unnormalized mass function and let be the normalizing partition function. We then define as the log-unnormalized probabilities or the potential function.
Our algorithm is then:
- Add Gumbel distributed noise to our potential functions
- Find the MAP of this perturbed value over all . Call this value
- Repeat steps 1 and 2 multiple times and then collect the mean
We want to prove the supposedly useful Gumbel trick then using the Perturb-and-MAP method, specifically
where $\phi(x)$ has been defined as the potentials and where is the Euler-Mascheroni constant.
Because the mean of is , we can show that and then are recoverable.
A brief Gumbal interlude
The Gumbel distribution is traditionally used to model the maxima of already extreme events. For example, what will be worst earthquake next year given the measurements of the worst earthquakes in the past 10 years in San Francisco?
A variable drawn from has the probability distribution
where . For this problem, we set the scale parameter whereas the location parameter remains free.
Additionally, the cumulative distribution function of a is
This will become useful!
The actual proof
We want to find the value of that maximizes . Thinking in terms of the CDF, we want all values of to produce smaller or equal values
The first equality follows from multiplying the Gumbel CDF of of all possible values to capture the maximum. The second equality comes from expanding out the Gumbel CDF. The third equality consolidates the potential functions such that . The fourth equality sticks the back into the function. The last equality compresses the probability back into the Gumbel CDF, except set at a different location.