Maximum Likelihood as Minimize KL-Divergence 8

When I study Product of Experts and Restricted Boltzmann Machine recently, I have found a very interesting technical point to be demonstrated. That is maximum likelihood estimation can be viewed as a process to minimize a KL-Divergence between two distributions.

We start from a simple notation. Let \( P(x_{i} \, | \, \boldsymbol{\Theta})\) be the distribution for generating each data point \( x_{i} \). We can define the model parameter distribution as:
P_{\boldsymbol{\Theta}}(x) = P(x \, | \, \boldsymbol{\Theta})
\end{equation}We define the following distribution as the “empirical data distribution”:
P_{D}(x) = \sum_{i=1}^{N} \frac{1}{N}\delta(x – x_{i})\label{eq:data_distribution}
\end{equation}where \(\delta\) is the Dirac delta function. We can verify that this is a valid distribution by summing over all \( x\): \(\sum_{j=1}^{N}P_{D}(x_{j}) = 1\). By using this empirical data distribution, we wish to calculate the following KL divergence:
\mathrm{KL}[ P_{D}(x) \, || \, P_{\boldsymbol{\Theta}}(x)] &= \int P_{D}(x) \log \frac{P_{D}(x)}{P_{\boldsymbol{\Theta}}(x)} \, dx \\
&= \int P_{D}(x) \log P_{D}(x) \, dx – \int P_{D}(x) \log P_{\boldsymbol{\Theta}}(x) \, dx \\
&= -H[P_{D}(x)] – \int P_{D}(x) \log P_{\boldsymbol{\Theta}}(x) \, dx \\
&\propto – \Bigr< \log P_{\boldsymbol{\Theta}}(x) \Bigl>_{P_{D}(x)}
\end{align}The last line comes when we drop \(-H[P_{D}(x)]\), which has nothing to do with parameters \(\boldsymbol{\Theta}\) and \(\Bigr< \Bigl>_{P_{D}(x)}\) represents the expectation over \(P_{D}(x)\). If we minimize the left hand side KL divergence, it is equivalent to minimize the negative expectation on the right hand side. This expectation is indeed:
\Bigr< \log P_{\boldsymbol{\Theta}}(x) \Bigl>_{P_{D}(x)} &= \sum_{x} P_{D}(x) \log P(x \, | \, \boldsymbol{\Theta}) \\
&= \sum_{x} \Bigr[ \frac{1}{N}\sum_{i=1}^{N}\delta(x – x_{i}) \Bigl] \log P(x \, | \, \boldsymbol{\Theta}) \\
&= \frac{1}{N} \sum_{i=1}^{N} \log P(x_{i} \, | \, \boldsymbol{\Theta})
\end{align}This is indeed log likelihood of the dataset.

Leave a comment

Your email address will not be published. Required fields are marked *

8 thoughts on “Maximum Likelihood as Minimize KL-Divergence

  • Dror

    Hi LiangJie,

    This is a really cool result. Where did you come across this?

    I had the feeling that something like this holds but wasn’t sure how to go about it.

    I’d appreciate it if you’ve got any more info regarding this.



  • D

    Hi LiangJie. This is nice but I think that KL(P ||Q) only exists when the radon nikodym derivative of P exists with respect to Q, ie P is absolutely continuous with respect to Q. Otherwise you get into difficulties integrating over the logarithm of zero. In your case, the expression log(P_D(x)) may not make mathematical sense. I don’t think you can take logarithms of the Dirac Delta function, since this is not a ‘function’ as such, but a distribution.

  • Liangjie Hong Post author

    Hi Luyao,

    Equation (5) does not leave an integral as the expectation is over all data points. So it is a summation. (But “dx” should be removed and I’m going to revise that.) Equation (6) directly comes from the definition of the empirical data distribution.

  • Luyao Yuan

    I think there should be an integral, you should just replace PD(x) with its definition. Now, it is not even a probability.

  • Liangjie Hong Post author

    OK. I do miss a summation there. Now, it’s clear. The integral is the summation here as the expectation is taken with respect to a discrete distribution, empirical data distribution. The last line is from the change from x to x_i.

    The outer summation is sum over all possible x while the inner summation is over all data instances. The outer summation is gone as the change of x to x_i.

    Thanks for pointing out this.

  • Praveen Narayanan

    I was solving a problem from Duda and Hart (also in Bishop). Minimizing the KL distance is equivalent to computing an MLE estimate. Aside from the Dirac-Delta function (which is quite cool), my interpretation is that we can view it as a sort of Monte Carlo estimate where we average over N (see Doucet’s notes – which also contains Dirac Delta function – points taken from the distribution we have data from.