Maximum Likelihood as Minimize KL-Divergence

When I study Product of Experts and Restricted Boltzmann Machine recently, I have found a very interesting technical point to be demonstrated. That is maximum likelihood estimation can be viewed as a process to minimize a KL-Divergence between two distributions.

We start from a simple notation. Let \( P(x_{i} \, | \, \boldsymbol{\Theta})\) be the distribution for generating each data point \( x_{i} \). We can define the model parameter distribution as:
\begin{equation}
P_{\boldsymbol{\Theta}}(x) = P(x \, | \, \boldsymbol{\Theta})
\end{equation}We define the following distribution as the “empirical data distribution”:
\begin{equation}
P_{D}(x) = \sum_{i=1}^{N} \frac{1}{N}\delta(x – x_{i})\label{eq:data_distribution}
\end{equation}where \(\delta\) is the Dirac delta function. We can verify that this is a valid distribution by summing over all \( x\): \(\sum_{j=1}^{N}P_{D}(x_{j}) = 1\). By using this empirical data distribution, we wish to calculate the following KL divergence:
\begin{align}
\mathrm{KL}[ P_{D}(x) \, || \, P_{\boldsymbol{\Theta}}(x)] &= \int P_{D}(x) \log \frac{P_{D}(x)}{P_{\boldsymbol{\Theta}}(x)} \, dx \\
&= \int P_{D}(x) \log P_{D}(x) \, dx – \int P_{D}(x) \log P_{\boldsymbol{\Theta}}(x) \, dx \\
&= -H[P_{D}(x)] – \int P_{D}(x) \log P_{\boldsymbol{\Theta}}(x) \, dx \\
&\propto – \Bigr< \log P_{\boldsymbol{\Theta}}(x) \Bigl>_{P_{D}(x)}
\end{align}The last line comes when we drop \(-H[P_{D}(x)]\), which has nothing to do with parameters \(\boldsymbol{\Theta}\) and \(\Bigr< \Bigl>_{P_{D}(x)}\) represents the expectation over \(P_{D}(x)\). If we minimize the left hand side KL divergence, it is equivalent to minimize the negative expectation on the right hand side. This expectation is indeed:
\begin{align}
\Bigr< \log P_{\boldsymbol{\Theta}}(x) \Bigl>_{P_{D}(x)} &= \sum_{x} P_{D}(x) \log P(x \, | \, \boldsymbol{\Theta}) \\
&= \sum_{x} \Bigr[ \frac{1}{N}\sum_{i=1}^{N}\delta(x – x_{i}) \Bigl] \log P(x \, | \, \boldsymbol{\Theta}) \\
&= \frac{1}{N} \sum_{i=1}^{N} \log P(x_{i} \, | \, \boldsymbol{\Theta})
\end{align}This is indeed log likelihood of the dataset.

8 thoughts on “Maximum Likelihood as Minimize KL-Divergence”

Dror April 8, 2014 at 2:10 am

Hi LiangJie,

This is a really cool result. Where did you come across this?

I had the feeling that something like this holds but wasn’t sure how to go about it.

I’d appreciate it if you’ve got any more info regarding this.

Cheers,

Dror

Liangjie Hong Post author April 9, 2014 at 12:19 am

Thanks, Dror.

In fact, I would say that this is probably a standard result. For example, someone else has a similar blog post: http://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/

D August 5, 2015 at 10:49 am

Hi LiangJie. This is nice but I think that KL(P ||Q) only exists when the radon nikodym derivative of P exists with respect to Q, ie P is absolutely continuous with respect to Q. Otherwise you get into difficulties integrating over the logarithm of zero. In your case, the expression log(P_D(x)) may not make mathematical sense. I don’t think you can take logarithms of the Dirac Delta function, since this is not a ‘function’ as such, but a distribution.

Luyao Yuan March 13, 2016 at 5:25 pm

I think you left a integral mark at (5). Also, how do you get (6) from (5)

Liangjie Hong Post author March 13, 2016 at 5:39 pm

Hi Luyao,

Equation (5) does not leave an integral as the expectation is over all data points. So it is a summation. (But “dx” should be removed and I’m going to revise that.) Equation (6) directly comes from the definition of the empirical data distribution.

Luyao Yuan March 13, 2016 at 9:04 pm

I think there should be an integral, you should just replace PD(x) with its definition. Now, it is not even a probability.

Liangjie Hong Post author March 13, 2016 at 9:35 pm

OK. I do miss a summation there. Now, it’s clear. The integral is the summation here as the expectation is taken with respect to a discrete distribution, empirical data distribution. The last line is from the change from x to x_i.

The outer summation is sum over all possible x while the inner summation is over all data instances. The outer summation is gone as the change of x to x_i.

Thanks for pointing out this.

Praveen Narayanan November 6, 2016 at 2:29 pm

I was solving a problem from Duda and Hart (also in Bishop). Minimizing the KL distance is equivalent to computing an MLE estimate. Aside from the Dirac-Delta function (which is quite cool), my interpretation is that we can view it as a sort of Monte Carlo estimate where we average over N (see Doucet’s notes – which also contains Dirac Delta function – http://www.stats.ox.ac.uk/~doucet/doucet_defreitas_gordon_smcbookintro.pdf)from points taken from the distribution we have data from.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Dror April 8, 2014 at 2:10 am

Hi LiangJie,

This is a really cool result. Where did you come across this?

I had the feeling that something like this holds but wasn’t sure how to go about it.

I’d appreciate it if you’ve got any more info regarding this.

Cheers,

Dror
Liangjie Hong Post author April 9, 2014 at 12:19 am

Thanks, Dror.

In fact, I would say that this is probably a standard result. For example, someone else has a similar blog post: http://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/
D August 5, 2015 at 10:49 am

Hi LiangJie. This is nice but I think that KL(P ||Q) only exists when the radon nikodym derivative of P exists with respect to Q, ie P is absolutely continuous with respect to Q. Otherwise you get into difficulties integrating over the logarithm of zero. In your case, the expression log(P_D(x)) may not make mathematical sense. I don’t think you can take logarithms of the Dirac Delta function, since this is not a ‘function’ as such, but a distribution.
Luyao Yuan March 13, 2016 at 5:25 pm

I think you left a integral mark at (5). Also, how do you get (6) from (5)
Liangjie Hong Post author March 13, 2016 at 5:39 pm

Hi Luyao,

Equation (5) does not leave an integral as the expectation is over all data points. So it is a summation. (But “dx” should be removed and I’m going to revise that.) Equation (6) directly comes from the definition of the empirical data distribution.
Luyao Yuan March 13, 2016 at 9:04 pm

I think there should be an integral, you should just replace PD(x) with its definition. Now, it is not even a probability.
Liangjie Hong Post author March 13, 2016 at 9:35 pm

OK. I do miss a summation there. Now, it’s clear. The integral is the summation here as the expectation is taken with respect to a discrete distribution, empirical data distribution. The last line is from the change from x to x_i.

The outer summation is sum over all possible x while the inner summation is over all data instances. The outer summation is gone as the change of x to x_i.

Thanks for pointing out this.
Praveen Narayanan November 6, 2016 at 2:29 pm

I was solving a problem from Duda and Hart (also in Bishop). Minimizing the KL distance is equivalent to computing an MLE estimate. Aside from the Dirac-Delta function (which is quite cool), my interpretation is that we can view it as a sort of Monte Carlo estimate where we average over N (see Doucet’s notes – which also contains Dirac Delta function – http://www.stats.ox.ac.uk/~doucet/doucet_defreitas_gordon_smcbookintro.pdf)from points taken from the distribution we have data from.

Hong, LiangJie

Director of Engineering – AI at LinkedIn Corporation

Director of Engineering – AI at LinkedIn Corporation

Maximum Likelihood as Minimize KL-Divergence 8

Leave a comment

8 thoughts on “Maximum Likelihood as Minimize KL-Divergence”