New MAP Estimation of latent Dirichlet allocation


A recent published paper entitled “On Estimation and Selection for Topic Models” by Matthew A. Taddy on AISTATS 2012 is interesting.

Firstly, this paper tries to introduce a new way to do inference for latent Dirichlet allocation (LDA). Traditionally, variational inference and Gibbs sampling are widely used to perform Bayesian inference for LDA. For a simpler version of LDA, Probabilistic Latent Semantic Analysis (PLSA), a Maximum Likelihood Estimation (MLE) via EM algorithm is usually used where no prior distributions are used at all. Both Bayesian inference and MLE can be treated as two extremes to learning a LDA or PLSA as MLE may be prone to over-fitting but Bayesian inference is a little bit overly complicated. The parameterization of the model is slightly different from normal settings. Let \( \mathbf{x}_{i} \) be a vector of counts in \( V \) categories (terms) for document \( i \). Basically, each element in this vector is the number of times term \( v \) appearing in document \( i \). We also have the total term count \( m_{i} = \sum_{j}^{V} x_{i,j} \). Thus, the \( K \)-topic model has the following generative story:

\[
\begin{aligned}
\mathbf{x}_{i} \sim \mbox{Multinomial}(\boldsymbol{\theta}_{i,1} \boldsymbol{\phi}_{1} + \cdots + \boldsymbol{\theta}_{i,K} \boldsymbol{\phi}_{K}, m_{i})
\end{aligned}
\]

where \( \boldsymbol{\theta}_{i} \) is a per-document distribution over topics and \( \boldsymbol{\phi}_{k} \) is a distribution over words. Of course, this setting is indeed exactly same as previously published settings. However, we can easily view the term counts as a function of total counts explicitly. Again, Dirichlet distributions are put on \( \boldsymbol{\theta} \) and \( \boldsymbol{\phi} \). The paper discusses a joint posterior maximization by using EM algorithm. The author treats the term count vector \( \mathbf{x}_{i} \) as:

\[
\begin{aligned}
\mathbf{x}_{i} \sim \mbox{Multinomial}(\boldsymbol{\phi}_{1}, t_{i,1}) + \cdots + \mbox{Multinomial}(\boldsymbol{\phi}_{K}, t_{i,K})
\end{aligned}
\]

where \( \mathbf{t}_{i} \sim \mbox{Multinomial}(\boldsymbol{\theta}_{i}, m_{i}) \) and these \( \mathbf{t}_{i} \) are treated as missing-data for EM algorithm. Thus, missing-data is no longer single topic assignments for each term but aggregated topic counts for the whole documents. The detailed derivations of the corresponding EM algorithm is in the paper.

The second point of the paper is to propose some new method to perform model selection, which is always difficult for topic models in some sense. The whole idea is to find \( K \) that can maximize \( P(\mathbf{X} \, | \, K) \), which is usually approximated through MCMC. In this paper, the author details Laplace’s method to approximate this probability. Again, details are in the paper. The last point made by the paper is to use multinomial dispersion to evaluate whether the model fits the data, which is standard in statistics but never used in topic modeling literature.

The experiments are somewhat interesting. The author demonstrated that MAP is almost always comparable to variational EM (VEM) and sometimes slightly better than Gibbs sampling. Sometimes, MAP is even better than VEM.

Note, this is NOT the first paper to propose a MAP learning for topic models. Here’s one at least:

J. T. Chien, J. T. Chien, M. S. Wu, and M. S. Wu. Adaptive Bayesian Latent Semantic Analysis. Audio, Speech, and Language Processing, IEEE Transactions on [seealso Speech and Audio Processing, IEEE Transactions on], 16(1):198–207, 2008.

Leave a comment

Your email address will not be published. Required fields are marked *