Category Archives: Topic Model

Topic Models meet Latent Factor Models

There is a trend in research communities to bring two well-established classes of models together, topic models and latent factor models. By doing so, we may enjoy the ability to analyze text information with topic models and incorporate the collaborative filtering analysis with latent factor models. In this section, I wish to discuss some of these efforts.

Three papers will be covered in this post are listed at the end of the post. Before that, let’s first review what latent factor models are. Latent factor models (LFM) are usually used in collaborative filtering context. Say, we have a user-item rating matrix \( \mathbf{R} \) where \( r_{ij} \) represents the rating user \( i \) gives to item \( j \). Now, we assume for each user \( i \), there is a vector \( \mathbf{u}_{i} \) with the dimensionality \( k \), representing the user in a latent space. Similarly, we assume for each item \( j \), a vector \( \mathbf{v}_{j} \) with the same dimensionality representing the item in a same latent space. Thus, the rating \( r_{ij} \) is therefore represented as:
\[ r_{ij} = \mathbf{u}_{i}^{T} \mathbf{v}_{j} \]This is the basic setting for LFM. In addition to this basic setting, additional biases can be incorporated, see here. For topic models (TM), the simplest case is Latent Dirichlet Allocation (LDA). The story of LDA is like this. For a document \( d \), we first sample a multinomial distribution \( \boldsymbol{\theta}_{d} \), which is a distribution over all possible topics. For each term position \( w \) in the document, we sample a discrete topic assignment \( z \) from \( \boldsymbol{\theta}_{d} \), indicating which topic we use for this term. Then, we sample a term \( v \) from a topic \( \boldsymbol{\beta} \), a multinomial distribution over the vocabulary.

For both LFM and TM, they are methods to reduce original data into latent spaces. Therefore, it might be possible to link them together. Especially, items in the LFM are associated with rich text information. One natural idea is that, for an item \( j \), the latent factor \( \mathbf{v}_{j} \) and its topic proportional parameter \( \boldsymbol{\theta}_{j} \) somehow gets connected. One way is to directly equalize these two variables. Since \( \mathbf{v}_{j} \) is a real-value variable and \( \boldsymbol{\theta}_{j} \) falls into a simplex, we need certain ways to keep these properties. Two possible methods can be used:

  1. Keep \( \boldsymbol{\theta}_{j} \) and make sure it is in the range of [0, 1] in the optimization process. Essentially put some constraint on the parameter.
  2. Keep \( \mathbf{v}_{j} \) and use logistic transformation to transfer a real-valued vector into simplex.

Hanhuai and Banerjee showed the second technique in their paper by combining Correlated Topic Model with LFM. Wang and Blei argued that this setting suffers from the limitation that it cannot distinguish topics for explaining recommendations from topics important for explaining content since the latent space is strictly equal. Thus, they proposed a slightly different approach. Namely, each \( \mathbf{v}_{j} \) derives from \( \boldsymbol{\theta}_{j} \) with item-dependent noise:
\[ \mathbf{v}_{j} = \boldsymbol{\theta}_{j} + \epsilon_{j} \] where \( \epsilon_{j} \) is a Gaussian noise.

A different approach is to not directly equal these two quantities but let me impact these each other. One such way explored by Hanhuai and Banerjee is that \( \boldsymbol{\theta}_{j} \) influences how \( \mathbf{v}_{j} \) is generated. More specifically, in Probabilistic Matrix Factorization (PMF) setting, all \( \mathbf{v} \)s are generated by a Gaussian distribution with a fixed mean and variance. Now, by combining LDA, the authors allow different topic has different Gaussian prior mean and variance values. A value similar to \( z \) is firstly generated from \( \boldsymbol{\theta}_{j} \) to decide which mean to use and then generate \( \mathbf{v}_{j} \) from that particular mean and variance.

A totally different direction was taken by Agarwal and Chen. In their fLDA paper, there is no direct relationship between item latent factor and content latent factor. In fact, their relationship is realized by the predictive equation:
\[ r_{ij} = \mathbf{a}^{T} \mathbf{u}_{i} + \mathbf{b}^{T} \mathbf{v}_{j} + \mathbf{s}_{i}^{T} \bar{\mathbf{z}}_{j}
\]where \( \mathbf{a} \), \( \mathbf{b} \) and \(\mathbf{s}_{i} \) are regression weights and \( \bar{\mathbf{z}}_{j} \) is the average topic assignments for item \( j \). Note, \(\mathbf{s}_{i} \) is a user-dependent regression weights. This formalism encodes the notion that all latent factors (including content) will contribute to the rating, not only item and user factors.

In summary, three directions have been taken for integrating TM and LFM:

  1. Equal item latent factor and topic proportion vector, or make some Gaussian noise.
  2. Let topic proportion vector to control the prior distribution for item latent factor.
  3. Let item latent factor and topic assignments, as well as user latent factor, contribute the rating.

Reference:

  • Deepak Agarwal and Bee-Chung Chen. 2010. fLDA: matrix factorization through latent dirichlet allocation. In Proceedings of the third ACM international conference on Web search and data mining (WSDM ’10). ACM, New York, NY, USA, 91-100. [PDF]
  • Hanhuai Shan and Arindam Banerjee. 2010. Generalized Probabilistic Matrix Factorizations for Collaborative Filtering. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM ’10). IEEE Computer Society, Washington, DC, USA, 1025-1030. [PDF]
  • Chong Wang and David M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’11). ACM, New York, NY, USA, 448-456.[PDF]

An Easy Reading Tutorial for Bayesian Non-parametric Models

The “god-father” of LDA, David Blei, recently published a tutorial on Bayesian Non-parametric Models, with one of his student. The whole tutorial is easy-reading and provides very clear overview of Bayesian Non-parametric Models. In particular, Chinese Restaurant Process (CRP) and Indian Buffet Process are discussed in a very intuitive way. For those who are interests in technical details about these models, this tutorial may be just a starting point and the Appendix points out several ways to discuss models more formally, including inference algorithms.

One specific interesting property shown in this tutorial is the “exchangeable” property for CRP, which I wish to re-state as below.

Let \( c_{n} \) be the table assignment of the \(n\)th customer. A draw from CPR can be generated by sequentially assigning observations to classes with probability:
\[P(c_{n} = k | \mathbf{c}_{1:n-1}) = \begin{cases}
\frac{m_{k}}{n-1+\alpha}, & \mbox{if } k \leq \mathbf{K}_{+} \mbox{ (i.e., $k$ is a previously occupied table)} \\
\frac{\alpha}{n-1+\alpha}, & \mbox{otherwise (i.e., $k$ is the next unoccupied table)}
\end{cases}\]where \( m_{k} \) is the number of customers sitting at table \( k \), and \( \mathbf{K}_{+} \) is the number of tables for which \( m_{k} > 0 \). The parameter \( \alpha \) is called the concentration parameter. The CRP exhibits an important invariance property: The cluster assignments under this distribution are exchangeable. This means \( p(\mathbf{c}) \) is unchanged if the order of customers is shuffled.

Consider the joint distribution of a set of customer assignments \( c_{1:N} \). It decomposes according to the chain rule:
\[p(c_{1}, c_{2}, \cdots , c_{N}) = p(c_{1}) p(c_{2} | c_{1}) \cdots p(c_{N} | c_{1}, c_{2}, \cdots , c_{N-1}) \]where each terms comes from above equation. To show that this distribution is exchangeable, we will introduce some new notation. Let \( \mathbf{K}(c_{1:N}) \) be the number of groups in which these assignments place the customers, which is a number between 1 and \( N \). Let \( I_{k} \) be the set of indices of customers assigned to the \(k\)th group, and let \( N_{k} \) be the number of customers assigned to that group. Now, for a particular group \( k \) the joint probability of all assignments in this group is:
\[ \frac{\alpha}{I_{k,1}-1+\alpha} \frac{1}{I_{k,2}-1+\alpha} \frac{2}{I_{k,3}-1+\alpha} \cdots \frac{N_{k}-1}{I_{k,N}-1+\alpha} \]where each term in the equation represents a customer. The numerator can be re-written as \( \alpha (N_{k}-1)!\). Therefore, we have:
\[ p(c_{1}, c_{2}, \cdots , c_{N}) = \prod_{k=1}^{K} \frac{\alpha (N_{k}-1)!}{(I_{k,1}-1+\alpha)(I_{k,2}-1+\alpha)\cdots (I_{k,N_{k}}-1+\alpha)} \]Finally, notice that the union of \( \mathbf{I}_{k} \) across all groups \(k\) identifies each index once, because each customer is assigned to exactly one group. This simplifies the denominator and let us write the joint as:
\[ p(c_{1}, c_{2}, \cdots , c_{N}) = \frac{\alpha^{K} \prod_{k=1}^{K} (N_{k}-1)! }{\prod_{i=1}^{N} (i-1+\alpha)} \]This equation only depends on the number of groups \(\mathbf{K}\) and the size of each group \(\mathbf{N}_{k}\).

Some Recent Papers About Topic Models

In this post, I would like to talk about several recent papers about topic models. These papers may not belong to the same direction of applying or extending topic models. However, some of them are quite interesting and worth to be discussed here.

The first one is

Enhong Chen, Yanggang Lin, Hui Xiong, Qiming Luo, and Haiping Ma. 2011. Exploiting probabilistic topic models to improve text categorization under class imbalanceJournal of Information Processing and Management. 47, 2 (March 2011), 202-214.

The idea is straightforward and simple. The author proposed a two-step approach to mitigate the problem of unbalanced data. The first step is to learn topic models from the existing unbalanced data. Here, for each class label, a separate set of topics is learned. Once the models are obtained, synthetic documents or new samples are drawn from learned models. This is possible since topic distribution and word distribution are fixed after learning process. The number of new samples is determined by the difference between the dominant class and the rare class. A more aggressive method is also proposed, which is used to avoid noisy labeled data. The idea is to use all synthetic samples to train a classifier, rather than original samples. The experimental results demonstrate some performance improvement of this method over other ones that are proposed to tackle the same problem.

The second paper is

Wayne Xin Zhao, Jing Jiang, Hongfei Yan, and Xiaoming Li. 2010. Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 56-65.

The paper is interesting because it also demonstrates a method to incorporate term-level features into a topic model. The list of features for each term is embedded through a Maximum Entropy Model. The supervised learning part of the model learns the fixing weights of these features and Gibbs sampling for the topic model uses these weights. For details, please refer to the paper.

The next one is

Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan and Xiaoming Li. Comparing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Information Retrieval (ECIR’11) (full paper), 2011.

The paper has several interesting aspects. First, it is claimed as a first study of topics obtained on Twitter and other traditional media. The authors use a standard LDA model to discover topics from NewYorkTimes corpus and a modified topic model for Twitter, separately. Then, they proposed a heuristic method to map Twitter topics onto NYT topics.  In addition, they manually assigned topic types to all the topics found by models. By doing all these, common topics and corpus-specific topics are obtained heuristically. It’s a little bit strange that they do not consider any techniques to mine topics from multiple corpus. Secondly, they do not compare to the method where only LDA is used. Note, the same Twitter-LDA is used in:

Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achanauparp, Ee-Peng Lim and Xiaoming Li. Topical keyphrase extraction from Twitter. To appear in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11) (long paper), 2011.

 

Reviews on User Modeling in Topic Models

In this post, I would like to review several papers that wish to extend standard topic models with incorporating user information. The first paradigm or group of papers is introduced by M. Rosen-Zvi et al.

These three papers define a “Author-Topic” model, a simple extension of LDA. The generation process is as follows:

  1. For each document d:
    1. For each word position:
      1. Sample an author x uniformly sampled from the group of authors \mathbf{a}_{d} for this document.
      2. Sample an topic assignment z from per-author multinomial distribution over topics \theta_{x}.
      3. Sample a word w from topic z, a multinomial distribution over words.

The inference of the model is done by Gibbs Sampling. The biggest drawback of the model is that it loses the distribution over topics for documents. In “Learning Author-Topic Models from Text Corpora“, the authors proposed a heuristic solution to this problem: adding a fictitious author for each document. The second group of papers is from UMass.

They  proposed several models. The first one is “Author-Recipient-Topic” model, which is suitable for message data, like emails. The generation process is as follows:

  1. For each document d, we observe its author a_{d} and a set of recipients \mathbf{r}_{d}:
    1. For each word position:
      1. Sample a recipient x uniformly sampled from \mathbf{r}_{d}.
      2. Sample an topic assignment z from author-recipient multinomial distribution over topics \theta_{a_{d},x}.
      3. Sample a word w from topic z, a multinomial distribution over words.

This model is further extended into “Role-Author-Recipient-Topic” model. The idea is that each author or recipient may play different roles in the exchange of messages. Therefore, it is better to explicitly model them. Three possible variants are introduced. The first variant is that for each word position, we first sample a role for author and for the sampled recipient as well. Once the roles are sampled, the topic assignments are sampled from role-role pair-determined multinomial distribution over topics. The second variant is that only one role is generated for the author of the message. However, for recipients, each one has a role. For each word position, a recipient with his corresponding role is firstly sampled and a topic assignment is sampled from author-role author-role pair multinomial distribution over topics. The third variant is that all recipients share a single role. The third model is “Author-Persona-Topic” model. The generation process is as follows:

  1. For each author a:
    1. Sample a multinomial distribution over persona \eta_{a}.
    2. For each persona g, sample a multinomial distribution over topics \theta_{g}.
  2. For each document d with author a_{d}:
    1. Sample a persona g_{d} from \eta_{a_{d}}.
    2. For each word position:
      1. Sample an topic assignment z from \theta_{g_{d}}.
      2. Sample a word w from topic z, a multinomial distribution over words.

All these models do not have a per-document distribution for topics.

The third group of papers is from Noriaki Kawamae. Both models introduced in these papers extended the ideas of “Author-Topic” model and “Author-Persona-Topic” model in particular.

The first model is “Author-Interest-Topic” model. It introduced a notion of “document-class”. The authors have a distribution over document-classes and for each document class, it has a distribution over topics. Here, we can think of document-class as “persona” in previous models. For each document, it firstly samples a document-class from per-author distibution over document classes. Then, by using this document-class, we can draw topics from this particular class. The difference between ”Author-Interest-Topic” model and “Author-Persona-Topic” model is that the distribution over topics for each persona is under author level in “Author-Persona-Topic” but they are global variables in “Author-Interest Topic” model. The “Latent-Interest-Topic” model is much complicated than all previous models. It adds another layer of abstraction, author-classes. For each author, it has variable to indicate his author-class, which is drawn from a multinomial distribution. For each author-class, there is a multinomial distribution over topics. For each document, we first draw a document-class from its author’s per author-class distribution over document-classes. Then, the later generation process is same as “Author-Interest-Topic“. The key for “Author-Interest-Topic” and “Latent-Interest-Topic” models is that they are clustering models, in the sense that authors or documents are forced clustered into either author classes or document classes.

The last group of papers is from Jie Tang et al. All the proposed models are based on “Author-Topic” model.

They firstly proposed three variants of “Author-Conference-Topic” model. For each author, there is a multinomial distribution over topics. For each token in the document, an author is uniformly sampled and the topic assignment is sampled from per-author multinomial distribution over topics. The differences between three variants are how the conference stamp is generated. We omit the discussion here.

Reviews on Probabilistic Models for User Profiles

In this post, I would like to review some probabilistic models for user profiling. More specifically, I’m looking at the models that taking users’ preferences into account and try to predict certain quantities based on these preferences, which is a normal scenario for collaborative filtering.

  • Latent semantic models for collaborative filtering” by Thomas Hofmann, ACM Transactions on Information Systems, 2004
    The proposed model is based on pLSA. In order to incorporate ratings, the authors propose the following decomposition scheme:
    p(v|u,y) = \sum_{z}p(z|u)p(v|z,y)where p(v|z,y) follows Gaussian distribution. The paper also introduced practical techniques to normalize user ratings. The model is learned through (tempered) EM.
  • Modeling User Rating Profiles For Collaborative Filtering” by Benjamin Marlin, NIPS 2003
    The model proposed in the paper is essentially  LDA in the context of collaborative filtering. The original document-term matrix was replaced by a user-item (user-rating) matrix. Unlike this pLSA model for collaborative filtering, this model introduced the decomposition scheme as:p(r|u)=\sum_{z} p(r|z)p(z|u)where no “dummy” variable y gets involved. The model is learned through variational inference.
  • Flexible Mixture Model for Collaborative Filtering” by Luo Si and Rong Jin, ICML 2003
    The model proposed in the paper is an extension of two-side clustering model of pLSA. It assumes that users are belong to multiple clusters and items are also belong to multiple clusters. The rating of a particular item is based on the user clusters and item clusters. Therefore, p(x,y,r) = \sum_{z_{x}} \sum_{z_{y}} p(z_{x})p(y_{x})p(x|z_{x})p(y|z_{y})p(r|z_{x},z_{y})where z_{x} are latent factors for users and z_{y} are latent factors for items. All distributions here are multinomial distributions. The model is learned through EM.
    A full Bayesian treatment of the model is introduced in “Latent Grouping Models for User Preference Prediction” by Eerika Savia, Kai Puolamaki and Samuel Kaski in Machine Learning 2009, which is learned through Gibbs Sampling.
  • The Multiple Multiplicative Factor Model For Collaborative Filtering” by Benjamin Marlin and Richard S. Zemel, ICML 2004
    Rather than using a same set of latent factors to “explain” all ratings, the Multiple Multiplicative Factor Model (MMF) tries to use different latent factors to “explain” different ratings. Therefore, for each user, the model has a K-dimensional binary vector Z where each element z_{k} represents whether k-th factor is “active” or not. For rating X, the authors introduced to use softmax function to map arbitrary real values to simplex. The model is learned through variational inference.
  • Efficient Bayesian Hierarchical User Modeling For Recommendation Systems” by Yi Zhang and Jonanthan Koren, SIGIR 2007
    The model introduced in this paper is similar to the pLSA version of user profiles. The rating r of a item y by user u is decomposed as:p(r|y,u)=p(u)p(r|u,y)where p(u) is a Gaussian distribution and p(r|u,y) is modeled through a Generalized Linear Model r=u^{T}y+\epsilon, essentially another Gaussian distribution in the paper. The novel part of the model is that y is the document representation of an item. Therefore, the authors assume that the rating is weighted sum of terms of documents where the weights are user specific. The model is learned through EM.

A study about several variants of pLSA on collaborative filtering is done by Rong Jin, Luo Si and Chengxiang Zhai.

A Study of Mixture Models for Collaborative Filtering”  R. Jin, L. Si and C. X. Zhai, Information Retrieval Journal, 9(3), pp. 357-382, 2006

A study about how to normalize ratings is done by Rong Jin and Luo Si:

A Study of Methods for Normalizing User Ratings in Collaborative Filtering“ R. Jin and L. Si, The 27th Annual International ACM SIGIR Conference (SIGIR 2004), pp. 568-569, 2004

Two important distributions related to Dirichlet distribution

In this post, we review two important facts of Dirichlet distribution. The basic setting used here as follows. Suppose we have a discrete space \mathcal{X} = \{ \mathcal{X}_{1},\mathcal{X}_{2}, \ldots,\mathcal{X}_{K} \}. These are K outcomes can be observed from the “random experiments”. Suppose we have N observations \{X_{1}, X_{2},\ldots, X_{N}\} that are distributed according to a Multinomial distribution \theta where \theta_{i} = P(X_{j} = \mathcal{X}_{i}). After placing a Dirichlet distribution \mbox{Dir}(\alpha) on \theta, we are interested in the following two questions:

  • What is the posterior distribution P(\theta | \mathbf{X}, \alpha)?
  • What is the predictive distribution  P(X_{N+1} | \mathbf{X}, \alpha)?

Posterior Distribution

We can steadily derive the posterior distribution as below:

  P(\theta | \mathbf{X}, \alpha) = \frac{P(X|\theta)P(\theta|\alpha)}{\int_{\theta} P(X|\theta)P(\theta|\alpha) \, d\theta} \\  = \frac{\Bigr( \prod_{k=1}^{K} (\theta_{k})^{n(k)} \Bigl) \Bigr( \frac{\Gamma(\sum_{k} \alpha_{k})}{\prod_{k} \Gamma(\alpha_{k})} \prod_{k}^{K} \theta_{k}^{\alpha_{k} - 1} \Bigl) }{\int_{\theta} \Bigr( \prod_{k=1}^{K} (\theta_{k})^{n(k)} \Bigl) \Bigr( \frac{\Gamma(\sum_{k} \alpha_{k})}{\prod_{k} \Gamma(\alpha_{k})} \prod_{k}^{K} \theta_{k}^{\alpha_{k} - 1} \Bigl) \, d\theta } \\  = \frac{\Gamma \Bigr( \sum_{k} ( \alpha_{k} + n(k) ) \Bigl)}{\prod_{k}^{K} \Gamma \Bigr( \alpha_{k} + n(k) \Bigl)} \prod_{k}^{K} \theta_{k}^{\alpha_{k} + n(k) -1 }

where n(k) is the number of times outcome \mathcal{X}_{k} appear in the data. Therefore, the posterior distribution is indeed a Dirichlet distribution \mbox{Dir} \Bigr( \alpha_{1}+n(1),\alpha_{2}+n(2),\ldots,\alpha_{k}+n(k) \Bigl). If we only have one observation X, this posterior distribution is simply \mbox{Dir} \Bigr( \alpha_{1}+\delta_{1}(X),\alpha_{2}+\delta_{2}(X),\ldots,\alpha_{k}+\delta_{k}(X) \Bigl) where \delta_{k}(X) is 1 only if X takes the value of outcome \mathcal{X}_{k}. This notation can also extend to the general case that the posterior distribution is \mbox{Dir} \Bigr( \alpha_{1}+\sum_{i}^{N} \delta_{1}(X_{i}),\alpha_{2}+\sum_{i}^{N} \delta_{2}(X_{i}),\ldots,\alpha_{k}+\sum_{i}^{N} \delta_{k}(X_{i}) \Bigl).

Predictive Distribution

By using the posterior distribution derived above, we can have the predictive distribution as follows:

  P(X_{N+1} = \mathcal{X}_{i} | \mathbf{X}, \alpha) = \int_{\theta} P(X_{N+1} = \mathcal{X}_{i} | \theta) P(\theta | \mathbf{X}, \alpha) \, d\theta \\  =\int_{\theta} \theta_{i} \Biggr[ \frac{\Gamma \Bigr( \sum_{k} ( \alpha_{k} + n(k) ) \Bigl)}{\prod_{k}^{K} \Gamma \Bigr( \alpha_{k} + n(k) \Bigl)} \prod_{k}^{K} \theta_{k}^{\alpha_{k} + n(k) -1 } \Biggl] \, d\theta \\  = \mathbb{E}[\theta_{i}] \\  = \frac{\alpha_{i} + n(i)}{\sum_{k} \Bigr( \alpha_{k} + n(k) \Bigl)} = \frac{\alpha_{i} + \sum_{j}^{N} \delta_{i}(X_{j})}{\sum_{k} \alpha_{k} + N }

where the second last line is derived by observing that the posterior distribution is a Dirichlet distribution and the expression is essentially an expectation under that distribution. Note, the final line of the equation can be re-written into:

  \frac{\alpha_{i} + \sum_{j}^{N} \delta_{i}(X_{j})}{\sum_{k} \alpha_{k} + N } = \Bigr( \frac{\alpha_{i}}{\sum_{k} \alpha_{k}} \Bigl) \Bigr( \frac{\sum_{k} \alpha_{k}}{\sum_{k} \alpha_{k} + N} \Bigl) + \Bigr( \frac{1}{N} \sum_{j}^{N} \delta_{i}(X_{j}) \Bigl) \Bigr( \frac{N}{\sum_{k}\alpha_{k} + N }\Bigl)

It is the weighted summation of prior mean of \theta_{i} and the MLE of \theta_{i}. There is one extreme case for predictive distribution, which is that there is no data points before. By using the equation we derive above, we have:

  P(X_{N+1} = \mathcal{X}_{i} | \alpha) = \frac{\alpha_{i}}{\sum_{k} \alpha_{k}}

These two distributions are heavily used in Dirichlet/Multinomial Bayesian modeling and also are milestones for understanding Dirichlet Process.

Generate Synthetic Data for LDA

This post has a simple Python script to generate synthetic data for LDA. The purpose of this is to test some inference algorithms for LDA, e.g., Gibbs Sampling.

The script requires Numpy and Scipy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
## This program is to generated synthetic data for LDA
import math
import random
import numpy
import numpy.random
import sys
 
## define some constant
TOPIC_N = 2
VOCABULARY_SIZE = 1000
DOC_NUM = 100
TERM_PER_DOC = 100
 
beta = [0.01 for i in range(VOCABULARY_SIZE)]
alpha = [0.9 for i in range(TOPIC_N)]
FILE_NAME = sys.argv[1]
 
phi = []
## generate multinomial distribution over words for each topic
for i in range(TOPIC_N):
	topic =	numpy.random.mtrand.dirichlet(beta, size = 1)
	phi.append(topic)
## generate words for each document
output_f = open(FILE_NAME+'.doc','w')
z_f = open(FILE_NAME+'.z','w')
theta_f = open(FILE_NAME+'.theta','w')
for i in range(DOC_NUM):
	buffer = {}
	z_buffer = {} ## keep track the true z
	## first sample theta
	theta = numpy.random.mtrand.dirichlet(alpha,size = 1)
	for j in range(TERM_PER_DOC):
		## first sample z
		z = numpy.random.multinomial(1,theta[0],size = 1)
		z_assignment = 0
		for k in range(TOPIC_N):
			if z[0][k] == 1:
				break
			z_assignment += 1
		if not z_assignment in z_buffer:
			z_buffer[z_assignment] = 0
		z_buffer[z_assignment] = z_buffer[z_assignment] + 1
		## sample a word from topic z
		w = numpy.random.multinomial(1,phi[z_assignment][0],size = 1)
		w_assignment = 0
		for k in range(VOCABULARY_SIZE):
			if w[0][k] == 1:
				break
			w_assignment += 1
		if not w_assignment in buffer:
			buffer[w_assignment] = 0
		buffer[w_assignment] = buffer[w_assignment] + 1
	## output
	output_f.write(str(i)+'\t'+str(TERM_PER_DOC)+'\t')
	for word_id, word_count in buffer.iteritems():
		output_f.write(str(word_id)+':'+str(word_count)+' ')
	output_f.write('\n')
	z_f.write(str(i)+'\t'+str(TERM_PER_DOC)+'\t')
	for z_id, z_count in z_buffer.iteritems():
		z_f.write(str(z_id)+':'+str(z_count)+' ')
	z_f.write('\n')
	theta_f.write(str(i)+'\t')
	for k in range(TOPIC_N):
		theta_f.write(str(k)+':'+str(theta[0][k])+' ')
	theta_f.write('\n')
z_f.close()
theta_f.close()
output_f.close()
 
## output phi
output_f = open(FILE_NAME+'.phi','w')
for i in range(TOPIC_N):
	output_f.write(str(i)+'\t')
	for j in range(VOCABULARY_SIZE):
		output_f.write(str(j)+':'+str(phi[i][0][j])+' ')
	output_f.write('\n')
output_f.close()
 
## output hyper-parameters
output_f = open(FILE_NAME+'.hyper','w')
output_f.write('TOPIC_N:'+str(TOPIC_N)+'\n')
output_f.write('VOCABULARY_SIZE:'+str(VOCABULARY_SIZE)+'\n')
output_f.write('DOC_NUM:'+str(DOC_NUM)+'\n')
output_f.write('TERM_PER_DOC:'+str(TERM_PER_DOC)+'\n')
output_f.write('alpha:'+str(alpha[0])+'\n')
output_f.write('beta:'+str(beta[0])+'\n')
output_f.close()

You can run the script by using the following command:

python generate_data.py test_data

where “test_data” is the output prefix for generated files.

Notes on Probabilistic Latent Semantic Analysis (PLSA)

Please note, I highly recommend you read the more detailed version of this note here.

Formulation of PLSA

There are two ways to formulate PLSA. They are equivalent but may lead to different inference process.

  1. P(d,w) = P(d) \sum_{z} P(w|z)P(z|d)
  2. P(d,w) = \sum_{z} P(w|z)P(d|z)P(z)

Let’s see why these two equations are equivalent by using Bayes rule.

P(z|d) = \frac{P(d|z)P(z)}{P(d)}
P(z|d)P(d) =P(d|z)P(z)
P(w|z)P(z|d)P(d) =P(w|z)P(d|z)P(z)
P(d) \sum_{z} P(w|z)P(z|d) = \sum_{z} P(w|z)P(d|z)P(z)

The whole data set is generated as (we assume that all words are generated independently):

D = \prod_{d} \prod_{w} P(d,w)^{n(d,w)}

The Log-likelihood of the whole data set for (1) and (2) are:

L_{1} = \sum_{d} \sum_{w} n(d,w) \log [ P(d) \sum_{z} P(w|z)P(z|d) ]

L_{2} = \sum_{d} \sum_{w} n(d,w) \log [ \sum_{z} P(w|z)P(d|z)P(z) ]

EM

For L_{1} or L_{2}, the optimization is hard due to the log of sum. Therefore, an algorithm called Expectation-Maximization is usually employed. Before we introduce anything about EM, please note that EM is only guarantee to find a local optimum (although it may be a global one).

First, we see how EM works in general. As we shown for PLSA, we usually want to estimate the likelihood of data, namely P(X|\theta), given the paramter \theta. The easiest way is to obtain a maximum likelihood estimator by maximizing P(X|\theta). However, sometimes, we also want to include some hidden variables which are usually useful for our task. Therefore, what we really want to maximize is P(X|\theta)=\sum_{z}P(X|z,\theta)P(z|\theta), the complete likelihood. Now, our attention becomes to this complete likelihood. Again, directly maximizing this likelihood is usually difficult. What we would like to show here is to obtain a lower bound of the likelihood and maximize this lower bound.

We need Jensen’s Inequality to help us obtain this lower bound. For any convex function f(x), Jensen’s Inequality states that :

\lambda f(x) + (1-\lambda) f(y) \geq f(\lambda x + (1-\lambda) y)

Thus, it is not difficult to show that :

E[f(x)] = \sum_{x} P(x) f(x) \geq f(\sum_{x} P(x) x) = f(E[x])

and for concave functions (like logarithm), it is :

E[f(x)] \leq f(E[x])

Back to our complete likelihood, we can obtain the following conclusion by using concave version of Jensen’s Inequality :

 \log \sum_{z}P(X|z,\theta)P(z|\theta)= \log \sum_{z}P(X|z,\theta)P(z|\theta)\frac{q(z)}{q(z)}
 = \log E[\frac{P(X|z,\theta)P(z|\theta)}{q(z)}]
 \geq E[\log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}]

Therefore, we obtained a lower bound of complete likelihood and we want to maximize it as tight as possible. EM is an algorithm that maximize this lower bound through a iterative fashion. Usually, EM first would fix current \theta value and maximize q(z) and then use the new q(z) value to obtain a new guess on \theta, which is essentially a two stage maximization process. The first step can be shown as follows:

E[\log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}] = \sum_{z} q(z) \log \frac{P(X|z,\theta)P(z|\theta)}{q(z)}
= \sum_{z} q(z) \log \frac{P(z|X,\theta)P(X,\theta)}{q(z)}
= \sum_{z} q(z) \log P(x,\theta) + \sum_{z} q(z) \log \frac{P(z|X,\theta)}{q(z)}
= \log P(x,\theta) - \sum_{z} q(z) \log \frac{q(z)}{P(z|X,\theta)}
= \log P(x,\theta) - KL(q(z) || P(z|X,\theta))

The first term is the same for all z. Therefore, in order to maximize the whole equation, we need to minimize KL divergence between q(z) and P(z|X,\theta), which eventually leads to the optimum solution of q(z) = P(z|X,\theta). So, usually for E-step, we use current guess of \theta to calculate the posterior distribution of hidden variable as the new update score. For M-step, it is problem-dependent. We will see how to do that in later discussions.

Another explanation of EM is in terms of optimizing a so-called Q function. We devise the data generation process as P(X|\theta)=P(X,H|\theta)=P(H|X,\theta)P(X|\theta). Therefore, the complete likelihood is modified as:

L_{c}(\theta) = \log P(X,H|\theta) = \log P(X|\theta) + \log P(H|X,\theta) = L(\theta) + \log P(H|X,\theta)

Think about how to maximize L_{c}(\theta). Instead of directly maximizing it, we can iteratively maximize L_{c}(\theta^{(n+1)})-L_{c}(\theta^{(n)}) as :

L(\theta) - L(\theta^{(n)}) = L_{c}(\theta) - \log P(H|X,\theta) - L_{c}(\theta^{(n)}) + \log P(H|X,\theta^{(n)})
= L_{c}(\theta) - L_{c}(\theta^{(n)}) + \log \frac{P(H|X,\theta^{(n)})}{P(H|X,\theta)}

Now take the expectation of this equation, we have:

L(\theta) - L(\theta^{(n)}) = \sum_{H} L_{c}(\theta)P(H|X,\theta^{(n)}) - \sum_{H} L_{c}(\theta^{(n)})P(H|X,\theta^{(n)}) + \sum_{H} P(H|X,\theta^{(n)})\log \frac{P(H|X,\theta^{(n)})}{P(H|X,\theta)}

The last term is always non-negative since it can be recognized as the KL-divergence of P(H|X,\theta^{(n)} and P(H|X,\theta). Therefore, we obtain a lower bound of Likelihood :

L(\theta) \geq \sum_{H} L_{c}(\theta)P(H|X,\theta^{(n)}) + L(\theta^{(n)}) - \sum_{H} L_{c}(\theta^{(n)})P(H|X,\theta^{(n)})

The last two terms can be treated as constants as they do not contain the variable \theta, so the lower bound is essentially the first term, which is also sometimes called as “Q-function”.
Q(\theta;\theta^{(n)}) = E(L_{c}(\theta)) = \sum_{H} L_{c}(\theta) P (H|X,\theta^{(n)})

EM of Formulation 1

In case of Formulation 1, let us introduce hidden variables R(z,w,d) to indicate which hidden topic z is selected to generated w in d (\sum_{z} R(z,w,d) = 1). Therefore, the complete likelihood can be formulated as :

L_{c1} = \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) \log [ P(d) P(w|z)P(z|d) ]
= \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) [ \log P(d) + \log P(w|z) + \log P(z|d) ]

From the equation above, we can write our Q-function for the complete likelihood E[L_{c1}]:

 E[L_{c1}] = \sum_{d} \sum_{w} n(d,w) \sum_{z} P(z|w,d) [ \log P(d) + \log P(w|z) + \log P(z|d) ]

For E-step, simply using Bayes Rule, we can obtain:

P(z|w,d) = \frac{P(w|z,d)}{P(w,d)}
= \frac{P(w|z)P(z|d)P(d)}{\sum_{z} P(w|z)P(z|d)P(d)}
= \frac{P(w|z)P(z|d)}{\sum_{z} P(w|z)P(z|d)}

For M-step, we need to maximize Q-function, which needs to be incorporated with other constraints:

H = E[L_{c1}]+ \alpha [1-\sum_{d} P(d) ]+ \beta \sum_{z}[1- \sum_{w} P(w|z)]
+\gamma \sum_{d}[1- \sum_{z} P(z|d)]

and take all derivatives:

\frac{\partial H}{\partial P(d)} = \sum_{w} \sum_{z} n(d,w) \frac{P(z|w,d)}{P(d)} - \alpha = 0
\rightarrow \sum_{w} \sum_{z} n(d,w) P(z|w,d) - \alpha P(d) = 0
\frac{\partial H}{\partial P(w|z)} = \sum_{d} n(d,w) \frac{P(z|w,d)}{P(w|z)} - \beta = 0
\rightarrow \sum_{d} n(d,w) P(z|w,d) - \beta P(w|z) = 0
\frac{\partial H}{\partial P(z|d)} = \sum_{w} n(d,w) \frac{P(z|w,d)}{P(z|d)} - \gamma = 0
\rightarrow \sum_{w} n(d,w) P(z|w,d) - \gamma P(z|d) = 0

Therefore, we can easily obtain:

P(d) = \frac{\sum_{w} \sum_{z} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} \sum_{z} n(d,w) P(z|w,d)}
= \frac{n(d)}{\sum_{d} n(d)}
P(w|z) = \frac{\sum_{d} n(d,w) P(z|w,d)}{\sum_{w} \sum_{d} n(d,w) P(z|w,d) }
P(z|d) = \frac{\sum_{w} n(d,w) P(z|w,d)}{\sum_{z} \sum_{w} n(d,w) P(z|w,d) }
= \frac{\sum_{w} n(d,w) P(z|w,d)}{n(d)}

EM of Formulation 2

Use similar method to introduce hidden variables to indicate which z is selected to generated w and d and we can have the following complete likelihood :

L_{c2} = \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) \log [ P(z) P(w|z)P(d|z) ]
= \sum_{d} \sum_{w} n(d,w) \sum_{z} R(z,w,d) [ \log P(z) + \log P(w|z) + \log P(d|z) ]

Therefore, the Q-function E[L_{c2}] would be :

E[L_{c2}] = \sum_{d} \sum_{w} n(d,w) \sum_{z} P(z|w,d) [ \log P(z) + \log P(w|z) + \log P(d|z) ]

For E-step, again, simply using Bayes Rule, we can obtain:

P(z|w,d) = \frac{P(w|z,d)}{P(w,d)}
= \frac{P(w|z)P(d|z)P(z)}{\sum_{z} P(w|z)P(d|z)P(z)}

For M-step, we maximize the constraint version of Q-function:

H = E[L_{c2}] + \alpha [1-\sum_{z} P(z) ] + \beta \sum_{z}[1- \sum_{w} P(w|z)]+
+\gamma \sum_{z} [1- \sum_{d} P(d|z)]

and take all derivatives:

\frac{\partial H}{\partial P(z)}= \sum_{d} \sum_{w} n(d,w) \frac{P(z|w,d)}{P(z)} - \alpha = 0
\rightarrow \sum_{d} \sum_{w} n(d,w) P(z|w,d) - \alpha P(z)= 0
  [latex]\rightarrow \sum_{d} n(d,w) P(z|w,d) - \beta P(w|z) = 0
\frac{\partial H}{\partial P(d|z)} = \sum_{w} n(d,w) \frac{P(z|w,d)}{P(d|z)} - \gamma = 0
\rightarrow \sum_{w} n(d,w) P(z|w,d) - \gamma P(d|z) = 0

Therefore, we can easily obtain:

P(z) = \frac{\sum_{d} \sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} \sum_{z} n(d,w) P(z|w,d)}
= \frac{\sum_{d} \sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} n(d,w)}
P(w|z)= \frac{\sum_{d} n(d,w) P(z|w,d)}{\sum_{w} \sum_{d} n(d,w) P(z|w,d) }
P(d|z) = \frac{\sum_{w} n(d,w) P(z|w,d)}{\sum_{d} \sum_{w} n(d,w) P(z|w,d) }