# Topic Model

## Slice Sampling for LDA Hyperparameters

For latent Dirichlet allocation (LDA) hyper-parameters, a typical approach is to utilize Monte-Carlo EM approach where E-step is approximated by Gibbs sampling while M-step is to perform a gradient-based optimization approach to optimize Dirichlet parameters. Such approach is implemented in Mallet package. For the detailed information about estimating Dirichlet parameters, see Tom Minka’s technical report.

Another less explored approach is to go with full Bayesian notion, interleaving sampling topic assignments and hyper-parameters. To sample hyper-parameters, as there is no closed-form solution, Slice Sampling is usually utilized. For introduction for Slice Sampling, see Radford M. Neal’s paper. For its application in LDA, please see the following two references:

We implemented the method in Fugue’s LDA algorithm.

## Tricks in Sampling Discrete Distributions (2) – Binary Search1

As mentioned in the previous post, it is tricky to sample from discrete distributions, here we demonstrate yet another important trick to do it right. No matter you do it in the original space, or in the log-space. Basically, you can easily come up some code snippet like this (we are using Java as an example here):

1 2 3 4 5 6 7 8 9 public int sample(){ double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1]; int index = -1; for (index = 0; index > p.length; index++) { if (u > p[index]) break; } return index; }

where $$p$$ is the accumulated un-normalized probabilities. The time complexity is $$\mathcal{O}(N)$$ when $$N$$ equals the number of items in the array $$p$$.

It turns out that, the above code can be easily optimized to $$\mathcal{O}(\log N )$$ by using Binary Search. The reason is quite simple. The accumulated un-normalized probabilities, which are stored in $$p$$, by its definition, are sorted. Therefore, binary search can be utilized. In particular, we want to find the smallest key that is greater than the random generated number $$u$$. This function is called ceiling in Algorithms in Section 3.1. We implemented it in our context as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 public int sample(){ double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1]; int lower = 0; int upper = p.length - 1; while (lower >= upper){ int mid = lower + (upper - lower) / 2; if((p[mid] - u) > 0){ upper = mid - 1; } else{ lower = mid + 1; } } return lower; }

Interestingly, even though this trick seems trivial, it is not mentioned in many literature and only discussed:

## Tricks in Sampling Discrete Distributions (1) – Sampling in Log-Space

One critical component in Gibbs sampling for complex graphical models is to be able to draw samples from discrete distributions. Take latent Dirichlet allocation (LDA) as an example, the main computation focus is to draw samples from the following distribution:

P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto (n_{d,k}^{-i} + \alpha_{k}) \frac{n_{w_{i}, k}^{-i} + \beta_{w_{i}}}{n_{., k}^{-i} + \sum_{v} \beta_{v}}\label{eq:lda}

where $$n_{d,k}^{-i}$$ is the number of tokens in the document $$d$$ assigned to the topic $$k$$, excluding the token $$i$$, $$n_{w_{i}, k}^{-i}$$ is the number of times token $$w_{i}$$ assigned to the topic $$k$$, excluding $$i$$, and $$n_{.,k}^{-i}$$ is the total number of tokens assigned to the topic $$k$$.

So, a straightforward sampling algorithm works as follows:

1. Let $$c_{k}$$ be the right-hand side of Equation \eqref{eq:lda} for topic $$k$$, which is an un-normalized probability.
2. We compute the accumulated weights as: $$C[i] = C[i-1] + c_{i} , \,\, \forall i \in (0, K-1]$$ and $$C[0] = c_{0}$$.
3. Draw $$u \sim \mathcal{U}(0, C[K-1] )$$ and find $$t = \arg\min_{i} \left( x_{i} \right)$$ where $$x_{i} = C[i] – u$$ and $$x_{i} > 0$$

The last line is essentially to find the minimum index that the array value is greater than the random number $$u$$.

One difficulty to deal with \eqref{eq:lda} is that the right hand side might be too small and therefore overflow (thinking about too many near-zero numbers multiplying). Thus, we want to deal with probabilities in log-space. We start to work with:

\log P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto \log (n_{d,k}^{-i} + \alpha_{k}) + \log (n_{w_{i}, k}^{-i} + \beta_{w_{i}}) – \log ( n_{., k}^{-i} + \sum_{v} \beta_{v} )\label{eq:log_lda}
and store in $$c_{k}$$ but remember that now each value represents un-normalized log probability. The next step is to compute the accumulated weights, this time as accumulated probabilities but in log-space! Thanks to the trick mentioned in [Notes on Calculating Log Sum of Exponentials], we are able to compute log sum efficiently. Please use the last equation there to compute the accumulated weights. The last step is to draw the random number. We compute $$u \sim \mathcal{U}(0, 1) + \log C[K-1]$$ and again find the minimum index that satisfy the array value is greater than $$u$$.

Notes:

1. The log-sampling algorithm for LDA is implemented in Fugue Topic Modeling Package.
2. Unless you really face the issue of overflow, sampling in log-space is usually much slower than the original space as log and exp are expensive functions to compute.

## Grid Representations of Texts

One ever-lasting theme of text mining is to model text in a proper space. A simple yet powerful space is in a one-dimension space where words are generated from an uncorrelated array of terms. This representation is usually called bag-of-word (BOW) representation.

BOW is also a foundation to more complicated approaches like latent Dirichlet allocation (LDA) or latent Semantic indexing (LSI). In such models, terms in a document are essentially generated from different 1D arrays (topics) where each array has its distinct distribution over the whole vocabulary. However, like simple BOW, these 1D arrays are uncorrelated.

The idea of grid representations of texts is essentially to model correlations between words. One such example is called “Multi-dimensional counting grids” [1] and its “admixture” version of the model “Componential counting grids” [2]. The basic assumption here is that, texts are generated by a moving window of grids where each grid is a distribution over terms. One benefit of such representation is that it would be more natural to handle thematic shifts in the framework. Also, depending on the window size, the n-gram effect is also automatically considered.

Although from the surface, the true advantage of such representation over BOW is not very clear, it is nevertheless an interesting idea to explore.

## References

[1] Jojic, N., Perina, A.: Multidimensional counting grids: Inferring word order from disordered bags of words: In UAI. 2011 547-556 [PDF]
[2] Perina, A., Jojic, N., Bicego, M. and Turski, A.: Documents as Multiple Overlapping Windows into a Grid of Counts: In NIPS 2013 [PDF]

## New MAP Estimation of latent Dirichlet allocation

A recent published paper entitled “On Estimation and Selection for Topic Models” by Matthew A. Taddy on AISTATS 2012 is interesting.

Firstly, this paper tries to introduce a new way to do inference for latent Dirichlet allocation (LDA). Traditionally, variational inference and Gibbs sampling are widely used to perform Bayesian inference for LDA. For a simpler version of LDA, Probabilistic Latent Semantic Analysis (PLSA), a Maximum Likelihood Estimation (MLE) via EM algorithm is usually used where no prior distributions are used at all. Both Bayesian inference and MLE can be treated as two extremes to learning a LDA or PLSA as MLE may be prone to over-fitting but Bayesian inference is a little bit overly complicated. The parameterization of the model is slightly different from normal settings. Let $$\mathbf{x}_{i}$$ be a vector of counts in $$V$$ categories (terms) for document $$i$$. Basically, each element in this vector is the number of times term $$v$$ appearing in document $$i$$. We also have the total term count $$m_{i} = \sum_{j}^{V} x_{i,j}$$. Thus, the $$K$$-topic model has the following generative story:

\begin{aligned} \mathbf{x}_{i} \sim \mbox{Multinomial}(\boldsymbol{\theta}_{i,1} \boldsymbol{\phi}_{1} + \cdots + \boldsymbol{\theta}_{i,K} \boldsymbol{\phi}_{K}, m_{i}) \end{aligned}

where $$\boldsymbol{\theta}_{i}$$ is a per-document distribution over topics and $$\boldsymbol{\phi}_{k}$$ is a distribution over words. Of course, this setting is indeed exactly same as previously published settings. However, we can easily view the term counts as a function of total counts explicitly. Again, Dirichlet distributions are put on $$\boldsymbol{\theta}$$ and $$\boldsymbol{\phi}$$. The paper discusses a joint posterior maximization by using EM algorithm. The author treats the term count vector $$\mathbf{x}_{i}$$ as:

\begin{aligned} \mathbf{x}_{i} \sim \mbox{Multinomial}(\boldsymbol{\phi}_{1}, t_{i,1}) + \cdots + \mbox{Multinomial}(\boldsymbol{\phi}_{K}, t_{i,K}) \end{aligned}

where $$\mathbf{t}_{i} \sim \mbox{Multinomial}(\boldsymbol{\theta}_{i}, m_{i})$$ and these $$\mathbf{t}_{i}$$ are treated as missing-data for EM algorithm. Thus, missing-data is no longer single topic assignments for each term but aggregated topic counts for the whole documents. The detailed derivations of the corresponding EM algorithm is in the paper.

The second point of the paper is to propose some new method to perform model selection, which is always difficult for topic models in some sense. The whole idea is to find $$K$$ that can maximize $$P(\mathbf{X} \, | \, K)$$, which is usually approximated through MCMC. In this paper, the author details Laplace’s method to approximate this probability. Again, details are in the paper. The last point made by the paper is to use multinomial dispersion to evaluate whether the model fits the data, which is standard in statistics but never used in topic modeling literature.

The experiments are somewhat interesting. The author demonstrated that MAP is almost always comparable to variational EM (VEM) and sometimes slightly better than Gibbs sampling. Sometimes, MAP is even better than VEM.

Note, this is NOT the first paper to propose a MAP learning for topic models. Here’s one at least:

J. T. Chien, J. T. Chien, M. S. Wu, and M. S. Wu. Adaptive Bayesian Latent Semantic Analysis. Audio, Speech, and Language Processing, IEEE Transactions on [seealso Speech and Audio Processing, IEEE Transactions on], 16(1):198–207, 2008.

## Topic Models meet Latent Factor Models

There is a trend in research communities to bring two well-established classes of models together, topic models and latent factor models. By doing so, we may enjoy the ability to analyze text information with topic models and incorporate the collaborative filtering analysis with latent factor models. In this section, I wish to discuss some of these efforts.

Three papers will be covered in this post are listed at the end of the post. Before that, let’s first review what latent factor models are. Latent factor models (LFM) are usually used in collaborative filtering context. Say, we have a user-item rating matrix $$\mathbf{R}$$ where $$r_{ij}$$ represents the rating user $$i$$ gives to item $$j$$. Now, we assume for each user $$i$$, there is a vector $$\mathbf{u}_{i}$$ with the dimensionality $$k$$, representing the user in a latent space. Similarly, we assume for each item $$j$$, a vector $$\mathbf{v}_{j}$$ with the same dimensionality representing the item in a same latent space. Thus, the rating $$r_{ij}$$ is therefore represented as:
$r_{ij} = \mathbf{u}_{i}^{T} \mathbf{v}_{j}$This is the basic setting for LFM. In addition to this basic setting, additional biases can be incorporated, see here. For topic models (TM), the simplest case is Latent Dirichlet Allocation (LDA). The story of LDA is like this. For a document $$d$$, we first sample a multinomial distribution $$\boldsymbol{\theta}_{d}$$, which is a distribution over all possible topics. For each term position $$w$$ in the document, we sample a discrete topic assignment $$z$$ from $$\boldsymbol{\theta}_{d}$$, indicating which topic we use for this term. Then, we sample a term $$v$$ from a topic $$\boldsymbol{\beta}$$, a multinomial distribution over the vocabulary.

For both LFM and TM, they are methods to reduce original data into latent spaces. Therefore, it might be possible to link them together. Especially, items in the LFM are associated with rich text information. One natural idea is that, for an item $$j$$, the latent factor $$\mathbf{v}_{j}$$ and its topic proportional parameter $$\boldsymbol{\theta}_{j}$$ somehow gets connected. One way is to directly equalize these two variables. Since $$\mathbf{v}_{j}$$ is a real-value variable and $$\boldsymbol{\theta}_{j}$$ falls into a simplex, we need certain ways to keep these properties. Two possible methods can be used:

1. Keep $$\boldsymbol{\theta}_{j}$$ and make sure it is in the range of [0, 1] in the optimization process. Essentially put some constraint on the parameter.
2. Keep $$\mathbf{v}_{j}$$ and use logistic transformation to transfer a real-valued vector into simplex.

Hanhuai and Banerjee showed the second technique in their paper by combining Correlated Topic Model with LFM. Wang and Blei argued that this setting suffers from the limitation that it cannot distinguish topics for explaining recommendations from topics important for explaining content since the latent space is strictly equal. Thus, they proposed a slightly different approach. Namely, each $$\mathbf{v}_{j}$$ derives from $$\boldsymbol{\theta}_{j}$$ with item-dependent noise:
$\mathbf{v}_{j} = \boldsymbol{\theta}_{j} + \epsilon_{j}$ where $$\epsilon_{j}$$ is a Gaussian noise.

A different approach is to not directly equal these two quantities but let me impact these each other. One such way explored by Hanhuai and Banerjee is that $$\boldsymbol{\theta}_{j}$$ influences how $$\mathbf{v}_{j}$$ is generated. More specifically, in Probabilistic Matrix Factorization (PMF) setting, all $$\mathbf{v}$$s are generated by a Gaussian distribution with a fixed mean and variance. Now, by combining LDA, the authors allow different topic has different Gaussian prior mean and variance values. A value similar to $$z$$ is firstly generated from $$\boldsymbol{\theta}_{j}$$ to decide which mean to use and then generate $$\mathbf{v}_{j}$$ from that particular mean and variance.

A totally different direction was taken by Agarwal and Chen. In their fLDA paper, there is no direct relationship between item latent factor and content latent factor. In fact, their relationship is realized by the predictive equation:
$r_{ij} = \mathbf{a}^{T} \mathbf{u}_{i} + \mathbf{b}^{T} \mathbf{v}_{j} + \mathbf{s}_{i}^{T} \bar{\mathbf{z}}_{j}$where $$\mathbf{a}$$, $$\mathbf{b}$$ and $$\mathbf{s}_{i}$$ are regression weights and $$\bar{\mathbf{z}}_{j}$$ is the average topic assignments for item $$j$$. Note, $$\mathbf{s}_{i}$$ is a user-dependent regression weights. This formalism encodes the notion that all latent factors (including content) will contribute to the rating, not only item and user factors.

In summary, three directions have been taken for integrating TM and LFM:

1. Equal item latent factor and topic proportion vector, or make some Gaussian noise.
2. Let topic proportion vector to control the prior distribution for item latent factor.
3. Let item latent factor and topic assignments, as well as user latent factor, contribute the rating.

Reference:

• Deepak Agarwal and Bee-Chung Chen. 2010. fLDA: matrix factorization through latent dirichlet allocation. In Proceedings of the third ACM international conference on Web search and data mining (WSDM ’10). ACM, New York, NY, USA, 91-100. [PDF]
• Hanhuai Shan and Arindam Banerjee. 2010. Generalized Probabilistic Matrix Factorizations for Collaborative Filtering. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM ’10). IEEE Computer Society, Washington, DC, USA, 1025-1030. [PDF]
• Chong Wang and David M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’11). ACM, New York, NY, USA, 448-456.[PDF]

## An Easy Reading Tutorial for Bayesian Non-parametric Models

The “god-father” of LDA, David Blei, recently published a tutorial on Bayesian Non-parametric Models, with one of his student. The whole tutorial is easy-reading and provides very clear overview of Bayesian Non-parametric Models. In particular, Chinese Restaurant Process (CRP) and Indian Buffet Process are discussed in a very intuitive way. For those who are interests in technical details about these models, this tutorial may be just a starting point and the Appendix points out several ways to discuss models more formally, including inference algorithms.

One specific interesting property shown in this tutorial is the “exchangeable” property for CRP, which I wish to re-state as below.

Let $$c_{n}$$ be the table assignment of the $$n$$th customer. A draw from CPR can be generated by sequentially assigning observations to classes with probability:
$P(c_{n} = k | \mathbf{c}_{1:n-1}) = \begin{cases} \frac{m_{k}}{n-1+\alpha}, & \mbox{if } k \leq \mathbf{K}_{+} \mbox{ (i.e., k is a previously occupied table)} \\ \frac{\alpha}{n-1+\alpha}, & \mbox{otherwise (i.e., k is the next unoccupied table)} \end{cases}$where $$m_{k}$$ is the number of customers sitting at table $$k$$, and $$\mathbf{K}_{+}$$ is the number of tables for which $$m_{k} > 0$$. The parameter $$\alpha$$ is called the concentration parameter. The CRP exhibits an important invariance property: The cluster assignments under this distribution are exchangeable. This means $$p(\mathbf{c})$$ is unchanged if the order of customers is shuffled.

Consider the joint distribution of a set of customer assignments $$c_{1:N}$$. It decomposes according to the chain rule:
$p(c_{1}, c_{2}, \cdots , c_{N}) = p(c_{1}) p(c_{2} | c_{1}) \cdots p(c_{N} | c_{1}, c_{2}, \cdots , c_{N-1})$where each terms comes from above equation. To show that this distribution is exchangeable, we will introduce some new notation. Let $$\mathbf{K}(c_{1:N})$$ be the number of groups in which these assignments place the customers, which is a number between 1 and $$N$$. Let $$I_{k}$$ be the set of indices of customers assigned to the $$k$$th group, and let $$N_{k}$$ be the number of customers assigned to that group. Now, for a particular group $$k$$ the joint probability of all assignments in this group is:
$\frac{\alpha}{I_{k,1}-1+\alpha} \frac{1}{I_{k,2}-1+\alpha} \frac{2}{I_{k,3}-1+\alpha} \cdots \frac{N_{k}-1}{I_{k,N}-1+\alpha}$where each term in the equation represents a customer. The numerator can be re-written as $$\alpha (N_{k}-1)!$$. Therefore, we have:
$p(c_{1}, c_{2}, \cdots , c_{N}) = \prod_{k=1}^{K} \frac{\alpha (N_{k}-1)!}{(I_{k,1}-1+\alpha)(I_{k,2}-1+\alpha)\cdots (I_{k,N_{k}}-1+\alpha)}$Finally, notice that the union of $$\mathbf{I}_{k}$$ across all groups $$k$$ identifies each index once, because each customer is assigned to exactly one group. This simplifies the denominator and let us write the joint as:
$p(c_{1}, c_{2}, \cdots , c_{N}) = \frac{\alpha^{K} \prod_{k=1}^{K} (N_{k}-1)! }{\prod_{i=1}^{N} (i-1+\alpha)}$This equation only depends on the number of groups $$\mathbf{K}$$ and the size of each group $$\mathbf{N}_{k}$$.

## Some Recent Papers About Topic Models

In this post, I would like to talk about several recent papers about topic models. These papers may not belong to the same direction of applying or extending topic models. However, some of them are quite interesting and worth to be discussed here.

The first one is

Enhong Chen, Yanggang Lin, Hui Xiong, Qiming Luo, and Haiping Ma. 2011. Exploiting probabilistic topic models to improve text categorization under class imbalanceJournal of Information Processing and Management. 47, 2 (March 2011), 202-214.

The idea is straightforward and simple. The author proposed a two-step approach to mitigate the problem of unbalanced data. The first step is to learn topic models from the existing unbalanced data. Here, for each class label, a separate set of topics is learned. Once the models are obtained, synthetic documents or new samples are drawn from learned models. This is possible since topic distribution and word distribution are fixed after learning process. The number of new samples is determined by the difference between the dominant class and the rare class. A more aggressive method is also proposed, which is used to avoid noisy labeled data. The idea is to use all synthetic samples to train a classifier, rather than original samples. The experimental results demonstrate some performance improvement of this method over other ones that are proposed to tackle the same problem.

The second paper is

Wayne Xin Zhao, Jing Jiang, Hongfei Yan, and Xiaoming Li. 2010. Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 56-65.

The paper is interesting because it also demonstrates a method to incorporate term-level features into a topic model. The list of features for each term is embedded through a Maximum Entropy Model. The supervised learning part of the model learns the fixing weights of these features and Gibbs sampling for the topic model uses these weights. For details, please refer to the paper.

The next one is

Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan and Xiaoming Li. Comparing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Information Retrieval (ECIR’11) (full paper), 2011.

The paper has several interesting aspects. First, it is claimed as a first study of topics obtained on Twitter and other traditional media. The authors use a standard LDA model to discover topics from NewYorkTimes corpus and a modified topic model for Twitter, separately. Then, they proposed a heuristic method to map Twitter topics onto NYT topics.  In addition, they manually assigned topic types to all the topics found by models. By doing all these, common topics and corpus-specific topics are obtained heuristically. It’s a little bit strange that they do not consider any techniques to mine topics from multiple corpus. Secondly, they do not compare to the method where only LDA is used. Note, the same Twitter-LDA is used in:

Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achanauparp, Ee-Peng Lim and Xiaoming Li. Topical keyphrase extraction from Twitter. To appear in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11) (long paper), 2011.

## Reviews on User Modeling in Topic Models

In this post, I would like to review several papers that wish to extend standard topic models with incorporating user information. The first paradigm or group of papers is introduced by M. Rosen-Zvi et al.

These three papers define a “Author-Topic” model, a simple extension of LDA. The generation process is as follows:

1. For each document $latex d$:
1. For each word position:
1. Sample an author $latex x$ uniformly sampled from the group of authors for this document.
2. Sample an topic assignment $latex z$ from per-author multinomial distribution over topics $latex \theta_{x}$.
3. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

The inference of the model is done by Gibbs Sampling. The biggest drawback of the model is that it loses the distribution over topics for documents. In “Learning Author-Topic Models from Text Corpora“, the authors proposed a heuristic solution to this problem: adding a fictitious author for each document. The second group of papers is from UMass.

They  proposed several models. The first one is “Author-Recipient-Topic” model, which is suitable for message data, like emails. The generation process is as follows:

1. For each document $latex d$, we observe its author $latex a_{d}$ and a set of recipients $latex \mathbf{r}_{d}$:
1. For each word position:
1. Sample a recipient $latex x$ uniformly sampled from $latex \mathbf{r}_{d}$.
2. Sample an topic assignment $latex z$ from author-recipient multinomial distribution over topics $latex \theta_{a_{d},x}$.
3. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

This model is further extended into “Role-Author-Recipient-Topic” model. The idea is that each author or recipient may play different roles in the exchange of messages. Therefore, it is better to explicitly model them. Three possible variants are introduced. The first variant is that for each word position, we first sample a role for author and for the sampled recipient as well. Once the roles are sampled, the topic assignments are sampled from role-role pair-determined multinomial distribution over topics. The second variant is that only one role is generated for the author of the message. However, for recipients, each one has a role. For each word position, a recipient with his corresponding role is firstly sampled and a topic assignment is sampled from author-role author-role pair multinomial distribution over topics. The third variant is that all recipients share a single role. The third model is “Author-Persona-Topic” model. The generation process is as follows:

1. For each author $latex a$:
1. Sample a multinomial distribution over persona $latex \eta_{a}$.
2. For each persona $latex g$, sample a multinomial distribution over topics $latex \theta_{g}$.
2. For each document $latex d$ with author $latex a_{d}$:
1. Sample a persona $latex g_{d}$ from $latex \eta_{a_{d}}$.
2. For each word position:
1. Sample an topic assignment $latex z$ from $latex \theta_{g_{d}}$.
2. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

All these models do not have a per-document distribution for topics.

The third group of papers is from Noriaki Kawamae. Both models introduced in these papers extended the ideas of “Author-Topic” model and “Author-Persona-Topic” model in particular.

The first model is “Author-Interest-Topic” model. It introduced a notion of “document-class”. The authors have a distribution over document-classes and for each document class, it has a distribution over topics. Here, we can think of document-class as “persona” in previous models. For each document, it firstly samples a document-class from per-author distibution over document classes. Then, by using this document-class, we can draw topics from this particular class. The difference between “Author-Interest-Topic” model and “Author-Persona-Topic” model is that the distribution over topics for each persona is under author level in “Author-Persona-Topic” but they are global variables in “Author-Interest Topic” model. The “Latent-Interest-Topic” model is much complicated than all previous models. It adds another layer of abstraction, author-classes. For each author, it has variable to indicate his author-class, which is drawn from a multinomial distribution. For each author-class, there is a multinomial distribution over topics. For each document, we first draw a document-class from its author’s per author-class distribution over document-classes. Then, the later generation process is same as “Author-Interest-Topic“. The key for “Author-Interest-Topic” and “Latent-Interest-Topic” models is that they are clustering models, in the sense that authors or documents are forced clustered into either author classes or document classes.

The last group of papers is from Jie Tang et al. All the proposed models are based on “Author-Topic” model.

They firstly proposed three variants of “Author-Conference-Topic” model. For each author, there is a multinomial distribution over topics. For each token in the document, an author is uniformly sampled and the topic assignment is sampled from per-author multinomial distribution over topics. The differences between three variants are how the conference stamp is generated. We omit the discussion here.

## Reviews on Probabilistic Models for User Profiles

In this post, I would like to review some probabilistic models for user profiling. More specifically, I’m looking at the models that taking users’ preferences into account and try to predict certain quantities based on these preferences, which is a normal scenario for collaborative filtering.

• Latent semantic models for collaborative filtering” by Thomas Hofmann, ACM Transactions on Information Systems, 2004
The proposed model is based on pLSA. In order to incorporate ratings, the authors propose the following decomposition scheme:
$$p(v|u,y) = \sum_{z}p(z|u)p(v|z,y)$$ where $$p(v|z,y)$$ follows Gaussian distribution. The paper also introduced practical techniques to normalize user ratings. The model is learned through (tempered) EM.
• Modeling User Rating Profiles For Collaborative Filtering” by Benjamin Marlin, NIPS 2003
The model proposed in the paper is essentially  LDA in the context of collaborative filtering. The original document-term matrix was replaced by a user-item (user-rating) matrix. Unlike this pLSA model for collaborative filtering, this model introduced the decomposition scheme as:$$p(r|u)=\sum_{z} p(r|z)p(z|u)$$where no “dummy” variable $$y$$ gets involved. The model is learned through variational inference.
• Flexible Mixture Model for Collaborative Filtering” by Luo Si and Rong Jin, ICML 2003
The model proposed in the paper is an extension of two-side clustering model of pLSA. It assumes that users are belong to multiple clusters and items are also belong to multiple clusters. The rating of a particular item is based on the user clusters and item clusters. Therefore, $$p(x,y,r) = \sum_{z_{x}} \sum_{z_{y}} p(z_{x})p(y_{x})p(x|z_{x})p(y|z_{y})p(r|z_{x},z_{y})$$ where $$z_{x}$$ are latent factors for users and $$z_{y}$$ are latent factors for items. All distributions here are multinomial distributions. The model is learned through EM.
A full Bayesian treatment of the model is introduced in “Latent Grouping Models for User Preference Prediction” by Eerika Savia, Kai Puolamaki and Samuel Kaski in Machine Learning 2009, which is learned through Gibbs Sampling.
• The Multiple Multiplicative Factor Model For Collaborative Filtering” by Benjamin Marlin and Richard S. Zemel, ICML 2004
Rather than using a same set of latent factors to “explain” all ratings, the Multiple Multiplicative Factor Model (MMF) tries to use different latent factors to “explain” different ratings. Therefore, for each user, the model has a $$K$$-dimensional binary vector $$Z$$ where each element $$z_{k}$$ represents whether $$k$$-th factor is “active” or not. For rating $$X$$, the authors introduced to use softmax function to map arbitrary real values to simplex. The model is learned through variational inference.
• Efficient Bayesian Hierarchical User Modeling For Recommendation Systems” by Yi Zhang and Jonanthan Koren, SIGIR 2007
The model introduced in this paper is similar to the pLSA version of user profiles. The rating $$r$$ of a item $$y$$ by user $$u$$ is decomposed as:$$p(r|y,u)=p(u)p(r|u,y)$$ where $$p(u)$$ is a Gaussian distribution and $$p(r|u,y)$$ is modeled through a Generalized Linear Model $$r=u^{T}y+\epsilon$$, essentially another Gaussian distribution in the paper. The novel part of the model is that $$y$$ is the document representation of an item. Therefore, the authors assume that the rating is weighted sum of terms of documents where the weights are user specific. The model is learned through EM.

A study about several variants of pLSA on collaborative filtering is done by Rong Jin, Luo Si and Chengxiang Zhai.

A study about how to normalize ratings is done by Rong Jin and Luo Si: