# Collaborative Filtering

## Structured Sparse Regression for Recommender Systems

Feature based latent factor models have received increasing attention in recent years due to its capability to effectively solve the cold-start problem. There have been many feature based collaborative filtering (CF) models proposed recently, which can be grouped into two categories. The first type of models includes all variants of latent factor models (LFM) which have been proven as an effective approach to personalization and recommender systems. The core of LFM is to learn user-specific and item-specific features from user-item interactions and utilize these features for future predictions/recommendations. State-of-the-art LFM exploits low-rank latent spaces of users and features and treats latent factors that are learnt from user-item historical data as features. This type of models has gained significant successes in a number of applications, including the Netflix competition. The second category is the factorization machine (FM), which explicitly learns the mapping function from features to rating score circumventing the dependency on user/item latent factors as in the latent factor models, resulting in an effective model for the cold start problem [9].

Although these feature-based CF models have been shown to be effective, they do not utilize the feature structure information. For example, conventional latent factor models (e.g., matrix factorization or tensor factorization models) like RLFM [1, 2] learn mapping functions from user/item features to user/item latent vectors assuming the features have a flat first-order structure. Later, [10] showed that this kind of mapping can be extended to any non-linear models. Although the formalism is flexible, it leaves too much room for practitioners to choose which non-linear model to use for a particular application. Also, it is hard to incorporate human prior knowledge on the feature structure into the framework, unless through careful feature engineering, and the proposed inference algorithm is difficult to use in large-scale settings. Similar to RLFM, Gantner et al. [4] proposed a model to explicitly learn the mapping function from features to latent factors, resulting in an effective model for the cold start problem. But it still makes the flat first-order feature structure assumption. In the other line of work, Rendle et al. [9] proposed a more compact model called factorization machine FM. Basically FM is a second-order regression model which directly maps the user-item-event concatenated features to rating score by learning the implicit mapping functions from features to latent factors, resulting in an effective model for the cold start problem. However, the issue of encoding structural human prior information still boils down to sophisticated feature engineering and it’s not clear how to incorporate the heterogeneous feature structures into model training to enhance the rating prediction performance. Though FM considers the second-order feature structure, it simply uses all the feature pairs for prediction.

Quite a lot of work in sparse coding area have shown that many signals tend to have a sparse representation from basic components in nature, and a sparse model often outperforms a dense model and also has the variable selection effect. Inspired by this, a sparse model that uses an appropriate subset of feature pairs might have a better performance. In practices, human prior knowledge or explicit structure information about these features is also sometimes available. For example, the topical categories on news articles may naturally be organized into hierarchies or graphs. Another good example would be demographical information about users, especially their geo-locations that are aligned with countries, states and cities, defined in real-world geo-political settings. These prior knowledge and structures are invaluable information for better user understanding and profiling and eventually better recommendation results. However, it is not straightforward to encode this kind of structural prior knowledge into state-of-the-art recommendation models like RLFM and FM. One approach might be to construct features capturing these structures and embed them into regression models. But the interplay between an optimal way to construct such features and train a better regression model based on these features to map to latent features becomes non-trivial in this case. Some previous work has been proposed to impose structural information on latent variable models, which are not necessarily directed graphical models. For instance, He et al. [5] proposed a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. Although the paper shares a similar idea with our framework, their work cannot handle heterogeneous types of features and complex dependencies. Also, it is hard to link their work to state-of-the-art LFM used in CF settings. Along the line of undirected graphical models, Min et al. [8] proposed sparse high-order Boltzmann machines, aiming to capture dependencies between latent variables. Again, it is not obvious to plug the model into state-of-the-art CF approaches. In this paper, we propose a structured sparse second-order regression model with structural prior knowledge in a principled way. The notion of types of features is introduced such that different types of features would have different structures (e.g., topical categories versus geographical locations). We consider two kinds of structures. For inter-typed features (which are of different kinds), the model is able to learn sparse relationships between different types of features (e.g., age and gender). For intra-typed features (which are in the same kind) that have a hierarchy (tree), e.g., we have the country-state-city hierarchical tree for the geo-location feature terms, the model learns a sparse hierarchical structure on the tree such that if a parent feature edge (or interchangeably a feature pair) is selected, then all its descendant edges should be selected as well, and if a parent edge is removed then all its descendant edges should be removed too.

You can view the paper in details here: [PDF].

### Reference

1. D. Agarwal and B.-C. Chen. Regression-based latent factor models. In Proceedings of KDD, pages 19–28. ACM, 2009.
2. D. Agarwal and B.-C. Chen. fLDA: matrix factorization through latent Dirichlet allocation. In Proceedings of WSDM, pages 91–100. ACM, 2010.
3. T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. SVDFeature: A toolkit for feature-based collaborative filtering. The Journal of Machine Learning Research, 13(1):3619–3622, Dec. 2012.
4. Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle, and L. Schmidt-Thieme. Learning attribute-to-feature mappings for cold-start recommendations. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pages 176–185, 2010.
5. Y. He, Y. Qi, K. Kavukcuoglu, and H. Park. Learning the dependency structure of latent factors. In Advances in Neural Information Processing Systems 25, pages 2375–2383. 2012.
6. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. The Journal of Machine Learning Research, 12:2297–2334, 2011.
7. Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, Aug. 2009.
8. M. R. Min, X. Ning, C. Cheng, and M. Gerstein. Interpretable sparse high-order boltzmann machines. In AISTATS, pages 614–622, 2014.
9. S. Rendle. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology, 3(3):57:1–57:22, May 2012.
10. L. Zhang, D. Agarwal, and B.-C. Chen. Generalizing matrix factorization through flexible regression priors. In Proceedings of RecSys, pages 13–20. ACM, 2011.
11. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, pages 3468–3497, 2009.

## Poisson Matrix Factorization1

In recent years, Matrix Factorization (MF) methods play a pivotal role in the research of Collaborative Filtering and Recommender Systems. The basic assumption is that, the data can be formed into a $$N \times M$$ matrix $$X$$ where $$N$$ is the number of users and $$M$$ is the number of items. Each rating $$X_{i,j}$$ is modeled as a dot-product of the latent factor $$\theta_{i}$$ for the user $$i$$ and the latent factor $$\phi_{j}$$ for the item $$j$$. In the classic Probabilistic Matrix Factorization, the rating $$X_{i,j}$$ is defined as:

X_{i,j} \sim \mathcal{N}(\theta_{i}^{T}\phi_{j}, \sigma^{2})
where each user factor and each item factor is drawn from another set of Gaussians:
\begin{align}
\theta_{i} \sim \mathcal{N}(0, \sigma_{U}^{2}I) \\
\phi_{j} \sim \mathcal{N}(0, \sigma_{I}^{2}I)
\end{align}This definition has two issues:

1. It does not prevent that the ratings become negative, which is a natural result of the Gaussian assumption.
2. If no heuristics are applied, the model needs to model all zero ratings and therefore, dramatically impacting the predictive performance.

In order to demonstrate the second point, we could see that the log likelihood of the data can be give as:

\log P(X \, | \, \theta, \phi ) = – \frac{1}{2\sigma^{2}} \sum_{i} \sum_{j} ( X_{i,j} – \theta_{i}^{T}\phi_{j})^{2} – \frac{1}{2} N \times M \log \sigma^{2} + C
where $$C$$ is a constant. It is obvious that, if $$X_{i,j}$$ is zero, the model still needs to explain those ratings as well. In [1], the authors just use a variable to indicate whether the rating is observed or not, simply ignoring them, which might be effective, but not justified in the model. In [2], the authors do not ignore unobserved or zero ratings, they put a small confident number for those ratings, a heuristic still.

Recently, Prem Gopalan et al. [3, 4, 5, 6] have proposed a new model called Poisson Factorization (PF) to address these two issues. The central idea is to replace Gaussian assumption with Poisson distribution:

X_{i,j} \sim \mathrm{Poisson}(\theta_{i}^{T}\phi_{j})
where $$\theta_{i}$$ and $$\phi_{j}$$ are drawn from user specific and item specific Gamma distributions. One obvious outcome of this model is that, $$X_{i,j}$$ are modeled as non-negative values and the latent factors are also non-negative as well. To the second issue mentioned above, we could see the probability of data is：

P(X_{i,j} \, | \, \theta_{i}, \phi_{j}) = (\theta_{i}^{T}\phi_{j})^{X_{i,j}}\frac{\exp \left\{ -\theta_{i}^{T} \phi_{j} \right\}}{X_{i,j}!}
and therefore, for all items, the log probability is:

\log P(X \, | \, \theta, \phi) = \sum_{i,j} \left\{ X_{i,j}\log \left( \theta_{i}^{T}\phi_{j} \right) – \log X_{i,j}! – \theta_{i}^{T} \phi_{j} \right\}
As $$0!=1$$, thus, the first term would be zero if $$X_{i,j}$$ and the second term becomes $$1$$ and thus, only the terms that $$X_{i,j}$$ would remain. Therefore, as mentioned in the papers, the model, by definition, can support sparse matrices.

Another interesting property of PF, which is mentioned in [5, 6], is that, we can rewrite the Poisson observation model as a two stage process where a user $$u$$ first decides on a budget $$b_{u}$$ she has to spend on items, and then spends this budget rating items that she is interested in:
\begin{align}
b_{u} &\sim \mathrm{Poisson}(\theta_{u}^{T}\sum_{j} \phi_{j}) \nonumber \\
[X_{i1},\cdots,X_{iM}] &\sim \mathrm{Multinomial}(b_{u}, \frac{\theta_{u}^{T}\phi_{j}}{\theta_{u}^{T}\sum_{j}\phi_{j}})
\end{align}This shows that learning a PF model for user-item ratings is effectively the same as learning a budget for each user while also learning how that budget is distributed across items.

## Reference

1. Ruslan Salakhutdinov, Andriy Mnih:  NIPS 2007: 1257-1264
2. Yifan Hu, Yehuda Koren, Chris Volinsky:  ICDM 2008: 263-272
3. Prem Gopalan, Laurent Charlin, David M. Blei: NIPS 2014: 3176-3184
4. Prem Gopalan, Francisco J. Ruiz, Rajesh Ranganath, David M. Blei: AISTATS 2014: 275-283
5. Prem Gopalan, Jake M. Hofman, David M. Blei: CoRR abs/1311.1704 (2013)
6. Prem Gopalan, Jake M. Hofman, David M. Blei: UAI 2015.

## Weighted Approximately Ranked Pairwise loss (WARP)

Definition
To focus more on the top of the ranked list, where the top $$k$$ positions are those we care about using the precision at $$k$$ measure, one can weigh the pairwise violations depending on their position in the ranked list. For pair-wise learning procedure, we construct a set of all positive labelled instances, denoted as $$\mathcal{C}_{u}^{+}$$and a set of negative labelled instances as $$\mathcal{C}_{u}^{-}$$. The loss is defined as:
\begin{align} \mbox{err}_{\mbox{WARP}}(\mathbf{x}_{i}, y_{i}) = L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \end{align}where $$rank(f(y_{i} \, | \, \mathbf{x}_{i}))$$ is a function to measure how many negative labelled instances are “wrongly” ranked higher than this positive example $$\mathbf{x}_{i}$$:
\begin{align} rank(f(y_{i} \, | \, \mathbf{x}_{i})) = \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(y \, | \, \mathbf{x}_{i})] \nonumber \end{align}where $$\mathbb{I}(x)$$ is the indicator function, and $$L(\cdot)$$ transforms this rank into a loss:
\begin{align} L(r) = \sum_{j=1}^{r} \tau_{j}, \mbox{with} \; \tau_{1} \geq \tau_{2} \geq \cdots \geq 0. \end{align}Different choices of $$\tau$$ define different importance of the relative position of the positive examples in the ranked list. In particular:

• For $$\tau_{i} = 1$$ for all $$i$$ we have the same AUC optimization as margin ranking criterion.
• For $$\tau_{1} = 1$$ and $$\tau_{i > 1} = 0$$ the precision at $$1$$ is optimized.
• For $$\tau_{i \leq k} = 1$$ and $$\tau_{i > k}=0$$ the precision at $$k$$ is optimized.
• For $$\tau_{i} = 1/i$$ a smooth weighting over positions is given, where most weight is given to the top position, with rapidly decaying weight for lower positions. This is useful when one wants to optimize precision at $$k$$ for a variety of different values of $$k$$ at once.

The loss discussed above is also equal to:
\begin{align} \mbox{err}_{\mbox{WARP}}(\mathbf{x}_{i}, y_{i}) &= L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \times 1\nonumber \\ &= \frac{L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(\mathbf{x}_{i})]}{rank(f(\mathbf{x}_{i}))} \nonumber \\ &= \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \frac{L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(\mathbf{x}_{i})]}{rank(f(y_{i} \, | \, \mathbf{x}_{i}))} \end{align}with the convention $$0/0=0$$ when the correct label $$y$$ is top-ranked. Given a label $$y$$ from the positive set, the $$rank$$ function essentially is the total number of labels from the negative set which violate the functional relationships. The probability of one negative label to be drawn, given a particular positive label, is:
\begin{align} P((y^{\prime}, \mathbf{x}^{\prime}) \, | \, (y_{i}, \mathbf{x}_{i})) = \frac{1}{rank(f(y_{i} \, | \, \mathbf{x}_{i}))} \end{align}Due to the discrete nature of identity functions, we can always replace them with hinge loss:
\begin{align} \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(y_{i} \, | \, \mathbf{x}_{i})] \approx \max(0, 1 – f(y_{i} \, | \, \mathbf{x}_{i}) + f(y^{\prime} \, | \, \mathbf{x}^{\prime})) \end{align}

Online Learning to Rank
The overall risk we want to minimize is:
\begin{align} Risk(f) = \int \hat{\mbox{err}}_{\mbox{WARP}}(\mathbf{x},y) dP(\mathbf{x},y) \end{align}An unbiased estimator of this risk can be obtained by stochastically sampling in the following way:

1. Sample a positive pair $$(\mathbf{x},y)$$ according to $$P(\mathbf{x},y)$$.
2. For the chosen $$(\mathbf{x},y)$$ sample a negative instance $$(\mathbf{x}^{\prime},y^{\prime})$$ such that $$1+f(y^{\prime} \, | \, \mathbf{x}^{\prime}) > f(y \, | \, \mathbf{x})$$.

This chosen negative instance as well as the positive instance has the contribution:
\begin{align}\label{eq:warp_single_contribution} L[rank(f(y \, | \, \mathbf{x}))] \max(0,1 – f(y \, | \, \mathbf{x}) + f(y^{\prime} \, | \, \mathbf{x}^{\prime})) \end{align}to the total risk, i.e. taking the expectation of these contributions approximates the risk because we have probability $$\frac{1}{rank(f(y \, | \, \mathbf{x}))}$$ of drawing $$(\mathbf{x}^{\prime}, y^{\prime})$$ in step 2 above (Remember that the $$rank$$ function is essentially to approximate the total number of violating negative instances). This might suggest that we can perform a stochastic update over parameters.

For large dataset, the stochastic optimization steps discussed above is inefficient:

1. In step (2) above, we need to compute $$f(y^{\prime} \, | \, \mathbf{x}^{\prime})$$ for all possible negative instances.
2. In single contribution ??, we need to calculate $rank$ function, which also requires to compute $f$ for negative instances.

Some approximation can be used. For step (2), we can sample labels with replacement until we find a violating label.

Now if there are $$k = rank(f(y\, | \, \mathbf{x}))$$ violating labels, we use random variable $$N_{k}$$ to denote the number of trials in the sampling step to essentially obtain an violating label. This random variable follows a geometric distribution of parameter as (here, the assumption is that the probability to sample the first negative label equals the probability to sample a negative label):
\begin{align} \frac{k}{Y-1} \end{align}where $$Y$$ is the total number of negative labels. Thus, we have $$k=\frac{Y-1}{\mathbb{E}[N_{k}]}$$. This suggests that the value of the $rank$ function may be approximated by:
\begin{align} rank(f(y \, | \, \mathbf{x})) \approx \left \lfloor \frac{Y-1}{N} \right \rfloor \end{align}where $$N$$ is the number of trials in the sampling.

References:
[1] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1):21–35, October 2010.
[2] Nicolas Usunier, David Buffoni, and Patrick Gallinari. Ranking with ordered weighted pairwise classification. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1057–1064, New York, NY, USA, 2009. ACM.

## Topic Models meet Latent Factor Models

There is a trend in research communities to bring two well-established classes of models together, topic models and latent factor models. By doing so, we may enjoy the ability to analyze text information with topic models and incorporate the collaborative filtering analysis with latent factor models. In this section, I wish to discuss some of these efforts.

Three papers will be covered in this post are listed at the end of the post. Before that, let’s first review what latent factor models are. Latent factor models (LFM) are usually used in collaborative filtering context. Say, we have a user-item rating matrix $$\mathbf{R}$$ where $$r_{ij}$$ represents the rating user $$i$$ gives to item $$j$$. Now, we assume for each user $$i$$, there is a vector $$\mathbf{u}_{i}$$ with the dimensionality $$k$$, representing the user in a latent space. Similarly, we assume for each item $$j$$, a vector $$\mathbf{v}_{j}$$ with the same dimensionality representing the item in a same latent space. Thus, the rating $$r_{ij}$$ is therefore represented as:
$r_{ij} = \mathbf{u}_{i}^{T} \mathbf{v}_{j}$This is the basic setting for LFM. In addition to this basic setting, additional biases can be incorporated, see here. For topic models (TM), the simplest case is Latent Dirichlet Allocation (LDA). The story of LDA is like this. For a document $$d$$, we first sample a multinomial distribution $$\boldsymbol{\theta}_{d}$$, which is a distribution over all possible topics. For each term position $$w$$ in the document, we sample a discrete topic assignment $$z$$ from $$\boldsymbol{\theta}_{d}$$, indicating which topic we use for this term. Then, we sample a term $$v$$ from a topic $$\boldsymbol{\beta}$$, a multinomial distribution over the vocabulary.

For both LFM and TM, they are methods to reduce original data into latent spaces. Therefore, it might be possible to link them together. Especially, items in the LFM are associated with rich text information. One natural idea is that, for an item $$j$$, the latent factor $$\mathbf{v}_{j}$$ and its topic proportional parameter $$\boldsymbol{\theta}_{j}$$ somehow gets connected. One way is to directly equalize these two variables. Since $$\mathbf{v}_{j}$$ is a real-value variable and $$\boldsymbol{\theta}_{j}$$ falls into a simplex, we need certain ways to keep these properties. Two possible methods can be used:

1. Keep $$\boldsymbol{\theta}_{j}$$ and make sure it is in the range of [0, 1] in the optimization process. Essentially put some constraint on the parameter.
2. Keep $$\mathbf{v}_{j}$$ and use logistic transformation to transfer a real-valued vector into simplex.

Hanhuai and Banerjee showed the second technique in their paper by combining Correlated Topic Model with LFM. Wang and Blei argued that this setting suffers from the limitation that it cannot distinguish topics for explaining recommendations from topics important for explaining content since the latent space is strictly equal. Thus, they proposed a slightly different approach. Namely, each $$\mathbf{v}_{j}$$ derives from $$\boldsymbol{\theta}_{j}$$ with item-dependent noise:
$\mathbf{v}_{j} = \boldsymbol{\theta}_{j} + \epsilon_{j}$ where $$\epsilon_{j}$$ is a Gaussian noise.

A different approach is to not directly equal these two quantities but let me impact these each other. One such way explored by Hanhuai and Banerjee is that $$\boldsymbol{\theta}_{j}$$ influences how $$\mathbf{v}_{j}$$ is generated. More specifically, in Probabilistic Matrix Factorization (PMF) setting, all $$\mathbf{v}$$s are generated by a Gaussian distribution with a fixed mean and variance. Now, by combining LDA, the authors allow different topic has different Gaussian prior mean and variance values. A value similar to $$z$$ is firstly generated from $$\boldsymbol{\theta}_{j}$$ to decide which mean to use and then generate $$\mathbf{v}_{j}$$ from that particular mean and variance.

A totally different direction was taken by Agarwal and Chen. In their fLDA paper, there is no direct relationship between item latent factor and content latent factor. In fact, their relationship is realized by the predictive equation:
$r_{ij} = \mathbf{a}^{T} \mathbf{u}_{i} + \mathbf{b}^{T} \mathbf{v}_{j} + \mathbf{s}_{i}^{T} \bar{\mathbf{z}}_{j}$where $$\mathbf{a}$$, $$\mathbf{b}$$ and $$\mathbf{s}_{i}$$ are regression weights and $$\bar{\mathbf{z}}_{j}$$ is the average topic assignments for item $$j$$. Note, $$\mathbf{s}_{i}$$ is a user-dependent regression weights. This formalism encodes the notion that all latent factors (including content) will contribute to the rating, not only item and user factors.

In summary, three directions have been taken for integrating TM and LFM:

1. Equal item latent factor and topic proportion vector, or make some Gaussian noise.
2. Let topic proportion vector to control the prior distribution for item latent factor.
3. Let item latent factor and topic assignments, as well as user latent factor, contribute the rating.

Reference:

• Deepak Agarwal and Bee-Chung Chen. 2010. fLDA: matrix factorization through latent dirichlet allocation. In Proceedings of the third ACM international conference on Web search and data mining (WSDM ’10). ACM, New York, NY, USA, 91-100. [PDF]
• Hanhuai Shan and Arindam Banerjee. 2010. Generalized Probabilistic Matrix Factorizations for Collaborative Filtering. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM ’10). IEEE Computer Society, Washington, DC, USA, 1025-1030. [PDF]
• Chong Wang and David M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’11). ACM, New York, NY, USA, 448-456.[PDF]

## Reviews on Binary Matrix Decomposition2

In this post, I would like to review several existing techniques to binary matrix decomposition.

• Andrew I. Schein, Lawrence K.  Saul, and Lyle H. Ungar. A Generalized Linear Model for Principal Component Analysis of Binary Data. Appeared in Proceedings of the 9’th International Workshop on Artificial Intelligence and Statistics. January 3-6, 2003. Key West, FL.
This paper introduced a logistic version of PCA to binary data. The model assumes that each observation is from a single latent factor and there exists multiple latent factors. The model is quite straightforward and the inference is been done by Alternative Least Square.
• Tao Li. 2005. A general model for clustering binary data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD ’05). ACM, New York, NY, USA, 188-197.
In this paper, the author introduced the problem of “binary data decomposition”. The paper demonstrated several techniques that are popular for normal matrix factorization to binary data, like k-means, spectral clustering. The proposed method is to factorize the binary matrix into two binary matrices, where the binary indicators suggest membership.
• Tomas Singliar and Milos Hauskrecht. 2006. Noisy-OR Component Analysis and its Application to Link AnalysisJ. Mach. Learn. Res. 7 (December 2006), 2189-2213.
This paper introduced a probabilistic view of binary data. Like other latent factor models, each observation can be viewed as a sample from multiple binary latent Bernoulli factors, essentially a mixture model. A variational inference is conducted in the paper. The weak part of the paper is that the comparison of the model with PLSA and LDA is not quite convincing.
• Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. 2007. Binary Matrix Factorization with Applications. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining (ICDM ’07). IEEE Computer Society, Washington, DC, USA, 391-400.
This paper indeed introduced a variant of Non-negative Matrix Factorization to binary data, meaning that a binary matrix will be always decomposed into two matrices bounded by 0 to 1. The proposed method is a modification of NMF. However, in a document clustering problem, the performance difference between proposed method and NMF is very small.
• Miettinen, P.; Mielikainen, T.; Gionis, A.; Das, G.; Mannila, H.; , “The Discrete Basis Problem,Knowledge and Data Engineering, IEEE Transactions on , vol.20, no.10, pp.1348-1362, Oct. 2008.
Miettinen, P.; , “Sparse Boolean Matrix Factorizations,” Data Mining (ICDM), 2010 IEEE 10th International Conference on , vol., no., pp.935-940, 13-17 Dec. 2010
These two papers stated another view of factorization of binary data. Rather than directly using some SVD based or NMF based methods, these papers introduced a “cover” based discrete optimization method to the problem. However, through experiments, the performance advantages over traditional SVD or NMF methods are not very clear. Another drawback of their method is that some other existing methods are difficult to be incorporated with.
• Andreas P. Streich, Mario Frank, David Basin, and Joachim M. Buhmann. 2009. Multi-assignment clustering for Boolean data. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09). ACM, New York, NY, USA, 969-976.
This paper introduced a probabilistic view of the binary data. The observation is assumed to be generated either by “signal” or by “noise”, both are Bernoulli distributions. The switch variable is also sampled from the third Bernoulli distribution. This is essentially a simplified PLSA. The inference is done by deterministic annealing.
• Ata Kaban, Ella Bingham, Factorisation and denoising of 0-1 data: A variational approach, Neurocomputing, Volume 71, Issues 10-12, Neurocomputing for Vision Research; Advances in Blind Signal Processing, June 2008, Pages 2291-2308, ISSN 0925-2312.
This paper is somewhat similar “Noisy-OR” model and Logistic PCA as well. However, unlike Logistic PCA, the proposed model is a mixture model, meaning that a single observation is “generated” by multiple latent factors. The authors put a Beta prior over latent factors and the inference is done by Variational Inference.
Ella Bingham, Ata Kaban, and Mikael Fortelius. 2009. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Anal. Appl. 12, 1 (January 2009), 55-78.
This paper goes back to the assumption that each observation is sampled from a simple factor. The inference is done by EM.

In all, it seems that the performance advantages of specifically designed binary data models are small. However, the biggest advatange of these model is that they can give better interpretations sometimes. For computational models, NMF seems a good approximation. For probablistic models, a modified PLSA or LDA seems quite resonable.

## Reviews on Probabilistic Models for User Profiles

In this post, I would like to review some probabilistic models for user profiling. More specifically, I’m looking at the models that taking users’ preferences into account and try to predict certain quantities based on these preferences, which is a normal scenario for collaborative filtering.

• Latent semantic models for collaborative filtering” by Thomas Hofmann, ACM Transactions on Information Systems, 2004
The proposed model is based on pLSA. In order to incorporate ratings, the authors propose the following decomposition scheme:
$$p(v|u,y) = \sum_{z}p(z|u)p(v|z,y)$$ where $$p(v|z,y)$$ follows Gaussian distribution. The paper also introduced practical techniques to normalize user ratings. The model is learned through (tempered) EM.
• Modeling User Rating Profiles For Collaborative Filtering” by Benjamin Marlin, NIPS 2003
The model proposed in the paper is essentially  LDA in the context of collaborative filtering. The original document-term matrix was replaced by a user-item (user-rating) matrix. Unlike this pLSA model for collaborative filtering, this model introduced the decomposition scheme as:$$p(r|u)=\sum_{z} p(r|z)p(z|u)$$where no “dummy” variable $$y$$ gets involved. The model is learned through variational inference.
• Flexible Mixture Model for Collaborative Filtering” by Luo Si and Rong Jin, ICML 2003
The model proposed in the paper is an extension of two-side clustering model of pLSA. It assumes that users are belong to multiple clusters and items are also belong to multiple clusters. The rating of a particular item is based on the user clusters and item clusters. Therefore, $$p(x,y,r) = \sum_{z_{x}} \sum_{z_{y}} p(z_{x})p(y_{x})p(x|z_{x})p(y|z_{y})p(r|z_{x},z_{y})$$ where $$z_{x}$$ are latent factors for users and $$z_{y}$$ are latent factors for items. All distributions here are multinomial distributions. The model is learned through EM.
A full Bayesian treatment of the model is introduced in “Latent Grouping Models for User Preference Prediction” by Eerika Savia, Kai Puolamaki and Samuel Kaski in Machine Learning 2009, which is learned through Gibbs Sampling.
• The Multiple Multiplicative Factor Model For Collaborative Filtering” by Benjamin Marlin and Richard S. Zemel, ICML 2004
Rather than using a same set of latent factors to “explain” all ratings, the Multiple Multiplicative Factor Model (MMF) tries to use different latent factors to “explain” different ratings. Therefore, for each user, the model has a $$K$$-dimensional binary vector $$Z$$ where each element $$z_{k}$$ represents whether $$k$$-th factor is “active” or not. For rating $$X$$, the authors introduced to use softmax function to map arbitrary real values to simplex. The model is learned through variational inference.
• Efficient Bayesian Hierarchical User Modeling For Recommendation Systems” by Yi Zhang and Jonanthan Koren, SIGIR 2007
The model introduced in this paper is similar to the pLSA version of user profiles. The rating $$r$$ of a item $$y$$ by user $$u$$ is decomposed as:$$p(r|y,u)=p(u)p(r|u,y)$$ where $$p(u)$$ is a Gaussian distribution and $$p(r|u,y)$$ is modeled through a Generalized Linear Model $$r=u^{T}y+\epsilon$$, essentially another Gaussian distribution in the paper. The novel part of the model is that $$y$$ is the document representation of an item. Therefore, the authors assume that the rating is weighted sum of terms of documents where the weights are user specific. The model is learned through EM.

A study about several variants of pLSA on collaborative filtering is done by Rong Jin, Luo Si and Chengxiang Zhai.

A study about how to normalize ratings is done by Rong Jin and Luo Si: