# Posts

Anima Anandkumar is a prominent researcher about tensor methods (a.k.a., spectral methods) in machine learning. Recently, she has a QA session on Quora and I gathered some of her answers particular with tensor methods as follows:

1. What are some benefits and drawbacks of using tensor methods as opposed to more traditional techniques in machine learning?
The main gain is in terms of computation:
— a) tensor methods are embarrassingly parallel and scalable to  large problems
— b) they can build on efficient linear algebraic libraries, but are much more powerful and informative compared to matrix methods.
On the other hand, tensor methods are not sample efficient, meaning they require more samples than EM to reach the same level of accuracy (assuming computation is not an issue). Improving statistical efficiency of spectral methods is an ongoing research topic
2. What are your thoughts on the statistical efficiency of spectral methods? Do you think that they are competitive as they stand?
The short answer is that, MLE is sample efficient but may be difficult to compute while tensor methods (moment matching) is relatively easy to compute but sample inefficient. Some remedies are mentioned in the answer.
3. How are Tensor methods used in deep learning?
The short answer is that, currently used limited.
4. What are the best resources for starting with Tensor Analysis?
See her webpage for a start.

For detailed QA, please refer to Quora website.

## Slice Sampling for LDA Hyperparameters

For latent Dirichlet allocation (LDA) hyper-parameters, a typical approach is to utilize Monte-Carlo EM approach where E-step is approximated by Gibbs sampling while M-step is to perform a gradient-based optimization approach to optimize Dirichlet parameters. Such approach is implemented in Mallet package. For the detailed information about estimating Dirichlet parameters, see Tom Minka’s technical report.

Another less explored approach is to go with full Bayesian notion, interleaving sampling topic assignments and hyper-parameters. To sample hyper-parameters, as there is no closed-form solution, Slice Sampling is usually utilized. For introduction for Slice Sampling, see Radford M. Neal’s paper. For its application in LDA, please see the following two references:

We implemented the method in Fugue’s LDA algorithm.

## Tricks in Sampling Discrete Distributions (2) – Binary Search1

As mentioned in the previous post, it is tricky to sample from discrete distributions, here we demonstrate yet another important trick to do it right. No matter you do it in the original space, or in the log-space. Basically, you can easily come up some code snippet like this (we are using Java as an example here):

1 2 3 4 5 6 7 8 9 public int sample(){ double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1]; int index = -1; for (index = 0; index > p.length; index++) { if (u > p[index]) break; } return index; }

where $$p$$ is the accumulated un-normalized probabilities. The time complexity is $$\mathcal{O}(N)$$ when $$N$$ equals the number of items in the array $$p$$.

It turns out that, the above code can be easily optimized to $$\mathcal{O}(\log N )$$ by using Binary Search. The reason is quite simple. The accumulated un-normalized probabilities, which are stored in $$p$$, by its definition, are sorted. Therefore, binary search can be utilized. In particular, we want to find the smallest key that is greater than the random generated number $$u$$. This function is called ceiling in Algorithms in Section 3.1. We implemented it in our context as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 public int sample(){ double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1]; int lower = 0; int upper = p.length - 1; while (lower >= upper){ int mid = lower + (upper - lower) / 2; if((p[mid] - u) > 0){ upper = mid - 1; } else{ lower = mid + 1; } } return lower; }

Interestingly, even though this trick seems trivial, it is not mentioned in many literature and only discussed:

## Tricks in Sampling Discrete Distributions (1) – Sampling in Log-Space

One critical component in Gibbs sampling for complex graphical models is to be able to draw samples from discrete distributions. Take latent Dirichlet allocation (LDA) as an example, the main computation focus is to draw samples from the following distribution:

P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto (n_{d,k}^{-i} + \alpha_{k}) \frac{n_{w_{i}, k}^{-i} + \beta_{w_{i}}}{n_{., k}^{-i} + \sum_{v} \beta_{v}}\label{eq:lda}

where $$n_{d,k}^{-i}$$ is the number of tokens in the document $$d$$ assigned to the topic $$k$$, excluding the token $$i$$, $$n_{w_{i}, k}^{-i}$$ is the number of times token $$w_{i}$$ assigned to the topic $$k$$, excluding $$i$$, and $$n_{.,k}^{-i}$$ is the total number of tokens assigned to the topic $$k$$.

So, a straightforward sampling algorithm works as follows:

1. Let $$c_{k}$$ be the right-hand side of Equation \eqref{eq:lda} for topic $$k$$, which is an un-normalized probability.
2. We compute the accumulated weights as: $$C[i] = C[i-1] + c_{i} , \,\, \forall i \in (0, K-1]$$ and $$C[0] = c_{0}$$.
3. Draw $$u \sim \mathcal{U}(0, C[K-1] )$$ and find $$t = \arg\min_{i} \left( x_{i} \right)$$ where $$x_{i} = C[i] – u$$ and $$x_{i} > 0$$

The last line is essentially to find the minimum index that the array value is greater than the random number $$u$$.

One difficulty to deal with \eqref{eq:lda} is that the right hand side might be too small and therefore overflow (thinking about too many near-zero numbers multiplying). Thus, we want to deal with probabilities in log-space. We start to work with:

\log P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto \log (n_{d,k}^{-i} + \alpha_{k}) + \log (n_{w_{i}, k}^{-i} + \beta_{w_{i}}) – \log ( n_{., k}^{-i} + \sum_{v} \beta_{v} )\label{eq:log_lda}
and store in $$c_{k}$$ but remember that now each value represents un-normalized log probability. The next step is to compute the accumulated weights, this time as accumulated probabilities but in log-space! Thanks to the trick mentioned in [Notes on Calculating Log Sum of Exponentials], we are able to compute log sum efficiently. Please use the last equation there to compute the accumulated weights. The last step is to draw the random number. We compute $$u \sim \mathcal{U}(0, 1) + \log C[K-1]$$ and again find the minimum index that satisfy the array value is greater than $$u$$.

Notes:

1. The log-sampling algorithm for LDA is implemented in Fugue Topic Modeling Package.
2. Unless you really face the issue of overflow, sampling in log-space is usually much slower than the original space as log and exp are expensive functions to compute.

## Federated Optimization: Even More Personalized World?

Like the previous post about the personalized models, another NIPS workshop paper discussed a related topic, yet, from another perspective:

The authors introduces a new setting of the learning problem in which data are distributed across a very large number of computers, each having access only to few data points. This is primarily motivated by the setting, where users keep their data on their devices, but the goal is still to train a high quality global model.

Although it looks like very different from the paper described in the previous post, two pieces can be linked together for two points:

• It is important to train a high quality local or personalized model by utilizing the global model or vice versa.
• It is very important to understand the interplay of the global mode land the local model as well.

These work can raise interesting new directions, like how to serve/update models that are fully personalized on mobile devices.

## Serving Personalized Models

In the recent NIPS 2016 Workshop on Machine Learning Systems, one paper attracts my attention:

The central idea is very simple: we need to learn and serve personalized models on top of a global model to users. The argument of such setting is two-folds:

1. A global model might be hard to train, given the size of the model. It usually takes significant amount of computing efforts.
2. Depending on what types of model, it might be even difficult to serve and update the global model as well.

So, it is very natural that, the model for each individual user is trained separately while it is derived from a global model. The paper demonstrates a particular way of deriving such a model. But there could be many different ways of doing this.

Of course, this is not the first such reasoning. As the paper mentioned, prior work in multi-task learning has formalized similar problems. However, it might be the first time from the system perspective to show the advantages of having personalized models.

## Building Software Systems: V1 and V2

In a recent issue of Communications of the ACM, the article “Lean Software Development — Building and Shipping Two Versions” caught my eyes, as it describes one of the basic dilemmas faced in a lot of companies: how to build multiple versions of the same software at the same time, although the article discussed the issue from human resources perspective. The central argument of the article is that, certain engineers like prototyping while the others might love to prune or optimize the code, or to beautify the code, in the author’s word and therefore, it’s better off to separate them into two groups.

However, in practice, the reason to split teams into so-called V1/V2 sometimes is rather arbitrary. One common issue is that, V1 is a large and working system which, over time, team members have gained expertise on and a lot of people is familiar with the code base. But, as it might be developed in a rush at the first place, V1 is not flexible and sometimes full of hacks and heuristics. It needs a re-write, in theory. On the other hand, some management folks realized the issue but formed another team to develop the V2 system (usually called “next-gen”) with or without the consultant from the team of V1. Therefore, V2 is born in many problems. It is called “next-gen” but the developers of it may not really understand the real problems of V1 and sometimes, in order to separate V2 from V1, showing the superiority of V2, V2 tends to be overly engineered and overly designed. In the end of the day, both V1 and V2 may not be properly designed and implemented.

In the heart of the issue is that, why we need V1 or V2 at the same time? We need to write “correct code” rather than “fast code”. The need to build V1 and V2 simultaneously sometimes implies the bad design, the wrong requirements of the products and hacks that jeopardize the quality of the products.

## Machine Learning as an Engineering Discipline

Recently, Leon Bottou delivered two exiting keynote speeches on machine learning as an engineering discipline. I suggest that anyone who is using machine learning techniques to build software needs to read through them at least once. There are deep thinking about machine learning and software engineering.

1. Two big challenges in machine learning given on ICML 2015.
2. How Big Data changes Statistical Machine Learning given on IEEE BigData 2015.

## Structured Sparse Regression for Recommender Systems

Feature based latent factor models have received increasing attention in recent years due to its capability to effectively solve the cold-start problem. There have been many feature based collaborative filtering (CF) models proposed recently, which can be grouped into two categories. The first type of models includes all variants of latent factor models (LFM) which have been proven as an effective approach to personalization and recommender systems. The core of LFM is to learn user-specific and item-specific features from user-item interactions and utilize these features for future predictions/recommendations. State-of-the-art LFM exploits low-rank latent spaces of users and features and treats latent factors that are learnt from user-item historical data as features. This type of models has gained significant successes in a number of applications, including the Netflix competition. The second category is the factorization machine (FM), which explicitly learns the mapping function from features to rating score circumventing the dependency on user/item latent factors as in the latent factor models, resulting in an effective model for the cold start problem [9].

Although these feature-based CF models have been shown to be effective, they do not utilize the feature structure information. For example, conventional latent factor models (e.g., matrix factorization or tensor factorization models) like RLFM [1, 2] learn mapping functions from user/item features to user/item latent vectors assuming the features have a flat first-order structure. Later, [10] showed that this kind of mapping can be extended to any non-linear models. Although the formalism is flexible, it leaves too much room for practitioners to choose which non-linear model to use for a particular application. Also, it is hard to incorporate human prior knowledge on the feature structure into the framework, unless through careful feature engineering, and the proposed inference algorithm is difficult to use in large-scale settings. Similar to RLFM, Gantner et al. [4] proposed a model to explicitly learn the mapping function from features to latent factors, resulting in an effective model for the cold start problem. But it still makes the flat first-order feature structure assumption. In the other line of work, Rendle et al. [9] proposed a more compact model called factorization machine FM. Basically FM is a second-order regression model which directly maps the user-item-event concatenated features to rating score by learning the implicit mapping functions from features to latent factors, resulting in an effective model for the cold start problem. However, the issue of encoding structural human prior information still boils down to sophisticated feature engineering and it’s not clear how to incorporate the heterogeneous feature structures into model training to enhance the rating prediction performance. Though FM considers the second-order feature structure, it simply uses all the feature pairs for prediction.

Quite a lot of work in sparse coding area have shown that many signals tend to have a sparse representation from basic components in nature, and a sparse model often outperforms a dense model and also has the variable selection effect. Inspired by this, a sparse model that uses an appropriate subset of feature pairs might have a better performance. In practices, human prior knowledge or explicit structure information about these features is also sometimes available. For example, the topical categories on news articles may naturally be organized into hierarchies or graphs. Another good example would be demographical information about users, especially their geo-locations that are aligned with countries, states and cities, defined in real-world geo-political settings. These prior knowledge and structures are invaluable information for better user understanding and profiling and eventually better recommendation results. However, it is not straightforward to encode this kind of structural prior knowledge into state-of-the-art recommendation models like RLFM and FM. One approach might be to construct features capturing these structures and embed them into regression models. But the interplay between an optimal way to construct such features and train a better regression model based on these features to map to latent features becomes non-trivial in this case. Some previous work has been proposed to impose structural information on latent variable models, which are not necessarily directed graphical models. For instance, He et al. [5] proposed a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. Although the paper shares a similar idea with our framework, their work cannot handle heterogeneous types of features and complex dependencies. Also, it is hard to link their work to state-of-the-art LFM used in CF settings. Along the line of undirected graphical models, Min et al. [8] proposed sparse high-order Boltzmann machines, aiming to capture dependencies between latent variables. Again, it is not obvious to plug the model into state-of-the-art CF approaches. In this paper, we propose a structured sparse second-order regression model with structural prior knowledge in a principled way. The notion of types of features is introduced such that different types of features would have different structures (e.g., topical categories versus geographical locations). We consider two kinds of structures. For inter-typed features (which are of different kinds), the model is able to learn sparse relationships between different types of features (e.g., age and gender). For intra-typed features (which are in the same kind) that have a hierarchy (tree), e.g., we have the country-state-city hierarchical tree for the geo-location feature terms, the model learns a sparse hierarchical structure on the tree such that if a parent feature edge (or interchangeably a feature pair) is selected, then all its descendant edges should be selected as well, and if a parent edge is removed then all its descendant edges should be removed too.

You can view the paper in details here: [PDF].

### Reference

1. D. Agarwal and B.-C. Chen. Regression-based latent factor models. In Proceedings of KDD, pages 19–28. ACM, 2009.
2. D. Agarwal and B.-C. Chen. fLDA: matrix factorization through latent Dirichlet allocation. In Proceedings of WSDM, pages 91–100. ACM, 2010.
3. T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. SVDFeature: A toolkit for feature-based collaborative filtering. The Journal of Machine Learning Research, 13(1):3619–3622, Dec. 2012.
4. Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle, and L. Schmidt-Thieme. Learning attribute-to-feature mappings for cold-start recommendations. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pages 176–185, 2010.
5. Y. He, Y. Qi, K. Kavukcuoglu, and H. Park. Learning the dependency structure of latent factors. In Advances in Neural Information Processing Systems 25, pages 2375–2383. 2012.
6. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. The Journal of Machine Learning Research, 12:2297–2334, 2011.
7. Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, Aug. 2009.
8. M. R. Min, X. Ning, C. Cheng, and M. Gerstein. Interpretable sparse high-order boltzmann machines. In AISTATS, pages 614–622, 2014.
9. S. Rendle. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology, 3(3):57:1–57:22, May 2012.
10. L. Zhang, D. Agarwal, and B.-C. Chen. Generalizing matrix factorization through flexible regression priors. In Proceedings of RecSys, pages 13–20. ACM, 2011.
11. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, pages 3468–3497, 2009.

## Poisson Matrix Factorization1

In recent years, Matrix Factorization (MF) methods play a pivotal role in the research of Collaborative Filtering and Recommender Systems. The basic assumption is that, the data can be formed into a $$N \times M$$ matrix $$X$$ where $$N$$ is the number of users and $$M$$ is the number of items. Each rating $$X_{i,j}$$ is modeled as a dot-product of the latent factor $$\theta_{i}$$ for the user $$i$$ and the latent factor $$\phi_{j}$$ for the item $$j$$. In the classic Probabilistic Matrix Factorization, the rating $$X_{i,j}$$ is defined as:

X_{i,j} \sim \mathcal{N}(\theta_{i}^{T}\phi_{j}, \sigma^{2})
where each user factor and each item factor is drawn from another set of Gaussians:
\begin{align}
\theta_{i} \sim \mathcal{N}(0, \sigma_{U}^{2}I) \\
\phi_{j} \sim \mathcal{N}(0, \sigma_{I}^{2}I)
\end{align}This definition has two issues:

1. It does not prevent that the ratings become negative, which is a natural result of the Gaussian assumption.
2. If no heuristics are applied, the model needs to model all zero ratings and therefore, dramatically impacting the predictive performance.

In order to demonstrate the second point, we could see that the log likelihood of the data can be give as:

\log P(X \, | \, \theta, \phi ) = – \frac{1}{2\sigma^{2}} \sum_{i} \sum_{j} ( X_{i,j} – \theta_{i}^{T}\phi_{j})^{2} – \frac{1}{2} N \times M \log \sigma^{2} + C
where $$C$$ is a constant. It is obvious that, if $$X_{i,j}$$ is zero, the model still needs to explain those ratings as well. In [1], the authors just use a variable to indicate whether the rating is observed or not, simply ignoring them, which might be effective, but not justified in the model. In [2], the authors do not ignore unobserved or zero ratings, they put a small confident number for those ratings, a heuristic still.

Recently, Prem Gopalan et al. [3, 4, 5, 6] have proposed a new model called Poisson Factorization (PF) to address these two issues. The central idea is to replace Gaussian assumption with Poisson distribution:

X_{i,j} \sim \mathrm{Poisson}(\theta_{i}^{T}\phi_{j})
where $$\theta_{i}$$ and $$\phi_{j}$$ are drawn from user specific and item specific Gamma distributions. One obvious outcome of this model is that, $$X_{i,j}$$ are modeled as non-negative values and the latent factors are also non-negative as well. To the second issue mentioned above, we could see the probability of data is：

P(X_{i,j} \, | \, \theta_{i}, \phi_{j}) = (\theta_{i}^{T}\phi_{j})^{X_{i,j}}\frac{\exp \left\{ -\theta_{i}^{T} \phi_{j} \right\}}{X_{i,j}!}
and therefore, for all items, the log probability is:

\log P(X \, | \, \theta, \phi) = \sum_{i,j} \left\{ X_{i,j}\log \left( \theta_{i}^{T}\phi_{j} \right) – \log X_{i,j}! – \theta_{i}^{T} \phi_{j} \right\}
As $$0!=1$$, thus, the first term would be zero if $$X_{i,j}$$ and the second term becomes $$1$$ and thus, only the terms that $$X_{i,j}$$ would remain. Therefore, as mentioned in the papers, the model, by definition, can support sparse matrices.

Another interesting property of PF, which is mentioned in [5, 6], is that, we can rewrite the Poisson observation model as a two stage process where a user $$u$$ first decides on a budget $$b_{u}$$ she has to spend on items, and then spends this budget rating items that she is interested in:
\begin{align}
b_{u} &\sim \mathrm{Poisson}(\theta_{u}^{T}\sum_{j} \phi_{j}) \nonumber \\
[X_{i1},\cdots,X_{iM}] &\sim \mathrm{Multinomial}(b_{u}, \frac{\theta_{u}^{T}\phi_{j}}{\theta_{u}^{T}\sum_{j}\phi_{j}})
\end{align}This shows that learning a PF model for user-item ratings is effectively the same as learning a budget for each user while also learning how that budget is distributed across items.

## Reference

1. Ruslan Salakhutdinov, Andriy Mnih:  NIPS 2007: 1257-1264
2. Yifan Hu, Yehuda Koren, Chris Volinsky:  ICDM 2008: 263-272
3. Prem Gopalan, Laurent Charlin, David M. Blei: NIPS 2014: 3176-3184
4. Prem Gopalan, Francisco J. Ruiz, Rajesh Ranganath, David M. Blei: AISTATS 2014: 275-283
5. Prem Gopalan, Jake M. Hofman, David M. Blei: CoRR abs/1311.1704 (2013)
6. Prem Gopalan, Jake M. Hofman, David M. Blei: UAI 2015.