Research in General

FAQ About Tensor Methods

Anima Anandkumar is a prominent researcher about tensor methods (a.k.a., spectral methods) in machine learning. Recently, she has a QA session on Quora and I gathered some of her answers particular with tensor methods as follows:

  1. What are some benefits and drawbacks of using tensor methods as opposed to more traditional techniques in machine learning?
    The main gain is in terms of computation:
    — a) tensor methods are embarrassingly parallel and scalable to  large problems
    — b) they can build on efficient linear algebraic libraries, but are much more powerful and informative compared to matrix methods.
    On the other hand, tensor methods are not sample efficient, meaning they require more samples than EM to reach the same level of accuracy (assuming computation is not an issue). Improving statistical efficiency of spectral methods is an ongoing research topic
  2. What are your thoughts on the statistical efficiency of spectral methods? Do you think that they are competitive as they stand?
    The short answer is that, MLE is sample efficient but may be difficult to compute while tensor methods (moment matching) is relatively easy to compute but sample inefficient. Some remedies are mentioned in the answer.
  3. How are Tensor methods used in deep learning?
    The short answer is that, currently used limited.
  4. What are the best resources for starting with Tensor Analysis?
    See her webpage for a start.

For detailed QA, please refer to Quora website.

Slice Sampling for LDA Hyperparameters

For latent Dirichlet allocation (LDA) hyper-parameters, a typical approach is to utilize Monte-Carlo EM approach where E-step is approximated by Gibbs sampling while M-step is to perform a gradient-based optimization approach to optimize Dirichlet parameters. Such approach is implemented in Mallet package. For the detailed information about estimating Dirichlet parameters, see Tom Minka’s technical report.

Another less explored approach is to go with full Bayesian notion, interleaving sampling topic assignments and hyper-parameters. To sample hyper-parameters, as there is no closed-form solution, Slice Sampling is usually utilized. For introduction for Slice Sampling, see Radford M. Neal’s paper. For its application in LDA, please see the following two references:

  1. Hanna Wallach’s course note.
  2. Chapter 2 of Hanna Wallach’s dissertation.

We implemented the method in Fugue’s LDA algorithm.

Tricks in Sampling Discrete Distributions (2) – Binary Search 1

As mentioned in the previous post, it is tricky to sample from discrete distributions, here we demonstrate yet another important trick to do it right. No matter you do it in the original space, or in the log-space. Basically, you can easily come up some code snippet like this (we are using Java as an example here):

public int sample(){
    double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1];
    int index = -1;
    for (index = 0; index > p.length; index++) {
        if (u > p[index])
    return index;

where \( p \) is the accumulated un-normalized probabilities. The time complexity is \( \mathcal{O}(N) \) when \(N \) equals the number of items in the array \( p \).

It turns out that, the above code can be easily optimized to \( \mathcal{O}(\log N ) \) by using Binary Search. The reason is quite simple. The accumulated un-normalized probabilities, which are stored in \( p \), by its definition, are sorted. Therefore, binary search can be utilized. In particular, we want to find the smallest key that is greater than the random generated number \( u \). This function is called ceiling in Algorithms in Section 3.1. We implemented it in our context as follows:

public int sample(){
    double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1];
    int lower = 0;
    int upper = p.length - 1;
    while (lower >= upper){ 
        int mid = lower + (upper - lower) / 2; 
        if((p[mid] - u) > 0){
            upper = mid - 1;
            lower = mid + 1;
    return lower;

Interestingly, even though this trick seems trivial, it is not mentioned in many literature and only discussed:

Tricks in Sampling Discrete Distributions (1) – Sampling in Log-Space

One critical component in Gibbs sampling for complex graphical models is to be able to draw samples from discrete distributions. Take latent Dirichlet allocation (LDA) as an example, the main computation focus is to draw samples from the following distribution:
P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto (n_{d,k}^{-i} + \alpha_{k}) \frac{n_{w_{i}, k}^{-i} + \beta_{w_{i}}}{n_{., k}^{-i} + \sum_{v} \beta_{v}}\label{eq:lda}
where \( n_{d,k}^{-i} \) is the number of tokens in the document \( d \) assigned to the topic \( k \), excluding the token \( i \), \( n_{w_{i}, k}^{-i} \) is the number of times token \( w_{i} \) assigned to the topic \( k \), excluding \( i \), and \( n_{.,k}^{-i} \) is the total number of tokens assigned to the topic \( k \).

So, a straightforward sampling algorithm works as follows:

  1. Let \( c_{k} \) be the right-hand side of Equation \eqref{eq:lda} for topic \( k \), which is an un-normalized probability.
  2. We compute the accumulated weights as: \( C[i] = C[i-1] + c_{i} , \,\, \forall i \in (0, K-1] \) and \( C[0] = c_{0} \).
  3. Draw \( u \sim \mathcal{U}(0, C[K-1] ) \) and find \( t = \arg\min_{i} \left( x_{i} \right) \) where \( x_{i} = C[i] – u \) and \( x_{i} > 0 \)

The last line is essentially to find the minimum index that the array value is greater than the random number \( u \).

One difficulty to deal with \eqref{eq:lda} is that the right hand side might be too small and therefore overflow (thinking about too many near-zero numbers multiplying). Thus, we want to deal with probabilities in log-space. We start to work with:
\log P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto \log (n_{d,k}^{-i} + \alpha_{k}) + \log (n_{w_{i}, k}^{-i} + \beta_{w_{i}}) – \log ( n_{., k}^{-i} + \sum_{v} \beta_{v} )\label{eq:log_lda}
\end{equation}and store in \( c_{k} \) but remember that now each value represents un-normalized log probability. The next step is to compute the accumulated weights, this time as accumulated probabilities but in log-space! Thanks to the trick mentioned in [Notes on Calculating Log Sum of Exponentials], we are able to compute log sum efficiently. Please use the last equation there to compute the accumulated weights. The last step is to draw the random number. We compute \( u \sim \mathcal{U}(0, 1) + \log C[K-1] \) and again find the minimum index that satisfy the array value is greater than \( u \).


  1. The log-sampling algorithm for LDA is implemented in Fugue Topic Modeling Package.
  2. Unless you really face the issue of overflow, sampling in log-space is usually much slower than the original space as log and exp are expensive functions to compute.

Federated Optimization: Even More Personalized World?

Like the previous post about the personalized models, another NIPS workshop paper discussed a related topic, yet, from another perspective:

The authors introduces a new setting of the learning problem in which data are distributed across a very large number of computers, each having access only to few data points. This is primarily motivated by the setting, where users keep their data on their devices, but the goal is still to train a high quality global model.

Although it looks like very different from the paper described in the previous post, two pieces can be linked together for two points:

  • It is important to train a high quality local or personalized model by utilizing the global model or vice versa.
  • It is very important to understand the interplay of the global mode land the local model as well.

These work can raise interesting new directions, like how to serve/update models that are fully personalized on mobile devices.


Serving Personalized Models

In the recent NIPS 2016 Workshop on Machine Learning Systems, one paper attracts my attention:

The central idea is very simple: we need to learn and serve personalized models on top of a global model to users. The argument of such setting is two-folds:

  1. A global model might be hard to train, given the size of the model. It usually takes significant amount of computing efforts.
  2. Depending on what types of model, it might be even difficult to serve and update the global model as well.

So, it is very natural that, the model for each individual user is trained separately while it is derived from a global model. The paper demonstrates a particular way of deriving such a model. But there could be many different ways of doing this.

Of course, this is not the first such reasoning. As the paper mentioned, prior work in multi-task learning has formalized similar problems. However, it might be the first time from the system perspective to show the advantages of having personalized models.

Machine Learning as an Engineering Discipline

Recently, Leon Bottou delivered two exiting keynote speeches on machine learning as an engineering discipline. I suggest that anyone who is using machine learning techniques to build software needs to read through them at least once. There are deep thinking about machine learning and software engineering.

  1. Two big challenges in machine learning given on ICML 2015.
  2. How Big Data changes Statistical Machine Learning given on IEEE BigData 2015.

Bayesian treatment of product metrics? 2

In a recent conversation with a friend, he complained to me that the product they operated has a weird behavior. On one hand, the daily active users of the product has declined over time but on the other hand, the metric they have used to measure user engagements has been increased dramatically. His thinking was that, the metric is wrong and they should find something else to be able to capture the dynamic of the user base more accurately.

While the situation described above could a bug in the system, it is actually quite possible that it may happen in many occasions. Let’s use Click-Through-Rate (CTR) as an example. For product, the CTR is defined as \( \theta_{1} = P( c = 1 | v =1) \) where \( c \) is a binary random variable to indicate whether a click happens on the product and \( v \) is a binary random variable to indicate whether the product is being viewed or not. The Maximum-Likelihood-Estimation (MLE) of the CTR is:

\theta_{1} = \frac{N_{c}}{V_{c}}
\end{equation}where \( N_{c} \) is the number of visitors who clicked on the product and \( V_{c} \) is the total number of visitors. From this estimation, we can easily see that:

  1. CTR is a ratio where the denominator is the total number of visitors.
  2. It is possible that, \( V_{c} \) is dropping while the whole ratio is increasing.

In fact, I would argue that, any metric that has the total number of visitors (users) in the denominator would possibly has this issue. One potential reason is that, like the CTR example mentioned above, the particular way the product is optimized might drive away some users, reducing the number of \( V_{c} \) but still, make some heavily engaged users more engaged, driving relative more \( N_{c} \). Thus, even though \( V_{c} \) and \( N_{c} \) may decrease altogether but the ratio may increase significantly.

OK, if that’s the case, what we can do after all? One alternative is to measure \( \theta_{2} = P( c=1 | \Omega ) = \int P( c  | v  ) P( v  | \Omega )\, dv \) where \( \Omega \) represents the whole population. The idea is to measure how likely a random user in the whole universe would click on your product rather than a visitor who has been on your product already. This is, of course, much more harder to accurately compute.

One way to compute \( P(v = 1 | \Omega) \) is to gather some data from Internet or mobile devices to have the total number of visitors per month or per year and therefore, you can see how popular your product is.