Research in General


Hack Tweepy to Get Raw JSON 2

Tweepy is a popular Python wrapper for Twitter API. Although it provides a lot of useful features where you don’t need to deal with JSON or XML directly, sometimes you would like to store all information you have requested. In other words, you would like to store the raw JSON returned by Tweepy.

For some reason, this feature is not supported by the current version of Tweepy (as of April 4, 2012). So, we need to hack a little bit to get the result. Indeed, this issue was discussed here. The current solution is to add the following code into “parsers.py” of the source code of Tweepy:

1
2
3
class RawJsonParser(Parser):
    def parse(self, method, payload):
        return payload

Therefore, we can call the following code for invoking a new parser with JSON returned:

1
2
from tweepy.parsers import RawJsonParser
api = tweepy.API(auth_handler=auth, parser=RawJsonParser())

Install Numpy and Scipy on CentOS without Root Privilege 4

Sometimes you want to install Numpy and Scipy on a remote CentOS machine without root privilege, which is usually true when you are using a university server. Before you proceed to the following instructions, you need to make sure that a copy of Python is installed. This is also done without root privilege, meaning that you may install it in an alternative directory, rather than system directory.

Prerequisite:

  1. Download the latest version of LAPACK and extracted it into path [LAPACK].
  2. Download the latest version of BLAS and extracted it into path [BLAS].

Note that in almost all tutorials on how to install Numpy and Scipy on Linux machines discuss how to install them with ATALS. This is possible only if you have the root privilege where you can turn off CPU threshoding . Since we do not have root privilege, we can only install them with LAPACK and BLAS.

Step 1: Install Numpy

  1. Edit “site.cfg”
    a) Enable “[DEFAULT]” section and add

    src_dirs = [BLAS]:[LAPACK]

    b) Add

    [blas_opt]
    libraries = f77blas, cblas
     
    [lapack_opt]
    libraries = lapack, f77blas, cblas
  2. Type the following command in the shell:
    python setup.py build --fcompiler=gnu95

    which will compile the package with “gfortran”.

  3. Type the following command in the shell:
    python setup.py install

Step 2: Install Scipy

Once Numpy is installed.  Scipy can be easily built and installed through normal “python setup.py build” and “python setup.py install” process. Remember that these command should be accompanied with “–fcompiler=gnu95”.


Sorting Tuples in C++ 2

In this post, I would like to show how to create a tuple object in C++ 11 and how to sort tuples.

Here is the code for creating tuples and doing the sort. It is pretty straightforward.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
#include <string>
#include <vector>
#include <tuple>
#include <algorithm>
using namespace std;
 
typedef tuple<string,double,int> mytuple;
 
bool mycompare (const mytuple &lhs, const mytuple &rhs){
  return get<1>(lhs) < get<1>(rhs);
}
 
int main(void){
  vector<mytuple> data;
  data.push_back(make_tuple("abc",4.5,1));
  data.push_back(make_tuple("def",5.5,-1));
  data.push_back(make_tuple("wolf",-3.47,1));
  sort(data.begin(),data.end(),mycompare);
  for(vector<mytuple>::iterator iter = data.begin(); iter != data.end(); iter++){
    cout << get<0>(*iter) << "\t" << get<1>(*iter) << "\t" << get<2>(*iter) << endl;
  }
}

The code is successfully compiled by G++ 4.6.1 with the option “-std=gnu++0x”.


Two Forms of Logistic Regression

There are two forms of Logistic Regression used in literature. In this post, I will build a bridge between these two forms and show they are equivalent.

Logistic Function & Logistic Regression

The common definition of Logistic Function is as follows:

    \[P(x) = \frac{1}{1+\exp(-x)} \;\; \qquad (1)\]

where x \in \mathbb{R} is the variable of the function and P(x) \in [0,1]. One important property of Equation (1) is that:

    \[ <span class="ql-right-eqno"> (1) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-26f49852a50c1f981273da59a3d7de67_l3.png" height="213" width="233" class="ql-img-displayed-equation " alt="\begin{eqnarray*}P(-x) &=& \frac{1}{1+\exp(x)} \nonumber \\&=& \frac{1}{1+\frac{1}{\exp(-x)}} \nonumber \\&=& \frac{\exp(-x)}{1+\exp(-x)} \nonumber \\&=& 1 - \frac{1}{1+\exp(-x)} \nonumber \\&=& 1 - P(x) \; \; \qquad (2)\end{eqnarray*}" title="Rendered by QuickLaTeX.com"/> \]

The form of Equation (2) is widely used as the form of Logistic Regression (e.g., [1,2,3]):

    \[ <span class="ql-right-eqno"> (2) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-0d2ec58cfdefbc85b337c9195b7e9d47_l3.png" height="96" width="333" class="ql-img-displayed-equation " alt="\begin{eqnarray*}P(y = 1 \, | \, \boldsymbol{\beta}, \mathbf{x}) &=& \frac{\exp(\boldsymbol{\beta}^{T} \mathbf{x})}{1 + \exp(\boldsymbol{\beta}^{T} \mathbf{x})} \nonumber \\P(y = 0 \, | \, \boldsymbol{\beta}, \mathbf{x}) &=& \frac{1}{1 + \exp(\boldsymbol{\beta}^{T} \mathbf{x})} \;\; \qquad (3)\end{eqnarray*}" title="Rendered by QuickLaTeX.com"/> \]

where \mathbf{x} is a feature vector and \boldsymbol{\beta} is a coefficient vector. By using Equation (2), we also have:

    \[ <span class="ql-right-eqno"> (3) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-ab78efd52d92fe6f7a6a48a6387e0bce_l3.png" height="19" width="276" class="ql-img-displayed-equation " alt="\begin{equation*}P(y=1 \, | \, \boldsymbol{\beta}, \mathbf{x}) = 1 - P(y=0 \, | \, \boldsymbol{\beta}, \mathbf{x})\end{equation*}" title="Rendered by QuickLaTeX.com"/> \]

This formalism of Logistic Regression is used in [1,2] where labels y \in \{0,1\} and the functional form of the probability to generate different labels is different. Another formalism introduced in [3] unified the two forms into one single equation by integrating the label and the prediction together:

    \[ <span class="ql-right-eqno"> (4) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-adce25678a6ba44c60e8e18024431334_l3.png" height="43" width="347" class="ql-img-displayed-equation " alt="\begin{equation*}P(g= \pm 1 \, | \, \boldsymbol{\beta}, \mathbf{x}) = \frac{1}{1 + \exp( - g\boldsymbol{\beta}^{T} \mathbf{x})} \;\; \qquad (4)\end{equation*}" title="Rendered by QuickLaTeX.com"/> \]

where g \in \{\pm 1\} is the label for data item x. It is also easily to verify that P(g=1 \, | \, \boldsymbol{\beta}, \mathbf{x}) = 1 - P(g=-1 \, | \, \boldsymbol{\beta}, \mathbf{x}).

The Equivalence of Two Forms of Logistic Regression

At first glance, the form (3) and the form (4) looks very different. However, the equivalence between these two forms can be easily established. Starting from the form (3), we can have:

    \[ <span class="ql-right-eqno"> (5) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-e86023543d98b569af5f56638c5993ff_l3.png" height="175" width="278" class="ql-img-displayed-equation " alt="\begin{eqnarray*}P(y = 1 \, | \, \boldsymbol{\beta}, \mathbf{x}) &=& \frac{\exp(\boldsymbol{\beta}^{T} \mathbf{x})}{1 + \exp(\boldsymbol{\beta}^{T} \mathbf{x})} \nonumber \\&=& \frac{1}{\frac{1}{\exp(\boldsymbol{\beta}^{T} \mathbf{x})} + 1} \nonumber \\&=& \frac{1}{\exp(-\boldsymbol{\beta}^{T} \mathbf{x}) + 1} \nonumber \\&=& P(g= 1 \, | \, \boldsymbol{\beta}, \mathbf{x})\end{eqnarray*}" title="Rendered by QuickLaTeX.com"/> \]

We can also establish the equivalence between P(y=0 \, | \, \boldsymbol{\beta}, \mathbf{x}) and P(g=-1 \, | \, \boldsymbol{\beta}, \mathbf{x}) easily by using property (2). Another way to establish the equivalence is from the classification rule. For the form (3), we have the following classification rule:

    \[ <span class="ql-right-eqno"> (6) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-bb3dda211fa7695381356592f2eac0f8_l3.png" height="117" width="225" class="ql-img-displayed-equation " alt="\begin{eqnarray*}\frac{\frac{\exp(\boldsymbol{\beta}^{T} \mathbf{x})}{1 + \exp(\boldsymbol{\beta}^{T} \mathbf{x})}}{\frac{1}{1 + \exp(\boldsymbol{\beta}^{T} \mathbf{x})}} & > & 1 \;\; \rightarrow \;\; y = 1 \nonumber \\\exp(\boldsymbol{\beta}^{T} \mathbf{x}) & > & 1 \nonumber \\\boldsymbol{\beta}^{T} \mathbf{x} & > & 0\end{eqnarray*}" title="Rendered by QuickLaTeX.com"/> \]

An exactly same classification rule for the form (4) can also be obtained as:

    \[ <span class="ql-right-eqno"> (7) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-110b1c3ae6ac1d5eeddfaffa87acae4c_l3.png" height="166" width="264" class="ql-img-displayed-equation " alt="\begin{eqnarray*}\frac{\frac{1}{1 + \exp( - \boldsymbol{\beta}^{T} \mathbf{x})}}{\frac{1}{1 + \exp( \boldsymbol{\beta}^{T} \mathbf{x})}} & > & 1 \;\; \rightarrow \;\; g = 1 \nonumber \\\frac{1 + \exp(\boldsymbol{\beta}^{T} \mathbf{x})}{1 + \exp( - \boldsymbol{\beta}^{T} \mathbf{x})} & > & 1 \nonumber \\\exp(\boldsymbol{\beta}^{T} \mathbf{x}) & > & 1 \nonumber \\\boldsymbol{\beta}^{T} \mathbf{x} & > & 0\end{eqnarray*}" title="Rendered by QuickLaTeX.com"/> \]

Therefore, we can see that two forms essentially learn the same classification boundary.

Logistic Loss

Since we establish the equivalence of two forms of Logistic Regression, it is convenient to use the second form as it can be explained by a general classification framework. Here, we assume y is the label of data and \mathbf{x} is a feature vector. The classification framework can be formalized as follows:

    \[ <span class="ql-right-eqno"> (8) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-7b2ccccdd2c250978dabe7bef0c428ee_l3.png" height="42" width="183" class="ql-img-displayed-equation " alt="\begin{equation*}\arg\min \sum_{i} L\Bigr(y_{i},f(\mathbf{x}_{i})\Bigl)\end{equation*}" title="Rendered by QuickLaTeX.com"/>\]

where f is a hypothesis function and L is loss function. For Logistic Regression, we have the following instantiation:

    \[ <span class="ql-right-eqno"> (9) </span><span class="ql-left-eqno">   </span><img src="https://www.hongliangjie.com/wp-content/ql-cache/quicklatex.com-38105dc931573fa3c94d601dccba3b8a_l3.png" height="59" width="296" class="ql-img-displayed-equation " alt="\begin{eqnarray*}f(\mathbf{x}) &=& \boldsymbol{\beta}^{T} \mathbf{x} \nonumber \\L\Bigr(y,f(\mathbf{x})\Bigl) &=& \log \Bigr( 1 + \exp(-y f(\mathbf{x})\Bigl)\end{eqnarray*}" title="Rendered by QuickLaTeX.com"/>\]

where y \in \{ \pm 1 \}.

References

[1] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
[2] Tom M. Mitchell. Machine learning. McGraw Hill series in computer science. McGraw-Hill, 1997.
[3] Jason D. M. Rennie. Logistic Regression. http://people.csail.mit.edu/jrennie/writing, April 2003.


Simple Geographical Calculations

In this post, I would like to share some simple code to calculate geographical distances by using latitude and longitude points from some third-party services. This is particular useful when we wish to compute the average distances users travel from the check-in or geo-tagging information from Twitter, for instance. The code is straightforward and simple.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import math
import sys
import os
 
## Convert a location into 3d Corordinates
## location is a list of [latitude,longtidue]
## return: a list of [x,y,z]
def convert_location_cor(location):
    x_n = math.cos(math.radians(location[0])) * math.cos(math.radians(location[1]))
    y_n = math.cos(math.radians(location[0])) * math.sin(math.radians(location[1]))
    z_n = math.sin(math.radians(location[0]))
    return [x_n,y_n,z_n]
 
## Convert a 3d Corordinates into a location
## cor is a list of [x,y,z]                                                                                                                          
## return: a list of [latitude, longtitude]
def convert_cor_location(cor):
    r = math.sqrt(cor[0] * cor[0] + cor[1] * cor[1]+ cor[2] * cor[2])
    lat = math.asin(cor[2] / r)
    log = math.atan2(cor[1], cor[0])
    return [math.degrees(lat),math.degrees(log),math.degrees(r)]
 
## Compute the geographical midpoint of a set of locations
## location_list is a list of locations [locaiton 0, location 1, location 2]
## return: the location of midpoint                                                                                              
def geo_midpoint(location_list):
    x_list = []
    y_list = []
    z_list = []
    for i in range(len(location_list)):
	m = convert_location_cor(location_list[i])
	x_list.append(m[0])
	y_list.append(m[1])
	z_list.append(m[2])
    x_mean = sum(x_list) / float(len(location_list))
    y_mean = sum(y_list) / float(len(location_list))
    z_mean = sum(z_list) / float(len(location_list))
    return convert_cor_location([x_mean,y_mean,z_mean])
 
## Compute the distance between two locations
## a and b are two locations: [lat 1, lon 1] [lat 2, lon 2]
## return: the distance in KM
def geo_distance(a,b):
    theta = a[1] - b[1]
    dist = math.sin(math.radians(a[0])) * math.sin(math.radians(b[0])) \
     + math.cos(math.radians(a[0])) * math.cos(math.radians(b[0])) * math.cos(math.radians(theta))
    dist = math.acos(dist)
    dist = math.degrees(dist)
    distance = dist * 60 * 1.1515 * 1.609344
    return distance
 
## main program
if __name__ == '__main__':
    l_list = []
    l_list.append([-8.70934,115.173695])
    l_list.append([-8.70934,115.235514])
    l_list.append([-8.591728,115.235514])
    l_list.append([-8.591728,115.173695])
    midpoint = geo_midpoint(l_list)
    print geo_distance([-8.70934,115.173695],[-8.70934,115.235514])

A Must Read for Logistic Regression

I came across an old technical report written by Michael Jordan (no, not the basketball guy):

Why the logistic function? A tutorial discussion on probabilities and neural networks“. M. I. Jordan. MIT Computational Cognitive Science Report 9503, August 1995.

The material is amazingly straightforward and easy to understand. It answers (or at least partially) a long-standing question for me, why the form of logistic function is used in regression? Regardless of how it was used in the first place, the report shows that it is actually can be derived from a simple binary classification case where we wish to estimate the posterior probability:

    \[ P(w_{0}|\mathbf{x}) = \frac{P(\mathbf{x}|w_{0})P(w_{0})}{P(\mathbf{x})} \]


where w_{0} can be thought as class label and \mathbf{x} can be treated as feature vector. We can expand the denominator and introduce an exponential:

    \[ P(w_{0}|\mathbf{x}) = \frac{P(\mathbf{x}|w_{0})P(w_{0})}{P(\mathbf{x}|w_{0})P(w_{0})+P(\mathbf{x}|w_{1})P(w_{1})}=\frac{1}{1+\exp\{-\log a - \log b\}} \]


where a=\frac{P(\mathbf{x}|w_{0})}{P(\mathbf{x}|w_{1})} and b= \frac{P(w_{0})}{P(w_{1})}. Without achieving anything but only through mathematical maneuvering, we have already had the flavor how logistic function can be derived from simple classification problems. Now, if we specify a particular distribution form of P(\mathbf{x}|w) ( the class-conditional densities), for instance, Gaussian distribution, we can recover the logistic regression easily.

However, the whole point of the report is not just to show where logistic function comes into play, but showing how discriminative models and generative models in this particular setting are only the two sides of the same coin. In addition, Jordan demonstrated that these two sides are simply NOT equivalent but should be treated carefully when different learning criteria is considered. In general, a simple take-away is that the discriminative model (logistic regression) is more “robust” where generative model might be more accurate if the assumption is correct.

More details, please refer to the report.


Some Recent Papers About Topic Models

In this post, I would like to talk about several recent papers about topic models. These papers may not belong to the same direction of applying or extending topic models. However, some of them are quite interesting and worth to be discussed here.

The first one is

Enhong Chen, Yanggang Lin, Hui Xiong, Qiming Luo, and Haiping Ma. 2011. Exploiting probabilistic topic models to improve text categorization under class imbalanceJournal of Information Processing and Management. 47, 2 (March 2011), 202-214.

The idea is straightforward and simple. The author proposed a two-step approach to mitigate the problem of unbalanced data. The first step is to learn topic models from the existing unbalanced data. Here, for each class label, a separate set of topics is learned. Once the models are obtained, synthetic documents or new samples are drawn from learned models. This is possible since topic distribution and word distribution are fixed after learning process. The number of new samples is determined by the difference between the dominant class and the rare class. A more aggressive method is also proposed, which is used to avoid noisy labeled data. The idea is to use all synthetic samples to train a classifier, rather than original samples. The experimental results demonstrate some performance improvement of this method over other ones that are proposed to tackle the same problem.

The second paper is

Wayne Xin Zhao, Jing Jiang, Hongfei Yan, and Xiaoming Li. 2010. Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 56-65.

The paper is interesting because it also demonstrates a method to incorporate term-level features into a topic model. The list of features for each term is embedded through a Maximum Entropy Model. The supervised learning part of the model learns the fixing weights of these features and Gibbs sampling for the topic model uses these weights. For details, please refer to the paper.

The next one is

Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan and Xiaoming Li. Comparing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Information Retrieval (ECIR’11) (full paper), 2011.

The paper has several interesting aspects. First, it is claimed as a first study of topics obtained on Twitter and other traditional media. The authors use a standard LDA model to discover topics from NewYorkTimes corpus and a modified topic model for Twitter, separately. Then, they proposed a heuristic method to map Twitter topics onto NYT topics.  In addition, they manually assigned topic types to all the topics found by models. By doing all these, common topics and corpus-specific topics are obtained heuristically. It’s a little bit strange that they do not consider any techniques to mine topics from multiple corpus. Secondly, they do not compare to the method where only LDA is used. Note, the same Twitter-LDA is used in:

Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achanauparp, Ee-Peng Lim and Xiaoming Li. Topical keyphrase extraction from Twitter. To appear in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11) (long paper), 2011.

 


Reviews on Binary Matrix Decomposition 2

In this post, I would like to review several existing techniques to binary matrix decomposition.

 

  • Andrew I. Schein, Lawrence K.  Saul, and Lyle H. Ungar. A Generalized Linear Model for Principal Component Analysis of Binary Data. Appeared in Proceedings of the 9’th International Workshop on Artificial Intelligence and Statistics. January 3-6, 2003. Key West, FL.
    This paper introduced a logistic version of PCA to binary data. The model assumes that each observation is from a single latent factor and there exists multiple latent factors. The model is quite straightforward and the inference is been done by Alternative Least Square.
  • Tao Li. 2005. A general model for clustering binary data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD ’05). ACM, New York, NY, USA, 188-197.
    In this paper, the author introduced the problem of “binary data decomposition”. The paper demonstrated several techniques that are popular for normal matrix factorization to binary data, like k-means, spectral clustering. The proposed method is to factorize the binary matrix into two binary matrices, where the binary indicators suggest membership.
  • Tomas Singliar and Milos Hauskrecht. 2006. Noisy-OR Component Analysis and its Application to Link AnalysisJ. Mach. Learn. Res. 7 (December 2006), 2189-2213.
    This paper introduced a probabilistic view of binary data. Like other latent factor models, each observation can be viewed as a sample from multiple binary latent Bernoulli factors, essentially a mixture model. A variational inference is conducted in the paper. The weak part of the paper is that the comparison of the model with PLSA and LDA is not quite convincing.
  • Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. 2007. Binary Matrix Factorization with Applications. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining (ICDM ’07). IEEE Computer Society, Washington, DC, USA, 391-400.
    This paper indeed introduced a variant of Non-negative Matrix Factorization to binary data, meaning that a binary matrix will be always decomposed into two matrices bounded by 0 to 1. The proposed method is a modification of NMF. However, in a document clustering problem, the performance difference between proposed method and NMF is very small.
  • Miettinen, P.; Mielikainen, T.; Gionis, A.; Das, G.; Mannila, H.; , “The Discrete Basis Problem,Knowledge and Data Engineering, IEEE Transactions on , vol.20, no.10, pp.1348-1362, Oct. 2008.
    Miettinen, P.; , “Sparse Boolean Matrix Factorizations,” Data Mining (ICDM), 2010 IEEE 10th International Conference on , vol., no., pp.935-940, 13-17 Dec. 2010
    These two papers stated another view of factorization of binary data. Rather than directly using some SVD based or NMF based methods, these papers introduced a “cover” based discrete optimization method to the problem. However, through experiments, the performance advantages over traditional SVD or NMF methods are not very clear. Another drawback of their method is that some other existing methods are difficult to be incorporated with.
  • Andreas P. Streich, Mario Frank, David Basin, and Joachim M. Buhmann. 2009. Multi-assignment clustering for Boolean data. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09). ACM, New York, NY, USA, 969-976.
    This paper introduced a probabilistic view of the binary data. The observation is assumed to be generated either by “signal” or by “noise”, both are Bernoulli distributions. The switch variable is also sampled from the third Bernoulli distribution. This is essentially a simplified PLSA. The inference is done by deterministic annealing.
  • Ata Kaban, Ella Bingham, Factorisation and denoising of 0-1 data: A variational approach, Neurocomputing, Volume 71, Issues 10-12, Neurocomputing for Vision Research; Advances in Blind Signal Processing, June 2008, Pages 2291-2308, ISSN 0925-2312.
    This paper is somewhat similar “Noisy-OR” model and Logistic PCA as well. However, unlike Logistic PCA, the proposed model is a mixture model, meaning that a single observation is “generated” by multiple latent factors. The authors put a Beta prior over latent factors and the inference is done by Variational Inference.
    Ella Bingham, Ata Kaban, and Mikael Fortelius. 2009. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Anal. Appl. 12, 1 (January 2009), 55-78.
    This paper goes back to the assumption that each observation is sampled from a simple factor. The inference is done by EM.

In all, it seems that the performance advantages of specifically designed binary data models are small. However, the biggest advatange of these model is that they can give better interpretations sometimes. For computational models, NMF seems a good approximation. For probablistic models, a modified PLSA or LDA seems quite resonable.



Reviews on User Modeling in Topic Models

In this post, I would like to review several papers that wish to extend standard topic models with incorporating user information. The first paradigm or group of papers is introduced by M. Rosen-Zvi et al.

These three papers define a “Author-Topic” model, a simple extension of LDA. The generation process is as follows:

  1. For each document $latex d$:
    1. For each word position:
      1. Sample an author $latex x$ uniformly sampled from the group of authors \mathbf{a}_{d} for this document.
      2. Sample an topic assignment $latex z$ from per-author multinomial distribution over topics $latex \theta_{x}$.
      3. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

The inference of the model is done by Gibbs Sampling. The biggest drawback of the model is that it loses the distribution over topics for documents. In “Learning Author-Topic Models from Text Corpora“, the authors proposed a heuristic solution to this problem: adding a fictitious author for each document. The second group of papers is from UMass.

They  proposed several models. The first one is “Author-Recipient-Topic” model, which is suitable for message data, like emails. The generation process is as follows:

  1. For each document $latex d$, we observe its author $latex a_{d}$ and a set of recipients $latex \mathbf{r}_{d}$:
    1. For each word position:
      1. Sample a recipient $latex x$ uniformly sampled from $latex \mathbf{r}_{d}$.
      2. Sample an topic assignment $latex z$ from author-recipient multinomial distribution over topics $latex \theta_{a_{d},x}$.
      3. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

This model is further extended into “Role-Author-Recipient-Topic” model. The idea is that each author or recipient may play different roles in the exchange of messages. Therefore, it is better to explicitly model them. Three possible variants are introduced. The first variant is that for each word position, we first sample a role for author and for the sampled recipient as well. Once the roles are sampled, the topic assignments are sampled from role-role pair-determined multinomial distribution over topics. The second variant is that only one role is generated for the author of the message. However, for recipients, each one has a role. For each word position, a recipient with his corresponding role is firstly sampled and a topic assignment is sampled from author-role author-role pair multinomial distribution over topics. The third variant is that all recipients share a single role. The third model is “Author-Persona-Topic” model. The generation process is as follows:

  1. For each author $latex a$:
    1. Sample a multinomial distribution over persona $latex \eta_{a}$.
    2. For each persona $latex g$, sample a multinomial distribution over topics $latex \theta_{g}$.
  2. For each document $latex d$ with author $latex a_{d}$:
    1. Sample a persona $latex g_{d}$ from $latex \eta_{a_{d}}$.
    2. For each word position:
      1. Sample an topic assignment $latex z$ from $latex \theta_{g_{d}}$.
      2. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

All these models do not have a per-document distribution for topics.

The third group of papers is from Noriaki Kawamae. Both models introduced in these papers extended the ideas of “Author-Topic” model and “Author-Persona-Topic” model in particular.

The first model is “Author-Interest-Topic” model. It introduced a notion of “document-class”. The authors have a distribution over document-classes and for each document class, it has a distribution over topics. Here, we can think of document-class as “persona” in previous models. For each document, it firstly samples a document-class from per-author distibution over document classes. Then, by using this document-class, we can draw topics from this particular class. The difference between “Author-Interest-Topic” model and “Author-Persona-Topic” model is that the distribution over topics for each persona is under author level in “Author-Persona-Topic” but they are global variables in “Author-Interest Topic” model. The “Latent-Interest-Topic” model is much complicated than all previous models. It adds another layer of abstraction, author-classes. For each author, it has variable to indicate his author-class, which is drawn from a multinomial distribution. For each author-class, there is a multinomial distribution over topics. For each document, we first draw a document-class from its author’s per author-class distribution over document-classes. Then, the later generation process is same as “Author-Interest-Topic“. The key for “Author-Interest-Topic” and “Latent-Interest-Topic” models is that they are clustering models, in the sense that authors or documents are forced clustered into either author classes or document classes.

The last group of papers is from Jie Tang et al. All the proposed models are based on “Author-Topic” model.

They firstly proposed three variants of “Author-Conference-Topic” model. For each author, there is a multinomial distribution over topics. For each token in the document, an author is uniformly sampled and the topic assignment is sampled from per-author multinomial distribution over topics. The differences between three variants are how the conference stamp is generated. We omit the discussion here.