Grid Representations of Texts

One ever-lasting theme of text mining is to model text in a proper space. A simple yet powerful space is in a one-dimension space where words are generated from an uncorrelated array of terms. This representation is usually called bag-of-word (BOW) representation.

BOW is also a foundation to more complicated approaches like latent Dirichlet allocation (LDA) or latent Semantic indexing (LSI). In such models, terms in a document are essentially generated from different 1D arrays (topics) where each array has its distinct distribution over the whole vocabulary. However, like simple BOW, these 1D arrays are uncorrelated.

The idea of grid representations of texts is essentially to model correlations between words. One such example is called “Multi-dimensional counting grids” [1] and its “admixture” version of the model “Componential counting grids” [2]. The basic assumption here is that, texts are generated by a moving window of grids where each grid is a distribution over terms. One benefit of such representation is that it would be more natural to handle thematic shifts in the framework. Also, depending on the window size, the n-gram effect is also automatically considered.

Although from the surface, the true advantage of such representation over BOW is not very clear, it is nevertheless an interesting idea to explore.


[1] Jojic, N., Perina, A.: Multidimensional counting grids: Inferring word order from disordered bags of words: In UAI. 2011 547-556 [PDF]
[2] Perina, A., Jojic, N., Bicego, M. and Turski, A.: Documents as Multiple Overlapping Windows into a Grid of Counts: In NIPS 2013 [PDF]

Weighted Approximately Ranked Pairwise loss (WARP)

To focus more on the top of the ranked list, where the top \( k \) positions are those we care about using the precision at \( k \) measure, one can weigh the pairwise violations depending on their position in the ranked list. For pair-wise learning procedure, we construct a set of all positive labelled instances, denoted as \(\mathcal{C}_{u}^{+}\)and a set of negative labelled instances as \(\mathcal{C}_{u}^{-}\). The loss is defined as:
\mbox{err}_{\mbox{WARP}}(\mathbf{x}_{i}, y_{i}) = L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))]
\]where \( rank(f(y_{i} \, | \, \mathbf{x}_{i})) \) is a function to measure how many negative labelled instances are “wrongly” ranked higher than this positive example \( \mathbf{x}_{i} \):
rank(f(y_{i} \, | \, \mathbf{x}_{i})) = \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(y \, | \, \mathbf{x}_{i})] \nonumber
\]where \( \mathbb{I}(x) \) is the indicator function, and \( L(\cdot) \) transforms this rank into a loss:
L(r) = \sum_{j=1}^{r} \tau_{j}, \mbox{with} \; \tau_{1} \geq \tau_{2} \geq \cdots \geq 0.
\]Different choices of \( \tau \) define different importance of the relative position of the positive examples in the ranked list. In particular:

  • For \( \tau_{i} = 1 \) for all \( i \) we have the same AUC optimization as margin ranking criterion.
  • For \( \tau_{1} = 1 \) and \( \tau_{i > 1} = 0 \) the precision at \( 1 \) is optimized.
  • For \( \tau_{i \leq k} = 1 \) and \( \tau_{i > k}=0 \) the precision at \( k \) is optimized.
  • For \( \tau_{i} = 1/i \) a smooth weighting over positions is given, where most weight is given to the top position, with rapidly decaying weight for lower positions. This is useful when one wants to optimize precision at \( k \) for a variety of different values of \( k \) at once.

The loss discussed above is also equal to:
\mbox{err}_{\mbox{WARP}}(\mathbf{x}_{i}, y_{i}) &= L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \times 1\nonumber \\
&= \frac{L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(\mathbf{x}_{i})]}{rank(f(\mathbf{x}_{i}))} \nonumber \\
&= \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \frac{L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(\mathbf{x}_{i})]}{rank(f(y_{i} \, | \, \mathbf{x}_{i}))}
\]with the convention \( 0/0=0 \) when the correct label \( y \) is top-ranked. Given a label \( y \) from the positive set, the \(rank\) function essentially is the total number of labels from the negative set which violate the functional relationships. The probability of one negative label to be drawn, given a particular positive label, is:
P((y^{\prime}, \mathbf{x}^{\prime}) \, | \, (y_{i}, \mathbf{x}_{i})) = \frac{1}{rank(f(y_{i} \, | \, \mathbf{x}_{i}))}
\]Due to the discrete nature of identity functions, we can always replace them with hinge loss:
\mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(y_{i} \, | \, \mathbf{x}_{i})] \approx \max(0, 1 – f(y_{i} \, | \, \mathbf{x}_{i}) + f(y^{\prime} \, | \, \mathbf{x}^{\prime}))

Online Learning to Rank
The overall risk we want to minimize is:
Risk(f) = \int \hat{\mbox{err}}_{\mbox{WARP}}(\mathbf{x},y) dP(\mathbf{x},y)
\]An unbiased estimator of this risk can be obtained by stochastically sampling in the following way:

  1. Sample a positive pair \( (\mathbf{x},y)\) according to \( P(\mathbf{x},y) \).
  2. For the chosen \( (\mathbf{x},y) \) sample a negative instance \((\mathbf{x}^{\prime},y^{\prime})\) such that \(1+f(y^{\prime} \, | \, \mathbf{x}^{\prime}) > f(y \, | \, \mathbf{x})\).

This chosen negative instance as well as the positive instance has the contribution:
L[rank(f(y \, | \, \mathbf{x}))] \max(0,1 – f(y \, | \, \mathbf{x}) + f(y^{\prime} \, | \, \mathbf{x}^{\prime}))
\end{align}\]to the total risk, i.e. taking the expectation of these contributions approximates the risk because we have probability \( \frac{1}{rank(f(y \, | \, \mathbf{x}))} \) of drawing \( (\mathbf{x}^{\prime}, y^{\prime}) \) in step 2 above (Remember that the \( rank \) function is essentially to approximate the total number of violating negative instances). This might suggest that we can perform a stochastic update over parameters.

For large dataset, the stochastic optimization steps discussed above is inefficient:

  1. In step (2) above, we need to compute \( f(y^{\prime} \, | \, \mathbf{x}^{\prime})\) for all possible negative instances.
  2. In single contribution ??, we need to calculate $rank$ function, which also requires to compute $f$ for negative instances.

Some approximation can be used. For step (2), we can sample labels with replacement until we find a violating label.

Now if there are \( k = rank(f(y\, | \, \mathbf{x})) \) violating labels, we use random variable \( N_{k} \) to denote the number of trials in the sampling step to essentially obtain an violating label. This random variable follows a geometric distribution of parameter as (here, the assumption is that the probability to sample the first negative label equals the probability to sample a negative label):
\]where \( Y \) is the total number of negative labels. Thus, we have \( k=\frac{Y-1}{\mathbb{E}[N_{k}]} \). This suggests that the value of the $rank$ function may be approximated by:
rank(f(y \, | \, \mathbf{x})) \approx \left \lfloor \frac{Y-1}{N} \right \rfloor
\]where \( N \) is the number of trials in the sampling.

[1] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1):21–35, October 2010.
[2] Nicolas Usunier, David Buffoni, and Patrick Gallinari. Ranking with ordered weighted pairwise classification. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1057–1064, New York, NY, USA, 2009. ACM.

Quantum Mechanics and Information Retrieval

How quantum mechanics can be related to information retrieval? In this year’s ECIR 2012, Dr. Benjamin Piwowarski and Dr. Massino Melucci gave a tutorial on Quantum Information Access and Retrieval.

It is still a little bit difficult to follow up the ideas introduced in the slides. However, it has been demonstrated in the tutorial that many aspects of IR are developed or re-phrased under QT framework, such as relevance theory, mixture of models, multiple document summarization and relevance feedback.

The current biggest initiative of QM is Quantum Interaction conferences.

Axiomatic Analysis and Optimization of Information Retrieval Models

This is an “unusual” research aspect of Information Retrieval (IR). By trying to compare and analyze different IR models in a formal way, Axiomatic Framework can show some interesting and even astonishing results of IR models. For instance, it can show that IR models should satisfy certain number of constraints. If a model cannot satisfy some of them, we can expect its performance being worse. This is the type of comparison without any experiments at all, though the claims are indeed justified by empirical studies.


  • Axiomatic Analysis and Optimization of Information Retrieval Models by ChengXiang Zhai at ICTIR11 [Slides]
  • Yuanhua Lv, ChengXiang Zhai. Lower-Bounding Term Frequency Normalization. Proceedings of the 20th ACM International Conference on Information and Knowledge Management  (CIKM’11), 2011. [PDF]
  • Hui Fang, Tao Tao, and Chengxiang Zhai. 2011. Diagnostic Evaluation of Information Retrieval ModelsACM Transactions on Information Systems (TOIS) 29, 2, Article 7 (April 2011), 42 pages. [PDF]
  • Hui Fang, ChengXiang Zhai, Semantic Term Matching in Axiomatic Approaches to Information RetrievalProceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIR’06 ), pages 115-122. [PDF]
  • Hui Fang, ChengXiang Zhai, An Exploration of Axiomatic Approach to Information RetrievalProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIR’05 ), 480-487, 2005. [PDF]
  • Hui Fang, Tao Tao, ChengXiang Zhai, A formal study of information retrieval heuristicsProceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIR’04), pages 49-56, 2004. [PDF]
  • Hui Fang‘s PhD dissertation. [PDF]

SIGIR 2010 Paper Reading

In this post, I would like to talk about several interesting papers from SIGIR 2010. Note, this only reflects my view of these scientific work and does not necessarily correct and thorough.

  • On Statistical Analysis and Optimization of Information Retrieval Effectiveness Metrics
    This paper is more theoretical rather than practical. The main contribution is that the authors argue that the optimal ranking problem should be factorized into two distinct yet interrelated stages: the relevance prediction stage and ranking decision stage. The paper shows that a number of IR metrics (e.g., Average Precision, DCG, Reciprocal Rank) can be decomposed into the two stages mentioned above. Therefore, the overall strategy is to directly optimize the decomposed metrics. The authors show the improved performance over simple language models. However, the paper does not compare to Learning to Rank techniques where the metrics are also optimized. In all, this is an interesting paper for whose really work in Ad-Hoc retrieval fields.
  • Evaluating and Predicting Answer Quality in Community QA
    This paper is straightforward. The authors wanted to predict the best answers (in their words, “answer quality”) in the Community QA sites.  They firstly used a number of subjective features obtained from Amazon Technical Turks and found it difficulty to do so. Then, they used a bunch of automatically extracted features (most are meta-information features) and show the improved performance. The work is simple and indeed related to my work in SIGIR 2009.  They still do not answer a question that whether a so-called “best answer” really a true “best” answer among all others to the corresponding questions. Moreover, classification approaches are not compared to retrieval-based methods in this paper.
  • Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cQA Services
    This is another paper in QA. This work is one extension to many previous work. For example, in “question detection”, the authors proposed a one-class SVM method to obtain the training dataset. In addition, the authors proposed a graph-based method to segment questions into multiple sub-questions. Overall, the authors show that their method can give a significant boost to question matching and retrieval, compared to traditional Bag-of-Word methods. Additionally, the authors show that the Sequential Patterns Mining and Syntactical Patterns Mining can also improve the performance of question detection. One thing is not clear is that which retrieval model the authors used in the paper.
  • Multi-Style Language Model for Web Scale Information Retrieval
    This paper is interesting. It introduces two interesting points. First, it shows the significant gap between query language model and document model where the paper also demonstrated that the anchor and title language model are more near the queries. The second point made by this paper is how to estimate a language model by considering an open vocabulary, namely, an infinite vocabulary. The problem for an open vocabulary language model is how to assign probability mass to unseen terms and how to adjust the mass to seen terms. This paper show one simple method with closed form expressions. This “smoothed” language model is also embedded with a multi-component language model where the model utilizes multiple fields for a document.
  • Mining the Blogosphere for Top News Stories Identification
    This paper is straightforward and interesting. The problem addressed in the paper is to rank news stories according to blogosphere in a given day. Here, the authors treated the “date” as the query. The overall framework falls into language model framework. In order to know how likely all blog posts relevant to the query date, the authors utilize a clustering method to group blog posts into topics and estimate the query language model from these clusters. News headline language model is estimated by a standard Dirichlet smoothed language model. Then, the distance between language model is calculated through KL-divergence. The authors proposed two heuristics to identify the importance of news stories. In all, the paper is well-written and well-organized. However, it is not clear why the authors do not use Multiple-Document representation for a blog, compared to a clustering algorithm. In addition, there are several important parameters are tuned manually, for example, the spread of a news story. This prevent the system used in real applications.
  • Serendipitous Recommendations via Innovators
    This paper reveals one interesting yet not heavily explored area in recommendation systems, the “surprise” of recommendations. The author argues that a recommender which achieved high accuracy may not help users a lot since most recommended items are popular items that can be discovered by users anyway. If a recommender wants to show something really interesting, it should provide provide some items that may not be found by users without any help. Therefore, the author proposed to use “time” as a measure to identify the success of recommendation.  However, the algorithm proposed in the paper is not very intuitive. Anyway, I think it’s still an interesting paper and worth to read.
  • Temporal Diversity in Recommender Systems
    This paper is simple and easy to follow. The main idea of the paper is to show that the temporal dynamics of recommender systems, especially in Netflix. One “obvious” observation of the paper is that users lose “patient” when they see same recommendations over time. Therefore, the authors claim that diversity should be taken into account by recommenders.

Notes on Language Models (1)

Query Likelihood Language Models

The basic idea behind Query Likelihood Language Models (QLLM) is that a query is a sample drawn from a language model. In other words, we want to compute the likelihood, that given a document language model \theta_{D}, how likely the posed query Q would be used. Formally, this can be expressed as P(Q|\theta_{D}). Two questions will immediately arise following this formulation. (1) How to choose the model to represent \theta_{D}? (2) How to estimate \theta_{D}?

Multinomial Language Model

One popular choice for \theta_{D} is multinomial distribution. The original multinomial distribution is

P(X_{1}=x_{1},X_{2}=x_{2},\ldots,X_{n}=x_{n})=\frac{n!}{x_{1}!x_{2}!,\ldots,x_{n}!}p_{1}^{x_{1}}p_{2}^{x_{2}},\ldots, p_{n}^{x_{n}}

In Language Modeling, we usually ignore the coefficient and therefore we obtain unigram language model so that the order of text sequence is not important. Use multinomial distribution to model \theta_{D}, we obtain

P(Q|\theta_{D}) = \prod_{w \in Q} P(w | \theta_{D}) =\prod_{w \in V} P(w | \theta_{D})^{c(w,Q)}

where c(w,Q) is the number of times that term w appearing in query Q. Now, the problem becomes to estimate P(w|\theta_{D}). In theory, we need to estimate it for all the terms in our vocabulary. However, since c(w,Q) could be 0 (meaning that term w does not show up in the query), we only care about the terms in the query.

The key point here is that we do not know \theta_{D}}! How can we calculate P(w|\theta_{D}) for the terms in the query when we really do not have a model in hand? One way is to use document D as a sample to estimate \theta_{D}. Therefore, essentially, we choose the model \theta_{D} such that \hat{\theta_{D}}= arg max_{\theta} P(D|\theta).

One simple way to estimate P(D|\theta) is through Maximum Likelihood Estimator (ML). Now, let us to derive ML for Multinomial Language Model. First, we usually work on log-likelihood rather than the product of probabilities just to avoid very small numbers: \log P(D|\theta) = \sum_{w \in V} c(w,D) \log P(w|\theta). Then, use Lagrange Multiplers, we obtain:

L =\log P(D|\theta) + \lambda (1-\sum_{w \in V} P(w|\theta))

Take all derivatives respect to P(w|\theta) and \lambda:

\frac{\partial L}{\partial P(w|\theta)} = \frac{c(w,D)}{P(w|\theta)}-\lambda = 0
\frac{\partial L}{\partial \lambda } = 1 - \sum_{w \in V} P(w|\theta) = 0

From the first equation, we can get P(w|\theta) = \frac{c(w,D)}{\lambda} and also \sum_{w \in V} c(w,D) = \lambda \sum_{w \in V} P(w|\theta)=1. Therefore,

P(w|\theta) = \frac{c(w,D)}{\sum_{w \in V} c(w,D)} =\frac{c(w,D)}{|D|}