Weighted Approximately Ranked Pairwise loss (WARP)

Definition
To focus more on the top of the ranked list, where the top $$k$$ positions are those we care about using the precision at $$k$$ measure, one can weigh the pairwise violations depending on their position in the ranked list. For pair-wise learning procedure, we construct a set of all positive labelled instances, denoted as $$\mathcal{C}_{u}^{+}$$and a set of negative labelled instances as $$\mathcal{C}_{u}^{-}$$. The loss is defined as:
\begin{align} \mbox{err}_{\mbox{WARP}}(\mathbf{x}_{i}, y_{i}) = L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \end{align}where $$rank(f(y_{i} \, | \, \mathbf{x}_{i}))$$ is a function to measure how many negative labelled instances are “wrongly” ranked higher than this positive example $$\mathbf{x}_{i}$$:
\begin{align} rank(f(y_{i} \, | \, \mathbf{x}_{i})) = \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(y \, | \, \mathbf{x}_{i})] \nonumber \end{align}where $$\mathbb{I}(x)$$ is the indicator function, and $$L(\cdot)$$ transforms this rank into a loss:
\begin{align} L(r) = \sum_{j=1}^{r} \tau_{j}, \mbox{with} \; \tau_{1} \geq \tau_{2} \geq \cdots \geq 0. \end{align}Different choices of $$\tau$$ define different importance of the relative position of the positive examples in the ranked list. In particular:

• For $$\tau_{i} = 1$$ for all $$i$$ we have the same AUC optimization as margin ranking criterion.
• For $$\tau_{1} = 1$$ and $$\tau_{i > 1} = 0$$ the precision at $$1$$ is optimized.
• For $$\tau_{i \leq k} = 1$$ and $$\tau_{i > k}=0$$ the precision at $$k$$ is optimized.
• For $$\tau_{i} = 1/i$$ a smooth weighting over positions is given, where most weight is given to the top position, with rapidly decaying weight for lower positions. This is useful when one wants to optimize precision at $$k$$ for a variety of different values of $$k$$ at once.

The loss discussed above is also equal to:
\begin{align} \mbox{err}_{\mbox{WARP}}(\mathbf{x}_{i}, y_{i}) &= L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \times 1\nonumber \\ &= \frac{L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(\mathbf{x}_{i})]}{rank(f(\mathbf{x}_{i}))} \nonumber \\ &= \sum_{(\mathbf{x}^{\prime}, y^{\prime}) \in \mathcal{C}_{u}^{-}} \frac{L[rank(f(y_{i} \, | \, \mathbf{x}_{i}))] \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(\mathbf{x}_{i})]}{rank(f(y_{i} \, | \, \mathbf{x}_{i}))} \end{align}with the convention $$0/0=0$$ when the correct label $$y$$ is top-ranked. Given a label $$y$$ from the positive set, the $$rank$$ function essentially is the total number of labels from the negative set which violate the functional relationships. The probability of one negative label to be drawn, given a particular positive label, is:
\begin{align} P((y^{\prime}, \mathbf{x}^{\prime}) \, | \, (y_{i}, \mathbf{x}_{i})) = \frac{1}{rank(f(y_{i} \, | \, \mathbf{x}_{i}))} \end{align}Due to the discrete nature of identity functions, we can always replace them with hinge loss:
\begin{align} \mathbb{I}[f(y^{\prime} \, | \, \mathbf{x}^{\prime}) \geq f(y_{i} \, | \, \mathbf{x}_{i})] \approx \max(0, 1 – f(y_{i} \, | \, \mathbf{x}_{i}) + f(y^{\prime} \, | \, \mathbf{x}^{\prime})) \end{align}

Online Learning to Rank
The overall risk we want to minimize is:
\begin{align} Risk(f) = \int \hat{\mbox{err}}_{\mbox{WARP}}(\mathbf{x},y) dP(\mathbf{x},y) \end{align}An unbiased estimator of this risk can be obtained by stochastically sampling in the following way:

1. Sample a positive pair $$(\mathbf{x},y)$$ according to $$P(\mathbf{x},y)$$.
2. For the chosen $$(\mathbf{x},y)$$ sample a negative instance $$(\mathbf{x}^{\prime},y^{\prime})$$ such that $$1+f(y^{\prime} \, | \, \mathbf{x}^{\prime}) > f(y \, | \, \mathbf{x})$$.

This chosen negative instance as well as the positive instance has the contribution:
\begin{align}\label{eq:warp_single_contribution} L[rank(f(y \, | \, \mathbf{x}))] \max(0,1 – f(y \, | \, \mathbf{x}) + f(y^{\prime} \, | \, \mathbf{x}^{\prime})) \end{align}to the total risk, i.e. taking the expectation of these contributions approximates the risk because we have probability $$\frac{1}{rank(f(y \, | \, \mathbf{x}))}$$ of drawing $$(\mathbf{x}^{\prime}, y^{\prime})$$ in step 2 above (Remember that the $$rank$$ function is essentially to approximate the total number of violating negative instances). This might suggest that we can perform a stochastic update over parameters.

For large dataset, the stochastic optimization steps discussed above is inefficient:

1. In step (2) above, we need to compute $$f(y^{\prime} \, | \, \mathbf{x}^{\prime})$$ for all possible negative instances.
2. In single contribution ??, we need to calculate $rank$ function, which also requires to compute $f$ for negative instances.

Some approximation can be used. For step (2), we can sample labels with replacement until we find a violating label.

Now if there are $$k = rank(f(y\, | \, \mathbf{x}))$$ violating labels, we use random variable $$N_{k}$$ to denote the number of trials in the sampling step to essentially obtain an violating label. This random variable follows a geometric distribution of parameter as (here, the assumption is that the probability to sample the first negative label equals the probability to sample a negative label):
\begin{align} \frac{k}{Y-1} \end{align}where $$Y$$ is the total number of negative labels. Thus, we have $$k=\frac{Y-1}{\mathbb{E}[N_{k}]}$$. This suggests that the value of the $rank$ function may be approximated by:
\begin{align} rank(f(y \, | \, \mathbf{x})) \approx \left \lfloor \frac{Y-1}{N} \right \rfloor \end{align}where $$N$$ is the number of trials in the sampling.

References:
[1] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1):21–35, October 2010.
[2] Nicolas Usunier, David Buffoni, and Patrick Gallinari. Ranking with ordered weighted pairwise classification. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1057–1064, New York, NY, USA, 2009. ACM.