Research in General


KDD 2024

This year’s KDD was in Barcelona, Spain. It was the first time I visited the country and the city. Barcelona has a unique charm as a good combination of old Roman districts as well as 20 century art. The conference has more than 2300 participants, another year of high excitement despite economic challenges that made fewer corporations set up a booth to hire.

Here are some highlights from the conference.

Tutorials

Workshops

Selected Papers

Ranking/Recommender System

  • Enhancing Pre-Ranking Performance: Tackling Intermediary Challenges in Multi-Stage Cascading Recommendation Systems [Link]
    • From Ant Group
    • Large-scale recommender systems have 3 stages: Recall, Pre-Ranking and Ranking. Pre-Ranking is a selection stage that the team uses to sample down items from Recall to Ranking to reduce the size. The paper proposed a framework to tackle sample biases and model consistent performance issues for such 3 stage systems. The deployed system showed 7% improvement over the baseline.
    • [Note]: Worth reading to see how different  companies tackle multi-stage challenges.
  • Achieving a Better Tradeoff in Multi-stage Recommender Systems through Personalization [Link]
    • From Meta
    • In this paper, the authors claimed that a key observation is that, all else equal, ranking more items indeed improves the overall objective but has diminishing returns. With this observation, the authors proposed a framework to exploit the trade-off of performance and computation cost. The real-world experiments showed the effectiveness of the proposed framework.
    • [Note]: Worth reading to see how different  companies tackle multi-stage challenges.
  • On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation [Link]
    • From ShareChat
    • The paper formally presented the assumptions that are necessary to consider DCG an unbiased estimator of online reward, providing a derivation for this metric from first principles whilst linking it to off-policy estimation. The authors then proved that the widespread practice of normalizing the DCG metric renders it inconsistent with respect to DCG, in that the ordering given by nDCG can differ from that given by DCG, and provide empirical evidence. Empirical results from off- and online experiments on a large scale recommendation platform show that the unbiased DCG metric strongly correlates with online metrics over time, whereas nDCG does not, whilst differences in online metrics directionally align with differences in both nDCG and DCG, the latter can enjoy improved sensitivity to detect statistically significant online improvements.
    • [Note]: Worth reading as the results are a bit surprising and might be worth examining whether it can be applied to a wide range of use cases.

Advertising

  • Offline Reinforcement Learning for Optimizing Production Bidding Policies [Link]
    • From Meta
    • The authors proposed a generalizable approach to optimizing bidding policies in production environments by learning from real data using offline reinforcement learning. This approach can be used to optimize any differentiable base policy, and only requires data generated by the base policy itself. The  approach does not incur additional infrastructure, safety, or explainability costs, as it directly optimizes the parameters of existing production routines without necessarily replacing them with black box-style models like neural networks.
    • [Note]: Worth reading as it is built on top of existing production systems and to exploit offline RL to improve them.
  • Joint Auction in the Online Advertising Market [Link]
    • From Meituan
    • The authors innovatively proposed a joint advertising model termed “Joint Auction”, allowing brand suppliers and stores to collaboratively bid for advertising slots, catering to both their needs.
    • [Note]: Worth reading as it shows a novel product offering with a newly designed auction mechanism.
  • Bi-Objective Contract Allocation for Guaranteed Delivery Advertising [Link]
    • From Chinese Academy of Sciences and Alibaba
    • Guaranteed Delivery (GD) advertising work with two different stages, namely, the offline selling stage and the online serving stage. The former deals with contract allocation, and the latter fulfills the impression allocation of signed contracts. Existing work usually handles these two stages separately. For example, contracts are formulated offline without concerning practical situations in the online serving stage. Therefore, this paper addresses in this paper a bi-objective contract allocation for GD advertising, which maximizes the impressions, i.e., Ad resource assignments, allocated for the new incoming advertising orders, and at the same time, controls the balance in the inventories. Since the proposed problem is high dimensional and heavily constrained, we design an efficient local search that focuses on the two objectives alternatively.
    • [Note]: Worth reading as a good reference for GD knowledge in general.
  • Optimized Cost Per Click in Online Advertising: A Theoretical Analysis [Link]
    • From The Hong Kong University of Science and Technology
    • Optimized Cost Per Click (OCPC) and Optimized Cost Per Mille (OCPM) have emerged as the most widely adopted pricing models in the online advertising industry. However, the existing literature has yet to identify the specific conditions under which these models outperform traditional pricing models like Cost Per Click (CPC) and Cost Per Action (CPA). To fill the gap, this paper builds an economic model that compares OCPC with CPC and CPA theoretically, which incorporates out-site scenarios and outside options as two key factors. The analysis reveals that OCPC can effectively replace CPA by tackling the problem of advertisers strategically manipulating conversion reporting in out-site scenarios where conversions occur outside the advertising platform. Furthermore, OCPC exhibits the potential to surpass CPC in platform payoffs by providing higher advertiser payoffs and consequently attracting more advertisers.
    • [Note]: Worth reading as a background paper for OCPC and OCPM.
  • An Efficient Local Search Algorithm for Large GD Advertising Inventory Allocation with Multilinear Constraints [Link]
    • From Chinese Academy of Sciences and Alibaba
    • As the requirements of advertisers become more and more diverse and fine-grained, the focus ratio requirement, which states that the portion of allocated impressions of a designated contract on focus media among all possible media should be greater than another contract, often appears in. business scenarios. However, taking these requirements into account brings hardness for the GD advertising inventory allocation as the focus ratio requirements involve non-convex multilinear constraints. Existing methods which rely on the convex properties are not suitable for processing this problem, while mathematical programming or constraint-based heuristic solvers are unable to produce high-quality solutions within the time limit. Therefore, the authors proposed a local search framework to address this challenge. It incorporates four new operators designed for handling multilinear constraints and a two-mode algorithmic architecture.
    • [Note]: Worth reading as a good reference for GD knowledge in general.

Experimentation

  • Learning Metrics that Maximise Power for Accelerated A/B-Tests [Link]
    • From ShareChat
    • The authors proposed to tackle this by learning metrics from short-term signals that directly maximize the statistical power they harness with respect to the North Star. The authors showed that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimize the p-values a metric would have produced on a log of past experiments. Empirical results show that they are able to increase statistical power by up to 78% when using learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star.
    • [Note]: Worth reading for better learning proxy metrics.
  • Choosing a Proxy Metric from Past Experiments [Link]
    • From Google
    • The authors introduced a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. The procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. They then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem,
    • [Note]: Worth reading for better learning proxy metrics.
  • Metric Decomposition in A/B Tests [Link]
    • From Airbnb
    • In this article, authors proposed a new direction for sensitivity improvement via treatment effect augmentation whereby a target metric of interest is decomposed into components with high signal-to-noise disparity. Inference in the context of this decomposition is developed using both frequentist and Bayesian theory.
    • [Note]: Worth reading to understand how to further reduce metrics variance.
  • False Positives in A/B Tests [Link]
    • From A/B experimentation guru: Ronny Kohavi
    • The authors begin with motivation about why false positives are expensive in many software domains.  They offer several approaches to estimate the true success rate of experiments, given the observed “win” rate (statistically significant positive improvements), and show examples from Expedia and Optimizely. They offer a modified procedure for experimentation, based in sequential group testing, that selectively extends experiments to reduce false positives, increase power, at a small increase to runtime. They conclude with a discussion of the difference between ideas and experiments in practice, terms that are often incorrectly used interchangeably.
    • [Note]: A lot of practical suggestions for A/B testing.

LLM Labels

  • Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I. [Link]
    • From Google
    • In this paper, the authors proposed two methods based on prediction powered inference and conformal risk control that utilize computer- generated relevance annotations to place reliable confidence intervals ( CI s) around IR evaluation metrics. The proposed methods require a small number of reliable annotations from which the methods can statistically analyze the errors in the generated annotations. Using this information, we can place CI s around evaluation metrics with strong theoretical guarantees.
    • [Note]: A really cool paper to illustrate how to systematically evaluate LLM generated labels.


KDD 2022 1

Last week, I attended KDD 2022 in Washington DC, which was the first offline conference I have ever attended since WSDM 2020, almost two and half years ago in Houston. Although the COVID-19 pandemic still had a sizeable impact on the event as the majority of researchers and companies from China could not make it to the conference, participants demonstrated high enthusiasm and hoped all could return back to the prior norm before the pandemic.

Let me share a couple of thoughts regarding the conference.

First of all, Graph Neural Networks (GNN) is literally everywhere, in terms of applications in almost all domains, as well as how industry and academic both show tremendous interest in them, which has reminded me of how Graph Mining has gone viral during early years in the 2000s due to the rise of Social Networks and World Wide Web. Surprisingly, topics like PageRank, Community Detection, and Graph Laplacian have been rarely mentioned. Of course, this is not just for Graph Mining, other sub-fields such as Recommender Systems, Text Mining, and Information Retrieval have also been completely rewritten since the surge of Deep Learning.

Secondly, Causal Inference has become a much more mainstream topic than before where it is been utilized in many applications including Industry solutions. One thing worth mentioning is that many researchers still use Causal Inference as an intermediate step to further improve the prediction performance in some downstream tasks. However, it has been broadly ignored how Causal Inference could bring more insights to the problems at hand, for instance, few papers have talked about Causal Effect and its estimation. Moreover, the majority of papers in Causal Inference are essentially Observational Studies. But they do not offer discussions around data and assumptions necessary to have meaningful and robust causal estimation, which could be misleading and risky.

Thirdly, some research topics have declined dramatically. For instance, while Optimization once was a pretty hot topic given the popularity of large-scale machine learning, the need for understanding and improving it has been greatly reduced due to the rise of various of Deep Learning frameworks. Most researchers can reliably use these frameworks to achieve reasonable performance without understanding underlying optimization algorithms. In addition, it is surprising to observe that Probabilistic Modeling has been abandoned altogether. Now, few papers start with assumptions about how data is generated. Although it is arguably applicable to all scenarios, Probabilistic Modeling has its own advantages to describe certain data generation assumptions and help us understand certain problems better, which is quite different and possibly complementary from the computational point of view, which dominates the modeling language since the rise of Deep Learning.

Lastly, all Keynote Speakers delivered quite excellent talks, with their own colors. Lise Getoor from UC Santa Cruz is a prominent scholar from the last-round rise of Graph Mining. Some of the techniques like Collective Classification mentioned in the talk have been widely used and even included in some textbooks in the 2000s, which are completely ignored today. Her research is still around classical Graph Mining, especially in the domain of reasoning using graph structures. On the other hand, Milind Tambe from Harvard University has a surprisingly good talk. Rather than focusing on technologies and algorithms, Milind told stories about how to apply AI for social goods and how to build applications to impact real-world people’s life, which sent a strong message to the conference. The last speaker Shuang-Hua Teng from USC gave a rather theoretical talk, discussing how to balance the relationship between heuristics and theory. One thing quite interesting from the talk was that Shuang-Hua clearly categorized quite a number of widely used algorithms into heuristics, even though they might be considered theoretical-oriented algorithms by practitioners.

Compared to KDD 2019 that I attended in Alaska 3 years ago, this year’s KDD is different in the sense of the number of participants, the number of companies, and the diversity of research topics. It is quite obvious that we are around the corner of the last wave from Deep Learning. I’m looking forward to meeting with friends next time in LA.


FAQ About Tensor Methods

Anima Anandkumar is a prominent researcher about tensor methods (a.k.a., spectral methods) in machine learning. Recently, she has a QA session on Quora and I gathered some of her answers particular with tensor methods as follows:

  1. What are some benefits and drawbacks of using tensor methods as opposed to more traditional techniques in machine learning?
    The main gain is in terms of computation:
    — a) tensor methods are embarrassingly parallel and scalable to  large problems
    — b) they can build on efficient linear algebraic libraries, but are much more powerful and informative compared to matrix methods.
    On the other hand, tensor methods are not sample efficient, meaning they require more samples than EM to reach the same level of accuracy (assuming computation is not an issue). Improving statistical efficiency of spectral methods is an ongoing research topic
  2. What are your thoughts on the statistical efficiency of spectral methods? Do you think that they are competitive as they stand?
    The short answer is that, MLE is sample efficient but may be difficult to compute while tensor methods (moment matching) is relatively easy to compute but sample inefficient. Some remedies are mentioned in the answer.
  3. How are Tensor methods used in deep learning?
    The short answer is that, currently used limited.
  4. What are the best resources for starting with Tensor Analysis?
    See her webpage for a start.

For detailed QA, please refer to Quora website.


Slice Sampling for LDA Hyperparameters

For latent Dirichlet allocation (LDA) hyper-parameters, a typical approach is to utilize Monte-Carlo EM approach where E-step is approximated by Gibbs sampling while M-step is to perform a gradient-based optimization approach to optimize Dirichlet parameters. Such approach is implemented in Mallet package. For the detailed information about estimating Dirichlet parameters, see Tom Minka’s technical report.

Another less explored approach is to go with full Bayesian notion, interleaving sampling topic assignments and hyper-parameters. To sample hyper-parameters, as there is no closed-form solution, Slice Sampling is usually utilized. For introduction for Slice Sampling, see Radford M. Neal’s paper. For its application in LDA, please see the following two references:

  1. Hanna Wallach’s course note.
  2. Chapter 2 of Hanna Wallach’s dissertation.

We implemented the method in Fugue’s LDA algorithm.


Tricks in Sampling Discrete Distributions (2) – Binary Search 1

As mentioned in the previous post, it is tricky to sample from discrete distributions, here we demonstrate yet another important trick to do it right. No matter you do it in the original space, or in the log-space. Basically, you can easily come up some code snippet like this (we are using Java as an example here):

1
2
3
4
5
6
7
8
9
public int sample(){
    double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1];
    int index = -1;
    for (index = 0; index > p.length; index++) {
        if (u > p[index])
            break;
    }
    return index;
}

where \( p \) is the accumulated un-normalized probabilities. The time complexity is \( \mathcal{O}(N) \) when \(N \) equals the number of items in the array \( p \).

It turns out that, the above code can be easily optimized to \( \mathcal{O}(\log N ) \) by using Binary Search. The reason is quite simple. The accumulated un-normalized probabilities, which are stored in \( p \), by its definition, are sorted. Therefore, binary search can be utilized. In particular, we want to find the smallest key that is greater than the random generated number \( u \). This function is called ceiling in Algorithms in Section 3.1. We implemented it in our context as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public int sample(){
    double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1];
    int lower = 0;
    int upper = p.length - 1;
    while (lower >= upper){ 
        int mid = lower + (upper - lower) / 2; 
        if((p[mid] - u) > 0){
            upper = mid - 1;
        }
        else{
            lower = mid + 1;
        }
    }
    return lower;
}

Interestingly, even though this trick seems trivial, it is not mentioned in many literature and only discussed:


Tricks in Sampling Discrete Distributions (1) – Sampling in Log-Space


One critical component in Gibbs sampling for complex graphical models is to be able to draw samples from discrete distributions. Take latent Dirichlet allocation (LDA) as an example, the main computation focus is to draw samples from the following distribution:

(1)   \begin{equation*}P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto (n_{d,k}^{-i} + \alpha_{k}) \frac{n_{w_{i}, k}^{-i} + \beta_{w_{i}}}{n_{., k}^{-i} + \sum_{v} \beta_{v}}\end{equation*}


where n_{d,k}^{-i} is the number of tokens in the document d assigned to the topic k, excluding the token i, n_{w_{i}, k}^{-i} is the number of times token w_{i} assigned to the topic k, excluding i, and n_{.,k}^{-i} is the total number of tokens assigned to the topic k.

So, a straightforward sampling algorithm works as follows:

  1. Let c_{k} be the right-hand side of Equation \eqref{eq:lda} for topic k, which is an un-normalized probability.
  2. We compute the accumulated weights as: C[i] = C[i-1] + c_{i} , \,\, \forall i \in (0, K-1] and C[0] = c_{0}.
  3. Draw u \sim \mathcal{U}(0, C[K-1] ) and find t = \arg\min_{i} \left( x_{i} \right) where x_{i} = C[i] - u and x_{i} > 0

The last line is essentially to find the minimum index that the array value is greater than the random number u.

One difficulty to deal with \eqref{eq:lda} is that the right hand side might be too small and therefore overflow (thinking about too many near-zero numbers multiplying). Thus, we want to deal with probabilities in log-space. We start to work with:

(2)   \begin{equation*}\log P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto \log (n_{d,k}^{-i} + \alpha_{k}) + \log (n_{w_{i}, k}^{-i} + \beta_{w_{i}}) - \log ( n_{., k}^{-i} + \sum_{v} \beta_{v} )\end{equation*}

and store in c_{k} but remember that now each value represents un-normalized log probability. The next step is to compute the accumulated weights, this time as accumulated probabilities but in log-space! Thanks to the trick mentioned in [Notes on Calculating Log Sum of Exponentials], we are able to compute log sum efficiently. Please use the last equation there to compute the accumulated weights. The last step is to draw the random number. We compute u \sim \mathcal{U}(0, 1) + \log C[K-1] and again find the minimum index that satisfy the array value is greater than u.

Notes:

  1. The log-sampling algorithm for LDA is implemented in Fugue Topic Modeling Package.
  2. Unless you really face the issue of overflow, sampling in log-space is usually much slower than the original space as log and exp are expensive functions to compute.

Federated Optimization: Even More Personalized World?

Like the previous post about the personalized models, another NIPS workshop paper discussed a related topic, yet, from another perspective:

The authors introduces a new setting of the learning problem in which data are distributed across a very large number of computers, each having access only to few data points. This is primarily motivated by the setting, where users keep their data on their devices, but the goal is still to train a high quality global model.

Although it looks like very different from the paper described in the previous post, two pieces can be linked together for two points:

  • It is important to train a high quality local or personalized model by utilizing the global model or vice versa, having a business management system is great.
  • It is very important to understand the interplay of the global mode land the local model as well.

These work can raise interesting new directions, like how to serve/update models that are fully personalized on mobile devices.


Serving Personalized Models

In the recent NIPS 2016 Workshop on Machine Learning Systems, one paper attracts my attention:

The central idea is very simple: we need to learn and serve personalized models on top of a global model to users. The argument of such setting is two-folds:

  1. A global model might be hard to train, given the size of the model. It usually takes significant amount of computing efforts.
  2. Depending on what types of model, it might be even difficult to serve and update the global model as well.

So, it is very natural that, the model for each individual user is trained separately while it is derived from a global model. The paper demonstrates a particular way of deriving such a model. But there could be many different ways of doing this.

Of course, this is not the first such reasoning. As the paper mentioned, prior work in multi-task learning has formalized similar problems. However, it might be the first time from the system perspective to show the advantages of having personalized models.


Machine Learning as an Engineering Discipline

Recently, Leon Bottou delivered two exiting keynote speeches on machine learning as an engineering discipline. I suggest that anyone who is using machine learning techniques to build software needs to read through them at least once. There are deep thinking about machine learning and software engineering.

  1. Two big challenges in machine learning given on ICML 2015.
  2. How Big Data changes Statistical Machine Learning given on IEEE BigData 2015.