## Structured Sparse Regression for Recommender Systems

Feature based latent factor models have received increasing attention in recent years due to its capability to effectively solve the cold-start problem. There have been many feature based collaborative filtering (**CF**) models proposed recently, which can be grouped into two categories. The first type of models includes all variants of latent factor models (**LFM**) which have been proven as an effective approach to personalization and recommender systems. The core of **LFM** is to learn user-specific and item-specific features from user-item interactions and utilize these features for future predictions/recommendations. State-of-the-art **LFM** exploits low-rank latent spaces of users and features and treats latent factors that are learnt from user-item historical data as features. This type of models has gained significant successes in a number of applications, including the Netflix competition. The second category is the factorization machine (**FM**), which explicitly learns the mapping function from features to rating score circumventing the dependency on user/item latent factors as in the latent factor models, resulting in an effective model for the cold start problem [9].

Although these feature-based **CF** models have been shown to be effective, they do not utilize the feature structure information. For example, conventional latent factor models (e.g., matrix factorization or tensor factorization models) like **RLFM** [1, 2] learn mapping functions from user/item features to user/item latent vectors assuming the features have a flat first-order structure. Later, [10] showed that this kind of mapping can be extended to any non-linear models. Although the formalism is flexible, it leaves too much room for practitioners to choose which non-linear model to use for a particular application. Also, it is hard to incorporate human prior knowledge on the feature structure into the framework, unless through careful feature engineering, and the proposed inference algorithm is difficult to use in large-scale settings. Similar to **RLFM**, Gantner et al. [4] proposed a model to explicitly learn the mapping function from features to latent factors, resulting in an effective model for the cold start problem. But it still makes the flat first-order feature structure assumption. In the other line of work, Rendle et al. [9] proposed a more compact model called factorization machine **FM**. Basically **FM** is a second-order regression model which directly maps the user-item-event concatenated features to rating score by learning the implicit mapping functions from features to latent factors, resulting in an effective model for the cold start problem. However, the issue of encoding structural human prior information still boils down to sophisticated feature engineering and it’s not clear how to incorporate the heterogeneous feature structures into model training to enhance the rating prediction performance. Though **FM** considers the second-order feature structure, it simply uses all the feature pairs for prediction.

Quite a lot of work in sparse coding area have shown that many signals tend to have a sparse representation from basic components in nature, and a sparse model often outperforms a dense model and also has the variable selection effect. Inspired by this, a sparse model that uses an appropriate subset of feature pairs might have a better performance. In practices, human prior knowledge or explicit structure information about these features is also sometimes available. For example, the topical categories on news articles may naturally be organized into hierarchies or graphs. Another good example would be demographical information about users, especially their geo-locations that are aligned with countries, states and cities, defined in real-world geo-political settings. These prior knowledge and structures are invaluable information for better user understanding and profiling and eventually better recommendation results. However, it is not straightforward to encode this kind of structural prior knowledge into state-of-the-art recommendation models like **RLFM** and **FM**. One approach might be to construct features capturing these structures and embed them into regression models. But the interplay between an optimal way to construct such features and train a better regression model based on these features to map to latent features becomes non-trivial in this case. Some previous work has been proposed to impose structural information on latent variable models, which are not necessarily directed graphical models. For instance, He et al. [5] proposed a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. Although the paper shares a similar idea with our framework, their work cannot handle heterogeneous types of features and complex dependencies. Also, it is hard to link their work to state-of-the-art **LFM** used in **CF** settings. Along the line of undirected graphical models, Min et al. [8] proposed sparse high-order Boltzmann machines, aiming to capture dependencies between latent variables. Again, it is not obvious to plug the model into state-of-the-art **CF** approaches. In this paper, we propose a structured sparse second-order regression model with structural prior knowledge in a principled way. The notion of types of features is introduced such that different types of features would have different structures (e.g., topical categories versus geographical locations). We consider two kinds of structures. For inter-typed features (which are of different kinds), the model is able to learn sparse relationships between different types of features (e.g., age and gender). For intra-typed features (which are in the same kind) that have a hierarchy (tree), e.g., we have the country-state-city hierarchical tree for the geo-location feature terms, the model learns a sparse hierarchical structure on the tree such that if a parent feature edge (or interchangeably a feature pair) is selected, then all its descendant edges should be selected as well, and if a parent edge is removed then all its descendant edges should be removed too.

You can view the paper in details here: [PDF].

### Reference

- D. Agarwal and B.-C. Chen. Regression-based latent factor models. In Proceedings of KDD, pages 19–28. ACM, 2009.
- D. Agarwal and B.-C. Chen. fLDA: matrix factorization through latent Dirichlet allocation. In Proceedings of WSDM, pages 91–100. ACM, 2010.
- T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. SVDFeature: A toolkit for feature-based collaborative filtering. The Journal of Machine Learning Research, 13(1):3619–3622, Dec. 2012.
- Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle, and L. Schmidt-Thieme. Learning attribute-to-feature mappings for cold-start recommendations. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pages 176–185, 2010.
- Y. He, Y. Qi, K. Kavukcuoglu, and H. Park. Learning the dependency structure of latent factors. In Advances in Neural Information Processing Systems 25, pages 2375–2383. 2012.
- R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. The Journal of Machine Learning Research, 12:2297–2334, 2011.
- Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, Aug. 2009.
- M. R. Min, X. Ning, C. Cheng, and M. Gerstein. Interpretable sparse high-order boltzmann machines. In AISTATS, pages 614–622, 2014.
- S. Rendle. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology, 3(3):57:1–57:22, May 2012.
- L. Zhang, D. Agarwal, and B.-C. Chen. Generalizing matrix factorization through flexible regression priors. In Proceedings of RecSys, pages 13–20. ACM, 2011.
- P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, pages 3468–3497, 2009.