

In this blog post, we discuss Abacus.AI’s NeurIPS 2022 paper on recommender systems.
Paper: https://arxiv.org/abs/2206.11886
Code: https://github.com/naszilla/reczilla
1 min video: https://www.youtube.com/watch?v=Eu8G3oNzcvU
5 min video: https://www.youtube.com/watch?v=NkNdxF5chZY
The concept of meta-learning is both simple and powerful: information about the past performance of machine learning models on various datasets, can help predict the future performance of these algorithms on new datasets. This concept underlies the powerful pre-trained models for computer vision applications (such as ResNet and AlexNet), and large language models such as BERT. Given the success of meta-learning in vision and NLP, we explore whether meta-learning can be useful in another common ML domain: recommender systems. In case you’re not familiar with recommender systems, we’ll give a brief overview.
Recommender Systems
Whether you know it or not, you’ve probably interacted with a recommender system today. These systems are embedded in many consumer-facing applications, social media, and content hosting sites. As the name suggests, recommender systems essentially recommend items to users, where “items” might be consumer products (think Amazon) or content (think Netflix), and a “user” is anyone looking for one of these items.
Behind the scenes, recommender systems use information about all past user-item interactions to make new recommendations—for example, did user X buy product Y? Did user A give a five-star rating to movie B? Large consumer applications like Amazon or Netflix process billions of these interactions every year, so they have a lot of data to work with. Sometimes these systems also use data about specific users (their age, their purchasing history) and specific items (the color or price of a product) to make recommendations.
Meta-Learning for Recommender Systems
The first step toward meta-learning for recommender systems (recsys) is understanding which recsys algorithms perform well on which datasets. For this we train and test a range of recsys algorithms on a large variety of datasets. Since most recsys algorithms are tuned using hyperparameters, we also test a range of hyperparameters for each algorithm. Overall we test 24 recsys algorithms (each with up to 100 random sets of hyperparameters, limited by time), with 85 datasets, for a total of about 85,000 individual train-test experiments. (Aside: All of these results are available at this link, in case you want to reproduce our results or run your own analyses!)
Generalizability of Recommender Systems
As a first step toward meta-learning, we use our experiment results to answer the question:
Q1: If a rec-sys algorithm performs well on one dataset, will it perform well on other datasets?
At first glance of our experiment results, a few things are clear:
1. Every algorithm performs very well on at least one dataset
2. Every algorithm performs very poorly on at least one dataset
To show this, we rank the performance of all algorithms for every dataset (after tuning the hyperparameters of each algorithm for each dataset). The table below shows the best (smallest), worst (largest), and average rank of each algorithm, over all 85 datasets tested. (Aside: there are many performance metrics for recsys algorithms, and this ranking is based on several metrics—see our paper for details.)

RecZilla: Meta-Learning for Recommender Systems
Since different algorithms perform well (and poorly) with different datasets, we cannot simply choose a single algorithm (or hyperparameter set) and assume that it will perform well for all datasets. In other words, to achieve the best performance we should choose a different algorithm for each dataset. This leads to our second question:
Q2: Can we predict the performance of a recsys algorithm on a new recsys dataset?
To answer this question, we propose a meta-learning pipeline for recommender systems (outlined in the figure below), which we call RecZilla. At a high level, RecZilla takes a recsys dataset as input, and predicts the performance of several recsys algorithms. This means that we need to represent each recsys dataset as a feature vector that can be passed to a ML model.

We calculate a large set of numerical features to represent each recsys dataset, including various statistics of the recsys dataset, and the performance of basic recsys algorithms on subsets of the dataset (“landmark” features). In total, we calculate 383 numerical features to represent a recsys dataset. As a first pass, we check whether any of the these features are correlated with algorithm performance. The table below shows the highest correlations between any algorithm’s performance, and any of the 383 dataset features (using performance metric PREC@10).

Many of the dataset features are strongly correlated with algorithm performance, which suggests that we can predict algorithm performance using dataset features. Our next step is to train a predictive model using these features. Since we have a relatively small number of data points in this prediction problem (85 recsys datasets), we take a few steps to simplify the learning pipeline (see the full paper for details):
1. We treat each pair of (algorithm + hyperparameters) as a separate “algorithm”. So for example, k-nearest neighbors with k=5 and k=10 are both included, and treated as separate algorithms.
2. We select a subset of n recsys algorithms that have good coverage over all datasets. At a high level, this means that we choose a subset of algorithms such that at least one algorithm performs well on all datasets.
3. We select a subset of m dataset features that have good coverage in their correlation with the selected algorithm’s performance.
After selecting this subset of n algorithms and m dataset features, we train an ML model to predict the performance of all n algorithm using only m features of a dataset. When we encounter a new dataset, we first calculate the m selected features of this dataset and pass them to our meta-model. This model returns a vector of n estimated performance metrics, one for each algorithm selected.
We test this pipeline using three different meta-models (the middle module in the figure above): XGBoost (XGB), k-nearest-neighbors (KNN), and linear regression. Each model is trained to minimize the mean absolute error (MAE) between the predicted performance, and the actual performance, of all n algorithms.
As a first pass, we check whether RecZilla can actually learn, using two ablation studies: (1) we vary the number of datasets used to train this meta-learning (while using n=100 algorithms), and (2) we vary the number of dataset features used to train the meta-learner model (while using m=100 dataset features). In each of these studies we randomly select the subset of datasets (for (1)) and dataset features (for (2)), over 50 random trials.

These ablation studies show that RecZilla does improve as we increase the available information: both the number of datasets they are trained on (left) and the number of dataset features they use (right). As a proof-of-concept, this tells us that it is possible to predict the performance of recasts algorithms on new datasets (Q2). So where do we go from here?
Putting RecZilla to Use
The first takeaway from our results is that recsys algorithms do not generalize across datasets: an algorithm that performs well on some (or most) datasets will not necessarily perform well on a new dataset we encounter. However, using a meta-learning pipeline—RecZilla—we can predict which recsys algorithms will perform well on a new dataset using numerical features of the dataset.
The goal of RecZilla is to allow practitioners who are faced with a new recsys dataset to quickly identify and train a recsys algorithm that performs well. The practitioner may then choose to further tune the model to reach even stronger performance. To use RecZilla, we recommend the following steps:
1. Decide on an objective metric.
There are many metrics for evaluating the performance of recasts algorithms; common choices include normalized discounted cumulative gain (NDCG), precision, and coverage. Each metric is calculated at a specific cutoff (the number of items recommended to each user). These metrics are reported as <metric>@<cutoff>, for example NDCG@10. You can use any of the 315 metrics that we calculate as an objective metric for your meta-learner. These metrics include various performance measures at various cutoffs, as well as timing measures. You can also specify your own metric that is a function of any other metric, such as NDCG@10 + training_time.
2. Build a RecZilla meta-model. You can use a pre-trained RecZilla model, or you can train one yourself. We provide pre-trained models for metrics PREC@10, NDCG@10, and Item-hit coverage@10. See here to get started: https://github.com/naszilla/reczilla#using-a-trained-meta-model-for-inference-
3. Training your own model. You can train your own RecZilla model using all results from our experiments. See here to get started: https://github.com/naszilla/reczilla#training-a-new-meta-model-
Finally, if you are going to NeurIPS 2022, come stop by our poster!
- On the Generalizability and Predictability of Recommender Systems - November 6, 2022
- Abacus.AI at NeurIPS 2022 - September 23, 2022
- Beginners Guide To Transformer Models - August 18, 2020