2020 OnSampledMetricsforItemRecommen

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Item Recommendation Performance Measure, High-Bias Estimator, Low-Variance Estimator, Sampling During Evaluation.

Notes

  • It observed that many research papers on recommender systems applied the same sampling to evaluation as done for training for reasons of simplicity and that this biased the results of offline metrics.

Cited By

Quotes

Abstract

The task of item recommendation requires ranking a large catalogue of items given a context. Item recommendation algorithms are evaluated using ranking metrics that depend on the positions of relevant items. To speed up the computation of metrics, recent work often uses sampled metrics where only a smaller set of random items and the relevant items are ranked. This paper investigates sampled metrics in more detail and shows that they are inconsistent with their exact version, in the sense that they do not persist relative statements, e.g., recommender A is better than B, not even in expectation. Moreover, the smaller the sampling size, the less difference there is between metrics, and for very small sampling size, all metrics collapse to the AUC metric. We show that it is possible to improve the quality of the sampled metrics by applying a correction, obtained by minimizing different criteria such as bias or mean squared error. We conclude with an empirical evaluation of the naive sampled metrics and their corrected variants. To summarize, our work suggests that sampling should be avoided for metric calculation, however if an experimental study needs to sample, the proposed corrections can improve the quality of the estimate.

1 INTRODUCTION

Over recent years, item recommendation from implicit feedback has received a lot of attention from the recommender system research community. At its core, item recommendation is a retrieval task, where given a context, a catalogue of items should be ranked and the top scoring ones are shown to the user. Usually the catalogue of items to retrieve from is large: tens of thousands in academic studies and often many millions in industrial applications. Finding matching items from this large pool is challenging as the user will usually only explore a few of the highest ranked ones. For evaluating recommender algorithms, usually sharp metrics such as precision or recall over the few highest scoring items (e.g., top 10) are chosen. Another popular class are smooth metrics such as average precision or normalized discounted cumulative gain (NDCG) which place a strong emphasis on the top ranked items.

Recently, it has become common in research papers to speed up evaluation by sampling a small set of irrelevant items and ranking the relevant documents only among this smaller set [7, 9, 10, 12, 15– 17]. Sampling of negatives is commonly used during training of large models [4, 13], and several works have studied the implications of sampling as well as various methods to improve it [5, 6], see [18] for a comparative study of sampling methods. However, to the best of our knowledge, the implications of sampling during evaluation have not been explored. In this work, the consequences of this approach are studied. In particular, it is shown that findings from sampled metrics (even in expectation) can be inconsistent with exact metrics. This means that if a recommender A outperforms a recommender B on the sampled metric, it does not imply that A has a better metric than B when the metric is computed exactly. This is even a problem in expectation; i.e., with unlimited repetitions of the measurement. Moreover, a sampled metric has different characteristics than its exact counterpart. In general, the smaller the sampling size, the less differences there are between different metrics, and in the small sample limit, all metrics collapse to the area under the ROC curve, which discounts positions linearly. This is particularly problematic because many ranking metrics are designed to focus on the top positions.

As we will show, the sampled metrics can be viewed as high-bias, low-variance estimators of the exact metrics. Their low variance can be particularly misleading if one does not recognize that they are biased, as repeated measurements may indicate a low variance, and yet no meaningful conclusion can be drawn because the bias is recommender-dependent, i.e. the value of the bias depends on the recommender algorithm being evaluated. We also show that this issue can be alleviated if one applies a point-wise correction to the sampled metric, by minimizing criteria that trade-off bias and variance. Empirical performance of the sampled metrics and their corrections is illustrated on a movie recommendation problem.

This analysis suggests that if a study is really interested in metrics that emphasize the top ranked items, sampling candidates should be avoided for the purposes of evaluation, and if the size of the problem is such that sampling is necessary, corrected metrics can provide a more accurate evaluation. Lastly, if sampling is used, the reader should be aware that the reported metric has different characteristics than its name implies.

0 2000 4000 6000 8000 10000 Rank r of relevant item 0.0 0.2 0.4 0.6 0.8 1.0 Quality metric AUC Average Precision NDCG Recall at 20 Precision at 20

0 20 40 60 80 100 Rank r of relevant item 0.0 0.2 0.4 0.6 0.8 1.0 Quality metric AUC Average Precision NDCG Recall at 20 Precision at 20

Figure 1: Visualization of metric vs. predicted rank for 𝑛 = 10, 000. The left side shows the metrics over the whole set of 10,000 items. The right side zooms onto the contributions of the top 100 ranks. All metrics besides AUC are top heavy and almost completely ignore the tail. This is usually a desirable property for evaluating ranking because users are unlikely to explore items further down the result list.

2 EVALUATING ITEM RECOMMENDATION

This section starts by formalizing the most common evaluation scheme for item recommendation. Let there be a pool of 𝑛 items to recommend from. For a given instance1 x, a recommendation algorithm, 𝐴, returns a ranked list of the 𝑛 items. In an evaluation, the positions, 𝑅(𝐴, x) βŠ† {1, . . . , 𝑛}, of the withheld relevant items within this ranking are computed – 𝑅 will also be referred to as the predicted ranks. For example, 𝑅(𝐴, x) = {3, 5} means for an instance x recommender 𝐴 ranked two relevant items at positions 3 and 5. Then, a metric 𝑀 is used to translate the positions into a single number measuring the quality of the ranking. This process is repeated for a set of instances, 𝐷 = {x1, x2, . . .}, and an average metric is reported:

1 |𝐷| Γ• x∈𝐷 𝑀(𝑅(𝐴, x)). (1)

This problem definition assumes that in the ground truth, all relevant items are equally preferred by the user, i.e., that the relevant items are a set. This is the most commonly used evaluation scheme in recommender systems. In more complex cases, the ground truth includes preferences among the relevant items. For example, the ground truth can be a ranked list or weighted set. Our work shows issues with sampling in the simpler setup, which implies that the issues carry over to the more complex case.

3 METRICS

This section recalls commonly used metrics for measuring the quality of a ranking. For convenience, the arguments, 𝐴, x, from 𝑅(𝐴, x) are omitted whenever the particular recommender, 𝐴, or instance, x, is clear from context. Instead, the shorter form 𝑅 is used. Area under the ROC curve (AUC) measures the likelihood that a random relevant item is ranked higher than a random irrelevant item.

AUC(𝑅)𝑛 = 1 |𝑅| (𝑛 βˆ’ |𝑅|) Γ• π‘Ÿ βˆˆπ‘… Γ• π‘Ÿ β€²βˆˆ( {1,...,𝑛}\𝑅) 𝛿 (π‘Ÿ < π‘Ÿ β€²) = 𝑛 βˆ’ |𝑅|βˆ’1 2 βˆ’ 1 |𝑅| Í π‘Ÿ βˆˆπ‘… π‘Ÿ 𝑛 βˆ’ |𝑅| , (2)

1E.g., a user, context, or query. with the indicator function 𝛿 (𝑏) = 1 if 𝑏 is true and 0 otherwise. Precision at position π‘˜ measures the fraction of relevant items among the top π‘˜ predicted items:

Prec(𝑅)π‘˜ = |{π‘Ÿ ∈ 𝑅 : π‘Ÿ ≀ π‘˜}| π‘˜ . (3)

Recall at position π‘˜ measures the fraction of all relevant items that were recovered in the top π‘˜:

Recall(𝑅)π‘˜ = |{π‘Ÿ ∈ 𝑅 : π‘Ÿ ≀ π‘˜}| |𝑅| . (4)

Average Precision at π‘˜ measures the precision at all ranks that hold a relevant item:

AP(𝑅)π‘˜ = 1 min(|𝑅|, π‘˜) Γ•π‘˜ 𝑖=1 𝛿 (𝑖 ∈ 𝑅)Prec(𝑅)𝑖 . (5)

Normalized discounted cumulative gain (NDCG) at π‘˜ places an inverse log reward on all positions that hold a relevant item:

NDCG(𝑅)π‘˜ = 1 Ímin( |𝑅|,π‘˜) 𝑖=1 1 log2 (𝑖+1) Γ•π‘˜ 𝑖=1 𝛿 (𝑖 ∈ 𝑅) 1 log2 (𝑖 + 1) . (6)

3.1 Simplified Metrics

3.2 Example

4 SAMPLED METRICS

4.1 Inconsistency of Sampled Metrics

…

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2020 OnSampledMetricsforItemRecommenSteffen Rendle
Walid Krichene
On Sampled Metrics for Item Recommendation