2002 CumulatedGainbasedEvaluationofI

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Discounted Cumulative Gain, Normalized Discounted Cumulative Gain.

Notes

Cited By

Quotes

Author Keywords

Graded relevance judgments, cumulated gain

Abstract

Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield. These novel measures are defined and discussed and their use is demonstrated in a case study using TREC data: sample system run results for 20 queries in TREC-7. As a relevance base we used novel graded relevance judgments on a four-point scale. The test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. The graphs based on the measures also provide insight into the performance IR techniques and allow interpretation, for example, from the user point of view.

2.3 Relative to the Ideal Measure — the Normalized (D)CG Measure

Are two IR techniques significantly different in effectiveness from each other when evaluated through (D)CG curves? In the case of P–R performance, we may use the average of interpolated precision figures at standard points of operation, for example, 11 recall levels or DCV points, and then perform a statistical significance test. The practical significance may be judged by the Sparck-Jones [1974] criteria;

2.4 Comparison of Earlier Measures=

In addition, the DCG measure has the following further advantages.

  • It realistically weights down the gain received through documents found later in the ranked results.
  • It allows modeling user persistence in examining long ranked result lists by adjusting the discounting factor.

Furthermore, the normalized nCG and nDCG measures support evaluation.

  • They represent performance as relative to the ideal based on a known (possibly large) recall base of graded relevance judgments.
  • The performance differences between IR techniques are also normalized in relation to the ideal thereby supporting the analysis of performance differences.



References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 CumulatedGainbasedEvaluationofIKalervo Järvelin
Jaana Kekäläinen
Cumulated Gain-based Evaluation of IR Techniques10.1145/582415.5824182002