1999 KEAPracticalAutomaticKeyphraseE

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea’s effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.

Introduction

In addition, keyphrases can help users get a feel for the content of a collection, provide sensible entry points into it, show how queries can be extended, facilitate document skimming by visually emphasizing important phrases; and offer a powerful means of measuring document similarity (e.g. [6], [8], [13]).

Keyphrases are usually chosen manually. In many academic contexts, authors assign keyphrases to documents they have written. Professional indexers often choose phrases from a predefined “controlled vocabulary” relevant to the domain at hand. However, the great majority of documents come without keyphrases, and assigning them manually is a tedious process that requires knowledge of the subject matter. Automatic extraction techniques are potentially of great benefit.

Several methods have been proposed for generating or extracting summary information from text (e.g. [1], [7],[10]). In the specific domain of keyphrases, there are two fundamentally different approaches: keyphrase assignment and keyphrase extraction. Both use machine learning methods, and require for training purposes a set of documents with keyphrases already attached.

Keyphrase assignment seeks to select the phrases from a controlled vocabulary that best describe a document. The [[training data] associates a set of documents with each phrase in the vocabulary, and builds a classifier for each phrase. A new document is processed by each classifier, and assigned the keyphrase of any model that classifies it positively (e.g. [3]). The only keyphrases that can be assigned are ones that have already been seen in the training data.

Keyphrase extraction, the approach used here, does not use a controlled vocabulary, but instead chooses keyphrases from the text itself. It employs lexical and information retrieval techniques to extract phrases from the document text that are likely to characterize it [12]. In this approach, the training data is used to tune the parameters of the extraction algorithm.

This paper describes a new keyphrase extraction algorithm, Kea, that is simple and effective, and performs at the current state of the art [5]. It uses the Naïve Bayes machine learning algorithm for training and keyphrase extraction. An implementation is available from the New Zealand Digital Library project (http://www.nzdl.org/).

Kea’s output is illustrated in Table 1, which shows the titles of three research articles and two sets of keyphrases for each article. One set gives the keyphrases assigned by the author; the other was determined automatically from the article’s full text. Phrases in common between the two sets are italicized.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1999 KEAPracticalAutomaticKeyphraseEEibe Frank
Carl Gutwin
Ian H Witten
Gordon W Paynter
Craig G Nevill-Manning
KEA: Practical Automatic Keyphrase Extraction1999