2010 SemEval2010Task5AutomaticKeyphr

From GM-RKB
(Redirected from Kim et al., 2010)
Jump to navigation Jump to search

Subject Headings: Keyphrase Extraction; Scientific Article Corpus; Task 5 of the Workshop on Semantic Evaluation 2010; Keyphrase-based Document Classification.

Notes

Cited By

2014

  • (Hasanaidul & Ng, 2014) ⇒ Kazi S. Hasanaidul, and Vincent Ng. (2014). “Automatic Keyphrase Extraction: A Survey of the State of the Art.” In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1262-1273.

Quotes

Abstract

This paper describes Task 5 of the Workshop on Semantic Evaluation 2010 (SemEval-2010). Systems are to automatically assign keyphrases or keywords to given scientific articles. The participating systems were evaluated by matching their extracted keyphrases against manually assigned ones. We present the overall ranking of the submitted systems and discuss our findings to suggest future directions for this task.

Introduction

Keyphrases [1] are words that capture the main topics of a document. As they represent these key ideas, extracting high-quality keyphrases can benefit various natural language processing (NLP) applications such as summarization, information retrieval and question-answering. In summarization, keyphrases can be used as a form of semantic metadata (Barzilay and Elhadad, 1997; Lawrie et al., 2001; D’Avanzo and Magnini, 2005). In search engines, keyphrases can supplement full-text indexing and assist users in formulating queries.

Recently, a resurgence of interest in keyphrase extraction has led to the development of several new systems and techniques for the task (Frank et al., 1999; Witten et al., 1999; Turney, 1999; Hulth, 2003; Turney, 2003; Park et al., 2004; Barker and Corrnacchia, 2000; Hulth, 2004; Matsuo and Ishizuka, 2004; Mihalcea and Tarau, 2004; Medelyan and Witten, 2006; Nguyen and Kan, 2007; Wan and Xiao, 2008; Liu et al., 2009; Medelyan, 2009; Nguyen and Phan, 2009). These have showcased the potential benefits of keyphrase extraction to downstream NLP applications.

In light of these developments, we felt that this was an appropriate time to conduct a shared task for keyphrase extraction, to provide a standard assessment to benchmark current approaches. A second goal of the task was to contribute an additional public dataset to spur future research in the area.

Currently, there are several publicly available data sets. [2] For example, Hulth (2003) contributed 2,000 abstracts of journal articles present in Inspec between the years 1998 and 2002. The data set contains keyphrases (i.e. controlled and uncontrolled terms) assigned by professional indexers — 1,000 for training, 500 for validation and 500 for testing. Nguyen and Kan (2007) collected a dataset containing 120 computer science articles, ranging in length from 4 to 12 pages. The articles contain author-assigned keyphrases as well as reader-assigned keyphrases contributed by undergraduate CS students. In the general newswire domain, Wan and Xiao (2008) developed a dataset of 308 documents taken from DUC 2001 which contain up to 10 manually-assigned keyphrases per document. Several databases, including the ACM Digital Library, IEEE Xplore, Inspec and PubMed provide articles with authorassigned keyphrases and, occasionally, readerassigned ones. Medelyan (2009) automatically generated a dataset using tags assigned by the users of the collaborative citation platform CiteULike. This dataset additionally records how many people have assigned the same keyword to the same publication. In total, 180 full-text publications were annotated by over 300 users. [3] Despite the availability of these datasets, a standardized benchmark dataset with a well-defined training and test split is needed to maximize comparability of results.

For the SemEval-2010 Task 5, we have compiled a set of 284 scientific articles with keyphrases carefully chosen by both their authors and readers. The participants’ task was to develop systems which automatically produce keyphrases for each paper. Each team was allowed to submit up to three system runs, to benchmark the contributions of different parameter settings and approaches. Each run consisted of extracting a ranked list of 15 keyphrases from each document, ranked by their probability of being readerassigned keyphrases.

In the remainder of the paper, we describe the competition setup, including how data collection was managed and the evaluation methodology (Section 2). We present the results of the shared task, and discuss the immediate findings of the competition in Section 3. In Section 4 we assess the human performance by comparing readerassigned keyphrases to those assigned by the authors. This gives an approximation of an upperbound performance for this task.

2 Competition Setup

2.1 Data

We collected trial, training and test data from the ACM Digital Library (conference and workshop papers). The input papers ranged from 6 to 8 pages, including tables and pictures. To ensure a variety of different topics was represented in the corpus, we purposefully selected papers from four different research areas for the dataset. In particular, the selected articles belong to the following four 1998 ACM classifications: C2.4 (Distributed Systems), H3.3 (Information Search and Retrieval), I2.11 (Distributed Artificial Intelligence – Multiagent Systems) and J4 (Social and Behavioral Sciences – Economics). All three datasets (trial, training and test) had an equal distribution of documents from among the categories (see Table 1). This domain specific information was provided with the papers (e.g. I2.4-1 or H3.3- 2), in case participant systems wanted to utilize this information. We specifically decided to straddle different areas to see whether participant approaches would work better within specific areas.

Participants were provided with 40, 144, and 100 articles, respectively, in the trial, training and test data, distributed evenly across the four research areas in each case. Note that the trial data is a subset of the training data. Since the original format for the articles was PDF, we converted them into (UTF-8) plain text using pdftotext, and systematically restored full words that were originally hyphenated and broken across two lines. This policy potentially resulted in valid hyphenated forms having their hyphen (-) removed.

All collected papers contain author-assigned keyphrases, part of the original PDF file. We additionally collected reader-assigned keyphrases for each paper. We first performed a pilot annotation task with a group of students to check the stability of the annotations, finalize the guidelines, and discover and resolve potential issues that may occur during the actual annotation. To collect the actual reader-assigned keyphrases, we then hired 50 student annotators from the Computer Science department of the National University of Singapore.

We assigned 5 papers to each annotator, estimating that assigning keyphrases to each paper should take about 10-15 minutes. Annotators were explicitly told to extract keyphrases that actually appear in the text of each paper, rather than to create semantically-equivalent phrases, but could extract phrases from any part of the document (including headers and captions). In reality, on average 15% of the reader-assigned keyphrases did not appear in the text of the paper, but this is still less than the 19% of author-assigned keyphrases that did not appear in the papers. These values were computed using the test documents only. In other words, the maximum recall that the participating systems can achieve on these documents is 85% and 81% for the reader- and author-assigned keyphrases, respectively.

As some keyphrases may occur in multiple forms, in our evaluation we accepted two different versions of genitive keyphrases: A of B → B A (e.g. policy of school = school policy) and A’s B → A B (e.g. school’s policy = school policy). In certain cases, such alternations change the semantics of the candidate phrase (e.g., matter of fact vs. ?fact matter). We judged borderline cases by committee and do not include alternations that were judged to be semantically distinct.

Table 1 shows the distribution of the trial, training and test documents over the four different research areas, while Table 2 shows the distribution of author- and reader-assigned keyphrases.

Dataset Total Document Topic
C H I J
Trial 40 10 10 10 10
Training 140 34 39 35 36
Test 100 25 25 25 25

Table 1: Number of documents per topic in the trial, training and test datasets, across the four ACM document classifications.

Dataset Author Reader Combined
Trial 149 526 621
Training 559 1824 2223
Test 387 1217 1482

Table 2: Number of author- and reader-assigned keyphrases in the different datasets.

Interestingly, among the 387 author-assigned keywords, 125 keywords match exactly with reader-assigned keywords, while many more nearmisses (i.e. partial matches) occur.

2.2 Evaluation Method and Baseline

Traditionally, automatic keyphrase extraction systems have been assessed using the proportion of top-N candidates that exactly match the gold-standard keyphrases (Frank et al., 1999; Witten et al., 1999; Turney, 1999). In some cases, inexact matches, or near-misses, have also been considered. Some have suggested treating semantically-similar keyphrases as correct based on similarities computed over a large corpus (Jarmasz and Barriere, 2004; Mihalcea and Tarau, 2004), or using semantic relations defined in a thesaurus (Medelyan and Witten, 2006). Zesch and Gurevych (2009) compute near-misses using an ngram based approach relative to the gold standard. For our shared task, we follow the traditional exact match evaluation metric. That is, we match the keyphrases in the answer set with those the systems provide, and calculate micro-averaged precision, recall and F-score (β = 1). In the evaluation, we check the performance over the top 5, 10 and 15 candidates returned by each system. We rank the participating systems by F-score over the top 15 candidates.

Participants were required to extract existing phrases from the documents. Since it is theoretically possible to retrieve author-assigned keyphrases from the original PDF articles, we evaluate the participating systems over the independently-generated and held-out reader-assigned keyphrases, as well as the combined set of keyphrases (author- and reader-assigned).

All keyphrases in the answer set are stemmed using the English Porter stemmer for both the training and test dataset.

We computed a TF×IDF n-gram based baseline using both supervised and unsupervised learning systems. We use 1, 2, 3-grams as keyphrase candidates, used Na¨ıve Bayes (NB) and Maximum Entropy (ME) classifiers to learn two supervised baseline systems based on the keyphrase candidates and gold-standard annotations for the training documents. In total, there are three baselines: two supervised and one unsupervised. The performance of the baselines is presented in Table 3, where R indicates reader-assigned keyphrases and C indicates combined (both author- and readerassigned) keyphrases.

Method by Top 5 candidates Top 10 candidates Top 15 candidates
P R F P R F P R F
TF X IDF R 17.8% 7.4% 10.4% 13.9% 11.5% 12.6% 11.6% 14.5% 12.9%
C 22.0% 7.5% 11.2% 17.7% 12.1% 14.4% 14.9% 15.3% 15.1%
NB R 16.8% 7.0% 9.9% 13.3% 11.1% 12.1% 11.4% 14.2% 12.7%
C 21.4% 7.3% 10.9% 17.3% 11.8% 14.0% 14.5% 14.9% 14.7%
ME R 16.8% 7.0% 9.9% 13.3% 11.1% 12.1% 11.4% 14.2% 12.7%
C 21.4% 7.3% 10.9% 17.3% 11.8% 14.0% 14.5% 14.9% 14.7%

Table 3: : Baseline keyphrase extraction performance for one unsupervised (TF×IDF) and two supervised (NB and ME) systems

3 Competition Results

The trial data was downloaded by 73 different teams, of which 36 teams subsequently downloaded the training and test data. 21 teams participated in the final competition, of which two teams withdrew their systems.

Table 4 shows the performance of the final 19 submitted systems. 5 teams submitted one run, 6 teams submitted two runs and 8 teams submitted the maximum number of three runs. We rank the best-performing system from each team by micro-averaged F-score over the top 15 candidates. We also show system performance over reader-assigned keywords in Table 5, and over author-assigned keywords in Table 6. In all these tables, P, R and F denote precision, recall and Fscore, respectively. The best results over the reader-assigned and combined keyphrase sets are 23.5% and 27.5%, respectively, achieved by the HUMB team. Most systems outperformed the baselines. Systems also generally did better over the combined set, as the presence of a larger gold-standard answer set improved recall.

In Tables 7 and 8, we ranked the teams by score, computed over the top 15 candidates for each of the four ACM document classifications. The numbers in brackets are the actual F-scores for each team. Note that in the case of a tie in F-score, we ordered teams by descending F-score over all the data.

4 Discussion of the Upper-Bound Performance

The current evaluation is a testament to the gains made by keyphrase extraction systems. The system performance over the different keyword categories (reader-assigned and author-assigned) and numbers of keyword candidates (top 5, 10 and 15 candidates) attest to this fact.

The top-performing systems return F-scores in the upper twenties. Superficially, this number is low, and it is instructive to examine how much room there is for improvement. Keyphrase extraction is a subjective task, and an F-score of 100% is infeasible. On the author-assigned keyphrases in our test collection, the highest a system could theoretically achieve was 81% recall5 and 100% precision, which gives a maximum F-score of 89%. However, such a high value would only be possible if the number of keyphrases extracted per document could vary; in our task, we fixed the thresholds at 5, 10 and 15 keyphrases.

Another way of computing the upper-bound performance would be to look into how well people perform the same task. We analyzed the performance of our readers, taking the authorassigned keyphrases as the gold standard. The authors assigned an average of 4 keyphrases to each paper, whereas the readers assigned 12 on average. These 12 keyphrases cover 77.8% of the authors’ keyphrases, which corresponds to a precision of 21.5%. The F-score achieved by the readers on the author-assigned keyphrases is 33.6%, whereas the F-score of the best-performing system on the same data is 19.3% (for top 15, not top 12 keyphrases, see Table 6).

We conclude that there is definitely still room for improvement, and for any future shared tasks, we recommend against fixing any threshold on the number of keyphrases to be extracted per document. Finally, as we use a strict exact matching metric for evaluation, the presented evaluation figures are a lower bound for performance, as semantically equivalent keyphrases are not counted as correct. For future runs of this challenge, we believe a more semantically-motivated evaluation should be employed to give a more accurate impression of keyphrase acceptability.

5 Conclusion

This paper has described Task 5 of the Workshop on Semantic Evaluation 2010 (SemEval-2010), focusing on keyphrase extraction. We outlined the design of the datasets used in the shared task and the evaluation metrics, before presenting the official results for the task and summarising the immediate findings. We also analyzed the upperbound performance for this task, and demonstrated that there is still room for improvement over the task. We look forward to future advances in automatic keyphrase extraction based on this and other datasets.

  1. We use “keyphrase” and “keywords” interchangeably to refer to both single words and phrases.
  2. All data sets listed below are available for download from http://github.com/snkim/AutomaticKeyphraseExtraction
  3. http://bit.ly/maui-datasets

References

  • 1. Ken Barker, Nadia Cornacchia, Using Noun Phrase Heads to Extract Document Keyphrases, Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, p.40-52, May 14-17, 2000
  • 2. }}Regina Barzilay and Michael Elhadad. Using Lexical Chains for Text Summarization. In Proceedings of ACL/EACL Workshop on Intelligent Scalable Text Summarization. 1997, Pp. 10--17.
  • 3. }}Ernesto D'Avanzo and Bernado Magnini. A Keyphrase-Based Approach to Summarization: The LAKE System at DUC-2005. In Proceedings of DUC. 2005.
  • 4. Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, Craig G. Nevill-Manning, Domain-specific Keyphrase Extraction, Proceedings of the 16th International Joint Conference on Artificial Intelligence, p.668-673, July 31-August 06, 1999, Stockholm, Sweden
  • 5. Anette Hulth, Improved Automatic Keyword Extraction Given More Linguistic Knowledge, Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, p.216-223, July 11, 2003 doi:10.3115/1119355.1119383
  • 6. Anette Hulth, Enhancing Linguistically Oriented Automatic Keyword Extraction, Proceedings of HLT-NAACL 2004: Short Papers, p.17-20, May 02-07, 2004, Boston, Massachusetts
  • 7. }}Mario Jarmasz and Caroline Barriere. Using Semantic Similarity over Tera-byte Corpus, Compute the Performance of Keyphrase Extraction. In Proceedings of CLINE. 2004.
  • 8. Dawn Lawrie, W. Bruce Croft, Arnold Rosenberg, Finding Topic Words for Hierarchical Summarization, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.349-357, September 2001, New Orleans, Louisiana, USA doi:10.1145/383952.384022
  • 9. Zhiyuan Liu, Peng Li, Yabin Zheng, Maosong Sun, Clustering to Find Exemplar Terms for Keyphrase Extraction, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, August 06-07, 2009, Singapore
  • 10. }}Yutaka Matsuo and Mitsuru Ishizuka. Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information. International Journal on Artificial Intelligence Tools. 2004, 13(1), Pp. 157--169.
  • 11. Olena Medelyan, Ian H. Witten, Thesaurus based Automatic Keyphrase Indexing, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, June 11-15, 2006, Chapel Hill, NC, USA doi:10.1145/1141753.1141819
  • 12. }}Olena Medelyan. Human-competitive Automatic Topic Indexing. PhD Thesis. University of Waikato. 2009.
  • 13. }}Rada Mihalcea and Paul Tarau. TextRank: Bringing Order Into Texts. In Proceedings of EMNLP. 2004, Pp. 404--411.
  • 14. Thuy Dung Nguyen, Min-Yen Kan, Keyphrase Extraction in Scientific Publications, Proceedings of the 10th International Conference on Asian Digital Libraries: Looking Back 10 Years and Forging New Frontiers, December 10-13, 2007, Hanoi, Vietnam
  • 15. Chau Q. Nguyen, Tuoi T. Phan, An Ontology-based Approach for Key Phrase Extraction, Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, August 04-04, 2009, Suntec, Singapore
  • 16. Youngja Park, Roy J Byrd, Branimir K Boguraev, Automatic Glossary Extraction: Beyond Terminology Identification, Proceedings of the 19th International Conference on Computational Linguistics, p.1-7, August 24-September 01, 2002, Taipei, Taiwan doi:10.3115/1072228.1072370
  • 17. }}Peter Turney. Learning to Extract Keyphrases from Text. In National Research Council, Institute for Information Technology, Technical Report ERB-1057. 1999.
  • 18. Peter D. Turney, Coherent Keyphrase Extraction via Web Mining, Proceedings of the 18th International Joint Conference on Artificial Intelligence, p.434-439, August 09-15, 2003, Acapulco, Mexico
  • 19. Xiaojun Wan, Jianguo Xiao, CollabRank: Towards a Collaborative Approach to Single-document Keyphrase Extraction, Proceedings of the 22nd International Conference on Computational Linguistics, p.969-976, August 18-22, 2008, Manchester, United Kingdom
  • 20. Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, Craig G. Nevill-Manning, KEA: Practical Automatic Keyphrase Extraction, Proceedings of the Fourth ACM Conference on Digital Libraries, p.254-255, August 11-14, 1999, Berkeley, California, USA doi:10.1145/313238.313437
  • 21. }}Torsten Zesch and Iryna Gurevych. Approximate Matching for Evaluating Keyphrase Extraction. In Proceedings of RANLP. 2009.

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2010 SemEval2010Task5AutomaticKeyphrOlena Medelyan
Timothy Baldwin
Min-Yen Kan
Su Nam Kim
SemEval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles2010