WikiQA Dataset
		
		
		
		
		
		Jump to navigation
		Jump to search
		
		
	
A WikiQA Dataset is a QA dataset that is a set of question-answer pairs, collected and annotated for research on open-domain question answering benchmark.
- AKA: WikiQA Corpus, Microsoft Research WikiQA Corpus.
- Context:
- It contains included 3,047 questions and 29,258 sentences.
- Online repository and datasets available at: https://www.microsoft.com/en-us/download/details.aspx?id=52419
- Benchmark Tasks: Open-Domain Question Answering (QA) Benchmark.
 
- Example(s):
- Counter-Example(s):
- a CoQA Dataset,
- a HotpotQA Dataset,
- a MS COCO Dataset,
- a NarrativeQA Dataset,
- a Natural Questions Dataset,
- a NewsQA Dataset,
- a QuAC Dataset,
- a RACE Dataset,
- a SearchQA Dataset,
- a SQuAD Dataset,
- a TriviaQA Dataset.
 
- See: Question-Answering System, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.
References
2021
- (MS, 2021) ⇒ https://www.microsoft.com/en-us/download/details.aspx?id=52419 Retrieved:2021-01-03.
- QUOTE: The WikiQA corpus is a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, we used Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer. Because the summary section of a Wikipedia page provides the basic and usually most important information about the topic, we used sentences in this section as the candidate answers. With the help of crowdsourcing, we included 3,047 questions and 29,258 sentences in the dataset, where 1,473 sentences were labeled as answer sentences to their corresponding questions. More detail of this corpus can be found in our EMNLP-2015 paper, "WikiQA: A Challenge Dataset for Open-Domain Question Answering " Yang et al. 2015. In addition, this download also includes the experimental results in the paper, an evaluation script for judging the "answer triggering" task, as well as the answer phrases labeled by the authors of the paper.
 
2015
- (Yang et al., 2015) ⇒ Yi Yang, Wen-tau Yih, and Christopher Meek. (2015). “WikiQA: A Challenge Dataset for Open-Domain Question Answering.” In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015).
- QUOTE: We describe the WikiQA dataset, a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. Most previous work on answer sentence selection focuses on a dataset created using the TREC-QA data, which includes editor-generated questions and candidate answer sentences selected by matching content words in the question. WikiQA is constructed using a more natural process and is more than an order of magnitude larger than the previous dataset. In addition, the WikiQA dataset also includes questions for which there are no correct sentences, enabling researchers to work on answer triggering, a critical component in any QA system.