Question-Answer (QA) Benchmark Dataset
(Redirected from question-answer pair)
		
		
		
		Jump to navigation
		Jump to search
		A Question-Answer (QA) Benchmark Dataset is a reading comprehension NLP benchmark database for QA systemss.
- Context:
- It can (typically) consists of Question-Answer Records.
- ...
 
- Example(s):
- a CoQA Dataset (Reddy et al., 2019),
- a FigureQA Dataset,
- a Frames Dataset,
- a HotpotQA Dataset (Yang et al., 2018),
- a NarrativeQA Dataset (Kocisky et al., 2018),
- a Natural Questions Dataset (Kwiatkowski et al., 2019)
- a NewsQA Dataset (Trischler et al., 2016),
- a SearchQA Dataset (Dunn et al., 2017),
- a SQuAD Dataset (Rajpurkar et al., 2016; 2018),
- a TriviaQA Dataset (Joshi et al., 2017).
- …
 
- Counter-Example(s):
- See: Reading Comprehension System, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.
References
2023
- (Ji et al., 2023) ⇒ Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. (2023). “BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.”  doi:10.48550/arXiv.2307.04657
- It introduces BEAVERTAILS, a new question-answering dataset aimed at facilitating alignment of AI assistants towards helpfulness and harmlessness.
- The dataset contains over 330,000 QA pairs annotated with safety meta-labels across 14 potential harm categories. 55% of the QA pairs are labeled as unsafe.
- It also includes over 360,000 pairs of human preference rankings judging the helpfulness and harmlessness of responses. This allows disentangling these two metrics.
- It demonstrates applications of the dataset, including training a QA moderation model, separate reward and cost models, and fine-tuning an LLM with safe RLHF.
 
2019a
- (Kwiatkowski et al., 2019) ⇒ Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. (2019). “Natural Questions: A Benchmark for Question Answering Research.” In: Transactions of the Association for Computational Linguistics, 7.
- QUOTE: The questions consist of real anonymized, aggregated queries issued to the Google search engine. Simple heuristics are used to filter questions from the query stream. Thus the questions are “natural” in that they represent real queries from people seeking information.
 
2019b
- (Reddy et al., 2019) ⇒ Siva Reddy, Danqi Chen, and Christopher D. Manning. (2019). “CoQA: A Conversational Question Answering Challenge.” In: Transactions of the Association for Computational Linguistics Journal, 7. DOI:10.1162/tacl_a_00266.
- QUOTE: We introduce CoQA, a novel dataset for building Conversational Question Answering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.  The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage.  We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning). We evaluate strong dialogue and reading comprehension models on CoQA.         (...)
 
- QUOTE: We introduce CoQA, a novel dataset for building Conversational Question Answering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.  The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage.  We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning). We evaluate strong dialogue and reading comprehension models on CoQA.         
| Dataset | Conversational | Answer Type | Domain | 
|---|---|---|---|
| MCTest (Richardson et al., 2013) | ✗ | Multiple choice | Children’s stories | 
| CNN/Daily Mail (Hermann et al., 2015) | ✗ | Spans | News | 
| Children's book test (Hill et al., 2016) | ✗ | Multiple choice | Children’s stories | 
| SQuAD (Rajpurkar et al., 2016) | ✗ | Spans | Wikipedia | 
| MS MARCO (Nguyen et al., 2016) | ✗ | Free-form text, Unanswerable | Web Search | 
| NewsQA (Trischler et al., 2017) | ✗ | Spans | News | 
| SearchQA (Dunn et al., 2017) | ✗ | Spans | Jeopardy | 
| TriviaQA (Joshi et al., 2017) | ✗ | Spans | Trivia | 
| RACE (Lai et al., 2017) | ✗ | Multiple choice | Mid/High School Exams | 
| Narrative QA (Kocisky et al., 2018) | ✗ | Free-form text | Movie Scripts, Literature | 
| SQuAD 2.0 (Rajpurkar et al., 2018) | ✗ | Spans, Unanswerable | Wikipedia | 
| CoQA (this work) | ✔ | Free-form text, Unanswerable; Each answer comes with a text span rationale | Children’s Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science | 
2018a
- (Kocisky et al., 2018) ⇒ Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gabor Melis, and Edward Grefenstette. (2018). “The NarrativeQA Reading Comprehension Challenge.” In: Trans. Assoc. Comput. Linguistics, 6.
- QUOTE: To encourage progress on deeper comprehension of language, we present a new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts. These tasks are designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience. (...)
 
- QUOTE: To encourage progress on deeper comprehension of language, we present a new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts. These tasks are designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience. 
| Dataset | Documents | Questions | Answers | 
|---|---|---|---|
| MCTest (Richardson et al., 2013) | 660 short stories, grade school level | 2640 human generated, based on the document | multiple choice | 
| CNN/Daily Mail (Hermann et al., 2015) | 93K+220K news articles | 387K+997K Cloze-form, based on highlights | entities | 
| Children’s Book Test (CBT) (Hill et al., 2016) | 687K of 20 sentence passages from 108 children’s books | Cloze-form, from the 21st sentence | multiple choice | 
| BookTest (Bajgar et al., 2016) | 14.2M, similar to CBT | Cloze-form, similar to CBT | multiple choice | 
| SQuAD (Rajpurkar et al., 2016) | 23K paragraphs from 536 Wikipedia articles | 108K human generated, based on the paragraphs | spans | 
| NewsQA (Trischler et al., 2016) | 13K news articles from the CNN dataset | 120K human generated, based on headline, highlights | spans | 
| MS MARCO (Nguyen et al., 2016) | 1M passages from 200K+ documents retrieved using the queries | 100K search queries | human generated, based on the passages | 
| SearchQA (Dunn et al., 2017) | 6.9m passages retrieved from a search engine using the queries | 140k human generated Jeopardy! questions | human generated Jeopardy! answers | 
| NarrativeQA (this paper) | 1,572 stories (books, movie scripts) & human generated summaries | 46,765 human generated, based on summaries | human generated, based on summaries | 
2018b
- (Rajpurkar et al., 2018) ⇒ Pranav Rajpurkar, Robin Jia, and Percy Liang. (2018). “Know What You Don't Know: Unanswerable Questions for SQuAD". In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Volume 2: Short Papers.
- QUOTE: In this work, we construct SQuADRUn[1], a new dataset that combines the existing questions in SQuAD with 53,775 new, unanswerable questions about the same paragraphs. Crowdworkers crafted these questions so that (1) they are relevant to the paragraph, and (2) the paragraph contains a plausible answer—something of the same type as what the question asks for.
 
2018c
- (Yang et al., 2018) ⇒ Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. (2018). “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.” In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018).
2016
- (Rajpurkar et al., 2016) ⇒ Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. (2016). “SQuAD: 100,000+ Questions for Machine Comprehension of Text.” In: arXiv preprint arXiv:1606.05250.
- QUOTE: We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000 + questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees.