2007 DevelopmentImplementationandaCo

Jump to navigation Jump to search

Subject Headings: Definitional Sentence; Definitional Question Answering, MedQA Task.


Cited By



Author Keywords


The published medical literature and online medical resources are important sources to help physicians make patient treatment decisions. Traditional sources used for information retrieval (e.g., PubMed) often return a list of documents in response to a user’s query. Frequently the number of returned documents from large knowledge repositories is large and makes information seeking practical only “after hours” and not in the clinical setting. This study developed novel algorithms, and designed, implemented, and evaluated a medical definitional question answering system (MedQA). MedQA automatically analyzed a large number of electronic documents to generate short and coherent answers in response to definitional questions (i.e., questions with the format of “What is X?”). Our preliminary cognitive evaluation shows that MedQA out-performed three other online information systems (Google, OneLook, and PubMed) in two important efficiency criteria; namely, time spent and number of actions taken for a physician to identify a definition. It is our contention that question answering systems that aggregate pertinent information scattered across different documents have the potential to address clinical information needs within a timeframe necessary to meet the demands of clinicians.

1. Introduction

Physicians often have questions about the care of their patients. The published medical literature and online medical resources are important sources to answer physicians’ questions [1], [2], [3] and [4], and as a result, to enhance the quality of patient care [5], [6] and [7]. Although there are a number of annotated medical knowledge databases including UpToDate and Thomson Micromedex that are available to physicians to use, studies found that physicians often need to consult primary literature for the latest information in patient care [2], [8] and [9]. Information retrieval systems (e.g., PubMed) return lists of retrieved documents in response to user queries. Frequently, the number of retrieved documents is large. For example, querying PubMed about the drug celecoxib results in more than one thousand articles. Physicians usually have limited time to browse the retrieved information. Studies indicate that physicians spend on average two minutes or less seeking an answer to a question, and that if a search takes longer, it is likely to be abandoned [1], [10], [11] and [12]. An evaluation study showed that it takes an average of more than 30 min for a healthcare provider to search for answer from the MEDLINE search engine, which means “information seeking is practical only ‘after hours’ and not in the clinical setting” [13].

Question answering can automatically analyze a large number of articles and generates short text, ideally,a few seconds, to answer questions posed by physicians. Such a technique may provide a practical alternative that enables physicians to efficiently seek information at point of patient care. This paper reports the research development, implementation, and a cognitive evaluation of a medical definitional question answering system (MedQA). Although it is our long-term goal to enable MedQA to answer all types of medical questions, we started with definitional question type because it tends to be more clear-cut and constrained in the medical domain; this contrasts with many other types of clinical questions that typically have large variations in what reasonable answers might be.

2. Background

Although the notion of computer-based question answering has been around since 1970s (e.g., [14]), the actual research development is still a relatively young field. Question answering has been driven by the text retrieval conference (TREC), which supports research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. TREC introduced question answering (QA) track in 1999. TREC (2004) reported the best question answering system to perform with 77% accuracy for answering factoid questions (e.g., “How many calories are there in a Big Mac?”) and 62.2% F-score for answering list questions (e.g., “List the names of chewing gums”) [15]. In addition to factoid questions, TREC has since 2003 provided evaluation to scenario questions (e.g., definitional question “What is x?”) that require long and complex answers. Research development in scenario question answering (e.g., [16] and [17]) has been supported by the Advanced Research and Development Activity (ARDA)’s Advanced Question and Answering for Intelligence (AQUAINT) program since 2001.

However, fewer research groups are working on medical domain-specific question answering. Zweigenbaum [18] and [19] provided an in depth analysis of the feasibility of question answering in the biomedical domain. Rinaldi and colleagues [20] adapted an open-domain question answering system to answer genomic questions (e.g., “where was spontaneous apoptosis observed?”) with the focus on identifying term relations based on a linguistic-rich full-parser. Niu and colleagues [21] and Delbecque and colleagues [22] incorporated semantic information for term relation identification. Specifically, they mapped terms in a sentence to the UMLS semantic classes (e.g., “Disease or Syndrome”) and then combined the semantic classes with surface cues or shallow parsing to capture term relations. For example, in the sentence “the combination of aspirin plus streptokinase significantly increased mortality at 3 months” the word “plus” refers to the combination of two or more medications [21] Xuang et al. [23] manually evaluated whether medical questions can be formulated by problem/population, intervention, comparison, and outcome (PICO), the criteria recommended by the practice of evidence-based medicine. Yu [24] proposed a framework to answer biological questions with images. None of systems described above, however, reported a fully implemented question answering system that generates answers in response to users’ questions from a large text collection such as millions of MEDLINE records and other World Wide Web collection as we are reporting in this study.

3.3. Answer extraction

Answer extraction identifies from the retrieved documents relevant sentences that answer the questions. We developed multiple strategies to identify relevant sentences. MedQA first classifies sentences into specific types. Biomedical articles that report original research normally follow the rhetorical structure known as IMRAD (Introduction, Methods, Results, and Discussion) [50], [51], [52] and [53]. Within each section, there is a well-structured rhetorical substructure. For example, the introduction of a scientific paper usually begins with general statements about the significance of the topic and its history in the field [51]. A previous user study found that physicians prefer the Results sections to others for determining the relevance of an article [54]. We found that definitional sentences were more likely to be in the sections of Introduction and Background. We have followed the previous approaches [55] and [56] to apply supervised machine-learning approaches (e.g., naïve bayes) to identify different sections (e.g., Introduction, Background, Methods, Results, and Conclusion) [57]. The training set was generated automatically from the abstracts in which sections were identified by the authors of the abstracts. The trained classifier) was then used to automatically predict the classes of MEDLINE sentences. This provided a total of 1,004,053 sentences for training, including, 86,971 Introduction, 70,850 Background, 248,630 Methods, 371,419 Results, 167,252 Conclusions, and 58,930 Others. Employing leave-one-out tenfold cross validation with bag-of-words as features, the classifier achieved 78.6% accuracy for identifying a sentence into one of the five sections.

In addition, MedQA further categorized sentences based on linguistic and syntactic features. For example, in scientific writing, it is customary to use the past tense when reporting original work and present tense when describing established knowledge [50] and [51]. Biomedical literature reports not only experimental results, but also hypotheses, tentative conclusions, hedges, and speculations. MedQA applied cue phrases (e.g., suggest, potential, likely, may, and at least) identified by [58], which was reported to outperform machine-learning approaches, to separate facts from speculations. Factual sentences were selected for capturing definitions.

The definitional sentences were identified by lexico-syntactic patterns. For example, “qureyterm, Formative Verb (e.g., “is” and “are”, Noun Phrase)” can be used to identify a definitional sentence such as “vulvar vestibulitis syndrome (VVS) is a common form of dyspareunia in premenopausal women” to answer a question such as “What is vulvar vestibulitis syndrome?” In contrast with other state-of-the-art definitional question answering systems [59], [60], [61] and [62] that mostly captured lexico-syntactic patterns manually, our system automatically learned lexico-syntactic patterns from a large collection of training set that was created automatically.

We automatically learned the lexico-syntactic patterns from a large set of Google definitions. Specifically, we applied all of the terms that are included in the Unified Medical Language System (UMLS 2005AA) as candidate definitional terms, and crawled the Web to search for definitions. We built our crawler on the Google:Definition service. Google:Definition provides definitions that seem to mostly come from web glossaries. We found that 36,535 terms (from the total of 1 million) to have definitions specified by the Google. We therefore downloaded a total of 191,406 definitions; the average number of definitions for each definitional term was 5.2.

With this set of definitions, we then automatically identified lexico-syntactic patterns that comprise the definitions. We applied an information extraction system Autoslog-TS [63] and [64] to automatically learn the lexico-syntactic patterns. In the following, we will describe the AutoSlog-TS and how we applied it for lexico-syntactic pattern learning.

AutoSlog-TS is an information extraction system that automatically identifies extraction patterns for noun phrases by learning from two sets of unannotated texts. In our application, one collection of text incorporates relevant or definitional sentences, and another collection of text is irrelevant or background because it incorporates sentences that are randomly selected from the MEDLINE collection.

AutoSlog-TS first performs part-of-speech tagging and shallow parsing, and then generates every possible lexico-syntactic pattern within a clause to extract every noun phrase in both collections of relevant and irrelevant texts. It then computes statistics based on how often each pattern appears in the relevant texts versus the irrelevant texts and produces a ranked list of extraction patterns coupled with statistics indicating how strongly each pattern is associated with relevant and irrelevant texts. The extraction patterns are ranked based on how often each pattern appears in the relevant texts versus the irrelevant texts. For each extraction pattern, AutoSlog-TS computes two frequency counts: totalfreq is the number of times that the pattern appears anywhere in the corpus, and relfreq is the number of times that the pattern appears in the relevant texts. The conditional probability estimates the likelihood that the pattern is relevant:

[math]\displaystyle{ P(relevant/pattern)= (relfreq)/(totalfreq) }[/math]

The RlogF measure [64] balances a pattern’s conditional probability estimate with the log of its frequency:

[math]\displaystyle{ R\log{F}(pattern)=\log_2(relfreq) × P(relevant/pattern) }[/math]

The R log F measurement were evaluated in a number of information extraction tasks to be robust [65]. We therefore used R log F measure to rank the lexico-syntactic patterns we generated by Autoslog-TS. We then implemented the patterns into MedQA to capture definitional sentences.

3.4. Summarization and Answer Formulation

5. Conclusions

This study reports research development, implementation, and a cognitive evaluation of a biomedical question answering system (MedQA). MedQA generates short-paragraph-level texts to answer physicians’ and other biomedical researchers’ ad hoc questions. The contributions of this work include:

  1. Automatic generation of lexico-syntactic patterns for identifying definitions.
  2. The integration of document retrieval, answer extraction, and summarization into a working system that generated a short paragraph-level answer to a definitional question.
  3. A cognitive evaluation that compared MedQA with three other state-of-the-art online information systems; namely, Google, OneLook, and PubMed.

Our results show that MedQA in general out-performed OneLook and PubMed in the following four criteria: quality of answer, ease of use, time spent, and actions taken for obtaining an answer. Although the evaluation results show that Google was preferred system in quality of answer and ease of use, the results showed that MedQA out-performed Google in its time-spent and actions taken for obtaining an answer; both advances showed the promise for MedQA to be useful in clinical settings.

It is important to point out the limitations of this work. This is a small scale involving four physicians who evaluated 12 medical questions. These physicians may not be representative of the broader population. In addition, the small sample size precludes the ability to measure the statistical differences among search engines. Future work needs to increase both the number of subjects and the number of the questions to be evaluated. Although MedQA is a work in progress, we can provisionally conclude that such a system has the potential to facilitate the process of information seeking in demanding clinical contexts.

Appendix A.

Definitional questions that we found from over four thousands of clinical questions collected by Ely, D’Alessandro and colleagues [1], [25], [26] and [27]. The 12 selected questions for evaluation are in bold.

  1. What is cerebral palsy?
  2. What is gemfibrozil?
  3. What is d-dimer?
  4. What is the marsh score?
  5. What is hemoglobin A0?
  6. What is TCA (tetracaine, cocaine, alcohol)?
  7. What is Vagisil?
  8. What is Tamiflu (oseltamivir)?
  9. What is midodrine?
  10. What is Ambien (zolpidem)?
  11. What is Terazol (terconazole)?
  12. What is Maltsupex?
  13. What is DDAVP (1-desamino-8-d-arginine vasopressin)?
  14. What is Resaid?
  15. What is droperidol (Inapsine)?
  16. What is an appendix epididymis?
  17. What is henox premer?
  18. What is Lotrel?
  19. What is Hytrin?
  20. What is Cozaar?
  21. What is ceftazidime?
  22. What is Zolmitriptan (Zomig)?
  23. What is octreotide (sandostatin) and somatostatin?
  24. What is Proshield?
  25. What is risperidone (Risperdal)?
  26. What is Genora?
  27. What is mepron (Atovaquone)?
  28. What is Zoloft (sertraline)?
  29. What is fluvoxamine (Luvox)?
  30. What is Lunelle?
  31. What is amantadine dosing?
  32. What is cyclandelate?
  33. What is clonazepam (Klonopin)?
  34. What is westsoy formula?
  35. What is Zyprexa?
  36. What is Mexitil (mexiletine)?
  37. What is paregoric?
  38. What is Legatrin?
  39. What is nimodipine (Nimotop)?
  40. What is glatiramer (Copaxone)?
  41. What is propafenone?
  42. What is Ultravate cream (Halobetasol)?
  43. What is cilostazol (Pletal) (for intermittent claudication)?
  44. What is sotalol?
  45. What is Norvasc (amlodipine)?
  46. What is Uristat?
  47. What is nabumetone?
  48. What is Zofran (ondansetron)?
  49. What is terbinafine (Lamisil)?
  50. What is Cetirizine (Reactin)?
  51. What is Serzone (nefazodone)?
  52. What is Sansert?
  53. What is Urised?
  54. What is nedocromil sodium (Tilade)?
  55. What is Ultram (tramadol)?
  56. What is the Poland anomaly?
  57. What is vestibulitis?
  58. What is hemolytic uremic syndrome?
  59. What is a Lisfranc fracture?
  60. What is euthyroid sick syndrome?
  61. What is Williams Syndrome?
  62. What is senile tremor?
  63. What is NARES (nonallergic rhinitis with eosinophilia syndrome)?
  64. What is Walker–Warburg Syndrome?
  65. What is Carnett’s sign?
  66. What is the oxygen dissociation curve?
  67. What is the pivot-shift test?
  68. What is Osler’s sign?
  69. What is nephrocalcinosis?
  70. What is Vanderwoude syndrome?
  71. What is dysfibrinogenemia?
  72. What is Walker–Warburg syndrome?
  73. What is the antibiotic dose?
  74. What is bronchiolitis?
  75. What is occipital neuralgia?
  76. What is the cubital tunnel syndrome?
  77. What is Klippel Feil Syndrome?
  78. What is dyskinesia?
  79. What is Sandifer syndrome?
  80. What is heel pain syndrome?
  81. What is a blue dome cyst?
  82. What is central pontine myelinosis?
  83. What is Kenny syndrome?
  84. What is dysdiadokokinesis?
  85. What is Wegener’s granulomatosis?
  86. What is melanosis coli?
  87. What is lipoprotein A?
  88. What is FISH (fluorescence in situ hybridization)?
  89. What is schizoaffective disorder?
  90. What is Charcot Marie Tooth Disease?
  91. What is the adenosine-thallium exercise tolerance test?
  92. What is Ogilvie’s syndrome?
  93. What is prealbumin?
  94. What is serum sickness?
  95. What is Kussmaul breathing?
  96. What is Prader–Willi syndrome?
  97. What is an Adie’s pupil?
  98. What is Noonan syndrome?
  99. What is fetor hepaticus?
  100. What is dissociative disorder?
  101. What is the Jendrassik maneuver?
  102. What is a tethered spinal cord?
  103. What is the postcholecystectomy syndrome?
  104. What is herpes gladiatorum?
  105. What is Fragile X syndrome?
  106. What is a “high” ankle sprain?
  107. What is Peutz–Jegher’s Syndrome?
  108. What is an eccrine spiradenoma?
  109. What is Fanconi’s Syndrome?
  110. What is Still’s disease?
  111. What is the “dawn” phenomenon?
  112. What is BOOP (bronchiolitis obliterans and organizing pneumonia)?
  113. What is Best’s disease?
  114. What is Rovsing’s sign?
  115. What is a sliding hiatus hernia?
  116. What is biliary gastritis (bile gastritis)?
  117. What is a Hepatolite scan (questionably the same as a PAPIDA scan)?
  118. What is seronegative spondyloarthropathy?
  119. What is Stickler Syndrome?
  120. What is the Delphi technique?
  121. What is Xalatan eye drops?
  122. What is Betamol eye drops?
  123. What is the binomial distribution?
  124. What is a Roux-en-Y hepatoenterostomy?
  125. What is the bivariate normal distribution?
  126. What is Sperling’s sign (Sperling’s maneuver)?
  127. What is an incidence rate ratio?
  128. What are Alomide ophthalmic drops?
  129. What is the urological surgery technique?
  130. What are Acular ophthalmic drops? (good for allergic conjunctivitis?)
  131. What are “poppers”?
  132. What are Hawkins and Neer impingement signs (for shoulder pain, rotator cuff injury)?
  133. What are the Ottawa knee rules?
  134. What are endomysial antibodies?
  135. What are preeclampsia labs (laboratory studies)?
  136. What are Lewy bodies?
  137. What are the heat injury syndromes?
  138. What are pineal brain tumors?



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 DevelopmentImplementationandaCoHong Yu
Minsuk Lee
David Kaufman
John Ely
Jerome A Osheroff
George Hripcsak
James Cimino
Development, Implementation, and a Cognitive Evaluation of a Definitional Question Answering System for Physicians10.1016/j.jbi.2007.03.002