2008 ATextMiningPersOnTheReqsForEAnnotAbs

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Biomedicine Domain, NLP Task, Information Extraction.

Notes

Quotes

  • Keywords: Information extraction; Article annotation; Text mining; Journal annotation pipeline; Review; Perspectives; Electronically annotated information

Abstract

  • We propose that the combination of human expertise and automatic text-mining systems can be used to create a first generation of electronically annotated information (EAI) that can be added to journal abstracts and that is directly related to the information in the corresponding text. The first experiments have concentrated on the annotation of gene/protein names and those of organisms, as these are the best resolved problems. A second generation of systems could then attempt to address the problems of annotating protein interactions and protein/gene functions, a more difficult task for text-mining systems. EAI will permit easier categorization of this information, it will help in the evaluation of papers for their curation in databases, and it will be invaluable for maintaining the links between the information in databases and the facts described in text. Additionally, it will contribute to the efforts towards completing database information and creating collections of annotated text that can be used to train new generations of text-mining systems. The recent introduction of the first meta-server for the annotation of biological text, with the possibility of collecting annotations from available text-mining systems, adds credibility to the technical feasibility of this proposal.

4. Proposal for the requirements of an interactive electronic annotation system

  • Taking into consideration the possible scenarios of usage and the basic categories of annotation types described above, our proposal for such a system would be (see Fig. 1)
    • Step I: During the manuscript submission process automatic, web-based IE systems would examine the text, identifying and highlighting where genes and proteins are mentioned, and normalizing these mentions (suggesting links to core database identifiers along with the confidence associated with such associations).
    • Step II: The authors will then review the NER and document classification results, choosing the correct matches or providing new more accurate ones. To facilitate this process, the verification of results must be straightforward (i.e., presenting the author with definitions of the terms or the main DB record content). It is particularly important to note that authors will still have to verify that no important mentions and/or normalizations have been missed by the IE systems. It is also interesting to consider that the process of interaction with the automatic system will ultimately influence the way in which papers are written.
    • Step III: In a second round, the systems will use the curated entity annotations to mine for relations between them, adding as much metadata as possible (e.g., mutations, interaction surfaces, and special genomic backgrounds). This two step approach would eliminate the problem of detecting the correct entities prior to the detection of the interactions between entities. This will also make it easy for the authors to fill in the missing blanks by using templates and the formerly curated entities.
    • Step IV: An abstract will finally be submitted to the journal including an additional section with the “electronically annotated information (EAI)” and the corresponding database identifiers (i.e. protein or DNA sequence databases identifiers) automatically associated. The EAI can then be included in MEDLINE to facilitate its future use, and used as labels in the XML/HTML/PDF versions of the text to facilitate the training of text-mining systems. Finally, the EAI can be deposited within databases (e.g. interaction databases) together with the explicit mention of the paper and authors.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 ATextMiningPersOnTheReqsForEAnnotAbsFlorian Leitnera
Alfonso Valencia
A Text-Mining Perspective on the Requirements for Electronically Annotated Abstracts10.1016/j.febslet.2008.02.072