2005 TextMiningAndOntologiesInBiomedicine

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Biomedical Text, Term Detection Task. Text mining, ontology, terminology, information extraction, information retrieval

Notes

Cited By

2009

Quotes

Abstract

The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill information, extract facts, discover implicit links and generate hypotheses relevant to user needs. Ontologies, as conceptual models, provide the necessary framework for semantic representation of textual information. The principal link between text and an ontology is terminology, which maps terms to domain-specific concepts. This paper summarises different approaches in which ontologies have been used for text-mining applications in biomedicine.

Introduction

... it is not always straightforward to link textual information with ontology due to the inherent properties of language. Two major obstacles are: (1) inconsistent and imprecise practice in the naming of biomedical concepts (terminology), and (2) incomplete ontologies as a result of rapid knowledge expansion.

Terminology
  • The principal link between text and an ontology is a terminology, which aims to map concepts to terms (Fig. 2). A term is defined as a textual realization of a specialized concept, e.g. gene, protein, disease, etc.
  • In practice, TM applications are faced with the problems of term variation and term ambiguity, which make the integration of information available in text and ontologies difficult.
  • Term variation originates from the ability of a natural language to express a single concept in a number of ways. For example, in biomedicine there are many synonyms for proteins, enzymes, genes, etc. Having six or seven synonyms for a single concept is not unusual in this domain [8]. The probability of two experts using the same term to refer to the same concept is <20% [9]. In addition, biomedicine includes pharmacology, where numerous trademark names refer to the same compound (e.g. advil, brufen, motrin, nuprin, and nurofen all refer to ibuprofen).
  • Term ambiguity occurs when the same term is used to refer to multiple concepts. Ambiguity is an inherent feature of natural language. Words typically have multiple dictionary entries and the meaning of a word can be altered by its context. Sublanguages, as the languages confined to specialised domains [10], provide a context which generally reduces the level of ambiguity. However, biomedicine encompasses a plethora of subdomains, which is an additional cause for the high level of ambiguity in biomedical terminology. For example, the term promoter refers to a “binding site in a DNA chain at which RNA polymerase binds to initiate transcription of messenger RNA by one or more nearby structural genes” in biology, while in chemistry it denotes a “substance that in very small amounts is able to increase the activity of a catalyst.” In addition, acronyms are extensively used in biomedicine (a new acronym is introduced in every 5–10 abstracts)11 and they are known to be highly ambiguous (>80% of acronyms are ambiguous, the average number of possible interpretations being >15).12 For example, AR could be expanded to any of the following terms: Androgen Receptor, AmphiRegulin, Acyclic Retinoid, Agonist-Receptor, Adrenergic Receptor, etc.

Named entity recognition

  • IE depends on NER (i.e. term recognition, classification and mapping to designated concepts) as the main step in accessing textually described domain-specific information [36].
  • Typically, one third of term occurrences are variants,39 which means that many new terms can be recognised as variants of known terms

Ontology-based IE

  • Ontology-based IE systems attempt to map a term occurring in text to a concept in an ontology, typically in the absence of any explicit link between term and concept. This is passive ontology use.

Ontology-driven IE

  • Ontology-driven IE systems, unlike ontology-based ones, make active use of an ontology in processing, to strongly guide and constrain analysis. For example, Daraselia et al.48 employ a full sentence parser49 and a domain-specific filter to extract information on protein-protein interactions.

Conclusions

Different layers of text annotation (lexical, syntactic and semantic) are required for sophisticated TM in biomedicine. High terminological variability, typical of the domain, emphasises the need for lexico-syntactic procedures and annotations that can be used to neutralise the effects of such variation. Such pheno mena can be tackled effectively through the use of rule-based or machine learning techniques. However, traditional heuristic and ad hoc TM methods simply do not deliver in a complex sublanguage such as that of biomedicine. Encoding of the explicit semantic layer in biomedical text representation needs to be supported by ontologies as the formal means of representing domain-specific knowledge. Up until recently, most TM systems have not relied on ontologies or terminologies, which is a main reason why biomedical TM systems generally provide poorer results compared to other domains (e.g. newswire).

Therefore, ontologies together with terminological lexicons are prerequisites for advanced TM. It is not enough to rely on one or the other: both are needed if we wish to produce highly accurate results of the kind needed by biomedical experts and also to obtain broad cove rage of biomedical text. TM applications should aim at deriving complex information from text, e.g. temporal, causal, conditional and other types of semantic relations between biomedical entities as opposed to simple associations. In order to achieve such objectives, biomedical text needs to be semantically annotated and actively linked to ontologies.

This leads us to the question of the types of ontologies needed for TM. As demonstrated by GENIES52 and GenIE51, it is essential to focus on describing the syntactic and semantic behaviour of biomedical sublanguage and on the formal description of domain event concepts. These systems had to develop their own ontologies of events and their own terminological lexicons. Therefore, the challenge for the field is to develop appropriate ontology resources and link them to adequate terminological lexicons in order to support the kind of processing required – and also to support interoperability between such ontologies.

This can be greatly facilitated by recent advances in reducing the cost of configuring and tuning systems based on biomedical sublanguage: lexical standards enabling reusability; ML techniques to discover patterns of sublanguage behaviour in large annotated text corpora to help grammar writers; development of ontologies that can act as domain models and major developments in extracting and characterising terminology, including compound terms and acronyms.

References


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 TextMiningAndOntologiesInBiomedicineJohn McNaught
Irena Spasic
Sophia Ananiadou
Anand Kumar
Text Mining and Ontologies in Biomedicine: Making sense of raw textBriefings in Bioinformaticshttp://personalpages.manchester.ac.uk/staff/sophia.ananiadou/BIB.pdf10.1093/bib/6.3.2392005