2013 LearningFormalDefinitionsforBiomedi

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Formal Definition; Biomedical Concept; Definition Extraction

Notes

Cited By

Quotes

Abstract

Ontologies play a major role in life sciences, enabling a number of applications, from new data integration to knowledge verification. Obtaining formalized knowledge from unstructured data is especially relevant for biomedical domain, since the amount of textual biomedical data has been growing exponentially. The aim of this thesis is to develop a method of creating formal definitions for biomedical concepts using textual information from scientific literature (PubMed abstracts), encyclopedias (Wikipedia), controlled vocabularies (MeSH) and the Web. The knowledge representation formalism of choice is Description Logic as it allows for integrating the newly acquired axioms in existing biomedical ontologies (e.g. SNOMED) as well as for automated reasoning on top of them. The work is specifically focused on extracting non-taxonomic relations and their instances from natural language texts. It encompasses the analysis, description, implementation and evaluation of the supervised relation extraction pipeline and sets the scene for the unsupervised relation extraction, proposing a novel algorithm of relation discovery via semantic clustering.

Table of Contents

1.1 The growth of biomedical literature and the benefits of its formalization p.7 ====
1.2 Two examples of biomedical knowledge formalization p.8
1.3 The task of formal definition generation p.9
1.3.1 What is formal definition generation? p.9
1.3.2 Why is formal definition generation important? p.10
1.3.3 Is formal definition generation feasible? A case study p.10
1.4 Objectives and Outline p.12
2. Background p.14
2.1 What is a definition? p.14
2.2 What is Ontology Generation? p.15
2.2.2 Ontology learning and Definition generation p.16
2.3 Biomedical Knowledge Resources p.17
2.3.1 SNOMED CT p.18
2.3.2 UMLS p.20
2.3.3 MeSH p.22
2.4 Description Logics p.23
2.4.1 Basic DL constructors p.24
2.4.2 From triples to Description Logic formulas p.25
3. Related work on relation extraction p.26
3.1 Relation extraction for the general domain p.26
3.1.1 Relation extraction and the types of linguistic processing p.27
3.1.2 Relation extraction and different types of learning p.28
3.1.3 Generating semantic relresentations p.31
3.2 Biomedical extraction p.32
4. Non-taxonomic relation extraction using SNOMED CT ontology p.36
4.1. Dataset generation p.37
4.2 Feature extraction p.37
4.3 Relation classification p.39
4.4 Discussion p.40
5. Formal Definition Generation pipeline p.41
5.1. Overview of the pipeline p.41
5.2 Annotation of biomedical texts with ontology concepts p.44
5.2.1 Introduction to the process of annotation and related work p.44
5.2.2 The Attribute Alignment Annotator p.45
5
5.2.3 The Extended Annotator p.45
5.2.4 Implementation p.48
5.2.5 Evaluation p.49
5.2.6 Runtime Assessment p.50
5.2.7 Summary of contributions and conclusions p.51
5.2.8 Future work p.51
5.3 Parser for Relation Extraction p.52
5.3.1 Various types of the definitional structure of a sentence p.53
5.3.2 The structure of definitions in MeSH p.54
5.3.3 Functionality of the parser p.55
5.3.4 Manual evaluation of the parser p.57
5.3.5 Future improvements of the parser p.59
5.4 Learning Relational Labels p.60
5.4.1 Choosing the classifier p.61
5.4.2 Choosing the features p.61
5.4.3 Choosing the set of relations p.64
6. Evaluation p.66
6.1 SemRep: biomedical relation extraction system p.66
6.1.1 SemRep relation extraction component p.67
6.1.2 SemRep Gold Standard corpus p.67
6.2 Experiments p.69
6.2.1 Results p.69
6.2.2 Improvement of the classification p.69
6.2.3 Comparison with SemRep p.71
7. Unsupervised Relation Extraction p.73
7.1 From relation classification to unsupervised relation clustering p.73
7.2 Relation construction via semantic clustering p.76
7.2.1 Semantic clustering: assumptions and use cases p.76
7.2.2 Semantic clustering of lexical elements p.78
7.2.3 The DBSCAN algorithm and its hierarchical extention p.80
7.3 Preliminary evaluation of the method p.81
8. Future work p.85
9. Conclusions p.87

1. Introduction and Motivation p.7

Formalization of biomedical knowledge has long been an area of active research. Existing biomedical knowledge resources vary considerably in terms of their formalization principles, from databases and data collections (e.g. MEDLINE1), to taxonomies and controlled vocabularies (e.g. MeSH2), to proper ontologies with rich formal semantics (e.g. SNOMED3). They also vary greatly with respect to the sub-domains and areas they cover, as well as to their size, age, ways of maintaining and integrating new knowledge etc. Formally representing the biomedical knowledge can bridge the gap between existing resources and enrich them as well as process the newly generated knowledge that come in abundance and is publicly accessible.

1.1 The growth of biomedical literature and the benefits of its formalization p.7

Research in life sciences is characterized by the exponential growth of the published scientific materials: articles, patents, technical reports etc. MEDLINE1, one of the biggest bibliographic databases for biomedicine, currently contains more than 23 million articles. The average amount of newly added articles comprises 15000 items per week. Figure 1 illustrates the grow rate of MEDLINE over the past half a century [Tsatsaronis et al. 2013].

To handle such a large amount of information, multiple initiatives have been launched for the purpose of organizing biomedical knowledge formally, e.g. using ontologies [Bodenreider et al. 2006]. An ontology is a complex formal structure that can be decomposed into a set of logical axioms that state different relations between formal concepts. Together the axioms model the state of affairs in a domain. With the advances in Description Logics (DL) the process of designing, implementing and maintaining large-scale ontologies has been considerably facilitated [Baader et al. 2003]. In fact, DL has become the most widely used formalism underlying ontologies. Several wellknown biomedical ontologies, such as GALEN [Rector et al. 2006] or SNOMED CT [SNOMED CT User Guide] are based on DL. SNOMED CT has adopted the lightweight description logic EL++ that allows for tractable reasoning.

There are several benefits of formal knowledge representation. First of all, an ontology can be viewed as a conceptualization of some domain, thus it provides a common language for the scientific society with which the communication between researchers can be facilitated. Secondly, formalization of entities enables efficient information integration; already existing knowledge about the entity can be aggregated from multiple resources, and the new knowledge can be easily integrated so that it is not lost or left unnoticed. Thirdly, formal knowledge representation makes it possible to automatize a number of crucial tasks that deal with information processing: efficient search, validation and reasoning. Finally, formal representation can support knowledge visualization which itself can bring about further insights about the domain, i.e., facilitate

1 MEDLINE: http://www.ncbi.nlm.nih.gov/pubmed
2 MeSH: http://www.nlm.nih.gov/mesh/
3 SNOMED CT: http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html

. …

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2013 LearningFormalDefinitionsforBiomediAlina PetrovaLearning Formal Definitions for Biomedical Concepts2013