2009 IntroToLinguisticAnnotation

Jump to navigation Jump to search

Subject Headings: Linguistic Annotation Task.


Cited By


Author Keywords


Linguistic annotation and text analytics are active areas of research and development, with academic conferences and industry events such as the Linguistic Annotation Workshops and the annual Text Analytics Summits. This book provides a basic introduction to both fields, and aims to show that good linguistic annotations are the essential foundation for good text analytics.

After briefly reviewing the basics of XML, with practical exercises illustrating in-line and stand-off annotations, a chapter is devoted to explaining the different levels of linguistic annotations. The reader is encouraged to create example annotations using the WordFreak linguistic annotation tool. The next chapter shows how annotations can be created automatically using statistical NLP tools, and compares two sets of tools, the OpenNLP and Stanford NLP tools.

The second half of the book describes different annotation formats and gives practical examples of how to interchange annotations between different formats using XSLT transformations. The two main text analytics architectures, GATE and UIMA, are then described and compared, with practical exercises showing how to configure and customize them. The final chapter is an introduction to text analytics, describing the main applications and functions including named entity recognition, coreference resolution and information extraction, with practical examples using both open source and commercial tools. Copies of the example files, scripts, and stylesheets used in the book are available from the companion website, located at the book website, located at http://sites.morganclaypool.com/wilcock.

Table of Contents

  • 1. Working with XML.
    • 1.1 Introduction p.1
    • 1.2 XML Basics p.2
    • 1.3 XML Parsing and Validation p.3
    • 1.4 XML Transformations p.9
    • 1.5 In-Line Annotations p.11
    • 1.6 Stand-Off Annotations p.14
    • 1.7 Annotation Standards p.18
    • 1.8 Further Reading p.18
  • 2 Linguistic Annotation p.19
  • 3 Using Statistical NLP Tools p.45
    • 3.1 Statistical Models p.45
    • 3.2 OpenNLP and Stanford NLP Tools p.46
    • 3.3 Sentences and Tokenization p.46
    • 3.4 Statistical Tagging p.48
    • 3.5 Chunking and Parsing p.49
    • 3.6 Named Entity Recognition p.55
    • 3.7 Coreference Resolution p.59
    • 3.8 Further Reading p.61
  • 4 Annotation Interchange p.63
    • 4.1 XSLT Transformations p.63
    • 4.2 WordFreak-OpenNLP Transformation p.68
    • 4.3 GATE XML Format p.71
    • 4.4 GATE-WordFreak Transformation p.75
    • 4.5 XML Metadata Interchange: XMI p.81
    • 4.6 WordFreak-XMI Transformation p.84
    • 4.7 Towards Interoperability p.91
    • 4.8 Further Reading p.93
  • 5 Annotation Architectures p.95
    • 5.1 GATE p.95
    • 5.2 GATE Information Extraction Tools p.97
    • 5.3 Annotations with JAPE Rules p.100
    • 5.4 Customizing GATE Gazetteers p.103
    • 5.5 UIMA p.107
    • 5.6 UIMAWrappers for OpenNLP Tools p.108
    • 5.7 Annotations with Regular Expressions p.113
    • 5.8 Customizing UIMA Dictionaries p.115
    • 5.9 Further Reading p.118
  • 6 Text Analytics p.119
    • 6.1 Text Analytics Tools p.119
    • 6.2 Named Entity Recognition p.122
    • 6.3 Training Statistical Models p.128
    • 6.4 Coreference Resolution p.133
    • 6.5 Information Extraction p.136
    • 6.6 Text Mining and Searching p.142
    • 6.7 New Directions p.145

1. Working with XML

... In general, annotations are notes of some kind that are attached to an object of some kind. In this book, the objects that are annotated are texts. Linguistic annotations are notes about linguistic features of the annotated text that give information about the words and sentences of the text. …

1.1 Introduction

1.2 XML Basics

1.3 XML Parsing and Validation

1.4 XML Transformations

1.5 In-Line Annotations

1.6 Stand-Off Annotations

1.7 Annotation Standards

1.8 Further Reading

2 Linguistic Annotation

2.1 Levels of Linguistic Annotation

In linguistic theory, the analysis and description of linguistic phenomena are usually organized into several distinct levels. The different sounds used by a language are described at the level of phonology. The writing system is described at the level of orthography. Morphology describes the formation and inflection of individual words. Syntax describes the ordering of words and their combination into phrases and sentences. Semantics analyzes the meaning of individual words (lexical semantics) and the meaning of phrases and sentences (compositional semantics). How words and phrases are actually used to make things happen is the level of pragmatics. How people and things are introduced as topics and subsequently referred to in later utterances is the level of discourse.

The different levels of linguistic description can be thought of as layers, as shown in Figure 2.1. Phonology and orthography deal with the smallest units (individual sounds and letters) at the bottom. Morphology, syntax and semantics deal with the medium-sized units (words, phrases and sentences). Discourse and pragmatics deal with the largest units (whole paragraphs and dialogues) at the top.

discourse cohesion in a text or dialogue
pragmatics functions of utterances
semantics meaning of words and sentences
syntax word order and sentence structure
morphology word formation and inflections
orthography spelling (written language)
phonology sounds (spoken language)
Fig 2.1 : Levels of Linguistic descriptions.

The current state of the art in linguistic annotation also divides the different annotation tasks into different levels, which can be arranged into a similar set of layers as shown in Figure 2.2. However, there is only an approximate correspondence between the levels of the tasks performed in practical corpus annotation work and the levels of description in linguistic theory.

coreference resolution linking references to same entities in a text
named entity recognition identifying and labeling named entities
semantic analysis labeling predicate-argument relations
syntactic parsing analyzing constituent phrases in a sentence
part-of-speech tagging labeling words with word categories
tokenization segmenting text into words
sentence boundaries segmenting text into sentences
Fig 2.2 : Levels of Linguistic annotations.

This book focusses on the annotation of texts, where the language is written not spoken, so we do not include an annotation level matching phonology. The annotation tasks that deal with the level of orthography are tokenization and sentence boundary detection. These tasks segment the text into distinct words (tokens) and distinct sentences. It does not usually matter which of these two tasks is performed first, but it is important that both tasks are performed before the higher-level tasks are done.

2.2 WordFreak Annotation Tool

There are many tools that can be used for linguistic annotation. We will use WordFreak (http://wordfreak.sourceforge.net/), a Java-based linguistic annotation tool designed to support both human and automatic annotation of linguistic data. WordFreak is briefly described by its developers Thomas Morton and Jeremy LaCivita in (Morton and LaCivita 2003). There is no user manual, so we will give detailed examples here.

We use WordFreak in order to gain practical experience of doing linguistic annotations by hand. That’s the only way to learn the difficulties involved in making decisions in linguistic annotations. Later, when we use statistical NLP tools, we will appreciate the speed and power of automatic annotations, by contrast with manual annotations.

As an example text, we will use Shakespeare’s Sonnet 130. Figure 2.3 shows sonnet130.txt, a plain text version of Sonnet 130.

WordFreak creates stand-off XML annotations. We will describe the format and see examples in the following sections. Note that GATE and WordFreak deal with existing annotations differently.

2.3 Sentence Boundaries

2.4 Tokenization

2.5 Part-of-Speech Tagging

2.6 Syntactic Parsing

2.7 Semantics and Discourse

2.8 WordFreak with OpenNLP

We have learned in the practical work that doing linguistic annotations by hand is a slow process. In this section we combine WordFreak with automatic tagging and parsing tools to do linguistic annotations much faster.We will learn more about statistical annotation tools in Chapter 3.

Automatic annotation tools inevitably make some mistakes, but the errors can be corrected by hand using the WordFreak user interface. The combination of high-speed automatic annotation and high-quality human checking and correction can be a good solution for some annotation tasks.

2.9 Further Reading

3 Using Statistical NLP Tools

3.1 Statistical Models

3.2 OpenNLP and Stanford NLP Tools

3.3 Sentences and Tokenization

3.4 Statistical Tagging

3.5 Chunking and Parsing

3.6 Named Entity Recognition

3.7 Coreference Resolution

3.8 Further Reading

4 Annotation Interchange

4.1 XSLT Transformations

4.2 WordFreak-OpenNLP Transformation

4.3 GATE XML Format

4.4 GATE-WordFreak Transformation

4.5 XML Metadata Interchange: XMI

4.6 WordFreak-XMI Transformation

4.7 Towards Interoperability

4.8 Further Reading

5 Annotation Architectures

5.1 GATE

5.2 GATE Information Extraction Tools

5.3 Annotations with JAPE Rules

5.4 Customizing GATE Gazetteers

5.5 UIMA

5.6 UIMAWrappers for OpenNLP Tools

5.7 Annotations with Regular Expressions

5.8 Customizing UIMA Dictionaries

5.9 Further Reading

6 Text Analytics

6.1 Text Analytics Tools

6.2 Named Entity Recognition

6.3 Training Statistical Models

6.4 Coreference Resolution

6.5 Information Extraction

6.6 Text Mining and Searching

6.7 New Directions


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 IntroToLinguisticAnnotationGraham WilcockIntroduction to Linguistic Annotation and Text Analyticshttp://books.google.com/books?id=TDQJb1UgVywC10.2200/S00194ED1V01Y200905HLT003