2007 ToponymResolutionInTextPhD

From GM-RKB
Jump to navigation Jump to search

Subject headings: Toponym Mention Normalization Task, Toponym Mention Normalization Algorithm, Toponym Record, Gazetteer.

Notes

Cited By

Quotes

Abstract

In the area of Geographic Information Systems (GIS), a shared discipline between informatics and geography, the term geo-parsing is used to describe the process of identifying names in text, which in computational linguistics is known as named entity recognition and classification (NERC). The term geo-coding is used for the task of mapping from implicitly geo-referenced datasets (such as structured address records) to explicitly geo-referenced representations (e.g., using latitude and longitude). However, present-day GIS systems provide no automatic geo-coding functionality for unstructured text. In Information Extraction (IE), processing of named entities in text has traditionally been seen as a two-step process comprising a flat text span recognition sub-task and an atomic classification sub-task; relating the text span to a model of the world has been ignored by evaluations such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)). However, spatial and temporal expressions refer to events in space-time, and the grounding of events is a precondition for accurate reasoning. Thus, automatic grounding can improve many applications such as automatic map drawing (e.g. for choosing a focus) and question answering (e.g., for questions like How far is London from Edinburgh?, given a story in which both occur and can be resolved). Whereas temporal grounding has received considerable attention in the recent past (Mani and Wilson 2000); Setzer (2001)), robust spatial grounding has long been neglected.

  • The problem of automatic toponym resolution, or computing the mapping from occurrences of names for places as found in a text to an unambiguous spatial footprint of the location referred to, such as a geographic latitude/longitude centroid is difficult to automate due to insufficient and error-prone geographic databases, and a large degree of place name ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth).
  • This thesis investigates how referentially ambiguous spatial named entities can be grounded, or resolved, with respect to an extensional coordinate model robustly on open-domain news text by collecting a repertoire of linguistic heuristics and extra-linguistic knowledge sources such as population. I then investigate how to combine these sources of evidence to obtain a superior method. Noise effects introduced by the named entity tagging that toponym resolution relies on are also studied. While few attempts have been made to solve toponym resolution, these were either not evaluated, or evaluation was done by manual inspection of system output instead of creating a re-usable reference corpus. A systematic comparison leads to an inventory of heuristics and other sources of evidence. In order to carry out a comparative evaluation procedure, an evaluation resource is required, so a reference gazetteer and an associated novel reference corpus with human-labelled referent annotation were created for this thesis, to be used to benchmark a selection of the reconstructed algorithms and a novel re-combination of the heuristics catalogued in the inventory. Performance of the same resolution algorithms is compared under different conditions, namely applying it to the output of human named entity annotation and automatic annotation using an existing Maximum Entropy sequence tagging model.
  • More formally, the task can be describes as follows. We start with a corpus D comprising a set of documents D = {D1, . . ., D|D|} as input. Each document Di comprises a sequence of tokens TOKENS= (TOKEN[1]. . .TOKEN[|TOKENS|]). We further need a gazetteer G, i.e. an inventory that lists all candidate referents R = { R1 . . . R|R | }. A gazetteer entry G(Ti) for a toponym Ti is a tuple containing a feature type7 and set of referents R G for Ti. Here, referents are represented by the centroid of the location’s latitude and longitude, respectively. A toponym resolver is a function FG(·, ·) that maps from a document Di 2 D in which the toponyms are not resolved yet, to a document with the same content in which the toponyms are resolved, i.e. where for each toponym (or for some toponyms, in the case of a partial toponym resolver) a referent from the set of candidate referents has been chosen. Referents can be represented in various ways, including polygons or simply pairs of latitude and longitude of the centroid.
  • Grounding. In linguistic pragmatics, grounding is the general concept of relating a linguistic entity to (a model of) the world (Figure 1.2). For Lakoff, for example, the ultimate basis for the human cognitive ability of language comprehension is that language is grounded in experience (Lakoff (1993)). According to Lakoff, humans have built in physical properties that influence the way homo sapiens uses language, including the universal perception of fundamental spatial dichotomies such as UP–DOWN.
  • Central Thesis Objective: The relative utility of heuristics and evidence sources for toponym resolution needs to be measured in a principled way.

1.5 Contributions

  • I have coined the technical term toponym resolution12 to describe the mapping from a place name in a prose text to an extensional representation of a location in a spatial model that the place name refers to. Before, the terms ‘place-name disambiguation’, ‘geo-coding’ and ‘grounding’ were often used, sometimes interchangeably, in a confusing way.
  • Fourth, I propose a new algorithm for TR based on the notion of minimality heuristics (Gardent and Webber (2001)) for the task, based on a novel ‘geometric minimality heuristic’: assume the set of referent assignments that minimise the convex hull of the candidate referents. The algorithm uses the new heuristic together with the ‘one referent per discourse’ heuristic commonly used in Word Sense Disambiguation.

4.4 Gazetteer

  • For this project, a new gazetteer (henceforth TextGIS R Gazetteer), was built from existing sources. The GNIS gazetteer of the U.S. Geographic Survey and the GNS gazetteers of the National Geospatial Intelligence Agency (NGA)6 were used and supplemented by 267 CIA World Factbook (WFB) country centroids.
  • Grounding. In linguistic pragmatics, grounding is the general concept of relating a linguistic entity to (a model of) the world (Figure 1.2). For Lakoff, for example, the ultimate basis for the human cognitive ability of language comprehension is that language is grounded in experience (Lakoff (1993)). According to Lakoff, humans have built in physical properties that influence the way homo sapiens uses language, including the universal perception of fundamental spatial dichotomies such as UP–DOWN.
  • Central Thesis Objective: The relative utility of heuristics and evidence sources for toponym resolution needs to be measured in a principled way.

1.5 Contributions

  • I have coined the technical term toponym resolution12 to describe the mapping from a place name in a prose text to an extensional representation of a location in a spatial model that the place name refers to. Before, the terms ‘place-name disambiguation’, ‘geo-coding’ and ‘grounding’ were often used, sometimes interchangeably, in a confusing way.
  • Fourth, I propose a new algorithm for TR based on the notion of minimality heuristics (Gardent and Webber (2001)) for the task, based on a novel ‘geometric minimality heuristic’: assume the set of referent assignments that minimise the convex hull of the candidate referents. The algorithm uses the new heuristic together with the ‘one referent per discourse’ heuristic commonly used in Word Sense Disambiguation.

4.4 Gazetteer

  • For this project, a new gazetteer (henceforth TextGIS R Gazetteer), was built from existing sources. The GNIS gazetteer of the U.S. Geographic Survey and the GNS gazetteers of the National Geospatial Intelligence Agency (NGA)6 were used and supplemented by 267 CIA World Factbook (WFB) country centroids.
  • Some ambiguous location names
    • San Juan
    • San Fransisco
    • Midway
    • Clinton
    • Victoria

References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 ToponymResolutionInTextPhDJochen L. LeidnerToponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place Nameshttp://www.era.lib.ed.ac.uk/bitstream/1842/1849/1/leidner-2007-phd.pdf